Neural Style Transfer via Deep Image Prior

6 min readAug 19, 2021

Introduction

The first time I was shown the picture of London mixed with Van Gogh’s “Starry Night”, I was amazed. Until then, I knew a few classical methods for blending images, for instance Laplacian Pyramid Blending [1]. Following the renaissance of deep learning, and specifically the success of Convolutional Neural Networks in computer vision tasks, it didn’t take too long until a neural method was developed. Neural Style Transfer (NST) from Gatys et al. [2] was the first to bridge the gap between image blending and neural networks. This is done via computing the activations of a pre-trained CNN, e.g. VGG-16 [3], for two input images, and generate a new image whose activations are similar to the activations of the two images of our interest. As depicted in Figure 1, this is done by minimizing some distance measure via the standard methodology of computing gradients for the input via back-propagation and employing an optimizer for reaching a possible local-minima.

Many improvements were introduced since then. One prominent approach that I find very interesting is employing reparameterization of the generated image [4]. The idea behind this is that different reparameterization may lead to convergence to different local minimas. In [4], the authors provide a few detailed explanations:

Improved Optimization — transforming the input to make an optimization problem easier.
Basins of Attraction — The probability of our optimization process falling into any particular local minima is controlled by its basin of attraction (i.e., the region of the optimization landscape under the influence of the minimum). Changing the parameterization of an optimization problem is known to change the sizes of different basins of attraction, influencing the likely result.
Additional Constraints — An optimizer working in such a parameterization will still find solutions that minimize or maximize the objective function, but they’ll be subject to the constraints of the parameterization. By picking the right set of constraints, one can impose a variety of constraints, ranging from simple constraints (e.g., the boundary of the image must be black), to rich, subtle constraints.
Implicitly Optimizing other Objects — A parameterization may internally use a different kind of object than the one it outputs and we optimize for. For example, while the natural input to a vision network is an RGB image, we can parameterize that image as a rendering of a 3D object and, by back-propagating through the rendering process, optimize that instead. Because the 3D object has more degrees of freedom than the image, we generally use a stochastic parameterization that produces images rendered from different perspectives.

In [4], the authors specifically exemplifying NST via parameterizing the generated image with a simple linear transformation — the Fourier transform. The rest of the method remains the same as in the original NST paper from Gatys et al. This setting is depicted in Figure 2.

Figure 2: The settings of NST via FFT reparameterization of the generated image. Taken from [4].

Figure 3 shows an example of the style-transferred London image. Surely, it looks very good.

Figure 3: Style transfer of “Starry Night” onto a photograph by Andyindia. Taken from [4]. This image is generated via reparameterization with 2D-FFT.

As I find this idea very interesting, I decided to experiment with it and combine it with another interesting method called Deep Image prior (DIP) [5]. DIP employs convolutional generators for solving inverse problems, such as image denoising, inpainting, and more. It has been found that convolutional neural networks (CNNs) has a strong prior of natural images, which can be explained by the imposed similarity between adjacent pixels, which is a direct result of the convolution operator. Moreover, different generator architectures may impose additional constraints on the generated image: using a shallow generator with very few parameters (much less then the original image) expresses the notion of compress-sensing. Besides, using over-parameterized generators provably denoise images via early-stopping, a phenomenon which is well-studied in [6].

I intend to combine NST with DIP in order to generate high-quality, style transferred images. In the following, I’ll briefly discuss the method I propose.

DIP NST

The original method optimizes the input image. I added a regularization to the problem by using a deep convolutional generator for generating the input to the style transfer problem. The optimization manipulates the parameters of the generator instead of directly changing the generated image. It results with less parameters to update and imposes a sparse representation to the image. It’s also important to emphasize that I use gradient descent while the original NST paper uses L-BFGS, which is very memory consuming

Figure 4: The settings of NST via DIP reparameterization of the generated image.

I conducted an experiment with the new parameterization. These are the input images for the algorithm:

Figure 5: On the left, the content image. On the right, the style image.

I chose 2 optimizers to experiment with: (1) L-BFGS, as used in the original paper, and (2) SGD, a commonly used optimizer in deep learning. For these 2 images, the output of the original NST method are:

The image generated via the original NST using L-BFGS looks pretty good to me. The brush strokes came out pretty well and the colors match the style of “Starry Night”. Now, lets reiterate this experiment but with SGD:

It is well known that SGD converges to a poor local minima for NST. Indeed, the resulting image is very noisy. Although I observe the change of the colors and their tendency to the tone of “Starry Night”, this failed to connect between adjacent pixels. This results with a color variation that is interpreted as noise.

The output of NST using DIP is shown in Figure 8. The size of the output image is 3×600×880, which means there are≈1.58 million parameters in the original problem. The architecture of the convolutional generator I used has 164,893 parameters, a tenth of the original amount.

Figure 8: NST via DIP, optimized with SGD

This image looks very decent. Indeed, I find it looking better than the two previous images obtained via the original NST which does not apply any reparameterization at all. The convolutional generator successfully imposed a good prior on the generated image. The noise effect from Figure 7 is completely gone and the style of “Starry Night” is expressed conspicuously. In my opinion, this result looks even better than the original NST which uses L-BFGS, as shown in Figure 6.

Conclusions

In this post I’ve showed you my method that combines Neural Style Transfer with Deep Image Prior. I hope you enjoyed this and learned something useful :)

References

[1] https://research.adobe.com/publication/local-laplacian-filters-edge-aware-image-processing-with-a-laplacian-pyramid/

[2] https://arxiv.org/abs/1508.06576

[3] https://arxiv.org/abs/1409.1556

[4] https://distill.pub/2018/differentiable-parameterizations/

[5] https://arxiv.org/abs/1711.10925

[6] https://arxiv.org/abs/1910.14634

Neural Style Transfer via Deep Image Prior

Introduction

DIP NST

Conclusions

References

Written by Eitan Kosman