Style transfer with Stable Diffusion

February 22nd 2023

Dec 29, 2023

This post was imported from an earlier blog. The model, Stable Diffusion 1.4, that it uses has since been superseded, and fine-tuning with LoRa became a more widely adopted method of style transfer.

The internet is awash with images generated by AI models like Midjourney, DALLE, and Stable Diffusion. But the images that accompany articles on it tend to be a bit same-y. Sci-fi themes, Blade Runner dystopian cities, anime. 'Photorealistic' images of women that come ready-airbrushed.

I wanted to experiment with generating some different (nicer) looking images using a form of style transfer. I did this using textual inversion on Stable Diffusion. Textual inversion is a technique for adding a new word to pinpoint some representational capacity of a generative model.

For this experiment, I wanted to push the model into recreating the textures, palettes and style of children's illustrator Brian Wildsmith. Wildsmith made books and posters using a range of media, from scratchy pen and ink to rich gauche. One of the most striking things about his work is the use of colour -- fuscias, emerald greens, and frequent use of geometic patterns. I wanted to see how well the model would learn and apply these themes. This isn't the simplest way to showcase style transfer, since Wildsmith's style is very broad, spanning a lifetime of work.

In the post below, I take a deeper look at the textual inversion, primarily for style transfer, but also as a tool for countering bias in training datasets. I'll also dive into the mathematics of classifier-free guidance as it's applied in this case.

Here are some examples of original Wildsmith:

Caveats

I don't want to start throwing around derivative images without saying something about the relationship of style-transfer results to the original creations. Its clear that style-transfer results are dependent in a strong sense on the original artist, and right that the artist's role be respected.

Can AI-generated images be art? My feeling is that on the one hand, simply grabbing the output image from a generative neural network isn't producing a piece of art. However, these neural nets are tools, which can be used creatively, flexibly and imaginatively. Generative AI can be used to produce art, in an analogous way to that in which artists create artwork with mixed media, photography, and other methods that differ from traditional paint on canvas.

There are certain characteristics of art that are essential to it. One is creativity. Secondly, art is, as I see it, essentially a form of human-to-human communication: the artist has some intention for the piece they are creating. This is why a sunset, however beautiful, is not art. So, whether or not AI-generated media is art, depends in part on the role played by the human in the process of their creation.

Given this understanding of the relationship between art and generative AI, we can address worries sometimes voiced about whether generative AI is going to replace artists.

Consider the minor scandal at the end of last year when a digital art prize was won by an entry generated by Midjourney. The winner gloated "Art is dead, dude. It’s over. A.I. won. Humans lost." I don't see how you can hold this view without both knowing very little about the technology and very little about art. Rather than being an adversary of art, generative AI is simply a new tool to be put to various uses.

However, this doesn't mean that other professions won't be heavily disrupted by generative AI. It seems inevitable that less creative image-generation, like stock photography, won't end up being at least partially automated by generative AI.

Let’s examine one technique for guiding the models’ outputs in a desired direction.

Textual inversion on a style

There is a lot of interest in fine-tuning large foundation models to make them applicable to custom domains. This isn't simple to apply in practice, due amongst other things to the alarmingly named 'catastrophic forgetting'. But there are reliable methods that can be used to add specific representational capacity to a pretrained model, such as textual inversion.

Textual inversion enables you to add a new pseudo-word to the embedding space of the model, representing a chosen image or style. The method effectively teaches the model the new word by examples.

Gal et al 2022 present textual inversion on text-to-image models. They suggest 3-5 images of an object is sufficient to embed it into the model. I started off with about 6 images in my training set, and then increased this considerably, first to 136 and ultimately to 318.

Data and setup

I used Automatic1111's webui to train the textual inversion on Stable Diffusion v1.4.

My dataset was around 300 images I scanned from Brian Wildsmith books and found online. I named each file with a brief description of the content. The filenames are used in training to generate prompts of the form "a painting of [filewords] by [name]", where “[filewords]” come from the filename and “[name]” is the new pseudo-word.

Images were cropped to 512x512. One hurdle was how to programmatically crop the images in a way that preserved features of interest. Centre-cropping meant that the resulting models generated images missing the tops of heads. For speed, I used a feature of the webui to programatically crop based on detected focal point in combination with splitting large images into smaller, overlapping tiles, but this was not entirely satisfactory.

The work was done with permission of the Wildsmith family.

Results

Here's a couple of images that were generated randomly during training of the new embedding.

The geometric pattern in the background of the elephant is frequently generated by my model. It appears in the training set in some of Wildsmith's illustrations of abstract shapes. For instance, from his counting book:

Original images from 123 by Brian Wildsmith (1970)

On the other hand, the decorative design on the elephant's trunk does not appear in the training set, but was created by the model.

The following sets of examples were created using the same prompts, with embeddings using 1 and 16 vectors per token respectively, though different random seeds in each case. The version with 16 vectors per token captures more of the texture of the pencil and paintbrush strokes in the original images, as well as some unintended artefacts of the training data like page creases, photographic glare and shadows.

Results were generated with 20-80 sampling steps, and cfg_scale raised to 14 from the default 7. More on this parameter below. Theoretically higher cfg_scale values should force closer adherence to the prompt; I found they gave better looking results, although they tended to reproduce more of the unwanted artefacts too.

Using 16 vectors per token (sadly the cat image was a victim of the cropping issue):

Images generated by Stable Diffusion with textual inversion

Classifier-free guidance for style transfer

One parameter I experimented with was the value controlling classifier-free guidance. Classifier free guidance is a technique that can be applied at inference time to control the closeness with which the generated results adhere to the prompt. I wanted to understand what exactly classifier-free guidance is doing in the context of style transfer, both in terms of the mathematics and the visual results. In most codebases, this parameter is cfg_scale.

To show the effect of varying the classifier-free guidance scale, see the sequence of images below, for values ranging from 0 and 30 using a fixed seed. The model was prompted to draw a city of tents in the desert under a full moon, in Wildsmith's style. It's immediately noticable that while results for inital values in [0,1] don't resemble the prompt, after this the model is gently exploring a set of similar points in the space.

This time, a house with a mountain range in the background. The image flickers between different placements of the house, leaving it out altogether for large classifier-free guidance scale values. Both of these quick experiments strongly suggest that there is no benefit to tuning the classifier-free guidance scale beyond a certain point.

So what does it do? The method of classifier-free guidance developed from an earlier method, classifier guidance. This was introduced by Dhariwal and Nichol (2021) and was the technique that gave diffusion models the edge over GANs in image synthesis. I found Sander Dieleman's blog post on guidance gave a careful motivation for the methods, and Lilian Weng's blog post covered the maths. This post builds on their accounts by including concrete examples of varying the classifier free guidance values.

Conditional Diffusion

To see what classifier guidance does, let's recap what exactly a conditional diffusion model like stable diffusion does. The trick to understanding the model is to see that what we are modeling — images — are sampled from a probability distribution q(x), albeit an extremely complex one. Images are not noise; they have a structure, and this can be approximated by a distribution. We can also consider a noising process, called forward diffusion, in which we add incremental amounts of Gaussian noise to an image x_0 drawn from q(x), giving us a sequence x_t for t in [0, T] such that x x_T ends up being pure noise. If we had access to q, we could take a sample of this pure noise and apply the reverse of the diffusion process, denoising it back to content, by estimating

\( q(x_{t}\vert x_{t+1})\)

and picking the x_t with greatest likelihood. This is not possible, however, since q depends on the entire dataset of images and is therefore intractable to compute.

What we can do is to train a model that approximates these conditional probabilities above, i.e. the probability of a slightly-less-noisy version of x, given a noisier version. Call this model

\(p_{\theta}(x_{t} \vert x_{t+1})\)

Under certain assumptions, the distribution pθ is a Gaussian, thus computable. We can use pθ in the manner we would have liked to use q: to denoise images to get them back to noiselessness.

What we do when we're denoising is to use a 'noise predicting model'

\(\epsilon_{\theta}(x_{t+1}, y)\)

to estimate the noise added at timestep t given the prompt, y, and the noised image, x_t+1. This task is equivalent to predicting the incrementally denoised image x_t, because if you know what the added noise is, you know what the image is like when that noise is taken away.

There is an additional hurdle: for applications of image generation, we would like to use a conditional diffusion model. This is to say, we want the denoising to be conditioned on some text prompt. We don't just want our denoising process to reveal any image that looks like it came from the underlying distribution q(x), but rather to reveal the image most closely aligned with a given text prompt, y. This means we want to instead use the conditional probability

\(p_{\theta}(x_{t} \vert x_{t+1}, y)\)

This can likewise be approximated by a Gaussian.

Here’s how we apply that guidance. Note that diffusion models approximate:

\(\nabla_{x}log(p_{\theta}(x_{t} ))\)

the 'score function' of the distribution, rather than the distribution itself.

Thus, the score function of the conditional distribution (conditional on y that is) is:

\(\nabla_{x}log(p_{\theta}(x_{t} \vert y))\)

This makes sense, because, in simple terms, in reverse conditional diffusion, we want to know how much the log-probability of an image x given prompt y changes when we vary x, so that we can iteratively move towards the x that maximises this value. This value is the gradient of the log-probability of x given the label y. Which is just what is encoded in the identity above.

The insight of classifier guidance is that we can use a classifier, in the sense of a classification model, to guide image generation. The classifier is trained on noisy versions of x labeled with classes y. Think of Imagenet, and how we can train a classifier model, such as Resnet, on it. In this case, we use a noised version of Imagenet, that has been put through the same forward diffusion process as used in the diffusion model. For a full derivation, see Dhariwal and Nichol (2021) Appendix H.

We can decompose the score function of the conditional distribution so that it has as a component a classifier model. The decomposition is just an application of Bayes rule, and is straightforward enough that we can repeat it below. We start with the conditional score function:

\(\nabla_{x} log(p_{\theta}(x\vert y))\)

Bayes rule says that:

\(p_{\theta}(x\vert y) = \frac{p_{\theta}(y\vert x).p_{\theta}(x)}{p_{\theta}(y)}\)

Then, taking gradients w.r.t. x on both sides, and distributing the log, we have:

\(\Rightarrow\nabla_{x}log(p_{\theta}(x\vert y)) = \nabla_{x}log \frac{p_{\theta}(y\vert x)p_{\theta}(x)}{p_{\theta}(y)}\)

\(\Rightarrow \nabla_{x} log(p_{\theta}(x\vert y)) = \nabla_{x}log(p_{\theta}(y\vert x)) + \nabla_{x} log(p_{\theta}(x)) - \nabla_{x} log(p_{\theta}(y))\)

\(\Rightarrow \nabla_{x} log(p_{\theta}(x\vert y)) = \nabla_{x}log (p_{\theta}(y\vert x)) + \nabla_{x} log(p_{\theta}(x)) \)

since pθ(y) is not a function in x, its gradient w.r.t. x is 0.

Notice that via the rearrangement above, we are expressing the score function of the conditional probability,

\(\nabla_{x}log(p_{\theta}(x\vert y))\)

in terms of the gradients of the unconditional probability

\(log(p_{\theta}(x))\)

and the gradients of a classifier

\(log(p_{\theta}(y\vert x))\)

This last term is a classifier because it is just the thing that a classification model learns: the probability of the label given the data. In this way, the diffusion model's score function can be decomposed into an unconditional probability and a classifier.

Classifier guidance tweaks the score function, to introduce a coefficient, s, controlling the gradients of the classifier term, and allowing them to be amplified during denoising.

To see exactly the effect on the model outputs of varying s in the new score function, we start from the following identity, true for any constant Z:

\(s\nabla_{x}log (p_{\theta}(y\vert x)) = \nabla_{x} log \frac{1}{Z}p_{\theta}(y\vert x)^{s}\)

This is to say, multiplying the log probability of y given x by s is proportional to raising the renormalised log probability to the power s. This exponentiation has the effect of disproportionately increasing its larger values. Increasing the function’s larger values is the same thing as amplifying the modes of the distribution, which are the values that maximise the probability density function. So, for s>1, the s that maximises

\(s\nabla_{x}log (p_{\theta}(y\vert x)) + \nabla_{x} log(p_{\theta}(x))\)

will be closer to the mode of the distribution,

\(p_{\theta}(y\vert x)\)

so closer to the class label y.

Classifier guidance is just the practice of dialing up the weight of the classifier term in the score function, which as we have seen results in x that maximise the likelihood of label y given x.

This is why classifier guidance is known to have the effect of boosting the fidelity of diffusion model outputs at the expense of diversity. As the name classifier guidance suggests, it is used to guide the score function closer to the modes of a classifier based on the prompt.

However, there is a huge drawback to classifier guidance — you need to have trained a classifier on noisy images!

Classifier-free guidance (Ho & Salimans, 2021) is a development of classifier guidance. It pushes the model in the same direction as classifier guidance, but avoids the need to train a specialised classifier.

To use classifier-free guidance, during training we replace the text caption y in a conditional diffusion model with a null label, ∅, a fixed proportion of the time, typically 10-20%. This process is called text-conditioning dropout. It means that the resulting model can be used as a conditional diffusion model (i.e. generation is conditioned on a text prompt), but also as an unconditional diffusion model (i.e. with no text prompt).

The score function for such a model is a linear combination of score functions for a conditional and an unconditional model.

The score function for such a diffusion model is a weighted combination of a conditional score function and an unconditional score function with weight s, as follows:

\(\widehat{\epsilon}_{\theta}(x_{t}\vert y) = \epsilon_{\theta}(x_{t}\vert y=\emptyset) + s(\epsilon_{\theta}(x_{t}\vert y) - \epsilon_{\theta}(x_{t}\vert y=\emptyset))\)

where

\(\epsilon_{\theta}(x_{t}\vert y=\emptyset) \ is \ the \ unconditional \ score \ function\)

and

\(\epsilon_{\theta}(x_{t}\vert y) \ is \ the \ conditional \ score \ function\)

For s=0, this weighted function is just the unconditional score function, and for s=1, it is the original conditional score function. But as s>1 increases, the model is a mixture of the conditional and unconditional, increasingly weighted towards the conditional model. Typically, the classifier free guidance value at inference time is set to 7 or so, strongly weighting towards the conditional guidance by label y, which, for me, is my new pseudo-word brian_wildsmith.

Now we have a feel for what’s going on, let's end with two more examples. Below are the results for the prompt 'dove in the style of brian_wildsmith'. In this series of results, the classifier guidance scale ranges from 0 to 20, increasing in increments of 0.02s, an order of magnitude smaller than in the sequences above.

At 14 frames per second, it takes about 4 seconds to reach s=1. At 4s, the frame is not discernibly a bird, but is just before the turning point at which a bird sharply emerges. This is when the score function is

\(\widehat{\epsilon}_{\theta}(x_{t}\vert y) = \epsilon_{\theta}(x_{t}\vert \emptyset) + s(\epsilon_{\theta}(x_{t}\vert y) - \epsilon_{\theta}(x_{t}\vert \emptyset))= \epsilon_{\theta}(x_{t}\vert y)\)

That is, at s=1 the score function is just the unconditional noise predictor — the function we wanted to improve on by introducing classifier, and then classifier-free, guidance.

Prior to this point, the images resemble a portrait of a woman: not noise, but not an image that is conditioned on the prompt either.

There is another jump in form 16 seconds in, which is when s=4. Beyond this, some more Wildsmith-like detail is added, but in general s>10 yields minor variations on the result with no particular improvement in quality. This matches with the default value of classifier free guidance scale of 7 in Stable Diffusion codebases.

Finally, a tree in the style of Brian Wildsmith. As before, the prompted image emerges a little after s=1 (4 seconds in), and there is a long tail of minor variations as s>4.

Textual inversion for bias reduction

Finally, I wanted to flag a use case of textual inversion that I haven't seen discussed beyond the original paper by Gal et al. It is well known that generative models reproduce the biases of their training data. For instance, the prompts 'doctor' and 'scientist' disproportionately produce images that resemble men. This reflects biases in the kind of images that are uploaded to image hosting sites where the training data is collected from.

When DALL-E2 was first released, this bias recieved a lot of attention and OpenAI scrambled to release a patch that would force the model to produce more diverse images.

The first pass solution was surprisingly hacky. It consisted of silently adding a keyword from a list of minority groups (such as 'female', 'black') to the prompt, as users quickly discovered by prompting the model with "A doctor holding a sign that says", and observing the keyword printed on the sign in the generated image.

This phenomenon isn't reproducible in current versions, so clearly OpenAI have a more robust solution now. One solution would be to fix the training set. But retraining is computationally expensive.

As the authors of the textual inversion paper suggest, we can use the technique to update a trained model's understanding of words already in its vocabulary. Rather than using a novel token, as we do when adding a style or object to the model's vocabulary, we overwrite the embedding of an existing token, training on a small, curated set of images.

The compute budget to do this is negligible relative to training the base model. My textual inversion above ran in about 4 hours on an unspectacular GeForce GTX 1080, whereas the initial training of stable diffusion was 150,000 GPU hours on the considerably more FLOP-heavy A100. This suggests that textual inversion is an elegant and practical tool in the toolbelt to use in countering bias.

References

Ho, J. and Salimans, T. Classifier-free diffusion guidance, In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, preprint https://arxiv.org/abs/2207.12598 2021

Gal, R, Alaluf Y., Atzmon Y., Patashnik O., Bermano A, Chechik G., and Cohen-Or D., An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, preprint https://arxiv.org/abs/2208.01618 2022

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M., GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, preprint https://arxiv.org/abs/2112.10741 2022

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794

Signalling NaN

Discussion about this post