Part 0 - Setup

To start off, I used the DeepFloyd model to generate images. Here, and in all steps of the project, I used 2024 as a seed. Here are images I attained with 20 inference steps:

Pencil at 20 inference steps

Barista at 20 inference steps

Waterfall at 20 inference steps

When inference steps are reduced, image quality also decreases significantly. Here are the same images with significantly fewer inference steps:

Pencil at 5 inference steps

Barista at 5 inference steps

Waterfall at 5 inference steps

As expected, both the quality of the image and the quality of the upscale are significantly worse.

Part 1 - Sampling loops

1.1 Implementing the forward process

An important part of the diffusion process is the forward process, which takes in an image and outputs a noisy version of the same image. The algorithm we used allows for images that are increasingly noisy according to the following formula:

\[ q(x_t | x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]

which is equivalent to computing

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where } \epsilon \sim N(0, 1) \]

After I implemented this forward algorithm it produced the following results:

Original Campanille Image

Noisy Campanille for Timestep t=250

Noisy Campanille for Timestep t=500

Noisy Campanille for Timestep t=750

1.2 Classical Denoising

Classical denoising can be achieved by applying a gassian blur to noisy images. The goal of this is that if noise is high frequecny, we may denoise an image by limiting it to the lower frequncy. However, in this case results are quite bad. Here we can see the results of classical denoising:

Noisy Campanille for Timestep t=250

Noisy Campanille for Timestep t=500

Noisy Campanille for Timestep t=750

Gauss deblur campanille for Timestep t=250

Gauss deblur campanille for Timestep t=500

Gauss deblur campanille for Timestep t=750

1.3 One-step denoising

At this step, we will try to denoise images by using UNET. Our goal as of this subsection is to input a noisy image, and directly (not iteratively) return the model's best estimate for the denoised image. At this step, we provide no text guidance for the denoising other than the prompt 'a high quality photo'.

In this, and subsequent steps of the project, we observe that when denoising a very noisy image, the result may be high quality but different to the input. This is expected given that the model has to hallucinate, and will allow image generation in later sections.

Here are my results for one step denoising of the campanille image:

Noisy Campanille for Timestep t=250

Noisy Campanille for Timestep t=500

Noisy Campanille for Timestep t=750

One-step denoised campanille for Timestep t=250

One-step denoised campanille for Timestep t=500

One-step denoised campanille for Timestep t=750

1.4 Iterative denoising

Unet achieves significantly better results if the image is denoised iteratively, as opposed to in one step. Instead of denoising iteratively timestep by timestep, we denoise 30 timesteps at a time (out of one thosand), since performace is acceptable and perfonance is significantly improved.

The following relationship relates a noisy image at timestep t, with that at timestep t' (less than t)

\[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_\sigma \]

Here, the alphas bar are determined by the model, alpha_t is the ratio between alpha_t bar and alpha_t' bar, and beta_t is 1 minus that value.

By using this algorithm, starting at a noisy campanille at timestep t=690, we achieve the following iterative denoising results.

Blurred campanille for Timestep t=690

Iteratively denoised campanille at t=540

Iteratively denoised campanille at t=390

Iteratively denoised campanille at t=240

Iteratively denoised campanille at t=90

Iteratively denoise campanille

One step denoised campanille

Gauss blurred t=690 campanille

1.5 Diffusion model sampling

Now, we notice that if we provide pure noise, as input to the previous algorithm, it will generate random viable images (not specific since we aren't providing a propmt that carries information). Here are 5 samples I generated by using this:

Naive sample 1

Naive sample 2

Naive sample 3

Naive sample 4

Naive sample 5

1.6 Classifier Free Guidance

While the previous images where somewhat realistic, they were undeniably of low quality. Here, we try to improve on that. Specifically, instead of taking the noise prediciton from the model, we estimate noise using a prompt and a null string. Then, we overweight the prompt noise using the following formula (for gamma greater than 1):

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

By using this, we may achieve higher quality images:

CFG sample 1

CFG sample 2

CFG sample 3

CFG sample 4

CFG sample 5

1.7 Image-to-image translation

In this section, we experiment and try to obtain similar images to an input, by adding significant noise. If we add too much, the input and output will be too different. If we don't add enough they'll be too similar. Here

Here are is the original campanille image:

Original Campanille

And here are the edited verisons:

Campanille Edit i=1

Campanille Edit i=3

Campanille Edit i=5

Campanille Edit i=7

Campanille Edit i=10

Campanille Edit i=20

I also edited an image of a minion toy. I believe that since its of unusal shape it took longer for the image to resemble the original:

Original Minion

And here are the edited verisons:

Minion Edit i=1

Minion Edit i=3

Minion Edit i=5

Minion Edit i=7

Minion Edit i=10

Minion Edit i=20

Finally, I used an image of a mountain I took on vacation

Original Mountain

And here are the edited verisons:

Mountain Edit i=1

Mountain Edit i=3

Mountain Edit i=5

Mountain Edit i=7

Mountain Edit i=10

Mountain Edit i=20

1.7.1 Hand-drawn images

A similar procedure can be applied to hand drawn/unrealistic images. Here are the results for a hand-drawn image I found online:

Internet house drawing

Internet drawing i=1

Internet drawing i=3

Internet drawing i=5

Internet drawing i=7

Internet drawing i=10

Internet drawing i=20

Then I tried to apply this to my terrible drawing skills. I achieved great results with a castle drawing and worse ones with one of a mountain:

Mountain drawing

Castle edit i=1

Castle edit i=3

Castle edit i=5

Castle edit i=7

Castle edit i=10

Castle edit i=20

Mountain drawing

Mountain edit i=1

Mountain edit i=3

Mountain edit i=5

Mountain edit i=7

Mountain edit i=10

Mountain edit i=20

1.7.2 Inpainting

Using a similar procedure, we're able to replace only a section of an image, thereby impainting that section. On the example image, we obtain these results:

Original Campanille

Campanille Mask

Inpainted Campanille

Next, I applied the same process to internet images. Interestingly, the mona lisa was transformed into an alien and trump disappeared into the background

Original Mona Lisa

Mona Lisa Mask

Alien Mona

Original Trump

Trump Mask

Disappeared Trump

1.7.3 Text conditioned image-to-image

We can refine this editing procedure by using text prompts to guide it. In the case of the campanille, we can turn it into a rocketship, by giving 'a rocket ship' as prompt. Then, noise will be hallucinated as a rocket ship. Here are the results for the campanille:

Campanille to rocket, for varying i values

Then I converted Arnold Schwarzenegger to a skull (i=7 is particlarly good):

Arnold original portrait

Arnold to skull

And here is Biden wearing a hat:

Biden original portrait

Biden to man with hat

1.8 Visual Anagram

Let's try something else! In this section, we will try to create an image which matches one prompt when facing upwards, and another when flipped. To achieve this, in each denoising iteration, we flip the image and calculate the denoising for each prompt. Then we average this estimated noise.

Here is a Visual Anagram which is an oil painting of an old man facing upwards, and a campfire scene facing down.

Old Man

Campfire

Here are some other examples:

Skull facing up

Barista facing down

Man wearing a hat facing up

Dog facing down

1.9 Hybrid Images

Similar to the last section, we can also create hybrid images. In this case, instead of combining images by flipping them before calculating noise, we calculate both in the same direction but apply a high pass filter to one, and a low pass to another. Here is an anagram between a skull and waterfalls:

Hybrid between skull and waterfall

Here's another 2 I made, hybrids between a rocket and a dog and a pencil and a barista:

Hybrid between skull and waterfall

Hybrid between pencil and barista

Part 2: Diffusion Models from Scratch

Training a Single-Step Denoiser

In part B of the project, we implement UNET. To do this, we must implement a series of transformations, such as ConvBlocks, DownBlocks and UpBlocks. The goal of this part is to imitate this model and train it to denoise unconditionally:

Unconditional Unet Diagram

1.2 Training the UNET

First, we implement the noising algorithm. This is similar to that we found in part A of the project. These are the associated results:

Varying levels of noise

Using these noisy digits as input data, we train our model. Here is the Training loss diagram:

Training loss

Here are some sample results for the first and fifth epoch:

First epoch results

Fifth epoch results

Out of sample testing

After we finish our training, we may test out of sample. Here are my results:

First epoch results

Part 2: Training a Diffusion Model

In part 2, we extend the model of part 1, first by creating a time dependent model, and then by making it class-conditioned

2.1 Time conditioning

In order to create a diffusion model, we must update our part 1 model to denoise iteratively as opposed to in one step. To adapt our previous model to this iterative process, we adapt our noise generation to be iterative (as defined in part A), and condition the model on T by adding two fully connected (FC) blocks. Here is the diagram for the conditioned model:

Time conditional UNET diagram

2.2 Training the UNET

After training the model from the previous part, I got the following loss curve:

Time conditioned UNET log loss

2.3 Sampling from the UNET

Here are the sampling results after 1, 5, and 20 epochs:

Samples after 1 epoch

Samples after 5 epochs

Samples after 20 epochs

2.4 Adding class conditioning to the unit

The final improvement for our model is to add class conditioning. This is not only the essence of what has made diffusion models so popular but should also significantly improve our results.

To implement this, we add two new FC Blocks to our model. We also implement drouput to ensure that the model works without conditioning (that is, how larger real models work).

Here is the log loss curve for the class conditioned model.

Class conditioned UNET log loss

2.5 Sampling from class-conditioned model

Here are the sampling results for the class conditioned model:

Class conditioned samples after 1 epoch

Class conditioned samples after 5 epochs

Class conditioned samples after 20 epochs