Part 0 - Setup

To start off, I used the DeepFloyd model to generate images. Here, and in all steps of the project, I used 2024 as a seed. Here are images I attained with 20 inference steps:

Original cameraman image
Pencil at 20 inference steps
Cameraman D_x
Barista at 20 inference steps
Cameraman D_x
Waterfall at 20 inference steps

When inference steps are reduced, image quality also decreases significantly. Here are the same images with significantly fewer inference steps:

Original cameraman image
Pencil at 5 inference steps
Cameraman D_x
Barista at 5 inference steps
Cameraman D_x
Waterfall at 5 inference steps

As expected, both the quality of the image and the quality of the upscale are significantly worse.

Part 1 - Sampling loops

1.1 Implementing the forward process

An important part of the diffusion process is the forward process, which takes in an image and outputs a noisy version of the same image. The algorithm we used allows for images that are increasingly noisy according to the following formula:

\[ q(x_t | x_0) = N(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]

which is equivalent to computing

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where } \epsilon \sim N(0, 1) \]

After I implemented this forward algorithm it produced the following results:

Original cameraman image
Original Campanille Image
Cameraman D_x
Noisy Campanille for Timestep t=250
Original cameraman image
Noisy Campanille for Timestep t=500
Cameraman D_x
Noisy Campanille for Timestep t=750

1.2 Classical Denoising

Classical denoising can be achieved by applying a gassian blur to noisy images. The goal of this is that if noise is high frequecny, we may denoise an image by limiting it to the lower frequncy. However, in this case results are quite bad. Here we can see the results of classical denoising:

Original cameraman image
Noisy Campanille for Timestep t=250
Original cameraman image
Noisy Campanille for Timestep t=500
Cameraman D_x
Noisy Campanille for Timestep t=750
Original cameraman image
Gauss deblur campanille for Timestep t=250
Original cameraman image
Gauss deblur campanille for Timestep t=500
Cameraman D_x
Gauss deblur campanille for Timestep t=750

1.3 One-step denoising

At this step, we will try to denoise images by using UNET. Our goal as of this subsection is to input a noisy image, and directly (not iteratively) return the model's best estimate for the denoised image. At this step, we provide no text guidance for the denoising other than the prompt 'a high quality photo'.

In this, and subsequent steps of the project, we observe that when denoising a very noisy image, the result may be high quality but different to the input. This is expected given that the model has to hallucinate, and will allow image generation in later sections.

Here are my results for one step denoising of the campanille image:

Original cameraman image
Noisy Campanille for Timestep t=250
Original cameraman image
Noisy Campanille for Timestep t=500
Cameraman D_x
Noisy Campanille for Timestep t=750
Original cameraman image
One-step denoised campanille for Timestep t=250
Original cameraman image
One-step denoised campanille for Timestep t=500
Cameraman D_x
One-step denoised campanille for Timestep t=750

1.4 Iterative denoising

Unet achieves significantly better results if the image is denoised iteratively, as opposed to in one step. Instead of denoising iteratively timestep by timestep, we denoise 30 timesteps at a time (out of one thosand), since performace is acceptable and perfonance is significantly improved.

The following relationship relates a noisy image at timestep t, with that at timestep t' (less than t)

\[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_\sigma \]

Here, the alphas bar are determined by the model, alpha_t is the ratio between alpha_t bar and alpha_t' bar, and beta_t is 1 minus that value.

By using this algorithm, starting at a noisy campanille at timestep t=690, we achieve the following iterative denoising results.

Original cameraman image
Blurred campanille for Timestep t=690
Original cameraman image
Iteratively denoised campanille at t=540
Cameraman D_x
Iteratively denoised campanille at t=390
Original cameraman image
Iteratively denoised campanille at t=240
Cameraman D_x
Iteratively denoised campanille at t=90
Original cameraman image
Iteratively denoise campanille
Original cameraman image
One step denoised campanille
Cameraman D_x
Gauss blurred t=690 campanille

1.5 Diffusion model sampling

Now, we notice that if we provide pure noise, as input to the previous algorithm, it will generate random viable images (not specific since we aren't providing a propmt that carries information). Here are 5 samples I generated by using this:

Original cameraman image
Naive sample 1
Cameraman D_x
Naive sample 2
Original cameraman image
Naive sample 3
Original cameraman image
Naive sample 4
Cameraman D_x
Naive sample 5

1.6 Classifier Free Guidance

While the previous images where somewhat realistic, they were undeniably of low quality. Here, we try to improve on that. Specifically, instead of taking the noise prediciton from the model, we estimate noise using a prompt and a null string. Then, we overweight the prompt noise using the following formula (for gamma greater than 1):

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \]

By using this, we may achieve higher quality images:

Original cameraman image
CFG sample 1
Cameraman D_x
CFG sample 2
Original cameraman image
CFG sample 3
Original cameraman image
CFG sample 4
Cameraman D_x
CFG sample 5

1.7 Image-to-image translation

In this section, we experiment and try to obtain similar images to an input, by adding significant noise. If we add too much, the input and output will be too different. If we don't add enough they'll be too similar. Here

Here are is the original campanille image:

Original cameraman image
Original Campanille

And here are the edited verisons:

Original cameraman image
Campanille Edit i=1
Cameraman D_x
Campanille Edit i=3
Original cameraman image
Campanille Edit i=5
Original cameraman image
Campanille Edit i=7
Cameraman D_x
Campanille Edit i=10
Original cameraman image
Campanille Edit i=20

I also edited an image of a minion toy. I believe that since its of unusal shape it took longer for the image to resemble the original:

Original cameraman image
Original Minion

And here are the edited verisons:

Original cameraman image
Minion Edit i=1
Cameraman D_x
Minion Edit i=3
Original cameraman image
Minion Edit i=5
Original cameraman image
Minion Edit i=7
Cameraman D_x
Minion Edit i=10
Original cameraman image
Minion Edit i=20

Finally, I used an image of a mountain I took on vacation

Original cameraman image
Original Mountain

And here are the edited verisons:

Original cameraman image
Mountain Edit i=1
Cameraman D_x
Mountain Edit i=3
Original cameraman image
Mountain Edit i=5
Original cameraman image
Mountain Edit i=7
Cameraman D_x
Mountain Edit i=10
Original cameraman image
Mountain Edit i=20

1.7.1 Hand-drawn images

A similar procedure can be applied to hand drawn/unrealistic images. Here are the results for a hand-drawn image I found online:

Original cameraman image
Internet house drawing
Original cameraman image
Internet drawing i=1
Cameraman D_x
Internet drawing i=3
Original cameraman image
Internet drawing i=5
Original cameraman image
Internet drawing i=7
Cameraman D_x
Internet drawing i=10
Original cameraman image
Internet drawing i=20

Then I tried to apply this to my terrible drawing skills. I achieved great results with a castle drawing and worse ones with one of a mountain:

Original cameraman image
Mountain drawing
Original cameraman image
Castle edit i=1
Cameraman D_x
Castle edit i=3
Original cameraman image
Castle edit i=5
Original cameraman image
Castle edit i=7
Cameraman D_x
Castle edit i=10
Original cameraman image
Castle edit i=20
Original cameraman image
Mountain drawing
Original cameraman image
Mountain edit i=1
Cameraman D_x
Mountain edit i=3
Original cameraman image
Mountain edit i=5
Original cameraman image
Mountain edit i=7
Cameraman D_x
Mountain edit i=10
Original cameraman image
Mountain edit i=20

1.7.2 Inpainting

Using a similar procedure, we're able to replace only a section of an image, thereby impainting that section. On the example image, we obtain these results:

Original cameraman image
Original Campanille
Original cameraman image
Campanille Mask
Original cameraman image
Inpainted Campanille

Next, I applied the same process to internet images. Interestingly, the mona lisa was transformed into an alien and trump disappeared into the background

Original cameraman image
Original Mona Lisa
Original cameraman image
Mona Lisa Mask
Original cameraman image
Alien Mona
Original cameraman image
Original Trump
Original cameraman image
Trump Mask
Original cameraman image
Disappeared Trump

1.7.3 Text conditioned image-to-image

We can refine this editing procedure by using text prompts to guide it. In the case of the campanille, we can turn it into a rocketship, by giving 'a rocket ship' as prompt. Then, noise will be hallucinated as a rocket ship. Here are the results for the campanille:

Original cameraman image
Campanille to rocket, for varying i values

Then I converted Arnold Schwarzenegger to a skull (i=7 is particlarly good):

Original cameraman image
Arnold original portrait
Original cameraman image
Arnold to skull

And here is Biden wearing a hat:

Original cameraman image
Biden original portrait
Original cameraman image
Biden to man with hat

1.8 Visual Anagram

Let's try something else! In this section, we will try to create an image which matches one prompt when facing upwards, and another when flipped. To achieve this, in each denoising iteration, we flip the image and calculate the denoising for each prompt. Then we average this estimated noise.

Here is a Visual Anagram which is an oil painting of an old man facing upwards, and a campfire scene facing down.

Original cameraman image
Old Man
Original cameraman image
Campfire

Here are some other examples:

Original cameraman image
Skull facing up
Original cameraman image
Barista facing down
Original cameraman image
Man wearing a hat facing up
Original cameraman image
Dog facing down

1.9 Hybrid Images

Similar to the last section, we can also create hybrid images. In this case, instead of combining images by flipping them before calculating noise, we calculate both in the same direction but apply a high pass filter to one, and a low pass to another. Here is an anagram between a skull and waterfalls:

Original cameraman image
Hybrid between skull and waterfall

Here's another 2 I made, hybrids between a rocket and a dog and a pencil and a barista:

Original cameraman image
Hybrid between skull and waterfall
Original cameraman image
Hybrid between pencil and barista

Part 2: Diffusion Models from Scratch

Training a Single-Step Denoiser

In part B of the project, we implement UNET. To do this, we must implement a series of transformations, such as ConvBlocks, DownBlocks and UpBlocks. The goal of this part is to imitate this model and train it to denoise unconditionally:

Original cameraman image
Unconditional Unet Diagram

1.2 Training the UNET

First, we implement the noising algorithm. This is similar to that we found in part A of the project. These are the associated results:

Original cameraman image
Varying levels of noise

Using these noisy digits as input data, we train our model. Here is the Training loss diagram:

Original cameraman image
Training loss

Here are some sample results for the first and fifth epoch:

Original cameraman image
First epoch results
Original cameraman image
Fifth epoch results

Out of sample testing

After we finish our training, we may test out of sample. Here are my results:

Original cameraman image
First epoch results

Part 2: Training a Diffusion Model

In part 2, we extend the model of part 1, first by creating a time dependent model, and then by making it class-conditioned

2.1 Time conditioning

In order to create a diffusion model, we must update our part 1 model to denoise iteratively as opposed to in one step. To adapt our previous model to this iterative process, we adapt our noise generation to be iterative (as defined in part A), and condition the model on T by adding two fully connected (FC) blocks. Here is the diagram for the conditioned model:

Original cameraman image
Time conditional UNET diagram

2.2 Training the UNET

After training the model from the previous part, I got the following loss curve:

Original cameraman image
Time conditioned UNET log loss

2.3 Sampling from the UNET

Here are the sampling results after 1, 5, and 20 epochs:

Original cameraman image
Samples after 1 epoch
Original cameraman image
Samples after 5 epochs
Original cameraman image
Samples after 20 epochs

2.4 Adding class conditioning to the unit

The final improvement for our model is to add class conditioning. This is not only the essence of what has made diffusion models so popular but should also significantly improve our results.

To implement this, we add two new FC Blocks to our model. We also implement drouput to ensure that the model works without conditioning (that is, how larger real models work).

Here is the log loss curve for the class conditioned model.

Original cameraman image
Class conditioned UNET log loss

2.5 Sampling from class-conditioned model

Here are the sampling results for the class conditioned model:

Original cameraman image
Class conditioned samples after 1 epoch
Original cameraman image
Class conditioned samples after 5 epochs
Original cameraman image
Class conditioned samples after 20 epochs