Neural Radiance Fields

Part 1: Fit a Neural Field to a 2D image

Network

First, we create a multilayer perceptron network with sinusoidal positional encoding. This network will take in the 2d pixel coordinates and will output the 3D pixel colors.

Original cameraman image
MLP model architecture

Dataloader

The next step is to implment a Dataloader that randomly samples pixels from the image at each iteration. This is done because selecting all the pixels from a high resolution image would exceed the GPU memory limit.

Loss function, Optimizer and Metric

As suggested, I used MSE error as the loss function of the difference between predicted pixel value and true color. Using this, I trained my networks using Adam for 2000 iterations.

Hyperparameter Tuning

The next step is hyperparameter tuning. I chose to explore the impact of varying learning rate and L. Here are my results:

Original cameraman image
Learning Rate Tuning
Original cameraman image
L tuning

As can be seen from the graph, the optimal learning rate out of those I tested was 5e-3. Also, we can see that larger Ls lead to better results over 2000 iterations. However, there isn't much difference when L surpasses 10. For that reason, I used 5e-3, but kept the recommended L=10.

Results

The first image I used it this photo of a fox:

Original cameraman image
Fox Original Image

With the selected parameters, I obtained the following results:

Original cameraman image
Fox Iteration 1
Original cameraman image
Fox Iteration 25
Original cameraman image
Fox Iteration 100
Original cameraman image
Fox Iteration 500
Original cameraman image
Fox Iteration 1000
Original cameraman image
Fox Iteration 2000 (Final)

Then, I repeated the process with the 'Las Meninas' painting. I chose this because there isn't a single subject in the middle but rather multiple characters. Like with the fox image, the first step was hyperparameter tuning:

Original cameraman image
Learning Rate Tuning
Original cameraman image
L tuning

Interestingly, in this painting (and over 2000 iterations), there was no perceptible difference between learning rates, so long as they were lower than 2e-2. I chose the highest learning rate out of the ties since it had better results at earlier iterations. As for L tuning, there was no perceptible difference for L above 10. Here are my resulting images at various iterations:

Original cameraman image
Meninas Iteration 1
Original cameraman image
Meninas Iteration 25
Original cameraman image
Meninas Iteration 100
Original cameraman image
Meninas Iteration 500
Original cameraman image
Meninas Iteration 1000
Original cameraman image
Meninas Iteration 2000 (Final)

Part 2: Fit a Neural Field to a 2D image

In the second part of the project, I use a Neural Radiance Field to represent 3D space. Compared to the original NERF paper, we will user lower resolution images and preprocessed cameras. Still our overall procedure is similar.

2.1 Create Rays From Cameras

This formula relates camera coordinates to world coordinates:

\[ \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3 \times 3} & \mathbf{t} \\ \mathbf{0}_{1 \times 3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} \]

The matrix to the right of the equals sign is the extrinsic (w2c) matrix. This relates the world coordinates to the camera coordinates. Next we relate the camera coordinates to the pixel coordinates using the following equations:

Pixel to Camera: \[ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} \]

With these two conversions, we only need to define a pixel to ray conversion, to obtain ray origin and direction.

2.2 Sampling

Using the ray creation that I implemented for part 2.1, we may now implement sampling. The task becomes harder than it was for the 2D Case. Now, we must first sample rays from images and then sample points from rays.

2.3 Putting the dataloading all together

We may use the provided visualization to check that the code is so far correct. Here are the results I obtained:

Original cameraman image
Viser Dataloading visualization
Original cameraman image
Viser all rays from same image

2.4 Neural Radiance Field

Here is the structure of a Neural Radiance Field:

Original cameraman image
Viser Dataloading visualization

As part of this section, I implemented the network shown. Compared to the network that I used in part A, the first observation that one can make is that the net is significantly deeper. This can be easily explained by the fact that 3D is way more challenging than 2D. It also stands out that we use an adapted version of a positional encoding, that now encapsulates both coordinates and ray direction.

2.5 Volumetric Rendering

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \left(1 - \exp(-\sigma_i \delta_i)\right) \mathbf{c}_i, \quad \text{where} \quad T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \]

With this defined, we're ready to train the model and evaluate our results. Here is the validation accuracy of the model:

Original cameraman image
Training loss and PSNR

Instead of comparing results for 6 still images, I decided that it looked better to create the video visualization for intermediate models too (in effect comparing 60 frames). I've also included the first six frames of each in the appendix to ensure that I meet the requirement.

Original cameraman image
Rendered results after 150 steps
Original cameraman image
Rendered results after 500 steps
Original cameraman image
Rendered results after 1000 steps
Original cameraman image
Rendered results after 2000 steps

Bells & Whistles

Changing background color

For the bells and whistles portion, I decided to change the background color. This can be done with the same trained version of the model. This is possible because we trained our model to identify rays where there is no object. At these points, we should inject backgound color. Therefore, by adding the product of accumulated transmittance and our desired RGB background we can create a non-black setting for our Lego. Here are a grey and red background:

Original cameraman image
Rendered results after 1000 steps
Original cameraman image
Rendered results after 2000 steps

Appendix

Here are the first 6 frames of the result gif, in case the grader is looking at the pdf version:

Original cameraman image
150 steps - frame 1
Original cameraman image
150 steps - frame 2
Original cameraman image
150 steps - frame 3
Original cameraman image
150 steps - frame 4
Original cameraman image
150 steps - frame 5
Original cameraman image
150 steps - frame 6
Original cameraman image
500 steps - frame 1
Original cameraman image
500 steps - frame 2
Original cameraman image
500 steps - frame 3
Original cameraman image
500 steps - frame 4
Original cameraman image
500 steps - frame 5
Original cameraman image
500 steps - frame 6
Original cameraman image
1000 steps - frame 1
Original cameraman image
1000 steps - frame 2
Original cameraman image
1000 steps - frame 3
Original cameraman image
1000 steps - frame 4
Original cameraman image
1000 steps - frame 5
Original cameraman image
1000 steps - frame 6
Original cameraman image
2000 steps - frame 1
Original cameraman image
2000 steps - frame 2
Original cameraman image
2000 steps - frame 3
Original cameraman image
2000 steps - frame 4
Original cameraman image
2000 steps - frame 5
Original cameraman image
2000 steps - frame 6