Introduction and the Pinhole camera#

Have you ever wondered what complex mathematics runs behind the miniature camera that we use in our every day life. Point and shoot cameras are ubiquotous to a point that we take camera technology for granted.

Consider an image of a railway track.

railways

The rails look as if they converge at the end, however, we know that they do not. The depth of the scene is lost in the image (as the picture is a 2D image). The imperative question arises How do we map a 3D object onto a 2D image? Better yet, How to map a 3D point in space on a 2D plane? While making this 3D to 2D mapping, what is the information about the scene that is lost? What illusions would arise?

This section aims at answering the above questions while providing a detailed explanation of the pinhole camera.

Computer Vision Overview

  1. Low-level vision

    1. Image processing

    2. Edge detection

    3. Feature detection

    4. Cameras

    5. Image formation

  2. Geometry and algorithms

    1. Projective Geometry

    2. Stereo Vision

    3. Structure from motion

  3. Recognition

    1. Face detection/recognition

    2. Category recognition

    3. Image segmentation

To model a camera, one must make sure to preserve both geometry and semantics of the scene. Some applications of computer vision include but not limited to Single View modelling, Detection and Recognition, Visual Question and Answering, Optical Character Recognition, Entertainment (e.g., Snapchat), Shape Reconstruction using depth sensors, Building rome in a day!, 3D Scanning, Medical Imaging and so on.

Image Formation#

How a colorful image is formed? All the images are a resultant of the following entities:

  1. Lighting Conditions

  2. Scene Geometry

  3. Surface Properties

  4. Camera Properties

The focus of this course is predominantly on the Camera Properties. The image formation requires a light source (e.g., sun) continuously emits photons. These photons are reflected by the surface and a portion of this light is directed towards the camera.

The sensor plane (the film) has physical existence, but the image plane is virtual. The image plane is flipped upside down as compared to the sensor plane.

Pinhole camera model#

The pinhole camera makes sure that every unique point in the physical space falls on the film.

../_images/pinhole_1.PNG

Fig. 1 Pinhole camera#

../_images/pinhole_2.PNG

Fig. 2 Pinhole camera Model#

The image that we see on the screen after taking an image is virtual image plane. Same phenomenon happens in human eyes.

Zooming in and Zooming out changes the focal length of a camera. (Not digital zoom, optical zoom)

../_images/focal_length.PNG

Fig. 3 Pinhole camera 2D representation#

The above diagram gives us the first equation for image formation in computer vision. Given a 3D point \(P =(X,Y,Z)^T\), the camera projects this point to onto a \(2D\) image plane to \(p = (x,y)^T\). From the above figure:

\[\begin{gather*} \frac{\text{Object size}}{\text{Object distance}} = \frac{\text{Image size}}{\text{Focal Length}} \end{gather*}\]
(1)#\[\frac{X}{Z} = \frac{x}{f}, \frac{Y}{Z} = \frac{y}{f}\]
\[x=f \frac{X}{Z}, y = f\frac{Y}{Z}\]

This is the simplest form of perspective projection. You can change the size of the image by changing any of the above 3 parameters. For a simple example of human face, the image size can be increased by i) actually having a bigger object size \(Y\), ii) increasing the focal length \(f\), and iii) bringing the object closer to the camera (decreasing \(Z\))

In the above figure the focal length is the only parameter that can be changed (as the size of the object and the distance from camera is assumed to be fixed.) This means that upon increasing the focal length the size of the image increases proportionally. When we zoom in, we increase the focal length.

If I zoom in/out of a picture after taking a picture, will focal length change?

After taking a picture, if you zoom in and zoom out, it is a digital zoom. It does not change the focal length of the camera.

How to calculate the default focal length of a smartphone camera?

Calculating the focal length of camera

The aim of a camera is to map each of the 3D point in the 3D world to a location in the 2D image plane. Camera does the 3D-2D mapping. It is also called as the Perspective projection. This leads to a question about how to recover a 3D location of a point from a 2D image.

While projecting from 3D to 2D, the depth information is lost (also the angles between objects in 3D) are lost. Each ray becomes a pixel on the image plane.

Note

In the above perspective projection, the angle between lines is not preserved and the depth information is lost.

When the distance between camera and the object is too large (god’s view), the difference in sizes of closer objects and far objects is not preserved.

\(x = sX, y = sY, \text{ where } s=\frac{f}{Z_0}\), \(Z_0\) is the distance between camera and the object.

Projection Matrix and the Intrinsic parameters#

(2)#\[\begin{split}Z \begin{pmatrix} \frac{fX}{Z} \\ \frac{fY}{Z} \\ 1 \end{pmatrix} = \begin{pmatrix} fX \\ fY \\ Z \end{pmatrix} = \begin{bmatrix} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0 \end{bmatrix}\begin{pmatrix} X\\ Y\\ Z\\ 1 \end{pmatrix}\end{split}\]

The simplest form of perspective projection in matrix form is given by the above equation.

What is the need for extra row in the 3D coordinate?

The extra column of zeros and the extra 1 in the 3D coordinate is added to incorporate the translation.

What do the distances \(X, Y, Z\) mean here?

\(X,Y,Z\) are the relative coordinates of a 3D point from the camera origin. This means that when the object is moved or the camera is moved, the values of \(X,Y,Z\) change.

This leads us to the question, What about the values of \(x,y\)? These pixel values are relative to the origin Principal point of the image plane. Typically the origin of an image is located at the bottom left or the top left.

Hence, there is a need to make sure the principal point overlaps with the image origin, and we add an offset \(o_x\) and \(o_y\) within the image plane to resolve this issue. Keep in mind that this offset is not caused by the movement of the object or the camera. It an intrinsic parameter of every camera.

The (2) becomes:

(3)#\[\begin{split}\begin{pmatrix} fX + Zo_x \\ fY + Zo_y\\ Z \end{pmatrix} = \begin{bmatrix} f & 0 & o_x & 0\\ 0 & f & o_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix}\begin{pmatrix} X\\ Y\\ Z\\ 1 \end{pmatrix}\end{split}\]
\[\begin{split} K = \begin{bmatrix} f & 0 & o_x \\ 0 & f & o_y \\ 0 & 0 & 1 \end{bmatrix} \end{split}\]

The focal length \(f\) need not be same along the \(x\) and \(y\) axes always. A general form of the projection matrix is:

(4)#\[\begin{split}K = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix}\end{split}\]

This is the (Intrinsic) calibration matrix.

There is an advanced version of the above matrix. It is called Camera Calibration matrix.

(5)#\[\begin{split}K = \begin{bmatrix} \gamma f_x & s & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix}\end{split}\]

\(\gamma\) - when you zoom in and zoom out, the \(x\) and \(y\) co-ordinates are changed proportionally. However if there is an unproportional change in \(x\) and \(y\) axes, \(\gamma\) takes care of that. \(\gamma\) is also called the aspect ratio of the pixel.

\(s\) - skew of the sensor pixel, i.e., if the pixel is a parallelogram and not a square.

There are 5 intrinsic parameters for a camera. Refer (5)

Extrinsic parameters of a camera#

The extrinsic parameters come into picture, even before shooting an image. Where the camera is located and it’s angle. (Rotation and Translation).

These are the parameters that identify uniquely the transformation between the unknown camera reference frame and the known world reference frame.

Determining these parameters includes:

  1. Finding the translation vector between the relative positions of the origins of the two reference frames.

  2. Finding the rotation matrix that brings the corresponding axes of the two frames into alignment (i.e., onto each other).

../_images/rot_translate.PNG

Fig. 4 Rotation and Translation from world coordinates to camera coordinates.#

Using the extrinsic camera parameters, we can find the relation between the coordinates of a point \(P\) in the world \((P_w)\) and camera \((P_c)\) coordinates:

(6)#\[P_c = R(P_w-T)\]

where

(7)#\[\begin{split}R = \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}\end{split}\]
\[\begin{split} \text{If } P_c = \begin{bmatrix} X_c \\ Y_c \\ Z_c \end{bmatrix} \text{ and } P_w = \begin{bmatrix} X_w \\ Y_w \\ Z_w \end{bmatrix}, \text{ then} \end{split}\]
(8)#\[\begin{split}\begin{bmatrix} X_c \\ Y_c \\ Z_c \end{bmatrix} = \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}\begin{bmatrix} X_w-T_x \\ Y_w-T_y \\ Z_w-T_z \end{bmatrix}\end{split}\]

In other words, we first translate the coordinates to match the camera coordinates and then rotate the axes so that both camera axes and world coordinate axes overlap.

\[\begin{split} X_c = R_1^T(P_w-T) \\ Y_c = R_2^T(P_w-T) \\ Z_c = R_3^T(P_w-T) \end{split}\]

where \(R_i^T\) corresponds to the \(i^{th}\) row of the rotation matrix.

What is a camera?

Camera is nothing but a 3D point to 2D point mapping function that has 11 parameters (intrinsic and extrinsic).

3D to 2D mapping (both intrinsic and extrinsic)#

  1. Given an image of an object, we would like to find the intrinsic and extrinsic parameters of a camera.

  2. With these intrinsic and extrinsic parameters, we want to use the camera to map any 3D point in the real world to a 2D coordinate on the image plane.

\([X_w, Y_w, Z_w]^T \rightarrow [X_c, Y_c, Z_c]^T\) is rigid transformation and \([X_c, Y_c, Z_c]^T \rightarrow s[x,y,1]\) is projective transformation (perspective projection).

\[\begin{split} s\begin{bmatrix}x\\ y\\ 1 \end{bmatrix} = \begin{bmatrix} \gamma f_x & s & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0& 1& 0\end{bmatrix} \begin{bmatrix}R_{3 \times 3} & 0_{3 \times 1} \\ 0_{1 \times 3} & 1\end{bmatrix} \begin{bmatrix} I_{3 \times 3} & T_{3 \times 1} \\ 0_{1 \times 3} & 1\end{bmatrix}\begin{bmatrix} X_w \\ Y_w \\ Z_w \\ 1\end{bmatrix} \end{split}\]
(9)#\[\begin{split}s\begin{bmatrix}x\\ y\\ 1 \end{bmatrix} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & T_x \\ r_{21} & r_{22} & r_{23} & T_y \\ r_{31} & r_{32} & r_{33} & T_z \end{bmatrix} \begin{bmatrix} X_w \\ Y_w \\ Z_w \\ 1\end{bmatrix}\end{split}\]

\([x, y]^T\) are the image coordinates and \([X_w, Y_w, Z_w]^T\) are the world coordinates. \(T_{3 \times 1} = [T_x, T_y, T_z]^T\) is the translation vector.

The intrinsic and extrinsic parameters combine to form the matrix \(M\).

\[\begin{split} M = M_{in}.M_{ex} = \begin{bmatrix}m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34} \end{bmatrix} \end{split}\]

Intuition behind the 3D \(\rightarrow\) 2D mapping in a camera.#

(10)#\[\begin{split}s\begin{bmatrix}x\\ y\\ 1 \end{bmatrix} = \begin{bmatrix}m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34}\end{bmatrix} \cdot \begin{bmatrix} X_w \\ Y_w \\ Z_w \\ 1\end{bmatrix}\end{split}\]

Let’s suppose, you have \(n\) image coordinates of a camera and their corresponding world coordinates. Our aim to find the matrix \(M\). From (10), we have:

\[\begin{split} sx = m_{11}X_w + m_{12}Y_w + m_{13}Z_w + m_{14} \\ sy = m_{21}X_w + m_{22}Y_w + m_{23}Z_w + m_{24} \\ s = m_{31}X_w + m_{32}Y_w + m_{33}Z_w + m_{34}\end{split}\]

To solve for the matrix \(M\),

../_images/matrix_m.PNG

Fig. 5 Solving for the matrix \(M\)#

where the first matrix is of size \(2n \times 12\) (\(n\) is the number of available points.)

In the above homogeneous linear equation \(Ax=0\), we know the values of image coordinates and the real-world coordinates. We are required for find the elements of the matrix \(M\). The above equation has infinite solutions for \(Ax = 0\), since we can randomly scale \(x\) with a scalar \(\lambda\) such that \(A(\lambda x)= 0\). Therefore, we assume \(||x||=1\), solving the equation can be converted to:

\[ min||Ax|| \]

The minimization problem can be solved with Singular Value Decomposition (SVD). Assume that \(A\) can be decomposed to \(U\Sigma V^T\), we have

\[ min||Ax|| = ||U \Sigma V^Tx|| = ||\Sigma V^Tx|| \]

Since \(||V^Tx||=||x||=1\), then we have \(min||Ax|| = ||\Sigma y||\).

As \(||y||=1\), \(x\) should be the last row of \(V^T\).

Once we have the matrix \(M\),

(11)#\[\begin{split}M = \begin{bmatrix}m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34}\end{bmatrix} = \begin{bmatrix} f_xr_{11}+o_xr_{31} & f_x r_{12}+o_xr_{32} & f_xr_{13}+o_xr_{33}&f_xT_x+o_xT_z \\ f_y r_{21}+o_yr_{31} & f_yr_{22}+ o_y r_{32} & f_yr_{23}+o_yr_{33} & f_yT_y+o_yT_z \\ r_{31} & r_{32} & r_{33} & T_z\end{bmatrix}\end{split}\]

Here, \(M\) is the projection matrix. Let’s define

\[\begin{split} \begin{align} m_1 &= (m_{11}, m_{12}, m_{13})^T \\ m_2 &= (m_{21}, m_{22}, m_{23})^T \\ m_3 &= (m_{31}, m_{32}, m_{33})^T \\ m_4 &= (m_{14}, m_{24}, m_{34})^T \end{align} \end{split}\]

Also we define,

\[\begin{split} \begin{align} r_1 &= (r_{11}, r_{12}, r_{13})^T \\ r_2 &= (r_{21}, r_{22}, r_{23})^T \\ r_3 &= (r_{31}, r_{32}, r_{33})^T \end{align} \end{split}\]

Observe that \((r_1, r_2, r_3)\) is the rotation matrix, then

(12)#\[\begin{split}(r_1, r_2, r_3) \begin{pmatrix} r_1^T\\ r_2^T\\ r_3^T \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}\end{split}\]

Then we have \(r_i^Tr_i = 1, r_i^Tr_j=0 \enspace (i \neq j)\).

From \(M\) we have,

\[ m_1^Tm_3 = o_x \]
\[m_1^Tm_1 = f_x^2+o_x^2\]
\[o_x = m_1^Tm_3, \enspace o_y = m_2^Tm_3\]
\[f_x = \sqrt{m_1^Tm_1 - o_x^2}, \enspace f_y = \sqrt{m_2^Tm_2 - o_y^2}\]

Summary#

This chapter discusses about the pinhole camera model. The intrinsic and extrinsic camera paramters. Given a 3D world coordinates of an object, a camera performs a 3D \(\rightarrow\) 2D mapping of the point onto an image plane. There are a total of 11 paramters for any camera to perform this perspective projection mapping.

This chapter also explains the mathematics used to find the intrinsic and extrinsic parameters of a camera using image coordinates and world coordinates. Given an image of an object, and \(n\) world coordinates and image coordinates, we discuss the mathematics to find the matrix \(M = M_{in}\cdot M_{ex}\) that would help us map any 3D point in real world to a 2D point on image plane.

The next chapter provides the code for the above mathematics. We use chessboard corners as objects and try to find the intrinsic and extrinsic parameters of a given camera.