Intel’s RealSense cameras are astonishingly precise but not as accurate. By optimising the calibration of the depth stream and correcting for non-linearity, the accuracy can be improved by an order of magnitude at 2.5 metres and becomes almost linear in the depth:

Sources of error

Calculating the coordinates of the 3D point corresponding to a depth reading is straightforward trigonometry – here‘s a quick refresher – but the accuracy of the results depends on several factors.

Accuracy of the intrinsics

The supplied Intel® RealSense™ Depth Module D400 Series Custom Calibration program uses the traditional method, displaying a chequerboard to the camera in various poses and solving for the intrinsics. There are several issues with this methodology:

This method establishes the intrinsics solely for the colour camera.
Although it resolves to sub-pixel accuracy, it does so on a single frame, which is imprecise. The results of 3D calculations are extremely sensitive to errors in the field-of-view: A one-degree error in the vertical field of view translates into >11mm error at 1 metre. Concomitant errors in the horizontal field of view make matters worse and they are quadratic in the depth.
The depth stream is synthesised by the stereo depth module and the vision processor. Imperfections anywhere in the chain (unforeseen distortion, varying refraction at different wavelengths, heuristics in the algorithms, depth filtering) may negatively affect the accuracy. One cannot assume that an apparently perfect colour image will produce ideal results in the depth map.

Non-linearity

This is readily observed with the supplied DepthQuality tool. When viewing a target at a measured distance of 1’000 from the glass, the instantaneous reported depth is out by ~22mm:

By averaging depth measurements over a period, errors in precision can be eliminated. Averaged over 1’000’000 measurements, my out-of-the-box D435 reports a range of 980.70mm – an error of 19.3mm. This is within the specified accuracy 2%=20mm but increases quadratically, as is to be expected. Fortunately, this non-linearity appears to be constant for a given camera and once determined, can be eliminated.

Focal Point

The focal point of the depth map is supplied in the Intel RealSense D400 Series Datasheet, for a D435 it is defined as being 3.2mm behind of the glass. Presumably due to the manufacturing tolerances of ±3%, the focal point may in reality be tens of mm away.

Mounting

No matter how precisely the camera is mounted, there will be errors between the mounting and the camera’s true central axis. Knowing them improves the accuracy when translating from the camera frame to the parent (vehicle or world) frame.

Solution

I have written a program that calibrates a camera based solely on measurements in the depth stream. It derives all the parameters discussed above by making several observations of a target with known dimensions. A much higher degree of accuracy is obtained by averaging over a large number of measurements. The optimal parameter values are then calculated, as a single problem, with a non-linear solver.

It is open-source, available on GitHub https://github.com/smirkingman/RealSense-Calibrator

Screenshot of an optimiser output:

Discussion

The comparisons presented above use the Z-range as the metric, as this is the metric in the reference documentation. The Z measure alone is only part of the answer, a more realistic metric is the 3D error of the point: the vector between the truth and the 3D point determined by the camera and software. Futhermore, just supplying a number doesn’t tell the whole story. Traditional error analysis supplies descriptive statistics, which give a value and a confidence known as the 68–95–99.7 rule, which allows us to make statements like “The error will be no more than Xmm 99.7% of the time” (3-sigma, or 3σ).

The 3D error – the length of the vector between the true coordinates of the point and what the camera+software reported is:

The 3-sigma error is:

what this shows is that at 1.5 metres, the coordinates of the 3D point will be between 1’447 and 1’553 from the camera 99.7% of the time.

A depth camera measures the distance from the camera to a point in 3D space. For a given point, the camera supplies the row and column on its ‘screen’ and the depth towards the point. It is worth pointing out here that classic depth cameras like the Kinect supply the length of the ray; RealSense cameras supply the range, or Z component.

Calculating the coordinates of the point is fairly straightforward trigonometry. Suppose a D435 camera is mounted 500mm off the ground, pointing at the horizon. 1000mm away there is an object 101.5mm high:

To warm up, the camera’s vertical field of view is 56°, so at 1’000mm half of the height is

$1000*tan(56/2)=531.7mm$

The camera has 480 rows, so it will see the 101.5mm-high object at row

$((480-1)/2)+(-398.5)/531.7*((480-1)/2)=row\ 60$

Bonus: It sees the object at an angle of

$Atan⁡(398.5/1000)=-21.7^\circ$

Now we define constants for the Fx intrinsics, the centre row and the height of a pixel:

$VFov2=VFov/2$

$VSize=Tan(VFov2)*2$

$VCentre=(Rows-1)/2$

$VPixel=VSize/(Rows-1)$

$VRatio=(VCentre-row)*VPixel$

$Y=Range*-VRatio$

Notice the ‘Rows-1’ because there are 479 intervals between 480 pixels: row 239 points just under the horizon and row 240 points just above the horizon.

Then, for the example above we define our constants:

$VFov2=56/2=28$

$VSize=Tan(28)*2=1.0634$

$VCentre=(480-1)/2=239.5$

$VPixel=1.0634/(480-1)=0.00222$

and calculate the Y coordinate:

$VRatio=(239.5-60)*0.00222=0.3985$

$Y=1000*-VRatio=-398.5 mm$

The calculations for the X-coordinate are identical, replacing ‘Vertical’ with ‘Horizontal’ and Z is simply the supplied range.

Maurice's Musings

Improving the depth map accuracy of RealSense cameras – by an order of magnitude