Wednesday 21 August 2013

Data Accuracy

Between the default and near modes, Microsoft advertises a usable depth range of between 40cm and 4 metres with millimetre granularity along the depth axis.  The process of “undoing” the perspective view of the camera essentially stretches the orthogonal plane with respect to depth – see previous post on Mapping Depth Data into Virtual Space.

Using the empirically derived constants, a theoretical resolution of the sensor can be determined: at the closest point (40cm), the x/y-plane resolution is 0.712mm whereas at 4 metres from the sensor, this drops to 7.12mm.  The effective resolution follows a linear correlation and is based on the higher capture resolution of 640x480.

The closer distance resolution compares very favourably with that which would typically be used in the field by archaeologists; between the project partners, we were looking at using 1mm accuracy laser scanning equipment as the standard to compare against.  Anything measured up to 56.2cm from the Kinect sensor would therefore be better or equal to this standard (in theory).

The linear nature of increasing resolution as objects are further away from the camera is something that should also be taken into account when performing the Iterative Closest Point algorithm, perhaps favouring pairings closer to the camera.  While this has not been factored into our current process, it is certainly something worth investigating to aid tracking accuracy.  It also highlights the need for objects close to the camera to be visible while tracking and stitching for optimal results; these could be later removed from the final models.

The camera plane resolution is only one part of the overall data accuracy question; the next is how reliable is the depth data returned from the camera.  To give an indication of this, the camera was pointed towards a static scene and the depth measurements recorded and compared over time.  Lighting is constant and nothing visible enters or exits the scene.  The scene is illustrated below and contains visible points that span from being too close to the camera, into its normal range and beyond:


It should be noted at this stage that what follows are observations based on rudimentary experiments and are by no means rigorously tested under strict scientific conditions.  That said, conditions have been maintained enough to provide some indicative results that have some meaning.

The first thing we looked at was the variation of depth for a given pixel in the depth map.  Over 300 frames of the static scene (10 seconds), the minimum and maximum depths reported per pixel are extracted along with an average pixel depth.  The average pixel depth is plotted against the average variation, removing the outliers.  This gives the plot below – note the default Kinect range is used given an effective sensor range of 0.8m and 4m.


This plot demonstrates that there is noise at all depths from the camera sensor, although perhaps measurements less than 1.7m suffer less as the base-line tends to be lower.  This adds further support for this technology when scanning objects closer to the camera than further away.  Many of the high peaks can probably be attributed to the edges of objects as our scene has little in the way of smooth gradient changes.  The depth error therefore seems to be within a few millimetres at nearer distances.

In addition to the variation of depths per pixel, there are times when the sensor can fluctuate between returning valid and invalid data.  Constructing a histogram of percentage of valid pixels suggests that in our example 89% of the pixels remained valid leaving 11% fluctuating.  A plot of this 11% (normalised) is given below (note that 100% valid has not been plotted to ensure the remaining values can be displayed on a meaningful scale).


The majority of those pixels without 100% valid depth values across time have either a very low (below 1%) or a high (above 98%) percentage.  While “missing” depth data isn’t a significant problem, it is still worth noting that it happens and thus erroneous data values needed to be pruned (such as those with low valid percentages) and it cannot be relied upon that each pixel will consistently have depth values.  Thus pruning algorithms need to take into account temporal history.

In summary, the accuracy of the Kinect sensor appears sufficient to capture good resolution objects with 1-2mm resolution at a distance less than 1 metre from the camera.  However because of the variation of the depth data and validity of pixel data, care needs to be taken when designing the tracking and stitching process to accommodate the level of error that we are seeing.