Kinecting up the Past: 2013

Wednesday 21 August 2013

Data Accuracy

Between the default and near modes, Microsoft advertises a usable depth range of between 40cm and 4 metres with millimetre granularity along the depth axis. The process of “undoing” the perspective view of the camera essentially stretches the orthogonal plane with respect to depth – see previous post on Mapping Depth Data into Virtual Space.

Using the empirically derived constants, a theoretical resolution of the sensor can be determined: at the closest point (40cm), the x/y-plane resolution is 0.712mm whereas at 4 metres from the sensor, this drops to 7.12mm. The effective resolution follows a linear correlation and is based on the higher capture resolution of 640x480.

The closer distance resolution compares very favourably with that which would typically be used in the field by archaeologists; between the project partners, we were looking at using 1mm accuracy laser scanning equipment as the standard to compare against. Anything measured up to 56.2cm from the Kinect sensor would therefore be better or equal to this standard (in theory).

The linear nature of increasing resolution as objects are further away from the camera is something that should also be taken into account when performing the Iterative Closest Point algorithm, perhaps favouring pairings closer to the camera. While this has not been factored into our current process, it is certainly something worth investigating to aid tracking accuracy. It also highlights the need for objects close to the camera to be visible while tracking and stitching for optimal results; these could be later removed from the final models.

The camera plane resolution is only one part of the overall data accuracy question; the next is how reliable is the depth data returned from the camera. To give an indication of this, the camera was pointed towards a static scene and the depth measurements recorded and compared over time. Lighting is constant and nothing visible enters or exits the scene. The scene is illustrated below and contains visible points that span from being too close to the camera, into its normal range and beyond:

It should be noted at this stage that what follows are observations based on rudimentary experiments and are by no means rigorously tested under strict scientific conditions. That said, conditions have been maintained enough to provide some indicative results that have some meaning.

The first thing we looked at was the variation of depth for a given pixel in the depth map. Over 300 frames of the static scene (10 seconds), the minimum and maximum depths reported per pixel are extracted along with an average pixel depth. The average pixel depth is plotted against the average variation, removing the outliers. This gives the plot below – note the default Kinect range is used given an effective sensor range of 0.8m and 4m.

This plot demonstrates that there is noise at all depths from the camera sensor, although perhaps measurements less than 1.7m suffer less as the base-line tends to be lower. This adds further support for this technology when scanning objects closer to the camera than further away. Many of the high peaks can probably be attributed to the edges of objects as our scene has little in the way of smooth gradient changes. The depth error therefore seems to be within a few millimetres at nearer distances.

In addition to the variation of depths per pixel, there are times when the sensor can fluctuate between returning valid and invalid data. Constructing a histogram of percentage of valid pixels suggests that in our example 89% of the pixels remained valid leaving 11% fluctuating. A plot of this 11% (normalised) is given below (note that 100% valid has not been plotted to ensure the remaining values can be displayed on a meaningful scale).

The majority of those pixels without 100% valid depth values across time have either a very low (below 1%) or a high (above 98%) percentage. While “missing” depth data isn’t a significant problem, it is still worth noting that it happens and thus erroneous data values needed to be pruned (such as those with low valid percentages) and it cannot be relied upon that each pixel will consistently have depth values. Thus pruning algorithms need to take into account temporal history.

In summary, the accuracy of the Kinect sensor appears sufficient to capture good resolution objects with 1-2mm resolution at a distance less than 1 metre from the camera. However because of the variation of the depth data and validity of pixel data, care needs to be taken when designing the tracking and stitching process to accommodate the level of error that we are seeing.

Thursday 4 July 2013

When Tracking and Merging Goes Wrong

As mentioned in previous posts, the Kinect sensor data can be noisy which can lead to errors during both the ICP and merging processes, potentially compounding each other. The process itself is also subject to inherent algorithmic issues. This short post gives a couple of examples where the tracking drifts and demonstrates the importance of the perspective correction factor previously discussed balancing the thresholds used in the various processes.

In the first video we’ve completely removed the perspective correction factor. Without this, as the camera moves around the scene, the tracking processes does it best to stitch “skewed” frame data into world space yet inevitably fails.

The next video takes a much larger pan but due to the flat wall surfaces being tracked that provide little depth variation, the tracking algorithm drifts upwards.

In the above videos, the top two left window show what the camera is seeing as a colour image and as a greyscale depth map. The bottom left window is a 3D point cloud representation of the RGB and depth data combined from the Kinect sensor (which would normally be correctly coloured, but the green in this window represents point matches into the world space). The larger window in the centre of the screen is the compiled world space. The green wireframe box indicates the current camera position and orientation into this world. Green indicates the points that are paired with the individual captures from the Kinect device.

Underneath the larger 3D window are "debug" outputs - the one on the left give internal states for the steps within the matching process and the right one gives the camera orientation of the current frame in terms of rotation and offset into the global space.

Merging Point Clouds

Once the frame data has been aligned with the world space using the iterative closest point algorithm, it can be merged to created the larger environment. In this process we maintain the concept of point clouds as opposed to create surfaces. There are three components to merging the dataset:

Refine existing world points
Add new frame points
Remove erroneous world points

Merging point clouds proceeds by considering only a subset of world points. This subset is defined as the points that fall within the camera’s view frustum when transformed by the frame transformation. In the discussion that follows, this subset will be referred to as simply the world points.

Refine Existing World Points

Each world point is matched against a frame point after applying the frame transformation. The matching threshold can be stricter than the ICP process to increase world point cloud density; for example, a threshold of 1cm will provide a final resolution of 1cm whereas a threshold of 1mm will provide much more fidelity although it might also introduce errors due to the level of noise returned by the Kinect sensor. More information about the level of accuracy, noise and reliability will be given in a future post.

The world points are updated using the matches, where one frame point may map to many world points. After the existing points are updated, all world points and frame points that are involved in a match are ignored for the remaining merging processes.

Add New Frame Points

The frame points that were not matched to a world point are considered new points and added to the world point dataset.

Remove Erroneous World Points
There is noise in the depth map that the Kinect sensor returns and thus some points in the world dataset will also be erroneous and need to be pruned. The strategy employed here is to eliminate any world points that fall within the transformed camera frustum that do not have significant support for their existence. We therefore don’t simply remove per frame all world points that don’t have a match with the frame points as the frame itself could be in error. Instead as each world point is updated and added, we take note of when they were last seen. If at the end of each frame there are world points that have not been matched and have not been seen for a given period of frames, they are removed.

Tracking and Merging Example
The following video illustrates the process of tracking and merging point clouds. The bottom left window is a 3D point cloud representation of the RGB and depth data combined from the Kinect sensor (which would normally be correctly coloured, but the green in this window represents point matches into the world space). The larger window in the centre of the screen is the compiled world space. The green wireframe box indicates the current camera position and orientation into this world. Green indicates the points that are paired with the individual captures from the Kinect device.

Underneath the larger 3D window are "debug" outputs - the one on the left give internal states for the steps within the matching process and the right one gives the camera orientation of the current frame in terms of rotation and offset into the global space.

Tuesday 2 July 2013

Aligning Point Clouds

Point clouds are joined together once the Kinect depth map data has been converted into an absolute [x, y, z] coordinate system, as described in an earlier. At the core of this process is the Iterative Closest Point (ICP) algorithm. The ICP process finds an affine transformation between two point clouds that maximise their overlapping regions (thus minimises the distance between the two point clouds in space).

It should be noted that this post will only provide a textual overview of the process, which will be expanded in future articles. The web also has a good number of resources and examples of this process that can be found by searching for Iterative Closest Point.

The following definitions describe the terms to be used:

Frame Data (F): Perspective-corrected depth map data from the Kinect device
World Data (W): An accumulation of point data from one or many frames defined in world space
Frame Transformation (T): A 4x4 affine transformation matrix that transforms frame data into world space defined by rotation and translation about the 3 axes (affording 6 independent variables).

The basic ICP algorithm proceeds as follows and is per frame of data:

For each pixel in the frame data, apply the current frame transformation and find the closest point in the world data
Minimise the distance between the pairs by refining the frame transformation
Repeat back to step 1 while the sum of errors over all pairs is reducing

Initially the frame transformation is the identify matrix and the world data is the first frame from the Kinect device. The closest point also takes into account the point normals which are calculated using neighbouring points.

The minimisation step is an inner iterative process that makes use of the Jacobian of the transformation matrix; the Jacobian relates small changes in the independent variables with changes in the positions of the frame data transformed into world space. The aim of ICP is thus to minimise the general equation (F*T)-W over all frame and world pairs (although in practice, this equation becomes slightly more complex when normal point-to-plane distances are taken into account. Furthermore, the error function used to measure the correctness of the transformation matrix is typically a sum of square errors).

The inner iteration that involves the Jacobian is due to the manner in which the problem is linearised and thus at each step, a new approximation to the solution is given which might be bettered using the new values from the previous step and so on. The inner iterations occur until the error between the pairs is no longer decreased.

The outer iteration of the above algorithm uses the revised frame transformation to determine new (and hopefully better) pairings. The outer loop continues while the resulting sum of distances between pairs is decreasing.

Assuming this process finds a transformation that generates a low error, the frame data can be transformed into world space and merged with the world environment. The next frame of data is processed, using the last known frame transformation, and so on, building the virtual environment by piecing together the data from the individual frames. This will be discussed further in a later post.

An example of the tracking process is given below:

In this video, the first frame is captured and taken to be the world data; updates to the world data are not undertaken. Each frame of the video represents a new frame from the Kinect device. The cyan colour dots indicate that a pairing between the frame data and world data has been found and lies within a given threshold. The red dots indicate that the closest point match between frame and world data lies outside the given threshold. The video shows the position of the frame data once it has been transformed into world space and hence the more stable the features, the better the tracking (if just the original frame data were viewed, the scene would be seen to pan and shake as the camera is moved).

Mapping Depth Data into Virtual Space

While mapping depth values from the Kinect sensor to a virtual space is straightforward, a perspective correction factor needs to be taken into account, which is discussed in this post. In the following, the official Windows Kinect SDK is used and all formula given relate to the specific values returned from the API (which can be different to those returned by unofficial SDKs). Depth data is delivered as scanlines from bottom to top.

To convert Kinect data into 3D space where one unit is equal to 1 metre:

scale=depth*PERSPECTIVE_CORRECTION
x=(i-(DEPTH_FRAME_WIDTH/2))*scale;
y=(j-(DEPTH_FRAME_HEIGHT/2))*scale;
z=-depth/1000;

Where:

depth is the millimetre depth value returned by the Kinect device within the depth map
PERSPECTIVE_CORRECTION is an empirically derived constant that converts from the camera’s perspective into an orthogonal view (essentially “undoing” the natural perspective view of the camera)
DEPTH_FRAME_WIDTH is the width dimension of the depth map (typically 320 or 640)
DEPTH_FRAME_HEIGHT is the height dimension of the depth map (typically 240 or 480)
i and j represent the i^th pixel from the left and j^th pixel from the bottom of the frame

Notes:

This formula translates the depth values onto the negative z-axis such that a value of zero is the camera position and -1.0 is 1 metre away.
A right-handed coordinate system is used.
The PERSPECTIVE_CORRECTION constant is fixed for a given depth map resolution and defined as 0.00000356 for a resolution of 320x240 and 0.00000178 for a resolution of 640x480
When doubling the width and depth of the depth map, the constant is halved

Perspective Correction

The camera’s perspective field of vision is important to factor out to get precise [x, y, z] coordinates in space that can be used to correlate different snapshots of the same scene taken at different angles since camera perspective varies according to camera position. Figure 1 illustrates the result of mapping depth values directly to fixed [x, y] coordinates without taking into account perspective.

Figure 1a) Mapping depth values to fixed [x, y] coordinates without perspective correction: view seen from the camera

Figure 1b) Mapping depth values to fixed [x,y] coordinates without perspective correction: view of the scene from above - note that the wall and shelves do not make right-angles due to the camera taking a perspective view

By including the perspective correction, real-world right angles remain right angles in the virtual space and distances are corrected to their absolute values as illustrated in Figure 2.

Figure 2a) Mapping depth values to absolute [x, y, z] coordinates using perspective correction: view seen from the camera

Figure 2b) Mapping depth values to absolute [x, y, z] coordinates using perspective correction: view of the scene from above – note that the wall and shelves make right-angles when using the perspective correction constant and appear straight and well aligned

The perspective correction was determined by measuring objects in the real world and comparing them to the size of their virtual counterpart without correction. This was correlated against distance from the camera, resulting in the derived constants. The formula for determining the initial fixed [x, y] positions are given below:

x=(i-(DEPTH_FRAME_WIDTH/2))*WORLD_SCALE;

y=(j-(DEPTH_FRAME_HEIGHT/2))*WORLD_SCALE;

z=-depth*WORLD_SCALE*DEPTH_SCALE;

WORLD_SCALE is 0.01 or 0.02 for 640x480 and 320x240 depth map resolutions respectively and DEPTH_SCALE is 0.1. These values were selected empirical to offer a visually good representation of the real world when mapped into the virtual space.

Using this mapping, a number of objects were placed in front of the camera and measured in both the real world and virtual space along their x- and y-axis to provide a scale factor mapping between the two spaces. These values are given in Table 1 along with the object’s distance from the camera.

Distance from Camera	Mean Scale Factor
810mm	0.137
1380mm	0.245
2630mm	0.472
3750mm	0.666

Table 1: Scale factors between real and virtual objects at a specific distance

Plotting the two columns of Table 1 against each other illustrates a linear correlation, as shown in Figure 3.

Figure 3: Plotting distance from camera against mean depth scale factor for perspective correction

The gradient of the linear line in Figure 3 gives the perspective correction value, calculated with respect to millimetre distances as per the original set of equations and factoring in the DEPTH_SCALE and WORLD_SCALE constants as per the second set of uncorrected equations.