Thursday 4 July 2013

When Tracking and Merging Goes Wrong

As mentioned in previous posts, the Kinect sensor data can be noisy which can lead to errors during both the ICP and merging processes, potentially compounding each other.  The process itself is also subject to inherent algorithmic issues.  This short post gives a couple of examples where the tracking drifts and demonstrates the importance of the perspective correction factor previously discussed balancing the thresholds used in the various processes.

In the first video we’ve completely removed the perspective correction factor.  Without this, as the camera moves around the scene, the tracking processes does it best to stitch “skewed” frame data into world space yet inevitably fails.

The next video takes a much larger pan but due to the flat wall surfaces being tracked that provide little depth variation, the tracking algorithm drifts upwards.


In the above videos, the top two left window show what the camera is seeing as a colour image and as a greyscale depth map.  The bottom left window is a 3D point cloud representation of the RGB and depth data combined from the Kinect sensor (which would normally be correctly coloured, but the green in this window represents point matches into the world space).  The larger window in the centre of the screen is the compiled world space.  The green wireframe box indicates the current camera position and orientation into this world.  Green indicates the points that are paired with the individual captures from the Kinect device.

Underneath the larger 3D window are "debug" outputs - the one on the left give internal states for the steps within the matching process and the right one gives the camera orientation of the current frame in terms of rotation and offset into the global space.

Merging Point Clouds

Once the frame data has been aligned with the world space using the iterative closest point algorithm, it can be merged to created the larger environment.  In this process we maintain the concept of point clouds as opposed to create surfaces.  There are three components to merging the dataset:
  1. Refine existing world points
  2. Add new frame points
  3. Remove erroneous world points

Merging point clouds proceeds by considering only a subset of world points.  This subset is defined as the points that fall within the camera’s view frustum when transformed by the frame transformation.  In the discussion that follows, this subset will be referred to as simply the world points.


Refine Existing World Points
Each world point is matched against a frame point after applying the frame transformation.  The matching threshold can be stricter than the ICP process to increase world point cloud density; for example, a threshold of 1cm will provide a final resolution of 1cm whereas a threshold of 1mm will provide much more fidelity although it might also introduce errors due to the level of noise returned by the Kinect sensor.  More information about the level of accuracy, noise and reliability will be given in a future post.

The world points are updated using the matches, where one frame point may map to many world points.  After the existing points are updated, all world points and frame points that are involved in a match are ignored for the remaining merging processes.


Add New Frame Points
The frame points that were not matched to a world point are considered new points and added to the world point dataset.


Remove Erroneous World Points
There is noise in the depth map that the Kinect sensor returns and thus some points in the world dataset will also be erroneous and need to be pruned.  The strategy employed here is to eliminate any world points that fall within the transformed camera frustum that do not have significant support for their existence.  We therefore don’t simply remove per frame all world points that don’t have a match with the frame points as the frame itself could be in error.  Instead as each world point is updated and added, we take note of when they were last seen.  If at the end of each frame there are world points that have not been matched and have not been seen for a given period of frames, they are removed.


Tracking and Merging Example
The following video illustrates the process of tracking and merging point clouds.  The bottom left window is a 3D point cloud representation of the RGB and depth data combined from the Kinect sensor (which would normally be correctly coloured, but the green in this window represents point matches into the world space).  The larger window in the centre of the screen is the compiled world space.  The green wireframe box indicates the current camera position and orientation into this world.  Green indicates the points that are paired with the individual captures from the Kinect device.
 
Underneath the larger 3D window are "debug" outputs - the one on the left give internal states for the steps within the matching process and the right one gives the camera orientation of the current frame in terms of rotation and offset into the global space.


Tuesday 2 July 2013

Aligning Point Clouds

Point clouds are joined together once the Kinect depth map data has been converted into an absolute [x, y, z] coordinate system, as described in an earlier.  At the core of this process is the Iterative Closest Point (ICP) algorithm.  The ICP process finds an affine transformation between two point clouds that maximise their overlapping regions (thus minimises the distance between the two point clouds in space).

It should be noted that this post will only provide a textual overview of the process, which will be expanded in future articles.  The web also has a good number of resources and examples of this process that can be found by searching for Iterative Closest Point.

The following definitions describe the terms to be used:
  • Frame Data (F): Perspective-corrected depth map data from the Kinect device
  • World Data (W): An accumulation of point data from one or many frames defined in world space
  • Frame Transformation (T): A 4x4 affine transformation matrix that transforms frame data into world space defined by rotation and translation about the 3 axes (affording 6 independent variables).

The basic ICP algorithm proceeds as follows and is per frame of data:
  1. For each pixel in the frame data, apply the current frame transformation and find the closest point in the world data
  2. Minimise the distance between the pairs by refining the frame transformation
  3. Repeat back to step 1 while the sum of errors over all pairs is reducing

Initially the frame transformation is the identify matrix and the world data is the first frame from the Kinect device.  The closest point also takes into account the point normals which are calculated using neighbouring points.

The minimisation step is an inner iterative process that makes use of the Jacobian of the transformation matrix; the Jacobian relates small changes in the independent variables with changes in the positions of the frame data transformed into world space.  The aim of ICP is thus to minimise the general equation (F*T)-W over all frame and world pairs (although in practice, this equation becomes slightly more complex when normal point-to-plane distances are taken into account.  Furthermore, the error function used to measure the correctness of the transformation matrix is typically a sum of square errors).

The inner iteration that involves the Jacobian is due to the manner in which the problem is linearised and thus at each step, a new approximation to the solution is given which might be bettered using the new values from the previous step and so on.  The inner iterations occur until the error between the pairs is no longer decreased.

The outer iteration of the above algorithm uses the revised frame transformation to determine new (and hopefully better) pairings.  The outer loop continues while the resulting sum of distances between pairs is decreasing.

Assuming this process finds a transformation that generates a low error, the frame data can be transformed into world space and merged with the world environment.  The next frame of data is processed, using the last known frame transformation, and so on, building the virtual environment by piecing together the data from the individual frames.  This will be discussed further in a later post.

An example of the tracking process is given below:

In this video, the first frame is captured and taken to be the world data; updates to the world data are not undertaken.  Each frame of the video represents a new frame from the Kinect device.  The cyan colour dots indicate that a pairing between the frame data and world data has been found and lies within a given threshold.  The red dots indicate that the closest point match between frame and world data lies outside the given threshold.  The video shows the position of the frame data once it has been transformed into world space and hence the more stable the features, the better the tracking (if just the original frame data were viewed, the scene would be seen to pan and shake as the camera is moved).

Mapping Depth Data into Virtual Space

While mapping depth values from the Kinect sensor to a virtual space is straightforward, a perspective correction factor needs to be taken into account, which is discussed in this post.  In the following, the official Windows Kinect SDK is used and all formula given relate to the specific values returned from the API (which can be different to those returned by unofficial SDKs).  Depth data is delivered as scanlines from bottom to top.

To convert Kinect data into 3D space where one unit is equal to 1 metre:

scale=depth*PERSPECTIVE_CORRECTION

x=(i-(DEPTH_FRAME_WIDTH/2))*scale;
y=(j-(DEPTH_FRAME_HEIGHT/2))*scale;
z=-depth/1000;
Where:
  • depth is the millimetre depth value returned by the Kinect device within the depth map
  • PERSPECTIVE_CORRECTION is an empirically derived constant that converts from the camera’s perspective into an orthogonal view (essentially “undoing” the natural perspective view of the camera)
  • DEPTH_FRAME_WIDTH is the width dimension of the depth map (typically 320 or 640)
  • DEPTH_FRAME_HEIGHT is the height dimension of the depth map (typically 240 or 480)
  • i and j represent the ith pixel from the left and jth pixel from the bottom of the frame
Notes:
  • This formula translates the depth values onto the negative z-axis such that a value of zero is the camera position and -1.0 is 1 metre away.
  • A right-handed coordinate system is used.
  • The PERSPECTIVE_CORRECTION constant is fixed for a given depth map resolution and defined as 0.00000356 for a resolution of 320x240 and 0.00000178 for a resolution of 640x480
  • When doubling the width and depth of the depth map, the constant is halved



Perspective Correction

The camera’s perspective field of vision is important to factor out to get precise [x, y, z] coordinates in space that can be used to correlate different snapshots of the same scene taken at different angles since camera perspective varies according to camera position.  Figure 1 illustrates the result of mapping depth values directly to fixed [x, y] coordinates without taking into account perspective.

Figure 1a) Mapping depth values to fixed [x, y] coordinates without perspective correction: view seen from the camera




Figure 1b) Mapping depth values to fixed [x,y] coordinates without perspective correction: view of the scene from above - note that the wall and shelves do not make right-angles due to the camera taking a perspective view
By including the perspective correction, real-world right angles remain right angles in the virtual space and distances are corrected to their absolute values as illustrated in Figure 2.
Figure 2a) Mapping depth values to absolute [x, y, z] coordinates using perspective correction: view seen from the camera
Figure 2b) Mapping depth values to absolute [x, y, z] coordinates using perspective correction: view of the scene from above – note that the wall and shelves make right-angles when using the perspective correction constant and appear straight and well aligned

The perspective correction was determined by measuring objects in the real world and comparing them to the size of their virtual counterpart without correction.  This was correlated against distance from the camera, resulting in the derived constants.  The formula for determining the initial fixed [x, y] positions are given below:

x=(i-(DEPTH_FRAME_WIDTH/2))*WORLD_SCALE;
y=(j-(DEPTH_FRAME_HEIGHT/2))*WORLD_SCALE;
z=-depth*WORLD_SCALE*DEPTH_SCALE;

WORLD_SCALE is 0.01 or 0.02 for 640x480 and 320x240 depth map resolutions respectively and DEPTH_SCALE is 0.1.  These values were selected empirical to offer a visually good representation of the real world when mapped into the virtual space.

Using this mapping, a number of objects were placed in front of the camera and measured in both the real world and virtual space along their x- and y-axis to provide a scale factor mapping between the two spaces.  These values are given in Table 1 along with the object’s distance from the camera.

Distance from Camera
Mean Scale Factor
810mm
0.137
1380mm
0.245
2630mm
0.472
3750mm
0.666
Table 1: Scale factors between real and virtual objects at a specific distance

Plotting the two columns of Table 1 against each other illustrates a linear correlation, as shown in Figure 3.
Figure 3: Plotting distance from camera against mean depth scale factor for perspective correction


The gradient of the linear line in Figure 3 gives the perspective correction value, calculated with respect to millimetre distances as per the original set of equations and factoring in the DEPTH_SCALE and WORLD_SCALE constants as per the second set of uncorrected equations.