It should be noted that this post will only provide a textual overview of the
process, which will be expanded in future articles. The web also has a good number of resources
and examples of this process that can be found by searching for Iterative
Closest Point.
The following definitions
describe the terms to be used:
-
Frame Data (F): Perspective-corrected depth map data from the Kinect device
- World Data (W): An accumulation of point data from one or many frames defined in world space
- Frame Transformation (T): A 4x4 affine transformation matrix that transforms frame data into world space defined by rotation and translation about the 3 axes (affording 6 independent variables).
The basic ICP algorithm proceeds as follows and is per frame of data:
-
For each pixel in the frame data, apply the current frame transformation and find the closest point in the world data
- Minimise the distance between the pairs by refining the frame transformation
- Repeat back to step 1 while the sum of errors over all pairs is reducing
Initially the frame transformation is the identify matrix and the world data is the first frame from the Kinect device. The closest point also takes into account the point normals which are calculated using neighbouring points.
The minimisation step is an
inner iterative process that makes use of the Jacobian of the transformation
matrix; the Jacobian relates small changes in the independent variables with
changes in the positions of the frame data transformed into world space. The aim of ICP is thus to minimise the
general equation (F*T)-W over all frame and world pairs (although in practice,
this equation becomes slightly more complex when normal point-to-plane
distances are taken into account.
Furthermore, the error function used to measure the correctness of the
transformation matrix is typically a sum of square errors).
The inner iteration that
involves the Jacobian is due to the manner in which the problem is linearised and
thus at each step, a new approximation to the solution is given which might be
bettered using the new values from the previous step and so on. The inner iterations occur until the error
between the pairs is no longer decreased.
The outer iteration of the
above algorithm uses the revised frame transformation to determine new (and
hopefully better) pairings. The outer
loop continues while the resulting sum of distances between pairs is
decreasing.
Assuming this process finds a transformation
that generates a low error, the frame data can be transformed into world space
and merged with the world environment. The
next frame of data is processed, using the last known frame transformation, and
so on, building the virtual environment by piecing together the data from the
individual frames. This will be
discussed further in a later post.
An example of the tracking process is given below:
An example of the tracking process is given below:
In this video, the first frame is captured and taken to be the world data; updates to the world data are not undertaken. Each frame of the video represents a new frame from the Kinect device. The cyan colour dots indicate that a pairing between the frame data and world data has been found and lies within a given threshold. The red dots indicate that the closest point match between frame and world data lies outside the given threshold. The video shows the position of the frame data once it has been transformed into world space and hence the more stable the features, the better the tracking (if just the original frame data were viewed, the scene would be seen to pan and shake as the camera is moved).
No comments:
Post a Comment