A system for detecting and tracking faces was previously described [8]. It combined motion detection by spatio-temporal filtering with an appearance-based face model in the form of a neural net. Multiple person tracking was performed using time-symmetric matching and Kalman filtering. In this section, the use of colour as a cue for detection and tracking is described. Colour provides a computationally efficient yet effective method which is robust under rotations in depth and partial occlusions. It can be combined with motion and appearance-based face detection.


Human skin forms a relatively tight cluster in colour space even when different races are considered [5]. Figure 3 shows the colour distribution of three faces in hue-saturation (H-S) space. Face colour distributions were modelled as Gaussian mixtures of the form:
The mixing parameter P(j) corresponds to the prior probability that the
data,
, was generated by component j. Each mixture component,
, is a Gaussian with mean
and covariance matrix
.
Given n face pixels
,
, Expectation-Maximisation (EM) provides an effective maximum-likelihood
algorithm for learning a Gaussian mixture model [9]. An expectation (E) step consists of evaluating the
posterior probabilities
for each mixture component.
Let the sum of these probabilities be
.
A maximisation (M) step then updates the mixture components as follows:
The E and M steps are iterated until convergence. If M=1, the parameters of the Gaussian are estimated directly.
In practice, an H-S model of a single person functions well
with other races. The mixture model is used to assign a probability
to each pixel in an image and faces are detected by grouping suitably
sized areas of high probability. A face is tracked by estimating the
position as the mean
and the spatial extent
as the vertical and horizontal standard devaitions
of the local colour probability distribution in the image plane. For a
given frame t, the box position
is estimated
as an offset from the position
:
where
ranges over all image coordinates in the region
of interest and is the colour point at
image position
. To improve accuracy, probabilities
are thresholded. Values lower than the threshold are taken to be background and are consequently set to zero in order to
nullify their influence on the estimation of
and
.
The size of the bounding box is estimated by computing the standard
deviation weighted by the pixel probabilities:
Figure 2 shows a sequence of a face being tracked with a moving camera against a cluttered background. The tracker's ability to deal with changes in scale, large rotations in depth and partial occlusion are all clearly demonstrated.
The colour-based tracking system has been implemented on a 200MHz Pentium PC equipped with a Matrox Meteor frame grabber and a Sony EVI-D31 active camera. The camera can be driven by maintaining the mean position, m, at the centre of the image. Tracking is performed at approximately 15 frames per second. Some problems are inevitably caused by large changes in the spectral composition of scene illumination. It has been found necessary to use at least two colour models, one for interior lighting and one for exterior natural daylight.