Before two images can be mosaiced, it is necessary to find the
transformation which aligns one image with the other
. This is the well known ``motion estimation''
or ``registration'' problem [2]. Techniques for solving
this problem can be divided into two broad
categories [2, 7]: feature-based and featureless
techniques. When the motion between the two images is small, for
instance when the two images are from successive frames in a video
sequence, the motion parameters can be estimated using optical
flow [7]. When the motion is not small, as in this
application, the motion could be estimated using generalised
cross-correlation [10]. However, this requires a
computationally expensive combination of coarse-to-fine and iterative
gradient-based search [7]. The final option is to use a
feature-based approach [4], estimating the motion from
point correspondences. Such techniques are often criticized on the
grounds that it is difficult to find the features automatically, or
that the features are sensitive to noise and
occlusion [7]. In this particular application, however,
there is a plentiful supply of stable, detectable features. We
therefore adopt a feature-based approach.
Instead of using generic ``corner'' features, we exploit knowledge of the domain to extract a more organised set of features. Each image is automatically segmented into a hierarchy of columns, lines and words: it is relatively straightforward to match such organised sets of features across images. The algorithm for producing the hierarchical segmentation is summarised below.
The first step is to estimate the angle that the rows of text make
with the image raster lines (the skew angle), which is assumed
to lie in the range
. To achieve this, a small patch
of text is selected randomly, and then rotated in small steps over the
range
until the variance of the pixel intensities
summed along raster lines is maximised [1] -- see
Figure 2. The exhaustive search is not computationally
expensive since small rotations (
) can be well
approximated by vertical shears, which are fast to compute. To ensure
that an accurate skew angle is found, the calculation is performed at
several image patches, and the final estimate is the average of the
individual angles weighted by the variance of the pixel intensities
at each patch. All subsequent feature detection operations are
performed on the de-skewed images.
Figure 2: Finding the skew angle. The text is rotated from its
original orientation (a) until the variance of the grey levels summed
along the image raster lines is maximized (b). (c) shows the results on a
real image, where the central portion of text has been de-skewed.
De-skewed document images can be reliably and intuitively segmented
into a hierarchy of columns, lines and words. To remove any
sensitivity to illumination and page coloration, we work on binary
gradient images
. These are obtained by applying a Sobel operator
to the de-skewed greyscale image and thresholding the
output [11] -- see Figure 3.
Figure 3: Binary gradient images. To reduce sensitivity to
lighting conditions and page coloration, a Sobel operator is applied
to the original images (a). The absolute value of the output is
thresholded to produce the binary gradient image (b).
Columns are easily segmented from the de-skewed, binary gradient images by summing pixels vertically -- see Figure 4(a). A similar horizontal process is used to find the baselines of each row within each column (we do not assume the columns are aligned). Finally, a vertical process is applied within each row to segment individual words. A complete segmentation of a real image is shown in Figure 4(b).
The mosaic will be built up by matching the lower right hand corners of words in pairs of overlapping images. The segmentation scheme finds these points reliably, and also places them in the context of a hierarchy of rows and columns: in effect, the points have been grouped [6]. As we shall see in Section 4, the grouping can be exploited by an intuitive and efficient matching algorithm.
The segmentation process involves a considerable amount of summing in the binary gradient image, which could be computationally expensive if not implemented efficiently. To this end, we construct a matrix of partial sums [9] whose elements are given by
The matrix of partial sums can be calculated in one pass through the binary gradient image. Summing pixels within any rectangular area of the binary gradient image is then very efficient [9]:
Figure 4: Segmentation of a page of text. Columns can be found by
summing the binary gradient images in vertical strips and looking at
the difference between neighbouring sums (a). The strips are swept
from left to right across the image, and local minima with magnitudes
exceeding the threshold T indicate column start
points. Likewise, significant local maxima indicate column end
points. Since the columns are so prominent, the procedure is not
sensitive to the parameter T. The optimal width d of the
strips depends on the inter-column separation in pixels, which is
largely a function of the imaging resolution (dpi). Thus d needs
adjusting only when the camera is moved or the lens is changed. A
similar process can be used to find rows within columns and words
within rows: a full segmentation is superimposed on the original
(skewed) greyscale image in (b). The lower right hand corner of each
word provides a suitable point for matching across images.