Graphics Application Lab.,
Department of Computer Science,
Pusan National University,
Kum-Jung-Ku, Pusan 609-735, Korea.
E-mail: {seok,hcpark,hgcho\}@pearl.cs.pusan.ac.kr
This paper presents a new extracting method for
several types of texts from a text/graphic mixed document image.
We also propose a new word grouping method when intersected words
are placed on a circular arc or any line segment with
an arbitrary orientation.
The basic strategy of our algorithm is based on the analysis of the
run-length of the document image.
The average and variance of the number of runs in a run-length
encoding provide a nice structural property for symbols and texts.
We propose 3-dimensional neighborhood graph
for grouping word from a set of isolated characters, which are
obtained from the first character-isolating phase.
This graph maps each letter to a vertex in 3-dimensional
space according to the ``volume'' of the character.
Experimental results show that more than 97% of words were successfully
extracted from a text/graphic mixed document.
keywords:document analysis, pattern recognition, text extraction, image processing