Using Neural Cells to Improve Image Textual Line Segmentation

semanticscholar（2017）

引用 0|浏览0

暂无评分

摘要

Before one can begin applying automatic transcription processes to a document image, a line segmentation algorithm usually must be applied first in order to identify the individual text lines upon which recognition will be performed. Most line segmentation algorithms use standard image processing techniques and/or statistics to identify inking activity in the image. Unfortunately, these algorithms have no awareness of inking that is intentional (such as that written by an author) versus that which is merely background, darkness of copy, noise, etc. Moreover, the algorithms themselves cannot always tell when lines have been overor under-segmented. Neural networks can be taught to learn such distinctions. We have trained a neural process which can detect image line phenomena and can be used to improve automatic line segmentation. To the best of our knowledge, no other researchers have demonstrated similar processes. We observe definite accuracy gains using our neurally-enhanced line segmenter over a previous high quality line segmenter, and we believe such improvements will apply when using other line segmentation algorithms. 1. Background on Line Segmentation When transcription of document images is a user’s goal, it is almost always a requirement to first transform the document into the sequence of its textual lines so that subsequent processing can then proceed on a line-by-line basis. The process of transforming a document into its constituent lines is referred to as line segmentation. The vast majority of line segmentation techniques leverage either statistics or common image processing techniques which include projection-based methods, the finding of connected components, smearing approaches, clustering, and Hough transforms (see, for examples [1], [2]). Other techniques also exist which try to use more sophisticated approaches such as peak and trough detections and carving seams between them, hidden Markov model-like analyses, and graphical approaches (see [1], [3], [4]). These methods provide good or even excellent results depending on the image collection, especially when collections are small or homogeneous. Yet when the images in a collection number in the millions or billions and are quite diverse, as in a genealogical data set, typical line segmentation algorithms fail to account for many of the phenomena that will occur. This is largely because these systems are designed to identify contrastive areas of pixel darkness and/or pixel connectivity as opposed to trying to capture what we will refer to here as deliberate marks: markings on the page which were intentionally placed there by the document’s author. Let us consider some situations where line segmentation can be problematic for activity-based or connectivity-based algorithms: a) In the presence of faint copy where inking is weak, systems that rely on connectivity get confused because single lines of text can appear as disconnected. This can result in falsely detecting multiple lines. Activity detectors, on the other hand, may treat these areas as having insufficient information to even predict the appearance of some of the lines. b) Speckle noise can cause challenges for a connectivity-based algorithm since it may be treated as myriad regions of disconnected components and thus over-generate lines. c) If textual lines have significant overlap (where, say, the descenders from one line intersect frequently with ascenders from below), connectivity-based algorithm may falsely merge these lines together. Activity-based algorithms, on the other hand, can treat such regions of overlap as if they were additional lines since there may be sufficient pixel darkness to warrant it. d) When the image is copied from a book or on a black background resulting in the image having side regions with dark vertical streaks, the activity algorithms can end up over-generating lines if the streaks are of variable thicknesses. e) If the slope of the text lines drifts from one side of the page to the other, both kinds of algorithms can either over-generate lines or can produce hypothesized lines where the left half and right half of the lines actually are drawn from different textual lines. This list can be extended. Yet it suffices as an illustration that the base algorithms, absent a search for deliberate marks, are likely to have challenges with certain kinds of documents. In this paper, we demonstrate that deep neural networks can be built which can identify deliberate marks and which can be used to improve a system’s ability to properly find textual lines. We demonstrate through this effort that not only is the error rate of line segmentation reduced through our neural processes, but we actually train a handwriting recognition system on a very large corpus of US legal documents and show that the resultant line segmentation improves performance by 1.2% absolute (7.5% relative reduction in error). 2. DNNs to Count Lines Prior to the work described in this paper, we had created a line segmentation algorithm (which we will refer to here as PreDNN) which leverages many of the leading techniques mentioned in Section 1. The base components of our algorithm are described elsewhere (see [5]), but suffice it to say that it uses multi-swath projections to identify activity peaks, dynamic programming to carve seams between those activity peaks, and overlap of connected components to detect falsely merged lines. These processes were then extended using statistical analyses to find and remove the overgeneration of detected peaks and to trim out falsely carved seams. The resultant algorithm seems to have quite high on a huge collections of generic printed documents and has accuracy (perhaps about 85%-90%) on a collection of tens of thousands of handwritten document spanning four centuries. We expect that the results of the PreDNN system rival those of state-of-the-art and we have observed that they are quite usable. That said, leaving out up to 15% of a document’s content is undesirable. 2.1. Line Count Cells Deep neural networks provide a potential means to overcome some of these gaps. We reasoned that if we could build a neural network which, given a snippet of an image, could predict if the snippet contains no text lines, fractions of a line, multiple lines, or exactly one line, then such a network could be used to either completely perform line segmentation or to improve upon an existing line segmentation algorithm. To test this hypothesis, we assembled a mixed corpus consisting of printed Latin-script texts, Chinese print, and Latinscript handwritten documents. From these, we automatically generated a huge collection of image snippets of variable sizes whose edges were not necessarily straight. These snippets were then condensed down into small cells of size 30 pixels by 30 pixels. We tagged each of these 30x30 cells with one of seven different tags based on the number of lines observed in each cell. As shown in Figures 1a-g, these snippets fall into categories of: Fig1.a Fig1.b Fig1.c Fig1.d Fig1.e Fig1.f Fig1.g (a) no-text-lines, (b) single-text-line, (c) vertical-bar-only, (d) lessthan-one-text-line, (e) two fragment lines, (f) more-than-one-butfewer-than-two, and (g) two-plus-lines. The colors that are selected here are deliberate in that they will be used throughout this paper in colorized images representing the various classifications. Our collection of tagged cells currently consists of 70.5K different images of which 65,259 are used for training and 5,277 are used for testing. No specific effort was made to have a comparable number of each of the seven classes of cells, so the final distribution is reflective of what is observed in practice. Table 1 shows the actual distribution by category in both the training and testing sets: Table 1: Distribution of Counts of Tagged Cells CATEGORY # in TRAIN # in TEST no-text-lines 7330 625 single-text-line 12651 848 vertical-bar-only 703 89 less-than-one-text-line 16813 1098 two fragment lines 5027 407 more-than-1/fewer-than-2 8573 801 two-plus-lines 14162 1409 An interesting thing about tagging cells in the way specified is that the tags remain the same even if the cells are flipped with respect to the y-axis, or if they are rotated 180 degrees. That means that if we consider these four permutations, we can likewise multiply the size of our collection by four resulting in a training set with 280K elements and a test set with 21K. 2.2. Training a DNN for Line Counts With training cells available, we can now train a deep neural network (DNN) to try to predict the line count of future image cells. To train our DNN, we make use of Google’s open source Tensorflow [6] engine. Additionally, the Tensorflow developers created a “recipe” using convolutional neural networks (CNN) as applicable to the MNIST digit-recognition task which we have modified for our task. We use the same number of layers in our network as they have, but we use a kernel size of 3, and our first through third hidden layers have, respectively, 16, 32, and 216 nodes. Each of our layers are smaller than those of the recipe, but this is beneficial for recognition speed (which seems essential for our task). Our particular DNN topology yields an average tagging accuracy of 91%. We can get up to 2% better prediction accuracy by doing one of the following: (a) on a per cell basis, apply the DNN predictor to each of the four legal permutations of that cell and vote, or (b) create cells from overlapping rectangular regions and keep the result if the prediction for both regions agree or otherwise use the prediction of the overlap region as a tie breaker. Since computational cost is a factor, method (a) is somewhat less desirable because one must perform four times more computation for a 2% gain. On the other hand, method (b) can be performed for only about 10% more cost than just doing cell-by-cell evaluation because the rectangles take the place of the cells and the tie b

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要