ICDAR 2015 Contest on MultiSpectral Text Extraction (MS ...ICDAR 2015 Contest on MultiSpectral Text...

ICDAR 2015 Contest on MultiSpectral TextExtraction (MS-TEx 2015)

Rachid Hedjam∗†, Hossein Ziaei Nafchi∗, Reza Farrahi Moghaddam∗, Margaret Kalacska†, and Mohamed Cheriet∗∗Synchromedia Lab, Department of Automated Manufacturing Engineering, ETS, University of Quebec,

1100 Notre-Dame West, Montreal, QC, Canada H3C 1K3Email:[email protected], [email protected], [email protected]

†Remote sensing Lab. Department of Geography, McGill University,805 Sherbrooke West, Montreal, QC, Canada H3A 2K6Email: {rachid.hedjam, margaret.kalacska}@mcgill.ca

Abstract—The first competition on the MultiSpectral TextExtraction (MS-TEx) from historical document images has beenorganized in conjunction with the ICDAR 2015 conference. Thegoal of this contest is evaluation of the most recent advances intext extraction from historical document images captured by amultispectral imaging system. The MS-TEx 2015 dataset contains10 handwritten and machine-printed historical document imagesalong with eight spectral images for each image. This paperprovides a report on the methodology and performance of thefive submitted algorithms by various research groups across theworld. The objective evaluation and ranking was performedby using well-known evaluation metrics of binarization andclassification.

Keywords—Multispectral imaging. Document Image Binariza-tion. Historical document analysis.

I. INTRODUCTION

In recent years, Multispectral (MS) imaging has becomea very important tool for historical document analysis. Thistechnique is widely known as a non-invasive method of inves-tigation thanks to its simultaneous use of ultraviolet, infraredand visible light. It enables conservators and art historiansto obtain valuable information on ancient documents withoutcausing any physical damage to the materials and makesit possible to reveal the text that has been overwritten, todistinguish and recognize the chemical material composing theink, and to detect signs of degradation in historical documents.It can also help to extract information from cultural heritagepatterns which cannot be extracted using conventional colorphotography. Extracting (segmenting) the original text (oldwriting) from a MS document image is a very important stepfor subsequent document image analysis and investigation.In order to facilitate comparison of the results of differentalgorithms and track their progress over time towards a levelof human performance, it would be of great interest to obtaina standard benchmark and accurate ground-truth with the mostrepresentative information about the targeted samples. To thisend, we have generated a dataset of twenty-one (21) MSdocument images for training purposes1 (see the URL below)[1], [2], and another dataset of ten (10) MS document imagesfor testing purposes.

The structure of the paper is as follows: Section II brieflydescribes the different submitted methods; Section III describes

1http://www.synchromedia.ca/databases/msi-histodoc

the dataset used to test the submitted methods; Section IVdefines the objective measures of the contest; Section V isdevoted to the experimental results and evaluation of thesubmitted methods; at the end the conclusions are provided.

II. METHODS AND PARTICIPANTS

Five (5) different methods are submitted to the MS-TEx2015 contest from four (4) research groups. The submittedmethods are briefly described below.

1) Computer Vision Lab. Vienna University of Tech-nology (Markus Diem, Fabian Hollaus and Robert Sab-latnig): The proposed approach incorporates three methodsfor MultiSpectral Text Extraction. First, a rough foregroundestimation is performed by thresholding a cleaned channelusing Lu et al.’s [3] binarization. In order to compute a cleanedchannel, the background channel F8 (IR4 band) is removedfrom a visible channel F2 (Blue band). This rough foregroundestimation is used in a second step to train an Adaptive Co-herence Estimator (ACE) proposed by Scharf and Whorter [4].The ACE detects a spectral subspace that enhances ink whilethe contrast of other elements (e.g. stains) is reduced. Finallywe combine the cleaned channel with the mean and standarddeviation images and perform a GrabCut [5]. The GrabCut isguided by a mask based on the results of the ACE. The sourcecode is available at https://github.com/diemmarkus/MSTEx-CVL.git

2) Computer Vision Lab. Vienna University of Technol-ogy (Fabian Hollaus, Markus Diem and Robert Sablatnig):In the first step of the approach, the binarization method ofLu et al. [3] is applied on a Blue band (F2). The output ofthis method is used for the estimation of the mean spectralsignature of the writing. This signature is used to train theAdaptive Coherence Estimator (ACE), as suggested by Scharfand Whorter [4]. The output of this method is an image inwhich the writing is enhanced while the background regionsare suppressed. Thus, the foreground can be successfullydetected by applying a global Otsu threshold on the outputimage of the ACE method. The resulting binary image is thenfinally combined with the output of the binarization methodof Lu et al [3]. The source code of the method is available at:https://github.com/hollaus/MSTEx-CVL-matlab

3) Document Image and Pattern Analysis (DIPA)Center, Islamabad, Pakistan (Ahsen Raza): The proposed

978-1-4799-1805-8/15/$31.00 ©2015 IEEE

Competitions @ 2015 13th International Conference on Document Analysis and Recognition (ICDAR)

1181

method for multispectral image segmentation is based onthree main steps. First, image fusion is performed using thewavelet transform-based image fusion technique. Once a fusedimage is obtained, a conditional noise removal procedure isperformed with a mix of noise removal filters. This stepis carried out to preserve the information of interest anddiscard unwanted artifacts. After that we perform window (size5x5) based thresholding using a modified form of Niblack’sthresholding technique. In the third and final step we againperform conditional noise removal followed by image cleaningbased on the aspect ratio of the connected components.

4) Institute of Automation, Chinese Academy of Sci-ences (Alex Zhang and Cheng-Lin Liu): The key of ourmethod is to binarize images by a graph-based semi-supervisedclassification method. Specifically, it proceeds as follows: 1)It extracts edges from the normal image (F2) using an edgedetector, i.e. Canny edge detector. 2) Coarse classification byrules: a) it classifies dark pixels that are located near someedges as foreground pixels; b) it classifies light pixels that arelocated far from edges as background pixels; c) it classifiesthe remaining pixels as unlabeled pixels. 3) Fine classificationby graph-based semi-supervised learning: a) it constructs agraph by connecting each pixel with its neighbors; b) it infersunlabeled pixels by Gaussian Random Fields. 4) It removesthe noise using F7 and F8 multispectral bands.

5) Information Sciences Institute, University of South-ern California (Yue Wu, Stephen Rawls, Wael Abd-Almageed and Premkumar Natarajan): In summary, themethod consists of a four-stage pipeline: 1) parameter es-timation, 2) feature extraction, 3) initial classification, and4) refinement. This is a fully trainable pipeline without anyhard-coded pre- or post-processing, but with three learnedclassifiers, namely a) “base,” b) “spectrum,” and c) “refine”Specifically, we first estimate various method parameters fromall image channels, e.g., text stroke width, noise level, edgemap, etc. Using these estimated parameters, we describe apixel location via statistical features across image channelsand probabilistic features that indicate the likelihood thata pixel is text. Such probabilistic features are obtained byapplying the base classifier on each single image channel(see details in [6]). We then apply the spectrum classifier toclassify each pixel location as text or non-text, and obtain aninitial binarized text image. Finally, we refine this initial resultby rejecting connected components via the refine classifier.All three classifiers are trained using either ICDAR DIBCOdatasets or the provided MSTEx dataset.

In addition to the participants, we consider Howe’s [7]and Lelore’s [8] methods as a non-participating baseline, butthey are not included in the ranking. These methods dealwith color or gray-scale document images. The motivationbehind including these two outstanding methods from previouschallenges is to show the usefulness of using multispectral doc-ument image analysis and its capability of providing additionaland discriminant information to help binarization methodsdistinguish between different objects on the basis of theirphotometeric properties rather than their color.

III. DATASET DESCRIPTION

A. Multispectral dataset description

The MSTEx test dataset consists of ten (10) multispectral(MS) images generated from a set of historical manuscriptswritten between the 17th and 20th centuries, collected by theArchives of Quebec (Canada). The documents are handwrittenwith iron gall-based (ferro-gallic) ink made from salts andtannic acid from vegetable sources [9], [10]. It was the standardwriting and drawing ink from about the 12th century untilthe 19th century, and remained in use well into the 20th

century. The MS document images have been recorded usingSynchromedia’s MS imaging system, which is composed ofa CCD sensor [model Chroma X3 KAF 6303E (Kodak),with a high quantum efficiency of 1100nm and resolutionof 3072x2048 (6 Mega) pixels of 9x9 microns] and uses aset of eight (8) chromatic filters motorized and controlledby the software of the camera. Each chromatic filter actsas a band pass filter to produce a band image at a specificwavelength. The set of collected bands constitutes the so-calledMS information cube (see Fig. 1), which contains one spectralreflectance (or spectral signature) for each pixel. For moredetails the reader can refer to [1], [2], [11].

340nm500nm

600nm700nm

800nm900nm1000nm

1100nm

340 500 600 700 800 900 1000 1100

wavelength (nm)R

efle

catn

ce (

%)

(a) (b)

Fig. 1. MS information cube. It contains for each wavelength a calibratedspectral reflectance image so that for each image pixel an entire spectralreflectance curve is provided.

The test dataset is composed of ten (10) folders. Thename of each folder is composed of the letter “z” and anumerical number (for example, z97). Each folder containsone MS document image with eight (8) spectral bands and oneground-truth image in the binary form with this name template:[folder-name][GT.png] (for example, “z97 GT.png”). Gener-ally speaking, a MS document image (in our dataset) maycontain the following classes (see Fig. 2):

1) Original text (OT):2) Annotations (AN).3) Stamps (ST).4) Degradation (DE)5) Background (BG).

where OT and BG are present in all the document images. ANand ST are not always present. DE is always present in all thedocument images but with different degrees of severity. It may


1182

contain several types of degradation that, virtually, cannot beenumerated.

Fig. 2. The different classes of the MS document image z97 (band at 600nm). The image is selected from the training dataset.

B. Ground-truth

Each ground-truth is a binary image. It is defined by blackpixels, which correspond to the class OT, and white pixels,which cover the class BG and all other possible classes (AN,ST, DE, BG; see Fig. 3). The protocol of generating theground-truth follows the method introduced in [12], [13] andconsists of two main steps. First, images in different bands areprocessed and a rough binary image is produced. This roughbinary image is then manually modified to generate the finalground-truth.

Fig. 3. The ground-truth image of the MS document image z97.

IV. ORIGINAL TEXT EXTRACTION

The objective is to extract only the class OT (originaltext) from the input MS document image. In our context,the term extracting can refer to segmentation, isolation,separation or retrieval. Therefore, there is no restrictionon the proposed methods used for the original textextraction. Any kind of classification, binarization, imagefusion, source separation, segmentation methods, etc.can be used. Accordingly, the classes BG, DE, STand AN must be combined as one class, BG+, at theoutput of the submitted algorithm. The full descriptionof the contest and the call for participation is available athttp://www.synchromedia.ca/competition/ICDAR/mstexicdar2015.html.

V. EXPERIMENTAL RESULTS

A. Evaluation measures

The submitted methods are evaluated using several well-known and widely used measures. These measures are: 1)FM (F-Measure) [14]; 2) NRM (Negative Rate Metric) [15],3) DRD (Distance Reciprocal Distortion) [16], and 4) Kappameasure [17].

1) F-Measure (FM):

FM =2×R× P

R+ P, (1)

where R = TP/(TP +FN), and P = TP/(TP +FP ) withTP, FP, TN and FN denoting respectively the true positive,false positive, true negative, and false negative.

2) NRM (Negative Rate Metric): NRM consists of mis-matching the GT and the machine output (prediction). It isdefined as

NRM =NRFN +NRFP

2, (2)

where NRFN = FN/(TP +FN) and NRFP = FP/(FP +TN).

3) DRD (Distance Reciprocal Distortion): The DRD mea-sure has been proposed to calculate the distortion betweenbinary images [16]. For all the F flipped pixels, it computesthe distortion as follows:

DRD =

∑Fl=1 DRDl

NUBM, (3)

where NUBM is the number of nonuniform (not all black orwhite pixels) 8 × 8 blocks in the GT image. DRDl, whichcorresponds to the distortion of the lth flipped pixel [16],is defined as the weighted sum of the pixels in the 5 × 5block of the GT that differ from the value of the flipped pixelB(x, y)l in the machine output B (predicted image). DRDl

can be expressed as follows:

DRDl =

2∑i=−2

2∑j=−2

|GT (i, j)l −B(x, y)l| ×W (i, j), (4)

where W is a normalized weight [16].

4) Kappa: Kappa coefficient [17], is well known in thedomain of remotely sensed hyperspectral image classification.Its purpose is to give the reader a quantitative measure of themagnitude of agreement between observers. The calculationis based on the difference between how much agreement isactually present (“observed” agreement Po) compared to howmuch agreement would be expected to be present by chancealone (“expected” agreement Pe):

Kappa =Po − Pe

1− Pe(5)

B. Results and method ranking

As described in [18], for each image in the dataset, the bestvalue of each measure among all of the methods is considered.A method with the best value for a measure gets a score of1, and the other methods get a fraction of 1 in comparison to


1183

the best value. Since there are four evaluation measures andten test images, we can compute the score of a method as:

Sk =

10∑i=1

4∑j=1

( Besti,jvaluek,i,j

,valuek,i,j

Besti,j

)j, k = 1, · · · , 5, (6)

where k denotes the index of a particular participant, andvaluek,i,j is the value of the measure number j obtained onthe test image number i by method k; and the operator

(·)j

gives the first fraction for those measures j that give a lowervalue to better performance (such as DRD), and it gives thesecond fraction for the case of those measures j with a reversebehavior (for example, F-Measure). At the end, the methodwith the highest S score will be positioned as the first rank,and so on.

TABLE I. THE AVERAGE PERFORMANCE OF THE METHODS AGAINSTTHE TEST DATASET EVALUATED USING FOUR MEASURES. THE RANKING

SCORES ARE ALSO CALCULATED USING THE S OVERALL SCORE. THEFINAL RANKS ARE THEN PROVIDED IN THE COLUMN ‘RANK’.

Rank Method FM NRM DRD Kappa S

1st 1 83.33 9.250 4.241 82.20 35.262nd 2 81.87 10.05 4.793 80.67 33.653rd 4 79.09 12.58 5.084 77.80 31.214th 5 76.57 14.31 5.548 75.12 29.555th 3 73.14 10.26 9.325 71.18 27.38

- Howe [7] 70.35 12.09 8.598 68.58 27.69- Lelore [8] 67.16 6.965 15.27 64.47 27.85

Table I shows the ranking and performance of each method.Overall, Method 1 submitted by M. Diem, et al. from ViennaUniversity of Technology achieved the highest performance.Figure 4 shows the outputs of different methods on the MSimage z92 from the test dataset. As mentioned above, for thesake of comparison when working in gray or color spaces,we added in the last two rows of Table I, the results oftwo binarization algorithms [7], [8]. The input to these twoalgorithms was the RGB image formed by combining the F2(Blue), F3 (Green) and F4 (Red) visible bands. This colorimage is in fact the appearance of the original documentusing the common imaging systems. The experimental resultsshow that practically all of the multispectral-based submittedmethods outperform Howe’s [7] and Lelore’s [8] methods.This may confirm that using advanced imaging systems, thatcan capture document images at both visible and invisiblewavelengths (UV, IR), may be of great interest for documentimage analysis tasks.

VI. CONCLUSION

The ICDAR 2015’s MS-TEx 2015 contest was organizedto provide a place to objectively evaluate binarization andsegmentation methods for multispectral document images.Five methods from four teams successfully participated tothe contest. Method 1 submitted from Vienna University ofTechnology ranked first in the contest in terms of combinedranking of four evaluation measures. In addition to the sub-mitted multispectral-based methods, two grayscale/color-basedstate-of-the-art methods were included but not ranked to showthat using advanced multispectral imaging systems is of greatinterest in document image analysis because of their advantage

in discriminating between different document objects evenwhen they appear with similar colors in the visible humanvision range.

For more meaningful outcomes, we will work in futureon increasing training and testing of multispectral documentimage datasets with various degradations.

REFERENCES

[1] R. Hedjam and M. Cheriet, “Ground-truth estimation in multispectralrepresentation space: Application to degraded document image bina-rization,” in 12th International Conference on Document Analysis andRecognition (ICDAR), Aug 2013, pp. 190–194.

[2] R. Hedjam and M. Cheriet, “Historical document image restorationusing multispectral imaging system,” Pattern Recognition, vol. 46, pp.2297–2312, 2013.

[3] S. Lu, B. Su, and C. L. Tan, “Document image binarization usingbackground estimation and stroke edges,” International Journal onDocument Analysis and Recognition (IJDAR), vol. 13, no. 4, pp. 303–314, 2010.

[4] L. Scharf and L. McWhorter, “Adaptive matched subspace detectorsand adaptive coherence estimators,” in 1996. Conference Record of theThirtieth Asilomar Conference on Signals, Systems and Computers, Nov1996, pp. 1114–1117 vol.2.

[5] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut -interactive fore-ground extraction using iterated graph cuts,” ACM Transactions onGraphics (SIGGRAPH), August 2004.

[6] Y. Wu, S. Rawls, W. AbdAlmageed, and P. Natarajan, “Learningdocument image binarization from data,” ArXiv e-prints, May 2015.

[7] N. Howe, “Document binarization with automatic parameter tuning,”International Journal on Document Analysis and Recognition (IJDAR),vol. 16, no. 3, pp. 247–258, 2013.

[8] T. Lelore and F. Bouchara, “Fair: A fast algorithm for documentimage restoration,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 35, no. 8, pp. 2039–2048, Aug 2013.

[9] M. Trojan-Bedynski, F. Kalbfleisch, S. Tse, and P. Sirois, “The use ofsimmering water in the treatment of a nineteenth century sketchbook ofiron gall ink drawings by james g. mackay,” Journal of the CanadianAssociation for Conservation, vol. 28, 2003.

[10] J. G. Neevel and B. Reiland, “Bathophenanthroline indicator paper:Development of a new test for iron ions,” Restaurierung, vol. 6(1),2005.

[11] R. Hedjam, “Visual image processing in various representation spacesfor documentary preservation,” PhD. Thesis, University of Quebec(ETS), 2013.

[12] H. Ziaie Nafchi, S. Ayatollahi, R. Farrahi Moghaddam, and M. Cheriet,“An efficient ground truthing tool for binarization of historicalmanuscripts,” in 12th International Conference on Document Analysisand Recognition (ICDAR), Aug 2013, pp. 807–811.

[13] H. Ziaei Nafchi, R. Farrahi Moghaddam, and M. Cheriet, “Phase-basedbinarization of ancient document images: Model and applications,”IEEE Transactions on Image Processing, vol. 23, pp. 2916–2930, 2014.

[14] M. Sokolova and G. Lapalme, “A systematic analysis of performancemeasures for classification tasks,” Information Processing & Manage-ment, vol. 45, no. 4, pp. 427–437, 2009.

[15] D. P. Young and J. M. Ferryman, “Pets metrics: On-line performanceevaluation service,” in Proceedings of the 14th International Confer-ence on Computer Communications and Networks, ser. ICCCN ’05.Washington, DC, USA: IEEE Computer Society, 2005, pp. 317–324.

[16] H. Lu, A. Kot, and Y. Shi, “Distance-reciprocal distortion measure forbinary document images,” IEEE Signal Processing Letters, vol. 11,no. 2, pp. 228–231, Feb 2004.

[17] A. J. Viera and J. M. Garrett., “Understanding interobserver agreement:The kappa statistic.” Fam Med., vol. 37(5), pp. 360–363, 2005.

[18] S. Ayatollahi and H. Ziaei Nafchi, “Persian heritage image binarizationcompetition (phibc 2012),” in First Iranian Conference on PatternRecognition and Image Analysis (PRIA), March 2013, pp. 1–4.


1184

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

Fig. 4. Example of different methods’ outputs. (a) Visible image made from Blue (F2), Green (F3) and Red (F4) bands; (b) Infrared band (F8); (c) GT; (d)Method 1; (e) Method 2; (f) Method 3; (g) Method 4; (h) Method 5; (i) Lelore [8]; (j) Howe [7].


1185

ICDAR 2015 Contest on MultiSpectral Text Extraction (MS ...ICDAR 2015 Contest on MultiSpectral Text...

Documents

Transcript of ICDAR 2015 Contest on MultiSpectral Text Extraction (MS ...ICDAR 2015 Contest on MultiSpectral Text...