Tutorials
Statistical and Adaptive OCR — a Hands-On Tutorial with OCRopus
- Instructor:
-
- Prof. Dr. Thomas Breuel
- Date/Time:
- Sep. 16, 13:00 - 15:00
-
- Place:
- Room Yuri (Hotel Nikko Nara 5F)
- Intended Audience:
- students and researchers interested in the state of the art in document analysis,
including OCR and layout analysis; students, researchers,
and industrial developers interested in using the OCRopus open source OCR system
as a platform for research and development
- Prerequisites for participants:
- some knowledge of pattern recognition, machine learning, image processing
- Outline and brief description:
- He will give an introduction to state-of-the-art techniques for OCR,
layout analysis, and digital library capture applications.
Techniques likely to be covered include:
- efficient document image processing and cleanup using fast morphology
- morphological decomposition, bit-blit and run-length based implementations,
output complexity, marker morphology, applications to document cleanup and layout analysis
- automatic and adaptive document image binarization and segmentation
- review of standard document binarization methods,
fast integral-image based implementations of adaptive binarization methods
- skew detection and correction, dewarping
- overview of skew detection methods based on the docstrum,
Fourier transforms, mathematical morphology, line fitting and RANSAC;
detailed description of robust least square matching for skew correction;
brief overview of image dewarping methods and camera-based methods
- text/image segmentation in documents
- local image features used for pixel and region classification,
logistic regression and decision tree based classification
- geometric document layout analysis
- review of morphological layout analysis, docstrum, XY cut, Voronoi methods,
whitespace methods, and computational geometry methods
- statistical document layout analysis
- an overview of statistical approaches to layout analysis, including MRFs,
stochastic 2D grammars, 2D HMMs, turbo decoding, and structural mixture models
- text line recognition using oversegmentation
- recognition by oversegmentation, grouping, and segmentation graphs;
different methods for oversegmentation (upper contour, skeletal graph, dynamic programming);
relationship to HMMs; statistically sound estimation of segmentation costs and posterior probabilities;
relationship to speech recognition and HMM-based methods
- bitmap clustering for OCR
- " clustering methods for OCR, both for style modeling and for speeding up OCR engines;
contour-based methods; shape matching-based methods;
statistically sound modeling of shape variations
- adaptive character recognition
- " an overview of current approaches to adaptive character recognition methods—methods
that adapt to book or document-specific degradations, fonts, and shape variations
- statistical language modeling for OCR using WFSTs and the OpenFST library
- "classical" language models, including n-graphs, n-grams, dictionaries, and tries;
expression of classical language models in weighted finite state transducers; smoothing;
representation of recognition alternatives; integration with statistical natural
language post-processing, including information extraction; use of WFST models for
training OCR with misaligned ground truth
- benchmarking and performance evaluation
- an overview of layout performance evaluation metrics;
color-based pixel-accurate ground truth representation;
bipartite matching and computation of vectorial performance scores;
evaluation of segmentation performance; evaluation of character recognition performance
He will review the state of the art in each of these areas, and then describe specific approaches and algorithms in detail.
All methods he will be talking about are implemented in the OCRopus open source OCR system,
and he will illustrate the tutorial with OCRopus-based examples.
- Profile of the Instructor:
- Thomas Breuel is professor of computer science at the Technical University
of Kaiserslautern Computer Science Department, head of the Image Understanding and Pattern Recognition (IUPR)
research group at the DFKI, and a consultant in Palo Alto, CA, USA.
His research group works in the areas of image understanding, document imaging, computer vision,
and pattern recognition. Previously, he was a researcher at Xerox PARC, the IBM Almaden Research Center,
IDIAP, Switzerland, as well as a consultant to the US Bureau of the Census. He is an alumnus of the Massachusetts
Institute of Technology and Harvard University.
- Teaching Experience:
- Thomas Breuel is a full professor of computer science at the University of Kaiserslautern.
Since 2004, he has developed the HCI curriculum and three new courses in pattern recognition and image understanding.
Unlocking the World's Knowledge: The Analysis of Historical Documents
- Instructor:
-
- Prof. Apostolos Antonacopoulos
- Date/Time:
- Sep. 16, 15:30 - 17:30
-
- Place:
- Room Yuri (Hotel Nikko Nara 5F)
- Overview:
-
The tutorial will cover the background issues, challenges and opportunities in the analysis of historical documents.
The tutorial is broadly divided in two parts. The first part starts with an examination of the different motivation
and other institutional factors that influence technical decisions. The types of documents typically
encountered are discussed next with the challenges and possibilities they offer for digitisation and full-text conversion. Focussing on the needs of major libraries, the remainder of the first part presents in detail
the different stages in full-text conversion. In each of the stages (scanning, image enhancement, segmentation,
OCR and post-processing) the challenges and possibilities for improvement are examined.
The second part of the tutorial comprises a more technical description of the state-of-the-art
in the analysis of historical documents. Major past and current initiatives will be mentioned
and individual methods will be described for each stage in the processing, analysis
and recognition of historical documents. Finally, as an essential aspect in measuring and making progress,
ways of performance evaluation of historical document analysis methods will be presented.
PART A
- Digitisation Approaches. Differences between libraries and archives,
showcase vs. mass digitisation, preservation vs. searchability.
- Documents. Differences through the centuries. Handwritten and printed. Manuscripts, books and newspapers.
- Full-text conversion workflow. Stage-by-stage description of processing, analysis and recognition.
- Scanning: scanning options, library processes.
- Image enhancement: different types of artefacts, challenges.
- Layout analysis / segmentation: challenges.
- OCR: challenges and approaches.
- Post- processing: dictionaries, automated and manual correction approaches.
PART B
- State-of-the-art.
- Projects: past and current.
- Methods and examples for each stage.
- Performance evaluation.
- Intended audience:
- All Document Analysis researchers.
- Prerequisites for participants:
- A basic knowledge of Document Analysis methodology.
- About the presenter:
-
Apostolos Antonacopoulos is currently a Senior Lecturer in Pattern Recognition and heads the Pattern Recognition
and Image Analysis (PRImA) research group at the University of Salford, UK.
He received his PhD from the University of Manchester Institute of Science and Technology (UMIST),
UK in 1995. For his early research in Document Image Analysis he was presented with the Best Student Paper Award
at the International Association for Pattern Recognition (IAPR) workshop on Document Analysis Systems in 1994 (DAS'94).
In 2005, he received the IAPR/ICDAR Young Investigator Award for
"outstanding service to the ICDAR community and his innovative research in historical document processing applications."
Dr Antonacopoulos is a member of the Editorial Boards of the International Journal
on Document Analysis and Recognition (IJDAR) and of the Electronic Letters on Computer Vision and Image Analysis (ELCVIA) journal,
Chair of the IAPR Conferences and Meetings Committee, Vice-Chair of the IAPR Technical Committee on Reading Systems (TC11),
past Chair of the IAPR Education Committee, Advisory Board member of ICDAR, Program Co-Chair or ICDAR2009,
Chair (Publications) of ICDAR2003, Chair (Tutorials and Demos) ICFHR2008, Co-Chair of WDA2001 and WDA2003,
and a Program Committee member of most current and recent editions of conferences in his field of research:
ICPR, ICDAR, DAS, ACM DocEng, SPIE DRR, ACM SAC etc. He is a member of the IET (MIET), the IEEE and the IEEE Computer Society,
the ACM and the British Machine Vision Association.
In 2007 he co-edited a special issue on the Analysis of Historical Documents in the International Journal
of Document Analysis and Recognition. He currently (2008-2011) plays a leading role in the largest to-date EU-funded project
in Digital Libraries and Cultural Heritage: IMPACT (Improving Access to Text) carrying out research in full-text conversion
of historical documents in the leading libraries of Europe.