The Eighth IAPR International Workshop on Document Analysis Systems (DAS2008)

September 16-19, 2008
Nara Prefectural New Public Hall, Nara, Japan


Statistical and Adaptive OCR — a Hands-On Tutorial with OCRopus

Prof. Dr. Thomas Breuel
Prof. Dr. Thomas Breuel
Sep. 16, 13:00 - 15:00
Room Yuri (Hotel Nikko Nara 5F)

Intended Audience:
students and researchers interested in the state of the art in document analysis, including OCR and layout analysis; students, researchers, and industrial developers interested in using the OCRopus open source OCR system as a platform for research and development
Prerequisites for participants:
some knowledge of pattern recognition, machine learning, image processing
Outline and brief description:
He will give an introduction to state-of-the-art techniques for OCR, layout analysis, and digital library capture applications. Techniques likely to be covered include:

He will review the state of the art in each of these areas, and then describe specific approaches and algorithms in detail. All methods he will be talking about are implemented in the OCRopus open source OCR system, and he will illustrate the tutorial with OCRopus-based examples.

Profile of the Instructor:
Thomas Breuel is professor of computer science at the Technical University of Kaiserslautern Computer Science Department, head of the Image Understanding and Pattern Recognition (IUPR) research group at the DFKI, and a consultant in Palo Alto, CA, USA.
His research group works in the areas of image understanding, document imaging, computer vision, and pattern recognition. Previously, he was a researcher at Xerox PARC, the IBM Almaden Research Center, IDIAP, Switzerland, as well as a consultant to the US Bureau of the Census. He is an alumnus of the Massachusetts Institute of Technology and Harvard University.
Teaching Experience:
Thomas Breuel is a full professor of computer science at the University of Kaiserslautern. Since 2004, he has developed the HCI curriculum and three new courses in pattern recognition and image understanding.

Unlocking the World's Knowledge: The Analysis of Historical Documents

Prof. Apostolos Antonacopoulos
Prof. Apostolos Antonacopoulos
Sep. 16, 15:30 - 17:30
Room Yuri (Hotel Nikko Nara 5F)

The tutorial will cover the background issues, challenges and opportunities in the analysis of historical documents.

The tutorial is broadly divided in two parts. The first part starts with an examination of the different motivation and other institutional factors that influence technical decisions. The types of documents typically encountered are discussed next with the challenges and possibilities they offer for digitisation and full-text conversion. Focussing on the needs of major libraries, the remainder of the first part presents in detail the different stages in full-text conversion. In each of the stages (scanning, image enhancement, segmentation, OCR and post-processing) the challenges and possibilities for improvement are examined.

The second part of the tutorial comprises a more technical description of the state-of-the-art in the analysis of historical documents. Major past and current initiatives will be mentioned and individual methods will be described for each stage in the processing, analysis and recognition of historical documents. Finally, as an essential aspect in measuring and making progress, ways of performance evaluation of historical document analysis methods will be presented.
  1. Digitisation Approaches. Differences between libraries and archives, showcase vs. mass digitisation, preservation vs. searchability.
  2. Documents. Differences through the centuries. Handwritten and printed. Manuscripts, books and newspapers.
  3. Full-text conversion workflow. Stage-by-stage description of processing, analysis and recognition.
  1. State-of-the-art.
Intended audience:
All Document Analysis researchers.
Prerequisites for participants:
A basic knowledge of Document Analysis methodology.
About the presenter:
Apostolos Antonacopoulos is currently a Senior Lecturer in Pattern Recognition and heads the Pattern Recognition and Image Analysis (PRImA) research group at the University of Salford, UK. He received his PhD from the University of Manchester Institute of Science and Technology (UMIST), UK in 1995. For his early research in Document Image Analysis he was presented with the Best Student Paper Award at the International Association for Pattern Recognition (IAPR) workshop on Document Analysis Systems in 1994 (DAS'94). In 2005, he received the IAPR/ICDAR Young Investigator Award for "outstanding service to the ICDAR community and his innovative research in historical document processing applications."

Dr Antonacopoulos is a member of the Editorial Boards of the International Journal on Document Analysis and Recognition (IJDAR) and of the Electronic Letters on Computer Vision and Image Analysis (ELCVIA) journal, Chair of the IAPR Conferences and Meetings Committee, Vice-Chair of the IAPR Technical Committee on Reading Systems (TC11), past Chair of the IAPR Education Committee, Advisory Board member of ICDAR, Program Co-Chair or ICDAR2009, Chair (Publications) of ICDAR2003, Chair (Tutorials and Demos) ICFHR2008, Co-Chair of WDA2001 and WDA2003, and a Program Committee member of most current and recent editions of conferences in his field of research: ICPR, ICDAR, DAS, ACM DocEng, SPIE DRR, ACM SAC etc. He is a member of the IET (MIET), the IEEE and the IEEE Computer Society, the ACM and the British Machine Vision Association.

In 2007 he co-edited a special issue on the Analysis of Historical Documents in the International Journal of Document Analysis and Recognition. He currently (2008-2011) plays a leading role in the largest to-date EU-funded project in Digital Libraries and Cultural Heritage: IMPACT (Improving Access to Text) carrying out research in full-text conversion of historical documents in the leading libraries of Europe.