Overview
FrançaisRead this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHOR
-
Rolf INGOLD: Professor - Computer Science Department, University of Fribourg (Switzerland)
INTRODUCTION
Document image analysis and recognition is a scientific discipline that brings together a range of computer techniques aimed at reconstructing the content of a document from its image. While it has long been confined to the field of character recognition, it now has much broader objectives, ranging from simple document classification to complete content interpretation, indexing and re-editing. Thus, the ultimate goal of document image recognition is to generate a high-level representation in the form of structured documents, in a form suitable for the intended application.
By way of introduction, let's consider a page from a scientific book (figure 1 a ) that needs to be "hypertextualized", i.e. produced as an electronic version with hypertext links for navigation. In such an application, it is imperative to determine the logical structure of the book, i.e. its hierarchical organization into chapters, sections and paragraphs, and to identify definitions, exercise statements, experiment descriptions, formulas, etc. Figure 1 b visually reflects this structure at page level, while figure 1 c illustrates the resulting hierarchical structure. It is this structure that can be used for hypertext navigation.
Traditionally, document recognition has been applied primarily to paper documents for which no electronic form was available. Today, these techniques are recognized as being particularly useful for restructuring unstructured or poorly structured electronic documents, using the image produced synthetically, for example with a Postscript print engine.
From a historical point of view, it is interesting to note that optical character recognition predates the development of computer technology, since patents were already filed in the 19th century and a demonstration prototype was reported in 1916. The first computerized approaches to character recognition date back to the early 1960s; for example, the first mail sorting machine (limited to typed addresses) was installed in the USA in 1965. However, major developments date back to the advent of office automation in the 1980s
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference
This article is included in
Digital documents and content management
This offer includes:
Knowledge Base
Updated and enriched with articles validated by our scientific committees
Services
A set of exclusive tools to complement the resources
Practical Path
Operational and didactic, to guarantee the acquisition of transversal skills
Doc & Quiz
Interactive articles with quizzes, for constructive reading
Document image analysis and recognition
Bibliography
References
Websites
Recherche privée et produits commerciaux
Serveur sur la reconnaissance de caractères et l'analyse de documents https://cfar.umd.edu/
Organizations
Center of Excellence for Document Analysis and Recognition (CEDAR) http://www.cedar.buffalo.edu
Perception, Systèmes, Information (PSI) http://psiserver.insa-rouen.fr/psi
Reconnaissance de l'Écriture et Analyse de Documents...
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference