Features to be supported by Malayalam OCR
  1. Purpose of this document
    This document details the features that will be supported by Malayalam OCR. It does not say how each feature will be implemented. This document must be used to guide design decisons.
    Examples of features
    • Ability to handle slanted pages.
      - Due to careless use of scanner during digitization, pages often make an angle with the horizontal axis. This is impossible to prevent in practice. This angle is called skew. Skew affects OCR accuracy significantly and hence a good OCR system should detect this angle and rotate the image in the opposite direction before the recognition stage. This process is called skew detection and correction.
    • Ability to recognize pages intermingled with text and graphics.
  2. Design Guidelines
    There could be many different ways to support a particular feature. We do not know in advance which method will work best to support a feature. Often, it depends on the particular page to be recognized (input). In some cases, a combination of methods will give the best results.

    As an example, different skew detection methods are required to correctly recognize the skew angle in different kinds of pages. Most OCR systems detect skew in two stages. In the first stage, a course estimate of the skew angle is obtained. The second stage uses a different algorithm to obtain a finer estimate of the skew angle.

    So, our system should allow a developer/user to rapidly test various approaches (or a combination of approaches) and decide which works best for the particular case at hand.

  3. List of features that needs to be supported
    1. Binarization
    2. Skew detection and correction
    3. Character recognition based on supervised and unsupervised learning.
    4. Recognize pages intermingled with text and graphics.
    5. Use of a lexicon to improve accuracy.
  4. List of features that will not be supported
    1. Lines of different font sizes.
