GET A LIFE NOW: Optical Recognition

http://www.computerworld.com/s/article/73023/Optical_Character_Recognition?taxonomyId=11&pageNumber=2

Optical Character Recognition
By Sami Lais
July 29, 2002
Computerworld - Suppose you wanted to digitize the novel Moby Dickovernight. You could stay up all night typing and still not finish. Or you could use a high-end scanner and in minutes scan all of author Herman Melville's works into a computer using optical character recognition (OCR) technology.

This is the technology long used by libraries and government agencies to make lengthy documents quickly available electronically. Advances in OCR technology have spurred its increasing use by enterprises.

More
Computerworld
QuickStudies
For many document-input tasks, OCR is the most cost-effective and speedy method available. And each year, the technology frees acres of storage space once given over to file cabinets and boxes full of paper documents
Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots). Software to recognize the images is also required.
The OCR software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas.
Older OCR systems match these images against stored bitmaps based on specific fonts. The hit-or-miss results of such pattern-recognition systems helped establish OCR's reputation for inaccuracy.

Today's OCR engines add the multiple algorithms of neural network technology to analyze the stroke edge, the line of discontinuity between the text characters, and the background. Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is. The OCR software then averages or polls the results from all the algorithms to obtain a single reading.

Advances are being made to recognize characters based on the context of the word in which they appear, as with the Predictive Optical Word Recognition algorithm from Peabody, Mass.-based ScanSoft Inc. The next step for developers is document recognition, in which the software will use knowledge of the parts of speech and grammar to recognize individual characters.

Today, OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic.

Developers are taking different approaches to improve script and handwriting recognition. OCR software from ExperVision Inc. in Fremont, Calif., first identifies the font and then runs its character-recognition algorithms.

Advances have made OCR more reliable; expect a minimum of 90% accuracy for average-quality documents. Despite vendor claims of one-button scanning, achieving 99% or greater accuracy takes clean copy and practice setting scanner parameters and requires you to "train" the OCR software with your documents.

The first step toward better recognition begins with the scanner. The quality of its charge-coupled device light arrays will affect OCR results. The more tightly packed these arrays, the finer the image and the more distinct colors the scanner can detect.

Smudges or background color can fool the recognition software. Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade-offs.

For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits' worth of color information. This scan will take longer than a lower-resolution scan and produce a larger file, but OCR accuracy will likely be high.

A scan at 72 dpi will be faster and produce a smaller file—good for posting an image of the text to the Web—but the lower resolution will likely degrade OCR accuracy.

Most scanners are optimized for 300 dpi, but scanning at a higher number of dots per inch will increase accuracy for type under 6 points in size.

Bilevel (black and white only) scans are the rule for text documents. Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel. Some scanners can also let you determine how subtle to make the color differentiation.

Which method will be more effective depends on the image being scanned. A bilevel scan of a shopworn page may yield more legible text. But if the image to be scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out.

Lais is a freelance writer in Takoma Park, Md.

What's OCR?

What is OCR?

Next to keypunching, Optical Character Recognition is the oldest data entry technique in existence. Long before the first key-to-disk system of CRT was used,Optical Character Readers were entering data in commercial and government EDP installations.

The popularity of OCR has been increasing each year with the advent of fast microprocessors providing the vehicle for vastly improved recognition techniques. This can be shown in OCR wands now reading print that, over 10 years ago, large batch readers would have rejected. There has also been tremendous improvements in increasing both effective read rates and accuracy. Data Entry through OCR is faster, more accurate, and generally more efficient than keystroke data entry. Desktop OCR scanners can read typewritten data into a computer at rates up to 2400 words per minute!

How Does OCR Work?

There are two basic methods used for OCR: Matrix matching and feature extraction. Of the two ways to recognize characters, matrix matching is the simpler and more common.

Matrix Matching compares what the OCR scanner sees as a character with a library of character matrices or templates. When an image matches one of these prescribed matrices of dots within a given level of similarity, the computer labels that image as the corresponding ASCII character.

Feature Extraction is OCR without strict matching to prescribed templates. Also known as Intelligent Character Recognition (ICR), or Topological Feature Analysis, this method varies by how much "computer intelligence" is applied by the manufacturer. The computer looks for general features such as open areas, closed shapes, diagonal lines, line intersections, etc. This method is much more versatile than matrix matching. Matrix matching works best when the OCR encounters a limited repertoire of type styles, with little or no variation within each style. Where the characters are less predictable, feature, or topographical analysis is superior.

OCR Fonts

What is a font? A font is the term given to a set of characters, usually 0 - 9, A through Z, and a few special characters. Each character within a font will have a defined reproducible size and shape. For OCR, these are defined by ANSI, the American National Standards Institute.

OCR fonts, or characters, that can be read by the lower speed, lower cost systems we are discussing here require well defined character shapes that are very reproducible and designed to be both machine and human readable. These unique and well defined character sets allow for greater accuracy.

OCR Scanners

OCR reading devices are fundamentally classified with two categories, Text Input and Data Capture.

Text input devices are page readers or document scanners that scan entire documents or large portions of documents. The source data is entered with the intention of someone editing it during or after it is scanned. Text input devices have varying degrees of automation from hand fed to having automatic feeding, reading, sorting, and stacking capabilities.

Data Capture devices are designed to capture repetitive data and to perform formatting functions on the data as it is being entered. The data delivered from the scanner to the computer must be very accurate because it is entered without the intention of being edited later, so accuracy must be higher than text input.

Elements of a Successful OCR Application

The elements of a successful OCR installation include:
Proper Media
Forms Design
Data Integrity and Output Processing
OCR Reader

Reasons for Using OCR

There are a number of reasons for choosing OCR scanning over other methods of data entry. Some of the more significant include:
To reduce Data Entry Errors
To Consolidate Data Entry
To Handle Peak Loads
Human Readable
Can Be Used with Many Printing Techniques
Scanning Corrections

When is OCR Preferred over Bar Code?

OCR is better suited for data entry in a controlled environment for any number of characters. For example, remittance processing where data on utility bills or other turnaround documents need to be entered into a system.

Some OCR scanlines may contain more than 40 characters and a variety of valuable information such as date the bill is due, account number, amount owed, type of service, etc.

Bar code is best suited where the primary function is to identify parts or items in harsh environments or where the media is to be used over and over again and consists of relatively few characters. For example, identifying and tracking passenger luggage in the Airline industry. Bar codes are very tolerant to rough handling and harsh environments, but require much more space on a label or document than OCR. Inch for inch, OCR can hold 6 times more information than a standard bar code.

GET A LIFE NOW

COMPASSION

1/13/2012

Optical Recognition

Manage Your Attention

Mindfulness Based Stress Relief

Mindful Living

Save the Planet

Blog Archive

My Blog List

Twitter

Blog Writer