Here are four ways to get the best results with OCR:
Quality of the original text
Type of scanner
Speed (power) of the computer
Quality of the software
The process of optical character recognition is much like a person reading. Like a person's eye, the scanner analyzes light reflected from a page to create a pictorial representation of what is black (the ink) and what is not black (the paper). This picture of the page is stored in the computer where the OCR software tries to chop the black parts up into individual letters and then guess what each one is.
Generally anything difficult for a person to read will be impossible for an OCR system to convert. This includes such things as smudged text, small type, strange fonts, characters that are too close together, and typewritten characters created with an old ribbon or dirty keys.
The bottom line to OCR efficiency: always start with high-quality printed material.
Given good clean text, the next critical element in the system is the resolution of the scanner. Typical scanner resolutions range from 75 to 450 dots per inch (DPI). The higher the resolution of the scanner, the higher the likelihood of accurate recognition by the software. Although some companies claim to be able to accurately recognize text at 200 DPI, 300 DPI is probably the lowest practical resolution.
Another factor that influences the suitability of a scanner for OCR is its method of feeding. The two standard methods are flatbed and roller-fed.
Flatbed scanners are like photocopiers in that the artwork to be scanned is placed on a glass plate, covered by a lid, and then passed over by a light.
Roller-fed scanners are like a typical FAX machine, in that artwork is scanned after being fed in through rollers.
Flatbed scanners are well-suited for bound, oversized, or especially small pages. Roller-fed scanners are good for bulk scanning of material with a consistent format. A flatbed scanner with an optional document feeder offers the best of both worlds.
The power of the computer used in an OCR system doesn't effect the accuracy of conversion. Nevertheless, we recommend the computer have at least an 80386 processor running at 25 Mhz. Also, since most OCR programs require large amounts of memory, having 4 MB of RAM or more can make life a lot easier.
There are dozens of OCR programs on the market. Most of these are simply software programs that you load into a computer and run, though some high-end programs work with a board that plugs into the computer. These hardware and software combinations are usually faster, more accurate, and more expensive.
There are a lot of options available when buying OCR programs. Some of the more interesting ones can do the following things:
Read both mono and proportional spaced fonts
Read dot-matrix output
Learn new fonts
Mix text and graphics
Define frames of text to read
Retain character formatting attributes
Retain columns (tables)
Save in various word processing formats
Incorporate spell checking into the conversion process
Be aware, however, that character recognition systems rarely achieve more than 95% accuracy on anything but the cleanest of text. Even at 99% accuracy, there could be 20 mistakes on a typical 2,000-character page. So, budget time for spell checking and proofreading.
There are almost infinite combinations of specific scanners, computers, and software packages. If you have printed pages numbering in the hundreds, OCR may work for you. If you have massive amounts of printed text, you should consider hiring a data entry service to have it re-keyed.