Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction Data capture, character recognition OCR ICR OMR CHR BCR, image processing, forms data capture, document indexing, automatic data extraction
Buying guide data capture

Buying guide

On the market the supply of data capture systems today is very wide. The italian and foreign software house offering an applications for vertical or general purpose, each with its own characteristics and peculiarities that make it more or less attractive. Although the activities of these systems can usually be divided into document capture, recognition, correction and output of data, not all systems work the same way and not everyone has the same performance. Also, the price range is very wide and may contribute to mixed up the idea. So let us understand how it is possible to navigate the features and capabilities of these systems to be able to evaluate and choose the best way.

"It's not all gold that glitters" - THE RECOGNITION RATE

It may come naturally first ask how well read a certain system. The percentage of data correctly recognized by the subsystem recognition, the recognition rate, at first sight it might be a good discriminant. Unfortunately, anyone who claims that the system reads the "X" percent does not provide reliable information, or rather does not provide an absolute figure comparable with others. For example, if two vendors both say that their optical reading system has a recognition rate of 99%, it might be true that each on its own sample documents have reached that result, but it is said that the two systems produce the same performance on your documents. In fact, each vendor will be used to determine a sample different from the estimate, then the two numbers may not be comparable nor of course can give any guarantee future! The only way to use the recognition rate as a feature of the comparison may be to perform a test on the same sample of documents, but also in this case there are further implications to keep in mind.

Imagine that , by processing the same sample at two different systems, one of it has a recognition rate of 95% and another 90%. At first sight, the first system might seem better than the second, but in reality may not be so. In fact, we should also use another measure , which we call the false positive rate, which is the percentage of data read bad, but considered good. In fact, whenever a data is read, is attributed to a confidence of reading which indicates how the system is certain to have read it correctly, and based on a threshold on this value, it can determine if the data requires validation by an operator or not. Suppose that the first system, the one with a recognition rate of 95%, confers a high confidence of reading to 100 % of the data , for which you do not "aware" that the 5 % of the data is incorrect, while the latter confers a high confidence reading only 90% of the data (those actually correct), for which ever knowing that 10% of the data is wrong It clear that in this situation it would be preferable to the seemingly lower performance, but it manages to be aware of when reading bad!

Usually also the recognition rate is calculated on the individual characters, while the entities that we are going to read are the fields, consisting of sequences of characters. If we refer to a readable form that has 10 fields of 20 characters each, and if the recognition rate is 95% means that the system of 200 characters has 180 beds correctly, while the other is not able to read them. If the 20 characters beds are not all in the same field, means that 9 out of 10 fields do not require correction, for which the recognition rate on the fields would be 90%. Conversely, if at least one character not recognized happen in each of the fields means that all 10 fields need correction, for which the recognition rate on the fields would be 0%!

"Sooner or later the chickens come home to roost" - VALIDATION AND CORRECTION DATA

A good data capture system must provide the ability to perform data validation subsystem recognized by OCR / ICR / CHR if you have lookup tables that verify their presence. For example, suppose you read your complete name, street, city, zip code, the municipality and the province could be validated automatically in a special database to make sure that they have been fully recognized, perhaps operating a cross-validation and a any self-correction for finding the most similar data in the database if the value read is the same or not (for example "70124 TOBIHO TD" -> "10124 TURIN TO"). In this way, the operator in the correction step can simply confirm the correction without the need to type anything.

Similarly it should be possible to make the car a formal validation of dates, amounts, codes and other data expected with specific formatting and note: you should be able to use not only in the common routine of checking pre-set (for example, tax code for VAT, etc..), but also have the ability to implement routines customized specifications.

A viable solution would therefore have a streamlined user interface to easily compare with what is recognized as actually written on the document: in fact, what better ergonomics and functionality of the interface, the less time / man to devote to the validation phase and correction of data.


An data capture system can become more attractive than another even making a careful assessment of the special peculiarities it has, obviously related to their needs.

For a multinational company, for example, could be an interesting system that has user interface and documentation in english, as well as in italian language, so as to allow the adoption of the same technology across the company.

The use of a recognition engine ICR trained specifically for the italian style of writing should be a prerequisite if you have to process the forms filled with handwriting data mainly on our territory: it is known as the Italian style of writing is sufficiently different from one overseas (yes, such as the numbers 1, 4, 7, etc..) and the recognition performance can be heavily penalized by not using the right tool.

A service center should be able to have a system which allows to keep under control the productivity of individual operators and of being able to automatically compile statistics of each type and each base: for example, for a certain type of module you should know the amount processed in a certain day, the percentage of recognized data / suspicious / correct time / man spent on corrections, etc..

To be truly effective on modules not specifically designed for data capture, a system should possess special features, such as the possibility of alignment of the form even without the use of preprinted markers and the possibility of removal of the black columning preprinted.

Conversely, having to deal with documents designed for data capture, with columning colored filtered by the scanner, could be very useful to be able to display the module rebuilt, reinstated the columning during the correction phase to facilitate the operator to navigate between fields.

If you have to deal with documents containing sensitive data, where privacy is an essential requirement, it may be imperative that the data capture system maintains the encrypted images and does not allow operators who carry out the correction of viewing the entire form, but only a area limited to the field to check. It has long been much talk of telecommuting and probably the validation / correction of data, which is similar to the data-entry, can be one of the most easily be outsourced if the data capture system has features that enable it to perform this task remotely, perhaps via the internet, of course, with all the guarantees of security and robustness requirements.

"Who acts as if it's three" - POSSIBILITIES FOR CUSTOMIZATION

Unfortunately there is no software that is already provided everything you want! It is important that a good data capture system has many customization possibilities, so that we can adapt it to your needs with the least possible effort.

A first level of customization can take place usually in a "visual" modifying a number of settings and parameters through the user interface.

However, a type of customization more thrust can be achieved only by the use of scripting languages ​​or API (Application Programming Interface) which allows you to change the default behavior of the application and to extend its functionality.

For example, if you want a not standard output with a particular record layout, or if you want to use custom logic to decide what data should go in and fix what does not, maybe by implementing the particular routine of self validation of the data, then here it is essential that it is possible to intervene without requiring the intervention of the software house that developed the product, so they can be independent.

"Appetite comes with eating" - CALABILITY AND MODULARITY

The experience teaches us that the needs can change over time, even a short distance and also considerably:

  • Today you have the need to read only structured forms, but tomorrow it may be necessary to read documents also unstructured.
  • Today we want to work with documents that require only reading boxes marking and bar codes, forms, but tomorrow it could happen that requires the reading of printed text or manuscript.
  • Today, a single recognition server may be sufficient to support the amount of documents to be processed, but tomorrow it could no longer be so.

To protect your investment is therefore essential to choose a product that is potentially able to meet future needs, but that it is modular and scalable, allowing ie not having to buy in advance capabilities that do not need it immediately, but that may still be needed in the future.

"Who spends more spend less" - LICENSING POLICY

The data capture solutions, as is the case for most software solutions are usually sold in the form of "license" the duration of which can be "unlimited", that is, you pay a one-time software that can be used in life (subject to the possibility of evolutionary maintenance contracts and upgrades), or "annual" / "monthly", that is, every time you pay a tot to use the software, as a kind of rent.

However, in the specific case of data capture systems, it is easy to find solutions even deliberately limited in performance (speed of recognition engines) or treatable volumes (number of documents that can be processed per day / month / year).

Indeed, some vendors offer solutions with speed of recognition engines blocked a characters per second (CPS), usually in order to reduce the impact of the cost of the recognition engine (purchased from third parties) on the cost of the entire solution. This can significantly reduce the cost of an entry level solution, but it can be penalizing for professional solutions where throughput is important. For example, a solution based on an ICR engine limited to 10 CPS, even when used on very high-performance machines, where a solution in unlimited speed could easily reach 300 CPS, it will be always have the same modest rate of recognition and thus can not benefit from upgrade and evolution of the hardware on which it works.

Other vendors offer limited solutions in volumes of negotiable documents, on an absolute basis (eg 100,000 modules) or time (eg: 10.000 modules / year). While it may be interesting to qualify for a reduced price when you have the need to work "una tantum" or treat small quantities of modules, on the other hand you have to evaluate how well it could cost you handle an unexpected increase in the volume to be treated, even temporary. Other licenses that provide for the payment of a fee for each page processed (pay per use) may become disadvantageous in the case of service work in view of the inability to achieve economies of scale.

Last thing to consider, but not least, is the licensing schema for the acquisition and correction stations: the data capture systems, the most advanced in fact provide for the possibility of using acquisition stations and stations of multiple correction. Some solutions provide for the payment of a royalty for each station to be used by other packages stations, while others allow you to use an unlimited number of workstations at no extra cost. The latter possibility is found very useful when you need to manage peak processing, being able to freely scale the architecture of the system depending on the hardware and human resources available and the contingent need.
Buying guide data capture

"Walking with leaden feet" - THE SUPPLIER

The vendor, that the provider of the data capture system, perhaps needless to say, but also should be evaluated with great care: a product that at first glance seems satisfactory, but it is poorly supported may soon join regret the purchase!

Certainly we should prefer specialized suppliers, who have data capture as their core business, and giving the greatest assurances on the maintenance and development of the products.

References can help to identify your target market, but also to verify the reliability of the supplier in terms of quality of the product is the quality of after sales support.

If you choose a foreign supplier that does not have its own headquarters in Italy is well to bear in mind the problems that might arise from the difference of language, time difference and travel costs if you require on-site interventions.