- BCR 2D
- MICR CMC7
- MICR E13B
- Form Identification
- Image Enhancement
- TWAIN and ISIS Scanning
- Black Border Removal
- Lines Removal
- Dynamic Thresholding
- Layout Analysis
- Quality Control
- File Format Conversion
- Book Curvature Correction
- Keystone Correction
The technology that we call Free-Form represents the new frontier of data capture of documents to free structure.
This technologies to data capture of structured modules are almost able
from several years and have reached a very high degree of maturity,
laying the groundwork for new challenges: the data capture documents at free structure is one of them.
If a structured document is any type of module in which
the positions of the data to be extracted are precise and known in advance,
an unstructured document is instead a document in which there are,
however, very precise data, but their position and the their layout is not known
a priori and can vary greatly between the document and the document of the same typology.
The most classic example of unstructured document in which it is very easy
to come across on a daily is represented by bills: although
we know a priori that each invoice is the business name of the supplier, the date, the
number progressive, the taxable, the VAT and the total,
we can not know in advance where these data are located.
In fact, their position is not standardized but it is left to the free will of each supplier
that you can choose to use fonts, graphics elements, colors and shadows as they see fit.
One of the possible strategies to deal with these types of documents is to be traced back to the case of homogeneous structured documents, where possible.
For example, continuing to talk about the bills, you might create a specific template to associate with the invoices of each vendor,
so that once identified the supplier, the invoice can be treated in an appropriate way.
This approach can be good when the number of classes is not high
and when the process of classification can be done accurately,
whether performed directly by software or manually by an operator.
We must therefore prepare to worry about the different template
to quote and be certain that they are processed only documents related to them.
In contrast to other types of unstructured documents, for example the curricula, this type of strategy is not applicable.
The approach that is used to solve this problem, rather than starting from a spatial definition, part by a logical definition of the data.
In practice, the data to read are defined, and then identified by a series of specific attributes,
such as, for example, key words next to them,
formatting type awaited, relative position, presence or absence of graphical elements,
the criteria of cross-validation check, and so on.
In the case of VAT as an invoice, for example, will be able
to recognize it, and then obtain the value, instructing the system to find a
sequence of 11 numeric characters (or 2 letters followed by 11 numeric),
near (above, below, right, left) of the words "VAT",
perhaps limited to a certain area of the document (for example in the top half of the image),
verifying the checksum and, if possible, in the presence of a possible database of suppliers.
In practice, the software instructs you to "think" like humans do:
in fact, when we look on a bill given the TOTAL DOCUMENT
we are naturally inclined to look at the bottom right of the sheet,
maybe we focus on a box particularly evident or marked
and try as "test" the words "TOTAL DOCUMENT" O "INVOICE AMOUNT" or "TOT. INVOICE".
In the same way it acts a system for processing of unstructured documents:
this is based on our information,
on the basis of the rules properly reset, which must then be defined in a precise and exhaustive.
Example of unstructured document recognition (Invoices): notice how the same "date" field has been identified in completely different positions.
The basis of these features is the use of optical character recognition (OCR)
of the entire document together with a robust algorithm of layout analysis:
the combined use of these two tools makes it possible to identify blocks of text, vertical lines, horizontal and text elements with their confidences,
with the possibility of verifying whether or not the logical conditions imposed on the research data on the page.
To make it even more accurate processing of unstructured documents is also possible
to combine the two strategies described above:
if the system is able to associate the document to be treated to a template known,
is treated as a structured document, otherwise it is treated as a document unstructured and processed equally.
The freeform data capture allow then to extract data from any type of documents.
Our products that implement the FreeForm technology
For more information on the FreeForm technology, it is worthwhile to know how and know our solutions that implement it, you can send us an e-mail to firstname.lastname@example.org or fill in the form below.