| |
We extracted
data from Websites (PDF or HTML files) in a pre-specified
format, using W4f technology and Java and
produce the output in Xml format. We have been doing this
as part of a production process and have completed
20 projects. These projects delivered what we call Shopkeeper
Units or SKUs.
Extraction projects can be divided
into 3 types: Type A, B, and Hand Built extractors.
- "Type A"extractors
look at the individual data source page, typically HTML,
and use client defined rules to extract
the proper data. These extractors generate smaller volumes
of data, handle complex data sources and are
more sensitive to changes in the data source.
- "Type B"(Programming
logic driven) extractors. They read data from a table
that defines all the possible
permutations of data to be generated. Then the extractors
iterate through the permutations, creating an XML
entry for each SKU. These extractors tend to generate
large volumes of data.
- "Hand Built"extractors
are just that. Here the client provides the data source
with the data to be extracted
and a list of rules.
We then set up a process where the source documents are
processed by hand and the extracted data entered into
database for delivery to the client.
We also extract the data from Websites for Exactone Inc.
by writing configuration files and submitting them to the
Engine
hosted at the Client site by using a proprietary language
called IQL. The Engine, built on proprietary Java based
technology,
aggregates the data and generates an online catalogue on
the Net for the use of the ultimate customers. Apart from
writing
the Config files, we also maintain them to take care of
the broken or missing links etc, and ensure that the Catalogue
data
is available 24 hours a day, 365 days a year to the end
users. |