Data analysis has the goal of highlighting useful information, suggesting conclusions, and supporting decision-making. Aware Research applies an advanced toolkit of statistical, linguistic and structural techniques, depending on your needs.
This refers to identifying incomplete, incorrect, irrelevant, inconsistent parts of the data and correcting by replacement, modification or deletion. This generally does not include the "noise reduction" step of eliminating advertisements, adminstrative text, or repeating blocks by page segmentation, nor the data validation step during extraction.
Duplicate detection and correction is included in the above, but crosses over into the linguistic or statistical realm when near-duplicates must be identified by phonetic or fuzzy matching.
For many projects, this is all the analysis that is required before delivery.
However, for those who need a little more:
Statistical methods including
- Classification by Bayesian methods or by support vector machine.
- Clustering by k-means, nearest-neighbor, etc., with a variety of distance metrics.
- Dimensionality reduction by singular value decomposition or principle component analysis
Linguistic methods including
- Tokenization, for second-order queries using words, analysis of word and ngram frequencies, co-occurrence
- Segmentation by paragraph and sentence, necessary for specifying words co-occurring within a sentence or within a certain window on the page
- Stemming/lemmatization, reducing words to simpler forms for broader matching
- Shallow parsing, e.g, extracting noun phrases for summarizing or matching content
- Named Entity Recognition, for names of people, places, organizations, events, etc.
- Miscellaneous methods including readability metrics, spelling checking, text normalization, synonym generation...
These analytical tools, along with our software and hardware infrastructure, provide a great deal of capability helping us to help you.