What is character-set encoding?

Character-set encoding tells the web browser how to display characters which appear only in sets specific to certain languages, regions or special purposes.  In the early days of computing, a standard set known as ASCII contained only 128 distinct codes.  Since then other common encodings have come into use, representing Western European, Eastern European, Asian characters and many many more.  A newer standard called Unicode aims to reconcile and unify these many systems, but it's still far from universal.

In web data-mining we often come across pages where characters don't match their declared encoding, or there are mixed encodings, or there is no declared encoding at all.  Consequences include strangely garbled or accented characters, missing characters, or unknown characters replaced with a special Unicode substitute character which looks like a question mark.

At Aware Research we pay attention to these issues, do whatever is possible to detect and correct encoding errors, and we provide all text results encoded in UTF-8, capable of representing all character sets and compatible with modern computers.