Mining the Deep Web

"Searching on the Internet today can be compared to dragging a net across the surface of the ocean", said Michael Bergman in a study back in 2000.

Nearly a decade later, some estimates show that up to 500 times more online information remains hidden below the shallow nets cast by the general search engines. There's a wealth of information that is deep and therefore missed, unless you know how to go after it.

What are these information resources?

  • Dynamic content: dynamic pages returned in response to a submitted query or accessed only through a form. Examples include the massive databases of the Securities and Exchange Commission and the Patent and Trademark Office.
  • Unlinked content: pages which are not linked to by other pages, which may prevent web crawling programs from accessing their content.
  • Contextual Web: pages with content varying by different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).  This includes sites that will adjust content and even language based on your inferred location.
  • Volatile Web: pages with content varying with time.  This includes most news sources as well as millions of blogs.
  • Private Web: sites that require registration and login (password-protected resources).  This includes many directories and association databases, often with access further limited by technical means.
  • Limited access content: sites that limit access to their pages in a technical way, e.g., behind a proxy server, using the Robots Exclusion Standard, CAPTCHAs, or pragma: no-cache HTTP headers which prohibit search engines from browsing them and creating cached copies.
  • Scripted content: pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX web applications.
  • Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
  • Non-HTTP protocols: such as files served via FTP. For example, the roughly 200GB data of (just) the 2000 Census at the U.S. Census Bureau.

General search engines such as Google face a challenge to their coverage posed by the increasingly dynamic content of the Deep Web.  Their strategy is essentially inverted; when earlier they could expoit the power of muliple links directing them to essentially static and visible content, now they have essentially a single gateway to highly variable content.  When previously they could become good at deriving answers from the data, now they must become good at asking questions to derive the data.*

It's like the truism that the more you know, the more you can ask—a spiraling task for those who would seek information in general.

On the other hand, a harbinger of increasing opportunity for those whose aim is focused research and targeted information.



*Until the game evolves to a higher level of semantic exchange, but that's another post for another day.