|
Updated 15 June 2007. A prospective client had something to hide
when she claimed no previous involvement in an industry rife with
fraud. This claim stated in conjunction with the submission of an
informed business plan rang false. Other clues about her integrity
worried the lawyer. He soon suspected that she was a dishonest
person. After the meeting, he consulted another partner, who in turn
delivered the puzzle to my e-mail inbox. My mission was to fit the
mismatched pieces of information together, either substantiating or
disproving the lawyer's skepticism.
Internet Archive to the Rescue
Wanting to emphasize the importance of
retaining knowledge of history, George Santayana wrote the words
made famous by the film, Rise and Fall of the Third Reich--"Those
who cannot remember the past are condemned to repeat it." Of course,
at the time the
Internet Archive didn't exist; nor did
the Information Age. If it had, perhaps he would have edited his
philosophy to state, "Those who cannot
discover the past are condemned to repeat it."
Certainly in times when new information
amounts to five exabytes, or the equivalent of "information
contained in half a million new libraries the size of the Library of
Congress print collections" (How
Much Information 2003?), it is perhaps fortunate that librarians
possess a knack for discovering information. It is also in our favor
that Brewster Kahle and Alexa Internet foresaw a need for an archive
of Web sites.
Internet Archive and the Wayback Machine
Founded in 1996, the Internet Archive
contains about 30 billion archived Web pages. While always open to
researchers, the collection did not become readily accessible until
the introduction of the
Wayback Machine
in 2001. The Wayback Machine enables finding archived pages by their
Web address. Enter a URL to retrieve a dated listing of archived
versions. You can then display the archived document as well as any
archived pages linked from it.
The Internet Archive helped me successfully
respond to the concerns the lawyers had about the prospective
client. It contained evidence of a business relationship with a
company clearly in the suspect industry. Broadening the
investigation to include the newly discovered company led to
information about an active criminal investigation. Suddenly, the
pieces of the puzzle came together and spelled L-I-A-R.
Using the Internet Archive should be a
consideration for any research project that involves due diligence,
or the careful investigation of someone or something to satisfy an
obligation. In addition to people and company investigations, it can
assist in patent research for evidence of prior art, or copyright or
trademark research for evidence of infringement. It can also come in
handy when researching events in history, looking for copies of
older documents like superseded statutes or regulations,
or when seeking the ideals of a former political administration.
(Note: 25 October 2004. A special keyword search engine,
called Recall Search, facilitates some of these queries.
Unfortunately, it was removed from the site during mid-September.
Messages posted in the Internet Archive forum indicate they plan to
bring it back. Note: 15 June 2007. I think it's safe to
assume that Recall Search is not coming back. However, check out the
site for developments in searching archived audio (music), video
(movies) and text (books).)
Recall Search at the Internet Archive
But while the Internet Archive contains
information useful in investigative research, finding what you want
within the massive collection presents a challenge. If you know the
exact URL of the document, or if you want to examine the contents of
a specific Web site--as was the case in the scenario involving the
prospective client--then the Wayback
Machine will suffice. But searching the Internet Archive by keyword
was not an option until recently. (Note:
See the note in the previous paragraph.)
During September 2003, the project
introduced Recall
Search, a beta version of a keyword search feature. Recall
makes about one-third, or 11 billion, Web
pages in the archived collection accessible by keyword. While it
further facilitates finding information in the Internet Archive, it
does not replace the Wayback Machine. Because of the limited size of
the keyword indexed collection and the problems inherent in keyword
searching, due diligence researchers should use both finding tools.
Recall does not support Boolean operators.
Instead, enter one or more keywords (fewer is probably better) and,
if desired, limit the results by date.
Results appear with a graph that illustrates
the frequency of the search terms over time. It also provides clues
about their context. For example, a search for my name limited to
Web pages collected between January 2002 and May 2003 finds ties to
the concepts, "school of law," "government resources," "research
site," "research librarian," "legal professionals" and "legal
research." The resulting graph further shows peaks at the beginning
of 2002 and in the spring of 2003.
Applying content-based relevancy ranking,
Recall also generates topics and categories. Little information
exists about how this feature works, and I have experienced mixed
results. But the idea is to limit results by selecting a topic or
category relevant to the issue.
Suppose you enter the keyword, Microsoft.
The right side of the search results page suggests concepts for
narrowing the query. For example, it asks if instead you mean
Microsoft Windows, Microsoft Internet Explorer, Microsoft Word,
and so on. Likewise, a search for turkey suggests wild
turkey, the country of Turkey, turkey hunting, roast turkey and
other interpretations.
While content-based relevancy ranking can be
a useful algorithm, it is far from perfect. Some topics and
categories generated might not seem to
make sense. If the queries you run do not produce satisfactory
results, consider another approach.
Pinpoint the specific sites you want to
investigate by first conducting the research on the Web. In the
prospective client example, an old issue of the newsletter of the
company under criminal investigation (Company A) mentioned the
prospective client's company (Company B). This clue led us to
Company A's Web site where we found no further mention of Company B.
However, with the Web site address in hand, we reviewed almost every
archived page at the Internet Archive and found solid evidence of a
past relationship. Additional research, during which we tracked down
court records and spoke to one of the investigators, provided the
verification we needed to confront the prospective client.
Advanced Search Techniques
You can display all versions of a specific
page or Web site during a certain time period by modifying the URL.
Greg Notess first illustrated this strategy in his On The Net
column (See "The
Wayback Machine: The Web's Archive," Online, March/April 2002).
A request for all archived versions of a
page looks like this:
http://web.archive.org/web/*/http://www.domain.com
The asterisk is a wildcard that you can
modify. For example, to find all versions from the year 2002, you
would enter:
http://web.archive.org/web/2002*/http://www.domain.com
Or to find all versions from September 2002,
you would enter:
http://web.archive.org/web/200209*/http://www.domain.com
Sometimes you encounter problems when you
browse pages in the archive. For example, I often receive a "failed
connection" error message. This may be the result of busy Web
servers or a problem with the page. It may also occur if the live
Web site prohibits crawlers.
To find out if the
latter issue is the problem, check the
site's robot exclusion file. A standard honored by most search
engines, the robot exclusion file resides in the root-level
directory. To find it, enter the main URL in your browser address
line followed by robots.txt. Like this:
http://www.domain.com/robots.txt .
If the site blocks the Internet Archive's
crawler, it will contain two lines of text similar to the following:
User-agent: ia_archiver
Disallow: /
If it forbids all crawlers, the commands
should look like this:
User-agent: *
Disallow: /
It's common for Web sites to block crawlers,
including the Internet Archive, from indexing their copyrighted
images and other non-text files. If the Internet Archive blots out
images with gray boxes, then the Web site probably prevents it from
making the graphics available.
If the site does not appear to block the
Internet Archive, don't give up when you encounter a "failed
connection" message. Return to the Wayback
Machine and enter the Web page address. This strategy generates a
list of archived versions of the page whereas Recall presents
specific matches to a query. One of the other dated copies of the
page may load without problems.
Conclusion
While the Internet Archive does not contain
a complete archive of the Web, it offers a significant collection
that due diligence researchers should not overlook. Tools like the
Wayback Machine and Recall Search provide points of access. However,
these utilities only handle simple queries. You can search by Web
page address or keyword. You cannot conduct Boolean searching or
limit a query by key information. Moreover, Recall Search limits
keyword access to one-third of the collection. Consequently, conduct
what research you can elsewhere first using public Web search
engines and commercial sources. Then use the information you
discover to scour relevant sites in the Internet Archive.
|