Google Secrets: THE INVISIBLE WEB

No matter how good you are at using Web search engines and general
directories, there are valuable resources on the Web that search engines
will not find for you. You can get to most of them if you know the URL,
but a search engine search will probably not find them for you. These resources,
often referred to as the “Invisible Web,” include a variety of content, including,
most importantly, databases of articles, data, statistics, and government documents.
The “invisible” refers to “invisible to search engines.” There is nothing
mysterious or mystical involved.
The Invisible Web is important to know about because it contains a lot of
tremendously useful information—and it is large. Various estimates put the size
of the Invisible Web at from two to five hundred times the content of the visible
Web. Before that number sinks in and alarms you, keep in mind the following:
1. There is a lot of very important material contained in the Invisible Web.
2. For the information that is there that you are likely to have a need for,
and the right to access, there are ways of finding out about it and getting
to it.
3. In terms of volume, most of the material is material that is meaningless
except to those who already know about it, or to the producer’s immediate
relatives. Much of the material that can’t be found is probably not
worth finding.
To adequately understand what this is all about, one must know why some
content is invisible. Note the use of the word “content” instead of the word
“sites.” The main page of invisible Web sites is usually easy to find and is covered
by search engines. It is the rest of the site (Web pages and other content) that
may be invisible. Search engines do not index certain Web content mainly for
the following reasons:
1. The search engine does not know about the page. No one has submitted the
URL to the search engine and no pages currently covered by the search
engine have linked to it. (This falls in the category, “Hardly anyone cares
about this page, you probably don’t need to either.”)
2. The search engines have decided not to index the content because it is
too deep in the site (and probably less useful), it is a page that changes
so frequently that indexing the content would be somewhat meaningless
(as, for example in the case of some news pages), or the page is generated
dynamically and likewise is not amenable to indexing. (Think in terms
of “Even if you searched and found the page, the content you searched
for would probably be gone.”)
3. The search engine is asked not to index the content, by the presence of a
robots.txt file on the site that asks engines not to index the site, or specific
pages, or particular parts of the site. (A lot of this content could be
placed in the “It’s nobody else’s business” category.)
4. The search engine does not have or does not utilize a technology that
would be required to index non-HTML content. This applies to files such
as images and audio files. Until 2001, this category included file types
such as PDF (Portable Document Format files), Excel files, Word
files, and others, that began to be indexed by the major search
engines in 2001 and 2002. Because of this increased coverage, the
Invisible Web may be shrinking, proportionate to the size of the total
Web.
5. The search engine cannot get to the pages to index them because it
encounters a request for a password or the site has a search box that
must be filled out in order to get to the content.
It is the last part of the last category that holds the most interest for the
searcher—sites that contain their information in databases. Prime examples of
such sites would be phone directories, literature databases such as Medline,
newspaper sites, and patents databases. As you can see, if you can find out that
the site exists, then you (without going through a search engine) can search
the site contents. This leads to the obvious question of where one finds out
about sites that contain unindexed (Invisible Web) content.
The three sites listed below are directories of Invisible Web sites. Keep in
mind that they list and describe the overall site, they do not index the contents
of the site. Therefore, these directories should be searched or browsed at a
broad level. For example, look for “economics” not a particular economic
indicator, or for sites on “safety” not “workplace safety.” As you identify sites
of interest, bookmark them.
You may also want to look at the excellent book on the Invisible Web by Chris
Sherman and Gary Price (The Invisible Web: Uncovering Information Sources
Search Engines Can’t See. CyberAge Books. Medford, NJ USA. 2001).
Direct Search
http://www.freepint.com/gary/direct.htm
The “grandfather” of Invisible Web directories, this site was created and is maintained
by Gary Price (co-author of The Invisible Web). The sites listed here are
carefully selected for quality of content, and you can either search or browse.
invisible-web.net
http://www.invisible-web.net
By the authors of The Invisible Web, this is the most selective of the three
Invisible Web directories listed here. It contains about 1,000 entries and you
can either browse or search.
CompletePlanet
http://completeplanet.com
The site claims “103,000 searchable databases and specialty search engines,”
but a significant number of the sites seem to be individual pages (e.g., news
articles) and many of the databases are company catalogs, Yahoo! categories,
and the like, not necessarily “invisible.” It lists a lot of useful resources, but the
content also emphasizes how trivial much Invisible Web material can be.

Google Secrets

Tuesday, August 3, 2010

THE INVISIBLE WEB

No comments:

Post a Comment

geograhical factors