Google Secrets: 2010-08-01

Wednesday, August 4, 2010

20 Great Google Secrets

Source :- pcmag.com/article
Google is clearly the best general-purpose search engine on the Web (see www.pcmag.com/searchengines

But most people don't use it to its best advantage. Do you just plug in a keyword or two and hope for the best? That may be the quickest way to search, but with more than 3 billion pages in Google's index, it's still a struggle to pare results to a manageable number.

But Google is an remarkably powerful tool that can ease and enhance your Internet exploration. Google's search options go beyond simple keywords, the Web, and even its own programmers. Let's look at some of Google's lesser-known options.

Syntax Search Tricks

Using a special syntax is a way to tell Google that you want to restrict your searches to certain elements or characteristics of Web pages. Google has a fairly complete list of its syntax elements at

www.google.com/help/operators.html

. Here are some advanced operators that can help narrow down your search results.

Intitle: at the beginning of a query word or phrase (intitle:"Three Blind Mice") restricts your search results to just the titles of Web pages.

Intext: does the opposite of intitle:, searching only the body text, ignoring titles, links, and so forth. Intext: is perfect when what you're searching for might commonly appear in URLs. If you're looking for the term HTML, for example, and you don't want to get results such as

www.mysite.com/index.html

, you can enter intext:html.

Link: lets you see which pages are linking to your Web page or to another page you're interested in. For example, try typing in

link:http://www.pcmag.com

Try using site: (which restricts results to top-level domains) with intitle: to find certain types of pages. For example, get scholarly pages about Mark Twain by searching for intitle:"Mark Twain"site:edu. Experiment with mixing various elements; you'll develop several strategies for finding the stuff you want more effectively. The site: command is very helpful as an alternative to the mediocre search engines built into many sites.

Swiss Army Google

Google has a number of services that can help you accomplish tasks you may never have thought to use Google for. For example, the new calculator feature

(www.google.com/help/features.html#calculator)

lets you do both math and a variety of conversions from the search box. For extra fun, try the query "Answer to life the universe and everything."

Let Google help you figure out whether you've got the right spelling—and the right word—for your search. Enter a misspelled word or phrase into the query box (try "thre blund mise") and Google may suggest a proper spelling. This doesn't always succeed; it works best when the word you're searching for can be found in a dictionary. Once you search for a properly spelled word, look at the results page, which repeats your query. (If you're searching for "three blind mice," underneath the search window will appear a statement such as Searched the web for "three blind mice.") You'll discover that you can click on each word in your search phrase and get a definition from a dictionary.

Suppose you want to contact someone and don't have his phone number handy. Google can help you with that, too. Just enter a name, city, and state. (The city is optional, but you must enter a state.) If a phone number matches the listing, you'll see it at the top of the search results along with a map link to the address. If you'd rather restrict your results, use rphonebook: for residential listings or bphonebook: for business listings. If you'd rather use a search form for business phone listings, try Yellow Search

(www.buzztoolbox.com/google/yellowsearch.shtml).

Extended Googling

Google offers several services that give you a head start in focusing your search. Google Groups

(http://groups.google.com)

indexes literally millions of messages from decades of discussion on Usenet. Google even helps you with your shopping via two tools: Froogle
CODE
(http://froogle.google.com),

which indexes products from online stores, and Google Catalogs
CODE
(http://catalogs.google.com),

which features products from more 6,000 paper catalogs in a searchable index. And this only scratches the surface. You can get a complete list of Google's tools and services at

www.google.com/options/index.html

You're probably used to using Google in your browser. But have you ever thought of using Google outside your browser?

Google Alert

(www.googlealert.com)

monitors your search terms and e-mails you information about new additions to Google's Web index. (Google Alert is not affiliated with Google; it uses Google's Web services API to perform its searches.) If you're more interested in news stories than general Web content, check out the beta version of Google News Alerts

(www.google.com/newsalerts).

This service (which is affiliated with Google) will monitor up to 50 news queries per e-mail address and send you information about news stories that match your query. (Hint: Use the intitle: and source: syntax elements with Google News to limit the number of alerts you get.)

Google on the telephone? Yup. This service is brought to you by the folks at Google Labs

(http://labs.google.com),

a place for experimental Google ideas and features (which may come and go, so what's there at this writing might not be there when you decide to check it out). With Google Voice Search

(http://labs1.google.com/gvs.html),

you dial the Voice Search phone number, speak your keywords, and then click on the indicated link. Every time you say a new search term, the results page will refresh with your new query (you must have JavaScript enabled for this to work). Remember, this service is still in an experimental phase, so don't expect 100 percent success.

In 2002, Google released the Google API (application programming interface), a way for programmers to access Google's search engine results without violating the Google Terms of Service. A lot of people have created useful (and occasionally not-so-useful but interesting) applications not available from Google itself, such as Google Alert. For many applications, you'll need an API key, which is available free from
CODE
www.google.com/apis

. See the figures for two more examples, and visit

www.pcmag.com/solutions

for more.

Thanks to its many different search properties, Google goes far beyond a regular search engine. Give the tricks in this article a try. You'll be amazed at how many different ways Google can improve your Internet searching.

Online Extra: More Google Tips

Here are a few more clever ways to tweak your Google searches.

Search Within a Timeframe

Daterange: (start date–end date). You can restrict your searches to pages that were indexed within a certain time period. Daterange: searches by when Google indexed a page, not when the page itself was created. This operator can help you ensure that results will have fresh content (by using recent dates), or you can use it to avoid a topic's current-news blizzard and concentrate only on older results. Daterange: is actually more useful if you go elsewhere to take advantage of it, because daterange: requires Julian dates, not standard Gregorian dates. You can find converters on the Web (such as

CODE
http://aa.usno.navy.mil/data/docs/JulianDate.html

but an easier way is to do a Google daterange: search by filling in a form at

www.researchbuzz.com/toolbox/goofresh.shtml
www.faganfinder.com/engines/google.shtml

. If one special syntax element is good, two must be better, right? Sometimes. Though some operators can't be mixed (you can't use the link: operator with anything else) many can be, quickly narrowing your results to a less overwhelming number.

More Google API Applications

Staggernation.com offers three tools based on the Google API. The Google API Web Search by Host (GAWSH) lists the Web hosts of the results for a given query

(www.staggernation.com/gawsh/).

When you click on the triangle next to each host, you get a list of results for that host. The Google API Relation Browsing Outliner (GARBO) is a little more complicated: You enter a URL and choose whether you want pages that related to the URL or linked to the URL

(www.staggernation.com/garbo/).

Click on the triangle next to an URL to get a list of pages linked or related to that particular URL. CapeMail is an e-mail search application that allows you to send an e-mail to google@capeclear.com with the text of your query in the subject line and get the first ten results for that query back. Maybe it's not something you'd do every day, but if your cell phone does e-mail and doesn't do Web browsing, this is a very handy address to know.

Tuesday, August 3, 2010

THE INVISIBLE WEB

No matter how good you are at using Web search engines and general
directories, there are valuable resources on the Web that search engines
will not find for you. You can get to most of them if you know the URL,
but a search engine search will probably not find them for you. These resources,
often referred to as the “Invisible Web,” include a variety of content, including,
most importantly, databases of articles, data, statistics, and government documents.
The “invisible” refers to “invisible to search engines.” There is nothing
mysterious or mystical involved.
The Invisible Web is important to know about because it contains a lot of
tremendously useful information—and it is large. Various estimates put the size
of the Invisible Web at from two to five hundred times the content of the visible
Web. Before that number sinks in and alarms you, keep in mind the following:
1. There is a lot of very important material contained in the Invisible Web.
2. For the information that is there that you are likely to have a need for,
and the right to access, there are ways of finding out about it and getting
to it.
3. In terms of volume, most of the material is material that is meaningless
except to those who already know about it, or to the producer’s immediate
relatives. Much of the material that can’t be found is probably not
worth finding.
To adequately understand what this is all about, one must know why some
content is invisible. Note the use of the word “content” instead of the word
“sites.” The main page of invisible Web sites is usually easy to find and is covered
by search engines. It is the rest of the site (Web pages and other content) that
may be invisible. Search engines do not index certain Web content mainly for
the following reasons:
1. The search engine does not know about the page. No one has submitted the
URL to the search engine and no pages currently covered by the search
engine have linked to it. (This falls in the category, “Hardly anyone cares
about this page, you probably don’t need to either.”)
2. The search engines have decided not to index the content because it is
too deep in the site (and probably less useful), it is a page that changes
so frequently that indexing the content would be somewhat meaningless
(as, for example in the case of some news pages), or the page is generated
dynamically and likewise is not amenable to indexing. (Think in terms
of “Even if you searched and found the page, the content you searched
for would probably be gone.”)
3. The search engine is asked not to index the content, by the presence of a
robots.txt file on the site that asks engines not to index the site, or specific
pages, or particular parts of the site. (A lot of this content could be
placed in the “It’s nobody else’s business” category.)
4. The search engine does not have or does not utilize a technology that
would be required to index non-HTML content. This applies to files such
as images and audio files. Until 2001, this category included file types
such as PDF (Portable Document Format files), Excel files, Word
files, and others, that began to be indexed by the major search
engines in 2001 and 2002. Because of this increased coverage, the
Invisible Web may be shrinking, proportionate to the size of the total
Web.
5. The search engine cannot get to the pages to index them because it
encounters a request for a password or the site has a search box that
must be filled out in order to get to the content.
It is the last part of the last category that holds the most interest for the
searcher—sites that contain their information in databases. Prime examples of
such sites would be phone directories, literature databases such as Medline,
newspaper sites, and patents databases. As you can see, if you can find out that
the site exists, then you (without going through a search engine) can search
the site contents. This leads to the obvious question of where one finds out
about sites that contain unindexed (Invisible Web) content.
The three sites listed below are directories of Invisible Web sites. Keep in
mind that they list and describe the overall site, they do not index the contents
of the site. Therefore, these directories should be searched or browsed at a
broad level. For example, look for “economics” not a particular economic
indicator, or for sites on “safety” not “workplace safety.” As you identify sites
of interest, bookmark them.
You may also want to look at the excellent book on the Invisible Web by Chris
Sherman and Gary Price (The Invisible Web: Uncovering Information Sources
Search Engines Can’t See. CyberAge Books. Medford, NJ USA. 2001).
Direct Search
http://www.freepint.com/gary/direct.htm
The “grandfather” of Invisible Web directories, this site was created and is maintained
by Gary Price (co-author of The Invisible Web). The sites listed here are
carefully selected for quality of content, and you can either search or browse.
invisible-web.net
http://www.invisible-web.net
By the authors of The Invisible Web, this is the most selective of the three
Invisible Web directories listed here. It contains about 1,000 entries and you
can either browse or search.
CompletePlanet
http://completeplanet.com
The site claims “103,000 searchable databases and specialty search engines,”
but a significant number of the sites seem to be individual pages (e.g., news
articles) and many of the databases are company catalogs, Yahoo! categories,
and the like, not necessarily “invisible.” It lists a lot of useful resources, but the
content also emphasizes how trivial much Invisible Web material can be.

A robot is traversing my whole site too fast!

This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.

How do I know if I've been visited by a robot?

You can check your server logs for sites that retrieve many documents, especially in a short time.
If your server supports User-agent logging you can check for retrievals with unusual User-agent header values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.

I've been visited by a robot! Now what?

Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!

Can I use /robots.txt or meta tags to remove offensive content on some other site from a search engine?

No, because those tools can only be used by the person controlling the content on that site.
You will have to contact the site and ask them to remove the offensive content, and ask them to take steps to remove it from the search engine too. That usually involves using /robots.txt, and then using the search engine's tools to request the content to be removed. For example, see: How can I prevent content from being indexed or remove content from Google's index.
If that fails, you can try contacting the search engine administrators directly to ask for help, but they are likely to only remove content if it is a legal matter. For example, see: How can I inform Google about a legal matter?

How do I get the best listing in search engines?

This is referred to as "SEO" -- Search Engine Optimisation. Many web sites, forums, and companies exist that aim/claim to help with that.
But it basically comes down to this:

In your site design, use text rather than images and Flash for important content
Make your site work with JavaScript, Java and CSS disabled
Organise your site such that you have pages that focus on a particular topic
Avoid HTML frames and iframes
Use normal URLs, avoiding links that look like form queries (http://www.example.com/engine?id)
Market your site by having other relevant sites link to yours
Don't try to cheat the system (by stuffing your pages of keywords, or attempting to target specific content at search engines, or using link farms)

How do I register my page with a robot?

You guessed it, it depends on the service :-) Many services have a link to a URL submission form on their search page, or have more information in their help pages. For example, Google has Information for Webmasters.

How does a robot decide where to visit?

This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.

Aren't robots bad for the web?

There are a few reasons people believe robots are bad for the Web:

Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects.
Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites.

But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.

So what are Robots, Spiders, Web Crawlers, Worms, Ants?

They're all names for the same sort of thing, with slightly different connotations:

Robots
the generic name, see above.
Spiders
same as robots, but sounds cooler in the press.
Worms
same as robots, although technically a worm is a replicating program, unlike a robot.
Web crawlers
same as robots, but note WebCrawler is a specific robot
WebAnts
distributed cooperating robots.

What is an agent?

The word "agent" is used for lots of meanings in computing these days. Specifically:

Autonomous agents
are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet.
Intelligent agents
are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.
User-agent
is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.

What is a WWW robot?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Wednesday, August 4, 2010

Tuesday, August 3, 2010

I've been visited by a robot! Now what?

geograhical factors