Step By Step 11 - Searching like the pros

Originally published in Australian Personal Computer magazine, September 1998
Last modified 03-Dec-2011.

 

You can find pretty much any information you want on the Internet. But will you? Here's the lowdown on finding it fast.

Plenty of people say they can't find what they want on the Web, though they know it must be there. This is partly the fault of the Internet's legendarily anarchic structure, but a bit of practice can have anybody finding data like a pro.

To find what you're looking for on the Net, you need to know how to search, and you need to know where to search. The "how" part is the one most people forget, so I'll tackle that first. The URLs for the sites mentioned in this column are at the end.

Stopwords, operators and wildcards

There's nothing wrong with broad search terms if you don't mind an avalanche of "hits", but the more carefully you construct your search "string" the less you're going to have to sift the results. "String" is the computer term for any sequence of characters, including spaces; searching is about the only place you still hear the word in general computing parlance. Another relevant word from days of yore is "syntax". A little time spent learning syntax, and a little thought put into your search string, can save you a lot of time poring through reams of close-but-no-cigar links or painstakingly revising your search string.

The syntax used by different searchers varies; here, I'll talk about the most common ways of doing things.

Common words like "the" or "and" are called "stopwords"; they're so common that search engines don't pay any attention to them in a search string, unless they're part of a quoted string. So a search for

tom and jerry

is the same as a search for

tom jerry

and will get you about a zillion things you don't want. But a search for

"tom and jerry"

will get you the cat-versus-mouse info you're looking for.

Wildcards are characters that match one or more other characters. The simple asterisk (*), by convention, matches any number of incidences, including none, of any string. The question mark traditionally matches any single character. They can both be used to broaden your search without forcing you to enter a tedious "or-list" of every form of a doubtful word. Can't remember whether it's Hindenberg or Hindenburg? "Hindenb?rg" will do. Likewise, win*98 will match "Windows 98", "Win98", "Win 98" and any other string starting and ending with the right letters.

Most search engines are not, by default, case sensitive, so "rolf harris" will match any instance of the wobbleboarder's name, regardless of how it's capitalised. Some searchers turn on case sensitivity if you use capitalisation anywhere in your string. So while "foo" matches Foo, foo, FOO and foO, "Foo" matches only Foo.

For real search power, you have to get into Boolean logic, named after its inventor, 19th century Irish mathematician George Boole. The standard Boolean operators are AND, OR, NOT or NEAR, and between them they let you focus a search very finely.

By default, practically all search engines treat all entered terms as if they had AND between them; only pages featuring all of the terms will be displayed. But to find, say, references to Widgetsoft products for Windows and OS/2, but not references to WidgetMail, you cold enter

widgetsoft AND (windows OR os/2) NOT widgetmail

Note the parentheses (brackets). They group terms and operators together so other operators can deal with the bracketed section as a whole. Without them, in this case, you'd be finding what you wanted, but also any page that contained the term os/2 but not the term widgetmail. Nested brackets are allowed, and should be handled properly by any competently written engine.

The NEAR operator varies in behaviour depending on the search engine; some engines, like AltaVista, treat "foo NEAR bar" as meaning foo within ten characters of bar, while others let you configure the acceptable distance apart the terms can be.

Different engines have their own shortcuts for and extensions to the Boolean operators. Most engines let you use &, |, ! and ^ for AND, OR, NOT and NEAR, and Infoseek, for example, uses a plus sign in front of a term to mean "this term must be present" and a minus sign for NOT.

Every search engine that supports non-obvious search modifiers like Boolean operations should have these modifiers explained in the help file. Many engines have basic and advanced interfaces, so people doing simple searches aren't intimidated by extra features. Check the help pages. Extra features unavailable from the meta-searchers, like matching files of a particular kind or pages in a particular date range, can help winnow out valuable information in those annoying situations when you know what you're after but can't quite put it into words.

Where to search

Now that you know how to look for things, let's get into where you should be looking. Many users use only whatever default search engine comes up when they hit the search button in their browser, and aren't too picky with their search "string". Do your searches this way and you'll get far too many results, yet may still miss out on the hits you're really looking for, because of the limitations of the engine you use.

For general Web searches, at the moment, I'm a huge fan of Google. Google has a large index, but its real strength is that it weights its results according to how many other pages link to the page in question. This makes Google searches, often, spookily accurate. Google is the only search engine with the guts to provide an "I'm feeling lucky" button that takes you automatically to hit number 1; it's surprising how often this very first hit is just what you're looking for!

If you're looking for something esoteric, though, the best engine to use is all of them, via a meta-searcher of some kind. A meta-searcher is a site or program that takes your search query and feeds it to several search engines, then collates their results and presents them with duplicates pruned out. Even the best search engines only index a portion of the Web, so using several at once gives notably higher chances of finding what you want.

For my heavy duty searches, I use Copernic. Copernic is a Windows program that feeds your string to a list of engines (all the big names plus a few more - you can pick which ones to use) and gives you a list of hits that you can easily sort by various criteria. It integrates neatly into the Start menu's Find section, and it keeps a history of past searches. Copernic also automatically updates the engines it searches and the syntax it uses, so it always does a pretty good job of getting what you ask for.

If you'd rather use a Web site than an outboard program for your combo-searches, try Metacrawler or ProFusion.

Yahoo's strengths

Yahoo used to be my only port of call when I was just looking for a company's Web site, or any other site whose name I had a good idea of, or if I was just looking for a selection of sites on a subject. When you've already found one site of the kind you're after, it often helps to find that site on Yahoo, then click the area identifier link above the site name to see everything else in that section.

But Yahoo's not what it used to be. Or, rather, it's better than it used to be, but it's not keeping up with the amazing growth of the Web. Getting listed on Yahoo takes a really long time, and so the Yahoo listings are always out of date. There's no perfect alternative, but Google comes surprisingly close.

Because Yahoo is a human-edited directory in which site owners get to describe their sites and put them in appropriate categories, it still really cuts through the noise for less esoteric searches. Yahoo also "falls through" to an AltaVista search after it's exhausted its own index, so it's possible you'll get what you need even if Yahoo doesn't list it.

Searching Usenet

Usenet newsgroups are frequently a lot more useful than Web pages, simply because much more data has been posted to Usenet over the years than has made it onto the Web. Newsgroups are also the way to go if you're looking for information on something that's only just happened, since Web pages on the subject usually take a while to be built, while newsgroup denizens can be counted on to sound off ten seconds after the event.

For this sort of total immediacy you'll need to fire up a newsreader and check out the relevant groups yourself, but if you can stand reading messages a day or two old, head for Deja.com (or, as it used to be called, Deja News). Other sites offer Usenet searching, but they either send their requests to Deja News (Hotbot does this) or have a much smaller database. The late lamented Reference.com and AltaVista's Usenet search functions can't hold a candle to Deja's - although Reference.com did index recent posts to various e-mail lists, which Deja News doesn't.

Deja.com's Power Search lets you restrict your search to one or more newsgroups, with wildcard support, so rec.pets.* matches a load of pet newsgroups. You can also specifically search by subject, author and date range. And whenever you're looking at a message, you can choose to view the "thread" of messages it's a part of (often catching other, similarly named threads as well), or view the message author's posting profile, which is an excellent way to judge the credibility of a poster, find more messages from someone especially amusing or erudite, or just snoop into someone's online activities. Deja.com also lets you read news and post to newsgroups, once you've registered with them. This makes it easy to chase up more information.

Here's one time-saving tip for Usenet searching - if you're searching for the answer to a question, and you see one or more threads of messages which from the title seem to be dealing with it, don't bother looking at the first message in the thread. Concentrate on all of the "Re:<subject>" messages instead. The initial message is probably just asking the same question you are, and whatever sections of its text are relevant will generally be quoted in the replies.

Specialised searchers

The other search resources you use depend on your interests. Databases and other information repositories of all sorts are being made Web-accessible, so you can check out U.S. patents, the huge Medline medical abstracts database, piles of information on the Australian Government - you name it. The one thing just about all of the huge online databases have in common is that they support Boolean search terms, so the basic search string skills you use on the Web will stand you in good stead when searching these sites too.


Links

Perl regular expressions
The standard syntax for advanced searches

Google
The best search engine currently available

Copernic
The best meta-search client

Metacrawler and ProFusion
Meta-search Web sites

 



Give Dan some money!
(and no-one gets hurt)