Searching the Web
Directories vs. Indexes
Spiders and Crawlers
Finding the needle in the haystack � keywords
The real power � Boolean operators

Directories vs. Indexes
A directory has a hierarchical or tree structure:
animal
cat�� dog�� gerbil�� hamster
Collie�� Dachsund�� German Shepherd
Toy�� Miniature�� Full

Directories vs. Indexes
One possible path down the tree:
animal
cat�� dog�� gerbil�� hamster
Collie�� Dachsund�� German Shepherd
Toy�� Miniature�� Full

Directories vs. Indexes
A directory has a hierarchical or tree structure,
which looks like this in Yahoo�

Slide 5

Directories vs. Indexes
A directory has a hierarchical or tree structure� like a table of contents
It is context-based�meaning that �adjacent� information is related
This offers efficient and effective browsing

Directories vs. Indexes
An index has no inherent structure�other than words, hence it is like, well, an index
It has granularity� meaning a detailed breakdown of where words are on the Web, without context or a sense of surroundings
This offers efficient and effective searching

Directories: Characteristics
Similar to a library or bookstore, with familiar categories
Arranged by subject or topic
And then subtopic and sub-subtopic�

Directories: Characteristics
Similar to a library or bookstore, with familiar categories
Arranged by subject or topic
And then subtopic and sub-subtopic�
Uses hyperlinks effectively to move �down� the topics�hence well-suited to purposeful browsing

Directories: Characteristics
Context and hyperlinks work together:
Topic:Animals or pets
Subtopic:Dogs
Sub-subtopic:Australian Shepherds
Target information:Finding a breeder, or training, or cost�

Directories: Issues
Because sites/links are chosen by editors,their scope � breadth and depth � is limited
Editing can introduce bias, personal or corporate
Editing can give unbalanced coverage, over- or underemphasizing topics
Currency requires editorial checking of content, link rot, etc.
Some directories charge for �favorable� listings

Directories: Examples
The cream of the crop:Yahoo!� It is a �closed� directory, meaning that its editors are its own employees
Open Directory Project uses unpaid editors and is used by Google (and formerly AltaVista); it is �open�
About.com is a half-open, half-closed hybrid

Indexes: Characteristics
An index is a database�like a dictionary or thesaurus that lists URLs of words and phrases instead of their definitions
It is machine-created, not human-built
Like any database, it is structured for efficient machine use, not human use
Hence, it is ideally suited for searching� and speed!

Indexes: Issues
Because all sites/links are included,their scope � breadth and depth � is unlimited
Financial costs can limit scope/content, e.g., frequency of revisiting pages already indexed
Indexing programs offer no quality review
Requires high user proficiency�

High user proficiency
A search on Australian shepherdwould have given > 75K hits on Google, 6M on A/V
in Fall 2001
Yet Google is a much larger database!
So how could this happen?And why would it not happen now?� [answer will come later]

Indexes: Issues
Because all sites/links are included,their scope � breadth and depth � is unlimited
Financial costs can limit scope/content, e.g., frequency of revisiting pages already indexed
Indexing programs offer no quality review
Requires high user proficiency
Text-focused, less useful for images, sound

Indexes: Examples
Google is now the frontrunner
But there may be reasons to use others:selective coverage, ease-of-use, comfort � all of which is driven by past experience� same as preference for a browser
Despite market share of Google, we will also look at AltaVista because of its historical and technological innovations

Indexes: Spiders and Allies
Automatic �spiders� (also robots, crawlers) find Web pages by following hyperlinks
They retrieve some portion of each page (title, first lines, full text)
Indexer adds the results to the database, calculates �relevancy�
Query processor responds to search requests

Keywords: An Overview
In Minerva, you can search fields � title, author, subject, title keywords, subject keywords�but these are like Yahoo! topics� a librarian has chosen them
In Web search engines such as AltaVista and Google, you can search full page content, as represented in the indexed database
This requires a very different skill set�

Keywords: An Overview
Choosing keywords is equivalent to starting at the �bottom� of a directory:
Topic:Animals or pets
Subtopic:Dogs
Sub-subtopic:Australian Shepherds
Target information:Finding a breeder, or training, or cost�

Context vs. Keywords
Topic:Animals or pets
Subtopic:Dogs
Sub-subtopic:Australian Shepherds
Target information:Finding a breeder, or training, or cost�
Directory tree
���������� Index search string

Keywords: An Overview
Pick distinctive, unusual, or unique words
Vary their order � sail boat vs. boat sail
Vary their case � boat vs. Boat vs. BOAT
Look at returned results � �hits� � to findadditional keywords
Check your spelling!

Boolean Operators
�Boolean� refers to George Boole, an 18th century British mathematician who developed much of the logic that underlies computer science
�Operators� are mathematical recipes, e.g., in 2+2=4, �+� is the addition operator.
�Boolean operators� are recipes for logical combinations

Boolean Operators
AND � both keywords connected by this operator must be present on the Web page for a result (�hit�) to be returned
Australian AND shepherd
AND is the default for most search engines
Always type Boolean operators in UPPER CASE

Boolean Operators
OR � either keyword connected by this operator must be present on the Web page for a result (�hit�) to be returned
boundary OR dispute
Australian OR shepherd
A few years ago, OR was the default for Yahoo and AltaVista � this explains the Google vs. AV �discrepancy� [earlier slide]

Boolean Operators
AND vs. OR:Australian shepherd
AND:3,450,000 hits on AltaVista
AND: 6,360,000 hits on Google
AND:3,450,000 on Yahoo
OR:266,000,000 hits on AltaVista
OR:269,000,000 on Google
OR:267,000,000 on Yahoo

Boolean Operators
The solution is to search on the phrase Australian shepherd,which is done by placing it in double quotes:
�Australian shepherd�
This is now almost universal among search engines.

Boolean Operators
Australian shepherd
default = AND: 3,450,000 hits on AltaVista
default = AND: 6,360,000 hits on Google
default = AND: 3,450,000 on Yahoo
�Australian shepherd�
1,670,000 hits on AltaVista
3,480,000 hits on Google
1,670,000 on Yahoo

Boolean Operators
NOT � second keyword connected by this operator must NOT be present on the Web page for a result (�hit�) to be returned
boundary NOT dispute
Australian AND NOT shepherd
�Australian shepherd� AND NOT breeder

Boolean Operators
NEAR � second keyword connected by this operator must be adjacent to the first on the Web page for a result (�hit�) to be returned
boundary NEAR dispute
�boundary dispute� NEAR Canada
�adjacent� usually means within 10 words
�Only� on AltaVista

Boolean Operators
There are two other search tools that are not logical operators, but they are most often combined with Boolean terms to refine searches � �wildcard� and �nesting.�

Boolean Operators
�wildcard� � using a special symbol, usually an asterisk (*), to search for part of a word
�boundary dispute resolution� might miss statements such as �X and Y announced today that they had resolved their long-standing boundary dispute.�Try this instead:
�boundary dispute� AND resol*, or better yet:
�boundary dispute� NEAR resol*

Boolean Operators
�nesting� � using parentheses to combine various operators and lessen ambiguity
�boundary disputes between the U.S. and Canada� � long, full of assumptions�
assumes all words present
assumes U.S., not US or United States
assumes Canada, not Canadian

Boolean Operators
An alternative:
�boundary dispute� AND Canada AND (US OR �U.S.� OR �United States�)
Another possibility:
�boundary dispute� AND (Canad* NEAR (US OR �U.S.� OR �United States�))

Boolean Operators
An alternative:
�boundary dispute� AND Canada AND (US OR �U.S.� OR �United States�)
Another possibility:
�boundary dispute� AND (Canad* NEAR (US OR �U.S.� OR �United States�))
But use caution � nesting is very powerful but an easy place to make mistakes

Boolean Search Tools
The classic operators:
AND
OR
NOT (AND NOT)
NEAR
And the additions:
�phrase in double quotes�
wildcard
nesting