Information Technology Services: Web

Constructing a local search using ht://Dig Search forms

About Search

Purpose

This document outlines basic instructions for constructing ht://Dig search forms which limit the returned results to a specified area of a web site or sites. This allows web authors to provide a search for their own web site or sub-sections of their web site.

For example: The library may wish to only return results from the library server.

If you do not limit the scope of your search forms you will be returning results against all currently indexed web servers up to the limit of 200 results.

What's this Search form and How does it work?

A search form is simply an HTML form constructed in such a way as to send instructions to the ht://Dig search engine. The ht://Dig search engine then performs a search based on the currently indexed web pages and the instructions defined within your search form.

You tell ht://Dig how to limit returned results by using the form fields "restrict" and "exclude".

Using "restrict" and "exclude" form fields

The form fields we will be discussing here are needed to limit the scope of the search results returned by your search form. Many examples of this behavior can be seen within the source code of the University at Albany Search page.

To target a particular area of all indexed web pages for the University at Albany you would use the "restrict" field. Which returns results restricted to a web address path you specify.

To exclude content in a section of your site you can use the "exclude" field. You may want to use this field in combination with "restrict" to provide searching of your local site without returning certain sections of your site.

For example: The library could use the "restrict" with value "library.albany.edu" to return results related to the library. However, they may not want "Databases & E-Texts" to appear and would use the "exclude" field with value "library.albany.edu/databases" to prevent pages in this sub-section of their site from appearing.

If you wish to prevent single web pages or linked sections of your site from being indexed during the next weekly indexing run, ht://Dig supports the robots meta tag.
Please note this will prevent any robots on the web which respect this meta tag from indexing and/or following links within the page where you use this directive. Google is one such robot. If you want to prevent only ht://Dig from indexing certain pages and still allow other robots to continue indexing these pages, use the htdig-noindex meta tag:
<meta name="htdig-noindex">

Examples

Working example:
Search Rockefeller College for:

The code:
<form method="get" action="http://search.albany.edu/cgi-bin/htsearch">
<input type="hidden" name="exclude" value="">
<input type="hidden" name="format" value="builtin-long">
<input type="hidden" name="restrict" value="/rockefeller">
Search Rockefeller College for:
<br>
<input type="text" size="21" name="words" value="">
<input type="submit" value="Search">
</form>

The key fields for controlling the scope of the search are the exclude and restrict hidden fields.

The restrict returns results which match a pattern in a web address. So, in this case, only pages which match the search string which contain /rockefeller as part of their URL (web address) will be returned. You can broaden this scope by placing additional URLs in the restrict value by separating them with a pipe "|".

For example:
<input type="hidden" name="restrict" value="/rockefeller|/www.pdp.albany.edu"> Will return results from Rockefeller and the Professional Development Program web site.

You can prevent sub-sections of your site from returning results by using the exclude field. In the first case, no exclusions apply and any matching results within /rockefeller would be returned.

However, by defining an exclude value along with the restrict:
<input type="hidden" name="restrict" value="/rockefeller">
<input type="hidden" name="exclude" value="/rockefeller/pos|/rockefeller/career">

Would cause results from Rockefeller College to be returned unless they were in Political Science or the Career section of the Rockefeller site.

Note: Because we index many web servers on campus, it is best to use values which use more of the web address in order to ensure you return only results you intended. For example: if you try to limit returned results to "baseball" you may get results from many different servers or even other websites within the same server. The Athletics Department would want to limit results for baseball using the restrict value "www.albany.edu/sports/mens/baseball/"