Copyright 2000
The Importance of Intranet Search Engines
and Techniques to Improve Search Results
ABSTRACT
Search engines have become extremely important tools whose
accuracy and scope are critically important to users. A Web site is
useless as an information resource if users cannot find the
information that they are looking for quickly and efficiently. To help
improve finding information on the Web, this paper will examine 1)
the importance of Intranet search engines, 2) how to prepare pages
for searching, and 3) how to build more powerful searches.
INTRODUCTION
With the wide use of the Web today, Intranet and Internet search
engines have become extremely important tools whose accuracy and
scope are depended upon by users. No matter how well sites are
designed and organized, finding information would be much more
tedious and time consuming if search engines did not exist. These
tools allow users to type keywords into a form, press a search button,
and get a list of documents that match the keywords. As simple as
this process may sound, many relevant documents may go
unexamined because of incomplete indexing of documents, poorly
formatted documents, and unspecific queries. To help improve
finding information on the Web, this paper will examine 1) the
importance of Intranet search engines, 2) how to prepare pages for
searching, and 3) how to build more powerful searches.
THE IMPORTANCE OF INTRANET SEARCH ENGINES
Intranet Web sites are a great way to disseminate information both
internally and externally. However, a resourceful Web site is useless
if users cannot find the information that they are looking for–
resulting in lost time and lost opportunities. If users cannot find
what they are looking, they will probably move on to another site.
Even Web sites that only contain a few pages should be searchable.
Assuming that a site is searchable because it has been submitted and
indexed by a major Internet search engine (e.g., AltaVista, HotBot,
and Lycos) can be a costly mistake. Due to the large number of
documents on the Web in comparison to the small number of
documents these external search engines have indexed, it is unlikely
that a corporate Web site has been completely indexed by one of
these major search engines. For example, AltaVista which claims to
be the most comprehensive search engine has 140 million page
indexed ("About AltaVista," 1999). However, the Web is currently
estimated to contain over 1 billion documents ("Important Things to
Know," 1999). This implies that approximately 86% of all
documents on the Web cannot be located directly using AltaVista.
Similarly, HotBot has indexed 110 million pages on the Web
(approximately 11%) leaving 89% unindexed ("Wired Digital",
1999). To exemplify the incomplete indexing of documents,
HotBot's Check URL application
(http://www.hotbot.com/help/checkurl.asp) was performed on a
sample site. The Check URL application verifies if a URL has been
indexed in HotBot's database. The site chosen for testing was the
Web site for the College of Business and Industry at Northeastern
State University (http://arapaho.nsuok.edu/~cbi). HotBot's
Check URL application returned information indicating that only 4
documents at this site have been indexed. With 14 cross-linked
documents available at this highly visible site, HotBot is currently
only indexing 29% of the available documents. While this indexing
percentage is much higher than the average Web site, it illustrates
that time might be wasted and opportunities lost if the College of
Business and Industry relied exclusively on external search engines
such as HotBot to completely index their Web site.
Based on these numbers, it is obvious that an Intranet search engine
must be incorporated into a Web site to ensure complete indexing.
However, before installing an Intranet search engine, you should
carefully consider 1) the types of queries that will be performed and
the types of results that you want returned, 2) the ease of
maintenance and installation, and 3) the price.
Queries/Results
The queries that can be submitted to Intranet search engines and the
results returned by them can vary drastically. Below are some
important features that should be considered. Good search engines
should have the ability to:
- Produce a generated summary of the important text.
- Include common misspellings in search queries.
- Retrieve search results quickly.
- Perform Natural Language Query (A query expressed by
typing English, Spanish, or some other spoken language in
normal matter.)
- Refine searches (i.e., search a subset of your query results).
- Search numeric, alphanumeric, and special characters.
- Access the full set of boolean operators.
- Perform word stemming (query of "graphic" will also return
documents containing "graphics" and "graphical").
- Detect and eliminate duplicate documents.
- Rank results by the date a document was created or last
modified.
- Index documents across firewalls and password protected
sites.
Maintenance and Installation
The tools for maintaining and installing an Intranet search engine
vary drastically from one search engine to the next. The commercial
versions typically are installed and maintained using graphical user
interfaces (GUI's). Whereas, non-commercial versions of search
engines are typically maintained by editing configuration files.
Price
Below is a cost comparison of four Intranet search engines. The
price for these tools can range from free to very expensive. As
shown in this table, the price for commercially available search
engines is typically based on the number of documents that can be
indexed.
| Search Tool |
Num. of documents
that can be indexed |
Cost |
WebGlimpse
http://glimpse.cs.arizona.edu/webglimpse |
Unlimited |
Free |
Harvest
http://harvest.transarc.com |
Unlimited |
Free |
AltaVista Search Intranet
http://altavista.software.digital.com |
3,000
100,000
1,000,000 |
Free
$29,995
$99,995 |
InfoSeek UltraSeek Intranet
Server
http://www.ultraseek.com |
10,000
>10,000 |
$4,995
Contact their sales
group
|
PREPARING PAGES FOR SEARCHING
When formatting or editing pages for the Web, search results should
be kept in mind. A properly formatted document will result in better
indexing, and consequently, better searches will be able to be
performed on those pages. In particular, this section explains how
search engines index information in specific HTML tags. By
knowing this information, you will be able to edit and create more
searchable documents.
Page Titles
Most search engines rank the information in the title of a document
higher than the information found in the body of a document.
Hence, it is important that a descriptive title be provided for each
document. The title should provide a little context as well as a
specific topic for the document. For example, if this document were
placed on the Web, either one of the following titles would be
accurate. However, the longer title tells the user exactly what to
expect when this document is viewed.
<title>The Importance of Intranet Search Engines
and Techniques to Improve Search Results</title>
Or
<title>Search Engines</title>
Meta Descriptions and Keywords
Many search engines display the meta description as part of
the results page. If a meta description is not provided, many search
engines typically display the first 50-70 words of a document.
Unfortunately, the header of a document might only contain
information about the author. By providing a meta description, a
search engine will more likely display a summary of the document
rather than simply the first few sentences. Below is an example of a
meta description for this document.
<META NAME="decription" CONTENT="Description of the importance
of Intranet search engines, preparing pages for searching, and how
to build more powerful searches.">
Meta Keywords are also an important part of a Web page. A
good set of keywords should cover the topics mentioned in the
document. Below is an example of meta keywords for this
document.
<META NAME="keywords" CONTENT="Intranet, search engines,
Web, Internet, indexing, retrieval, Boolean operators">
Headings
Many search engines also use headings (i.e., <h1>
through <h6>) to
rank the relevance of a document for a particular query. They
assume that words in headings are more important than the words in
text. Hence, when possible, place main concepts and ideas into
HTML heading tags.
Register your URL with the Search Engines and Directories
The final step in preparing pages for searching is to plan and
implement an awareness building campaign for your Web site. This
should include at a minimum submitting your URL to the major
search engines and directories for indexing. One easy approach is to
use a commercial service such as Submit-It (http://www.submit-
it.com).
HOW TO BUILD MORE POWERFUL SEARCHES
Despite differences in search engines, they have many searching
characteristics in common that can be used to build more powerful
searches. Below are three general search tips that will result in more
relevant documents being returned.
- Perform "phrase searching." Sometimes the order of the search
terms matters. By using phrase searching, you can greatly eliminate
the number of documents that matches a search query. For example,
if you phrase searched for "The Golden Gate Bridge," you
would get a list of documents that contain all four words in that
order.
- Use specific keywords as opposed to general ones. For example
"Purple Martins" will return much more specific results than
"birds."
- Incorporate Boolean operators into your search. Boolean
operators allow logical thought to be expressed as algebra. Below is
a list of Boolean operators and other search features that will help
produce more powerful search expressions.
AND
Joining search terms with the AND operator tells the search engine
that only documents containing all the terms should be returned. For
example (heart AND transplant) finds documents with both
the word heart and the word transplant. Note: On some
search engines, a plus sign (+) can be used to indicate an AND
operation.
OR
Joining search terms with the OR operator tells the search engine that
documents containing any of the terms or phrases should be
returned. For example (nearsighted OR myopic) finds
documents containing either the word nearsighted or the word
myopic. The returned documents could contain both of the
keywords or just one.
NOT
The NOT operator excludes unwanted documents containing the
specified terms or phrases. For example (heart AND attack
NOT transplant) would find documents on heart attacks, but
would not return documents on heart transplants. On some search
engines, a minus sign (-) can be used to indicate a NOT operation.
When using AltaVista the NOT operator cannot stand-alone. It must
be used in conjunction with another operator like OR/AND. If using
AltaVista, the query above would be phrased: (heart AND
attack AND NOT transplant).
Wild Cards/Word Stemming
When used at the end of a word, the asterisk (*) functions like a wild
card. It broadens a search to include extensions and plurals of the
word. For example: consult* would match consults, consultant,
consulted, and consulting.
NEAR
The NEAR operator finds documents containing both specified
keywords that are near to each other. For example
(constitution NEAR "United States") would find
documents containing the phrase "the Constitution of the United
States" or the "United States Constitution". When using the NEAR
operator in Lycos, the words must appear within 25 words of each
other in the results documents. However when using the NEAR
operator in AltaVista, the words must appear within 10 words of
each other.
Parentheses
Parentheses can be used to ensure that the operators are evaluated in
the desired order. For example, the parentheses in the query ("Lasik
surgery") AND (astigmatism OR nearsighted) will ensure that the
OR operation is performed before the AND operation. This query
would find documents with the phrase Lasik surgery, and either
astigmatism or nearsighted or both.
HTML Tags
Many search engines also incorporate features to restrict searches to
specific parts of a Web page. For example, typing
title:hypertension will retrieve only documents that have
the word hypertension in their title. Below is a partial listing of
other words that can be specified on various search engines.
| url: |
Returns documents containing the specified URL. |
| image: |
Detects image files (GIF, JPEG, etc.) |
| link: |
Returns documents containing a link to the specified URL. |
Find Similar
Exite and Magellan currently support a feature that searches the Web
based on an already retrieved document rather than on keywords.
Instead of using keywords, the search engine can use a document just
viewed as an example in the next search. The new search should
then find documents that are very similar to the one previously
viewed. In Exite this feature is called "MORE like this link" and in
Magellan this feature is called "Find Similar."
CONCLUSION
Search engines have become an essential tool for both the Internet
and Intranets. A resourceful Web site is useful if people can find
information quickly and efficiently, otherwise it is not. While the
Web and search engines will continue to evolve, this study has
indicated the importance of Intranet search engines, how to prepare
pages for searching, and how to build more powerful searches.