''The Mechanism of the Search Engine '' by TakuyaMurata

'''Foreword'''

I organized this paper in chronological order. First I talk about the history, then about what happens in the back end after the user posts a search query.

'''SearchEngineHistory'''

History of search engines is a good material to understand needs of it and its mechanism.

The Internet started as a network connecting various networks as the name indicates. At first, URL address were used to express the location of documents available to remote use, mostly geographical, and good enough to do that job. For instance, if you need some paper from University of California, Berkley, you just go to berkley.edu. There seems no need of searching basically. Some years after that, some sort of directory became necessary to find information, like the index in a library. The WorldWideWeb is a way that the user easily reaches information they want by simple mouse clicks, using hypertext technology (see MicrosoftEncarta).

In the mid-1990s, the Internet was commercialized and the number of web pages had increased astronomically, then it became seemingly impossible that without some kind of database that indexed whole web pages on the Internet like a phone directory. There were a number of them, some scholastic, some commercial. Yahoo is a one of the classic and the most popular sites that indexed as many web sites as possible. Most, including Yahoo, used a directory type where they organized web pages in corresponding category. It resembles catalogs in library. For instance, Yahoo has a link to Saint Mary's University of Minnesota in the location, "Home > Regional > U.S. States > Minnesota > Cities > Winona > Education > College and University > Private". These sites like Yahoo are often also called a directory since they resemble yellow pages.

As the number of pages keeps increasing steady but considerably, indexing web sites by human hands seems unrealistic. Most of time, more new pages are born while less pages are added to the databases and more pages are fundamentally modified, or even dead while less pages reflect that new status. Besides, categorizing web pages in conventional ways became also complicated as a bulk of new pages are added every day. The categories became deeper and deeper like above example.

Some search engines started to provide text searching with keywords. They are so-called robot search engines. Basically you get the results containing all of sites that have a keyword you ordered. As you might guess, you will get enormous numbers of sites quite likely, particularly if the keyword is common like "Internet". AltaVista of Digital Equipment Corporation is a classical instance, which was designed to show off the performance of their computers initially.

Google is a new and the latest generation in the history of search engines. Although it is just one of search engines, currently Google is the most popular and considered the most powerful search engine. Some young geeks used Google as even a verb like "I am googling when writing a research paper". Given that popularity, most of new search engines after Google became Google-like, such as clean interface, fast searching, organized search results. Aspseek is an instance  (http://www.aspseek.com/). And in this paper, I focus largely on mechanism of Google rather than traditional ones like AltaVista, since that became quite common nowadays among new generation search engines.

'''GatheringPages'''

Whatever the user want to retrieve from search engines, first you have to build a database that indexes as enormous as possible sites on the Internet. The process gathering web pages thoroughly is called crawling.

At the first time when you have nothing in your hand, you have to start somewhere (where you do doesn't matter). Then just trace the links to other pages. Once you have enough coverage of web pages, you might want to update them and if you detect you link, which likely means new pages or pages you have not yet visited. Apparently given the immense numbers of web pages, usually robots visits sites in the occurrence of weeks or a couple of months. Robots are often also called a 'spider'. Some robots are shortened to bots, which often refer to simple but persistent programs used in computers. For instance, the robot of Google is called Googlebot.

Crawling itself is a quite simple process. You don't have to manipulate pages at all, but only you have to is store pages into your database with terms that appeared in them. Some search engines provide the way that the user can access the pages stored in their databases as under the name of cache. Google is an instance.

Robots.txt is a file that specifies the policy about the crawling of search engines and stored in the same location where the site sits. You can specify that your pages are not appeared in the result of search engines. Crawling robots are expected to respect that policy, though it is easy to ignore the file.

'''IssueQuery'''

The searching of web pages starts when the user requested a new search to the search engine with keywords. It is often called query, a word borrowed from the study of databases. Query messages are often more than just a sequence of keywords. Boolean operations are largely used to control the results of search when multiword are applied. AND means the results must contain both keywords, one preceding AND and one following. In the similar idea, OR means the results must contain either keywords, one preceding OR or one following. Some search engines such as Google, AltaVista supports spell checking to improve the search result.

[Some engines also support NOT and NEAR. -- AmitPatel]

Articles, prepositions, and pronouns are typically treated as stop words, which are words simply ignored by search engines. It is needless to discuss why to ignore them when you think of what you get if you issue a query with a keyword like "the".

[Stopwords are often ignored because user queries like "who created wikiwiki?" might have good answers that do not have the word "who". -- AmitPatel]

As well as other search engines, Google also supports an exact phrase search.

'''DatabaseSearching'''

After the search engine site received a query from the user, it posts a request to the database. The structure and implementation of database largely depends on the vendor that owns that database and usually its details are secret. Typically the database is exceptionally huge and clustering multiple computers, often many. Highly likely they use customized rather than patching up existing tools or libraries. For instance, Google use Linux-based personal computers to make a cluster to support the database.

Most of search engines support caching to reduce the cost of time of searching of common words like "Amazon" dramatically. If the site received a query whose result is stored in cache, it returns the result from the cache without any posting a query request to the main database.

After the search engine received the result from the main database or cache, the site has to display the result to the user. The listing of result is usually quite simple: just list web pages that are hit with the description of the site. However, the order of the list is important yet difficult to judge by pure computation. The users usually look at only a couple of the sites from the top even though they got hundreds of thousands (Brandt). See also the ranking for detail.

Some search engines supports not only text search but also images or news. The gathering of information and mechanism of searching is almost identical to those of text search.

'''PageRanking'''

Ranking is also difficult but challenging theme in search engines. Nowadays, the difference among search engines is almost the order of listing of the search result since most of search engines return nearly the same result.

Google introduced ranking system, so-called page rank. Before that, most of search engines use information such as how often keywords appear in the site or if keywords are displayed as title or bold fonts rather than plain text. Instead, Google organizes the order of sites in terms of ranking. If the bigger the site has ranking, the closer to the top the site is displayed. In the reality, its algorithm is unrevealed, so I cite the description from Google, which is only accurate information about page rank.
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." (Our Search: Google Technology, http://www.google.com/technology/index.html)

You can know the rank out of 10 of the specific page using Google toolbar. Mostly the rank reflects the popularity and authority of the site. For instance, the rank of www.cnn.com is 9, that of www.smumn.edu is 6 and that of blogan.com (just my personal homepage) is 0.

[It is not true that "Google introduce ranking system".  Other search engines had ranking systems as well.  What Google introduced was a global web-wide ranking of pages based on linkage.  Keywords, titles, and bold fonts are used in Google as well; it's just that PageRank is one additional factor taken into account. -- AmitPatel]

'''SearchEngineNotes'''

[Brandt] Daniel Brandt, PageRank: Google's Original Sin, Google watch, http://www.google-watch.org/pagerank.html

[Encarta] Microsoft Encarta Encyclopedia Standard 2002

----
Web crawling is an amazingly stupid waste of resources.

Though it is wasteful, Google can provide the feature of caching quite easily. It is simply because the database of Google has copies of webpages. --TakuyaMurata

If Google is going to cache the entire contents of the internet anyway, why doesn't Google simply provide free hosting for everyone and ''become'' the Internet rather than a copy of it?

Actually I believe it is not such a bad idea. As you pointed out, Google hosts tons of pages as caches. Some even are dead. -- TakuyaMurata

They have the storage to be the internet, but storage is cheap. They'd need a lot more bandwidth, and bandwidth is expensive.

Web crawling doesn't seem terribly expensive compared to human surfers.  If a spider visits each page once a month, and that's expensive, then that means that I must be getting fewer than one human visitor per page per month on average.  Also, most web crawlers don't fetch images, JavaScript, style sheets, Flash, music, animation, and all those other things that eat up so much bandwidth.  -- AmitPatel
--------
Can anyone recommend, via hands-on experience, a software-based tool for intranet indexing and searching? We're looking at around 150,000 web pages/documents (including PDF). Thanks.

''The tool I used to use is long gone now, but where I work uses this (not sure which variant) and apparently everyone's happy with it: http://www.google.com/enterprise/search/index.html''

We tried that, but installing a new server box for each update up creates havoc. We want a software-only solution. We also are limited to Windows servers, for good or bad. Thanks.

Google Search Appliance is a pain the arse. I've worked with two versions from about 3 years apart, and they both suck. They have a lot of features, but never the ones I need, or don't work as expected. Who the heck is their audience?