Thursday, October 9, 2008

Reading Notes: Week 7: Access in Digital Libraries

LESK chapter 4.
This chapter discusses the varied and disparate non-textual materials that are involved in digital archives. As Lesk comments, it’s not all text pages! Not any more anyway! It runs through 4 main categories: Sound formats, Pictures (formatting by color texture and shape), Speech (more difficult to index and search than images), and Moving Images (Currently being researched but no contemporary solution that is affordable for library functionality.). Lesk discusses the indexing of these items, as well as issues with searches and solutions to these problems.

David Hawking, Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006.
1995- There was much speculation on the vastness of the web and the inability for an engine to search even a usable portion of it. And yet today the Big Three, Google, Yahoo, and MS all calculate about a billion queries a day in over a thousand languages world wide. This article explores the issues and techniques that these major search engines encounter and resolve.
INFRASTRUCTURE - large search engines manage numerous dispersed data centers. Services from these data centers are built up from clusters of commodity PCs. Individual servers in these data centers can be dedicated to specializations, i.e. crawling, indexing, query processing, snippet generation, link-graph computations, result caching, and insertion of advertising content. The amount of web data that search engines crawl and index = 400 TB. Crazy!.
CRAWLING ALGORITHMS - Uses a queue of URL’s to be visited and a system for determining if it has already seen a URL. This requires huge data structures—a list of 20 billion URLs = a TB of data. The crawler initializes with "seed" URLs. Crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. When the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseen URL to the queue. Finally, the crawler saves the page content for indexing. Continues until the queue is empty.

Crawling must address the following issues: Speed, Politeness, Excluded Content, Duplicate Content, Continuous Crawling, and Spam Rejection.

INDEXING ALGORITHMS - “Search engines use an inverted file to rapidly identify indexing terms—the documents that contain a particular word or phrase (J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," to be published in ACM Computing Surveys, 2006).”
REAL INDEXERS - Store additional information in the postings, such as term frequency or positions.
QUERY PROCESSING ALGORITHMS - Average query length is 2.3 words.
By default, current search engines return only documents containing all the query words.
REAL QUERY PROCESSORS – “The major problem with the simple-query processor is that it returns poor results. In response to the query "the Onion" (seeking the satirical newspaper site), pages about soup and gardening would almost certainly swamp the desired result.”
Result quality can be dramatically improved if the query processor scans and sorts results according to a relevance-scoring utility that takes into account the number of query term occurrences, document length, etc. The MSN search engine reportedly takes into account more than 300 factors.
Search engines use many techniques to speed things up – Skipping, Early Termination, Caching, and Clever Assignment of Document Numbers.


M. Henzinger et al. challenges in Web Search Engines. ACM SIGIR 2002. http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=792553
Couldn’t get into this site without a subscription.

Question: I’ve tried using the semantic search engine, HAKIA, and have come up with some perfect hits and some deplorable misses. Do the factors in the Hawking article still apply to semantic searching or are there different factors involved in such a redesigned engine?

No comments: