The Glory that is Shane's Digital Libraries Blog.

Thursday, October 9, 2008

Muddiest Point

One of my Digital Library teammates might have beat me to this question, but . . . is there any issue with the database for the final project having very little coherency? Let me explain a little better. Can we pick three media that have no real content connection just so we can practice with the said formats, if we are developing our own fictitious d-library?

Reading Notes: Week 7: Access in Digital Libraries

LESK chapter 4.
This chapter discusses the varied and disparate non-textual materials that are involved in digital archives. As Lesk comments, it’s not all text pages! Not any more anyway! It runs through 4 main categories: Sound formats, Pictures (formatting by color texture and shape), Speech (more difficult to index and search than images), and Moving Images (Currently being researched but no contemporary solution that is affordable for library functionality.). Lesk discusses the indexing of these items, as well as issues with searches and solutions to these problems.

David Hawking, Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006.
1995- There was much speculation on the vastness of the web and the inability for an engine to search even a usable portion of it. And yet today the Big Three, Google, Yahoo, and MS all calculate about a billion queries a day in over a thousand languages world wide. This article explores the issues and techniques that these major search engines encounter and resolve.
INFRASTRUCTURE - large search engines manage numerous dispersed data centers. Services from these data centers are built up from clusters of commodity PCs. Individual servers in these data centers can be dedicated to specializations, i.e. crawling, indexing, query processing, snippet generation, link-graph computations, result caching, and insertion of advertising content. The amount of web data that search engines crawl and index = 400 TB. Crazy!.
CRAWLING ALGORITHMS - Uses a queue of URL’s to be visited and a system for determining if it has already seen a URL. This requires huge data structures—a list of 20 billion URLs = a TB of data. The crawler initializes with "seed" URLs. Crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. When the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseen URL to the queue. Finally, the crawler saves the page content for indexing. Continues until the queue is empty.

Crawling must address the following issues: Speed, Politeness, Excluded Content, Duplicate Content, Continuous Crawling, and Spam Rejection.

INDEXING ALGORITHMS - “Search engines use an inverted file to rapidly identify indexing terms—the documents that contain a particular word or phrase (J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," to be published in ACM Computing Surveys, 2006).”
REAL INDEXERS - Store additional information in the postings, such as term frequency or positions.
QUERY PROCESSING ALGORITHMS - Average query length is 2.3 words.
By default, current search engines return only documents containing all the query words.
REAL QUERY PROCESSORS – “The major problem with the simple-query processor is that it returns poor results. In response to the query "the Onion" (seeking the satirical newspaper site), pages about soup and gardening would almost certainly swamp the desired result.”
Result quality can be dramatically improved if the query processor scans and sorts results according to a relevance-scoring utility that takes into account the number of query term occurrences, document length, etc. The MSN search engine reportedly takes into account more than 300 factors.
Search engines use many techniques to speed things up – Skipping, Early Termination, Caching, and Clever Assignment of Document Numbers.

M. Henzinger et al. challenges in Web Search Engines. ACM SIGIR 2002. http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=792553
Couldn’t get into this site without a subscription.

Question: I’ve tried using the semantic search engine, HAKIA, and have come up with some perfect hits and some deplorable misses. Do the factors in the Hawking article still apply to semantic searching or are there different factors involved in such a redesigned engine?

Friday, October 3, 2008

Readings: Week 6 Preservation in Digital Libraries

Margaret Hedstrom “Research Challenges in Digital Archiving and Long-term Preservation” http://www.sis.pitt.edu/~dlwkshop/paper_hedstrom.pdf

The major Research Challenges:
1. Digital Collections are large, multi-media libraries that are growing exponentially. They is currently no method for preservation in light of an exponentially growing collection that is constantly interred with new and variable media.
2. Digital preservation of these collections bears more similarity to archive programs thamn library oriented issues. There is a need to develop self-sustaining, self-monitoring, and self-repairing collections.
3. Maintaining digital archives over long periods of time are as much economic, social, and institutional as technological. And there are no current models for this type of extended advance.
4. To develop tools that automatically supply and extract metadata from resources, ingest, restructure and manage metadata over time. And become progressively affordable as the digital archive expands.
5. Future inexpensive, flexible, and effectual infrastructures for collections.

Brian F. Lavoie, The Open Archival Information System Reference Model: Introductory Guide. http://www.dpconline.org/docs/lavoie_OAIS.pdf

The OAIS Reference Model was developed through a joint venture between the CCSDS and ISO to create a solution to data handling issues ---- digital preservation problems.

An archival information system is “an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community.” The open meaning that the model is posed as a public forum to create a solution an any one who wishes to assist or use is welcome.

The duties of the OAIS model are:
1. Negotiate for and accept appropriate information from information producers
2. Obtain sufficient control of the information in order to meet long-term preservation objectives
3. Determine the scope of the archive’s user community
4. Ensure that the preserved information is independently understandable to the user community, in the sense that the information can be understood by users without the assistance of the information producer
5. Follow documented policies and procedures to ensure the information is preserved against all reasonable contingencies, and to enable dissemination of authenticated copies of the preserved information in its original form, or in a form traceable to the original
6. Make the preserved information available to the user community

The development group has create a fully detailed conceptual model, that explores a digital environment, functionality between, management, administration and the user. Even how the data would be packaged in the system. However this model is just that, only a model, it currently has no basis in reality.

Jones, Maggie, and Neil Beagrie. Preservation Management of Digital Materials: A Handbook. 2001. http://www.dpconline.org/graphics/handbook/index.html
introduction and digital preservation sections.

A manual developed through the Digital Preservation Coalition that elaborates on preservation management issues. Although designed to be an international handbook to d-preservation, the handbook does cite that it’s primacy deals with UK issues particularly legislative events. However it does remain current by posting current links with preservation topics and sites.

Justin Littman. Actualized Preservation Threats: Practical Lessons from Chronicling America. D-Lib Magazine July/August 2007. http://www.dlib.org/dlib/july07/littman/07littman.html

Chronicling America:
1. to support the digitization of historically significant newspapers.
2. to facilitate public access via a Web site.
3. to provide for the long-term preservation of these materials by constructing a digital repository.
--- has a digital repository component that houses the digitized newspapers, supporting access and facilitating long-term preservation

Preservation threats encountered: failures of media, software, and hardware. But the worst errors came from operators, ie. Human error, deletion of files.

Question: Statistically, is operator error always the worst preservation threat found in digital archives?

Sunday, September 28, 2008

Digital Ice Age.

I came upon this article while doing the second half of assignment 2. It discusses some of the major disadvantages to digitization as a preservation and archiving tool. Thought it might be worth a look for those who like a dystopian interpretation of digitization. follow link here.

Assignment 2: My Flickr Link

My assignment 2 part 1 link. Pictures of random books and junk I own:
http://flickr.com/photos/30778466@N06/

Friday, September 26, 2008

Reading Notes : Week 5 : XML

Martin Bryan. Introducing the Extensible Markup Language (XML) http://burks.bton.ac.uk/burks/internet/web/xmlintro.htm

What is XML?
-subset of the Standard Generalized Markup Language (SGML)
-designed to make it easy to interchange structured documents over the Internet
-mark where the start and end of each of the logical parts (called elements) of an interchanged document occurs

-XML does not require the presence of a DTD.
-XML system can assign a default definition for undeclared components of the markup.

XML allows users to:
bring multiple files together to form compound documents
identify where illustrations are to be incorporated into text files, and the format used to encode each illustration
provide processing control information to supporting programs, such as document validators and browsers
add editorial comments to a file.
It is important to note, however, that XML is not:
a predefined set of tags, of the type defined for HTML, that can be used to markup documents
a standardized template for producing particular types of documents.
XML is based on the concept of documents composed of a series of entities. Entity can contain one or more logical elements. Elements can have certain attributes (properties) that describe the way in which it is to be processed

XML clearly identifies the boundaries of document parts, whether it be a new chapter, a piece of boilerplate text, or a reference to another publication unlike other markup languages, HTML, XHTML.

Uche Ogbuji. A survey of XML standards: Part 1. January 2004. http://www-128.ibm.com/developerworks/xml/library/x-stand1.html
Extending you Markup: a XML tutorial by Andre Bergholz http://www.computer.org/internet/xml/xml.tutorial.pdf
XML Schema Tutorial http://www.w3schools.com/Schema/default.asp

These three sites are tutorials running through examples of XML and its applications. I initially had a difficult time noting the difference between HTML and XML, but the W3 schools site has a page on Web Primer that gives a list for what the Average Joe needs to know about site development and it gives links then to these constituent pieces in understanding the WWW: http://www.w3schools.com/web/default.asp

Question: Do HTML and XHTML serve the same purpose; meaning, do you only use one or the other on a web page?

Friday, September 19, 2008

Reading Notes: Week 4 Metadata in Digital Libraries

Anne J. Gilliland. Introduction to Metadata, pathways to Digital Information: 1: Setting the Stage http://www.getty.edu/research/conducting_research/standards/intrometadata/setting.html

Metadata: "the sum total of what one can say about any information object at any level of aggregation."

ALL information objects have three features - content, context, and structure - all of which can be reflected through metadata:
Content relates to what the object contains or is about, and is intrinsic.
Context indicates the who, what, why, where, how aspects associated with the object's creation and is extrinsic.
Structure relates to the formal set of associations within or among individual information objects and can be intrinsic or extrinsic.

Library metadata includes indexes, abstracts, and catalog records, according to cataloging rules and structural and content standards such as MARC,(as well as authority forms such as LCSH or the AAT (Art & Architecture Thesaurus). Such bibliographic metadata has been cooperatively created since the ‘60s and available to repositories and users through automated systems such as bibliographic utilities, OPACs, and online databases.

Archival and manuscript metadata: accession records, finding aids, and catalog records. Archival descriptive standards that have been developed the past two decades: MARC Archival and Manuscript Control (AMC) format published by the LoC(1984) (now integrated into the MARC format for bibliographic description); the General International Standard Archival Description (ISAD (G)) published by International Council on Archives (1994); & Encoded Archival Description EAD), adopted as a standard by the Society of American Archivists (SAA) in 1999.

metadata:
certifies the authenticity and degree of completeness of the content;
establishes and documents the context of the content;
identifies and exploits the structural relationships that exist between and within information objects;
provides a range of intellectual access points for an increasingly diverse range of users; and
provides some of the information an information professional might have provided in a physical reference or research setting.
metadata provides a Rosetta Stone to decode information objects into knowledge information systems of the 21st century and provides an base to translate between systems.

Stuart L. Weibel, “Border Crossings: Reflections on a Decade of Metadata Consensus Building”, D-Lib Magazine, Volume 11 Number 7/8, July/August 2005 http://www.dlib.org/dlib/july05/weibel/07weibel.html

A personal reflection on some of the achievements and lessons as part of the Dublin Core Metadata Initiative management team: The goal – a starting place for more elaborate description schemes.
What then, is metadata for?
harvest and index.
Metadata for images, useful, associating images with text makes them discoverable.

The Mongolian/Chinese railroad gage dilemma: interoperability challenge across and suffering a measure of broken semantics in the process.

The Web demands an international, multicultural approach to standards and infrastructure, but should they be large brush stroke standards or a light set?

Question: Weibel mentions Google in reference to international standards definitions. Is their a relationship between search engine groups and Dublin core, OCLC, and academic databases in the development of international standards?