The Glory that is Shane's Digital Libraries Blog.: October 2008

Friday, October 31, 2008

Week 10 Interaction and Evaluation

Arms chapter 8. http://www.cs.cornell.edu/wya/DigLib/new/Chapter8.html. This is useful if you want to learn really basic of interaction.

This chapter deals primarily with the issue of user interface. Over the past decade interface with systems has had to change to due to the broad range of people who currently have access. Originally such systems only required the ability to deal with academics and IT people who had knowledge of abstract access interfaces. They have now become more user friendly, even to the extent of imitating page turning, such as the case of JSTOR and American Memory.

Usability as a property of the whole system, not just the user interface of a client.

Conceptual Model:

· Interface design

· Functional design

· Data and metadata

· Computer systems and networks

Browsers added a new layer of access to systems and networks, starting in 19993 with Mosaic. "Mobile code gives the designer of a web site the ability to create web pages that incorporate computer programs. Java is a general purpose programming language that was explicitly designed for creating distributed systems, especially user interfaces, in a networked environment. A Java applet is a short computer program. It is compiled into a file of Java byte code and can be delivered across the network to a browser, usually by executing an HTTP command. The browser recognizes the file as an applet and invokes the Java interpreter to execute it."

New conceptual models: DLITE and Pad++.

Proper user support is more than a aesthetic handicap. well-designed, suitable functionality, and reactive systems make a quantifiable distinction to the value of DL's. When a system is diificult to use, users may fail to find important results, may misconstrue the data found, or may give up believing that the system is void of the proper data.

Rob Kling and Margaret Elliott "Digital Library Design for Usability" http://www.csdl.tamu.edu/DL94/paper/kling.html

During the last decade, software designers have made progress in developing usable systems for products such as word processors. Less attention given to usability in DL design. 2 forms of DL usability discussed - interface and organizational. While the Human-Computer-Interaction research community has helped pioneer design principles to improve interface usability, organizational usability is less well understood. "Design for usability" is a new term that refers to the design of computer systems so that the organizational usability is addressed. DL developers need to consider "design for usability" issues during DL system design.

Systems Usability :refers to how well people can exploit a computer system's intended functionality.

discuss two key forms of DL usability - interface usability and organizational usability.

Design for usability : that refers to the design of computer systems so that they can be effectively integrated into the work practices of specific organizations

organizational usability : characterizes the effective "fit" between computer systems (and DL's) with the social organization of computing in specific organizations. Learn primary characteristics of client organizations

Examined 5 models of computer-system design which are known in information systems and computer science research and professional communities. Each is a cultural model only in the specific organization and is hard to alter. Characterized one design model which we believe is the dominant cultural design model in the DL research community. Each of the five have strengths and weaknesses. Therefore propose a new organizationally-sensitive model which has the strongest chance of producing DL systems which most people will find usable in their workplaces.

This is a good time for the DL research community analysis of respective user systems and DL frameworks, we can start developing systematic understanding of the actual working conditions under which user find these models highly usable.

TefkoSaracevic,“Evaluation of digital libraries: An overview” http://www.scils.rutgers.edu/~tefko/DL_evaluation_Delos.pdf.

An extensive overview of DLs. States that DLs have a short history: discussion of them began in the 60's but no applicable systems developed 'til the mid 1990's. Writers evaluated over 80 studies on DLs in order to assemble a history that outlines the criteria around which D-libraries are observed and explored. Writer concludes that theorists and practitioners of evaluating DL systems do not seem to be agreeing or complying with each others observations and work.

Diagnosis of lack of evaluation:

"Complexity: Digital libraries are highly complex, they are much more than technological systems alone; evaluation of complex systems is very hard; we are just learning how to do this job and have a lot more to learn. In other words, we as yet do not know how to evaluate and we are experimenting with doing it in many different ways.

Premature: Even though they are exploding and are widespread, it may be too early in the evolution of digital libraries for evaluation.

Interest: There is no interest in evaluation. Those that do or research digital libraries are interested in doing, building, implementing, breaking new paths, operating … evaluation is of little or no interest, plus there is no time to do it.

Funding: There are inadequate or no funds for evaluation. Evaluation is time consuming, expensive and requires commitment – all these are in short supply. Grants have minimal or no funds allocated for evaluation. Granting agencies, while professing evaluation, are not allocating programs and budgets for evaluation. If there were funds there would be evaluation. With no funds there is no evaluation.

Culture: evaluation is not a part of the culture in research and operations of digital libraries. It is below the cultural radar. A stepchild. Plus many communities with very different cultures are involved in digital libraries. This particularly pertains to differences between technical and humanists cultures: language and frames of reference, priorities and understandings are different; communication is hard and at times impossible. Under these circumstances evaluation means very different things to different constituencies.

Cynical: who wants to know or demonstrate actual performance? Are there any emperor clothes around? Evaluation may be subconsciously or consciously suppressed. The ultimate evaluation of digital libraries will revolve around assessing transformation of their context – determining possible enhancing changes in institutions, learning, scholarly publishing, disciplines, small worlds and ultimately society due to digital libraries(10)."

Ben Sheiderman, Catherine Plaisant, "Designing the user interfaces" 4ed. chapter 1. A good introduction about usability and its application in human computer interaction (available in CourseWeb)

I cannot find this article on Course Web.

Muddiest Point Week 10

I can't find certain hyperlinks or materials on Course Web. Is there a possibilty of reavaluating some of these links as they may not be current? In particular, this weeks "article" by Sheiderman and Plaisant, as well as the Arms e-text which seems to migrate around the Cornell servers.

Friday, October 17, 2008

Week 8 Access in Digital Libraries: Part

Chapter 1. Definition and Origins of OAI-PMH. (Available in CourseWeb)

Todd Miller, Federated Searching: Put It in Its Place . April 15, 2004. http://www.libraryjournal.com/article/CA406012.html&

Proposing a relationship between federated search engines and library catalogs:

If the catalog is the primary source of information, then access federated searches through the catalog?

Available content is not limited to data stored within the physical library. The content demanded by users is often not cataloged by libraries. Viewing the catalog as the primary source of data does not reflect the current library. Today's libraries are vast information centers, providing books and other cataloged material is only one aspect of the modern library.

“Knowledge is power”, true for the patron & for the library. The libraries enable and engage their information, the more central they become in the lives of their constituency.

U.S. Senator Wendell Ford said, "If information is the currency of democracy, then libraries are the banks." Libraries have been made too secure. Google has shown that the most powerful information access approach also happens to be the simplest and easiest. “The most complex and least intuitive interfaces wind up securing information, not facilitating information access.”

The Truth About Federated Searching. October 2003. http://www.infotoday.com/it/oct03/hane1.shtml

Not all federated search engines can search all databases, most can search Z39.50 and free databases. Federated search engines cannot search all licensed databases for both walk-up and remote users. Why? Authentication: difficult to manage for subscription databases, especially for remote users.

True de-duplication is not possible.

Relevancy ranking are never totally relevant.

Subscribing rather than having a federated engine as software is the best option, due to updates and labor intensiveness of the IT issues. Leave it up to the engine database developers.

A federated search translates a search into something the native database's engine can understand. It's restricted to the capabilities of the native database's search function. A federated search can't can’t go beyond the parameters set by the native database engine.

Lynch, Clifford A. (1997). The Z39.50 Information Retrieval Standard, Part 1: A Strategic View of its Past, Present, and Future. D-Lib Magazine, April 1997. http://www.dlib.org/dlib/april97/04lynch.html

I’ve been dying all these years for a succinct definition for Z39.50 and I finally have it: “ Z39.50 -- properly "Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" -- is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search.

The rather forbidding name "Z39.50" comes from the fact that the National Information Standards Organization (NISO), the American National Standards Institute (ANSI)-accredited standards development organization serving libraries, publishing and information services, was once the Z39 committee of ANSI. NISO standards are numbered sequentially and Z39.50 is the fiftieth standard developed by NISO. The current version of Z39.50 was adopted in 1995, thus superseding earlier versions adopted in 1992 and 1988. It is sometimes referred to as Z39.50 Version 3.“

The article is the first part of a 2-part story on the history and implementation of Z39.5 protocol. Article deals primarily with Z39.5 and its use in digital libraries.

Norbert Lossau, “Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet” D-Lib Magazine, June 2004, Volume 10 Number 6. http://www.dlib.org/dlib/june04/lossau/06lossau.html

Librarians should not only look to Google and Yahoo but rather other search engines with the means of searching into the “Deep Web” as discussed in last weeks class. I think this is a big conundrum in all areas of IS and LIS. Ignorance or a laziness of skimming the Web rather than exploring the Web. Finding those sites and e-documents that can be pulled from beneath the levels normally searched by the larger engines. I’ve find many an important document with newer engines like Hakia that report to be semantic, but offer a far better advantage, it contains sites pooled and suggested manually by IT and LIS experts and amateurs.

Question: Has there been current research done on semantic engines like Hakia? I haven’t seen any news pertaining to the idea. Perhaps I’m not looking “deep” enough?”

Thursday, October 9, 2008

Muddiest Point

One of my Digital Library teammates might have beat me to this question, but . . . is there any issue with the database for the final project having very little coherency? Let me explain a little better. Can we pick three media that have no real content connection just so we can practice with the said formats, if we are developing our own fictitious d-library?

Reading Notes: Week 7: Access in Digital Libraries

LESK chapter 4.
This chapter discusses the varied and disparate non-textual materials that are involved in digital archives. As Lesk comments, it’s not all text pages! Not any more anyway! It runs through 4 main categories: Sound formats, Pictures (formatting by color texture and shape), Speech (more difficult to index and search than images), and Moving Images (Currently being researched but no contemporary solution that is affordable for library functionality.). Lesk discusses the indexing of these items, as well as issues with searches and solutions to these problems.

David Hawking, Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006.
1995- There was much speculation on the vastness of the web and the inability for an engine to search even a usable portion of it. And yet today the Big Three, Google, Yahoo, and MS all calculate about a billion queries a day in over a thousand languages world wide. This article explores the issues and techniques that these major search engines encounter and resolve.
INFRASTRUCTURE - large search engines manage numerous dispersed data centers. Services from these data centers are built up from clusters of commodity PCs. Individual servers in these data centers can be dedicated to specializations, i.e. crawling, indexing, query processing, snippet generation, link-graph computations, result caching, and insertion of advertising content. The amount of web data that search engines crawl and index = 400 TB. Crazy!.
CRAWLING ALGORITHMS - Uses a queue of URL’s to be visited and a system for determining if it has already seen a URL. This requires huge data structures—a list of 20 billion URLs = a TB of data. The crawler initializes with "seed" URLs. Crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. When the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseen URL to the queue. Finally, the crawler saves the page content for indexing. Continues until the queue is empty.

Crawling must address the following issues: Speed, Politeness, Excluded Content, Duplicate Content, Continuous Crawling, and Spam Rejection.

INDEXING ALGORITHMS - “Search engines use an inverted file to rapidly identify indexing terms—the documents that contain a particular word or phrase (J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," to be published in ACM Computing Surveys, 2006).”
REAL INDEXERS - Store additional information in the postings, such as term frequency or positions.
QUERY PROCESSING ALGORITHMS - Average query length is 2.3 words.
By default, current search engines return only documents containing all the query words.
REAL QUERY PROCESSORS – “The major problem with the simple-query processor is that it returns poor results. In response to the query "the Onion" (seeking the satirical newspaper site), pages about soup and gardening would almost certainly swamp the desired result.”
Result quality can be dramatically improved if the query processor scans and sorts results according to a relevance-scoring utility that takes into account the number of query term occurrences, document length, etc. The MSN search engine reportedly takes into account more than 300 factors.
Search engines use many techniques to speed things up – Skipping, Early Termination, Caching, and Clever Assignment of Document Numbers.

M. Henzinger et al. challenges in Web Search Engines. ACM SIGIR 2002. http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=792553
Couldn’t get into this site without a subscription.

Question: I’ve tried using the semantic search engine, HAKIA, and have come up with some perfect hits and some deplorable misses. Do the factors in the Hawking article still apply to semantic searching or are there different factors involved in such a redesigned engine?

Friday, October 3, 2008

Readings: Week 6 Preservation in Digital Libraries

Margaret Hedstrom “Research Challenges in Digital Archiving and Long-term Preservation” http://www.sis.pitt.edu/~dlwkshop/paper_hedstrom.pdf

The major Research Challenges:
1. Digital Collections are large, multi-media libraries that are growing exponentially. They is currently no method for preservation in light of an exponentially growing collection that is constantly interred with new and variable media.
2. Digital preservation of these collections bears more similarity to archive programs thamn library oriented issues. There is a need to develop self-sustaining, self-monitoring, and self-repairing collections.
3. Maintaining digital archives over long periods of time are as much economic, social, and institutional as technological. And there are no current models for this type of extended advance.
4. To develop tools that automatically supply and extract metadata from resources, ingest, restructure and manage metadata over time. And become progressively affordable as the digital archive expands.
5. Future inexpensive, flexible, and effectual infrastructures for collections.

Brian F. Lavoie, The Open Archival Information System Reference Model: Introductory Guide. http://www.dpconline.org/docs/lavoie_OAIS.pdf

The OAIS Reference Model was developed through a joint venture between the CCSDS and ISO to create a solution to data handling issues ---- digital preservation problems.

An archival information system is “an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community.” The open meaning that the model is posed as a public forum to create a solution an any one who wishes to assist or use is welcome.

The duties of the OAIS model are:
1. Negotiate for and accept appropriate information from information producers
2. Obtain sufficient control of the information in order to meet long-term preservation objectives
3. Determine the scope of the archive’s user community
4. Ensure that the preserved information is independently understandable to the user community, in the sense that the information can be understood by users without the assistance of the information producer
5. Follow documented policies and procedures to ensure the information is preserved against all reasonable contingencies, and to enable dissemination of authenticated copies of the preserved information in its original form, or in a form traceable to the original
6. Make the preserved information available to the user community

The development group has create a fully detailed conceptual model, that explores a digital environment, functionality between, management, administration and the user. Even how the data would be packaged in the system. However this model is just that, only a model, it currently has no basis in reality.

Jones, Maggie, and Neil Beagrie. Preservation Management of Digital Materials: A Handbook. 2001. http://www.dpconline.org/graphics/handbook/index.html
introduction and digital preservation sections.

A manual developed through the Digital Preservation Coalition that elaborates on preservation management issues. Although designed to be an international handbook to d-preservation, the handbook does cite that it’s primacy deals with UK issues particularly legislative events. However it does remain current by posting current links with preservation topics and sites.

Justin Littman. Actualized Preservation Threats: Practical Lessons from Chronicling America. D-Lib Magazine July/August 2007. http://www.dlib.org/dlib/july07/littman/07littman.html

Chronicling America:
1. to support the digitization of historically significant newspapers.
2. to facilitate public access via a Web site.
3. to provide for the long-term preservation of these materials by constructing a digital repository.
--- has a digital repository component that houses the digitized newspapers, supporting access and facilitating long-term preservation

Preservation threats encountered: failures of media, software, and hardware. But the worst errors came from operators, ie. Human error, deletion of files.

Question: Statistically, is operator error always the worst preservation threat found in digital archives?

The Glory that is Shane's Digital Libraries Blog.