Friday, September 12, 2008

Reading Notes Week 3: Representation of Digital Objects

LESK sections 2.1, 2.2, 2.7, chapter 3.
2.1 Computer typesetting:
Almost nothing in commercial setting is typed traditionally. This progression from tradition has led to most current written info. to be machine readable and available in some form via computer.
A progression from filmstrips manipulated by computer to laser printer.
Developed software that assisted in margin justification and line numbering. I.e. Nroff/troff-Bell Labs scribe-CMU TEX-Stanford
Between 1990 and 2000, the # of online databases almost doubles, however most books and journals are not available in full text via internet. This is an economic rather than technical leap.

2.2Text Formats:
Variety of formats starting with ASCII, due to the simplicity of ASCII not all derivations and symbols in all languages are covered. A new standard is proposed for this solution.
Large publishing groups use higher level systems. The 3 main standards are: MARC, SGML and HTML.

2.3Ways of Searching”
Linear Searching – a search algorithm that goes through a file from beginning to end looking for a string.
Inverted Files – elements to be searched are extracted and alphabetized and are then more readily accessible for multiple searches.
Hash tables or coding – computing whether a word appears in a file.

3 Images of Pages

Scanning
Image Formats
Display Requirements
Indexing Images of Pages
Shared Text/ Image Systems
Image Storage vs Book Storage
Large Scale Projects:
Thesaurus Linguae Graecae(1970s)- machine readable form of the corpus of classical Greek lit.
Gallica collection- 100,000 texts from the Bibliotheque Nationale France in image format.

Largest online book project: The Million Book Project by Raj Reddy, CMU- to place 1,000,000 full text books online, although texts that are older, outside of copyright infringement.


Clifford Lynch, “Identifiers and Their Role In Networked Information Applications”. http://www.arl.org/bm~doc/identifier.pdf


Norman Paskin. “Digital Object Indentifier (DOI) System”. Encyclopedia of Library and Information Sciences. http://www.doi.org/overview/080625DOI-ELIS-Paskin.pdf
Bibliographic utility identifier numbers such as the OCLC or RLIN numbers are used in duplicate detection and consolidation in the construction of online union catalog databases. Bibliographic citation can be viewed as an identifier, having many variations in style, and data elements based on editorial policies.

Question for the week: With all these databases coming to fruition you don't see much in popular audiences with these online databases. Even such pseudo-popular sites as Gutenburg.com. Does Google and other such search engines have a negative impact on finding these sites by having tendencies toward placing other searches higher in search lists?

No comments: