Dienst - An Architecture for Distributed Document Libraries

Carl Lagoze - Cornell University

James R. Davis - Xerox Corporation

As one of the five universities participating in the ARPA-sponsored Computer Science Technical Report project, we at Cornell have developed a digital library architecture called Dienst. Dienst is a protocol and implementation that provides Internet access to a distributed, decentralized multi-format document collection. The collection is managed by a set of interoperating Dienst servers distributed over the Internet. These servers provide three digital library services: repositories of multi-format documents; indexes into the document collection and search engines for these indexes; and user interfaces for browsing, searching, and accessing the collection.

Dienst models the distributed digital library as a flat set of documents, each of which has a unique location-independent identifier, exists in multiple formats (e.g., TIFF, GIF, Postscript, HTML), and consists of a set of named parts. These parts may be physical such as pages, or logical such as chapters, tables, etc.

The architecture provides a number of helpful abstractions for the Dienst user. First, all elements of the collection are uniformly searchable and accessible without regard to their actual location. Second, multiple representations of a document are logically linked. Finally, documents are structured objects that can be viewed in part or as a whole. Using publically available WWW clients, users may search the document collection, browse "thumbnail" images of documents, read individual documents in any of their available formats, and download or print a document.

A distinguishing feature of Dienst is that indexes are distributed and searches are processed in parallel across each index site. The current Dienst implementation provides two types of searching - bibliographic and full-text. Users may search for documents by number, title, author, abstract keywords, or other bibliographic information using an HTML forms interface. A user may search the full-text of documents through two interfaces - by directly entering the text to be searched or a "click-to-search", where the user selects a paragraph from a document as the basis of the search. The Dienst protocol can be extended to include other search types and engines in the future.

The Dienst software also provides site administrators with tools for managing their collections. These include, among others, automated document submission procedures, indexing tools, database integrity checkers, and format conversion tools.

Dienst servers are accessed through gateways from any World Wide Web (WWW) server that supports the Common Gateway Interface. Dienst protocol requests are packaged within HTTP, the WWW protocol. In this manner, Dienst exploits all the current features of the WWW - widely available multi-architecture clients, MIME typing of documents, support for embedded images, and the like - and will be able to leverage future developments in areas such as user authentication and support for new graphics standards.

Dienst servers are currently running at ten sites, providing common access to several thousand CS technical reports. The Cornell server is available at http://cs-tr.cs.cornell.edu. We continue to work at Cornell on the Dienst protocol and implementation. In the future we plan to provide easier installation and maintenance tools for site administrators, develop and incorporate more powerful search techniques, and extend the system to enforce copyright restrictions.