Von Jens Gebhart am 31. August 2000 um 00:00

Das Interview mit Peter Lyman führte Jens Gebhart ( ) im September 2000, Fotos von Katja Dell ( )

"The Internet Archive" in San Francisco, ein Ort an dem das komplette Internet archivert wird...


?: What is the archive.org ?

PL: The Archive is a 501(c)(3) public nonprofit that was founded to build an ‘Internet library,’ with the purpose of offering free access to historical digital collections for researchers, historians, and scholars. Founded in 1996 and located in the Presidio of San Francisco. In late 1999, the organization started to grow to build more well-rounded collections.

?: Why is the Internet Archive collecting sites from the Internet? What makes the information useful?

PL: Most societies place importance on preserving their culture and heritage. As our culture produces more and more artifacts in digital form, the Archive is preserving them to create a public library for researchers, historians, and scholars. While the newness of the digital media format presents challenges in collecting and preserving materials, we feel it is necessary: Our cultural, political, and historical artifacts are increasingly created in digital form, and if they are not saved now, they may never be saved at all. Much early media — television and radio, for example — was not saved. Many early movies were recycled to recover the silver in the film! Even now, at the turn of the 21st century, no comprehensive archives of television or radio programs exist.

But without cultural artifacts, civilization has no memory and no mechanism to learn from its successes and failures. The Internet Archive is working to prevent the Internet — a new medium with major historical significance — from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to permanently preserve a record of public material. In addition to developing our own collections, we will be working to promote formation of other Internet libraries in the United States and elsewhere.


?: Who has access to the collections?

PL: The Archive makes the collections available at no cost to researchers, historians, and scholars. At present, it takes someone with a certain level of technical knowledge to access the data, but there is no requirement that a user be affiliated with any particular organization. Open and free access to literature and other writings has long been considered essential to education and to the maintenance of an open society. Public and philanthropic enterprises have supported it through the ages. The Internet Archive is opening its collections to researchers, historians, and scholars to ensure that they have free and permanent access to public materials. The Archive has no vested interest in the discoveries of the users of its collections, nor is it a grant-making organization. At present, using collections of this size requires programming skills. However, we are hopeful about the development of tools and methods that will give the general public easy and meaningful access to our collective history. In addition to developing our own collections, we will be working to promote formation of other Internet libraries in the United States and elsewhere.

?: Do you collect all the sites on the Web?

PL: No, we collect only publicly accessible Web pages. These may include pages with personal information. If there is any indication that a site’s owner doesn’t want us to archive the site, we don’t and we do not collect or archive personal email messages or chat systems.

?: Are you violating copyright laws?

PL: No. Like your local library’s collections, our collections consist of publicly available documents. But in our case, the Archive has collected only pages that were available on the Internet at no cost and without passwords or special privileges. Even further, the authors of Web pages can remove their documents from the collection. You can also stop robots from crawling your site. Stopping robots from collecting the pages on a site leads to the removal of the pages from the existing collection.


?: What privacy issues does the Archive bring up? How do you protect my privacy if you archive my site?

PL: The Archive collects Web pages that are publicly available — the same ones that you might find as you surfed around the Web. We do not archive sites when there is any indication that their owners do not want them archived. Like a public library, the Archive provides free and open access to its collections to researchers, historians, scholars, and possibly to the general public. Our cultural norms have long promoted access to documents that were, but no longer are, publicly accessible.
Unlike a paper library, on the other hand, the Archive will not collect publicly available materials if there is an indication that the owner does not wish them to be archived. Furthermore, if we find any such indication on a site, we remove all previous versions of the page from the collections. We provide information on removing a site from the collections. Those who use the collections must agree to certain terms of use.

Given the rate at which the Internet is changing — the average life of a Web page is only 77 days — if no effort is made to preserve it, it will be entirely and irretrievably lost. Rather than let this moment slip by, we are proceeding with documenting the growth and content of the Internet, using libraries as our model.

?: What is the actual size of the Archive's Collection?

PL: We started to collect from October 1996 to now. The actual size is 13.8 terabytes (about 1 billion pages, text only during 1999, compare to the World Wide Web 1997: 2 Terabytes) and the monthly rate of growth is 2 terabytes as of March 2000. That signifies that every half a year the internet doubles itself. At the moment all webpages from late 1998 to six or more months ago (the collection contains no material less than six months old) are accessible, or about 3 terabytes as of March 2000. We hope to make the rest of the material (collected from late 1996 to late 1998) available during 2000.


?: What are the technical requirements for access to the Archive's collections ?

PL: While the Internet Archive does not charge for access to the collections, you will need Unix programming skills to gain access to and use an entire collection, such as the Archive’s collection of Web snapshots. (A free, easy-to-use application from Alexa Internet, which donates Internet materials to the Archive, can be used to access individual Web pages.)

The Archive assigns each user an ssh (secure shell) access account and disk space on the server facade.archive.org. (Secure shell access provides character-terminal log-in; it’s similar to Telnet access but more secure.) The server runs the Linux operating system. The server facade.archive.org has access to a series of Linux machines (named ia000.archive.org, ia001.archive.org, and so on). Each machine has either 12 or 20 disk drives (named 0, 1, 2, and so on). On each drive are three types of files: ARC format, DAT or MDT format (URLs and image references from the ARC files) and IDX (index) format, which each contain a list of URLs and their associated place in the ARC and DAT files. Users access the hard drives where the collections reside by referencing these remote files from facade.archive.org. You can use either FTP or NFS (network file system) access.

?: How do the Archive acquire collections?

PL: The Internet Archive acquires collections in two ways: Web robots that collect publicly accessible Web pages and donations of digital collections. A Web-crawling robot is software that automatically collects Web pages from publicly accessible Web servers. It examines each page for links to other pages that it can collect. In turn, if it finds more links on those pages, it follows those too. A set of pages that has been retrieved by a robot is called a "crawl." Crawling is how most search engines collect Web pages for indexing.

The Internet Archive currently receives "crawls," or snapshots of the Web, as donations from the Web navigation service Alexa Internet. In addition, we have plans to augment Alexa’s crawls with our own. Each Alexa crawl takes about two months.

Alexa’s robot currently gathers more than 100 gigabytes of publicly available information a day, often from thousands of different sites at a time. It’s equipped with a "throttle" that limits the rate of requests and prevents the robot from interfering with a server’s normal activities. Unlike people, who can follow any link when they browse the Web, robots do not, because Webmasters and authors use certain standards to control access to their sites by robots. For example, Alexa’s robot does not copy pages that require a password to access, pages tagged for "robot exclusion" by their owners, pages that are only accessible when a person types into and sends a form, or pages on secure servers.

Crawls in the Archive from late 1996 (when Alexa began crawling) to late 1998 included images and other media. Crawls since then have included only ASCII text, but plans are in progress to conduct our own crawls to fill out the collections.


?: What is the Alexa software different to other search engines?

PL: Alexa is software that can be retrieved free from the company's Web site (http://www.alexa.com) and added to a Web browser. Unlike other search engines, such as Yahoo! and Excite, it doesn't rely on word searches. Instead, it watches where its users go on the Internet, and then records that information in a central data base. Based on that information, Alexa can tell a user the most popular paths that other Alexa users have taken from the site the user is visiting at a given time. It also can suggest other sites offering related material. The top 10 sites pop up in a thin, gray bar near the browser (see below) and change as the user moves from page to page. The advantage of Alexa as a search engine is that it "attempts to be an objective source" for people seeking information.

?: Is there any selection in cawling the Internet?

PL: No, that would take too much time, comparing the huge amout of files. It's easyer to store everything.

?: Can you tell us something about Storage of the Collections?

PL: Storing the Archive’s collections involves parsing, indexing, and physically encoding the data. With the Internet collections growing at a rate of about 2 terabytes a month, this task poses a formidable challenge. For hardware, we use Linux PCs with clusters of IDE hard drives. Data collected until late 1998 was collected on DLT tape (a relatively inexpensive storage medium that is, however, too slow for querying). We are in the process of migrating that data to disk. We receive and store data in two formats: archive (ARC) files and metadata (MDT) files. ARC (.arc): These are 100-megabyte files made up of many individual files. Alexa Internet (currently the source of all crawls in our collections) is proposing ARC as a standard for archiving Internet objects. MDT (.dat or .dt): These are metadata files that contain contextual information extracted from ARC files, such as when each data file was gathered, any URLs it might contain, and its size. MDT files make it easier to index the ARC files and to conduct research analyses.

?: and the preservation?

PL: Preservation is the ongoing task of permanently protecting stored resources from damage or destruction. The main issues are guarding against the consequences of accidents and data degradation and maintaining the accessibility of data as formats become obsolete. Any medium or site used to store data is potentially vulnerable to accidents and natural disasters. Maintaining copies of the Archive’s collections at multiple sites can help alleviate this risk. Part of the collection is already handled this way, and we are proceeding as quickly as possible to do the same with the rest.

? Thank you and I hope that the Archive allready archived the unstable betacity.de website...

PL: Sure..



Brewster Kahle: Brewster is an engineer by profession and an archivist at heart. He designed supercomputers for Thinking Machines and helped found WAIS, Inc., Alexa Internet, and the Internet Archive.

Peter Lyman: Peter is university librarian and a professor in the School of Information Management and Systems at? the University of California, Berkeley.

Kathleen Burch: Kathleen has helped start and run nonprofits, including the San Francisco Center for the Book, since 1973. She is a Xerox PARC Artist in Residence for 2000.

Bruce Gilliat: Bruce, with a background in networking and online content strategies, is a cofounder (with Brewster Kahle) of Alexa Internet.

[Links to Internet Libraries and Librarianship]

Alexa Internet has catalogued Web sites and provides this information in a free service.
http://www.alexa.com" target="_blank">http://www.alexa.com

The Council on Library and Information Resources works to ensure the well-being of the scholarly communication system.
http://www.clir.org" target="_blank">http://www.clir.org
See its publication Why Digitize?
http://www.clir.org/pubs/reports/pub80-smith/pub80.html" target="_blank">http://www.clir.org/pubs/reports/pub80-smith/pub80.html

The Digital Library Forum has an online magazine and other resources for building digital libraries.
http://www.dlib.org" target="_blank">http://www.dlib.org

The Internet Public Library site has many links to online resources for the general public.
http://www.ipl.org" target="_blank">http://www.ipl.org

Brewster Kahle is founder of WAIS Inc. and Alexa Internet and chairman of the board of the Internet Archive. See his paper The Ethics of Digital Librarianship at
http://www.archive.org/about/documents/ethics_BK.html" target="_blank">http://www.archive.org/about/documents/ethics_BK.html

Michael Lesk of the National Science Foundation has written extensively on digital archiving and digital libraries.
http://www.purl.net/NET/lesk" target="_blank">http://www.purl.net/NET/lesk

The Library of Congress is the national library of the United States.
http://www.loc.gov" target="_blank">http://www.loc.gov

The Museum Digital Library plans to help digitize collections and provide access to them.
http://www.digitalmuseums.org" target="_blank">http://www.digitalmuseums.org

The National Science Foundation Digital Library Program has funded academic research on digital libraries.
http://www.nsf.gov/home/crssprgm/dli/start.htm" target="_blank">http://www.nsf.gov/home/crssprgm/dli/start.htm

Network Wizards has been tracking Internet growth for many years.
http://www.nw.com" target="_blank">http://www.nw.com

Project Gutenberg is making ASCII versions of classic literature openly available.
http://www.gutenberg.org" target="_blank">http://www.gutenberg.org

The Radio and Television Archive has many links to related resources.
http://www.rtvf.unt.edu/links/histsites.htm" target="_blank">http://www.rtvf.unt.edu/links/histsites.htm

Revival of the Library of Alexandria is a project to revive the ancient library in Egypt.
http://www.unesco.org/webworld/alexandria_new" target="_blank">http://www.unesco.org/webworld/alexandria_new

The United States Government Printing Office produces and distributes information published by the US government.
http://www.access.gpo.gov" target="_blank">http://www.access.gpo.gov

[Internet Mapping]

An Atlas of Cyberspaces has maps and dynamic tools for visualizing Web browsing.
http://www.cybergeography.com/atlas/surf.html" target="_blank">http://www.cybergeography.com/atlas/surf.html

The Internet Mapping Project is a long-term project by a scientist at Bell Labs to collect routing data on the Internet.
http://www.cs.bell-labs.com/who/ches/map" target="_blank">http://www.cs.bell-labs.com/who/ches/map

The Matrix Information Directory Service has good maps and visualizations of the networked world.
http://www.mids.org" target="_blank">http://www.mids.org

Peacock Maps has maps of Internet connectivity.

[Internet Statistics]

WebReference has an Internet statistics page (publisher: Internet.com).
http://www.webreference.com/internet/statistics.html" target="_blank">http://www.webreference.com/internet/statistics.html

LF.net Netzwerksysteme GmbH :: Internet Service Provider in Stuttgart

Mailingliste/Blog / Archiv