Peter Lyman, Researcher bei The Archive, einem Internetarchiv in San Francisco
Von Jens Gebhart am 31. August 2000 um 00:00
Das Interview mit Peter Lyman führte Jens Gebhart ( infoi-laborg ) im September 2000,
Fotos von Katja Dell ( katjadellsde )
"The Internet Archive" in San Francisco, ein Ort an dem das komplette Internet archivert wird...
http://www.i-lab.org/archive/archive7.jpg">
?: What is the archive.org ?
PL: The Archive is a 501(c)(3) public nonprofit that was founded to build an ‘Internet
library,’ with the purpose of offering free access to historical digital
collections for researchers, historians, and scholars. Founded in 1996 and located
in the Presidio of San Francisco. In late 1999, the organization started to grow
to build more well-rounded collections.
?: Why is the Internet Archive collecting sites from the Internet? What makes
the information useful?
PL: Most societies place importance on preserving their culture and heritage.
As our culture produces more and more artifacts in digital form, the Archive is
preserving them to create a public library for researchers, historians, and scholars.
While the newness of the digital media format presents challenges in collecting
and preserving materials, we feel it is necessary: Our cultural, political, and
historical artifacts are increasingly created in digital form, and if they are
not saved now, they may never be saved at all. Much early media — television
and radio, for example — was not saved. Many early movies were recycled to
recover the silver in the film! Even now, at the turn of the 21st century, no
comprehensive archives of television or radio programs exist.
But without cultural artifacts, civilization has no memory and no mechanism to
learn from its successes and failures. The Internet Archive is working to prevent
the Internet — a new medium with major historical significance — from
disappearing into the past. Collaborating with institutions including the Library
of Congress and the Smithsonian, we are working to permanently preserve a record
of public material. In addition to developing our own collections, we will be
working to promote formation of other Internet libraries in the United States
and elsewhere.
http://www.i-lab.org/archive/archive2.jpg">
?: Who has access to the collections?
PL: The Archive makes the collections available at no cost to researchers, historians,
and scholars. At present, it takes someone with a certain level of technical knowledge
to access the data, but there is no requirement that a user be affiliated with
any particular organization. Open and free access to literature and other writings
has long been considered essential to education and to the maintenance of an open
society. Public and philanthropic enterprises have supported it through the ages.
The Internet Archive is opening its collections to researchers, historians, and
scholars to ensure that they have free and permanent access to public materials.
The Archive has no vested interest in the discoveries of the users of its collections,
nor is it a grant-making organization. At present, using collections of this size
requires programming skills. However, we are hopeful about the development of
tools and methods that will give the general public easy and meaningful access
to our collective history. In addition to developing our own collections, we will
be working to promote formation of other Internet libraries in the United States
and elsewhere.
?: Do you collect all the sites on the Web?
PL: No, we collect only publicly accessible Web pages. These may include pages
with personal information. If there is any indication that a site’s owner
doesn’t want us to archive the site, we don’t and we do not collect
or archive personal email messages or chat systems.
?: Are you violating copyright laws?
PL: No. Like your local library’s collections, our collections consist of
publicly available documents. But in our case, the Archive has collected only
pages that were available on the Internet at no cost and without passwords or
special privileges. Even further, the authors of Web pages can remove their documents
from the collection. You can also stop robots from crawling your site. Stopping
robots from collecting the pages on a site leads to the removal of the pages from
the existing collection.
http://www.i-lab.org/archive/archive13.jpg">
?: What privacy issues does the Archive bring up? How do you protect my privacy
if you archive my site?
PL: The Archive collects Web pages that are publicly available — the same
ones that you might find as you surfed around the Web. We do not archive sites
when there is any indication that their owners do not want them archived. Like
a public library, the Archive provides free and open access to its collections
to researchers, historians, scholars, and possibly to the general public. Our
cultural norms have long promoted access to documents that were, but no longer
are, publicly accessible.
Unlike a paper library, on the other hand, the Archive will not collect publicly
available materials if there is an indication that the owner does not wish them
to be archived. Furthermore, if we find any such indication on a site, we remove
all previous versions of the page from the collections. We provide information
on removing a site from the collections. Those who use the collections must agree
to certain terms of use.
Given the rate at which the Internet is changing — the average life of a
Web page is only 77 days — if no effort is made to preserve it, it will be
entirely and irretrievably lost. Rather than let this moment slip by, we are proceeding
with documenting the growth and content of the Internet, using libraries as our
model.
?: What is the actual size of the Archive's Collection?
PL: We started to collect from October 1996 to now. The actual size is 13.8 terabytes
(about 1 billion pages, text only during 1999, compare to the World Wide Web 1997:
2 Terabytes) and the monthly rate of growth is 2 terabytes as of March 2000. That
signifies that every half a year the internet doubles itself. At the moment all
webpages from late 1998 to six or more months ago (the collection contains no
material less than six months old) are accessible, or about 3 terabytes as of
March 2000. We hope to make the rest of the material (collected from late 1996
to late 1998) available during 2000.
http://www.i-lab.org/archive/archive15.jpg">
?: What are the technical requirements for access to the Archive's collections
?
PL: While the Internet Archive does not charge for access to the collections,
you will need Unix programming skills to gain access to and use an entire collection,
such as the Archive’s collection of Web snapshots. (A free, easy-to-use application
from Alexa Internet, which donates Internet materials to the Archive, can be used
to access individual Web pages.)
The Archive assigns each user an ssh (secure shell) access account and disk space
on the server facade.archive.org. (Secure shell access provides character-terminal
log-in; it’s similar to Telnet access but more secure.) The server runs the
Linux operating system. The server facade.archive.org has access to a series of
Linux machines (named ia000.archive.org, ia001.archive.org, and so on). Each machine
has either 12 or 20 disk drives (named 0, 1, 2, and so on). On each drive are
three types of files: ARC format, DAT or MDT format (URLs and image references
from the ARC files) and IDX (index) format, which each contain a list of URLs
and their associated place in the ARC and DAT files. Users access the hard drives
where the collections reside by referencing these remote files from facade.archive.org.
You can use either FTP or NFS (network file system) access.
?: How do the Archive acquire collections?
PL: The Internet Archive acquires collections in two ways: Web robots that collect
publicly accessible Web pages and donations of digital collections. A Web-crawling
robot is software that automatically collects Web pages from publicly accessible
Web servers. It examines each page for links to other pages that it can collect.
In turn, if it finds more links on those pages, it follows those too. A set of
pages that has been retrieved by a robot is called a "crawl." Crawling
is how most search engines collect Web pages for indexing.
The Internet Archive currently receives "crawls," or snapshots of the
Web, as donations from the Web navigation service Alexa Internet. In addition,
we have plans to augment Alexa’s crawls with our own. Each Alexa crawl takes
about two months.
Alexa’s robot currently gathers more than 100 gigabytes of publicly available
information a day, often from thousands of different sites at a time. It’s
equipped with a "throttle" that limits the rate of requests and prevents
the robot from interfering with a server’s normal activities. Unlike people,
who can follow any link when they browse the Web, robots do not, because Webmasters
and authors use certain standards to control access to their sites by robots.
For example, Alexa’s robot does not copy pages that require a password to
access, pages tagged for "robot exclusion" by their owners, pages that
are only accessible when a person types into and sends a form, or pages on secure
servers.
Crawls in the Archive from late 1996 (when Alexa began crawling) to late 1998
included images and other media. Crawls since then have included only ASCII text,
but plans are in progress to conduct our own crawls to fill out the collections.
http://www.i-lab.org/archive/archive1.jpg">
?: What is the Alexa software different to other search engines?
PL: Alexa is software that can be retrieved free from the company's Web site (http://www.alexa.com)
and added to a Web browser. Unlike other search engines, such as Yahoo! and Excite,
it doesn't rely on word searches. Instead, it watches where its users go on the
Internet, and then records that information in a central data base. Based on that
information, Alexa can tell a user the most popular paths that other Alexa users
have taken from the site the user is visiting at a given time. It also can suggest
other sites offering related material. The top 10 sites pop up in a thin, gray
bar near the browser (see below) and change as the user moves from page to page.
The advantage of Alexa as a search engine is that it "attempts to be an objective
source" for people seeking information.
?: Is there any selection in cawling the Internet?
PL: No, that would take too much time, comparing the huge amout of files. It's
easyer to store everything.
?: Can you tell us something about Storage of the Collections?
PL: Storing the Archive’s collections involves parsing, indexing, and physically
encoding the data. With the Internet collections growing at a rate of about 2
terabytes a month, this task poses a formidable challenge. For hardware, we use
Linux PCs with clusters of IDE hard drives. Data collected until late 1998 was
collected on DLT tape (a relatively inexpensive storage medium that is, however,
too slow for querying). We are in the process of migrating that data to disk.
We receive and store data in two formats: archive (ARC) files and metadata (MDT)
files. ARC (.arc): These are 100-megabyte files made up of many individual files.
Alexa Internet (currently the source of all crawls in our collections) is proposing
ARC as a standard for archiving Internet objects. MDT (.dat or .dt): These are
metadata files that contain contextual information extracted from ARC files, such
as when each data file was gathered, any URLs it might contain, and its size.
MDT files make it easier to index the ARC files and to conduct research analyses.
?: and the preservation?
PL: Preservation is the ongoing task of permanently protecting stored resources
from damage or destruction. The main issues are guarding against the consequences
of accidents and data degradation and maintaining the accessibility of data as
formats become obsolete. Any medium or site used to store data is potentially
vulnerable to accidents and natural disasters. Maintaining copies of the Archive’s
collections at multiple sites can help alleviate this risk. Part of the collection
is already handled this way, and we are proceeding as quickly as possible to do
the same with the rest.
? Thank you and I hope that the Archive allready archived the unstable betacity.de
website...
PL: Sure..
http://www.i-lab.org/archive/archive12.jpg">
[Board]
Brewster Kahle: Brewster is an engineer by profession and an archivist at heart.
He designed supercomputers for Thinking Machines and helped found WAIS, Inc.,
Alexa Internet, and the Internet Archive.
Peter Lyman: Peter is university librarian and a professor in the School of Information
Management and Systems at? the University of California, Berkeley.
Kathleen Burch: Kathleen has helped start and run nonprofits, including the San
Francisco Center for the Book, since 1973. She is a Xerox PARC Artist in Residence
for 2000.
Bruce Gilliat: Bruce, with a background in networking and online content strategies,
is a cofounder (with Brewster Kahle) of Alexa Internet.
[Links to Internet Libraries and Librarianship]
Alexa Internet has catalogued Web sites and provides this information in a free
service.
http://www.alexa.com" target="_blank">http://www.alexa.com
The Council on Library and Information Resources works to ensure the well-being
of the scholarly communication system.
http://www.clir.org" target="_blank">http://www.clir.org
See its publication Why Digitize?
http://www.clir.org/pubs/reports/pub80-smith/pub80.html" target="_blank">http://www.clir.org/pubs/reports/pub80-smith/pub80.html
The Digital Library Forum has an online magazine and other resources for building
digital libraries.
http://www.dlib.org" target="_blank">http://www.dlib.org
The Internet Public Library site has many links to online resources for the general
public.
http://www.ipl.org" target="_blank">http://www.ipl.org
Brewster Kahle is founder of WAIS Inc. and Alexa Internet and chairman of the
board of the Internet Archive. See his paper The Ethics of Digital Librarianship
at
http://www.archive.org/about/documents/ethics_BK.html" target="_blank">http://www.archive.org/about/documents/ethics_BK.html
Michael Lesk of the National Science Foundation has written extensively on digital
archiving and digital libraries.
http://www.purl.net/NET/lesk" target="_blank">http://www.purl.net/NET/lesk
The Library of Congress is the national library of the United States.
http://www.loc.gov" target="_blank">http://www.loc.gov
The Museum Digital Library plans to help digitize collections and provide access
to them.
http://www.digitalmuseums.org" target="_blank">http://www.digitalmuseums.org
The National Science Foundation Digital Library Program has funded academic research
on digital libraries.
http://www.nsf.gov/home/crssprgm/dli/start.htm" target="_blank">http://www.nsf.gov/home/crssprgm/dli/start.htm
Network Wizards has been tracking Internet growth for many years.
http://www.nw.com" target="_blank">http://www.nw.com
Project Gutenberg is making ASCII versions of classic literature openly available.
http://www.gutenberg.org" target="_blank">http://www.gutenberg.org
The Radio and Television Archive has many links to related resources.
http://www.rtvf.unt.edu/links/histsites.htm" target="_blank">http://www.rtvf.unt.edu/links/histsites.htm
Revival of the Library of Alexandria is a project to revive the ancient library
in Egypt.
http://www.unesco.org/webworld/alexandria_new" target="_blank">http://www.unesco.org/webworld/alexandria_new
The United States Government Printing Office produces and distributes information
published by the US government.
http://www.access.gpo.gov" target="_blank">http://www.access.gpo.gov
[Internet Mapping]
An Atlas of Cyberspaces has maps and dynamic tools for visualizing Web browsing.
http://www.cybergeography.com/atlas/surf.html" target="_blank">http://www.cybergeography.com/atlas/surf.html
The Internet Mapping Project is a long-term project by a scientist at Bell Labs
to collect routing data on the Internet.
http://www.cs.bell-labs.com/who/ches/map" target="_blank">http://www.cs.bell-labs.com/who/ches/map
The Matrix Information Directory Service has good maps and visualizations of the
networked world.
http://www.mids.org" target="_blank">http://www.mids.org
Peacock Maps has maps of Internet connectivity.
http://www.peacockmaps.com
[Internet Statistics]
WebReference has an Internet statistics page (publisher: Internet.com).
http://www.webreference.com/internet/statistics.html" target="_blank">http://www.webreference.com/internet/statistics.html
Mailingliste/Blog / Archiv
- Mai 2013 (47)
- April 2013 (81)
- März 2013 (61)
- Februar 2013 (55)
- Januar 2013 (78)
- Dezember 2012 (38)
- November 2012 (87)
- Oktober 2012 (86)
- September 2012 (69)
- August 2012 (52)
- Juli 2012 (73)
- Juni 2012 (75)
- Mai 2012 (76)
- April 2012 (72)
- März 2012 (71)
- Februar 2012 (68)
- Januar 2012 (54)
- Dezember 2011 (45)
- November 2011 (70)
- Oktober 2011 (78)
- September 2011 (62)
- August 2011 (48)
- Juli 2011 (78)
- Juni 2011 (77)
- Mai 2011 (81)
- April 2011 (68)
- März 2011 (69)
- Februar 2011 (63)
- Januar 2011 (66)
- Dezember 2010 (42)
- November 2010 (79)
- Oktober 2010 (93)
- September 2010 (62)
- August 2010 (45)
- Juli 2010 (68)
- Juni 2010 (95)
- Mai 2010 (89)
- April 2010 (99)
- März 2010 (85)
- Februar 2010 (88)
- Januar 2010 (65)
- Dezember 2009 (80)
- November 2009 (119)
- Oktober 2009 (105)
- September 2009 (85)
- August 2009 (48)
- Juli 2009 (96)
- Juni 2009 (106)
- Mai 2009 (96)
- April 2009 (53)
- März 2009 (78)
- Februar 2009 (66)
- Januar 2009 (72)
- Dezember 2008 (57)
- November 2008 (87)
- Oktober 2008 (103)
- September 2008 (68)
- August 2008 (39)
- Juli 2008 (87)
- Juni 2008 (106)
- Mai 2008 (107)
- April 2008 (102)
- März 2008 (84)
- Februar 2008 (73)
- Januar 2008 (91)
- Dezember 2007 (66)
- November 2007 (103)
- Oktober 2007 (97)
- September 2007 (71)
- August 2007 (45)
- Juli 2007 (82)
- Juni 2007 (74)
- Mai 2007 (77)
- April 2007 (71)
- März 2007 (76)
- Februar 2007 (80)
- Januar 2007 (64)
- Dezember 2006 (69)
- November 2006 (81)
- Oktober 2006 (75)
- September 2006 (82)
- August 2006 (41)
- Juli 2006 (62)
- Juni 2006 (77)
- Mai 2006 (101)
- April 2006 (94)
- März 2006 (86)
- Februar 2006 (69)
- Januar 2006 (76)
- Dezember 2005 (70)
- November 2005 (99)
- Oktober 2005 (83)
- September 2005 (78)
- August 2005 (63)
- Juli 2005 (108)
- Juni 2005 (109)
- Mai 2005 (30)
- April 2005 (9)
- März 2005 (5)
- Februar 2005 (10)
- Januar 2005 (16)
- Dezember 2004 (48)
- November 2004 (56)
- Oktober 2004 (53)
- September 2004 (51)
- August 2004 (53)
- Juli 2004 (111)
- Juni 2004 (81)
- Mai 2004 (60)
- April 2004 (68)
- März 2004 (57)
- Februar 2004 (66)
- Januar 2004 (75)
- Dezember 2003 (33)
- November 2003 (39)
- Oktober 2003 (63)
- September 2003 (71)
- August 2003 (43)
- Juli 2003 (88)
- Juni 2003 (50)
- Mai 2003 (92)
- April 2003 (68)
- März 2003 (31)
- Februar 2003 (8)
- Januar 2003 (24)
- Dezember 2002 (12)
- November 2002 (12)
- Oktober 2002 (11)
- September 2002 (13)
- August 2002 (9)
- Juli 2002 (8)
- Juni 2002 (14)
- Mai 2002 (26)
- April 2002 (17)
- März 2002 (2)
- Januar 2002 (1)
- Dezember 2001 (1)
- November 2001 (1)
- Oktober 2001 (1)
- September 2001 (4)
- August 2001 (1)
- Juli 2001 (5)
- Juni 2001 (1)
- Mai 2001 (1)
- April 2001 (1)
- Februar 2001 (2)
- Januar 2001 (3)
- Dezember 2000 (1)
- Oktober 2000 (4)
- September 2000 (2)
- August 2000 (2)
- Juli 2000 (1)
- Juni 2000 (1)
- Mai 2000 (1)
- April 2000 (1)
- Januar 2000 (1)
- August 1999 (1)