The Cost of a WARC: Analyzing Web Archives in the Cloud

dc.contributor.authorDeschamps, Ryan
dc.contributor.authorFritz, Samantha
dc.contributor.authorLin, Jimmy
dc.contributor.authorMilligan, Ian
dc.contributor.authorRuest, Nick
dc.date.accessioned2019-04-19T19:47:49Z
dc.date.available2019-04-19T19:47:49Z
dc.date.issued2019
dc.description.abstractThe value of web archives to support scholarship in the humanities and social sciences is slowly being realized by the increasing availability of scalable tools and platforms. The cost of providing scholarly access is a critical component of developing a long-term sustainability strategy. This paper attempts to answer a straightforward question: How much does it cost to analyze web archives in the cloud? To make this question more concrete, we examine the creation of three derivatives (extraction of collection statistics, full text, and the webgraph) that serve as the starting points of many scholarly inquiries. Our analysis shows that these typical derivatives costs around US$7 per TB using our Archives Unleashed Toolkit. We describe in detail the methodology and assumptions made to arrive at this figure. To our knowledge, we are the first to quantify the economics of scholarly access to web archives, and we believe that this information is valuable for service planning by archives, libraries, and other institutions.en_US
dc.description.sponsorshipThis work was primarily supported by the Andrew W. Mellon Foundation, with additional funding from Start Smart Labs, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. We'd like to thank our content partners and Raymie Stata for comments on an earlier draft.
dc.identifier.citationRyan Deschamps, Samantha Fritz, Jimmy Lin, Ian Milligan, and Nick Ruest. “The Cost of a WARC: Analyzing Web Archives in the Cloud.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
dc.identifier.citationRyan Deschamps, Samantha Fritz, Jimmy Lin, Ian Milligan, and Nick Ruest. “The Cost of a WARC: Analyzing Web Archives in the Cloud.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
dc.identifier.issn978-1-7281-1547-4/19
dc.identifier.urihttp://hdl.handle.net/10315/36158
dc.identifier.urihttps://doi.org/10.1109/JCDL.2019.00043
dc.language.isoen
dc.rights.urihttps://doi.org/10.1109/JCDL.2019.00043
dc.subjectweb archivesen
dc.subjectcomputational analysisen
dc.subjectsustainabilityen
dc.subjectcloud computingen
dc.titleThe Cost of a WARC: Analyzing Web Archives in the Clouden
dc.typeArticle

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
cost-analysis.pdf
Size:
739.96 KB
Format:
Adobe Portable Document Format
Description:
Main article
Loading...
Thumbnail Image
Name:
JCDL - Cost of a WARC Presentation-4.pdf
Size:
2.27 MB
Format:
Adobe Portable Document Format
Description:
JCDL Presentation (slides)
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.83 KB
Format:
Item-specific license agreed upon to submission
Description: