YUL research and professional contributions
Permanent URI for this collection
Research conducted by York University Library Faculty members can be found in this collection, along with professional contributions such as presentation slides and instructional videos.
Browse
Browsing YUL research and professional contributions by Author "972f8992358baee35384472d4bcaa06e"
Now showing 1 - 20 of 53
Results Per Page
Sort Options
Item Open Access ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH: Archive, Big Data, Concurrent, Distributed, Efficient, Flexible(ACM, 2022-06-20) Holzmann, Helge; Ruest, Nick; Bailey, Jefferson; Dempsey, Alex; Fritz, Samantha; Lee, Peggy; Milligan, IanOver the past quarter-century, web archive collection has emerged as a user-friendly process thanks to cloud-hosted solutions such as the Internet Archive’s Archive-It subscription service. Despite advancements in collecting web archive content, no equivalent has been found by way of a user-friendly cloud-hosted analysis system. Web archive processing and research require significant hardware resources and cumbersome tools that interdisciplinary researchers find difficult to work with. In this paper, we identify six principles - the ABCDEFs (Archive, Big data, Concurrent, Distributed, Efficient, and Flexible) - used to guide the development and design of a system. These make the transformation of, and working with, web archive data as enjoyable as the collection process. We make these objectives – largely common sense – explicit and transparent in this paper. They can be employed by every computing platform in the area of digital libraries and archives and adapted by teams seeking to implement similar infrastructures. Furthermore, we present ARCH (Archives Research Compute Hub), the first cloud-based system designed from scratch to meet all of these six key principles. ARCH is an interactive interface, closely connected with Archive-It, engineered to provide analytical actions, specifically generating datasets and in-browser visualizations. It efficiently streamlines research workflows while eliminating the burden of computing requirements. Building off past work by both the Internet Archive (Archive-It Research Services) and the Archives Unleashed Project (the Archives Unleashed Cloud), this merged platform achieves a scalable processing pipeline for web archive research. It will be made open-source shortly and can be considered a reference implementation of the ABCDEF, which we have evaluated and discussed in terms of feasibility and compliance as a benchmark for similar platforms.Item Open Access Active Digital Preservation and Data/Metadata Migration(2017-04-04) Estlund, Karen; Ruest, NickDigital preservation activities increasingly focus on the movement of data and metadata between systems. This panel will present case studies in moving content through preservation activities with APTrust, the Digital Preservation Network, MetaArchive, and local applications. The presentations will highlight common methodologies and elicit group discussion on strategic and sustainable planning for active digital preservation. As the pace of evolution of repository systems continues to increase and new opportunities for digital preservation systems continue to emerge, the nature of active movement of repository objects and metadata has become a growing concern. The focus of content stewardship is shifting from being application-centric to data-centric, with the understanding that content must move through time. In order to provide effective mechanisms to move repository data during repository migrations and to these preservation systems, significant efforts are needed for various import, export, and verification services. The Fedora and MetaArchive communities have begun collaborative efforts to create tools that using the BagIt standard will enable preservation and system profiles that allow for ease of digital object transfer. Essential to these discussions is the role of metadata, file integrity, and size of transfers to actively manage digital objects.Item Open Access Arch-It!(2022-06-24) Holzmann, Helge; Ruest, Nick; Bailey, Jefferson; Dempsey, Alex; Fritz, Samantha; Milligan, Ian; Willis, KodyOver the past quarter-century, web archive collection has emerged as a user-friendly process thanks to cloud-hosted solutions such as the Internet Archive’s Archive-It subscription service. Despite advancements in collecting web archive content, no equivalent has been found by way of a user-friendly cloud-hosted analysis system. Web archive processing and research require significant hardware resources and cumbersome tools that interdisciplinary researchers find difficult to work with. In this paper, we present ARCH (Archives Research Compute Hub)1, an interactive interface, closely connected with Archive-It, engineered to provide analytical actions, specifically generating datasets and in-browser visualizations. It efficiently streamlines research workflows while eliminating the burden of computing requirements. Building off past work by both the Internet Archive (Archive-It Research Services) and the Archives Unleashed Project (the Archives Unleashed Cloud), this merged platform achieves a scalable processing pipeline for web archive research.Item Open Access The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration(2019) Deschamps, Ryan; Ruest, Nick; Lin, Jimmy; Fritz, Samantha; Milligan, IanThis paper introduces the Archives Unleashed Notebook, which is designed to work with derivative datasets from the Archives Unleashed Cloud, a platform for analyzing web archives. These datasets contain common starting points for scholarly inquiry, including full text content and the domain-level webgraph. Our notebooks interactively walk a scholar through the process of interrogating a collection using a fill-in-the-blanks 'madlibs' approach to promote engagement. Scholars start with a notebook populated with common analyses, in which they can make minor changes to variables to alter the subject of study in systematic ways.Item Open Access The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives(ACM/IEEE, 2020-08) Ruest, Nick; Lin, Jimmy; Milligan, Ian; Fritz, SamanthaThe Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building -- all proceeding concurrently in mutually --reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.Item Open Access Building Community and Tools for Analyzing Web Archives through Datathons(2019) Milligan, Ian; Casemajor, Nathalie; Fritz, Samantha; Lin, Jimmy; Ruest, Nick; Weber, Matthew S.; Worby, NicholasStarting in March 2016, the Archives Unleashed team and our collaborators have brought together social scientists, humanists, archivists, librarians, computer scientists, and other stakeholders to explore web archives as research objects. Three objectives motivated our team to develop and organize these events: facilitating scholarly access, community building, and skills training. We believe that we have been successful on all three fronts. For each event, over the course of two to three days, participants formed interdisciplinary teams and explored web archives using a variety of methods and tools. This paper details our experiences in designing these "datathons", with an intent to share lessons learned, highlight interdisciplinary approaches to research and education on web archives, and describe future opportunities.Item Open Access Building Community at Distance: A Datathon during COVID-19(Digital Library Perspectives, 2020-08-04) Fritz, Samantha; Milligan, Ian; Ruest, Nick; Lin, JimmyThis paper aims to use the experience of an in-person event that was forced to go virtual in the wake of COVID-19 as an entryway into a discussion on the broader implications around transitioning events online. It gives both practical recommendation to event organizers as well as broader reflections on the role of digital libraries during the COVID-19 pandemic and beyond.Item Open Access Building Successful, Open Repository Software Ecosystems: Technology and Community(2014-06-10) Ruest, Nick; Mumma, Courtney; Fleming, Declan; Giarlo, Michael; Woods, AndrewArchivematica, AtoM (Access to Memory), Fedora, Hydra, and Islandora provide a set of functions which contribute to a diverse curation and repository ecosystem. They are also projects existing in a greater open source ecosystem. Overlapping functionality and gaps among the tools available can make piecing together workflows and sharing solutions challenging. Sharing practices and collaborating on directions benefits each of the projects, as well as advances the objectives of the open repository community. This panel session offers a conversation about strategies to collaborate more efficiently and explore connections in the repository software ecosystem. Panelists will provide an overview of how each community operates, touching on the technological and social dimensions of community open-source project governance. We will look at how each of the communities collaborates with, learns from, challenges, and inspires the others, as well as how they engage other communities and organizations contributing to a thriving and more diverse repository ecosystem.Item Open Access Capturing the Web Today for Tomorrow: Innovations in capturing and analyzing social media and websites for the new scholarly record(2017-03-07) Ruest, Nick; Milligan, IanThe growth of digital sources since the advent of the World Wide Web in 1991, and the commencement of widespread web archiving in 1996, presents profound new opportunities for social and cultural analysis. In simple terms, the 1990s cannot be studied without web archives: they are both primary sources that reflect how people consume and understand media, as well as repositories that document the thoughts, opinions, and activities of millions of everyday people. These are a dream for social historians. However, all of this opportunity brings challenges. The size and complexity of the data requires interdisciplinary collaboration. Historians might have the research questions but not the technical resources or knowledge to work with these sources, requiring outreach to other disciplines. Libraries and archives are perfectly positioned to work in this new emerging field that brings together historians, computer scientists, and information specialists. In this talk, our speakers will discuss the fruits of one collaboration that has emerged at York University, the University of Alberta, and the University of Waterloo. Bringing together librarians, archivists, historians, and computer scientists, as well as an interdisciplinary team of undergraduate and graduate students, this distributed group is developing several web archival analytics projects. They work using a combination of centralized and de-centralized infrastructure to run data analytics, store web archives, provide a publicly-facing portal, and collaborate. Ian and Nick will discuss the challenges of working in an interdisciplinary environment, and give insights into how the team has been working through in-detail case studies of their work with webarchives.ca, Twitter archiving and analysis, Compute Canada, and Warcbase, a web analytics platform. Collaboration between computer scientists, librarians, archivists and humanists is not always a simple one, but it is a collaboration worth perusing.Item Open Access Content Selection and Curation for Web Archiving: The Gatekeepers vs. the Masses(2016) Milligan, Ian; Ruest, Nick; Lin, JimmyAny preservation effort must begin with an assessment of what content to preserve, and web archiving is no different. There have historically been two answers to the question "what should we archive?'' The Internet Archive's broad entire-web crawls have been supplemented by narrower domain or topic-specific collections gathered by numerous libraries. We can characterize this as content selection and curation by "gatekeepers''. In contrast, we have witnessed the emergence of another approach driven by "the masses''---we can archive pages that are contained in social media streams such as Twitter. The interesting question, of course, is how these approaches differ. We provide an answer to this question in the context of a case study about the 2015 Canadian federal elections. Based on our analysis, we recommend a hybrid approach that combines an effort driven by social media and more traditional curatorial methods.Item Open Access Content Selection and Curation for Web Archiving; The Gatekeepers vs the Masses (Presentation)(2016-06-22) Ruest, NickAny preservation effort must begin with an assessment of what content to preserve, and web archiving is no different. There have historically been two answers to the question "what should we archive?'' The Internet Archive's broad entire-web crawls have been supplemented by narrower domain or topic-specific collections gathered by numerous libraries. We can characterize this as content selection and curation by "gatekeepers''. In contrast, we have witnessed the emergence of another approach driven by "the masses''---we can archive pages that are contained in social media streams such as Twitter. The interesting question, of course, is how these approaches differ. We provide an answer to this question in the context of a case study about the 2015 Canadian federal elections. Based on our analysis, we recommend a hybrid approach that combines an effort driven by social media and more traditional curatorial methods.Item Open Access Content-Based Exploration of Archival Images Using Neural Networks(ACM/IEEE, 2020-08) Adewoye, Tobi; Han, Xiao; Ruest, Nick; Milligan, Ian; Fritz, Samantha; Lin, JimmyWe present DAIRE (Deep Archival Image Retrieval Engine), an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. This work represents one part of our broader effort to move away from text-centric analyses of web archives and scholarly tools that are direct reflections of methods for accessing the live web. This short piece describes the implementation of our system and a case study on a subset of the GeoCities web archive.Item Open Access The Cost of a WARC: Analyzing Web Archives in the Cloud(2019) Deschamps, Ryan; Fritz, Samantha; Lin, Jimmy; Milligan, Ian; Ruest, NickThe value of web archives to support scholarship in the humanities and social sciences is slowly being realized by the increasing availability of scalable tools and platforms. The cost of providing scholarly access is a critical component of developing a long-term sustainability strategy. This paper attempts to answer a straightforward question: How much does it cost to analyze web archives in the cloud? To make this question more concrete, we examine the creation of three derivatives (extraction of collection statistics, full text, and the webgraph) that serve as the starting points of many scholarly inquiries. Our analysis shows that these typical derivatives costs around US$7 per TB using our Archives Unleashed Toolkit. We describe in detail the methodology and assumptions made to arrive at this figure. To our knowledge, we are the first to quantify the economics of scholarly access to web archives, and we believe that this information is valuable for service planning by archives, libraries, and other institutions.Item Open Access CURATEcamp iPres 2012(Ariadne, 2012-12-13) Jordan, Mark; Mumma, Courtney; Ruest, NickMark Jordan, Courtney Mumma, Nick Ruest and the participants of CURATEcamp iPres 2012 report on this unconference for digital curation practitioners and researchers, held on 2 October 2012 in Toronto.Item Open Access d3 Data Visualization Bootcamp(2013-06-07) Ruest, Nick; Suhonos, MJBrief introduction of data visualization concepts, brief introduction of d3, and a walkthrough of three exercises using library datasets.Item Open Access Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities(2016) Jackson, Andrew; Lin, Jimmy; Milligan, Ian; Ruest, NickWeb archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. In this paper, we describe initial experiences in providing an exploratory search interface to web archives for humanities scholars and social scientists. We describe our initial implementation and discuss our finding in terms of desiderata for such a system. It is clear that the standard organization of a search engine results page (SERP), consisting of an ordered list of hits, is inadequate to support the needs of scholars. Shneiderman's mantra for visual information seeking ("overview first, zoom and filter, then details-on-demand") provides a nice organizing principle for interface design, to which we propose an addendum: \Make every-thing transparent". We elaborate on this by highlighting the importance of the temporal dimension of web pages as well as issues surrounding metadata and veracity.Item Open Access Digital Preservation at York University Libraries(2022-11-16) Ruest, NickYork University Libraries are ten years into a digital preservation program. How did it start, how it did it evolve, what does our policy and documentation look like, and what are the lessons learned? Library organizations are unique, but there is generally a fair bit of overlap where our path, policies and documentation can be of use to other organizations.Item Open Access Digital Preservation Tools, Practices, and Policies in Islandora(2014-06-10) Ruest, Nick; Jordan, Mark; Moses, DonaldThere exists many standards and best practices in the digital preservation community, but not many of these practices are implemented as easy to use tools in our digital repository platforms. This presentation will focus on the policy driven and community focused development of digital preservation tools and best practices in Islandora community, and on the tools created for transforming Submission Information Packages for Islandora to Archival Information Packages. Specifically, automated digital object file identification, automated checksum creation and verification, automated generation of PREMIS metadata, and the integration of BagIt allowing for the network transfer of objects in a given Islandora repository.Item Open Access Enabling Access to Old Wu-Tang Clan Fan Sites: Facilitating Interdisciplinary Web Archive Collaboration(2016-03-08) Ruest, Nick; Milligan, IanThe growth of digital sources since the advent of the World Wide Web in 1991, and the commencement of widespread web archiving in 1996, presents profound new opportunities for social and cultural opportunities. In simple terms, we cannot study the 1990s without web archives: they are both primary sources that reflect how people consume and understand media, as well as repositories that document the thoughts, opinions, and activities of millions of everyday people. These are a dream for social historians. For example, consider GeoCities, which grew to some thirty-eight million pages created by as many as seven million users during the fifteen years between 1994 and 2009. There are untold opportunities to understand the recent past, based on the voices of people who never before would have been included in a traditional historical record. But wait, with all this opportunity comes challenges: large data, the need for interdisciplinary collaboration between historians who might have the questions but not the technical resources or knowledge to work with these sources, and basic questions around what a web archive is and how to access them. Libraries and archives are perfectly positioned to work in this new emerging field that brings together historians, computer scientists, and information specialists. In our talk, we discuss the fruits of one collaboration that has emerged at York University and the University of Waterloo. Bringing together a librarian, a historian, a computer scientist, and an interdisciplinary team of undergraduate and graduate students, York has become a collaborative hub: using a combination of centralized and de-centralized infrastructure to run data analytics, store web archives, provide a publicly-facing portal (http://webarchives.ca/), and to collaborate using Slack, a research team has taken shape. We’ll discuss the challenges of working in an interdisciplinary environment, and give insights into how the team has been working through in-detail case studies of our work with http://webarchives.ca and the warcbase web analytics platform. The combination of computer scientists and humanists is not always a simple one, and York University Libraries provided the infrastructure, help, and leadership to make the team a success.Item Open Access Engaging the Public with Web Archives: Providing Access to 10 Years of Political History with WebArchives.ca(2016-05-31) Ruest, Nick; Milligan, IanIntroduction The growth of digital sources since the advent of the World Wide Web in 1990-91 presents profound opportunities for historians. Large web archives contain billions of webpages, and now make it possible for us to develop large-scale reconstructions of the recent web. Yet the sheer number of these sources presents significant challenges. The Internet Archive's "Wayback Machine" (http://archive.org/web) is a standard entryway to these collections, but requires that the user know the URL of the resource they want to visit; it is not feasible to do large-scale research in this manner. By unlocking the Wayback Machine's underlying WebARCHive (ARC/WARC) files, we can develop methods to track, visualize, and analyze change occurring over time. In this paper, we discuss how we implemented the United Kingdom Web Archive (UKWA) "Shine" interface on a Canadian corpus, and how the provision of a user layer significantly changed levels of user engagement. Project Rationale and Case Study The University of Toronto Library (UTL) began collecting a quarterly crawl in 2005 of Canadian political parties and political interest groups. It includes fifty websites: major and minor political parties, as well as political interest groups such as the Assembly of First Nations and equal marriage advocacy groups. Collecting continues. Despite 2005-2015 having been a pivotal period for Canadian politics, and analytics reveal few took advantage of it. The current portal requires a visit to https://archive-it.org/collections/227 for full-text queries. There is no faceting or significant advanced search features. The interface is largely unusable for broad research questions. Shine To provide access, we implemented the Shine interface (https://github.com/ukwa/shine). Shine provides a web-based interface for interacting with Apache Solr. Using the open-sourced code, we indexed all of the sites, provided explanatory layers, generated additional analytics around what each crawl contained (as some crawls might contain more webpages from say the Liberals, which throws off the relative frequency of keywords), and tried to write better user documentation. We launched http://webarchives.ca as the 2015 Canadian federal election campaign began. Results WebArchives.ca received significant attention. The Canadian Broadcasting Corporation (CBC) carried stories in Canada Votes, the Kitchener-Waterloo affiliate, Spark, as well as talk radio and campus news. We received 17,861 pageviews over 4,000 user sessions, largely between 27 August and 19 October. It also led to research findings, including: * unlike other forms of web content, political parties and interest groups do not archive material on their websites. This eases analysis due to fewer duplicates, but also shows why collecting is time critical; * political parties flip flop: the Conservatives accused the Liberals in 2005 of paying insufficient attention to murdered and missing indigenous women; a complete reversal occurred on the 2015 websites; * significant shifts away from user-generated content on party sites, which experimented and then abandoned widespread commenting and hosting of blogs. These were discoverable due to the Shine/Webarchives.ca interface. Conclusions More work needs to be done. The next step is to work with more Archive-It collections of national/international significance and publicize them in a similar way. At the end of the presentation, I will note an ongoing project we have with Canadian partners to consolidate and provide access to multiple collections.
- «
- 1 (current)
- 2
- 3
- »