Archiving DH Part 4: Solutions
Visit the first three parts in the series:
- Archiving DH Part 1 - The Problem
- Archiving DH Part 2 - The Problem in Detail
- Archiving DH Part 3 - The Long View
Archiving ≠ ____
Before we look at some options for archiving Digital Humanities projects, we need to discuss the elephant in the room. Archiving, or preservation, does not equal… many things. It does not equal a fully functional, live, continuously developed project. In terms of digital preservation, does it even require an accessible project? This should have been discussed in earlier posts in this series, but lack of planning or foresight leaves it for here.
The ALA, the authoritative body for the purposes of this discussion, defines preservation on their website as such:
“The preservation of library materials is a process dependent upon both the producers and curators of information resources. In keeping with the missions of their individual institutions, librarians must commit to preserving their collections through appropriate care. Preservation of materials in their original format should be practiced whenever possible, through proper storage and handling, supplemented by remedial treatment of damaged and fragile items. Replacement or reformatting for deteriorated materials must be actively pursued to enable users to have unimpeded access to the intellectual record, regardless of the condition of original documents. Preservation issues should be addressed while planning for new construction or the renovation of existing buildings to ensure that collections are preserved through appropriate and non-damaging storage and given proper security.”
Take note of the desire to keep a “material” or object in original format, and the need for proper storage, and the active pursuit of “reformatting” and “remedial treatment” all with the goal of keeping the intellectual record intact and available to users unimpeded. The goal is for a user to have access to the material in as close to original form as humanly possible, regardless of effort or cost.
In regards to Digital Preservation, the ALA states
“The Association defines digital preservation as policies, strategies and actions that ensure access to digital content over time. Publishers and distributors of content in digital form must address the usability and longevity of their electronic works. The Association encourages publishers to provide metadata that will facilitate the life cycle management of works in digital formats and to deposit digital works in repositories that provide for the long-term persistence and usability of digital content. The Association will work with the publishers to develop guidelines on digital preservation to help ensure that such information will not be lost when publishers can no longer retain and disseminate it.”
The key point here is the call for digital works to be deposited in “repositories that provide for the long-term persistence and usability of digital content.” The goal, as in “regular” preservation, is to keep the intellectual knowledge accessible to the user in as close to the original format as possible. When talking about digital works, and by using the terms publishers and distributors, the ALA statement seems to refer only to articles of text published by corporations of some kind, like journals. But what about other digital works, like websites, video, audio, etc? Are these digital projects, such as we are discussing here, being categorized as books, articles, and other text based media?
This, then, is the crux of the problem as I (howdy, Ammon here) see it. When talking about archiving and preserving DH projects, they are most often classified with the more common forms of scholarship: books, and journal articles. This is problematic for a number of reasons, but the main one I discuss below has to do with long-term usability of the object in question.
When a book is preserved or archived, the process is relatively easy. Especially for new books. You take the book and put it on a shelf. Granted, this does assume that the shelf is in a large existing building with climate controlled atmosphere, staffed with knowledgeable librarians, archivists and preservationists. There is a whole infrastructure built around the book to keep them usable as is for hundreds of years. One must also think about the efforts used to create the physical book in the first place; the quality of paper, ink, binding, and a slew of other processes developed over centuries to give us books today that last for centuries to come.
And that’s the main problem (which sparked this whole series of thought-posts). There is no established, tried and true process or infrastructure for building a DH project that will last for centuries. Heck, there’s nothing existing today that really will allow digital projects (or anything electrical for that matter) to exist for centuries. So that is the issue. We hope that digital projects have the same life cycle as books. But they don’t, and they can’t. Once a book is printed, it can exist on it’s own, say in a cave, for decades and centuries. A user can go into that cave, and as long as that user knows how to read the text, they can get the information from that book without any other technology or resources. The same is not true with digital projects. There’s no way to get a digital project “printed” and put it in a cave such that the information contained within can be retrieved with no external technology or resources.
What we offer below are some ways to deal with this lack of infrastructure and processes for archiving and preserving DH projects. We assume in these “solutions” that a massive infrastructure of technology and electricity will exist in order for users to access the project; but they are not self-sustaining objects fit for a cave.
Solutions
As articulated in the previous post in this series, projects should begin with the end in mind. Since the “end” can not be some self-contained object, we must compromise and decide on an “ending” that meets our needs. Once the ‘end’ is defined, only then can you design the pathway to find it. The “solution” heavily depends on what is desired in the “end.” Possible endings include: A fully functional project (website, or what have you)
- A project with limited functionality
- Put the project in a repository
- Abandon the project
Let’s take each in turn and give some practical examples of what these endings may look like.
Fully Functional
A possible ending is the fully functional website or project. It continues to exist exactly as it was developed. As defined above, though, this is not preservation. It is an actively maintained project. If we look to other fields for a definition of preservation we can see that preservation implies an altered state. Cucumbers preserved in dill and vinegar don’t grow any more, they are no longer cucumbers, but pickles. Butterflies preserved in a shadow box are no longer living breathing bugs. In these instances, preservation means to change from one state to another. Since we (or at least I, Ammon, do) talk about DH projects as if they were alive (they live on a certain server, they die when not actively developed), this seems to be the better comparison for digital projects. As argued above, we should stop comparing Digital Humanities projects with books. With that in mind, a fully-functional web project is not preserved, it’s a living, breathing project, that must be maintained and cared for. This end really just means continued development.
Limited Functionality
In that all web based projects need some kind of upkeep, our definition of preservation and archiving must be adapted to include projects that have limited functionality. This category includes projects that maintain some functionality, but shutdown and close other parts. For example, a project that collected data from users or researchers, could stop the data collection, but continue the searching capabilities, or allow for interactive maps, or data manipulation. This form of archiving tries to boil down the project to the basic components that can most easily be replicated in order to retrieve the information contained within. Two examples for such preservation are given here; scraping and containers.
Static HTML, CSS (Javascript?)
Perhaps the most stable format for web projects are when they are built with
the most basic building blocks; pure HTML and CSS. A simple way to archive a
website in this most basic format is to use command line tools like wget
and
curl
. The following command is used to create a static version of several old
DH projects hosted by the Scholars’ Lab.
wget --mirror -P static-content -nH -np -p -k -E http://womensbios.lib.virginia.edu
With these options, wget copies all of the files and their required files
(css, js, images, etc), and rewrites the URLs to include the .html
extension. This works great, but breaks old links out in the wild (fixed
with nginx rewrite).
Notes:
--mirror
"This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing."
-P static-content
where to put the downloaded files
-nH
no host directories (so it won't save the domain name part of the URL as a directory in which files are stored)
-np
don't ascend to the parent directory when doing recursive saves
-p
get all of the files necessary to make the page render properly (css, js, images, etc)
-k
convert all of the links in all of the downloaded HTML files so that they work
-E
adjust extension. adds the .html to all URL paths when not present (causes issues that are resolved with nginx rewrites)
Javascript is what adds interaction to current websites, and these files can be included in the static HTML, but they are definitely not future proof. A browser’s JavaScript rendering engine will most likely change in the future, breaking backwards compatibility.
Containerization
Another option for providing limited functionality (or even perhaps full functionality) is to migrate the project to a container. Docker is the most common, and widely known program for utilizing containers. In general, containerization involves creating a virtual computer or server system with only the bare necessities needed to run the project, all packaged up in a small container. This has the added benefit that when the underlying software needed to run a container is in place, then this containerized version of the project can be run anywhere.
Examples
Many of the legacy projects in Scholars’ Lab’s care were “preserved” using this method. Detailed documentation of the process was essential in attempting to preserve the process and the decisions that were made in archiving the project. One of the better examples of this is seen with the Collective Biographies of Women project. https://github.com/scholarslab/womensbios
Other examples from moving projects from one deprecated old server to another:
- https://gitlab.com/scholars-lab/latviandainas
- https://gitlab.com/scholars-lab/salisbury
- https://github.com/scholarslab/ibnjubayr.git
Repository
The distinction between access and preservation (though both are very closely related) and what “access” means has proven to be a particularly thorny place. These choices depend on the determination of the collection via the organization sponsoring the repository, the developers, and the project authors. As a preservationist, I (Lauren, here) rely on the knowledge of curators around the sustaining needs of projects. To get at the needs, we ask questions about what the creator actually is wanting to preserve, and if by that preservation the context/parts/tech that made that scholarship or thing unique/groundbreaking/a contribution to the scholarly field remain in place. All parties involved need to realize that with enough time and money, technically anything is possible, but everything is not possible for every project. The challenge then becomes, how do we provide the best technical path as determined by preservationists, developers and curators/creators/authors.
When moving a project into a state of preservation, there are several things to keep in mind as important for proper preservation. Of utmost importance is an inventory of all component parts of the project, such as images, text, video, etc. and metadata for those components. In our experience, this has proven to be highly replicable for most projects. It is also necessary to have a top level description of the work which is then placed as part of faculty papers in a university’s archives. A detailed inventory and a high level description, in many cases, solves how people may discover and explore context around the project. My take-away is that we need to be able to provide examples and reach realistic compromise over access to preserved projects.
The following are some examples of where to store preserved projects.
Put on a Raspberry Pi
One option, and perhaps the closest to the ephemeral goal of placing an object in a cave, is to put the project on a small form factor computer like a Raspberry Pi. In many cases, if the project does not rely on third-party sources that can’t be localized, this can provide a way to encapsulate a project into one sustainable unit. This allows for a complete system to render and view the project. The software at the correct versions can be pre-installed. All that is needed to access the project is a power source and a monitor. It would be quite a thing if the future of dissertation submissions were done on single board compact computer units.
Put into Library repository
Most (if not all) universities have some sort of academic repository in which to place digital objects. The University of Virginia utilizes Libra, which is akin to zipping up project files, giving it a description, and making it findable and accessible for users to download as they wish and explore locally.
Internet Archive
The Internet Archive and WayBackMachine already make a backup of websites and online resources. Although not perfect, and dependant again up on there being resources like electricity and funding to keep the company running, this is an easy way to have a static copy of a web project.
Web Archivers
Webrecorder is another option that seeks to capture the dynamic bits of a project and bundle all the resources into a reusable package.
GitHub Pages
Another viable option is to migrate to a distributed repo like Github where folks may have opportunities to replicate it themselves. Questions still exist for how we (UVA Library) might manage this for something we have deemed to be “part” of UVa Library collections. For example, does a description or MARC record/presence in Virgo deem this to be enough? As a preservationist (Lauren, here again), I’d say minimally we’d need these master files that comprise the website within our own repository and preservation systems as well, even if they’re replicated on Github.
Alternatives to Storing the Original Object
For particularly thorny or complicated projects that may rely on cloud servers, networks, and large interoperable systems that may stretch the bounds of what we think of as “projects” or “objects” (preserving gaming environments is an example folks have used) there are suggestions that documentation or recording of the interaction of games/apps/etc. is best for preserving what it was like to play/interact. Preserving master files of a game that is an empty world when you log in may not be the best long term preservation of what that work actually represented. There are similar tough arguments around highly personalized environments we now all encounter through algorithms - the context and content can only be a snapshot of that time, and because you and your past behavior have shaped that environment.
Abandonment
A perfectly viable solution, is to just abandon the project and let digital nature take its course. No updates, bug fixes, maintenance. It kind of just slowly rides off into the sunset, and when you decide to stop paying for hosting or the domain name, it becomes a memory and a static snapshot in the Internet Archive’s Wayback Machine.
If it lives at a university or similar sponsoring institution, this could mean a decade or more of life before some developer or systems administrator deletes files, changes server configurations, or implements some other change. An opportunity to prolong the life of a project in abandonment is to put the files and instructions for the project on GitHub, GitLabs or some other such service for hosting code.
Conclusion
Preserving and archiving Digital Humanities projects is hard. We are still in the early days of figuring out and building the infrastructure to support the long term accessibility of digital objects. We may never reach the point where digital projects can be packaged up into a physical object such that if placed in a cave will be accessible a hundred years later. But until that day, we will continue to preserve projects by porting them forward to the current technology.