Archiving DH Part 3: The Long View
Begin with the End in Mind
You don’t begin a journey without having a destination in mind. (Well, I guess you could. That would be a fun adventure. But for the purposes of our discussion, such a thing would be absurd.) Attributed to Stephen R. Covey, the phrase “begin with the end in mind” is a necessary way to think about any Digital Humanities project. When you have a clear sense of how the project should spend its life in the end (whatever that may be), then the project can be built from the beginning to fulfill that purpose. The final post in this series will look at what some of these “digital end-life” options might look like. This post focuses on some things that project owners, designers, developers, and users should be thinking about as they plan and create digital scholarship generally, and digital humanities projects, and web-based scholarship, more specifically. Here we look at the true costs of projects, establishing a data policy, defining ownership, documentation and resources for planning digital scholarship projects.
Standing on Shoulders
Many more brilliant minds than mine (Ammon) have written about this topic already. From the 2005 website and book “Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web” by Dan Cohen and Roy Rosenzweig to last years (2018) book “The Theory and Craft of Digital Preservation” by Trevor Owens. (Disclaimer, I worked with Dan, Roy and Trevor at the Center for History and New Media at George Mason University, so I unabashedly hold up their offerings as great examples.) Another incredibly useful resource is the Research Data Curation Bibliography by Charles W. Bailey, Jr., which contains over 750 articles, books and technical reports relating to digital curation, including many on digital preservation. There are surely many, many more resources available. Someone oughta build a website and database…
The Long View
True Costs
Supremely important for any project (not just creating Digital Humanities projects) is knowing with as much certainty as possible, what the project will cost. One wouldn’t agree to build a house without knowing how much it will cost. A funder doesn’t need to know how much each construction worker will be paid exactly, but the contractor does need to know how much the actual workers, materials and such are going to cost. And that’s one of the main points. As a DH project manager, developer, etc, we are more like a contractor than a funder. So we do need to know all of the hidden costs: costs of labor, and cost of resources, costs of sustainability, etc. So here are some the known and hidden costs that might occur in any given project, and that should be kept in mind when developing a road map for a project.
Some of the hidden and known costs of DH projects may include:
- Resources you assume will be there
- Grant funding
- Department funding
- Personal funding
- In-kind resources
- Resources that you aren’t paying for directly
- Electricity
- Department or university provided hosting (servers)
- Technical support for servers, phones, computers, etc.
- Access to data, journals, databases, etc.
- Housekeeping for your office, building
- Staff and cost of living
- Developers
- Designers
- Documentation writers
- Project managers
- Testers
- Hosting or Server space
- local vs. contracted out
- ongoing costs
- Maintenance
- Updates
- Security patches
- Backup services
- Monitoring
- Domain name
- Yearly cost
- Moderators for digital collecting sites
- User advocates to monitor and answer support forums and contact email addresses
- Software
- Initial purchase costs
- Upgrade or subscription costs
There are surely more. Just like each tasty meal you will eat this week is made from different ingredients, your specific project will have a unique blend of up-front and hidden costs, making your project the unique and wonderful gift to the world that it is.
Deukhine - sauce with peanut butter and tamarind, By T.K. Naliaka, CC BY-SA 4.0 [ Source Link ]
Data Policy
Your project will most likely use data or generate data. You need to be clear up front on how you expect and desire others to be able to use, or not use, the data from your project. For a smaller, not institutionally supported DH project, a data policy can be a simple set of public commitments about the possible futures of your project. The following is a practical example of implementing a data policy for a small digital project.
My (hey, this is Amanda!) Infinite Ulysses project was dissertational research supported by a single person, but it also had a wide and varied audience for such a small project (12k unique visitors during the first couple weeks of beta opening) and hosted user-created content (1k annotations). Its focus on interface design and user studies, and the use of social media encouraging testing of the site, were important in inviting use by enough users from both inside and outside academia to address my questions about the impact of such a shared space on literary reading and learning.
On the other hand, the site looking nice and having some publicity probably conveyed a permanency for the site that I was hopeful for at the time, and had some real plans and resources for continuing post-PhD, but was ultimately the dissertation project of a single person contributing their own limited time, money, and moderation stress to running the site. I should have done better making sure folks were clear that there was one person behind the project and no presumption of ongoing hosting, for example for folks planning long term or repeat readings (likely, given Ulysses’ length, difficulty, and rewards for rereadings) or planning to demo or use the site in a class.
I am happy that the project, from the start, did include a public document addressing both data privacy and data preservation plans for the project. For example, because the site invited and stored user annotations as they were reading a long and difficult novel, I had a responsibility to let folks know possible disasters (aka “I lose all your comments”) and what I was doing to prepare against those calamities. My public data policy stated: “This site has an automated daily backup that includes all user annotations and comments, as well as weekly server backups. In addition, the site is regularly replicated on a development and local server.”
Critically, this public data policy included plans for if I needed to stop site interaction or take down the site for some reason. Even if your plan is for your project to exist in its current or an improved form ongoing, sharing a contingency plan from the start lets users make informed decisions about their activity with your project. My policy stated: “If the site ever needs to be shut down, go offline indefinitely, or be transferred to substantially different ownership, I’ll contact all users through the email address given on your user profile page with directions for downloading your content. Users will be given at least one month’s notice to export their content. Should such a situation ever occur, I’d prioritize keeping the site up but in static form (i.e. you can’t change or add to the annotations and comments anymore) so that it’s still available as a resource; users could then opt to use the Annotator.js browser plugin to continue the annotation of the text using an AnnotateIt account. If that ever becomes the case, I’ll post instructions to the front page of this site on how to continue using the site.” I did end up first shutting the site to new users, then to new annotations, and eventually migrating from Drupal to a static archived version of the site. I felt better about doing this given there was never an explicit promise to always run the site the way it currently ran (though see above caveats about site shininess and social media).
My data policy also helped project users protect and own their labor through documented paths to export their own work for use elsewhere or personal preservation: “users should be able to export their content with the push of a button”, ideally, in multiple non-proprietary formats that support both reading (e.g. HTML or TXT) and data manipulation (e.g. JSON or CSV).
To create a similar public data policy for a small DH project, you’ll want to ask yourself the following:
- What might people build off this? Think about ways your site/data might be treated as a feed, API, or permanent fixture that you might not hear about, e.g. if folks could be running Twitter bots off the presumption of continuous new content or activity on your website, or use in a classroom.
- What have people built on this? Uses you know about; on Infinite Ulysses, this meant users’ textual annotations, but also: the community they help build and constitute, and their reputation, scholarly or otherwise, in that community; any scholarship citing and/or depending on folks being able to look at or interact with your project as it currently appears.
Ownership
Who owns the project is also a very important decision to make at the beginning of the project life cycle. Agree up front on ownership and licensing issues and set out some permissions in advance:
- who are the authors of a project (as distinct from advisors, technical helpers, editors, well-wishers, etc.),
- who can decide its fate, who is responsible for fixing it when broken,
- who is liable for any legal issues that may arise.
Decide in advance whether to give the Library permission to maintain/migrate/preserve/serve a project if/when it is no longer maintained by its authors/owners in the future. Decide on an appropriate copyright license. If so desired choose open license solutions that proactively grant the public permission to migrate/preserve/modify/update/fork projects when owners move on.
Documentation, documentation, documentation
Steve Balmer’s Developers mantra
I wish I (Ammon, here) had an audience of DH developers (or more importantly, their supervisors and the project leads) where I could chant “Documentation! Documentation! Documentation!” until I had giant sweat spots on my shirt. :) One of my biggest gripes with any DH project I’ve worked on is the lack of documentation. And for the record, I’m just as guilty as those I accuse.
There seem to be about three phases to a project, and documentation is critical for each phase to go smoothly from one to the next: 1) development phase, 2) stable release, 3) retirement. If anything, documentation provides a history of the project. And as we all hope, some future researchers are going to study our project and want to know as much about it as possible. Let’s give those future scholars a wealth of documentation to work with. My plea is to build documentation writing into the process of creating the project. Give developers time and space (on the calendar and distraction free) to write about what they are creating. In my view, documentation is just as important as the project itself. Documentation and the project should actually be seen as the the same thing, they should be inseparable. Documentation is like one of three legs of a three legged stool. (I leave it to the reader to label the other two legs.)
Development Phase
Documentation during this phase is critical for developers actively working with the project. Ongoing documentation of the project is helpful for developers to collaborate and coordinate. Documentation during this phase provides the logic and reasoning behind coding and technology decisions, defining what files and resources are important to the project, any resources that are not used in the public side but required for the backend, or kept for storage or future ideas. But more importantly for the long view, actively documenting the process of creation provides information that is vital for those who need to put the project into an archived state.
So many times I go into an ancient folder (anything older than 2005) and find many files that don’t seem to be used in the production version of the site. Because there is no documentation, I’m left to decide myself if they should be included in an archive version or left out. What I have found very helpful for projects is to include a File Structure section in the documentation. This section simply lays out all of the files that are required to make the project run and a short description of their function. An easy way to create the file structure (if you are using a Unix or GNU/Linux based server) is to use the tree
command, like so:
tree -L 1
This prints out the first level (-L 1) of files and directories located in the folder where you run the command. You may need to play with the level and other options to get the result your looking for. I find it is not necessary to list every single file and folder (like the application files and folders, or all of the resources like images and videos, files in archive directories, and such), but the main parent folders and a short description of the contents usually suffices. An example taken from the Collective Biography of Women project git repository README file is shown below.
Folder Structure
All of these files are in the git repository (except .env which is created separately on local and production).
├── alldata.json (data file, contains all of the bibliography information, in JSON format)
├── data/ (data folder for the solr container)
├── default.conf (nginx config file with a change to redirect old URLs to new)
├── Dockerfile (File to create the nginx image, pulls in the default.conf and files from static-content)
├── docker-compose.yml (file used by docker, determines which docker settings and images to use)
├── .env (environment file)
├── myconfig/ (folder containing the solr config files)
├── README.md (this file)
├── static-content/ (folder containing the static html, css, js, image files. the cloned website, not under version control, but in the wb-static image as /usr/share/nginx/html/)
├── web.xml (solr config file that allows the javascript in the search.html file to access the solr server)
Note: The following files in the static-content/ folder were edited from the originally scraped static versions, or created new.
- search.html (modified to connect with the solr docker container for search)
- solrSearch.js (created new as the connection between the solr database and the search page; displays the results from solr to search.html)
- style.css (adds styles for the loading animation)
Stable Release
While a project is out in the wild and working as planned, it is important to document the expectations of the project. Write out clearly what is the expected behaviour of the site. What should the results of searching look like? How was the site different than originally planned and written into the project or grant proposal? What were the big compromises that had to be made to get the project functional and live? What are the next steps for the project? Are there functionalities and steps that were left undone?
Retirement
This phase of the project should be seen similarly to the development phase. All decisions, changes, alterations, modifications, concessions, and choices should be documented. Specifically, note what functionality was lost during the archiving process and the reasons for that decision.
Further Resources
Here are a few more resources for planning Digital Humanities projects while keeping the long view in mind. If you know of other great resources, send an email to ammon@virginia.edu and I’ll add it to the list.