Automating Digital Archival Processing at Johns Hopkins University

This is a guest post from Elizabeth England, National Digital Stewardship Resident, and Eric Hanson, Digital Content Metadata Specialist, at Johns Hopkins University. 
Elizabeth: In my National Digital Stewardship Residency at Johns Hopkins University, I am responsible for a digital preservation project addressing a large backlog (about 50 terabytes) of photographs documenting the university’s born-digital visual history. Since 2004, the campus photography unit, Homewood Photography, has used an exclusively digital workflow and the photographs have been stored on optical media. Before I arrived, the university archives had already taken physical custody of thousands of these DVDs, but needed someone who would be devoted to processing this large-scale collection.
I’ve relied heavily on the electronic records accessioning workflow written by my mentor, the university’s digital archivist Lora Davis, and worked with her to adapt the workflow for the Homewood Photography collection. It’s the first digital collection of this size for the university archives, so some processing steps that work for smaller digital collections don’t work as well in this scenario. For example, typically, disk images are created to capture content from physical carriers such as hard drives or DVDs, in order to preserve all the files and how they were structured on the carrier. However in this collection, many large jobs such as university-wide graduation have content split across multiple DVDs. I needed to reunite split content to restore the original order, hence the decision to work with the photographs at the file level and not as disk images. On the DVDs, the photographs were saved as both .NEF (Nikon’s proprietary camera raw file format) and .JPEG. When transferring the photographs off DVDs, I’ve been keeping just the raw files since .NEF is a lossless format and is preferable for creating preservation files over the lossy .JPEG derivatives.
All this to say, the collection is being processed at a more granular level than may be expected for its size. From the beginning, I knew that using scripts to manage bulk actions would be a huge time-saver, but as someone with essentially zero scripting experience, I didn’t have a good sense of what could be automated. While reviewing the workflow with Lora, I showed her how time consuming it was going to be to manually move the .NEF files in order to nest them directly below the descriptive job titles. She recommended I look into using a script to collapse the directory structures, and although I found some scripts that accomplish this, none could manage the variety of disc directory structures.
Two examples of disc directory structures within this collection.
I described the situation to Eric Hanson, the Digital Content Metadata Specialist here at the Johns Hopkins, knowing that he had experience with Python and might be able to help.
Eric: At the time, I had been using Python for a few months to interact with the REST API for our archival data management system, ArchivesSpace, and to retrieve uniform resource identifiers (URI) for name and subject heading for potential use in linked data applications. Up until then, I had not used Python for manipulating files on a hard drive, aside from creating a few text files, but I assumed that there were Python libraries that could handle this type of task. I soon came across the “os” module and the “os.walk” function for mapping the directory structure and the “shutil” module and the “shutil.move” function for actually moving files. Both modules are part of the built-in Python Standard Library.
I worked with Elizabeth to examine the directory structures and get a sense of the variations that we would need to act upon. Given the inconsistent depth of the files, the script was written so that any file existing below the eighth level in our directory structure (i.e. /media/bitCurator/RAID/HomewoodPhoto/IncomingTransfer/BatchNumber/Disc/JobTitle/…) would be moved to the eighth level, placing the files directly under the folder with the descriptive job title.
We added a time-stamped log creation function to the script so that we would have a record of all of the changes that were made. We did several test runs where the log was written with the changes that would be made, but I disabled the part of the script that would actually move the files. After we were satisfied with the results in the test log, I fully enabled the script and Elizabeth put it use. The final version of the script can be found on Github.
Elizabeth: The success of the collapse directories script helped me realize other processing tasks that could potentially be automated. Considering the storage implications of this collection, it was decided early on that sampling as a selection strategy was necessary, so I researched archival sampling to determine a standard for retention and how to go about conducting the sampling. Most resources I consulted recommended using a random number table, however, this would be too time consuming to implement across thousands of jobs. After discussing with my other mentor, the university archivist Jordon Steele, we decided that 10% would be a sufficient standard of retention, which would be accomplished by systematically keeping every 10th image from the jobs. I decided to start the sampling with the 2nd image from each job (then 12th, 22nd, and so on) because the 1st image was often of a color checker, an important tool for photographers, but of low archival value. While this systematic sampling may not be ideal, it ensures that what’s retained would capture the beginning, middle, and end of each job.
A color checker from this collection, often the 1st image in a job.
Eric: The second Python script I created for Elizabeth came together more quickly since I was now familiar with the “os” and “shutil” modules. The initial version of the script looped through every job directory, skipped the first file, then selected every 10th file of the job (rounding up), and moved those files into a separate “sampled” directory that kept the descriptive job title. We again used a time-stamped log to see what changes would be made before fully enabling the script and moving the files from “processing” to “sampled.”
Elizabeth: I implemented the sampling script without issue the first time around, but hadn’t fully considered its future use. Because of the size of the collection and limited short-term processing storage space, I’ve been processing the collection a few terabytes at a time. I may have content in multiple parts of the processing pipeline simultaneously, as was the case the second time I went to use the sampling script. Luckily, I identified the issue before implementing the script: if the “unsampled” 90% stayed in the processing directory after running the sampling script, I couldn’t move any new content into the processing directory, because I’d be mixing to-be-sampled jobs with already-sampled jobs in the same storage space. I realized that each time I was enacting something on the content, I wanted to move the content into a new directory in the pipeline, which was an automation Eric was able to add to the sampling script.
10% of the files within each job are retained, and automatically moved from the Processing Directory into the Sampled Directory.
Eric: When Elizabeth first described the unsampled issue to me, I figured that I would just add a section at the end of the script that moves all of the unsampled files after the sampling was completed. After talking it over with Elizabeth, we realized that this approach could cause problems if the sampling script failed for any reason, as we would have to manually find and move any jobs that been sampled before the script failed. With that in mind, I actually found it was easier and more efficient to move the unsampled files in the same loop that was moving the sampled files, leaving the source directory empty after the script had run through it. The final version of this script is also available on Github.
Elizabeth: The final step I needed to accomplish before normalizing the photographs to .DNG, our chosen preservation file format for the collection, was to rename the job folders. This step was very important because the folder-level job names are essentially the only source of pre-existing descriptive metadata. The names typically followed a date_client_event format, with underscores between each word, such as 20081211_mechanical_engineering_faculty_portraits. I wanted to simplify and standardize the names, so that they would read more like 20081211_mechEng_facultyPortraits. I knew OpenRefine was a good option for cleaning up the thousands of names, but hadn’t worked with it before.
Eric: Prior to using Python for most of my work, I worked extensively with OpenRefine, which introduced me to a number of programming and automation concepts that I carried over into Python. I agreed with Elizabeth that OpenRefine was an excellent option for the type of clean-up she needed to do. I wrote a script that created a .CSV file of all of the job names, and Elizabeth created an OpenRefine project using that file. Given that the job names were a source of metadata elements, such as dates and the job clients and events, I worked with Elizabeth to establish naming conventions, in order to simplify the extraction of these elements. I showed her some basic functions in OpenRefine’s native language, GREL, so that she could take charge of the clean-up. After the clean-up was completed, Elizabeth exported a .CSV file containing both the original job names and the corrected names. I created a simple find-and-replace script that used the “os.rename” function to change names based on the .CSV file, available here.
Two examples of disc directory structures within this collection.
Elizabeth: I started with 1,500 DVDs for the first processing iteration. While I had outlined the processing workflow in advance, some details weren’t fully figured out until I actually was in the trenches. It took extra time to figure out how to automate processes, but devoting that time up front has already proven worthwhile for many reasons. First, the workflow for these 1,500 DVDs took 2 months to complete, while the second iteration of 1,400 DVDs was accomplished in just 2 weeks. Automating saved so much time that I’m now very ahead of schedule with the project! Second, automating processes means there’s less room for human error. After my close call with almost running the sampling script on content a second time, I realized scripts can protect against errors through building in simple actions, such as moving content into new directories or prompting the user to enter the name of the directory on which the script should run. Third, the logs that are generated by these scripts are useful not just when testing, but ultimately the logs generated when running the scripts in production document actions taken on the content and will be retained as processing documentation. Since the jobs were renamed, being able to trace the changes are important for archival chain of custody.
While these scripts were written for this very specific use case, they have potential future use for my residency project. When I’m done with the DVDs, I will begin developing workflows for transferring content from other sources and physical media, such as getting athletics-related photos directly from Athletic Communications via external hard drives, which will introduce new naming conventions and systems of organization.
Eric’s role in the greater landscape of my project is to assist with metadata clean-up (much of which is still forthcoming), and I couldn’t have predicted how extensive this collaboration would become back when Lora suggested I look into a script to collapse directory structures. One of the biggest takeaways for me has been to reach out to colleagues in other departments, ask for help, and you both might learn a new thing or two. Our collaboration has been successful not just in producing these scripts to automate processing; when we began this process in January, I was rather intimidated by Python. I still have a ways to go with learning Python, but I’m now more intrigued than apprehensive because I have a better sense its capabilities and potential use in processing large, digital archival collections.
Eric: By tackling these problems with Elizabeth, I learned how to use the “os” and “shutil” Python modules and which I have already reused in other projects, including a script for batch ingesting files into a DSpace repository and a documentation clean-up project. More importantly though, this collaboration highlighted the advantage of taking a broad view of what it means to provide metadata support for a project. While most of these tasks could be classified as file management, none of them required all that much of my time and helping Elizabeth with these issues deepened my understanding of the project and allowed me to offer better recommendations in regards to the metadata. Additionally, these tasks were mutually beneficial to both of our interests in the project. Given that the folder names were a primary source of metadata about the images, when I helped Elizabeth with the renaming of folders, I was also helping myself because the information would be easier for me to extract at a later stage of the project.