Contributed by: Megan Schanz and Lydia Tang of Michigan State University.

In this post, Megan Schanz and Lydia Tang of ArchivesSpace member institution Michigan State University outline MSU’s need to combine two separate ArchivesSpace implementations and the steps they took to accomplish this task.  They have made their migration documentation and scripts available to anyone.  Links to this documentation are available at the end of this post.

Background

Michigan State University historically has maintained administratively separate University Archives and Historical Collections (UAHC) and the Stephen O. Marry and Keelong Hong Special Collections. Throughout the decades, the UAHC historically had various reporting lines but joined the Libraries in 2018, where Special Collections currently also resides. By aligning reporting structures, this has enabled the UAHC to further coordinate on initiatives and infrastructure within the libraries, such as ArchivesSpace. This blog post will detail our successful initiative to merge two distinct instances of ArchivesSpace into a single instance with two repositories.

Why merge?

As a self-hosted institution, we initially maintained four servers; for development, testing, production, and failover. This arrangement was ideal because our technical team could test plugins and upgrades safely on the development environment, and our users could test out features and functionality on the test environment all without impacting production data. The failover server was synced regularly with the production server and could be used in the event of an issue with our primary production server. While effective, this workflow meant that any change or upgrade had to be applied to all four servers.

This arrangement worked well enough for Special Collections, but proved unsustainable when the servers were doubled after UAHC joined MSU Libraries. It became increasingly unsustainable to maintain separate custom theme plugins and different configurations for each instance.

Image representing the different servers and hosted instances of ArchivesSpace across Special Collections and University Archives
Server diagram before merge: Both Special Collections (top) and University Archives (bottom) both had their own set of servers that hosted ArchivesSpace (green), that then connected to external Solr(red) and database servers (yellow) which are shared by other applications at the Libraries.
Image representing the unified server environment for ArchivesSpace for development, testing and production.
Server diagram after merge: Special Collections and University Archives now both live on the same Docker container (dark green) with ArchivesSpace (blue) and their database (yellow) and Solr (red) services running within them.

How we merged

After deciding to combine the UAHC and Special Collections instances, we assumed that someone must have done this before and of course there would be a tool already built to handle this. After experimenting with any solutions we could find, we realized that our two instances were more complicated than existing solutions could account for. Even directly making ​API calls to retrieve from one instance to add to the other was much more complicated than we were hoping for, meaning that we needed to work directly with the database. These are the steps that we took:

Step 1:​ We identified which instance was going to be the base for us to merge the other one into. We selected UAHC’s instance, which had the most existing data, as the base to merge in Special Collections’ records. This limited the amount of records being moved.

Step 2:​ We identified the dependencies between all of the tables in the database. Anyone who has looked at the data model for ArchivesSpace can tell you that it is not a simple model. With over 100 tables, each with many links to other tables, it takes time to fully understand how data flows throughout the system. Looking at the database we were migrating, we examined each table to identify if it had data to move and what relationships it had with other tables. We noted that not all tables were used and some tables were identical and did not require migration. With that information, we determined the order the tables had to be migrated to preserve the dependencies as we moved them.

a screenshot of the spreadsheet we used to analyze the databases.

Image description: a screenshot of the spreadsheet we used to analyze the databases.

Step 3: ​Using the ​database information​ we gathered in step 2, we wrote a 2,000 line SQL script which queried one instance and inserted the data into our base instance.

a screenshot of the SQL script written to perform the data migration.
Image description: a screenshot of the SQL script written to perform the data migration.

The database merge script went surprisingly smooth. However when we reset the Solr index, it seemed to ignore the new records. Since our attempt at inserting into the repository table via SQL hadn’t worked, we created an empty repository through the ArchivesSpace interface and then migrated our information into that new repository.

In writing our script, we calculated identifiers for all of the migrating tables instead of just letting the database auto-calculate as it normally would. This enabled us to fix links to the site by simply performing a calculation to the original identifier to get the new identifier. For example, since both of our original repositories had an identifier of “2,” the merged repository incremented to “4.” Additionally, their resource identifiers were incremented by 4,000 so a URL that originally was: https://original_host.edu/repositories/​2​/resources/​100​ turned into https://new_host.edu/repositories/​4​/resources/4​100.

After our process ran and was tested, we were ready to put it to use in our production environment. We used Docker to deploy and run our new merged instance to help us consistently configure and deploy to each server, minimizing the overhead maintenance and risk of human error. Docker combined with GitLab CI/CD allows us now to easily make a change to a plugin, setting, or even upgrading ArchivesSpace versions and deploy it to each server with a click of a button.

GitLab CI/CD deployment workflow
Image description: GitLab CI/CD deployment workflow

How you can do it

Looking back, we spent way more hours working on it than anyone anticipated. Granted, we expanded the initial project scope by adding the Dockerized setup, but it has been well worth the time investment. Having learned so much about the ArchivesSpace data model during this process, we felt that it was important to share this information with others.

To help other institutions or groups facing similar challenges, we have made our ​migration process documentation and SQL script​ public. While it may not be out-of-the-box ready to use for other ArchivesSpace instances, it should at least provide a head start on how to approach the issue; requiring only minor repository-specific customizations to the SQL script or data model updates to accommodate future versions of ArchivesSpace.

We also decided to share our ​Docker setup​ as well which includes implementation documentation and can easily be further customized to include new plugins or configurations.

Feel free to reach out if you have any questions on either our merge experience or our Docker deployment!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.