Close

The challenge: Archiving legacy software

For software that has been developed on an online forge and can be copied without specific legal authorizations, the Software Heritage approach relies on massive automation. By prioritizing popular development platforms like GitHub and GitLab, we’ve already automatically archived more than 26 billion unique source code files from over 400 million different repositories.

However, the automated approach hits a wall when dealing with source code that hasn’t been developed on a modern platform. 

Legacy code refers to software that was created using outdated or now-obsolete tools, often stored in older formats, and may not be available through modern software repositories like GitHub or GitLab. This legacy code requires a completely different strategy.

 

Why legacy code needs a different approach

For those of us whose offices are still scattered with stacks of faded printer paper containing source code, the challenge of legacy preservation is immediately clear. Legacy code isn’t just about old files; it’s about complexity and physical reality. 

Archiving this heritage requires dealing with:

  • The variety of formats, from ancient tapes to obscure disk images.
  • The existence of multiple copies and versions across different media.
  • The crucial input of authors who may still be alive—their knowledge is essential for context.
  • Supporting materials, such as documentation, technical reports, email exchanges, and even physical books that provide the required context.

 

How to collect & preserve legacy source code

At Software Heritage, we focus on preserving the source code in digital format. There are two main ways to bring this digitized historical source code into the Software Heritage archive.

1: Do-It-Yourself Archiving (SWHAP)

The Software Heritage Acquisition Process (SWHAP) was developed in collaboration with the University of Pisa to help you manage this curation process. It guides you through all the necessary steps to successfully package the legacy source code you care about and archive it in Software Heritage.

  • A detailed, step-by-step guide is available here.
  • Check out the video tutorials below
Baptiste Mélès

“Thanks to the SWHAP process, I was able to archive the source code of version 1.0 of Georges Gonthier’s proof of the Four Color Theorem for Coq 7. This code is of interest for the history of computer-assisted mathematical proof, and I was able to use its archived version in my research work.”


— Baptiste Mélès, Research Associate, CNRS

2: Request Software Heritage support

If you own source code that you believe holds historical interest, and you need assistance to curate and archive it, please reach out to us. We can help reference it within the Software Heritage ecosystem and make it available to the broader community.

We will do our best to support your request, but we cannot guarantee that we will be able to take on every submission.

To request support, please send an email to legacy-code@softwareheritage.org stating:

  • Your name and affiliation.
  • The name and a quick description of your code, explaining its historical significance.
  • The format and the number of versions you possess.
  • If you have time to help us with the curation and archiving of the code.

 

Case studies: We have successfully archived challenging projects like the source code for Amaya, one of the first web browsers and editors, and the early computer vision program Chainage de contour, demonstrating the value of this targeted approach.

Get the Guide

Download the latest SWHAP guidelines here.

Contribute to the Guide

Distributed under a CC-BY 4.0 license, the guide’s source code is open for public use and community contributions.

Join the community

If you’re interested in the topic of curating and archiving legacy source code and would be interested in participating in future events focused on this work, join our mailing list.