January 7, 2021

Software Heritage in 2020: looking beyond the crisis

The year that has passed since we posted our last activity report is a very special one: humankind has been confronted with a global crisis that young generations have never seen the equivalent before, but it is not the first time we are confronted with such a challenge, and it will not be the last.

The epidemics we are facing today is a powerful yet mindless adversary that knows no border and has no political agenda. This means that we cannot sit down with this enemy and negotiate our way out of the danger that it poses to humankind as a whole: our only hope is to tap into all our collective knowledge to find a cure.

At Software Heritage, we committed ourselves over five years ago to the long term mission of collecting, preserving and making available to all the source code of all software publicly available, as it contains a growing amount of our collective knowledge.

We strongly believe that building the universal source code archive as a common non profit infrastructure will help humankind be better prepared for the next global crisis, contributing to answer the WHO, UNESCO and UNHR joint appeal for Open Science.

It is the commitment to this mission that kept the Software Heritage team working at full speed despite the difficulties of a year 2020 spent almost entirely in lockdown. And we are happy to report that it has been a very productive year!

Preserving the web of knowledge: archival

A quick look at the source code archive main page shows that it contains now almost 10 billion unique source files from more than 2 billion unique commits coming from over 150 million projects collected worldwide. We have put to good use the collaboration with GitHub to ease its archival,  salvaged hundreds of thousand of endangered repositories from Bitbucket, processed tens of thousands of save code now requests, and established collaborations with academic journals in life sciences and computer science to archive research software associated to published articles.

Preserving the web of knowledge: reference

Ensuring that past and present software source code is collected and safely archived is one part of the mission, but to fully reap the benefits of this effort, it is necessary to also make sure that all the software artifacts we archive can be referenced now and in the long term.

This is why we put significant effort in formalising the Software Heritage intrinsic identifiers, aka SWHID, that the Software Heritage archive provides for the tens of billions of software artifacts that it preserves, and in working with Industry and Academia towards their widespread adoption.

We are happy to report that the full specification of SWHIDs is available, the swh prefix used in SWHIDs is now registered with IANA, the Software Package Data Exchange (SPDX) industry standard specification includes SWHIDs  in its recently published version 2.2, and SWHIDs are now clearly present in the scholarly landscape for software source code identification.

Preserving the web of knowledge: description and citation

Our engagement to preserve the web of knowledge goes beyond ensuring that software is safely archived and identified for the long term. We also care about the metadata that describes it, and the way it is cited in documentation and research articles. As part of our effort to support Open Science and reproducibility of research, we contribute to community initiatives to describe research software with proper metadata, we have released the first bibliographic style ever designed to cite software, and we have introduced the Software Heritage badges (swh-badges), that you can use to link to the archived source code.

There are three types of badges:

These are important steps forward in our years long engagement to raise awareness about the importance of software in general, and as a key ingredient for academic research, on a par with articles and data.

Building the community

The scope of our mission is broad, and humbling. We know well that to succeed in the long term we need help from a broad community, ranging from industry to academia, from governments to international organisations, from private foundations to cultural institutions, from passionate individuals and contributors all over the world.

This year has seen the first bold step forward to foster the emergence of such a broad community.


We partnered with the Alfred P. Sloan Foundation and the NLnet foundation, to provide grants for experts that are willing to engage with the long term mission of Software Heritage and build adaptors for each of the platforms and version control systems out there in order to collect and archive it properly.

The NLnet foundation supported Octobus to rescue 250.000 endangered public Mercurial repositories, and Tweag to develop an adapter that allows Software Heritage to archive more than 20.000 source code tarballs used to build the Nix package collection.

The cascading grant received from the Sloan Foundation has enabled us to award two subgrants already. The first is supporting Cottage Labs which will connect that will allow all instances of InvenioRDM to safely and efficiently archive in Software Heritage the source code of all research projects that will be deposited in them, and to provide the corresponding intrinsic identifiers (SWHID) to the research community. The second will fund the work of Stefan Sperling to improve the current Subversion loader and develop a CVS loader.

The Software Heritage website has now a dedicated page that details the grant programs: some are still open, do not hesitate in applying!

Open Science and Research Infrastructures

We want to help carry the voice of software developers, researchers and research engineers in the Open Science movement: we are actively participating in the Research Data Alliance (RDA), perform different activities to promote software recognition in the FAIR ecosystem, participate in the FAIRsFAIR and EOSC-Pillar european projects.

EOSC SIRS Architecture

This year we moved one step forward, by coordinating the EOSC SIRS task force, that brought together 9 scholarly infrastructures, and produced an official report on the basic building blocks to support software source code in the scholarly ecosystem, An essential infrastructure in the global architecture is the Software Heritage universal software archive.


During the meeting organized on UNESCO’s headquarters in February 2020, 30 participants, representing the expanding network that supports our mission, met to contribute to the discussion on the next steps and strategic directions for the next years.


We are very grateful to them for keeping supporting our mission despite the difficult year we all went through.

And we have been delighted to welcome three new important supporters: the CNRS as platinum sponsor and Sorbonne Université and Université de Paris as gold sponsors are now working with us to build the software pillar of Open Science.

Sharing and spreading the news


We have spent years working hard to accomplish our mission. The time has come to share and spread the news better than what we have been doing up to now.  This is why this year we launched the Newsletter and the YouTube channel. Now you can stay up to date with Software Heritage news by subscribing to the newsletter and find all our presentations in one place on our YouTube channel!

Introducing the Software Heritage ambassador programme

Last but not least, we are now launching the Software Heritage ambassador programme, designed to welcome enthusiastic organizations and individuals that want to help spread the word about Software Heritage and the services it provides to society as a whole. There are many reasons to engage and you can apply to become an Ambassador for Software Heritage right now!


Preparing for the long haul

Our top priority will again be to ensure that the key functionalities that Software Heritage offers are rock solid: browsing, referencing, and saving source code. We have also been actively working on many exciting developments that are not really visible right now, and we hope to roll them out progressively in the coming months:  mirrors are coming, and integration with extrinsic metadata sources that will help better describe the contents of the archive, to cite a few.

And most importantly, the time has come to start working on setting up the independent, international, non profit, open organization that will host Software Heritage for the long term, with an exclusive focus on its mission of building and maintaining the universal source code archive, for the benefit of society as a whole.

We look forward to working with all interested parties to build this essential infrastructure that will contribute to preserve our software commons and provide the reference archive and knowledge base for all use cases, from industry to research, from cultural heritage to governments, from individuals to organizations.

Let’s join forces to preserve our past, improve our present, and, looking beyond the current crisis, prepare better for the future.

— Roberto Di Cosmo



January 7, 2021