December 28, 2018

Opening up to the world, and full speed ahead

Our ambition is to collect, preserve, and share the source code of all software ever written.

In the very intense year 2018 that has passed since our last activity report we have moved forward at a steady pace, and now is a good moment to recall the key accomplishments, putting them in perspective in the framework of our mission, at the service of cultural heritage, science, industry, and society as a whole.

Let’s start by looking at the progress made along these three dimensions, and then we’ll wrap up looking at how awareness is finally raising.

Collect: expanding the coverage

This last year the archive has grown quite a lot and we have made steady progress in expanding the coverage of our collection process. We have automated the tracking of Debian sources and added PyPi, GitLab.com and Inria’s own GitLab instance to the list of tracked forges.

For the source code hosted on forges that we do not track yet, we are always eager to receive contributions to our listers, but we are also opening up an initial version of a save code now service that allows to request the ingestion of source code even if it is not hosted on one of the already tracked forges.

And last, but not least, for research software, we have opened up a moderated software source code deposit service.

Preserve: kickstarting the mirror network

We acknowledge that there are many threats that might endanger long-term source code preservation: from technical failures, to mere economic decisions, from dispersion of efforts in uncoordinated initiatives, to changes in the legal framework, like the EU Copyright reform that absorbed a great amount of our time this year.

We know very well that we cannot entirely avoid them if we work alone.

That’s why an important part of our long-term strategy is the development of a geographically distributed network of mirrors, implemented using a variety of storage technologies, running in various administrative domains, controlled by different institutions, and located in different jurisdictions.

We have spent a great deal of effort to set up the legal and ethical framework for the mirror network, and we are now really thrilled to have recently welcomed the first member of the network, that we hope to see growing steadily in the future.

Collecting and preserving the software source code have been our top priorities from the very beginning, and for a while very little of the huge amount of work we made was visible from outside. This year has been a turning point, with a lot of visible progress on the share dimension!

Opening the doors of the archive

On June 7th 2018, just a bit more than a year after we established a landmark agreement on preserving and sharing the knowledge embedded in software source code, we were proud to be back at Unesco headquarters for the grand opening of the doors of the Software Heritage archive to the public. This is a major step forward in the sharing dimension: it is now possible to explore the largest collection of software source code in the world taking advantage of the many features detailed in the archive guided tour. And we will keep adding new ones!

Intrinsic identifiers for digital objects

When you find something interesting in the archive, you want to share it, and for this you need an identifier that will point to what you found. Hence we must provide identifiers for each and every one of the over 10 billions digital objects in the Software Heritage archive. It is not an easy task, because it is not just a technological challege: we knew that whatever choice we made, it would end up setting a standard in the medium term, and this is a serious responsibility.

Our iPres2018 research article provides a full account of why and how we designed the system of identifiers that is now deployed across all the Software Heritage archive (and that stands behind the “Permalinks” red vertical tab that is available in all views of the webapp that allows to browse the code).

Awareness is raising

The importance of software in general, and software source code in particular, in our modern societies has been long underestimated, overshadowed by the more visible aspects of the digital revolution, and that is one of the major reasons why Software Heritage was not created earlier.

Thanks to patient and steady efforts, the situation is now slowly changing, and the past year has been a real turning point.

An international expert meeting that was held at Unesco headquarters in November produced a detailed report on the relevance of Software Source Code, the blockers and the enablers, and a call for action that we hope will be heard all around the world.

Research Software, with its source code, is a long forgotten essential pillar for Open Science, along with research articles and research data. Things are now changing: Software Heritage has been included in the french national plan for Open Science, and research software can be in Software Heritage via an open access portal, a stepping stone for software citation.

We also reached out to the broad computing community to ask for help with the many technical and scientific challenges lying ahead.

According to our plan, we have recently established a non for profit foundation for Software Heritage, currently hosted by the Inria foundation. That’s why everybody can now contribute to our mission, joining the growing list of sponsors and partners: donations of any size are now accepted, and we do welcome even very small ones, as they indicate that you care.

Looking ahead

There are very many exciting areas of development and collaboration that will keep us busy in this coming year.

The coverage of the Software Heritage archive will be increased: more forges will be tracked, and more version control systems will be supported. We will also work on improving access to the contents of the archive, progressively adding metadata and provenance information to the webapp. We will welcome new members in the mirror network, and collaborate with industry on new interesting use cases.

Research software deposit in the Software Heritage archive will be made available through more open access portals, software source code will be promoted as a pillar for Open Science, and the intrinsic identifiers pointing into the archive will be promoted in the framework of software citation and reproducibility of research.

The content of the Software Heritage archive will be made available as a dataset to researchers in all fields, enabling the emergence of new tools and techniques that will enhance the fruition of the software heritage of humankind.

And we hope to attract even more users and contributors: we have an exciting mission, and everybody is welcome onboard!

— Roberto Di Cosmo

Software Heritage