Software Heritage: Ethical Charter for using the archive data


Software is at the heart of our digital society and embodies a growing part of our scientific, technical and organizational knowledge. As a consequence, software source code is now a growing part of our cultural heritage and a valuable asset for education, research, and industry.

The core mission of Software Heritage is to ensure that this precious body of knowledge will be preserved over time and made available to all, by collecting, preserving, and sharing all available software in source code form. Together with its complete development history. Forever.

We do this for multiple reasons. To preserve the scientific and technological knowledge embedded in software source code, that is a significant part of our heritage. To allow better software development and reuse for society and industry, by building the largest and open software knowledge database, enabling the development of a broad range of value-added applications. To foster better science, by assembling the largest curated archive for software research, and building the infrastructure for preserving and sharing research software.

We do this now, because we are at a turning point: on one side, many of the persons that created the computer technology we rely upon today are still around, and willing to help by making available the source code of their contributions, but we have only a limited time to collect their legacy. On the other, we seem to be at increasing risk of massive loss of source code developed collaboratively, because of hosting sites that shut down when their popularity decreases, and the lack of a structured effort to archive software artefacts.

You have been provided access to the full contents of the Software Heritage archive, that is the result of a significant collection, preprocessing and preservation effort undertaken by Software Heritage and the Software Heritage mirror network.

This offers you unprecedented opportunities to study and analyze the largest collection of source code ever assembled.

We hope that this access to the Archive will foster research projects that will produce positive results, like enhancing our understanding of software as a noble artefact of human ingenuity, improving its quality, studying its history, and many other ones we do not even foresee.

But with power comes responsibility, and this Ethical Charter highlights the principles that all persons and organizations accessing the Archive commit to respect.

Avoid harm

The source code collected in the Software Heritage archive enables a broad range of analysis and applications, in many areas of research. Unfortunately, even well-intended actions, including those that are accomplished for purely research purpose, may lead to harm.

You are expected to consider all potential ethical issues arising from your use of the data, and refrain from performing analysis or processing that may result in harm.

Protect Personal Data

The Software Heritage archive collects publicly available source code, and its development history, from a variety of public sources. Any personal information that may be contained in the source code or in the development history will hence be collected in the archive, and you get access to it.

Even where the local legislation does not make it mandatory, you will strive to adopt processes and policies that protect personal data in general, and in particular to safeguard from abusive behavior the people that through their work and dedication created the very software commons we are preserving. Mass mailing software developers is a well-known example of misuse that is clearly unacceptable, but there may be many other ones.

Avoid useless copies

You are also asked to refrain from redistributing the full content of the Archive, or significant portions of it: it is both unnecessary and dangerous. If you need to make (portion of) the data available, for example for reproducibility studies, do not copy the data, use persistent identifiers that reference the data into Software Heritage itself instead. Software Heritage is a long term archive, so the references will be stable over time, unlike bulk copies that may rot over time.

Keeping copies of the Archive inside the Software Heritage Mirror network also ensures that all persons getting access to the data are bound by the same obligations as you are.

Care about derived data

You are expected to carefully think about the derived data you make available to third parties, as a result of your processing and analysis. As an example, even if you do not engage directly into mass mailing software developers, publishing a complete database of all developer’s email addresses as a result of one of your studies does enable third parties to mass mail them, hence you must refrain from publishing it.