Our mission at Software Heritage is to collect, preserve, and make publicly available the entire body of software, in the preferred form for making modifications to it. We consider that publicly available source code, and even more so Free and Open Source Software (FOSS), is a digital commons that embodies decades of human creative effort. As we strive to preserve this vital resource for future generations, we acknowledge the emergence of inquiries regarding the use of the Software Heritage archive for the training of machine learning models, particularly large language models (LLMs) that can automatically generate code to assist with software development tasks.
The legal and ethical framework governing these undertakings are complex and rapidly evolving. On the legal side, while copyright law traditionally permits reading by humans for learning purposes, and recent legislation provides exceptions designed to facilitate text and data mining, both in Europe and in France, the large-scale extraction of knowledge from code for LLM training enters uncharted territory. This uncertainty is further intensified by ongoing legislative discussions, such as the AI Act in Europe, reflecting the regulators’ struggle to keep pace with scientific and technological advancement. On the ethical side, we still lack an agreed-upon definition of what “open source AI [model]” means, with seminal work recently started in the space.
Software source code is more than mere data. It is the result of a profound human endeavour—a cumulative effort of engineers, developers, computer scientists, and many others—particularly manifest in the FOSS movement. This colossal digital commons captures our technical, scientific, and organisational knowledge.
In view of our mission, we recognize the potential value in engaging with parties interested in training LLMs on the content of the Software Heritage archive. The development of machine learning models that encapsulate this digital commons promises to democratise the software creation process, making it easier for a broader constituency to reap the benefits of the digital revolution. This is a significant goal that aligns with our values.
We feel that the question is no longer whether LLMs for code should be built. They are already being built, independently of what we do, and there is no turning back. The real question is how they should be built and whom they should benefit.
In alignment with our mission, we believe that LLMs for code should be built in a transparent and respectful way, to the benefit of all. We hence state the following principles for acceptable machine learning use of the Software Heritage archive.
- Knowledge derived from the Software Heritage archive must be given back to humanity, rather than monopolized for private gain. The resulting machine learning models must be made available under a suitable open license, together with the documentation and toolings needed to use them.
- The initial training data extracted from the Software Heritage archive must be fully and precisely identified by, for example, publishing the corresponding SWHID identifiers (note that, in the context of Software Heritage, public availability of the initial training data is a given: anyone can obtain it from the archive). This will enable use cases such as: studying biases (fairness), verifying if a code of interest was present in the training data (transparency), and providing appropriate attribution when generated code bears resemblance to training data (credit), among others.
- Mechanisms should be established, where possible, for authors to exclude their archived code from the training inputs before model training begins.
In the context of training machine learning models on the content of the Software Heritage archive, we will consider collaborating with entities committed to upholding these principles.