March 5, 2019

Building a provenance index for all the source code, in partnership with CAST

We are delighted to announce a research partnership with CAST to create a provenance index for all the source code stored in the Software Heritage archive, that will support a greater understanding of the evolution of software development over time, and enable a wealth of advanced uses of the archive. 

Continuing Software Heritage mission

The mission of Software Heritage is to collect, preserve, and to make accessible the source code of all software available, serving the needs of industry, research and cultural heritage.

Software Heritage is a nonprofit initiative created to become the universal archive of software source code. Already with more than 5.6 billion source files from more than 88 million projects, including Debian, GitHub, GitLab, Gitorious, GoogleCode, GNU, Python Package Index and more, the Software Heritage archive has the unique ability to retrace the detailed history of revisions of all versions of the code it contains.

Where did this file really come from?

The research partnership between Software Heritage and  CAST has the set goal to build an efficient provenance index on top of the Software Heritage platform. This index is the key  to quickly identify the original occurrence of any given source file, as well as all its subsequent occurrences, allowing users to ask questions like “when and where did this file first appear?“, which are of paramount importance when fixing vulnerabilities or detecting exogenous code.

Building such a provenance index for more than five billion known source code files is a nontrivial undertaking, but we are looking forward to making available an experimental version of this functionality soon.

Getting involved

Everybody is welcome to help building the Great Library of Source Code: you can contribute to the development, make a donation, or contact us to become a Software Heritage partner or sponsor!