Our long term goal is to collect all publicly available software in source code form together with its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it.
The Software Heritage archive is growing over time as we crawl new source code from software projects and development forges. We will incrementally release archive search and browse functionalities — as of now you can check whether source code you care about is already present in the archive or not.
Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:
We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.
Note: the counters and graphs above are based on heuristics that might not reflect the exact size of the archive. While the long-term trends shown and ballpark figures are reliable, individual point-in-time values might not be.
All content stored in the archive get a unique, intrinsic identifier, obtained by composing several different cryptographic hash functions. Using the search box below you can check whether the archive already contains source code you care about via its SHA1. Just drag & drop the relevant source code files in the box (if you’ve them), or enter the SHA1 of one of them.
Programmatic access to the content of the archive is available via the Software Heritage API.
The API allows to navigate the archive as a graph of development-related objects, such as file contents, directories, commits, releases. With the API developers can lookup individual objects by their IDs, retrieve their metadata, and jump from one to another following links — e.g., from commits to the corresponding directories or parent commits, from releases to released commits, etc. The API also allows to retrieve crawling information, such as tracked software origins and the full list of visits performed on each of them. This allows, for instance, to know when snapshots of a specific Git repository where taken and, for each of them, where each branch was pointing at the time.
Additional functionalities of the Software Heritage archive are in the works. Here are some of the items on our roadmap:
In terms of what is in the archive… we just got started. We plan to track many more software projects and development forges, as well as enable people to directly submit URLs of missing projects that should be archived.
All archived content is timestamped at retrieval time and associated with where we found it (its origin). We are working on exposing provenance information as it is useful for a wealth of different applications.
Once content is in the archive, with all relevant metadata, it will be possible to browse the entirety of the archive, explore contained software projects, their timelimes, and the corresponding source code.
We are building the largest source code archive ever conceived. To allow people to exploit it fully we are working on source code indexing and full-text search. To deliver at this scale a number of challenges will need to be overcome though.
To fulfill our preservation mission we will enable download of actual source code content as well as retrieval of development history in a format that is exploitable using modern version control systems.