Close

The Software Heritage archive

Our long term goal is to collect all publicly available software in source code form together with its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it.

The Software Heritage archive is growing over time as we crawl new source code from software projects and development forges. We will incrementally release archive search and browse functionalities — as of now you can check whether source code you care about is already present in the archive or not.

Content

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

Source files

loading…

Commits

loading…

Projects

loading…

Do we already have your code?

All content stored in the archive get a unique, intrinsic identifier, obtained by composing several different cryptographic hash functions. Using the search box below you can check whether the archive already contains source code you care about via its SHA1. Just drag & drop the relevant source code files in the box (if you’ve them), or enter the SHA1 of one of them.

Examples:

  • the SHA1 of Player.cpp from the DOOM 3 videogame is a4d0c728252b18f66ac38d0a6f5e51fc471aa68d, is it present it in the archive?
  • the text of the license under which DOOM 3 is released (GPL3) has SHA1 8624bcdae55baeef00cd11d5dfcfa60f68710a02
  • a source code file present in the archive has SHA1 3ae58a7760b841b9588c81cf65602e0f5361bd22, can you find out what it is?

API

Programmatic access to the content of the archive is available via the Software Heritage API.

The API allows to navigate the archive as a graph of development-related objects, such as file contents, directories, commits, releases. With the API developers can lookup individual objects by their IDs, retrieve their metadata, and jump from one to another following links — e.g., from commits to the corresponding directories or parent commits, from releases to released commits, etc. The API also allows to retrieve crawling information, such as tracked software origins and the full list of visits performed on each of them. This allows, for instance, to know when snapshots of a specific Git repository where taken and, for each of them, where each branch was pointing at the time.

Read the API documentation

Help us unlock the next levels

Additional functionalities of the Software Heritage archive are in the works. Here are some of the items on our roadmap:

Increase coverage

In terms of what is in the archive… we just got started. We plan to track many more software projects and development forges, as well as enable people to directly submit URLs of missing projects that should be archived.

Provenance information

All archived content is timestamped at retrieval time and associated with where we found it (its origin). We are working on exposing provenance information as it is useful for a wealth of different applications.

Browsing

Once content is in the archive, with all relevant metadata, it will be possible to browse the entirety of the archive, explore contained software projects, their timelimes, and the corresponding source code.

Full-text search

We are building the largest source code archive ever conceived. To allow people to exploit it fully we are working on source code indexing and full-text search. To deliver at this scale a number of challenges will need to be overcome though.

Download

To fulfill our preservation mission we will enable download of actual source code content as well as retrieval of development history in a format that is exploitable using modern version control systems.

You can help

The Software Heritage archive will serve the needs of the many, from cultural institutions to scientists and industries. Everyone can help us achieving these ambitious goals and there are several ways to help.

Become a sponsor

Pursuing our roadmap for the archive requires significant resources. We welcome companies, institutions, and individuals who would like to join our sponsorship program and sustain the Software Heritage project.

Discover our sponsorship program
Tackle the scientific challenges

Building, maintaining, and exploiting the universal source code archive poses relevant scientific challenges. We welcome scientists who would like to contribute to this mission by participating in our research activities.

Join our research community
Code with us

All the software we develop ourselves is open source. We welcome contributors that are willing to delve into it and help us building the many components that are needed to make the archive progress towards the next milestones.

Dive into the code