Sharing the source code commons with a set of trustworthy and open services, providing access to the largest source code library in the world.
The SWH archive is the gateway to all captured source code and its entire development history. With the browsable platform, it is possible to visualize all the visits made to a given location of the code (collected from different forges, package managers and distros) and read the source code content captured.
SWH provides a Persistent IDentifier (PID) that can identify each and every source code artifact with integrity, called a SWHID. SWHIDs are intrinsic identifiers which are intimately bound to the designated object, they do not need a register, only agreement on a standard to resolve them.
The SWHID can also be used as a badge.
Go to the resolver API endpoint
The Vault is the service in charge of reconstructing parts of the archive as self-contained bundles, that can then be imported locally. For instance in a Git repository. With the vault directories and revisions can be downloaded by users on the web platform or through the API.
Go to the download directory API endpoint
The deposit feature is a SWORD 2.0 Server implementation. S.W.O.R.D (Simple Web-Service Offering Repository Deposit) is an interoperability standard for digital file deposit. The deposit allows a client (a repository, e.g. HAL) to submit software source archives and its associated metadata to the SWH archive. Metadata can be also submitted referencing a repository url (origin) or a SWHID.
The SWH archive harvests source code from different sources and converts all the source code into a single and universal data structure, an enormous Merkle directed acyclic graph [Merkle, 1987], which is a classical cryptographic construction, combining a tree and a hash function.
Crawling is separated into three phases: listing software sources, scheduling updates and collecting the software artifacts into the archive.
Archiving all the source code is a daunting task and there are different mechanisms put in place to ensure the preservation of source code from different types of origins.
API access is over HTTPS. All API endpoints are rooted at https://archive.softwareheritage.org/api/1/ and the data is sent and received as JSON by default.
You can jump directly to the endpoint index , which lists all available API functionalities, or read on for more general information about the API.
Archiving a repository from a forge isn’t the same action as archiving source code from a package manager. It becomes even harder when you realize that version control systems have evolved a lot over the last decades. The SWH architecture was designed to harmonize different sources into a robust infrastructure.
The data model adopted by Software Heritage to represent the information that it collects is centered around the notion of software artifact, using the following canonical names, from bottom to top: contents, directories, revisions and releases. Using also origins, visits ans snapshots to store provenance information. Read more in Software Heritage: Why and How to Preserve Software Source Code.
SWH mirrors are full copies that are in sync of the Software Heritage universal source code archive, operated independently from the Software Heritage initiative. Mirrors will improve software availability, prevent information loss and ultimately ensure unfettered access to software source code for all, reducing risk of data loss due to uncontrolled events.
SWH collects and extracts metadata that describes and provides additional information on source code.
swh-indexer module is in charge for computing source code files to extract information with the following objectives:
fossology-license (detecting the license of a file)