juin 13, 2025

Using the SoftWare Hash Identifier (SWHID): A tutorial

brown tree supported by wooden hand sculture

Software identification is crucial for ensuring the long-term traceability of scholarly outputs. However, identifying software can be complex, resembling an investigation requiring tailored solutions. The Software Hash Identifier (SWHID) is an intrinsic identifier designed for software, acting like a unique fingerprint or DNA sequence intrinsically bound to the software’s content. It complements extrinsic identifiers like DOIs, which typically identify metadata records or broader projects. The SWHID provides actionable solutions for researchers, repository managers, and others involved in the scholarly ecosystem.

This tutorial provides a guide for research support staff, designed to answer the question: « What does an end-user from my institution need to understand about software identification? »

We’ll explain why common identifiers like DOIs aren’t always sufficient for software, highlighting the specific concerns of unique software identification. Most importantly, we’ll introduce a straightforward, « plug-and-play » solution that your community can use, emphasizing the crucial role you’ll play in helping them implement it. This post derives from a two-hour live session by the Software Heritage Open Science team, Morane Gruenpeter and Sabrina Granger, as part of the FAIR implementation workshops. The slides are also available.

Understand what SWHID Identifies

SWHID is used to identify specific software artifacts at different levels of granularity.SWHIDs identify the source code content itself, rather than the project or its metadata. The different types of objects identifiable by a SWHID include:

CNT (Content): Identifies the content of a single file.
DIR (Directory): Identifies a directory, including its contents and the names of the files within it. This SWHID type is recommended for academic use – it’s self-contained and doesn’t depend on external services like Software Heritage to work.
REV (Revision): Identifies a commit in a development history sequence.
REL (Release): Identifies a tagged release, similar to a revision but specifically marked as a release.
SNP (Snapshot): Identifies a point in time, recording all entry points (like branches and releases) found in a software origin and where they pointed at that time.

These intrinsic identifiers correspond to granularity levels from the bottom of the software identification pyramid (Level 10: Code Fragment, Level 9: File, Level 8: Directory, Level 7: Commit, Level 6: Release, Level 5: Snapshot), where the number of items increases as you go down the pyramid.

How to generate a SWHID

A key feature of SWHID is that any end-user can generate one. You do not need an account on Software Heritage or need to be the software author. SWHIDs are free. For digital resources that are frequently created or modified, especially in large volumes, charging a per-identifier fee just doesn’t work.

You can find the SWHID for software artifacts already archived in Software Heritage in the permalinks box on the artifact’s page.
You can also compute a SWHID locally on your own machine using a command-line tool. For the same content, the SWHID computed locally will be the same as the one computed by Software Heritage, as long as the computational method (schema version) is the same.

Deconstruct the SWHID structure

A SWHID is a structured identifier with several parts:

Prefix: Always starts with SWH.
Schema Version: Indicates the hash computation method used (currently 1 for SHA-1). This can evolve if needed, with older hashes remaining valid.
Object Type: Indicates the type of software artifact being identified (C, DI, RE, RL, or SNP).
Hash: The hash value computed for the specific content or object.
Context Parameters (Optional): Provide additional information about where or when the artifact was found or its position within a larger structure. These parameters can include:
- Origin: The URL from which the software originated (e.g., a GitHub or GitLab repository). This parameter differentiates SWHIDs for identical content found in different locations.
- Visit: For artifacts lower in the graph (Content, Directory, Revision, Release), this refers to the snapshot in which the artifact was seen.
- Anchor: For artifacts lower than a snapshot, this is a Revision item from the graph that provides a specific point of reference.
- Path: The path to the artifact within a directory or revision.
- Lines: For content fragments, specifies the lines of code being identified.

Context parameters explain variations in seemingly identical SWHIDs: the core content hash is the same, but the context (e.g., path, origin) differs.

How to use SWHIDs

SWHIDs have several important use cases, primarily related to referencing, reproducibility, and citation of software source code:

Referencing specific code: SWHIDs allow you to point directly to specific versions or parts of software code (files, directories, revisions, etc.). This is different from DOIs, which often point to a metadata record about the software.
Ensuring reproducibility: Because SWHIDs are based on the intrinsic content, they enable reproducibility. If you have the SWHID, you can potentially regenerate or verify the exact content it refers to, even if the original infrastructure where it was found is no longer available.
Citing software: SWHIDs are designed to be used in software citations. The recommended way to facilitate this is to include metadata files like code meta.json or citation.cff alongside your code. Software Heritage can use these files to generate a citation that includes the SWHID of the corresponding artifact (e.g., the directory SWHID is often recommended for academia).
IMPORTANT CITATION RULE: Never include the SWHID itself within the source code files. Adding the SWHID changes the file contents, resulting in a new SWHID for the changed file, which breaks the link to the original content. Instead, include metadata files that allow platforms to generate citations, including the SWHID.
Resolving SWHIDs: SWHIDs can be resolved to access the corresponding software artifact, for example, on the Software Heritage archive (softwareheritage.org) or its operational mirror networks.

What the SWHID is not for

Data Sets: SWHIDs are designed specifically for software source code. While data might be stored alongside code in repositories and thus archived by Software Heritage, SWHIDs are not the recommended identifier for data sets. Other identifier types are more appropriate for data.
AI-Generated Code: Currently, SWHIDs cannot distinguish code generated by AI tools from human-generated code, nor do they provide functionality to specifically track the origin of AI-generated code.

By understanding these steps, you can leverage SWHIDs for robust and reproducible identification, referencing, and citation of software artifacts.

A toolbox

For further info:
https://www.softwareheritage.org/faq/#3_Referencing_and_identification
https://www.softwareheritage.org/how-to-archive-reference-code
https://www.softwareheritage.org/software-hash-identifier-swhid
https://www.swhid.org

Software Heritage