July 31, 2025

Why we need better software identification

With cybersecurity breaches and new regulations making headlines, software supply chain security is now top of mind for many people. New laws like the European Union’s Cyber Resilience Act (CRA) and recent United States Executive Orders are pushing for more transparency in digital goods.

All this attention means we need a solid, trustworthy way to identify software. Here’s the problem: how we currently name software and point to it in repositories often falls short. These ways can be temporary, vague, or just not secure enough. That leads to messy situations like confusion, name clashes, and outdated links. These aren’t just minor annoyances; they’re open doors for attacks, like “dependency confusion,” where bad actors trick systems into using malicious code. Plus, software bits can just disappear or move, making it impossible to check them later.

Clearly, we need a permanent fix that guarantees we can always find and verify software. This post outlines key information from the preprint paper “Software Identification for Cybersecurity: Survey and Recommendations for Regulators,” authored by Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, and Olivier Zendra with support from the SWHSec project.

Existing ID approaches: The good and the bad

There are two main types of software identification:

External IDs: These rely on outside info, like product names, version numbers, or links to package managers.
- Pros: They’re usually easy for humans to read and work with existing lists like the National Vulnerability Database. Some examples: the SWID, Package URL (purl), and SPDXID.
- Cons: Their reliability depends on external lists or naming rules, which can change or even be reused. That causes conflicts and makes them unreliable for security checks.
Internal IDs: These come directly from the software’s actual content, usually using a cryptographic hash (like a digital fingerprint).
- Pros: They offer uniqueness and integrity without relying on a central authority. They’re great for spotting if something’s been tampered with, don’t rely much on outside dependencies, and are difficult to fake with good hashing. Simple SHA256 checksums and Software Hash IDentifiers (SWHIDs) are examples.
- Cons: They’re often not very human-readable, which can make searching or brand recognition tricky.

In the real world, effective software bills of materials (SBOMs) and supply-chain tools generally combine both external references (which help connect with existing databases, vulnerability feeds, or licensing tools) and internal references (for strong integrity checks and guaranteed uniqueness). This means the smart approach is often to publish both—say, a purl or SWID alongside a cryptographic hash or SWHID. That way, you ensure both discoverability and verifiability.

round black and white light — Photo by George Prentzas on Unsplash

Inside the SWHID

SWHIDs are based on content, they’re permanent, and they can’t be tampered with easily. In 2025, they became an international standard (ISO/IEC 18670), making them globally recognized.

SWHIDs essentially package up both the data and its context using a clever Merkle DAG structure. This means each ID is directly tied to the exact piece of software it refers to.
They follow a simple pattern:
swh: <schema_version> : <object_type> : <object_id>

Key types include:

Content (cnt): Identifies a single file based on its raw contents:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
Directory (dir): Points to a directory’s layout and what’s inside it, including IDs of its contents:
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
Revision (rev): Like a “commit” in version control, holding details like who did it, when, and the message: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
Release (rel): Similar to a “tag,” pointing to a specific revision and maybe including a version name or signature: swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
Snapshot (snp): Captures everything in a whole version control system (all branches) at one specific moment:
swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453

SWHIDs also allow for optional qualifiers to add more context. You can specify:

Lines qualifier (lines=…): To point to specific lines in a file: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=112-116
Origin qualifier (origin=…)To say where the software was first seen: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d;
origin=https://github.com/example/repo
Path, anchor, and context qualifiers. These help pinpoint subdirectories, specific parts, or other key info for super-precise references:
swh:1:dir:d198bc9d…;path=/docs;anchor=readme-section

This way, SWHIDs combine the best of both internal and external identification methods into one stable system.

SWHIDs + The Software Heritage Archive

SWHIDs get even more robust when you link them with the Software Heritage Archive. Software Heritage is a non-profit project that saves publicly available source code and its entire history, and once code is in, it’s never deleted. It’s the biggest public archive of source code, 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. The archive stores everything in a cryptographically secure way, which helps with saving space by not duplicating things and makes sure everything is truly what it claims to be.

The combination of SWHIDs and the Software Heritage archive offers real advantages for meeting today’s legal requirements:

Guaranteed integrity: If the code changes even a little, the SWHID changes. This makes tampering immediately detectable.
Always there: SWHIDs don’t rely on outside services or websites, so they stay valid no matter where the code is hosted or if the original platform goes down. This solves the problem of code just vanishing.
Trackable history: SWHIDs identify parts of the Software Heritage structure, letting you trace a project’s development history, see where code came from, and check how different parts are related. Those extra qualifiers let you even track tiny code snippets.
Plays nice with rules: This combined approach directly helps meet the strict requirements for Software Bills of Materials (SBOMs), open-source security, and vulnerability management that the CRA and US Executive Orders demand.
Works everywhere: SWHIDs work consistently across all sorts of version control systems and software ecosystems.

The authors recommend SWHIDs, paired with the Software Heritage Archive, as the standard approach for referencing software, especially concerning the CRA and relevant US Executive Orders.
Here are some specifics for stakeholders:

Policy makers: Should mention SWHIDs (ISO/IEC 18670) in their rules and encourage their use in government purchases and funding programs.
Software companies: Should start making SWHIDs a part of their development process (CI/CD pipelines) to get stable IDs for their releases and patches.
Open source communities: Should publish official releases with their SWHIDs, ensure their code and history are archived by Software Heritage, and adopt best practices for referencing any outside software they use via SWHIDs in docs and SBOMs.

To wrap it up, using content-based, permanent software identifiers—specifically SWHIDs linked with the Software Heritage archive—is a strong and reliable answer to today’s cybersecurity and regulatory challenges. This approach builds trust and transparency, keeps us aligned with regulations, and even helps with innovation and saving money by simplifying compliance checks and cutting down on supply chain risks.

For more details and recommendations for implementation, check out the paper (preprint).

See our Publications section for more research from the Software Heritage Archive.

Software Heritage

Existing ID approaches: The good and the bad

Inside the SWHID

SWHIDs + The Software Heritage Archive

Follow us