Identifying billions of digital objects : a whitepaper
In the very broad scope addressed by digital preservation initiatives, a special
place belongs to the scientific and technical artifacts that we need to properly
archive to enable scientific reproducibility.
For these artifacts we need identifiers that are not only unique and persistent,
but also support integrity in an intrinsic way.
They must provide strong guarantees that the object denoted by a given
identifier will always be the same, without relying on third parties and
external administrative processes.
And of course, they must be free, as in free beer: we have billions of
software artefacts waiting for a name, so any unit cost associated to an
identifier, however small, would be a showstopper.
In our quest for the right identifiers, we were led to investigate many systems
of identifiers that are in widespread use today, including Ark, DOI, PURLs, etc.
and we found that none of them has all the properties we need.
The reason is indeed simple: all these systems of identifiers are actually
construed as “Digital Identifiers of Objects”, or DIOs, that is, digital systems
of identifiers that can be used to give a name to any kind of object, digital or
not, as clearly remarked already by Norman Paskin, in his 2010 article1 ; as such,
they do not take advantage in any way of the properties of digital objects.
We, on the other hand, have billions of digital objects to identify, so we need
another kind of identifiers: “Identifiers of Digital Objects”, that we call
IDOs, specifically designed to identify only digital objects.
This led us to design a whole system of identifiers for Software Heritage, that is at the same time simple, powerful, scalable and free.
You can find all the details in the article that we present at iPres2018:
we encourage all digital preservationists to read it and spread the word.
1. Norman Paskin, 2010. Digital object identifier (DOI) system. Encyclopedia of library and information sciences 3 (2010), 1586–1592.
The term “Digital Object Identifier” is construed as “digital identifier of an object,” rather than “identifier of a digital object”: the objects identified by DOI names may be of any form —digital, physical, or abstract— as all these forms may be necessary parts of a content management system. The DOI system is an abstract framework which does not specify a particular context of its application, but is designed with the aim of working over the Internet.
Photo credits: https://unsplash.com/@aaronburden