Thanks to a collaboration between Software Heritage (SWH), HAL-Inria and the CCSD, HAL is opening its doors to a new type of scientific deposit: software. Researchers now have the ability to deposit source code while contributing to Software Heritage, the Library of Alexandria of Software.
A testing phase, started in January, 2018, allowed users to deposit software using the HAL-Inria portal. Now, starting September 25th, all HAL portals will have the ability to submit code directly to Software Heritage.
Software has become an indissociable support of technical and scientific knowledge. Preserving this software is as essential as preserving articles and research datasets to promote open science and open source software.
By building a universal and sustainable archive of software, Software Heritage aims to build a central infrastructure for the benefit of society, science, and industry. Software Heritage, initiated by Inria, aims to collect, organize, preserve, and make available to all, the source code of all available software.
The Software Heritage project is supported by UNESCO and many international partners such as Microsoft, DANS (institution of the Royal Academy of Arts and Sciences of the Netherlands), the University of Bologna, Société Génerale, Huawei, Nokia Bell Labs, Intel and recently Google, UQAM, GitHub, Qwant, and FOSSID.
How to deposit software?
A researcher can submit source code with appropriate metadata on hal.archives-ouvertes.fr . A few metadata fields are mandatory: title, domain, license and authors. The source code should contain the following files: README, LICENSE and AUTHORS and a moderator will review the coherence between the submitted metadata and these files.
See also the deposit guide for a detailed example of the process.
By depositing the software on HAL and archiving it on SWH, the software becomes a legitimate and citable research product, which is essential in the quest for the reproducibility of scientific results.
Behind the scenes
Once the deposit is validated by a moderator that reviewed the coherence between the content and the metadata submitted, it is pushed to SWH using S.W.O.R.D (Simple Web-Service Offering Repository Deposit) protocol. SWH will proceed with the a simple verification of the deposit and its metadata.
The metadata transfered uses the CodeMeta vocabulary [CodeMeta 2017], which was created by several science and industry actors wishing to make a minimal metadata schema for scientific software based on Schema.org‘s SoftwareSourceCode and SoftwareApplication classes.
Then, the source code and metadata will be injected into Software Heritage, which generates a SWH-ID, intrinsically bound to the deposit.
The citation format proposed on HAL contains some of the mandatory metadata submitted with the software (title, authors, production date) and the persistent identifiers that make it possible to locate. The research product identifier is provided by HAL and the source code intrinsic identifier is provided by SWH.
How to use the source code intrinsic persistent identifier
Source code is identified using an intrinsic identifier (or SWH-ID) computed through cryptographic hashes. This means that the same content will have the same identifier even if originated in different locations. Anyone using the same hash function will retrieve the same identifier for the same content. That’s why the SWH-ID is reproducible and can be used for content integrity checking, but also it does not depend on a middle-man [Di Cosmo, Gruenpeter and Zacchiroli 2018]. If you are not already convinced of the efficiency of the cryptographic identifiers, they are also widely used in the industry (e.g., Git, nix, blockchains, IPFS, . . . ) and they are free.
Software Heritage guarantees a very long-term intrinsic identifier which can be resolved on several resolvers, including the Software Heritage resolver at https://archive.softwareheritage.org/.
The SWH-ID generated for a deposit is returned to HAL and is a reference to the content as part of the software citation. It is also accessible in the Permalinks box, which is available on a side tab on the Web site.
The neutral SWH-ID and the contextual SWH-ID
An identifier can be resolved to the content as is or with optional contextual attributes that provide more information about the referenced object, for example, the origin of the content.
The neutral identifier points towards the deposited content: swh:1:dir:42a13fc721c8716ff695d0d62fc851d641f3a12b
The contextual identifier includes information about the origin: swh:1:dir:42a13fc721c8716ff695d0d62fc851d641f3a12b;origin=https://hal.archivesouvertes.fr/hal-01727745.
For more information about the SWH-ID, you can read the documentation or the article “Identifiers for Digital Objects: the Case of Software Source Code Preservation” presented at iPres2018.
Now it’s your turn
The collaboration between SWH, HAL-Inria, and the CSSD has resulted in, among others, two major impacts for researchers and the reproducibility of research and knowledge: it is now easy to provide a permanent home for the software that drives your work and it is possible to consistently cite and point to whole bodies of source code or specific lines within it.
We look forward to seeing your deposits on SWH, and these new forms of software citation in your papers.
[CodeMeta 2017] Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Daniel S. Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. doi:10.5063/schema/codemeta-2.0
[Di Cosmo, Gruenpeter and Zacchiroli 2018] Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for Digital Objects: the Case of Software Source Code Preservation. iPRES 2018 – 15th International Conference on Digital Preservation, Sep 2018, Boston, United States. pp.1-9. 〈hal-01865790v3〉