June 28, 2022

All of humankind’s source code in a nutshell

The mission of Sofware Heritage is to collect, preserve, and share all software that is publicly available in source code form, addressing the needs of cultural heritage, industry, research and society as a whole.

As part of our long-term mission, we focus on improving the Software Heritage infrastructure and we look for future emerging technologies that may revolutionize the way archival is performed.

In May 2022, we had the honour to participate in the launch event of MoleculArXiv, an ambitious multidisciplinary research project that explores new ways of storing information in DNA chains: Software Heritage will provide one of the use cases, alongside INA (the French national audiovisual archives), BnF (the Bibliothèque Nationale de France) and the archives of the European Parliament.

Magnetic tape, despite its venerable age, is currently the method of choice for long-term archival.  The mature technology has proven to be reliable and dependable, and is used by most digital archivists around the world. Yet tape storage is not without flaws. The formats of tapes are constantly evolving – forcing users to constantly upgrade their tape libraries and drives to keep up with physical and logical standards. This migration is a race against time, and there are already concerns that the oldest contents recorded on tapes (e.g. audio contents such as speech, radio, or music that were taped almost a century ago) might be lost due to the lack of tape players – raising the spectre of a digital dark age. Tapes are also bulky, and robots are needed to handle petabytes of data.  Lastly, the tape is essentially a linear medium with sequential access. A tape must be mechanically spooled until the desired location, which makes reading and writing operations cumbersome. 

Archivists are still searching for the ideal medium, one that would be durable and compact, easy to access, and yet could still be read with certainty in dozens or hundreds of years. DNA might be the ultimate medium for archival. It is the tried-and-tested solution of Nature to store the genetic information of all organisms on earth. The chemical structure of DNA has not changed for billions of years, and we will be able to read it as long as there are humans around. Molecules of DNA are also durable: kept away from light, air and water, DNA is stable for millennia. Lastly, DNA is incredibly dense: it has the potential to store all the data generated in the world in less than 100 g of DNA (although the actual density will depend on technological choices).

The formidable potential of DNA for data storage has not been lost to scientists, technologists and decision-makers. The oldest proofs-of-concept were reported decades ago, and a small but diverse community is active on the subject. But in 2022 we still do not have DNA drives on our desks. Why is that? One reason is that the core technology for writing DNA (the phosphoramidite chemistry) is slow and expensive: it takes about 100€ and weeks to write several kilobytes of data in DNA. (And this chemistry is not even environmentally-friendly: it bathes DNA in a large volume of harmful solvent). 

Looking at it, it is not surprising that phosphoramidite chemistry fails to live up to the challenge of massive data storage. It was invented 40 years ago for biologists who had specific needs: they only wanted a few strands of DNA, but their sequences had to be perfect, as the biological code (which prescribes how DNA is translated into proteins) cannot correct any errors. If we had to make a comparison, writing massive amounts of data now with phosphoramidite chemistry is a bit like asking copyist monks to replicate Sacred Scriptures before the printing press was invented: the results will be perfect and beautiful, but the throughput will be horrendously slow. 

In the same way that Gutenberg did not massify publishing by perfecting scripting, we need to invent radically new technologies to write, handle, and read data in DNA on a massive scale. This vision of a radical shake-up is shared by a number of academic and industrial teams around the world, which aim to catalyze the revolution of DNA data storage. Some teams focus on massively parallelizing the writing of DNA with microelectronics, piggybacking on the massive gains afforded by miniaturization and automation (which in some ways is similar to the mechanization of printing). Other teams seek to assemble existing DNA blocks to write information (an idea that was proposed almost three decades ago and, which could be seen as the equivalent of assembling a set of predefined blocks for printing). All those efforts are impressive, but eventually, they all rely on the same “ink” to write DNA, i.e the phosphoramidite chemistry.

The CNRS project MoleculArxiv is seeking a new way of writing DNA, a new “ink” that would accelerate the writing of DNA by several orders of magnitudes, while being eco-friendly and adapted to massive data storage. By assembling a multidisciplinary team of chemists, physicists, engineers, biologists, and computer scientists, the project aims to establish an ecosystem around this technology, providing an end-to-end solution to archive data in DNA.

The involvement of institutional end-users like Software Heritage, BnF, INA or the European Parliament is crucial. They will help to define specifications and expectations from the start, and allow the MoleculArxiv team to understand and account for their needs. This partnership bringing technologists and end-users will help to avoid the pitfalls often seen in technology development, where technologists develop a new method in a vacuum, only to realize -after toiling away for years- that it is not exactly what the world was looking for (the syndrome of a “better mousetrap”). 

In that frame, Software Heritage has a role to play from the inception of the technology. On a practical level, a single source code is small enough to be within the early reach of the MoleculArxiv technologies, as most source codes are small enough (~10 KB) to fit on a single strand of DNA. This is in contrast with images (~MB) or video (~GB) which will be partitioned into thousands or millions of strands – which will require substantial technological development. On a conceptual level, source codes are a world unto themselves, forming a complex ecosystem that resembles the set of genetic programs of living organisms. A single source code executes a well-defined function, relying on functions provided by other source codes. This naturally generates a web of intricate dependencies between source codes, which are often a nightmare for developers (motivating the development of containers and virtual machines), but which will be fascinating to conceptualize at a molecular level. MoleculArxiv and Software Heritage will work closely to formulate elegant and efficient ways to store and retrieve source codes, accounting for these intricate dependencies.

And If it goes ahead as planned, in a few years we will be able to keep an entire cold copy of the full Software Heritage archive (in the order of petabytes)  in a capsule the size of the proverbial nutshell!

You can find the official announcement, with the mention of the use cases, online on the CNRS website.

For more information and to follow the progress of MoleculArXiv, follow their Twitter account.

Thanks to CNRS for launching such an ambitious project!

June 28, 2022