September 1, 2016

Google Code content now safely collected

A significant amount of publicly available source code is hosted on a variety of free hosting services, some of which are being phased out for various reasons. Salvaging this code before it’s too late is one of the reasons why Software Heritage was born in the first place. As such, we are actively working to collect and archive endangered source code, because there is now a clear and present danger of massively losing important parts of our collective technological and scientific knowledge.

Today, we are happy to announce that a full copy of all of the source code content originally hosted by Google Code has been safely copied to our servers.

We are deeply grateful to Vint Cerf for his help in making this possible, and to Chris Smith for his technical support.

While we have now retrieved the full Google Code source code—in the form of raw, on-disk Git, Subversion or Mercurial repositories—we haven’t yet ingested it in the Software Heritage archive. This means that you will not find Google Code content using our web search yet, unless the same content was also available from other software origins that we currently track (e.g., GitHub). Ingesting all of it will take some time, as there are over 1.4 million projects, and 1.5 million releases to be handled!

More importantly, ingesting this content will require specific development efforts to build robust Subversion and Mercurial loaders for Software Heritage, to complement our well tested Git loader that has been in production since well before our launch. If you know Subversion or Mercurial internals well, please don’t hesitate to join our development community (yes, of course, our code is all Free Software!).