October 10, 2018

PyPI Available on Software Heritage

The PyPI logo, including the text "Python Package Index."

We’re pleased to share that the Python Package Index (PyPI) is now integrated with Software Heritage: all PyPI packages have been archived and PyPI is now being tracked to timely archive new Python package releases as they become available.

PyPI is a volunteer run repository of nearly 1.5 million Python packages. All of these packages are now also available from Software Heritage (SWH).

SWH is building a Library of Alexandria of source code, recognizing the cultural heritage of software, the usefulness source code provides to industry, and the necessity of openness and preservation to research. Aiming to represent all publicly available software within the Software Heritage Archive, SWH is progressively extending its scope: after a few major code hosting platforms, both active (e.g., GitHub), and phased out (Google Code, Gitorious), it is now time to expand the coverage of package distributions.

Before the ingestion of PyPI, SWH had integrated source packages from Debian. Building the infrastructure necessary to enable the addition of PyPI to the Archive will make it easier to add support for other package managers in the future.

Indeed, building the Archive into what we want it to be means supporting all the ways people share their code with one another.

The success of the open source community is predicated on people giving back to the community, which is one of the reasons we’re excited about PyPI. SWH is largely written in Python, and archiving is one way we can give back to the Python community.

As always we’re carrying out the SWH mission in the open, so everyone can see what we are doing and participate in making it happen.

The technical side

A chart showing three stages of source code files and how they're added to the Software Heritage Archive: listing, scheduling, and loading.

There are three steps to go from having a package repository to having source code hosted and accessible on SWH: listing, scheduling, and loading.

Software Heritage has listers, which play an important role in crawling and parsing a list of upstream APIs and generating origins (collections of software projects references in the SWH Archive). This this project we used PyPI simple index, as the package listing API.

The scheduler (SWH-scheduler) runs, and records our recurrent one-shot tasks in a database, which we think of as the single source of truth. SWH-scheduler pulls tasks from this database into Celery, a Python tool which we use as a task queuing middleware and worker management framework. One of the ways our scheduler extends Celery is to allow prioritization between tasks of the same type; Celery itself works in a stricly FIFO manner.

Source distribution files (Python packages also known as sdists) are available as tarballs and zips depending on the platform the upstream developer uses to do their upload. The upcoming PEP 517 will standardize the way Python source distributions are built, which should ease our archiving work.

SWH uses a common pattern to implement the origin loading process:
  • fetching metadata about available versions;
  • comparing them with the latest loaded versions in the archive;
  • downloading and processing the new versions; and
  • loading the new data.

The specificities of PyPI have been taken into account:

  • known version comparison is done using the digests provided by the API (allows us to detect if a version was overwritten, and archive the new one);
  • PKG-INFO metadata parsing and saving; and
  • separate importing for versions with multiple dsits.

Now that PyPI has been successfully added, there are more package managers to integrate into the Archive. Software is spread all around over disparate distribution platforms. There needs to be a place to find, track, and search all source code, which is why we’re interested in adding more support to Software Heritage. If you’re interested in helping us, you can get started with the documentation. The PyPI implementation of loaders is particularly concise. You can read more about how to help with these efforts in the documentation.

Python Package Index logo © 2018 Python Software Foundation

October 10, 2018