We’re pleased to share that the Python Package Index (PyPI) is now integrated with Software Heritage: all PyPI packages have been archived and PyPI is now being tracked to timely archive new Python package releases as they become available.
PyPI is a volunteer run repository of nearly 1.5 million Python packages. All of these packages are now also available from Software Heritage (SWH).
SWH is building a Library of Alexandria of source code, recognizing the cultural heritage of software, the usefulness source code provides to industry, and the necessity of openness and preservation to research. Aiming to represent all publicly available software within the Software Heritage Archive, SWH is progressively extending its scope: after a few major code hosting platforms, both active (e.g., GitHub), and phased out (Google Code, Gitorious), it is now time to expand the coverage of package distributions.
Before the ingestion of PyPI, SWH. Building the infrastructure necessary to enable the addition of PyPI to the Archive will make it easier to add support for other package managers in the future.
Indeed, building the Archive into what we want it to be means supporting all the ways people share their code with one another.
The success of the open source community is predicated on people giving back to the community, which is one of the reasons we’re excited about PyPI. SWH is largely written in Python, and archiving pypi.org is one way we can give back to the Python community.
As always we’re carrying out the SWH mission in the open, so everyone can see what we are doing and participate in making it happen.
The technical side
There are three steps to go from having a package repository to having source code hosted and accessible on SWH: listing, scheduling, and loading.
Software Heritage has listers, which play an important role in crawling and parsing a list of upstream APIs and generating origins (collections of software projects references in the SWH Archive). This this project we used PyPI simple index, as the package listing API.
The scheduler (SWH-scheduler) runs, and records our recurrent one-shot tasks in a database, which we think of as the single source of truth. SWH-scheduler pulls tasks from this database into Celery, a Python tool which we use as a task queuing middleware and worker management framework. .
Source distribution files (Python packages also known as sdists) are available as tarballs and zips .
- fetching metadata about available versions;
- comparing latest load versions
- downloading and processing new versions; and
- loading new data.
The specificities of PyPI have been taken into account:
- comparison done using digests
- PKG-INFO metadata parsing and saving; and
- separate importing for versions with multiple dsits.
Now that PyPI has been successfully added, there are more package managers to integrate into the Archive. Software is spread all around over disparate distribution platforms. There needs to be a place to find, track, and search all source code, which is why we’re interested in adding more support to Software Heritage. If you’re interested in helping us, you can get started with the documentation. The PyPI implementation of loaders is particularly concise. You can read more about how to help with these efforts in the documentation.
Python Package Index logo © 2018 Python Software Foundation