CodeCommons is testing the limits of swh-fuse using large-scale clusters. Preliminary experiments ran on the 10,000-core Kraken cluster. The system validated performance by hitting an optimal file storage rate of 30,000 reads per second and sustained 8,000 file writes per second.
Read post
A recent talk by Director Roberto Di Cosmo highlights how 10 years in, Software Heritage aims its ‘large telescope’ at the future of code.
Read post
CodeCommons aims to address these issues, making source code and metadata available in a single, accessible location. It will implement standardized data pipelines for cleaning and preprocessing, provide traceability through identifiers, and incorporate ethical considerations, such as attribution and similarity checks.
Read post
CodeCommons aims to provide a centralized repository of essential resources, including code, documentation, and metadata, to facilitate the creation of smaller, more effective datasets for the next generation of AI tools.
Read post
CodeCommons is a two-year project building on the Software Heritage archive. Here’s an overview of the projects we and our partners are working on.
Read post
CodeCommons, a two-year project funded by the French government, is building on Software Heritage—the world’s largest public source code archive—to create higher-quality datasets for responsible artificial intelligence.
Read post