Beyond Stargate: Open code and the AI black box
Experts from France, Brazil, and the UAE explore how open-source code and transparent archives provide the essential foundation for digital sovereignty, ethical development, and linguistic inclusion.
Experts from France, Brazil, and the UAE explore how open-source code and transparent archives provide the essential foundation for digital sovereignty, ethical development, and linguistic inclusion.
CodeCommons is testing the limits of swh-fuse using large-scale clusters. Preliminary experiments ran on the 10,000-core Kraken cluster. The system validated performance by hitting an optimal file storage rate of 30,000 reads per second and sustained 8,000 file writes per second.
A recent talk by Director Roberto Di Cosmo highlights how 10 years in, Software Heritage aims its ‘large telescope’ at the future of code.
CodeCommons aims to address these issues, making source code and metadata available in a single, accessible location. It will implement standardized data pipelines for cleaning and preprocessing, provide traceability through identifiers, and incorporate ethical considerations, such as attribution and similarity checks.
CodeCommons aims to provide a centralized repository of essential resources, including code, documentation, and metadata, to facilitate the creation of smaller, more effective datasets for the next generation of AI tools.
CodeCommons is a two-year project building on the Software Heritage archive. Here’s an overview of the projects we and our partners are working on.
CodeCommons, a two-year project funded by the French government, is building on Software Heritage—the world’s largest public source code archive—to create higher-quality datasets for responsible artificial intelligence.