EleutherAI, a prominent AI research group, has released a massive new dataset designed for training AI models using only licensed and public domain content. Named Common Pile v0.1, the dataset spans a staggering 8 terabytes and represents a collaborative effort spanning nearly two years. Key contributors include AI startups like Poolside and Hugging Face, alongside several academic institutions such as the University of Toronto.

Common Pile v0.1 was used to train two new 7-billion-parameter models — Comma v0.1-1T and Comma v0.1-2T — which EleutherAI claims perform on par with other leading models trained on copyrighted material. These results, they argue, challenge the widespread belief that high-performing AI models must rely on unlicensed data.

This release comes amid growing legal scrutiny over AI training practices. Companies like OpenAI have faced lawsuits for training on copyrighted content scraped from the web. While some firms have licensing deals, many claim fair use protections under U.S. law. However, EleutherAI says this legal uncertainty has had a chilling effect — not on data use, but on transparency.

“Copyright lawsuits haven’t meaningfully changed data sourcing practices, but they have drastically decreased the transparency companies engage in,” said Stella Biderman, EleutherAI’s executive director, in a blog post on Hugging Face.

According to Biderman, lawsuits have discouraged some researchers from releasing work in data-heavy fields. In response, EleutherAI says it worked closely with legal experts to ensure Common Pile v0.1 complies with licensing standards. The dataset draws from trustworthy sources, including 300,000 public domain books from the Library of Congress and Internet Archive. It also incorporates transcribed audio content using OpenAI’s Whisper model.

The performance of the Comma models, trained on just a portion of Common Pile, reportedly matches benchmarks from proprietary models like Meta’s LLaMA in areas such as code generation, image understanding, and math.

“We think the idea that unlicensed text is required for top-tier performance is unjustified,” Biderman added. “As the volume of openly licensed data grows, so will the quality of models built on it.”

This release also signals a shift in EleutherAI’s strategy. In the past, its dataset “The Pile” included copyrighted content, drawing criticism and legal risk for companies that used it. With Common Pile v0.1, the organization aims to correct that course and has committed to releasing more transparent, legally vetted datasets in the future.

The dataset and models are publicly available through Hugging Face and GitHub.

Source