← Back to all articles

EleutherAI Launches Massive Licensed AI Dataset

Posted 2 days ago by Anonymous

Breaking New Ground in Ethical AI Training

EleutherAI has unveiled Common Pile v0.1, a groundbreaking collection of licensed and open-domain text designed for training AI models responsibly. This 8TB dataset represents one of the largest legal text collections available to researchers.

Collaborative Development Process

The two-year project involved key partnerships with:

  • AI startup Poolside
  • Hugging Face’s development platform
  • Several academic institutions

Addressing the Copyright Controversy

This release comes as major AI companies face lawsuits over training practices using scraped copyrighted material. Unlike many competitors relying on fair use arguments, EleutherAI has taken a transparent, legally-vetted approach.

The Common Pile v0.1 Dataset

The collection features diverse licensed content including:

  • 300,000 public domain books from the Library of Congress
  • Internet Archive materials
  • Audio transcriptions using OpenAI’s Whisper speech-to-text model

Proving the Concept With New AI Models

EleutherAI has demonstrated the dataset’s effectiveness with two new models:

Comma v0.1-1T and Comma v0.1-2T

These 7 billion parameter models rival Meta’s original Llama model in performance benchmarks for:

  • Coding tasks
  • Image understanding
  • Mathematical reasoning

“We’ve shown unlicensed text isn’t required for high performance,” stated EleutherAI executive director Stella Biderman.

A Shift Toward Responsible AI Development

This release marks a notable departure from EleutherAI’s previous The Pile dataset, which included controversial copyrighted material. The organization now commits to regular releases of open, ethical datasets through its research partnerships.

Collaborative Development Update

In a follow-up clarification, Biderman noted this initiative involved multiple partners including the University of Toronto, highlighting the project’s collaborative nature.