In an effort to reduce server strain caused by automated AI scraping, Wikipedia is proactively offering a new dataset tailored for artificial intelligence development. The Wikimedia Foundation has partnered with Kaggle — the data science platform owned by Google — to release a beta version of structured Wikipedia content in both English and French.
Designed with machine learning workflows in mind, the dataset provides developers with clean, machine-readable data ideal for training, fine-tuning, benchmarking, and analysis. As of April 15, the dataset includes research summaries, short descriptions, infobox data, image links, and article sections — excluding citations and non-text elements like audio.
By offering well-structured JSON representations of Wikipedia content, Wikimedia hopes to provide a more efficient and reliable alternative to scraping raw text, which has become a growing burden on the site’s infrastructure. While large organizations like Google and the Internet Archive already have data-sharing agreements in place, the Kaggle partnership aims to make high-quality Wikipedia data more accessible to smaller developers and independent researchers.
“Kaggle is proud to support the Wikimedia Foundation in making this data widely available to the machine learning community,” said Brenda Flynn, Kaggle’s partnerships lead. “We’re excited to help keep it accessible and useful for all.”
We have helped 20+ companies in industries like Finance, Transportation, Health, Tourism, Events, Education, Sports.