Wikipedia provides AI developers with data to protect against bots

0
340
Wikipedia provides AI developers with data to protect against bots

Wikipedia is trying to dissuade artificial intelligence developers from using the platform by releasing a dataset specifically optimized for training artificial intelligence models. On Wednesday, the Wikimedia Foundation announced that it has partnered with Kaggle, a Google-owned data science community platform that hosts machine learning data, to publish a beta version of a dataset of “structured Wikipedia content in English and French.”

Wikimedia states that the dataset hosted on Kaggle was “designed with machine learning workflows in mind,” making it easier for AI developers to access machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis. The content of the dataset is openly licensed and, as of April 15, includes study summaries, short descriptions, links to images, infobox data, and article sections – excluding links or non-written elements such as audio files.

The “well-structured JSON representations of Wikipedia content” available to Kaggle users should be a more attractive alternative to “extracting or parsing the raw text of an article,” according to Wikimedia, a problem that is currently putting a strain on Wikipedia’s servers as automated AI bots steadily consume the platform’s bandwidth. Wikimedia already has content sharing agreements with Google and the Internet Archive, but the partnership with Kaggle should make this data more accessible to smaller companies and independent data researchers.

“As the go-to place for the machine learning community to find tools and benchmarks, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data,” said Brenda Flynn, Head of Partner Relations at Kaggle. “Kaggle is excited to play a role in ensuring that this data is accessible, available, and useful.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here