News Technology

OpenAI models trained on over million hours of YouTube videos: Report

New report alleges OpenAI and Google scraped over 1 million hours of YouTube videos to train their large language models like ChatGPT, potentially violating creators’ copyrights.

According to a recent report from The New York Times, some of the biggest tech giants have been using transcripts from YouTube videos to train their powerful AI language models – potentially violating creators’ copyrights.

The story alleges that OpenAI used its speech recognition tool Whisper to transcribe over a million hours of YouTube content. Those transcripts were then fed into GPT-4, the AI model that powers ChatGPT Plus, as training data.

OpenAI isn’t the only one accused of this YouTube data mining. The report claims there were teams at Google doing the same – scraping YouTube videos to build up datasets for their own large language models like Bard/Gemini. A Google spokesperson admitted to the publication that “unauthorised scraping or downloading of YouTube content” goes against their policies.

But the report suggests Google may have turned a blind eye to OpenAI’s YouTube transcript heist because they were doing similar things themselves. Allegedly, Google knew what OpenAI was up to but didn’t raise objections since they were using YouTube data to train their AI as well.

Both companies had reportedly hit limits on the amount of useful training data they could find from more conventional sources like books, websites, and databases. OpenAI exhausted useful supplies back in 2021, for instance. So these companies started looking at new data streams like videos and podcasts.

Google reportedly even changed its data policy language last year in July to expand what it could do with consumer data including tools like Google Docs.

OpenAI and Google have defended their practices, claiming they only use public data or content where they have permission. But the allegations raise some thorny questions around fair use, copyright, and data privacy.

After all, most YouTube creators probably didn’t expect their videos could end up transcribed without their knowledge. It shows that in the race for AI supremacy, big tech companies are fine with cutting corners to feed the immense appetite of large language models.

Source:indianexpress.com

Leave a Reply

Your email address will not be published. Required fields are marked *