Sora vs. YouTube – Using YouTube content to train Sora violates YouTube's Terms of Service


OpenAI vs. Google / Sora vs. YouTube

There's been a lot of talk about Sora lately, and for good reason. It is causing major controversy on many fronts. Some are concerned about the jobs it will make redundant, while others are concerned about how it will be trained.

OpenAI says they are “not sure” whether Sora is trained using YouTube videos. But in an interview with Bloomberg, YouTube CEO Neal Mohan went on the record. He says using YouTube videos to train AI models would be a violation of the company's terms of service.

YouTube videoYouTube video

OpenAI's Sora vs. YouTube

Sora seems to be the most talked about topic on the internet at the moment, especially since they released their latest examples. It was even used to produce a music video. We've moved from focusing on large language models (LLMs) like ChatGPT and single-image generation tools like Midourney and DALL-E to focusing on video.

But how is Sora trained? OpenAI is a company that has been plagued by controversy for some time. Lawsuits have been filed against the company for allegedly training its various models using private data and stolen photos.

Now the Sora controversy seems to be back again. This is the company's AI tool for video generation. And it has come a very long way in a very short period of time – at least publicly. But did it train its models using YouTube video content?

OpenAI's CTO has no idea what they do

In an interview last month, OpenAI CTO Mira Murati said Sora will be available to the general public sometime during 2024. When specifically asked what data the model was trained on, the Wall Street Journal reports that it responded evasively but did not go into detail.

I won't go into the details of the data used, but it was publicly available or licensed data

Mira Murati

She confirmed that they used content from Shutterstock, which they have a partnership with. However, when asked, she said she did not know whether this was video content from YouTube, Facebook and Instagram.

Now call me an old cynic, but it sounds to me like they absolutely know the answer and they certainly used videos from YouTube, Facebook and Instagram to train their models. Of course that's just my opinion.

According to YouTube CEO, this is a 100% violation of the terms and conditions

In an interview with Bloomberg published yesterday, YouTube CEO Neal Mohan was asked if he could confirm whether or not OpenAI used YouTube content to train its models. He says he doesn't know either.

How could he give him the benefit of the doubt? He doesn't work for OpenAI, and if OpenAI wanted to hide their activity on YouTube, it wouldn't be difficult to download the video data anonymously with the help of some VPNs. I'm not saying OpenAI has done this, just that it's hypothetically possible.

What he did say, however, was that he had seen reports saying it may or may not have been used. He also said that it was a violation of YouTube's Terms of Service (TOS) for OpenAI to have used YouTube video content to train its models.

We have clear terms of service that um, um, when a… you know… again from a creator's perspective, when a creator uploads their, you know, hard work to our platform, there are certain expectations. One of these expectations is that the Terms of Service will be followed.

Neal Mohan

He was obviously a little unprepared for the question and unsure how to answer it. When asked whether Google trains its own Gemini AI (formerly Bard) using YouTube data, things were a little more unclear. He says they are bound by the same terms of service as OpenAI or any other YouTube user, even though they trained on some YouTube data.

They say this data was collected through individual contracts with specific creators on the platform or under the terms of YouTube's Terms of Service – which is different for YouTube/Google than for everyone else. Let's take a look at the YouTube Terms of Service, or at least the parts that might be relevant here.

Rights You Grant
You retain all of your ownership rights in your Content. In short: What belongs to you remains your property. However, we require you to grant YouTube and other users of the Service certain rights as described below.

License for YouTube
By providing Content to the Service, you grant YouTube a worldwide, non-exclusive, royalty-free, transferable, sublicensable license to use such Content (including copying, distributing, modifying, displaying and performing) for the purposes of operating, promoting and improving the Service .

It essentially says that YouTube is free to do whatever it wants with uploading content to YouTube, as long as it serves to operate, promote or improve the service YouTube offers.

Again, hypothetically, Google could argue that Gemini – or any other AI it is working on – is being developed to improve the service offered by YouTube. This would have to be tested in court, but that does mean that you allow them to use your videos to train Google Gemini if ​​they wish.

TL;DR – OpenAI can’t, YouTube can

In short, if OpenAI trains its models on YouTube data, OpenAI is violating the YouTube Terms of Service. If Google uses it to train twins, Google/YouTube don't – because of what we grant them when uploading.

Be that as it may, it doesn't look like the copyright situation will be resolved any time soon. And by the time the lawsuits are filed and new laws come into effect, it will be far too late to do anything about it anyway.

Each of its respective models has more than enough data to do what it needs to do. We will likely see more class action lawsuits. We may also see some battles between technology heavyweights in the courtroom.

However, if you don't want your video content to be used to train AI models, don't publish it on the internet.

[via The Verge]