Opting out: How to stop AI companies from using your online content to train their models

A US company created a button for website owners to block AI crawlers. Here’s a look at how to block AI from websites and social media.

We have ad block and now there’s an artificial intelligence (AI) block.

US cybersecurity company Cloudflare has created a button for website customers to block their data from being used by AI crawlers: Internet bots that roam the web to collect training data.

“We helped people protect against the scraping of their websites by bots (…) so I really think AI is the new iteration of content owners wanting to control how their content is used,” John Graham-Cumming, the company’s chief technical officer, told Euronews Next in an interview.

When a connection comes to a website hosted by Cloudflare, they are able to see who is requesting to see the website, including any AI crawlers that identify themselves. The blocker will respond by showing them an error.

Some AI bots pretend to be human users when accessing the website, so Cloudflare built a machine learning model that scores how likely a website request is coming from a human or robot user, Graham-Cumming said.

The CTO couldn’t say which clients are using the new button but said it’s been “very popular,” with a wide variety of small and large companies.

Blocking AI crawlers in general is becoming more popular, according to one study from the Data Provenance Initiative, a group of independent AI researchers.

Their recent analysis of over 14,000 web domains found that five per cent of all data assembled into the Internet’s public databases of C4, RefinedWeb, and Dolma is now restricted. But researchers note this number goes up to 25 per cent when looking at the highest quality sources.

Ways of blocking AI crawlers

There are ways to manually block AI crawlers from accessing your content.

Raptive, a US company advocating for creators, wrote in a guide that website hosts could manually add commands to robots.txt, the file that tells search engines who can access your site.

To do it, you would add the user-agent as the name of popular AI companies, such as Anthropic, and then add “disallow” with a colon and a forward dash.

Then, the website host would clear the cache and add /robots.txt at the end of the website’s domain in the search bar.

“Adding an entry to your site’s robots.txt file (…) is the industry-standard method for declaring which crawlers you permit to access your site,” Raptive says in their guide.

There are some AI, content companies, and social media platforms that also allow a block.

Before its planned June launch, Meta AI gave users a chance to opt out of a new policy where public posts would be used to train their AI models. The company then committed to the European Commission in June that they will not use user data for “undefined artificial intelligence techniques”.

In 2023, OpenAI published strings of code for website users to block three types of bots from websites: the OAI-SearchBot, ChatGPT-User and GPTBot.

OpenAI is also working on Media Manager, a tool that will let creators better control what content is being used to train generative AI.

“This will (be) (…) the first-ever tool of its kind to help us identify copyrighted text, images, audio and video across multiple sources and reflect creator preferences,” OpenAI said in a May blog post.

Some websites, like Squarespace and Substack, have easy commands or toggles to turn off AI crawling. Others, like Tumblrand WordPress, have “prevent third-party sharing” options that you can turn on to avoid AI training.

Users can opt out of AI scraping with Slack by sending their support team an email.

Industry-standard in the works

Websites are able to identify AI crawlers because of a longstanding Internet regulation called the Robots Exclusion Protocol.

Martijn Koster, a Dutch software engineer, created the protocol in 1994 to limit crawlers overwhelming his own site. It was later adopted by search engines to “help manage their server resources,” according to a blog post from Google Search Central, a site for developers.

However, it’s not an official Internet standard, which means developers “interpreted the protocol somewhat differently over the years,” according to Google.

One recent example is Perplexity, a US AI company that runs chatbots, which is being investigated by Amazon overtaking online news content without approval to train its bots.

“We don’t have an industry agreement for how that applies in the world of AI,” Graham-Cumming from Cloudflare said. “The good (companies) respect the protocol but they don’t actually have to.”

“We need something across the internet … that makes it very clear that yes or no you may scrape this website for data.”

The Internet Architecture Board (IAB) is hosting two-day workshops in September, where Graham-Cunning believes an industry standard will be set. Euronews Next has reached out to the IAB to confirm this.

Source link

Tagged Artificial intelligence, block access, website

Opting out: How to stop AI companies from using your online content to train their models

Ways of blocking AI crawlers

Industry-standard in the works

Leave a Reply Cancel reply