Media including the New York Times, CNN, Reuters and the Australian Broadcasting Corporation (ABC) have blocked a tool from OpenAI, limiting the company’s ability to continue accessing its content.
OpenAI is behind one of the best known AI chatbots, ChatGPT. Its web crawler – known as GTBot – may scan web pages to help improve its AI models.
Edge was The first to report The New York Times has blocked GTBot on its website. The Guardian later found that other major news websites, including CNN, Reuters, Chicago Tribune, ABC, and Australian Community Media (ACM) brands such as the Canberra Times and Newcastle Herald, also appeared to have blocked the web crawler.
So-called large language models such as ChatGPT require huge amounts of information to train their systems and allow them to answer user queries in ways similar to human language patterns. But the companies behind them are often reticent about the existence of copyrighted material in their datasets.
The ban on GTBot can be seen in publishers’ robots.txt files which tell crawlers from search engines and other entities which pages they are allowed to visit.
“Allowing GTBot to access your location can help AI models become more accurate and improve their overall capabilities and safety,” said OpenAI. in blog Which included instructions on how not to allow the crawler.
All ports examined added bans in August. Some have also blocked CCBot, a web crawler for an open repository of web data known as Common Crawl that has also been used for AI projects.
CNN confirmed to Guardian Australia that it recently banned GTBot across its titles, but did not comment on whether the brand plans to take further action on the use of its content in AI systems.
A Reuters spokesperson said it regularly reviews the robots.txt and site terms and conditions. “As intellectual property is the lifeblood of our business, it is imperative that we protect the copyright of our content,” she said.
The New York Times Terms of Service were recently updated to make the prohibition on “copying our content for AI training and development… clearer,” according to a spokesperson.
As of August 3, this is the case Site rules Use of Publisher Content to “develop any software, including, but not limited to, training a machine learning or artificial intelligence (AI) system” without consent is expressly prohibited.
Global media are facing decisions about whether to use AI as part of news gathering, as well as how to handle content that is likely to be absorbed into training groups by companies developing AI systems.
In early August, the media including AFP and Getty Images reported I signed an open letter Advocating for AI regulation, including transparency about the “composition of all training suites used to create AI models” and approval of the use of copyrighted materials.
Google has suggested that AI systems should be able to cancel publishers’ work unless they explicitly opt out.
In a report to the Australian Government’s review of the regulatory framework around artificial intelligence, the company defended “copyright systems that enable appropriate and fair use of copyrighted content to enable training of AI models in Australia on a broad and diverse set of data, while supporting applicable opt-outs.” “.
Research from OriginalityAI, a company that checks for the presence of AI content, subscribed this week I found that major websites including Amazon and Shutterstock also blocked GTBot.
Guardian’s robot.txt file does not allow GTBot.
ABC, Australian Community Media, Chicago Tribune, OpenAI, and Common Crawl did not respond by the deadline.