ChatGPT and other large language models (LLMs) are trained on vast amounts of text data scraped from the internet.
Check if your content was scraped – and remove them from the database!
Bots called “web scrapers” are designed to go to various webpages on the internet and copy information from it. This information may be the text, photos, or anything else present on the webpage. Web scraping is a common technique to gather from the internet for various commercial and non-commercial uses, mainly focused on machine learning and AI applications and research.
Some organizations specialize in this, conducting web-scraping on a massive scale. A prime example is Common Crawl, a non-profit that finds and saves the text data from hundreds of millions of webpages per month. This data is then made open-source, available for individual or company who wants to use it, including companies like OpenAI that train artificial intelligence models. These models are fed terrabytes upon terrabytes of text data scraped from the internet to identify patterns in human language and learn how to generate human-like text.
Neither the data collectors, nor the companies training their AI algorithms on this data, pay any attention to copyright. This is not surprising, as it’s hard for a bot to tell copyrighted text apart from non-copyrighted. Your blogs, articles and original content is automatically copyrighted and you have the right to know whether it has been scraped and to remove it. That’s what this service attempts to provide.