~/blog/"OpenAI's data collection"

2023-08-11

To the surprise of no-one, OpenAI has been crawling the web for content to train their Large Language Models (LLMs).

While LLMs have opened up a world of possibilities, like with every other technology, not all of them will be good. So, for the time being, I’ll be adding the following lines to my website’s robots.txt (and I encourage you to do the same):

User-agent: GPTBot
Disallow: /

Even disregarding the controversies that OpenAI has been involved in (e.g. the fact that they are not as open as their name implies), I feel like we don’t fully understand what this brave new world has in store for us.

When large data collection became widespread, we let that Pandora’s box opened for years. Which allowed companies like Facebook and Google to collect vast amounts of data, the consequences of which we’re still dealing with today.

We should learn from the mistakes of the past, and let public policy catch up with the tech this time around.