Blocking ChatGPT from stealing your content
I don't intend for this site to become a tech blog or anything, but I do feel that this is interesting enough to warrant a quick post. As I mentioned before, Bored Horse runs on Eleventy (and more, see the colophon). I've implemented the necessary code in
robots.txt, which is simple enough, to block OpenAI from scraping content from Bored Horse. This is what you need to add:
Yep, that's all there is to it. Just stick that in your
robots.txt, and place that file in the root of your site. OpenAI won't scrape your site now, or so they claim.
However, if you want to add this to Eleventy, and want to manage your
robots.txt with the rest of your content, you'll need to tweak it some more. There might be better ways to do this, but this is what I , within
module.exports = function(eleventyConfig) (yours might be called something else):
This pushes the
robots.txt file which resides in my
content folder (so,
content/robots.txt), where all the content is in my install (yours might be
_site, or something else), to the root in the
dist folder. That's the actual site that Eleventy builds for me, and that I push live, yours might be different. The one you're reading right now, as it were.
Simple enough. Granted, there might be better and/or different ways of doing this, but I like the idea of being able to use work with the
robots.txt file with the rest of my content, knowing that it'll deploy accordingly.
But, why would you want to block OpenAI's bot, you might ask? Because they don't credit you in any meaningful way when they generate a response to someone asking ChatGPT a question. This is surely a way to calm people down, and let OpenAI point to something that claims that they are, indeed, not actually just stealing content. Because, you know, you had a chance to say no.
Bullshit, I say, and will continue to block their scraping until they credit and link sources in a reasonable way.