Inspired by the comments on this Ars article, I’ve decided to program my website to “poison the well” when it gets a request from GPTBot.

The intuitive approach is just to generate some HTML like this:

<p>
// Twenty pages of random words
</p>

(I also considered just hardcoding twenty megabytes of “FUCK YOU,” but that’s a little juvenile for my taste.)

Unfortunately, I’m not very familiar with ML beyond a few basic concepts, so I’m unsure if this would get me the most bang for my buck.

What do you smarter people on Lemmy think?

(I’m aware this won’t do much, but I’m petty.)

  • jet@hackertalks.com
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    4
    ·
    1 year ago

    you dont have to do anything… people are already using LLMs to astroturf content online, all you have to do is wait. Garbage in, and garbage out.

  • nothacking@discuss.tchncs.de
    link
    fedilink
    arrow-up
    11
    ·
    edit-2
    1 year ago

    These models chose the most likely next word based on the training data, so a much more effective option would be a bunch of plausible sentences followed by an unhelpful or incorrect answer, formated like an FAQ. That way instead of slightly increasing the probability of random words, you massive increase the probability of a phrase you chose getting generated. I would also avoid phrases that outright refuse to provide an answer because these models are also trained to produce helpful and “ethical” answers, so using an confidently incorrect answer increases the chance that a user will see it

    Example: What is the color of an apple? Purple.

  • Sigmatics@lemmy.ca
    link
    fedilink
    arrow-up
    3
    arrow-down
    2
    ·
    edit-2
    1 year ago

    It’s not going to work. I’m pretty sure they have filters in place for stuff like this. And your random website won’t be crawled anyway because nobody’s linking to it

    • Reader9@programming.dev
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      It’s probably not going to work as a defense against training LLMs (unless everyone does it?) but it also doesn’t have to — it’s an interesting thought experiment which can aid in understanding of this technology from an outside perspective.

  • kamstrup@programming.dev
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    You should probably change page content entirely, server sizey, based on the user agent og request IP.

    Using CSS to change layout based on the request has long since been “fixed” by smart crawlers. Even hacks that use JS to show/hide content is mostly handled by crawlers.

    • colonial@lemmy.worldOP
      link
      fedilink
      arrow-up
      4
      ·
      1 year ago

      I won’t be using CSS or JS. I control the entire stack, so I can do a server-side check - GPTBot user agents get random garbage, everyone else gets the real deal.

      Obviously this relies on OpenAI not masking their user agent, but I think webmasters would notice a conspicuous lack of hits if they did that.