Intentionally corrupting LLM training data?

colonial@lemmy.world · edit-2 11 months ago

Intentionally corrupting LLM training data?

kamstrup@programming.dev · 11 months ago

You should probably change page content entirely, server sizey, based on the user agent og request IP.

Using CSS to change layout based on the request has long since been “fixed” by smart crawlers. Even hacks that use JS to show/hide content is mostly handled by crawlers.

colonial@lemmy.world · 11 months ago

I won’t be using CSS or JS. I control the entire stack, so I can do a server-side check - GPTBot user agents get random garbage, everyone else gets the real deal.

Obviously this relies on OpenAI not masking their user agent, but I think webmasters would notice a conspicuous lack of hits if they did that.