I was reading the reddit thread on Claude AI crawlers effectively DDOSing Linux Mint forums https://libreddit.lunar.icu/r/linux/comments/1ceco4f/claude_ai_name_and_shame/

and I wanted to block all ai crawlers from my selfhosted stuff.

I don’t trust crawlers to respect the Robots.txt but you can get one here: https://darkvisitors.com/

Since I use Caddy as a Server, I generated a directive that blocks them based on their useragent. The content of the regex basically comes from darkvisitors.

Sidenote - there is a module for blocking crawlers as well, but it seemed overkill for me https://github.com/Xumeiquer/nobots

For anybody who is interested, here is the block_ai_crawlers.conf I wrote.

(blockAiCrawlers) {
  @blockAiCrawlers {
    header_regexp User-Agent "(?i)(Bytespider|CCBot|Diffbot|FacebookBot|Google-Extended|GPTBot|omgili|anthropic-ai|Claude-Web|ClaudeBot|cohere-ai)"
  }
  handle @blockAiCrawlers {
    abort
  }
}

# Usage:
# 1. Place this file next to your Caddyfile
# 2. Edit your Caddyfile as in the example below
#
# ```
# import block_ai_crawlers.conf
#
# www.mywebsite.com {
#   import blockAiCrawlers
#   reverse_proxy * localhost:3000
# }
# ```
    • winnie@lemmy.ml
      link
      fedilink
      arrow-up
      9
      ·
      6 months ago

      Suggestion at the end:

        <a class="boom" href="https://boom .arielaw.ar">hehe</a>
      

      Wouldn’t it destroy GoogleBot (and other search engine) those making your site delisted from Search?

    • jkrtn@lemmy.ml
      link
      fedilink
      arrow-up
      7
      ·
      6 months ago

      This is one of the best things I’ve ever read.

      I’d love to see a robots.txt do a couple safe listings, then a zip bomb, then a safe listing. It would be fun to see how many log entries from an IP look like get a, get b, get zip bomb… no more requests.

    • A Basil Plant@lemmy.world
      link
      fedilink
      arrow-up
      8
      arrow-down
      1
      ·
      edit-2
      6 months ago

      In dark mode, the anchor tags are difficult to read. They’re dark blue on a dark background. Perhaps consider something with a much higher contrast?

      A picture of a website with a dark purple background and dark blue links.

      Apart from that, nice idea - I’m going to deploy the zipbomb today!

    • pvq@lemmy.ml
      link
      fedilink
      arrow-up
      6
      arrow-down
      1
      ·
      6 months ago

      I really like your site’s color scheme, fonts, and overall aesthetics. Very nice!

    • Para_lyzed@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      6 months ago

      From your recommendation, I found a related project pandoras_pot that I am able to run in a Docker container, and seems to run more efficiently on my Pi home server. I now use it in my Caddyfile to redirect a number of fake subdomains and paths that are likely to be found by a malicious bot (of course all are excluded in my robots.txt for bots that actually respect it). Thanks for the recommendation!

    • Deckweiss@lemmy.worldOP
      link
      fedilink
      arrow-up
      12
      ·
      edit-2
      7 months ago

      Thats an easy modification. Just redirect or reverse proxy to the tarpit instead of abort .

      I was even thinking about an infinitely linked data-poisoned html document, but there seemed to be no ready made project that can generate one at the moment. (No published data-poisoning techniques for plain text at all afaik. But there is one for images.)

      Ultimately I decided to just abort the connection as I don’t want my servers to waste traffic or CPU cycles.

    • winnie@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      Your link has no article, and Video inside Flash file (swf) that itn’t opening in 2024.

      And I don’t want to install Flash on my machine…

  • LiveLM@lemmy.zip
    link
    fedilink
    English
    arrow-up
    4
    ·
    6 months ago

    Huh, looks like the post in r/linux got removed for not being relevant.
    What a joke.