A very basic check which every day tries to connect to every known instance. If the connection fails or returns something different than HTTP 200, the instance is marked as dead and no federation activities will be sent to it.

This implementation is really basic, there can be false positives if an instance is temporarily down or unreachable during the check. It also rechecks all known instances every day, even if they have been down for years. Nevertheless it should be a major improvement, we can add more sophisticated checks later.

Still need to fix two problems mentioned in comments.

Originally posted by Nutomic in #3427

  • issue_tracking_bot@lemm.eeOPB
    link
    fedilink
    arrow-up
    1
    ·
    2 years ago

    which send traffic out, but block incoming traffic, thus still tying up my federation workers.

    The outbound federation code I think is basically an engine to denial-of-service peers. It makes no consideration that it is sending votes, comments, posts over and over to the same 100 servers and it just blindly queues http transactions with no concern that it is the same host it is already trying to communicate.

    I’ve managed 1990’s e-mail MTA’s with almost all the same sending problems and more traffic in 1999 than I’ve seen Lemmy do this month, and you have to have awareness of your outbound queue to a particular (familiar/frequent) host. Store and forward is what Lemmy needs to manage the variety of different software (Kbin), low-budget hardware.

    Right now the outbound design worries too much about not wasting storage but makes no consideration to just how much overhead there is to http transactions and mindlessly opening to the exact same server so many in a short period. It also does not give server operators an API to monitor their queues and activity, masking background information that is essential to know for server capacity planning and even attacks against the content/users of the site.

    Community to Community replication is a huge mount of content and single HTTP transaction per message with all the federation boilerplate and signing is probably doomed. I do not consider the volume of messages as of July 1 to be that high, the crashing servers and maturing smartphone apps have held back a lot of the content - you get less replies for every comment that does not get shared.

    I encourage something drastic on the outbound queue. I would suggest bite the bullet and make a big change now. Three ideas as to new direction:

    1. put in a SQLite database (don’t put more load on PostgreSQL) and at minimum log every new outbound item there so you can have self-awareness of sending to a particular host is backed up. Maybe don’t store the individual comments and posts and only their id in the main PostgreSQL and which instance you need to deliver to.

    2. Make the MTA part of Lemmy a different server app and service. Queue to the other app.

    3. “Punt” and face the reality that the huge traffic potential of Community replication doesn’t go well with the boilerplate federation JSON structure (bulky overhead), single HTTP transaction per item, and even the digital signature overhead. Build a Community to Community, Lemmy to Lemmy, replication agent that uses the front-door API to do posts, comments, and votes. This has to be the majority of the traffic and overhead. The front-end API already can load 300 comments at a time, add some new API paths for accepting bulk input. Now you have a logged-in session and don’t have to put a digital signature on each individual comment. This also allows backfill when servers are down or new. I would make it a pull agent, even a non-logged in user can fetch comments and posts per-community (and users subscribe to community), that way you don’t need to log-in to remote Lemmy servers to pick up new messages (read only).

    Drastic, I say. If a Rust programmer is handy, I’d throw in SQLite right now and build some structures to track by instance what is outbound queued for delivery. Also build an API to return some JSON on the queue sizes for server operators.

    Originally posted by RocketDerp in #3427