A very basic check which every day tries to connect to every known instance. If the connection fails or returns something different than HTTP 200, the instance is marked as dead and no federation activities will be sent to it.
This implementation is really basic, there can be false positives if an instance is temporarily down or unreachable during the check. It also rechecks all known instances every day, even if they have been down for years. Nevertheless it should be a major improvement, we can add more sophisticated checks later.
Still need to fix two problems mentioned in comments.
Moved the blocklist caching to https://github.com/LemmyNet/lemmy/pull/3486 and decreased cache time to one minute to ensure that changes take effect quickly.
This PR will take more scrutiny and testing, lets leave it for 0.18.2
Ive reworked this now to rely on the
updated
column for alive checks. Essentially there is a daily task which tries to connect to all connected instances, and if this succeeds, they are marked as updated at that time. When sending out activities, it checks that the instance was updated at most 3 days ago, otherwise no activities are sent to it. This way it doesnt matter if one or two checks fail.Right now the code is very messy and needs cleanup/error handling as well as testing.
Following up on what @RocketDerp said I guess it could be restructured the other way around. If you haven’t received activity (comments, likes or posts) from a certain instance in X amount of time, actually check if they are still alive.
This reduces the risks of false positives. Otherwise effectively 24 hours defederation would be a heavy punishment for a maybe 30-60 seconds downtime of scheduled backups, updates, etc
I want to mention that the fetch_local_site_data is the third most expensive function in the code base and the first-most expensive actually needed function. I’d recommend either this be merged or I can create a minimal PR that just caches fetch_local_site_data for a few seconds.
The cache duration can be significantly reduced and still have huge impact. Even though this function takes only 1 ms it is called with a frequency of over 1000Hz on lemmy.world. A cache duration of 5 seconds would be perfectly fine and even 1s would be useful.
People won’t understand why they’ve blocked an instance yet posts are still coming through, for example.
Should be fairly easy to also update the cache where the query updates the database. Won’t fix the issue if people are running multiple lemmy_server instances though
I foresee a lot of problems down the road as we start adding layers of stores and caches on top of each other.
Cache invalidation is definitely a non-trivial problem. The site just being down because it can’t handle millions of queries is arguably a bigger problem though :)
which send traffic out, but block incoming traffic, thus still tying up my federation workers.
The outbound federation code I think is basically an engine to denial-of-service peers. It makes no consideration that it is sending votes, comments, posts over and over to the same 100 servers and it just blindly queues http transactions with no concern that it is the same host it is already trying to communicate.
I’ve managed 1990’s e-mail MTA’s with almost all the same sending problems and more traffic in 1999 than I’ve seen Lemmy do this month, and you have to have awareness of your outbound queue to a particular (familiar/frequent) host. Store and forward is what Lemmy needs to manage the variety of different software (Kbin), low-budget hardware.
Right now the outbound design worries too much about not wasting storage but makes no consideration to just how much overhead there is to http transactions and mindlessly opening to the exact same server so many in a short period. It also does not give server operators an API to monitor their queues and activity, masking background information that is essential to know for server capacity planning and even attacks against the content/users of the site.
Community to Community replication is a huge mount of content and single HTTP transaction per message with all the federation boilerplate and signing is probably doomed. I do not consider the volume of messages as of July 1 to be that high, the crashing servers and maturing smartphone apps have held back a lot of the content - you get less replies for every comment that does not get shared.
I encourage something drastic on the outbound queue. I would suggest bite the bullet and make a big change now. Three ideas as to new direction:
-
put in a SQLite database (don’t put more load on PostgreSQL) and at minimum log every new outbound item there so you can have self-awareness of sending to a particular host is backed up. Maybe don’t store the individual comments and posts and only their id in the main PostgreSQL and which instance you need to deliver to.
-
Make the MTA part of Lemmy a different server app and service. Queue to the other app.
-
“Punt” and face the reality that the huge traffic potential of Community replication doesn’t go well with the boilerplate federation JSON structure (bulky overhead), single HTTP transaction per item, and even the digital signature overhead. Build a Community to Community, Lemmy to Lemmy, replication agent that uses the front-door API to do posts, comments, and votes. This has to be the majority of the traffic and overhead. The front-end API already can load 300 comments at a time, add some new API paths for accepting bulk input. Now you have a logged-in session and don’t have to put a digital signature on each individual comment. This also allows backfill when servers are down or new. I would make it a pull agent, even a non-logged in user can fetch comments and posts per-community (and users subscribe to community), that way you don’t need to log-in to remote Lemmy servers to pick up new messages (read only).
Drastic, I say. If a Rust programmer is handy, I’d throw in SQLite right now and build some structures to track by instance what is outbound queued for delivery. Also build an API to return some JSON on the queue sizes for server operators.
Originally posted by RocketDerp in #3427
-
I want to mention that the fetch_local_site_data
Discussion on this here: https://lemmy.ml/post/1700930
Originally posted by RocketDerp in #3427
Like I said its very messy and needs cleanup now. I did that now and also did another rework of the code. Most importantly dead instances and also blocklists are now stored in single-value moka caches. Much cleaner than using scheduled tasks to update them.
I also restored scheduled_tasks to the original implementation. However there is a problem because it checks nodeinfo, which isnt required for Activitypub and not present on some Fediverse instances (eg misskey.de). So it needs to check alternatively that a request to the domain root returns HTTP 200. Also these requests should really be async.
I would suggest, at minimum, you query the database to determine which instances have delivered you comments in the past 24 hours and exclude them from any dead server check.
Example SQL query:
SELECT SUBSTRING( ap_id FROM '.*://([^/]*)' ) AS hostname, count(substring( ap_id FROM '.*://([^/]*)' )) FROM comment WHERE published >= NOW() - INTERVAL '24 HOURS' GROUP BY hostname ORDER BY count DESC ;
Originally posted by RocketDerp in #3427