[Closed][BE] Check for dead federated instances (fixes #2221) #3427

issue_tracking_bot@lemm.ee · 2 years ago

[Closed][BE] Check for dead federated instances (fixes #2221) #3427

issue_tracking_bot@lemm.ee · 2 years ago

Moved the blocklist caching to https://github.com/LemmyNet/lemmy/pull/3486 and decreased cache time to one minute to ensure that changes take effect quickly.

This PR will take more scrutiny and testing, lets leave it for 0.18.2

Originally posted by Nutomic in #3427

issue_tracking_bot@lemm.ee · 2 years ago

Going by incoming traffic is problematic @RocketDerp - I’ve seen some cases of instances which send traffic out, but block incoming traffic, thus still tying up my federation workers.

Originally posted by sunaurus in #3427

issue_tracking_bot@lemm.ee · 2 years ago

Ive reworked this now to rely on the updated column for alive checks. Essentially there is a daily task which tries to connect to all connected instances, and if this succeeds, they are marked as updated at that time. When sending out activities, it checks that the instance was updated at most 3 days ago, otherwise no activities are sent to it. This way it doesnt matter if one or two checks fail.

Right now the code is very messy and needs cleanup/error handling as well as testing.

Originally posted by Nutomic in #3427

issue_tracking_bot@lemm.ee · 2 years ago

Following up on what @RocketDerp said I guess it could be restructured the other way around. If you haven’t received activity (comments, likes or posts) from a certain instance in X amount of time, actually check if they are still alive.

This reduces the risks of false positives. Otherwise effectively 24 hours defederation would be a heavy punishment for a maybe 30-60 seconds downtime of scheduled backups, updates, etc

Originally posted by Eskuero in #3427

issue_tracking_bot@lemm.ee · 2 years ago

I want to mention that the fetch_local_site_data is the third most expensive function in the code base and the first-most expensive actually needed function. I’d recommend either this be merged or I can create a minimal PR that just caches fetch_local_site_data for a few seconds.

The cache duration can be significantly reduced and still have huge impact. Even though this function takes only 1 ms it is called with a frequency of over 1000Hz on lemmy.world. A cache duration of 5 seconds would be perfectly fine and even 1s would be useful.

Originally posted by phiresky in #3427

issue_tracking_bot@lemm.ee · 2 years ago

People won’t understand why they’ve blocked an instance yet posts are still coming through, for example.

Should be fairly easy to also update the cache where the query updates the database. Won’t fix the issue if people are running multiple lemmy_server instances though

I foresee a lot of problems down the road as we start adding layers of stores and caches on top of each other.

Cache invalidation is definitely a non-trivial problem. The site just being down because it can’t handle millions of queries is arguably a bigger problem though :)

Originally posted by phiresky in #3427

issue_tracking_bot@lemm.ee · 2 years ago

which send traffic out, but block incoming traffic, thus still tying up my federation workers.

The outbound federation code I think is basically an engine to denial-of-service peers. It makes no consideration that it is sending votes, comments, posts over and over to the same 100 servers and it just blindly queues http transactions with no concern that it is the same host it is already trying to communicate.

I’ve managed 1990’s e-mail MTA’s with almost all the same sending problems and more traffic in 1999 than I’ve seen Lemmy do this month, and you have to have awareness of your outbound queue to a particular (familiar/frequent) host. Store and forward is what Lemmy needs to manage the variety of different software (Kbin), low-budget hardware.

Right now the outbound design worries too much about not wasting storage but makes no consideration to just how much overhead there is to http transactions and mindlessly opening to the exact same server so many in a short period. It also does not give server operators an API to monitor their queues and activity, masking background information that is essential to know for server capacity planning and even attacks against the content/users of the site.

Community to Community replication is a huge mount of content and single HTTP transaction per message with all the federation boilerplate and signing is probably doomed. I do not consider the volume of messages as of July 1 to be that high, the crashing servers and maturing smartphone apps have held back a lot of the content - you get less replies for every comment that does not get shared.

I encourage something drastic on the outbound queue. I would suggest bite the bullet and make a big change now. Three ideas as to new direction:

put in a SQLite database (don’t put more load on PostgreSQL) and at minimum log every new outbound item there so you can have self-awareness of sending to a particular host is backed up. Maybe don’t store the individual comments and posts and only their id in the main PostgreSQL and which instance you need to deliver to.

Make the MTA part of Lemmy a different server app and service. Queue to the other app.

“Punt” and face the reality that the huge traffic potential of Community replication doesn’t go well with the boilerplate federation JSON structure (bulky overhead), single HTTP transaction per item, and even the digital signature overhead. Build a Community to Community, Lemmy to Lemmy, replication agent that uses the front-door API to do posts, comments, and votes. This has to be the majority of the traffic and overhead. The front-end API already can load 300 comments at a time, add some new API paths for accepting bulk input. Now you have a logged-in session and don’t have to put a digital signature on each individual comment. This also allows backfill when servers are down or new. I would make it a pull agent, even a non-logged in user can fetch comments and posts per-community (and users subscribe to community), that way you don’t need to log-in to remote Lemmy servers to pick up new messages (read only).

Drastic, I say. If a Rust programmer is handy, I’d throw in SQLite right now and build some structures to track by instance what is outbound queued for delivery. Also build an API to return some JSON on the queue sizes for server operators.

Originally posted by RocketDerp in #3427

issue_tracking_bot@lemm.ee · 2 years ago

I want to mention that the fetch_local_site_data

Discussion on this here: https://lemmy.ml/post/1700930

Originally posted by RocketDerp in #3427

issue_tracking_bot@lemm.ee · 2 years ago

Ready for review/merge now.

Originally posted by Nutomic in #3427

issue_tracking_bot@lemm.ee · 2 years ago

Like I said its very messy and needs cleanup now. I did that now and also did another rework of the code. Most importantly dead instances and also blocklists are now stored in single-value moka caches. Much cleaner than using scheduled tasks to update them.

I also restored scheduled_tasks to the original implementation. However there is a problem because it checks nodeinfo, which isnt required for Activitypub and not present on some Fediverse instances (eg misskey.de). So it needs to check alternatively that a request to the domain root returns HTTP 200. Also these requests should really be async.

Originally posted by Nutomic in #3427

issue_tracking_bot@lemm.ee · 2 years ago

I would suggest, at minimum, you query the database to determine which instances have delivered you comments in the past 24 hours and exclude them from any dead server check.

Example SQL query:

SELECT SUBSTRING( ap_id FROM '.*://([^/]*)' ) AS hostname, count(substring( ap_id FROM '.*://([^/]*)' )) FROM comment WHERE published >= NOW() - INTERVAL '24 HOURS' GROUP BY hostname ORDER BY count DESC ;

Originally posted by RocketDerp in #3427

[Closed][BE] Check for dead federated instances (fixes #2221) #3427

[Closed][BE] Check for dead federated instances (fixes #2221) #3427

Check for dead federated instances (fixes #2221) by Nutomic · Pull Request #3427 · LemmyNet/lemmy