This commit changed the implementation for sending outgoing activities. I believe that it is responsible for major increases in CPU and RAM usage and client errors. Because now there are up to millions of async tasks active which are doing nothing but sleeping, and this likely messes up the scheduler. I will rework this for 0.18.2.

Unfortunately we will get back the problem with HTTP signatures expiring after only 10 seconds, but thats better than overloaded servers. This change needs to go into 0.18.1

This reverts commit d6b580a530563d4a2be76d077e015f9aecc75479.

Originally posted by Nutomic in #3466

  • issue_tracking_bot@lemm.eeOPB
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Side-note: there were lots of discussions happening regarding this current scaling issue in several different Matrix channels, Discord, and several GitHub issues. It’s hard to keep track of it all, so I have set up a new Matrix channel exclusively for this topic, in hopes of getting everybody on the same page regarding current plans, results of experiments, etc.

    I have invited both Lemmy maintainers, and several folks who have been been over the past week providing valuable input in discussions and performance tuning. If anybody else feels they can contribute, then please message me at @sunaurus:matrix.org for an invite.

    Originally posted by sunaurus in #3466

  • issue_tracking_bot@lemm.eeOPB
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Hey @Nutomic - this change might not have a net positive effect for the following reasons:

    1. We were seeing slowdowns on 0.17.4 lately already as well
    2. 0.18.1-rc4 has drastically improved federation so far (especially anything coming out of lemmy.world)

    For additional context, I was in a call with Ruud for the lemmy.world upgrade to 0.18.1-rc4. During that upgrade, we discovered: there is definitely major bottleneck somewhere. A single lemmy_server process would only actively use a small amount of connections from the db connection pool, the rest will be idle, at the same time query run times had increased to 10-15 seconds. Running multiple lemmy_server processes with small connection pools in parallel works around this issue - it immediately and drastically improved performance on lemmy.world.

    @phiresky has a pretty credible theory that spawning unlimited tokio tasks means that there is a very low chance of an incoming http request being served immediately - this would be one explanation for slowdowns in cases of large federation queues. From @phiresky in Matrix:

    maybe on the initial queue the timeout should be limited to e.g. 2 seconds and all slow requests immedatiely moved to the retry queue

    I highly recommend that we could try to improve the situation right now with the following mitigations (rather than rollback to broken federation):

    1. Set a short timeout for outgoing HTTP requests on the first try as per @phiresky’s comment above
    2. Reduce worker limits in production instances to ensure that incoming HTTP requests have better odds of being picked up quickly
    3. Give people guidance on how to horizontally scale Lemmy (I will write up some notes about this in the evening)
    4. Implement some quick hack to stop flooding the retry queue in case of servers which are totally unresponsive

    Originally posted by sunaurus in #3466