• RoundSparrow @ BT@bulletintree.comOPM
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago

    If you read enough EXPLAIN ANALYZE output from PostgreSQL you start to get some ideas…

    PostgreSQL makes estimates of queries before it dives into it. The same concept could be applied to Lemmy sorting orders and page numbers.

    The whole project seem stuck on fringe cases. The reality is that the servers are running slow and overloading with routine fetches of Hot and Active post listings on page number 1, 2, or 3. Sure, some of the apps have speed scrolling to get to page 5 quickly… but there is too much concern out there for how page 970 behaves with a 5 month old posting… and the servers are crashing. Did I mention that the servers have been nonstop crashing since I came around in May?

    Today I’m gong to experiment with doing what PostgreSQL does internally before picking an INDEX or a TABLE scan… some hints to the query where to go. For every sort order, there is a natural way to approach routine conventional paging… the trick is how to apply that to both single community and a blend of many communities.

    • RoundSparrow @ BT@bulletintree.comOPM
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      Brainstorming…

      Every combination of community block and subscribe is too much…

      But can we leverage community_aggregates. With all the overhead of updating it on every ++ of post and ++ of comment… can we just write a hint of where page 20 in sorts may fall?

      each community could set a dirty flag on community_aggregates update. count alone can’t be relied upon because it also decrements.

      The hierarchy is important to consider. Every comment is owned by a post, every post is owned by a community, every community is owned by an instance. For Reddit, the user is secondary and their comment and post traditionally can be linked to and read even if the user deletes their account. For Lemmy, removal of content was an early design priority, but think that harms search engine use users and goes against the spirit of what a public forum is about (yes, everyone wants to turn it into private playland, and ignore that hackers break into major websites with full time security teams - and Lemmy being operated by unknown strangers isn’t exactly something they should count on).

      Ok, back on topic. With all the overhead currently in place to update community)_aggregates on each new comment and post, it makes sense to try and filter out the cruft of 5000 communities that may be stale yet bog down every PostgreSQL for posts for the most recent hour. I don’t think I’ve seen one single mention by anyone analyzing the situation that lemmy.world having over 9,000 local communities is a huge factor - as Lemmy was sort of build with “20 communities” in 2020.

    • RoundSparrow @ BT@bulletintree.comOPM
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      Brainstorming…

      We know the sort order and the page number we are loading before we start building any SQL statement at all. Regardless of any customized user preferences, these are foundational in terms of filtering out data.

      For /c/All - page number is useful. It is when you get into /c/Subscribed with a blend of communities that the trouble begins.

      There is also the behavior people might expect with “Top 1 hour” and running out of posts to read. On that specific choice, they are likely to accept it… but with Hot and Active…

      So what exactly are Hot and Active… and how far back in time do they go?

      • RoundSparrow @ BT@bulletintree.comOPM
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        For post listing, these are the three that aren’t simple to grasp:

              SortType::Active => query
                .then_order_by(post_aggregates::hot_rank_active.desc())
                .then_order_by(post_aggregates::published.desc()),
              SortType::Hot => query
                .then_order_by(post_aggregates::hot_rank.desc())
                .then_order_by(post_aggregates::published.desc()),
              SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
        
        • RoundSparrow @ BT@bulletintree.comOPM
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          11 months ago
              hot_rank: 1728,
              hot_rank_active: 1728,
          
            -- Note: 1728 is the result of the hot_rank function, with a score of 1, posted now
            -- hot_rank = 10000*log10(1 + 3)/Power(2, 1.8)
          
          • RoundSparrow @ BT@bulletintree.comOPM
            link
            fedilink
            arrow-up
            1
            ·
            11 months ago

            Difference between hot_rank and hot_rank_active

            SET hot_rank = hot_rank(a.score, a.published),
                    hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)"
            
            
          • RoundSparrow @ BT@bulletintree.comOPM
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            11 months ago
            lemmy@lemmy_alpha LOG:  duration: 50.220 ms  execute : WITH batch AS (SELECT a.id
            	               FROM post_aggregates a
            	               WHERE a.published > $1 AND (a.hot_rank != 0 OR a.hot_rank_active != 0)
            	               ORDER BY a.published
            	               LIMIT $2
            	               FOR UPDATE SKIP LOCKED)
            	         UPDATE post_aggregates a SET hot_rank = hot_rank(a.score, a.published),
            	         hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)
            	             FROM batch WHERE a.id = batch.id RETURNING a.published;
            	    
            2023-08-18 09:00:34.578 MST [1877420] lemmy@lemmy_alpha DETAIL: 
             parameters: $1 = '2023-08-16 23:40:31.149267', $2 = '1000'
            
            
            • RoundSparrow @ BT@bulletintree.comOPM
              link
              fedilink
              arrow-up
              1
              ·
              11 months ago
              CREATE FUNCTION public.hot_rank(score numeric, published timestamp without time zone) RETURNS integer
                  LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE
                  AS $$
              DECLARE
                  hours_diff numeric := EXTRACT(EPOCH FROM (timezone('utc', now()) - published)) / 3600;
              BEGIN
                  IF (hours_diff > 0) THEN
                      RETURN floor(10000 * log(greatest (1, score + 3)) / power((hours_diff + 2), 1.8))::integer;
                  ELSE
                      RETURN 0;
                  END IF;
              END;
              $$;
              
              
  • RoundSparrow @ BT@bulletintree.comOPM
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago
    CREATE OR REPLACE FUNCTION post_aggregates_score ()
        RETURNS TRIGGER
        LANGUAGE plpgsql
        AS $$
    BEGIN
        IF (TG_OP = 'INSERT') THEN
            UPDATE
                post_aggregates pa
            SET
                score = score + NEW.score,
                upvotes = CASE WHEN NEW.score = 1 THEN
                    upvotes + 1
                ELSE
                    upvotes
                END,
                downvotes = CASE WHEN NEW.score = - 1 THEN
                    downvotes + 1
                ELSE
                    downvotes
                END,
                controversy_rank = controversy_rank (pa.upvotes + CASE WHEN NEW.score = 1 THEN
                        1
                    ELSE
                        0
                    END::numeric, pa.downvotes + CASE WHEN NEW.score = - 1 THEN
                        1
                    ELSE
                        0
                    END::numeric)
            WHERE
                pa.post_id = NEW.post_id;
        ELSIF (TG_OP = 'DELETE') THEN
            -- Join to post because that post may not exist anymore
            UPDATE
                post_aggregates pa
            SET
                score = score - OLD.score,
                upvotes = CASE WHEN OLD.score = 1 THEN
                    upvotes - 1
                ELSE
                    upvotes
                END,
                downvotes = CASE WHEN OLD.score = - 1 THEN
                    downvotes - 1
                ELSE
                    downvotes
                END,
                controversy_rank = controversy_rank (pa.upvotes + CASE WHEN NEW.score = 1 THEN
                        1
                    ELSE
                        0
                    END::numeric, pa.downvotes + CASE WHEN NEW.score = - 1 THEN
                        1
                    ELSE
                        0
                    END::numeric)
            FROM
                post p
            WHERE
                pa.post_id = p.id
                AND pa.post_id = OLD.post_id;
        END IF;
        RETURN NULL;
    END
    $$;
    
  • RoundSparrow @ BT@bulletintree.comOPM
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago

    It is interesting to ponder why Reddit had a multi-reddit 100 limit and similar for /r/All blocking limit of 100. As subscribed home page didn’t have a 100 limit. What performance reason would be solved by 100 limit if you had to render home pages with 2500 subreddits anyway?

  • RoundSparrow @ BT@bulletintree.comOPM
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago
        query = match options.sort.unwrap_or(SortType::Hot) {
          SortType::Active => query
            .then_order_by(post_aggregates::hot_rank_active.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::Hot => query
            .then_order_by(post_aggregates::hot_rank.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
          SortType::New => query.then_order_by(post_aggregates::published.desc()),
          SortType::Old => query.then_order_by(post_aggregates::published.asc()),
          SortType::NewComments => query.then_order_by(post_aggregates::newest_comment_time.desc()),
          SortType::MostComments => query
            .then_order_by(post_aggregates::comments.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopAll => query
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopYear => query
            .filter(post_aggregates::published.gt(now - 1.years()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopMonth => query
            .filter(post_aggregates::published.gt(now - 1.months()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopWeek => query
            .filter(post_aggregates::published.gt(now - 1.weeks()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopDay => query
            .filter(post_aggregates::published.gt(now - 1.days()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopHour => query
            .filter(post_aggregates::published.gt(now - 1.hours()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopSixHour => query
            .filter(post_aggregates::published.gt(now - 6.hours()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopTwelveHour => query
            .filter(post_aggregates::published.gt(now - 12.hours()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopThreeMonths => query
            .filter(post_aggregates::published.gt(now - 3.months()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopSixMonths => query
            .filter(post_aggregates::published.gt(now - 6.months()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
          SortType::TopNineMonths => query
            .filter(post_aggregates::published.gt(now - 9.months()))
            .then_order_by(post_aggregates::score.desc())
            .then_order_by(post_aggregates::published.desc()),
        };
    
    • RoundSparrow @ BT@bulletintree.comOPM
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      11 months ago

      With performance enhancement in mind…

      when listing a single community, filtering logic should be more open-ended, especially if community has little content.

      Changes:

      1. Controversial should be consistent with Hot and Active that when it runs out of data it falls back to published DESC.
      2. This makes “published DESC” a consistent sorting pattern for every situation except “Old” and “NewComments”.
          query = match options.sort.unwrap_or(SortType::Hot) {
            SortType::Active => query
              .then_order_by(post_aggregates::hot_rank_active.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::Hot => query
              .then_order_by(post_aggregates::hot_rank.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::Controversial => query
              .then_order_by(post_aggregates::controversy_rank.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::New => query.then_order_by(post_aggregates::published.desc()),
            SortType::Old => query.then_order_by(post_aggregates::published.asc()),
            SortType::NewComments => query.then_order_by(post_aggregates::newest_comment_time.desc()),
            SortType::MostComments => query
              .then_order_by(post_aggregates::comments.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopAll => query
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopYear => query
              .filter(post_aggregates::published.gt(now - 1.years()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopMonth => query
              .filter(post_aggregates::published.gt(now - 1.months()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopWeek => query
              .filter(post_aggregates::published.gt(now - 1.weeks()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopDay => query
              .filter(post_aggregates::published.gt(now - 1.days()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopHour => query
              .filter(post_aggregates::published.gt(now - 1.hours()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopSixHour => query
              .filter(post_aggregates::published.gt(now - 6.hours()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopTwelveHour => query
              .filter(post_aggregates::published.gt(now - 12.hours()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopThreeMonths => query
              .filter(post_aggregates::published.gt(now - 3.months()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopSixMonths => query
              .filter(post_aggregates::published.gt(now - 6.months()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
            SortType::TopNineMonths => query
              .filter(post_aggregates::published.gt(now - 9.months()))
              .then_order_by(post_aggregates::score.desc())
              .then_order_by(post_aggregates::published.desc()),
          };
      
  • RoundSparrow @ BT@bulletintree.comOPM
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago
     hot_rank_now_minus1second_score_250 |        current_now         
    -------------------------------------+----------------------------
                                    6899 | 2023-08-18 19:02:54.807922
    (1 row)
    
     hot_rank_now_minus1day_score_250 |        current_now         
    ----------------------------------+----------------------------
                                   68 | 2023-08-17 19:02:55.808032
    (1 row)
    
     hot_rank_now_minus2day_score_250 |        current_now         
    ----------------------------------+----------------------------
                                   21 | 2023-08-16 19:02:55.808132
    (1 row)
    
     hot_rank_now_minus3day_score_250 |        current_now        
    ----------------------------------+---------------------------
                                   10 | 2023-08-15 19:02:55.80823
    (1 row)
    
     hot_rank_now_minus6day_score_250 |        current_now         
    ----------------------------------+----------------------------
                                    3 | 2023-08-12 19:02:55.808326
    (1 row)
    
     hot_rank_now_minus9day_score_250 |        current_now         
    ----------------------------------+----------------------------
                                    1 | 2023-08-09 19:02:55.808419
    (1 row)
    
     hot_rank_now_minus12day_score_250 |        current_now         
    -----------------------------------+----------------------------
                                     0 | 2023-08-06 19:02:55.808512
    (1 row)
    
     hot_rank_now_minus22day_score_250 |        current_now         
    -----------------------------------+----------------------------
                                     0 | 2023-07-27 19:02:55.808606
    (1 row)
    

    from

    executing this:

    -- when does hot_rank go flat like old soda?
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '1 second')) AS hot_rank_now_minus1second_score_250, timezone('utc', now() - interval '1 second') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '1 day')) AS hot_rank_now_minus1day_score_250, timezone('utc', now() - interval '1 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '2 day')) AS hot_rank_now_minus2day_score_250, timezone('utc', now() - interval '2 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '3 day')) AS hot_rank_now_minus3day_score_250, timezone('utc', now() - interval '3 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '6 day')) AS hot_rank_now_minus6day_score_250, timezone('utc', now() - interval '6 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '9 day')) AS hot_rank_now_minus9day_score_250, timezone('utc', now() - interval '9 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '12 day')) AS hot_rank_now_minus12day_score_250, timezone('utc', now() - interval '12 day') AS current_now;
    
    SELECT hot_rank(250::numeric, timezone('utc', now() - interval '22 day')) AS hot_rank_now_minus22day_score_250, timezone('utc', now() - interval '22 day') AS current_now;