Observability is typically a hard problem. I was tempted to find the path of least resistance and I stumbled upon Coroot, a BPF-based observability tool that integrates with prometheus, pyroscope, clickhouse and otel to provide a truly useful interface for observing application behavior.
I will start with the Application Overview
when accessing coroot, you will see a map like this (may need to right click -> open image in new tab):
This is useful to get a sense of the flow of traffic from ingress to the database and backend services, like pictrs. Without reading the anance chart, we can see that there is an intermediary proxy (probably that crazy nginx config provided by the lemmy project.) There is also the lemmy frontend, its own container, then there’s the lemmy API server (the rust component), then pictrs and postgresql. Additionally, we see the health of mlmym as well - which appears to be a simple web service. This is because of the way mlmym accesses lemmy, and the fact that the egress traffic isn’t captured well by coroot here.
Clicking on any component brings me to the Application page, where I can see facts and data about that object. For example, if I click the Lemmy backend service, I can see exactly where it gets its traffic from, and what it depends on. In this case, it receives traffic from lemmy proxy and lemmy UI, and depends on pictrs and postgresql. I can also see basic statistics such as latency, requests/sec, http error rate, and whether or not we’re meeting our availability objectives (may need to right click -> open image in new tab):
From this page, I can also click on Instances to see the health of the Kubernetes node running the lemmy API pod
I can also see CPU and Memory metrics in the other tabs:
memory
Coroot even parses logs and will alert me about new errors:
In the deployment tab, Coroot can tell me if the lemmy Deployment is healthy, in CrashLoopBackOff or Error state:
Also, since Coroot is eBPF-based, it can trace application binaries down to the syscall, and present this flame graph. I can’t stress how useful this is for live production systems:
And finally, in the tracing tab, I can see traces for HTTP calls which have failed, including details from those traces:
But wait, there’s more. If I click the postgresql deployment, profiling and tracing also work there, down to the query level:
This is very powerful for a free-forever tool, so if you’re running Lemmy in Kubernetes I highly recommend it.