Building a Dry-Run Mode for the OpenTelemetry Collector

2026-03-17 · 1638 words · 8 min read

Teams continuously deploy programmable telemetry pipelines to production, without having access to a dry-run mode. At the same time, most organizations lack staging environments that resemble production – especially with regards to observability and other platform-level services. Despite knowing the potential risks involved due to the lack of a proper safety harness, most teams have no alternative, safe way to determine what telemetry they can actually cut to improve their signal-to-noise ratio without running the risk of missing something important.

So, teams resort to experimenting on live traffic.

How I built this page

2026-02-23 · 438 words · 3 min read

Development AI Web Writing

This site has needed a facelift for years. Not because the technology was outdated, but because every previous version of this blog eventually died. Quietly.

I’ve started blogs before. Many of them. They all followed the same lifecycle: excitement → a few posts → silence. Over the years, those genuine attempts quietly turned into a blog graveyard. Apparently, enthusiasm alone isn’t a sustainable publishing system.

Observability is becoming mission critical, but who watches the watchmen?

2022-09-14 · 1317 words · 7 min read

Observability SRE Monitoring

The last couple of years, there has been quite a lot of development in the area of lowering the barrier of entry for observability. There are now quite a few, reasonably mature options out there that lets you set up a good monitoring stack either through a few clicks or by a few one-liners in the terminal.

In the managed open-source space, the most successful one so far probably is Grafana Cloud, but there definitely is no shortage of closed-source vendors providing APM solutions where everything you need to get started is to drop either a single or multiple agents into your cluster or your machine.

Error Economics - How to avoid breaking the budget

2021-08-27 · 1784 words · 9 min read

SLOs Reliability Performance Site-Reliability Engineering

At SLOConf 2021 I talked about how we may use error budgets to add pass/fail criterias to reliability tests we run as part of our CI pipelines.

As Site Reliability Engineers, one of our primary goals is to reduce manual labor, or toil, to a minimum while at the same time keeping the systems we manage as reliable and available as possible. To be able to do this in a safe way, it’s really important that we’re able to easily inspect the state of the system.