On Observability

Imagine, for a moment, that you bought a home door lock. From your phone, you press the unlock button, and wait.

Fifteen seconds pass.

Thirty seconds pass.

At about sixty seconds, you get a pop-up, that your unlock message failed to send.

Back in the real world, fifteen seconds later, you hear a snicker-snack as your door actually unlocks.

What. Just. Happened?

You don’t know.

So you call customer support. “Hello, my name is <Your_name/>. I’m using your internet door lock product; my userid is HOSER12345. I was looking at my phone at exactly 2:34PM and clicked unlock …” and you tell the story.

The customer service rep replies “Huh. That is odd. We’ll look into it … Weird. Try again. Oh, you don’t need to try again, the door is unlocked? Okay. We’ll call you back.”

No one ever calls you back.

Again: What Just Happened?

You still don’t know. You slowly come to realize that they don’t know either.

This is a very real problem, one I see with websites, mobile apps, and other internet of things devices. At Excelon we’ve seen it recently on real projects. In modern terms, this is a lack of observability.

Let’s talk.

TOAD and SRE

There are two emerging roles/discussions that are growing to discuss this. SREs, Site Reliability Engineerings (or perhaps Software Reliability Engineers) is a role created by Google to keep large infrastructure projects running. It is one of those jobs that, like release engineer, everybody knows but few people can describe well. Personally, I frequently find definitions of the type “release manager: Do whatever it takes to keep releases happening.” Or “Reliability Engineer” to be “whatever it takes to keep the site up.” In the worst form, I see definitions such as “Test Data Management: The Management of The Test Data.” I call these hollow definitions because they are, well, hollow. They have the form of meaning, but there is nothing inside of them.

I can think of a few ways we might improve a hollow definition. First, we can ignore them, which gives us the freedom to do whatever we would like to accomplish the objective while ducking any supervision or accountability. Or, we could search for an operating model. There are a few different ways to “do” SRE. Google’s model and Netflix’s model may be the two most well-known.

Finally, you can interview people actually doing the work (or do it yourself), and build you own model.

I’m not quite sure the exact history, but one of our peers, Noah Sussman, has gone and done just that, using the term TOAD – Testing, Observability and DevOps. Chris McMahon, who I worked with a decade ago at Socialtext, has gone and started a series on it, now up to five parts (One, Two, Three, Four, Five) on his blog. I’ve known Noah almost as long as Chris, and frankly, I find TOAD filling in a fair bit of information gaps in the SRE literature. I had just started some work on SRE when I found TOAD. Perhaps SRE is what and TOAD is how. It is early days. We haven’t figured out who or how to structure or anything. The field is a great wild west of experiments and possibility.

Today, the small contribution I can make to TOAD is on O: Observability.

My small contribution, is, of course, asking what is going on with the door lock with that pesky user HOSER1234. If we are “doing observability”, we should probably be able to figure that out.

Let’s put that front and center.

We could go to the Wikipedia definition of observability, or even the definition of software observability. Personally, I’d prefer to use my plain-spoken, folksy, arguably imprecise language to say: Hey man, if you can’t debug a function call to figure out where it went wrong, it ain’t that observable.

The folks at twitter have a two part series on observability, that breaks it down for software into four parts:

Monitoring
Alerting/visualization
Distributed systems tracing infrastructure
Log aggregation/analytics

Personally, I’d cover alerting separate from visualization and make it five points. For log aggregation, I’d suggest Splunk or a tool like it. Throw the logs and other unstructured data into something with a query language behind it and write queries is a good place to start. Other days we could expand on that. For visualization there are tools like Tableau.

What we don’t talk about enough is “distributed systems tracing infrastructure.” To me, of this list, that would be the thing that gets me the ability to figure out what happened to my API call.

And it is a complex API call. We need the ability to trace the call from my phone to the internet, to the vendor’s servers, to my cable modem to my router to my base unit to my wireless electronic lock and back. Before you can “that’s more complex than it needs to be”, let me tell you, what actually happens could be more complex.

What I want to see is a stack trace of API calls, something like a waterfall chart on the front-end, or a “stack trace” for a crashing piece of software on the back end. I want them all, and I want them to make sense.

I want to observe them, you see.

That is part of observability. It is an important part.

And it is often missed.

That’s just my tiny contribution to TOAD, at least for today.

More to come.

PS: If this website was really observant, I could replace <Your_name/> with your actual name. But that would probably freak you out and get you into a whole conversation about privacy. So, let’s call that a different kind of observability, and save it for another day.

One comment on “On Observability”

Karlo Smid says:

September 23, 2019 at 6:16 am

Hi Matt, great post!

I wanted to share on twitter this article, but you share on twitter button does not open twitter share modal window. It just opens in modal window my twitter home page.

Regards, Karlo.

On Observability

What. Just. Happened?

Again: What Just Happened?

TOAD and SRE