Validate your assumptions
Last week I presented you with a coding challenge. This week, I’m going to follow up and show you one of my favorite software development techniques.
First, here are the solutions I got, in the order in which they arrived:
I included the lines of code count, to show the vast size difference between the solutions, from 39 to 207. Some of that variance is a matter of coding style, some a choice of language features (e.g. callbacks vs promises). But the solutions also all differ in terms of complexity, and how they operate. And that’s because each solution expresses a different point of view.
Which is what I hoped we’ll see from these submissions, so thank you all for participating.
Assuming that …
I think it’s fairly obvious how the code works, so I’m not going to talk about that. I will show you, though, how easily different developers can arrive at different solutions to the very same problem. And in doing so, I’ll illustrate a technique I depend on to write better software.
It’s a very simple technique, and consists of three steps:
- Write down your assumptions
- Validate your assumptions
- Write the code
I use pen and paper, notes app, or code comments, whichever is easily accessible. The point is not to document anything (that comes later), but to make sure I’m building the right thing.
Writing down my assumptions takes almost no time. It does make them visible. That forces me to acknowledge and review my assumptions. It’s easy to spot a false assumption when it’s right in front of you. Most often validation is just asking myself “does this make sense?”
It’s also a way to communicate design choices in pull requests.
The opposite of this approach, is writing code based on unexamined assumptions. There’s always a gap between the problem definition — no matter how specific it is — and the working code. And that gap is always filled by assumptions, implicit or explicit is your choice.
The implicit assumptions, unexamined, are not subject to review. They’re not accountable to being correct. And so they have this tendency to favor convenience — as in writing less code — rather than leading us towards better code, able to deal with the realities of the world.
I’m going to illustrate this by example. In each case, I’m going to start out with a convenient assumption, the one I wish was true, and then I’m going to examine it, and change as necessary. You can trace the results to the working code.
For Example …
So let’s start with the first assumption:
Assumption #1: if I have a result, I can tell if it’s correct.
This is an easy one. I run the script, I check the results, I do that a couple of time, and I can certify without a doubt whether the code Works On My Machine™.
But usually I write software to automate things, and this assumption is not particularly favorite to automation, because it requires human in the loop. So imagine that this script is going to run daily at 4AM. Do you want to be there to verify the results?
So this assumption failed validation, let’s change it:
Assumption #1’: I'm not around to tell if a result if correct.
This new assumption is going to lead me to a different implementation. Specifically, it’s forcing me to pick a failure mode: given the choice between returning the wrong result, and no result, my script chooses to return no result.
From here on, I am going to put much emphasis on the correctness of the result.
Ok, let’s move on to the next assumption:
Assumption #2: the network is always reliable.
This is a convenient assumption, it simplifies our code. Immensely. Imagine all the work you don’t have to do, if there are no network errors to deal with.
Back in the real world:
Assumption #2': the network is not reliable.
Anything that can go wrong, will go wrong. Finding the IP address of a server? Can fail. Having an IP address but no route to the server? Happens. Have a route, but can't establish a connection? All the time (especially on mobile). Connected, but server doesn't understand our request? Yes, that too. Server got the request, never sends a response? Too often, especially in Node-land.
We’ll get to talk about error handling later, for now, let’s review the next assumption:
Assumption #3: the network has no latency.
I would love to live in a world where web pages load instantly, videos never buffer, texts always arrive when you send them. And I’m not the only one, judging by how much code I see that's based on the assumption that networks have no latency.
Assumption #3': the network has unpredictable latency.
Latency can be variable. That means any value. Could it take 30 seconds to get a response? Sure, that could happen. But, if the response tells the time, what time is it telling? Now? 30 seconds ago? 15 seconds ago?
We can make a few guesses, like maybe “15 seconds ago” because it takes 15 seconds for the request to hit the server, and 15 seconds for the response to come back, which is based on the assumption that the latency is actually quite predictable (it doesn’t fluctuate), which … do I need to explain why not?
My take: if it takes too long to get there response, there’s no valuable data there, we'll ignore it. If we’re going to ignore the result, why even wait for it? So I’m going to force all requests to timeout after half a second. Added bonus, my script finishes in less than a second.
Moving on. I picked the Date header because it’s one of the simplest headers in the HTTP spec. It’s hard to get this one wrong:
Assumption #4: servers always implement the spec correctly.
But software developers are a very creative bunch. For example, if you make a request to http://github.com, it will redirect you to a different URL, and that redirect response is missing the Date header. So there’s that. And that’s just one example.
Assumption #4': servers deviate from the spec in every possible way.
Now we’re talking from first hand experience. You see, I don’t always remember every nuanced part of the spec, and sometimes the code I write deviates from the spec. Even specs I wrote myself.
Anyway that’s easy to solve, we just add code to deal with a header that’s not there, and headers that are there but don’t have a usable value. Basically, another form of unreliable server, along with network errors and request timeouts.
Let’s try three more assumptions, in succession:
Assumption #5: all servers have the same time.
Assumption #6: time is static.
Assumption #7: servers don’t lie.
We know servers don’t all have the exact same time, clocks are never perfectly synchronized.
Even if the clocks were perfectly synchronized, we know time is not static because race conditions! Time moves forward, the network has latency, so our requests all arrive at different times, and so the responses contain different time values.
Time can also go backwards, but for this particular case it makes no difference, so we can go with the simplified assumption.
Last thing, our computer’s clock is broken, so how come we’re quick to trust some other computer’s clock?
Assumption #5': servers have different time.
Assumption #6': time moves forward.
Assumption #7': servers lie.
So far we’ve dealt with assumptions that are universal truths. Or universal falsehoods, if you prefer. I started with these, because they are so obvious, they help drive the point of how articulating and validating our assumption leads us to better code.
Next, we’re going to try a few assumptions more specific to this particular project.
Quick recap: we’re going to collect results from some servers, maybe not all of them, the results are not going to be identical, some may even be wrong.
Assumption #8: most servers can tell the time, those that don’t are independently incorrect.
This is a bold assumption, but it’s also not too difficult to validate. Run a process over a long period of time and measure the results.
What we’re going to find is that indeed, most servers have a working clock, it gets synchronized often enough, it drifts ever so slightly. We’re also going to find out there’s low probability that multiple servers will be wrong at the same time, or will report the same wrong time (wrong times are typically all over the place).
So while we can’t trust any one server with the time it reports, we can trust a plurality of servers with very high certainty. So validated, let’s simplify our assumption:
Assumption #8’: the majority of servers will return the correct time.
Easy, now we need to figure out what the majority of servers are telling us. We can, perhaps, find the average value?
Assumption #9: to find the average value, calculate the mean.
As you may know, there are three types of averages. When you give people a set of numbers and ask them for the average, they tend to choose the median (half way value) or the mode (most popular value). That’s our conception of average. We know it when we see it.
When you ask the same group of people to write an algorithm that finds the average, strangely enough, you end up with algorithmic mean. I find this very curious, in particular, because it’s so easy to prove that mean gives the wrong result.
Let’s say most servers report the date August 11th 2015, which happens to be today, and one server reports January 1st 1970. What’s the mean time? I got October 2004.
In general, if your brain solves a problem one way, and your code solves the problem some other way, there needs to be a good reason other than “it was easier”.
Anyway, I’m actually going to stick with algorithmic mean, but I’m going to improve on it by dumping out all the outliers before calculation the mean. That way, I’m going to have all the August 2015 results and no January 1970’s. How?
First, I’m going to turn to the dark side and use a tool that no real programmer would dare use. Excel has this nice function called TRIMMEAN that “calculates the mean taken by excluding … outlying data from your analysis.” I’ll continue my descent and find a trimmed mean algorithm on StackExchange.
Assumption #10: the output of the previous calculation will always be correct.
Of course, what happens when you have too few data points, is that you can’t reliably remove the outliers. Imagine we’ve got two dates, one in January 1970, the other in August 2015. Which one do we discard and why?
If your answer was “January 1970”, you’ve just made an assumption that some other computer’s clock will never be set to a future date. Good luck.
So trust, but verify.
And I’m going to verify by looking at two things. I know I can trust a majority of servers, and I also know they’re all going to report approximately the same time. So I’m going to verify that I got a result from at least 3 servers, and that all values are within 2 seconds of each other.
This might feel like an overkill. If I’m verifying the results, in a way that will always spot any outliers, why did I start out by removing the outliers? Isn’t that duplicate functionality?
Assumption #11: if a server returns the wrong data, it will quickly get fixed.
Because there’s no way of telling when a server would return an outlier value, and how long that situation would last. And as long at the issue is there, the script that doesn’t trim outliers keeps failing. And the script that trims outliers can return a correct result.
And so …
Here’s the algorithm I came up with:
- Ask all servers what the time is
- Ignore servers that didn’t respond
- Ignore responses you can’t trust (timeout)
- Ignore responses you can’t use (missing/invalid header)
- Discard the outliers
- Do we have at least 3 responses?
- Are they all within 2 second of each other?
- Calculate the mean and display the result
The nice thing about this algorithm is its reliability: it can tolerate a lot of errors (network issues, slow servers, false responses, etc), but will only show correct results (with high enough certainty, accurate to our spec).
Can it be simplified? Yes, but that's topic for a another day.
The important part is how it came to be: from a simple process whereby I first state my assumptions, then validate them, and last work them into the code.