In The Beginning There Was CGI
That’s inaccurate, in the beginning there were punch cards. But this post is about building scalable applications using Web technologies, and in this post, everything starts with CGI.
CGIs are incredibly slow. Every time you make a request to the server, it starts up another process. You can build simple and light processes if you focus on optimizing those CGIs and tuning every line of code. But you end up developing at the speed of traffic jams.
To develop fast enough you need to reuse code, pull in libraries, hold state in the database. All of which means you end up with significant startup costs. Starting up a VM and a Web framework can take seconds, enough to kill the response time for anything more than a handful of users.
There must be a better way. Incidentally, that’s when the LAMP, Windows and Java crowds each go their own way.
In the LAMP world, processes are everything. If you want to pull out data from a file, sort it, and e-mail the result, you pipe several programs together. You’re building a solution by assembling processes.
And for more complex tasks you add even more processes. Want to do things on a schedule? Fire them up with cron. Need to improve throughput? Start up a cache process. Monitor uptime? That’s another process for you.
For LAMP developers, multi-process is the natural order of things. Multi-threading is reserved for the light stuff.
Windows views the world in a different way. Windows is, after all, a GUI operating system. Building GUIs with threads is easier than processes. Windows has always focused on the GUI at the expense of its under developed command line. As a result, it never developed an ecosystem of processes you can wire together.
The Windows developer mentality focuses on threads and APIs over processes and tasks. Not surprising, Windows is also optimized for threads and has too much of an overhead when it comes to processes.
Java is not Windows. But it tried. Years before J2EE, Java was evolved into an alternative to Windows. There was even talk of replacing the entire operating system and office suite with one written in Java. So Java followed Windows and did threads.
While Windows focused on GUI, Java focused on platform independence. So it went and re-invented the operating system, one library at a time. In Java you don’t scan files with grep, you use a library. You don’t pipe e-mails to sendmail, you use a library. All the features you need are folded into the VM.
Which turned a snappy VM into a huge behemoth that takes a couple of minutes to boot, as it’s setting up libraries, frameworks and containers. You don’t want to startup the JVM more than once.
Java’s Love/Hate Relationship With Threads
Java did to multi-threaded applications what it did to garbage collection and JIT: it brought them to the wider developer community and made them accessible and acceptable. It’s not the best language for multi-threaded development, but it’s a damn good one. And the most popular.
When it comes to multi-threaded development, Java was years ahead of the C/C++ it came to replace, and the VB/Delphi it competed with.
Then it realized what a problem that is. Multi-threaded developers understand concurrency, how threads interact and how to deal with shared resources and data. Most developers don’t. And so they get to suffer the unintended consequences of side effects.
The solution was to force developers to build separate units of work that can run in a multi-threaded environment without conflicts, and scale to more than one server. What comes naturally to every PHP and Perl developer had to be forced down the throat of Java developers.
It was aptly named EJB.
So Java made multi-threaded development incredibly easy, and then made multi-threaded development intentionally hard. Perl, PHP and their crowd didn’t do much to address multi-threaded, instead falling back to processes.
Why Threads Don’t Scale
When you build software, you spend a lot of time tackling problems. You’re solving one set of problems, which means you’re not solving another, or building new features.
You probably want to run more than one application on your server, handle more than one request at a time. So multi-threaded developers need to deal with issues like deployment units, managing containers, tackling class loaders. Anyone who developed in Java knows those technologies are not as slick as they’re made to look in the brochure.
Multi-process developers don’t have to deal with that. But they’re not saving any time. They have to deal with managing processes, deploying in different locations, file system paths. They have their own share of problems to solve. But every time they solve one of these problems, they’re solving a problem that helps them scale.
You build for one machine, but you’re building for a cluster.
When multi-threaded developers optimize their system, they first focus on “in-VM” optimization. How to get components to share resources and talk to each other and build a solution out of objects.
Multi-process developers can’t do that. Instead they have to focus on the cost of sending data from one process to another and coordinating their work.
You have a finite budget for optimizing, and you’ll probably run into a deadline before you max it out. So multi-threaded developers end up optimizing for much better performance on any given machines, but multi-process developers end up optimizing for scale.
Scalability is not performance. Scalability is getting from small to big (and back). I count software design in that.
Multi-threaded developers tend to scale through objects, libraries and frameworks. When you focus on the components around you, you don’t pay much attention to anything outside the sandbox. The level of abstraction is the API.
Multi-process developers scale by assembling programs together, chaining them or running them in parallel. If it’s not in the framework, you look for a program (or combination of) that does what you need. The level of abstraction is the task.
I happen to think tasks are the right level of abstraction. The more complex the system is, the more you need to focus away from pieces of code, and to what those pieces can do.
If you’re solving problems, not building software for its own sake, then the right level of abstraction is the task. Start working with processes and you’ll discover that you can do a lot more than with one process, no matter how big it is on the inside.
The more independent processes you have, the easier they are to combine into new and interesting uses. In fact, you might be seeing an analogy here with Mashups and SOA, and you’re right.
There are different ways to measure performance. In my experience multi-threaded has higher throughput on a single machine, but there’s so much one machine can do. Eventually, you’ll have to think outside that box.
But the real measure of performance is how fast you can go from idea to working solution. Hardware is cheaper than the people who work with it. Developers who think processes have an easier time building to scale, and an easier time assembling solutions out of existing pieces of code.
It won’t show up in your profiler, but you will get better design and longer lasting code.
Update: Some people confused this post with “how to build a Web server”. This post is about how to build applications higher up the stack. Different tools for different jobs. I also happen to use threads often, I just believe the “everything in one process” approach is misguided. It’s limited by design.
Photo by K /.