When you are building high-availability, high-traffic systems, it’s all about redundancy.
This afternoon, I met with Matt M, in our Ops department. The Ops group is responsible for making sure that our software gets deployed properly, and for maintaining our server farms. Since I am going to be handling our stress and performance testing, Matt and I focused on our performance cluster of machines.
I noticed something that I thought would be of interest to developers and system admins everywhere. Redundancy is built into our system at every level. At the machine level, we have many hard drives in our servers, organized into RAID arrays. The ‘R’ in RAID stands for redundant. If one hard drive fails, no problem. In addition, each machine has redundant power supplies. If a power supply fails, the user doesn’t even notice it. So far, this is all very standard, and somewhat basic.
Let’s zoom out one level. Our system is built of multiple subsystems. One subsystem is our public sites cluster. This cluster consists of two subsystems, the front end (IIS/ASP.Net), and the backend (Sql Server).
Within the Sql Server subsystem, there are two physical servers. Each one of these servers has the redundancy we spoke about in the previous chapter. If one of the Sql servers crashes in such a way that it is gone forever, the user can still perform any and all actions in the application, although some operations may be slower than usual. In order to achieve this, we have a piece of hardware in place that writes all database changes to two servers, simultaneously. This same hardware detects the failure of a single server, and redirects all traffic to the other server. This setup ensures that the loss of one server will not cause the loss of any customer data, and provides a much better
Within the front end subsystem, we have several web servers, which all have the same image of our code base on them. If any one of these servers goes offline, we once again recover without any human interaction, gracefully. The only data that is lost is encapsulated in whichever responses are being created by the server, which haven’t been sent back to the client yet. For those of you who aren’t web developers, a single response does not normally contain very much data. So, in the usual case, a set of users will only have to repeat the single step that they just performed (usually just clicking on the submit button again).
So far, we have redundancy within the servers, and we have redundant servers. This is all fairly standard stuff, and any medium sized web application will probably be architected similarly.
Let’s get on to the more interesting bits. Given our public site cluster, consisting of a front-end subsystem and a paired back-end subsystem, we also have redundancy at the level of the public site cluster. So, we have multiple sets of paired front-end and back-end subsystems. These clusters are load-balanced through a piece of hardware. Now, if we somehow lose all of our front-end servers and/or all of our back-end servers in a given cluster, the user still doesn’t notice anything during use of the application. The catch here is that there will be loss of data, or at least loss of the ability to get at the data, until our Ops guys get the cluster back up. However, the application continues to function, and the affected users will be provided with a (presumably) friendly error message detailing the problem.
Why do we need four levels of redundancy in our application, you ask? Good question. Here are some stats, to hopefully help you understand the answer. Our application has been in beta for 3½ months. In that time, we have had almost 80,000 domains registered through our service. So, on any given day, we have nearly 80,000 public websites being hosted on our servers. That is a lot of websites. So, if we didn’t have this level of redundancy, a single machine failing could prevent around 80,000 sites from being accessible to our customers. That translates directly into lost revenue for each one of our customers. Clearly, we have to have a very robust architecture, to prevent this from happening.
One of my responsibilities is making sure that our systems can handle failures. We call this fault tolerance, and I think this is going to be the most enjoyable part of testing this application. There’s nothing better than breaking something on purpose, just to see what happens next.