Eric's Technical Outlet

Learning the hard way so you don't have to

It’s NOT the Network, the Hypervisor, or the OS!

I’ll do my best to keep this posting on an even keel and not let it devolve into a rant, but no promises. As I poke around in various technical forums, whether LinkedIn or Spiceworks or TechNet or whatever, a new theme is popping up. People are complaining that they’ve dropped in a shiny new 10G network and just can’t figure out why it’s not any faster than their old 1GB network, or they’re using internal/private networks inside Hyper-V or ESX and they don’t go any faster than a standard copper line. In a couple of cases, I’ve actually taken the effort to try to explain what’s going on, but usually it falls on deaf ears. This post serves as both an explanation and an appeal to the more sensible administrators out there to look at what’s really going on.

The Symptom Tree

As nearly as I can tell, the steps to reproduce the “problem” are: buy shiny new hardware, plug in shiny new hardware, copy file from source A to destination B and time the transfer, then run to forums and complain.

Innocent Victim #1: The Operating System/Hypervisor

The first few responses often come from platform evangelists/haters who insist that it’s because the operating system or hypervisor (doesn’t matter what it is) is highly inefficient at handling networking and the poster just needs to forklift everything out and put in some other operating system or hypervisor. A fight will then ensue with the people who feel the opposite way. With any luck, the thread won’t be derailed.

There are some things to try here, just to be sure, but don’t expect them to have more than a minor performance impact. I’ll just list out some possibilities and leave it to your Internet-searching capabilities to find out how to apply them to your OS/hypervisor/hardware: TCP offloading, chimney, VMQ, protocol en/disabling (IPv4, IPv6), RSS, auto-tuning, MTU/jumbo frames, and subnetting. If you’re reading this because you’re dealing with a Hyper-V deployment, I wrote an earlier article on that which included a link to Microsoft’s tuning guide. Just a heads up: I’ve spent many hours in many installations tweaking, tuning, and testing, and it’s been a very rare occurrence that I was able to achieve meaningful differences. The change that makes the most difference is adjusted for MTU and jumbo frames when working with iSCSI because it can dramatically improve the overhead-to-data ratio. For all others, it’s easy to put more effort into it than will ever be recouped.

Innocent Victim #2: The Hardware Manufacturer

Once the OS/hypervisor is given a clean bill of health, the next most popular thing is to jump on the company that built the equipment. These are usually empty assaults based on anecdotal or even unrelated evidence: “Intel NICs are junk because Justin Bieber” or “Broadcom always breaks everything, just look at Greece’s economy”. I’m not claiming that there isn’t bad hardware out there or that even the biggest names in the industry haven’t trotted out some clunkers. What I’m saying is that hardware is the problem less than 2% of the time in the aggregate of all computer problems. There needs to be a long chain of troubleshooting that occurs before blame is brought to the physical components. The only exception in this case is cables; check cables first. You do, of course, need to worry a bit about driver issues, but that is often conflated too. It has been exceedingly rare for a driver problem to cause a major performance hit for many years now.

(Partially) Innocent Victim #3: The Operator

Most computer problems do usually wind up having something to do with the ineptitude of the operator. We know this because it got its own acronym: EEOC (equipment exceeds operator’s capabilities). In our hypothetical thread, it won’t be long before someone comes along and says, “You’re doing it wrong,” in a variable level of impoliteness. However, the way hardware is nowadays, an operator error usually means it doesn’t work at all or it just doesn’t work quite as well as it should. Middle ground is almost always something else.

What Should Have Happened

If you’ve read the title and the “Symptom Tree”, then you’ve hopefully already noticed that the hypothetical poster-in-trouble was testing the wrong thing. What the individual needed to do was test the actual network transfer speed. If you don’t see what he missed and don’t want to duplicate his mistake, then this is the paragraph you don’t want to skim. If you’re testing network connectivity speeds, then you need to ensure that what you’re testing is actually the network, and nothing else. Check the obvious things: have you hit your cables with a cable tester? Did you restart the switch after enabling jumbo frames? Did you ensure that every endpoint and waypoint in a jumbo frames communication chain is enabled for jumbo frames? Are your switches and endpoints reporting the correct negotiated speeds and duplex settings?

If all the basics are covered and you’re absolutely convinced that it must be a problem with the network, get your hands on something that’s actually designed to test network transmission speeds. Iperf is one such example, but there are plenty of others, especially if you’ve got a little money to spend. Until you’ve done these tests, you have absolutely no credible evidence that the network is the basis for the issue. The reason is simple: file copy is one of the poorest possible ways to test transmission metrics on a high speed network.

But… File Copy is EASY! Why Isn’t it Enough?

As I drafted out this post, I’d envisioned a lot of flashy diagrams and images and such. I’ve decided not to use them for the sake of brevity. The basic point they all were meant to illustrate is: hard drives are slow. In terms of just about anything else in modern computing, they are the boat anchors. The rest of your computer feels the same way about its hard drives as you did when you had to use a dial-up modem connection. All the not-included diagrams were intended to show why, and they involved fun things like angular velocity and read head travel and command queuing and all sorts of other things, but the synopsis is that hard drives require moving mechanical parts and those will just never be as fast as solid state components – by an order of magnitude.

I Have the Latest and Greatest HDD, So You’re Wrong!

I poked around for a while and noticed that hard drive manufacturers really don’t like to talk about just how fast their drives can move data in a real-world environment. That’s because, compared to everything else, they’re slow. However, they’ve got a few really quick components. To grab one at random, let’s look at a current SATA device. It says right on it that it’s got a 3Gb transfer rate. A few of those are faster than a 10Gb line, right? Well, as long as you’re only talking about pushing data to/from the cache, yes. Once the cache is expended, then it’s back to that pokey read head and angular velocity and all that. Some drive manufacturers will post “sustained maximum transfer rate” and “sustained average transfer rate”, and these numbers are much more enlightening. The fastest drive I found on my search boasted a sustained maximum transfer rate of right around 220MB/s. More on this in a bit.

Why is this Just Now a Problem?

I don’t know that it could be classified as a “problem”, but the condition certainly isn’t new. What’s new is that increases in networking speeds have dramatically outpaced increases in hard drive speeds. You can now get 6Gb fibre or 10Gb copper connections, and team/trunk them into terrifyingly massive bit-bearing beasts. If you’re talking about internal or private networks on a hypervisor, the rate is limited by the speed of the internal fabric of the hardware host. Hard drives, on the other hand, are still slow. In the past, you got around the limitations of hard drives by placing them in multi-spindle arrays. The more spindles, the better the throughput. Networking speeds have leapfrogged hard drives, so they have once again become the bottleneck.

Let’s See It

We’ll do some numbers, and if I feel like it, I might even produce some graphs. Let’s go back to that hard drive from earlier. It’s a 15,000 RPM SAS SCSI drive in a 2.5” form factor, so even if there are faster spindle drives out there, they won’t be much faster. The 220MB/s number is only going to occur when there’s only one operation in the pipe and it’s either 100% read or 100% write, when all the data is to be read from or written to a perfectly contiguous section, and it’s all in the optimal angular velocity zone, and there is absolutely no contention for the drive’s attention. In a production environment, the number of times this situation will occur is zero. Getting a more realistic idea of what you can expect is very difficult due to the huge array of factors involved (which is why HDD manufacturers don’t like to publish such numbers), but you can easily lop off a good 20% or more. For easy math, let’s just say this drive can actually sustain 180MB/s (which is still probably a fairly optimistic number), and keep in mind this is one of, if not the fastest drive on the market.

The next thing to think about is network transmission speeds. We need to get all of our numbers into a common format so they are easier to compare. I’m going to convert to MB/s using the online calculator at UnitConversion.org. Drop in 180 megabytes and you can see that this is the equivalent of 1.40625 gigabits. So, one of these drives can supply data faster than a gigabit line can carry it and it can read data off a gigabit line and get some idle time. Let’s say you drop 10 of these drives into a SAN and stripe them into a RAID-5 array. The controller has come back in time from the 23rd century and has absolutely zero performance impact, and your environment is so pristine that every read/write operation on this SAN is able to utilize all 10 spindles. In that fictional place, this SAN can keep one 10GbE line completely saturated.

image

Once it’s really thought through, most people should exit here with at least some understanding. Once you factor in things like normative hard drive speeds, drive contention, network contention (especially broadcast traffic), iSCSI NICs doing double-duty (common configuration no-no), hot-spotting, multiple hosts, controller overhead, and LUNs that only span a few spindles, it’s pretty obvious what’s going on. It’s just plain difficult for all but the biggest and fastest SANs to keep a 10GbE line full. But, let’s say you’ve done the math, considered all the possibilities, and you’re absolutely certain that your SAN is faster than its network interconnects and you’re still not getting the throughput you think you should. First, just to reiterate, if you haven’t run an Iperf or similar test, you cannot actually know that. Once you’ve done that, we can move on and look at it.

I’m going to use one of my first encounters of this subject as the model for this discussion. An individual had hooked up a couple of virtual hosts to a SAN across 10GbE lines and run a file copy from one VM to another. I asked several questions about the make-up of the SAN and never got any answer at all, so I suspect that its drive array simply couldn’t sustain a 10Gb/s transfer rate and that’s all there was to it. However, let’s draw a picture. See if you can spot the challenge.

image

See it? Let’s walk through it.

  1. Data is requested from the SAN. It shoots up the 10Gb pipe.
  2. The host processes the data as iSCSI traffic and translates it to the source VM as standard HDD data.
  3. The source VM converts the data from HDD to network traffic, then places it on the network line, which it thinks is a standard TCP/IP connection.
  4. The host transfers the data across its bus to the destination VM.
  5. The destination VM converts the data from network traffic to HDD data.
  6. The host converts the VM’s HDD data to iSCSI traffic and places it on the 10Gb pipe where…
  7. … for almost all of the copy, it has to share space with the data coming up the source line.

No matter how this is done, no matter how well it’s optimized, no matter how much work is done to eliminate contention, no matter what, the copy speed in this scenario will absolutely never exceed 640MB/s. Even if we can pretend that steps 2 through 6 will introduce absolutely no overhead, the read operation and the write operation are sharing the same 10Gb/s line, so by necessity, the transfer rate in this situation is hard-capped at 5Gb/s. That all assumes that the underlying SAN has the ability to read and write at that speed; unless the file is small enough to fit in cache, this is pretty much a worst-case scenario for the SAN’s performance because it’s going to have to balance reading from one location on its platters to writing to completely different positions on the same platters. Once you factor for reality, the actual transfer rate is going to be noticeably lower.

But Wait! There’s More!

This particular scenario talks about the internal/private networks offered by hypervisors. Those rely on the speed of all the components of a physical server computer and were intended to compete with the more common 1GbE copper lines and some of the slower fibre lines, and they win that competition every time. 10GbE and 6Gb fibre, on the other hand, are contenders in this race. Even if you’re not using an internal/private network, the characteristics of the computer that is tasked with processing the traffic going on/off the wire are going to play a much more visible role when dealing with these newer high-speed networks. Even if you’re copying from one physical source to a completely separate physical destination, the general purpose computers in the mix might present barriers of their own. These limits will be defined by the machine’s bus architecture and, of course, contention. A good analogy here is a teller at a bank’s drive-through lane that utilizes vacuum tubes. The “tube” can shoot transactions at 10GbE, but the “teller”’s “hand” cannot load and unload the transactions, much less decide what to do with the packets, that quickly.

Conclusion

The moral of the story is essentially to ensure you know where your bottlenecks are before you complain about them. Having an understanding of how data will travel in your environment is a good way to start looking for the source of slowdowns.

Advertisements

One response to “It’s NOT the Network, the Hypervisor, or the OS!

  1. peter May 26, 2012 at 1:39 pm

    Great break down…

    Makes you wonder if SAS is worth the extra bucks vs SATA!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: