Your network is continuously, every microsecond, tested to destruction.
TCP, the ubiquitous IP transport protocol used for virtually all data exchanges in a storage context is designed to eternally probe for higher bandwidth. Eventually, between the vast number of hosts all trying to deliver data ever more quickly, and the network you operate, something has to give.
With todays deployed networking gear which is hardly ever configured any different than the factory default settings in this regard, when this happens, packet loss is the dire consequence. This is the only option available to your switches, routers and WLAN access points, to shed some load and get themselves a (very) short break.
However, not every packet lost is terminated equally. From the viewpoint of the switch, any packet loss (euphemistically also called drop or discard, with the relevant counters often hidden from plain sight) is just taking away a very tiny fraction of possible bandwidth – and congestion happens when the load on a link is constantly right there at 100%, correct?
But not so fast, this simplistic view is missing the bigger picture. As hinted above, the prevailing protocol nowadays to connect any networked device with each other is TCP. And TCP not only delivers data in-order and reliably (but not necessarily timely), it has also co-evolved over the last 30 years to deal with the harsh realities of packet networks.
In today’s age, packet loss virtually always is a sign of network congestion; even the low-layer WiFi links have sophisticated mechanisms, to try very hard and get a packet delivered. Even there, packets typically get discarded not while “on the air”, but when waiting for some earlier packet to get properly delivered, e.g. At a much lower transmission rate. Again, the incoming data has to queue up, and eventually the queue is full – packet loss and congestion caught in the act.
But think of your datacenter, where you have only two servers in need of writing data to your NetApp storage system of choice. As soon as these two hosts each transmit at 501 MB/s towards a 10G LIF in the very same microsecond, the switch capabilities, and link bandwidth towards the storage are overloaded, and at least some data has to queue up again.
So, isn’t more queueing buffer in all participating devices the solution? At least there are fashions in network technology – some time ago, deep buffered switches were all the rage; then you had shallow buffered switches. Nowadays, these appear to be no significant talking points any more – the fashion train has move to different marketing statements (while switches still come in shallow and deep buffered varieties).
But again, this simplistic view – more buffers mitigate packet losses, and all is good – misses the bigger picture.
A short primer on how TCP works, in very broad strokes: Unless TCP understands, that there may be an issue with the bandwidth towards the other end host, it continues to increase the sending bandwidth – always. And when your network device buffers more data, during the entire time, the sender only increases the sending rate, thus filling up the buffer ever more quickly. Until, that is, an indication of network overload (yes, this is an allusion to packet loss) arrives back at the sender. But typically, the loss happens on enqueue – that is, for the freshest packet that happens to arrive when the queue is filled up. And the sender will only know about this having happened, *after* all previous packets in the queue have been delivered to the client. But with a huge queue, it takes more time until the receive knows about the lost packet, and only then it can inform the sender. Which just kept on increasing the sending rate until now….
You see where we are getting at – not only is the congestion signal delayed due to the huge buffer, but in the meantime, even larger swaths of data had to be purged from the switches….
(Unless your application or sender is not latency sensitive, and has other means to adjust its average bandwidth utilization; that is when such very deep buffered switches make sense in actually preventing packet loss)
Now, let’s take a closer look again at TCP and what it does once a congestion signal arrives. But what is that, anyway?
Well, the receiver will indicate, using duplicate ACKs, ideally augmented with SACK (selective Acknowledgement) blocks, that not all packets have arrived in-sequence. To cater for short reordering, the defaults are set to 3 duplicate ACKs to be required before the sender acts. But in order for these duplicate ACKs to be elicited from the receiver, it has to see at least as many packets – after the loss. We come back to this later on.
Once the sender understands the network had to discard packets (because, inevitably, momentary overload) it will immediately reduce the transmission rate (how quickly packets are being sent). Immediately being obviously delayed by the amount of time it took the follow-up packet, after the loss, to make it all the way to the receiver, and then the ACK back to the sender again. That bandwidth reduction is defined to be down to 50% of the previous bandwidth – although modern stacks improve their network utilization by reducing the bandwidth only to 70-80%.
So, all is well now, correct? The network indicated overload to the TCP session, which reduced its bandwidth; but not so fast – the lost packet still has to make it to the receiver. So, the sender will retransmit (“fast retransmit”) that singular lost packet – into the still quite full queues. At the very least, it will take more time than your typical (“no-load”) ping time for that retransmission to traverse the network. And only then, the thing you really care about, e.g. your NFS Write operation, or SMB FileQuery, or iSCSI read request completed from the viewpoint of the client. Because only once all the data is available in-sequence, TCP will provide it to these data consumers.
So, a lost packet cost one round-trip time, what is the big deal? Well, one round trip time of a network with filled up queues can be significantly longer than your typical ping time (usually done without major load between these two end devices). This also glances over the fact, that ICMP (ping) doesn’t necessarily uses the very same queues, or even the same path in some cases, as your TCP packets. But with modern All-Flash Arrays, the storage device latency (as reported by it) may be quicker than this RTT. (The latency reported by a storage array necessarily can only account between the time it saw the request – that is, the complete request got delivered to the storage protocol by TCP – until the response is assembled, and again handed down to TCP for transmission to the client; On the other hand, storage latency as reported by a client does include this network delay, as it measures from when the request is sent for delivery by TCP, until the completed response is received in full).
But that is not even the worst part of the story; Sometimes, the TCP retransmission makes it through to the overloaded switch, just to find that some other connection (e.g with a higher RTT – so the other session still sends at a unsustainable rate as it does not yet know about the overload situation) still filled up all the queue space. And the switch does what it has to do in this situation, dropping yet another packet.
However, now this was a retransmitted packet. And here is where all TCP implementations have to rely on a timer, to detect and recover from this situation. You may have seen this unknowingly, when using the “netstat -s” command
[netstat -s, highlight RTO]
That timer is specified in the standards to trigger after 1000 ms – but fortunately, no one is implementing this extremely lengthy timer today as specified – every major operating system TCP operates with this timer set to 200 to 400ms. But from a storage point of view, where we all strive to achieve consistent, low latency operations well within 10ms, and with AFF possibly even within 1-2 ms, this is a huge spike in client visible latency.
And the above described chain of events is not the only circumstance, when this lengthy timer is essential to get data flowing again. Remember the duplicate ACK counter, which needs to count to 3, before the sender reacts?
As a queue fills up, it becomes increasingly likely for subsequent packets to get discarded. But in a given transaction, TCP needs at least 3 more ACKs (* Early Retransmit, Limited Transmit) to act fast. If any of the last 4 packets in a write transaction is lost, this retransmission timeout timer is all that is available to continue sending eventually.
Now, all latency is lost
Well, no; There are a few recent developments that improve the situation. First, all modern TCP stacks do use heuristics such as Limited Transmit, and Early Retransmit to address the situation with transactional data exchanges. Instead of any of the last 4 packets being lost causing the lengthy timer-triggered recovery, it’s only the last one or two. Some operating systems, like Linux or Windows, have a feature called Tail Loss Probe (TLP), which retransmits the last packet in such a transaction, when there is no other data available. That retransmission happens very quickly, without any of the other consequences of a full retransmission timeout (which I did not mention in this essay, it is getting too long as it is).
But another big tunable resides right there in the network. Remember the description of how queues fill up, and only drop packets once they are completely full? And that when this happens, it’s likely that large swaths of data are discarded as the sender still keeps on sending at increasing rates before it received word about the first packet loss by the receiver?
This is the way all network devices are shipped (factory default settings). But you can make use of AQM (active queue management) to give an early warning – a singular packet lost (or marked – more about ECN in a follow up) is easy for TCP to detect and recover from. That technology has been around for nearly 20 years, but it’s hardly ever used (or known). Look in your switch or router manual for something called RED or WRED; on your WiFi Access Points, you may be in luck when you find “CoDel” or “FQ-CoDel.” The former, despite being readily available on all decent gear, has a reputation of being complex to set up – which is true if you strive for perfection. But isn’t the better the enemy of the good? And that is typically discussed in a context of maximum bandwidth utilization – which no longer is the main issue at hand. Latency is. If you are doing an RFP, ask your vendor about auto-tuning mechanisms like PIE or CoDel (perhaps even in the FQ – fair / flow queuing variant). They are really easy, as the switch will automatically figure out the optimal settings, at each moment in time (evolving with your networks demands) so you never have to tweak them.
But even good old stock RED / WRED, without adjusting of the parameters, will space out the necessary (!) losses more evenly in time, thus prevent burst losses which are very time-consuming to recover from.
In summary, let me conclude with the following observations: It’s a false goal to try to avoid packet loss at all costs (deep buffered switches, priority or link layer flow control) when you are running TCP. TCP will just try to go even faster, inducing unnecessary buffering delays. Instead, do away with the legacy drop-tail queueing disciple that is factory default everywhere and interacts poorly with latency sensitive, but reliable data transfers as we have in storage. Moving to an AQM like RED / WRED (random-detect) is also the first step towards enabling truly lossless networks with today’s technology. But more about Explicit Congestion Notification (enabled by default on more hosts in your environment that you are aware of) in a later installment.
tl:dr summary: Enable random-detect (RED) on Switches, TLP on Linux and Windows, prepare for ECN to address latency spikes you observe on your clients but not the storage.
(* Limited Transmit and Early Retransmit are mechanisms that are in wide spread use already. The first tries to send additional data on duplicate acks, to improve the chances that three dupacks get generated. But it requires data to be available for sending, which may not be the case in a synchronous transaction. Early retransmit approaches this from the other side, reducing the number of necessary dupacks prior to starting fast retransmissions. But if the very last packet of such a transaction is lost, only TLP can keep the latency in check)