So, I sort of understand this whole TCP thing: You open a connection, you send packets, you close the connection. TCP provides a reliable delivery protocol layered on top of the unreliable IP protocol. So your data gets wrapped in a TCP segment, which gets wrapped in an IP datagram.
But what does that actually look like?
Web requests, email, and all of that add another layer of protocol overhead on top of TCP, so let’s start out with something really simple: the world’s dumbest instant messaging service. We’re going to use netcat, the Swiss army knife of TCP/IP utilities. All we’re going to do is have one netcat process (the server) listen on a TCP port, and have another netcat process (the client) open a connection to it. Both will send any messages typed on the command line, and print any messages they get. We start up the server like so:
$ nc -l 43981
That’s just telling netcat to start up and listen on port 43981. Why 43981? We’ll get to that in a bit.
Then we switch to another terminal, and start up the client like so:
$ nc localhost 43981
Here, we need to tell it which server to connect to, and give it the same port number. Then we type stuff into the client:
$ nc localhost 43981 hello world! how's it going?
Each time we hit return, the line shows up in the server:
$ nc -l 43981 hello world! how's it going?
A key thing about TCP is that it’s a two-way connection. Part of what the client does when it opens the connection is tell the server how to send messages back to it. So here we can also type something into the server:
$ nc -l 43981 hello world! how's it going? pretty good!
And it will show up in the client:
$ nc localhost 43981 hello world! how's it going? pretty good!
When we get bored, we
ctrl-c to quit either the server or the client, and the other shuts down automatically.
Pop the Hood
Ok, so that’s it. Messages going across a TCP connection a line at a time. Totally bare-bones. So what’s going on under the hood? To answer that, we’re going to re-run this little exercise, and this time we’re going to use tcpdump to listen in on the conversation. As the name implies, tcpdump listens in on TCP traffic and dumps it out to the screen. So, open a third terminal and fire up tcpdump:
$ sudo tcpdump -i lo -X port 43981 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
“-i lo” tells it to listen on the loopback interface, since our machine is just sending messages to itself, and “-X” will dump out the TCP segments in a couple of useful formats. “port 43981″ tells it to only report traffic to and from our netcat server port.
We don’t see anything when we start up our netcat server, but as soon as we start up the client, we get this in the tcpdump terminal:
12:50:53.227362 IP localhost.59356 > localhost.43981: Flags [S], seq 586457076, win 32792, options [mss 16396,sackOK,TS val 48581958 ecr 0,nop,wscale 7], length 0 0x0000: 4500 003c d2b2 4000 4006 6a07 7f00 0001 E..<..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 9ff4 0000 0000 ........"....... 0x0020: a002 8018 fe30 0000 0204 400c 0402 080a .....0....@..... 0x0030: 02e5 4d46 0000 0000 0103 0307 ..MF........ 12:50:53.227404 IP localhost.43981 > localhost.59356: Flags [S.], seq 2685804629, ack 586457077, win 32768, options [mss 16396,sackOK,TS val 48581958 ecr 48581958,nop,wscale 7], length 0 0x0000: 4500 003c 0000 4000 4006 3cba 7f00 0001 E..<..@.@.<..... 0x0010: 7f00 0001 abcd e7dc a016 2055 22f4 9ff5 ...........U"... 0x0020: a012 8000 fe30 0000 0204 400c 0402 080a .....0....@..... 0x0030: 02e5 4d46 02e5 4d46 0103 0307 ..MF..MF.... 12:50:53.227439 IP localhost.59356 > localhost.43981: Flags [.], ack 1, win 257, options [nop,nop,TS val 48581958 ecr 48581958], length 0 0x0000: 4500 0034 d2b3 4000 4006 6a0e 7f00 0001 E..4..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 9ff5 a016 2056 ........"......V 0x0020: 8010 0101 fe28 0000 0101 080a 02e5 4d46 .....(........MF 0x0030: 02e5 4d46 ..MF
What we see here is the client and server negotiating a TCP connection in what’s known as a three-way handshake. Our client sends a packet saying that it wants to start a connection, the server sends back an acknowledgement, and the client responds with a confirmation. For each of these we get a summary line describing the packet and then a dump of the actual contents – verbatim, byte-by-byte. The “0×0000″ and such on the left are the byte index in hexadecimal for the start of each row; so zero, 10 (16 in decimal), 20 (32), 30 (48). The big chunk in the center is the data in hex characters. Each hex character is 4 bits (half a byte, and thus referred to as a “nibble” – ah, nerd humor), so each set of 4 is two bytes. The block on the right is the same data, rendered as ASCII characters (with all the non-printing characters shown as periods). Since what we’re dealing with here is all binary data, that’s not useful yet.
So what is all this crap?
Well, like I said, we’ve got TCP segments wrapped in IP datagrams, so the IP data is going to be what we see first. As always, Wikipedia is an awesome resource, and its page on IPv4 lays out the datagram structure for us bit-by-bit. (You may want to open that up in another tab for reference while you’re reading this.) Let’s look at our dump of the first packet:
0x0000: 4500 003c d2b2 4000 4006 6a07 7f00 0001 E..<..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 9ff4 0000 0000 ........"....... 0x0020: a002 8018 fe30 0000 0204 400c 0402 080a .....0....@..... 0x0030: 02e5 4d46 0000 0000 0103 0307 ..MF........
The first thing we see is the “4″ telling us that this is an IPv4 datagram (not IPv6). Then a “5″ for the header length. That’s in 32-bit words, so each of those will be two blocks of hex characters. So we already know that the IP header part of this packet is just:
0x0000: 4500 003c d2b2 4000 4006 6a07 7f00 0001 E..<..@.@.j..... 0x0010: 7f00 0001 ....
The next byte is the DSCP and ECN fields. They’re all zeros, so we can ignore them here, but essentially they tell routers how important or urgent this packet is. In principle, all packets are the same, but in practice we might want some packets – like for Voice Over IP – to have a higher priority if there’s a lot of traffic. This one byte opens a rabbit hole of technical and policy issues.
The next two bytes – 003c or 60 in decimal – tell us the total length of the packet. Sure enough, the packet ends after 12 bytes of the “0×0030″ row. Two bytes here means that the total length of the packet can’t be more than 65535 bytes (2^16 – 1).
Each packet incurs a certain amount of overhead in transmission and processing, so it makes sense to put as much data in each packet as possible. But while the IPv4 protocol sets a maximum of 65535 bytes, it doesn’t require that every router support that. Remember that the protocol was developed back when 64K bytes was more than most machines had, and even now, that’s a lot for one message on a router that’s handling large volumes of traffic.
So our next four bytes – d2b2 4000 – deal with fragmentation. When a router has to forward a datagram that’s bigger than the next router can deal with, it will break it into fragments. The first thing we need is an Identification field so the server knows which fragments go together. The starting index is arbitrary – d2b2 for this one – but you’ll see that it’s incremented normally for later packets. Why not just start with 0001 or 0000? I suspect there’s another rabbit hole there, and I’d guess it has to do with managing multiple connections between the same client and server.
The other two bytes are divided up a little oddly: 3 bits for flags and 13 bits for the Fragment Offset – its index number. That means that the flags are the 8, 4, and 2 bits of the first nibble. It’s 4, so that’s the Don’t Fragment bit. Even if we were fragmented, this is the first packet, so the Fragment Offset is zero.
Next is TTL – Time To Live. When I bring up this page in a browser on my laptop, it sends packets skipping across the network to my hosting provider in California. They’ll pass through a few routers at my ISP and several more at internet backbone providers across the country before they get to the hosting server. There isn’t a pre-ordained route that they’ll follow. Each router looks at each packet and tries to figure out where to send it to get it closer to its destination. This is what makes the internet robust: If one of those connections goes down, the router will figure out the next best way to get the message through. (And yes, the mechanics of how that works are more than other whole essay.)
The downside of this is that if one or more routers are mis-configured, they could send the packets back to a previous router, and they’d end up going in loops. To keep packets from circling endlessly, they have a limited lifespan, measured in “hops”. Each router along the way decrements the TTL field. If the packet hasn’t got where it’s going by the time it gets to zero, the router knows something’s gone wrong, and drops it. Our packet starts off with a TTL byte of 40, so it’s got 64 hops to live.
It may not look like it, but we’re almost down to the end of the IP header here. The next byte tells us the IP Protocol number for the contents of this IP datagram. 06 means it’s TCP.
The next two bytes – 6a07 – are the header checksum. It’s a number calculated from all the bytes in the header. It’s a way to check that the header wasn’t garbled in transit. When a router gets a packet, it calculates a checksum based on the header it received; if any bits got randomly flipped, the checksums won’t match. (This doesn’t protect against intentional tampering because someone could also update the checksum.)
The last two fields are the source and destination IP addresses. Again, this is a two-way connection we’re setting up here, so the client needs to tell the server where to send packets back to. Since we’re just talking to ourselves over the loopback interface, they’re both 127.0.0.1 – 7f00 0001 in hex.
Ok, that’s the IP header. Now on to the TCP header. Let’s strip the IP header out of our packet and see what’s left.
0x0000: 0x0010: e7dc abcd 22f4 9ff4 0000 0000 ...."....... 0x0020: a002 8018 fe30 0000 0204 400c 0402 080a .....0....@..... 0x0030: 02e5 4d46 0000 0000 0103 0307 ..MF........
The first two fields – two bytes each – are the source and destination port numbers: e7dc and abcd. That’s why I picked the weird port to run this on: 43981 in hex is abcd, so it’s easy to spot in the output. e7dc is 59356, which isn’t significant – it’s just what was automatically chosen when the client opened the connection. Perhaps the most significant thing about the ports is that they’re not part of the IP header. Ports are a TCP-level concept; the IP layer only cares about getting the packets to the right machine.
The next four bytes – 22f4 9ff4 – are the Sequence Number (586,457,076). As with the Fragment Offset in the IP layer, this is to keep track of what order the segments belong in and which have been received. The big difference is that here it’s the index of the starting byte in the segment, so it will increase from segment to segment by the number of bytes in the TCP data. It also starts at an arbitrary value, and loops around to zero when it hits the maximum Sequence Number (4 Gigabytes). More on this later.
The next number is the Acknowledgement Number. It’s essentially the Sequence Number for the data received. It’s zero for now, so we’ll talk about it later when it’s got something to say for itself.
The next nibble is the Data Offset, which is the TCP header length in 32-bit words. It’s “a” (10), for a total of 40 bytes, which matches what we can see.
The rest of the a002 block are unused bits (reserved for future use) and flags. They’re all zero except the 2 bit, which is the SYN flag (for synchronize), which means that this segment is the start of a connection.
The next two bytes – 8018 (32792 in decimal) – are the Window Size. This is the sender putting a cap on how much data can be sent back to it, in case it has limited resources. I don’t know the reason for that exact number, but there’s a surprise here: We’ll see in a minute that there’s an optional field that multiplies this value.
Next is the TCP checksum. Unlike the IP checksum, this one is summed across both the TCP header and data. Why doesn’t the IP checksum just do both? I’m not sure, but I’d guess there’s both a design principle and a practical reason. TCP shouldn’t really depend on IP for that. Even though they were designed to work together, they have separate responsibilities. In theory, you could run TCP on top of other protocols than IP, though I don’t know of anyone doing that. So if TCP has to calculate its own checksum, there’s no point making IP do it as well. The practical concern is that the TCP data can be huge compared to the 40 bytes of IP header, and the IP checksum has to be checked at every hop; the TCP checksum is only checked when it reaches its destination.
The last standard field is the Urgent Pointer. The URG flag wasn’t set, so this is 0000. As to when that flag is set and how the urgent pointer is used when it is, that’s probably yet another rabbit hole.
Beyond that, we have a number of optional fields. They’re odd in that they’re not in a specific order, they’re different sizes, and they may have multiple sub-fields. The first byte of each tells us what type of field it is. I’ll run through them quickly, putting pipes between the sub-fields so you can see how they’re broken up.
- 02|04|400c: Maximum segment size = 400c (16,396 bytes). This is used by the TCP layer to limit the segment size and save it from getting fragmented at the IP layer.
- 04|02: Selective acknowledgement permitted. Allows the receiver to request re-transmission of only missing segments, rather than the whole message. More on this later.
- 08|0a|01fd b1be|0000 0000: Timestamp=01fd b1be (33403326), previous timestamp=0000 0000. Used to help determine the order of the TCP segments when the amount of data being sent is more than the maximum Sequence Number.
- 01: no operation – padding to align options on word boundaries for performance.
- 03|03|07: window scale = 7; multiplies Window Size by 2^7, bringing it to 4,197,376 bytes
We can check our homework here by looking at the packet summary. (Come on, it wouldn’t have been any fun if we did that first!)
12:50:53.227362 IP localhost.59356 > localhost.43981: Flags [S], seq 586457076, win 32792, options [mss 16396,sackOK,TS val 48581958 ecr 0,nop,wscale 7], length 0
Now that we know what we’re looking at, it’s pretty easy to read: time; host.port for source and destination; SYN flag; Sequence Number; window size; options with max segment, selective ack, timestamp, no-op, and window scale; and data length of zero.
Awesome, done! That’s the first packet.
The Rest of the Handshake
Now that we have the structure down, we just need to look at what’s different about the rest of the packets, and we can get most of what we need from the summary lines.
So, packet two is the response from our server.
12:50:53.227404 IP localhost.43981 > localhost.59356: Flags [S.], seq 2685804629, ack 586457077, win 32768, options [mss 16396,sackOK,TS val 48581958 ecr 48581958,nop,wscale 7], length 0 0x0000: 4500 003c 0000 4000 4006 3cba 7f00 0001 E..<..@.@.<..... 0x0010: 7f00 0001 abcd e7dc a016 2055 22f4 9ff5 ...........U"... 0x0020: a012 8000 fe30 0000 0204 400c 0402 080a .....0....@..... 0x0030: 02e5 4d46 02e5 4d46 0103 0307 ..MF..MF....
What’s different? The IP identification is 0000, which strikes me as odd. Don’t know what’s going on there. And the checksum is different because of that. If we were connecting two different machines, we’d have seen the source and destination addresses switch. In the TCP header, the ports switched, which is the tip-off that this packet is going from the server to the client. We have a new Sequence Number, since the client and server keep separate counts of the data bytes they send. We now have an Acknowledgement Number, which is the client’s Sequence Number from the last packet, plus one. Both the SYN and ACK flags are set, marking this as the server acknowledgement. There’s a slightly different Window Size, but with the same scaling factor. The previous timestamp is set. And of course a different TCP checksum because of all that.
The third packet is the client’s confirmation of the connection. The server knows that the client asked for a connection, and the server knows that it sent an acknowledgement, but it needs to know that the client got the acknowledgment.
12:50:53.227439 IP localhost.59356 > localhost.43981: Flags [.], ack 1, win 257, options [nop,nop,TS val 48581958 ecr 48581958], length 0 0x0000: 4500 0034 d2b3 4000 4006 6a0e 7f00 0001 E..4..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 9ff5 a016 2056 ........"......V 0x0020: 8010 0101 fe28 0000 0101 080a 02e5 4d46 .....(........MF 0x0030: 02e5 4d46 ..MF
The IP identity has been incremented, which changes the checksum. The TCP Sequence Number has been incremented and matches Acknowledgement Number from the previous server packet. Likewise, the Acknowledgement Number is now the Sequence Number from the server packet, incremented. We have fewer options – just the timestamps and a couple of no-ops – so the Header Size is only 8. Only the ACK flag is set, which says that the connection is solid now. The Window Size is 257, with no scaling factor in the options. I don’t think this is actually used now that the connection is established, so I don’t know why it’s not zero. Something else to research.
Anyway, hey, TCP connection established! So this is the first thing we’d see whether we’re sending email, hitting a web page, or whatever.
Getting Down to Work
After that, we send a message from the client to the server, and get a response back. This time, we’re actually sending data!
12:51:04.418321 IP localhost.59356 > localhost.43981: Flags [P.], seq 1:14, ack 1, win 257, options [nop,nop,TS val 48584755 ecr 48581958], length 13 0x0000: 4500 0041 d2b4 4000 4006 6a00 7f00 0001 E..A..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 9ff5 a016 2056 ........"......V 0x0020: 8018 0101 fe35 0000 0101 080a 02e5 5833 .....5........X3 0x0030: 02e5 4d46 6865 6c6c 6f20 776f 726c 6421 ..MFhello.world! 0x0040: 0a . 12:51:04.418446 IP localhost.43981 > localhost.59356: Flags [.], ack 14, win 256, options [nop,nop,TS val 48584755 ecr 48584755], length 0 0x0000: 4500 0034 6d10 4000 4006 cfb1 7f00 0001 E..4m.@.@....... 0x0010: 7f00 0001 abcd e7dc a016 2056 22f4 a002 ...........V"... 0x0020: 8010 0100 fe28 0000 0101 080a 02e5 5833 .....(........X3 0x0030: 02e5 5833 ..X3
Here’s where the ASCII output finally becomes useful. You can spot the “hello world!” content right away, which makes it a lot easier to keep track of which packets are which as we’re digging through this.
In the first packet, the Total Length is bigger by 13 (“hello world!” plus the return character). The client set the PSH flag, to indicate that there’s data to push to the application (netcat). The Acknowledgement and Sequence numbers are the same as last time because no data was sent; but then in the server’s response, its Acknowledgement Number is 13 (no coincidence) more than the client’s Sequence Number.
Ok, so now that the Acknowledgement number is starting to move for real, let’s talk about what all the futzing around with it and the Sequence Number is about. This is really the core of TCP, what makes it special. This is how it guarantees that the data gets through even when IP delivery fails and packets get dropped. To do that, the client needs to keep track of each chunk of data it sends out, and it needs to get a response from the server saying that piece has been received. It’s like registered mail but better, because the response tells the client not only that the server got a packet, but how much data it got and where it is in the client’s data set. (The Acknowledgement Number is actually the number of the next byte the server expects to get from the client).
TCP isn’t normally a one-for-one exchange like this. Often, the client would send out a whole mess of packets at once. Rather than acknowledging each individually, which would generate a whole lot of traffic, the server just sends back the Acknowledgement Number for the highest packet received, assuming it gets them all. If it doesn’t, if there are packets missing, it could send an acknowledgement for the highest contiguous packet it gets, and have the client re-send everything later. But it could be smarter than that, and this is where the Selective Acknowledgement option comes in. That lets the server acknowledge several discontinuous blocks (as start and end bytes), so the client only has to re-send the missing pieces.
Also remember that this is a two-way conversation. When the server is sending an acknowledgement to the client, it’s also sending its own Sequence Number, so the client can keep track of what it’s received from the server. In this case, all the content is in the outbound message, and what’s coming back is an empty acknowledgement. But in an HTTP request, we’d see content in the outbound message – request headers, the type of request (GET, POST, etc.), the path to the web page we’re requesting, and any form data – and the response would have the HTML content of the web page.
The one thing left to show you is what happens when we close the connection. If you hit
ctrl-c in either terminal, you’ll see both the client and server exit immediately, but you’ll also see a bunch of traffic in tcpdump.
12:51:38.214610 IP localhost.59356 > localhost.43981: Flags [F.], seq 30, ack 14, win 257, options [nop,nop,TS val 48593204 ecr 48590359], length 0 0x0000: 4500 0034 d2b7 4000 4006 6a0a 7f00 0001 E..4..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 a012 a016 2063 ........"......c 0x0020: 8011 0101 fe28 0000 0101 080a 02e5 7934 .....(........y4 0x0030: 02e5 6e17 ..n. 12:51:38.215934 IP localhost.43981 > localhost.59356: Flags [F.], seq 14, ack 31, win 256, options [nop,nop,TS val 48593205 ecr 48593204], length 0 0x0000: 4500 0034 6d13 4000 4006 cfae 7f00 0001 E..4m.@.@....... 0x0010: 7f00 0001 abcd e7dc a016 2063 22f4 a013 ...........c"... 0x0020: 8011 0100 fe28 0000 0101 080a 02e5 7935 .....(........y5 0x0030: 02e5 7934 ..y4 12:51:38.215994 IP localhost.59356 > localhost.43981: Flags [.], ack 15, win 257, options [nop,nop,TS val 48593205 ecr 48593205], length 0 0x0000: 4500 0034 d2b8 4000 4006 6a09 7f00 0001 E..4..@.@.j..... 0x0010: 7f00 0001 e7dc abcd 22f4 a013 a016 2064 ........"......d 0x0020: 8010 0101 fe28 0000 0101 080a 02e5 7935 .....(........y5 0x0030: 02e5 7935 ..y5
We’ve sent a couple more messages back and forth (“how’s it going?”, “pretty good!”), so the Sequence and Acknowledgement numbers have jumped ahead a bit, as have the timestamps.
The real action here is in the TCP flags, the 2nd byte in the 0×0020 row. The ACK bit is still set, but now the FIN bit is too. That’s the client telling the server to close the connection. The server sends back a response with the FIN bit set, and the client sends a simple acknowledgement. It’s the same send-acknowledge-confirm exchange that we saw in the opening handshake.
Wrapping Up, Moving On
Ok, so that’s been a lot of to absorb, but what I hope you’ve gotten out of this is that all these internet protocol details are interestingly complex, but totally comprehensible. You’ve got the tools to look under the hood, and with a bit of patience you can figure out what all the parts are doing.
If you haven’t actually run through this little netcat/tcpdump exercise on your own terminal, give it a try. You don’t need to pick through it byte-by-byte like I have. Just take a couple minutes to watch the packets go back and forth, and skim the summary lines. That gave me a sort of visceral sense of what’s going on.
If you do want to dig into this more, try pointing tcpdump at a real service. Set up a minimal web page on your local web server, point tcpdump at port 80, and hit the page with your browser. I just tried that myself, and I think it’s going to keep me busy for a while.
Thanks to Frank Hunleth for corrections to the original version of this post.