Transport Layer Protocols

I collect transport-layer protocols. NONE MORE DORK.

Transmission Control Protocol (TCP)

Specification	RFC 793 (amended by RFC 1122)
Connections?	Y
Messages?	N
Reliable?	Y
Ordered?	Y
Weight	heavy

The protocol which transports 99% of the world's network data (not counting phone calls).

There are a bunch of specifications extending RFC 793; the only one which officially updates it is RFC 3168, which adds support for ECN, but there are specs for high-performance options, SACK, DSACK, and a bunch of other stuff.

The single big weakness of TCP, to my mind, is that it's a stream-oriented protocol, when almost all application protocols are message-oriented in some way (the only one i can think of that isn't is telnet). This means that every application-layer protocol has to provide its own messaging sublayer (usually an implicit one), which is a lot of wasted effort. Also, the invisibility of the message boundaries to the TCP layer means it can't use them to organise its transmissions, so you end up with hacks like Nagle's algorithm to make it work smoothly. Yes, being a stream fits naturally with the unix programming model, but then the unix programming model is cracked anyway.

Another weakness of TCP is its setup overhead. TCP carries out an exchange of packets (the 'three-way handshake') before the endpoints get to exchange data. In addition, the flow control algorithm for TCP involves a 'slow start', where transmission starts slowly, and ramps up to the capacity of the route over time. These factors combine to mean that a TCP connection does not become efficient until quite a number of packets in; whilst this is not a problem for long-lived connections (as used by connection-oriented application layer protocols, or those making large transfers), it makes TCP very unwieldy for short-lived connections, as used by many service protocols (like DNS, SNMP, etc).

Transactional TCP (T/TCP)

RFC 1644 specifies a modification of TCP (which never really took off) which allows a TCP connection to start carrying data earlier, partially overcoming the setup overhead.

TCP With Sequenced Packets

The lack of message demarcation is addressed by my modest proposal for sequenced packets over ordinary TCP.

User Datagram Protocol (UDP)

Specification	RFC 768
Connections?	N
Messages?	Y
Reliable?	N
Ordered?	N
Weight	ultralight

The protocol which transports the other 1% of the world's traffic.

UDP's killer problem is its unreliability; messages are guaranteed to be delivered intact if at all, but there's no guarantee that they'll actually be delivered. Other problems are lack of in-order delivery, lack of duplicate prevention, lack of connections, and the limitation of message size to the network layer MTU. If you don't need those, though, UDP is boss.

UDP Lite

RFC 3828 specifies UDP Lite, a minor modification of UDP which allows delivery of damaged messages. This may be useful for error-tolerant application layer protocols, such as streaming audio or video protocols.

Stream Control Transmission Protocol (SCTP)

Specification	RFC 2960
Connections?	Y
Messages?	Y
Reliable?	Y
Ordered?	Y (optional)
Weight	heavy

Big, scary protocol with more options than you can shake a stick at. It was essentially designed as a successor to TCP, although it's not intended to replace it. The major changes, from the application point of view, are that it provides a message-oriented connection, and messages can optionally be delivered out of order ('order of arrival') in a fairly flexible way. Other changes include multiplexing of several streams of messages within a connection, multihoming of connections (so connections can be spread over several networking interfaces at either end), and bundling of multiple messages into a single network-layer packet. Internally, SCTP uses more complex mechanisms for flow control and validation than TCP.

SCTP messages can be larger than the network layer MTU.

See also RFC 3286 for a gentle introduction to SCTP.

Internet Link (IL)

Specification	'The IL Protocol' (Plan 9 Manual)
Connections?	Y
Messages?	Y
Reliable?	Y
Ordered?	Y
Weight	medium

This is the transport-layer protocol used for RPC in the Plan 9 operating system. It's used to transport a reliable, duplicate-free ordered stream smallish (up to MTU sized) messages from one host to another. IL doesn't really have any flow control, although a rudimentary form could probably be added, using the information used for reliable delivery.

IL packets sit inside IP packets (with protocol number 40 = 0x28), and look like this:

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Checksum           |         Packet Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Packet Type  |    Special    |          Source Port          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Destination Port       |             ?!?!?!
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                     Sequence Identifier                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                   Sequence Acknowledgement                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Where the fields are:

Checksum (ilsum): IP-style checksum (complement of one's-complement sum) over the entire packet, including the IP header, the IL header, and the payload, with the sum and special fields taken as zero. Fucking stupid definition if you ask me - firstly, it's time to lose the stupid IP style checksum, and use a proper CRC, secondly, it should use a TCP-style IP pseudo-header, not the actual header, and thirdly, it shouldn't take the special field as zero (unless there's a reason for that i don't know).
Packet Length (illen): Total length of the packet, from the start of the IL header.
Packet Type (iltype): Identifies the type of packet; one of sync (0), data (1), dataquery (2), ack (3), query (4), state (5) or close (6).
Special (ilspec): Reserved for future use. The spec doesn't say what you should do with it now; internet tradition says set it to zero when sending, and ignore its value when receiving.
Source Port (ilsrc): Source port number, much as in TCP.
Destination port (ildst): Destination port number, much as in TCP.
?!?!?!: Okay. The spec (ie the Plan 9 manual) defines the header in terms of a C struct. The struct is laid out in such a way that any C compiler known to man will insert two bytes of padding at this point, and if there isn't padding here, the next two four-byte quantities are not aligned on four-byte boundaries, which would be kind of unprecedented in an internet protocol. However, the spec says in a couple of places that the header is 18 bytes long, which implies that there's no padding. I assume, therefore, that there really is no padding (i guess Plan 9 was first implemented in 16-bit machines, where four-byte quantities are doublewords, and don't have to be naturally aligned). However, drawing the next two fields following on directly makes the diagram look gross, so i'm just leaving a hole here. Note that if there is padding here, tradition says you should set it to zero when sending (although if you leave it to the compiler, it could have any value; under the gcc i have, it'll be 0x0000 if it's allocated on the heap, and 0xffff if it's on the stack - unless you use alloca, in which case it's 0x000!), and ignore it when receiving.
Sequence Identifier (ilid): The sequence number of the message.
Sequence Acknowledgement (ilack): The sequence number of the last in-sequence message received by the sender.

If i were designing IL2, i'd reform the checksum (CRC16 over the whole IL packet, with only the checksum set to zero, plus a pseudo-header as in TCP), drop the packet length (it's available from the network layer, dammit!), and shuffle the fields to lose the padding. But i'm not.

Anyway, the packet structure and the meanings of the fields are all fairly straightforward (ie fairly similar to TCP!). There are four things to explain: the use of sequence numbers, the different types of packet, the handshake and closing exchanges, and the reliability mechanism.

Sequence numbers are easy: every message (not every byte, as in TCP) in a connection has a unique one (unique within each side of the connection, that is - the 5-tuple (source address, source port, destination address, destination port, sequence number) globally uniquely identifies a message), with the first message having an arbitrary number (not zero, please, to give some protection against packets from dead connections), and each subsequent message having a number one higher than the previous one. A packet carrying a message bears the sequence number of that message as an identifier; packets not bearing messages (for which, see below), use the next number due to be assigned to a message. Every packet (with the exception of an opening sync packet) also carries an acknowledgement, which is the sequence number of the last message successfully received by the sender, where 'successfully' means 'intact, and with all preceding messages also successfully received'. Sequence numbers are the basis of IL's flow control mechanism, for which, see below.

There are seven packet types: sync, data, dataquery, ack, query, state and close.

sync: Opens a connection.
data: Transmits a message.
dataquery: Transmits a message and queries the state of the receiver.
ack: Acknowledges receipt of a message.
query: Queries the state of the receiver.
state: Indicates the state of the receiver.
close: Closes a connection.

Only data and dataquery packets carry messages; the other types of packets do not.

The opening handshake for IL is as follows:

Host A picks an initial sequence number (ISN) and sends a sync packet to host B; the sequence identifier is set to the ISN, and the sequence acknowledgement to zero.
Host B receives the sync, picks an ISN of its own and replies with another sync, with the sequence identifier set to its ISN and the sequence acknowledgement set to host A's ISN (ooh - shouldn't it really be one less than A's ISN?).
Both hosts are now clear to send.

The spec is hazy on what to do if packets get lost. I am by no means a transport-layer protocol expert, but my thinking is:

If A's sync gets lost, it times out waiting for an answering sync, and retries. So far so good.
If B's sync gets lost, B thinks the connection is open, but A does not; A eventually times out and retries. Here we have a problem: B will receive a sync on an already-open connection. I suggest that if a host receives a sync on an already-open connection, it should abort it, pretend it never happened, and send an answering sync. Additionally, the initiating host should not use the same ISN on sync retries; this allows the target host to distinguish between retried syncs and duplicate packets, which it can simply ignore.
It's even possible that B will start sending packets after it's sent its sync, in which case, these will arrive at A while it's still waiting for the answering sync, on a connection which it doesn't believe is open yet. A has a few options here. Firstly, it could buffer the packets, hoping that the sync is simply delayed; if it times out waiting, it can discard them, and if the sync arrives, it can process them. Secondly, it could discard them straight away, and carry on waiting for the sync; if it arrives, they can be resent later, and if not, they don't matter. Thirdly, it could discard them straight away, take their arrival as evidence that the answering sync has got lost, and immediately retry its sync. Which is the best option depends on the relative frequences of packet loss and packet reordering, with increasing likelihood of loss biasing towards the latter. Implementations could even monitor these frequencies and dynamically choose a policy accordingly!
The spec says that a host in the 'Syncee' state - what i call host B - should timeout and retry, but it's not clear how it could do this.

AIUI, sync messages don't carry messages. I don't see why they couldn't, though, and this would allow a fast T/TCP style setup.

Closing a connection is as follows:

Host X sends a close packet to host Y, with the sequence identifier and acknowledgement set as usual.
Host Y receives the close, and sends a close packet back, again with the sequence numbers set as usual.
The connection is closed.

Again, packet loss must be considered:

If X's close gets lost, it times out and retries.
If Y's close gets lost, X will time out and retry, and Y will receive a close on an already closed connection. Therefore, receipt of a close on a closed channel should elicit an appropriate answering close; after all, this doesn't do the answerer any harm.
We don't need to worry about the case of a close arriving on a connection which has been closed and reopened - X won't open a new connection with the same parameters until it's satisfied that the first one is closed. There is the potential for 'Buck Rogers duplicate' close packets (where a packet gets duplicated, and one duplicate flies round the internet for ages before arriving at its destination) arriving on a reopened connection, though; luckily, closes on reopened connections can be detected by sequence number mismatch (probably) and can thus be ignored.
There is a problem with duplication of answering closes; this can arise through network-layer duplication of Y's reply, or by a delay of X's initial close, which could cause it to retry, thus leading to two closes arriving at Y, and two answering closes being sent. If X exhibited the behaviour suggested above, where a close on a closed connection elicits an answering close, an infinite loop of answering closes on a closed connection could develop. As a safeguard against this, X should remember recently closed connections, and ignore closes on them.
There is the question of what happens if messages sent prior to the close have not been received; this could happen two ways: messages from X to Y missing, or messages from Y to X missing. The former can, and should, be avoided and entirely: when the application layer requests a close, the transport layer should use normal means to ensure that all messages have been delivered before sending the close. The latter is even simpler: when the application layer requests a close, it is indicating that it has no interest in any further messages, so these messages should simply be forgotten about. The sequence acknowledgement in X's close indicates to Y which was the last of its messages to be delivered, if it cares.

Finally, reliability. The key thing is that each end of a connection keeps track of which messages the other end has received, by maintaining an awareness of the acknowledged sequence number. During normal, rapid two-way traffic, this occurs simply through the exchange data packets, which carry a sequence acknowledgement. If only one end is actively sending, then the other end should periodically send an ack packet (using an ack timeout, reset on sending of any kind of packet), purely to communicate its sequence acknowledgement. This is highly straightforward.

It's when packets go missing that things get interesting. Two mechanisms come into play:

The first mechanism is for the maintenance of sequence awareness in the face of packet loss, and involves the query and state packets. If a host has not received a packet from its peer for a certain period, it can send a query packet. On receipt of a query packet, a host should reply with a state packet. These packets have no payload, but simply serve to carry sequence numbers.
The second mechanism corrects packet loss. If a host has not received acknowledgement of a particular message within a certain time, it should send it again, in a new data packet. However, it should only resend the first unacknowledged message; it should not resend subsequent messages until the first has finally been acknowledged.

As an optimisation, a host can send a dataquery packet; this is simply a packet which is both a data and a query - it carries a message, and asks for a state packet to be sent back.

I don't understand why ack and state are separate. Maybe it's so the querier can know that the packet is a response to its query, and not just a delayed acknowledgement.

There is probably a hell of a lot more information about the state of the network and the peer that can be wrung out of these exchanges by a clever implementation. Suggestions on a postcard to Bell Labs, please!

Realtime Transport Protocol (RTP)

Specification	RFC 1889
Connections?	Y
Messages?
Reliable?	N
Ordered?	Y (sort of)
Weight	heavy

Is realtime. For media stuff.

Runs on top of UDP, or another protocol; sort of a transport decorator. It claims to be "a new style of protocol following the principles of application level framing and integrated layer processing proposed by Clark and Tennenhouse". Make of that what you will.