Implementing TCP in Rust

Resources

Questions

  • How does Linux deal with port conflicts between SOCK_RAW and SOCK_STREAM? Does it multicast messages to 1+ sockets? Does the raw socket override all other sockets on the kernel?
    • I tried this using the pnet crate, and it looks like raw sockets can receive all packets for a given protocol, but this doesn’t stop the “regular” stream/datagram sockets from receiving this data too.
    • So I see what jonhoo meant about potentially running into conflicts if we were to implement a userspace TCP stack this way; we’d have to prevent the kernel from registering any other sockets.
    • With that said, “for a given protocol” is probably key here - the main reason for SOCK_RAW’s existence is probably to define new transport protocols, so that can presumably provide enough isolation. IP protocol numbers 0x90-0xFC (144-252) are unassigned.
    • If I set the pnet socket’s protocol to 1 (ICMP), I can spy on all ping traffic but nothing else.
  • Do both the binder and the connector need to maintain both send & recv queues? (I think yes)
    • YES
  • Does the 32-bit seq number limit imply that only 4.2GB of data can be transferred over a single connection, or is it safe to assume that conflicts won’t occur when we wrap around?
  • Why are bind and listen separate calls?

Notes

  • Getting a bit boring at the 3:41:00 mark. Implementing the nitty-gritty of the actual protocol (window checking, etc.) isn’t super interesting to me; I’d rather stop there and learn more about the Linux networking space in general.

Impl Notes

  • pnet crate for raw sockets: https://docs.rs/pnet/0.28.0/pnet

  • Not using this because the kernel’s impl. of TCP can interfere (how?); using tun/tap instead.

  • The tun_tap crate allows sending packets represented as byte-buffers. What level of the stack do these packets live at? Level 1, 2, or 3?

    • Ok it looks like this is the difference between TUN and TAP: /Screen Shot 2021-06-28 at 3.30.12 PM.png
  • You need to be root to create IP packets (how is this enforced?), but the CAP_NET_ADMIN capability is sufficient even if you aren’t root.

  • Not sure how tun/tap is going to be used to create a TCP stack yet; it seems to be the kind of thing you’d use to create a whole new virtual network in userspace.

  • TUN packets can provide 4 bytes of extra info at the top of a message containing flags and the protocol. The protocol seems to match the “EtherType” field in Ethernet packets, so 0x800 is IPv4 and 0x86DD is IPv6.

  • This seems to be a theme, with lower-level protocols having knowledge of the higher-level protocol being encapsulated. IP packets have a protocol field too: 0x01 is ICMP and 0x06 is TCP.

  • A TCP connection is identified by the (srcport, srcaddr, destport, destaddr) quad (where srcport is randomly generated per-connection).

  • The first TCP packet is header-only, so has a zero-byte payload.

  • Server flow:

    • Start listening, state: LISTEN
    • Receive a SYN, respond with a SYN_ACK, state: SYN RCVD
    • Receive an ACK for the SYN_ACK, state: ESTAB
  • The RFC expects “remembered variables” (state) to be stored in a “Transmission Control Block” (TCB). These variables are:

    Send Sequence Variables
    
      SND.UNA - send unacknowledged
      SND.NXT - send next
      SND.WND - send window
      SND.UP  - send urgent pointer
      SND.WL1 - segment sequence number used for last window update
      SND.WL2 - segment acknowledgment number used for last window update
      ISS     - initial send sequence number
    
    Receive Sequence Variables
    
      RCV.NXT - receive next
      RCV.WND - receive window
      RCV.UP  - receive urgent pointer
      IRS     - initial receive sequence number
    
    
  • ISS specifies the zeroth index of the byte buffer that we’re trying to transmit; it doesn’t have to be zero, and wraps at some point (32 bits)

  • Here’s a graphical representation of the send buffer: /Untitled-2021-06-28-2304.png

    • SND.UNA marks the acked/un-acked boundary
    • SND.NXT marks the sent/unsent boundary
    • SND.WND specifies how many bytes can be sent, and this range starts from SND.UNA
  • The receive buffer is simpler, it only marks the received (+acked) bytes with RCV.NXT and RCV.WND sets the size of each “reception”.

  • Every byte (octet) gets a sequence number

  • Segment == packet

  • Initial sequence number is randomly (sort of - ever increasing) chosen to prevent overlaps with other connections. These connections will use different quads, so this is a safety measure, not a necessity.

  • A segment stays in the network for a maximum bounded time (MSL - Max. Segment Lifetime), usually set to 4.55 hours.

  • This is the reason the second part of a handshake is a SYN_ACK:

    1) A --> B  SYN my sequence number is X
    2) A <-- B  ACK your sequence number is X
    3) A <-- B  SYN my sequence number is Y
    4) A --> B  ACK your sequence number is Y
    
  • In this example the client is setting the window size for the data it’s sending. What if it sets it SUPER high? DDoS?

  • Woohoo!

    2 0.347657799  192.168.0.1 → 192.168.0.5  TCP 60 59846 → 5000 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=175804761 TSecr=0 WS=128
    3 0.347846056  192.168.0.5 → 192.168.0.1  TCP 40 5000 → 59846 [SYN, ACK] Seq=0 Ack=1 Win=10 Len=0
    4 0.347905879  192.168.0.1 → 192.168.0.5  TCP 40 59846 → 5000 [ACK] Seq=1 Ack=1 Win=64240 Len=0	
    
  • Here’s a very useful overview of the handshake itself: /Screen Shot 2021-06-29 at 4.35.13 PM.png

  • ACKs don’t take up space in the byte stream, and so don’t need to be ACKed!

  • Generate a RST when you see an unintended segment

Connection States

                              +---------+ ---------\      active OPEN
                              |  CLOSED |            \    -----------
                              +---------+<---------\   \   create TCB
                                |     ^              \   \  snd SYN
                   passive OPEN |     |   CLOSE        \   \
                   ------------ |     | ----------       \   \
                    create TCB  |     | delete TCB         \   \
                                V     |                      \   \
                              +---------+            CLOSE    |    \
                              |  LISTEN |          ---------- |     |
                              +---------+          delete TCB |     |
                   rcv SYN      |     |     SEND              |     |
                  -----------   |     |    -------            |     V
 +---------+      snd SYN,ACK  /       \   snd SYN          +---------+
 |         |<-----------------           ------------------>|         |
 |   SYN   |                    rcv SYN                     |   SYN   |
 |   RCVD  |<-----------------------------------------------|   SENT  |
 |         |                    snd ACK                     |         |
 |         |------------------           -------------------|         |
 +---------+   rcv ACK of SYN  \       /  rcv SYN,ACK       +---------+
   |           --------------   |     |   -----------
   |                  x         |     |     snd ACK
   |                            V     V
   |  CLOSE                   +---------+
   | -------                  |  ESTAB  |
   | snd FIN                  +---------+
   |                   CLOSE    |     |    rcv FIN
   V                  -------   |     |    -------
 +---------+          snd FIN  /       \   snd ACK          +---------+
 |  FIN    |<-----------------           ------------------>|  CLOSE  |
 | WAIT-1  |------------------                              |   WAIT  |
 +---------+          rcv FIN  \                            +---------+
   | rcv ACK of FIN   -------   |                            CLOSE  |
   | --------------   snd ACK   |                           ------- |
   V        x                   V                           snd FIN V
 +---------+                  +---------+                   +---------+
 |FINWAIT-2|                  | CLOSING |                   | LAST-ACK|
 +---------+                  +---------+                   +---------+
   |                rcv ACK of FIN |                 rcv ACK of FIN |
   |  rcv FIN       -------------- |    Timeout=2MSL -------------- |
   |  -------              x       V    ------------        x       V
    \ snd ACK                 +---------+delete TCB         +---------+
     ------------------------>|TIME WAIT|------------------>| CLOSED  |
                              +---------+                   +---------+

                      TCP Connection State Diagram
                               Figure 6.
Edit