Thursday, February 25, 2016

[Networking] The excerpt about Jumbo Frame, MTU size and RX drop

This document is the excerpt about my survey of Jumbo Frame, MTU size and RX drop so that it may looks disorganized.
I can give my quick conclusion for my study:
( my lab is 10G network with supporting Jumbo Frame )

server1 <-----> Switch1 <-----> Switch2 <-----> server2
mtu:1500        mtu:9000        mtu:1500        mtu:1500    ==> only less then or equal mtu:1500 can pass through

server1 <-----> Switch1 <-----> Switch2 <-----> server2
mtu:1500        mtu:9000        mtu:9000        mtu:1500    ==> all mtu between 1500 and 9000 can pass through

P.S: the setting mtu to server/host doesn't affect their's mru (maximum receive units). So they can receive mtu 9000 packets even though mtu is 1500

============ The following are the excerpt ===============
  • 'ifconfig -a' shows excessive RX errors.
  • 'ethtool -S interface_name' command shows RX errors in rx_no_buffer_count.
  • 'top' command shows high SoftIRQ.
  1. Positive values in RX-ERR counter mean that the NIC received malformed Ethernet frames from the transmitting switch port, and data integrity could not be validated during frame's cyclic redundancy check (CRC) . The root cause of this is usually a bad cable, or a bad interface on either the machine or the switch.
  1. NIC speed / duplex mis-match with the connecting port on the switch/router.
  1. High or critical performance rated IPS protections are set to Prevent or Detect.
  1. High number of rules in Security and/or in NAT policy.
  • Softnet backlog full
  • Bad / Unintended VLAN tags
  • Unknown / Unregistered protocols
  • IPv6 frames when the server is not configured for IPv6
  1. When M1 sends a frame with a 600-byte payload to M2, there would be no problem.
  1. When M2 sends a frame with a 1200-byte payload to M1, there still would be no problem. Why not? Because setting M1's MTU didn't necessarily change its MRU, and in my experience MTUs and MRUs are separate, and implementations don't give you a way to change your MRU. So M1's MRU on that interface would be 1500 since it's Ethernet.
  1. Router wouldn't know it needs to fragment the frames from M2, because it believes all hosts on the Ethernet LAN that M1 is on are able to receive frames with 1200-byte payloads, because it was configured for a 1200-byte MTU on that interface. Luckily this would still probably work out fine, as I discussed in (2).
softnet backlog full (accounted in /proc/net/softnet_stat)Issuing a modprobe 8021q resolves the issue.Not sure if this will affect the flow control problem or not but will check that out.    
# ethtool -g eth0

Change the MTU of a network interface

How to test if 9000 MTU/Jumbo Frames are working
ping -M do -s 8972 [destinationIP]

RX Errors are typically caused by one or more of the following:

Try a new cable.

Solved the problem with the DHCP request packets coming from the wireless network. They must have been tagged with a VLAN id, and I did not have the 8021q module loaded in the kernel. Recent changes to the kernel now increment the rx_dropped counter under the following conditions:

- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)

Beginning with kernel 2.6.37, it has been changed the meaning of dropped packet count. Before, dropped packets was most likely due to an error. Now, the rx_dropped counter shows statistics for dropped frames because of:
If the rx_dropped counter stops incrementing while tcpdump is running; then it is more than likely showing drops because of the reasons listed earlier.

    Ring parameters for eth0:
    Pre-set maximums:
    RX:             1020
    RX Mini:        0
    RX Jumbo:       16320
    TX:             255
    Current hardware settings:
    RX:             255
    RX Mini:        0
    RX Jumbo:       0
    TX:             255
    To increase the RX ring buffer to 4080 you would run "ethtool -G eth0 rx 1020".

About MTU settings in machines and switch
You didn't specify what networking technologies you were talking about, so I'm going to assume Ethernet and IP[v4].
Ethernet has always defined its range of acceptable payload lengths to be from 46 to 1500 bytes, and requires all devices (hosts and switches) on the LAN to be able to receive frames with 1500-byte payloads. Because of this, Ethernet does not provide a fragmentation mechanism, nor does it provide a mechanism for communicating or negotiating MTUs (or, more importantly, MRUs -- Maximum Receive Units) between devices. In fact the term "MTU" or "maximum transmission unit" does not appear anywhere in the IEEE 802.3 specification.
So let's add IP into the picture. IP has a concept of an MTU, and most modern IP stacks let you set MTUs on a per-interface basis (and more). But your question as stated doesn't quite work out in the context of IP either, because IP has a minimum MTU of 576. So allow me to restate your question as "M1 has an MTU of 600, and M2 has an MTU of 1200". But what MTU shall we say that "Switch" has? Well, if Switch is just a Layer 2 Ethernet switch, it doesn't have a concept of a settable MTU. So to make your question work out in the context of IP, we'll have to turn that switch into a router. So let's call it "Router" and say it has two Ethernet interfaces, one attached to M1 and one attached to M2. Let's also say it has MTUs of 1200 set on both of its interfaces.
Okay, still trying to find and answer the true spirit of your question, let's say the link between M1 and Router is actually PPP instead of Ethernet. The PPP protocol allows hosts to communicate/negotiate their MRUs. Let's say that M1 told Router that M1 has a 600-byte MRU limitation, so Router has set its MTU for that link to 600 bytes.
Now, in this case, if M2 sends a 1200-byte IP datagram to M1 (without setting the "Don't Fragment" bit in the IP header), Router will receive it just fine, and realize it needs to fragment it to send it to M1. So does Router fragment it into two 600-byte fragments? Well, no, it's not that simple for a couple reasons.
One reason is that every fragment has to have its own IP header, which adds 20 bytes to the size of each fragment after the first. The other reason is that IP's fragmentation offset field counts in 8-byte chunks instead of individual bytes.
So let's say the 1200-byte datagram was specifically 1172 bytes of application data in a UDP datagram (8 bytes of UDP headers, 20 bytes of IP headers). After fragmentation, the first fragment would contain a 20-byte IP header, the 8-byte UDP header, and the first 568 bytes of the application data, for a total of 586 bytes. The second frame would contain another 20-byte IP header, no UDP header, and the next 576 bytes of the application data, for a total of 586 bytes. That leaves 28 bytes of application data left over for the final fragment, which, with its IP header added, would be 48 bytes.
Update based on Kavin's update that he was talking about Jumbo frames:
Jumbo frames are something that some Gigabit Ethernet product vendors created independently around the time GigE was created, and it was (I believe) subsequently rejected or ignored by the IEEE and seems unlikely to ever become part of the 802.3 Ethernet standard. Even IEEE 802.3-2008 which includes not just 1000BASE-T but 10GBASE-T, does not contain anything about 9000-byte frame payloads.
The vendors that came up with jumbo frames did not provide any kind of autonegotiation or communication mechanism for jumbo frame support, nor did they create an Ethernet-layer fragmentation method to handle the (very common) case you illustrated. If you want to run your Ethernet LAN in this nonstandard mode, you have to ensure that all hosts and switches on your LAN support jumbo frames.
If M1's NIC is not capable of receiving jumbo frames, it will consider a jumbo frame to be "Ethernet jabber" -- a broken device that "keeps jabbering on and on"; keeps sending bits well beyond the end of a maximum allowable 1500 (really 1518) -byte frame. Note that this meaning of jabber is a term for a kind of Ethernet malfunction and is not to be confused with the similarly-named "Jabber" Internet chat system. You'll have to decide if you want to stop using jumbo frames on this network, or if you want to upgrade M1 to have a NIC that supports jumbo frames.
If M1's NIC is capable of receiving jumbo frames, I suspect that setting its IPv4 MTU for that interface down to 1500 will ensure it doesn't transmit any jumbo sized IP datagrams in a single jumbo Ethernet frame, but it will most likely be able to receive large IP datagrams in single jumbo Ethernet frames no problem, because again, MTU is not MRU, and setting an IP-layer MTU doesn't affect what size frame buffers the NIC allows. Now, if you're tweaking a NIC/driver setting to tell the NIC to only use 1500-byte buffers instead of 9000-byte buffers, that's an Ethernet-layer change, and would probably make your NIC act as if it didn't support 9000-byte buffers.
iperf was utilized to test intervm network performance ‣ By default the HV and all backend network devices had an MTU set at 9000, the MTU on the default ubuntu 14.04 cloud image is set to 1500, we ran tests using both 1500 byte and 9000 byte MTU’s on the the guest. ‣ 1500 Byte MTU: 5Gigabit per instance/HV (25% of theoretical max) ‣ 9000 Byte MTU: 16Gigabit per instance/HV (80% of theoretical max) ‣ Latency ~.5ms average guest<—>guest ‣ Traffic was appropriately divided by node count, whereas one node would use 80% of the theoretical max, 10 nodes would each use 8% of the theoretical max, 5 nodes @ 16%, etc.

ip link set dev ethx mtu 9000
Use ethtool -i to check if NIC supports Jumbo Frame
Post a Comment