Troubleshooting network performance issues
Lately, I've come across with network performance issues in some data centers, which is usually a head breaker for networking engineers because when you see the bandwidth is enough but the throughput reached isn't what you expected, something is wrong. This is the time when solid networking knowledge is needed for the troubleshooting process and concepts like checksum, Frame Check Sequence (FCS) or overruns are required to analyse network performance issues and fix them.
Obviously, we can also have performance issues due to the fact that applications and services aren't configured properly or they've had a poor development process but I would like to highlight in this post what we can check with regard to networking.
We should look at networking interfaces and looking for the next attributes:
- Errors: This is the first thing we should look for because it counts when there are CRC errors, or we have frames too-short or too-long (CRC, checksum mismatch).
- Dropped: It counts when interfaces receive unintended VLAN tags or are receiving IPv6 frames when it isn't configured for IPv6.
- Overruns: This is another important attribute to look for because it counts when buffer FIFO gets full and the kernel isn't able to empty it. For example, if the network interface has a buffer of X bytes and it is filled and was exceeded before the buffer could be emptied, then we have overruns.
- Frame: It counts only when there are misaligned frames, it means frames with a length not divisible by 8. Therefore, that length isn't a valid frame and it is discarded. For instance, packets are going to fail if they are not ended on a byte boundary.
- Carrier: When we have loss of link pulse, it counts. Sometimes is recreated by removing and installing the Ethernet cable. Therefore, if this counter is high, the link is flapping (up and down), the Ethernet chip is having issues or the device at the other end of the cable is having issues.
- Collisions: This is another typical issue when we can't reach a good performance. Collisions may count when an interface is running as half duplex and the other end is running as full duplex. Therefore, the half duplex interface is detecting TX and RX packets at the same time and the half duplex device will terminate transmission. As a result, there are collisions, mismatch duplex, and we get very bad throughput. It is important to remember that switched environments always operate as full duplex and collision detection is disabled by default.
Next, we can see a mismatch duplex laboratory where Fa 0/1 of ASW1 is working as full duplex and it has FCS-Errors, which means “Frames with valid size with Frame Check Sequence (FCS) errors but no framing errors”. Consequently, throughput between PC1 and SRV1 is too bad.
And we can also see that Fa 0/1 of CSW1 is working as half duplex and it has Late-Collision, which means “Number of times that a collision is detected on a particular port late in the transmission process”. This is a big clue to realise that we have a duplex mismatch which should be fixed to have a good networking performance.
This post is being too long, I'm sorry, but I would like to leave some Linux commands as well like ethtool -S eth0 , netstat -s , netstat -i for troubleshooting network performance:
Regards my friends and remember, sometimes we have to go down to the physical layer to fix networking performance issues.