
Ethernet Emulation
(TCP/IP and UDP/IP)
Performance for GM-2
In addition to its OS-bypass features, GM also presents itself to the host operating system as an ethernet interface. This "ethernet emulation" feature of GM allows Myrinet to carry any packet traffic and protocols that can be carried on ethernet, including TCP/IP and UDP/IP.
It is helpful to understand that when using ethernet emulation over GM, traffic goes from the application through the OS kernel to the GM driver, following the same path as it would for a "real" ethernet NIC; traffic does not go directly from the application to the NIC, as it does when using GM in its OS-bypass mode. Thus, the TCP/IP and UDP/IP performance over GM depends primarily on the host-CPU performance and the host-OS's IP protocol stack. This performance varies widely for different hosts and operating systems. Also, unlike GM's OS-bypass mode, which exhibits a very small host-CPU overhead, TCP/IP and UDP/IP protocol processing at high data-transfer rates may use a significant fraction of the host-CPU cycles.
The GM developers have streamlined ethernet emulation over GM wherever practical. For example, the ethernet-emulation code uses the PCIX-series NIC DMA engines to offload the receive-side IP-checksum computation for TCP/IP and UDP/IP in operating systems that support it (Linux, FreeBSD, MacOS-X, Tru64 5.1). This optimization results in less data being accessed in the host-OS kernel. GM supports 9000-Byte jumbo frames in addition to the standard 1500-Byte ethernet frames; indeed, the MTU (Maximum Transmission Unit) can be set to any value between 64 Bytes and 9000 Bytes. Larger frames result in fewer packets being sent to transfer the same amount of data. An optimization used in GM-2 but not provided in GM-1 is interrupt-coalescing, which reduces host overhead by batching multiple transmitted and received packets together, thereby reducing the number of interrupts the host needs to service.
In the tables below, we report the ethernet-emulation (TCP/IP and UDP/IP) performance of GM-2.1.9 and GM-2.0.19 between a pair of 3.06GHz Intel Pentium-4 hosts that use the Serverworks Grand Champion chipset is reported. The test machines were running Debian 3.0 and the kernel.org 2.6.11smp Linux kernel. Hyperthreading was enabled. The GM driver was configured to use a 9K MTU for ethernet emulation and to optimize interrupt coalescing for low-latency performance by issuing the command gm_ethertune -i 7.
The standard netperf2.2pl4 benchmark resulted in the following bandwidth performance for TCP and UDP. The TCP test uses 256K socket buffers; the UDP test uses an 8K message size.
| NIC | Bandwidth | CPU Utilization | ||
| Sender | Receiver | |||
| PCIXE | TCP | 3674 Mb/s | 37% | 38% |
| UDP | 3964 Mb/s | 36% | 40% | |
| PCIXD | TCP | 1977 Mb/s | 19% | 23% |
| UDP | 1982 Mb/s | 17% | 19% | |
The following table shows the (half-round-trip) one-way latency performance for a 1-Byte message. The netperf benchmark presents this data as "number of transmits per second", so we divide 1 second by the number of transmits to get the full round-trip latency, then divide that by 2 to obtain the results below.
| NIC | One-way Latency | CPU Utilization | ||
| Sender | Receiver | |||
| PCIXE | TCP | 23 µs | 17% | 14% |
| UDP | 22 µs | 15% | 15% | |
| PCIXD | TCP | 25 µs | 13% | 12% |
| UDP | 24 µs | 11% | 12% | |
The "raw" netperf output for these tests is attached below.
Raw netperf output for PCIXE NICs:
>netperf224 -Hshout-my -l 60 -c -C -- -S262144 -s262144 TCP STREAM TEST to shout-my Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 217088 217088 217088 60.01 3673.80 36.48 38.18 1.627 1.703 >netperf224 -Hshout-my -l 60 -c -C -tUDP_STREAM -- -m 8192 UDP UNIDIRECTIONAL SEND TEST to shout-my Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SS us/KB 108544 8192 60.01 3629130 0 3963.5 35.67 1.475 108544 60.01 3629041 3963.4 39.89 1.649 >netperf224 -Hshout-my -l 60 -c -C -tTCP_RR TCP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.01 22026.08 16.58 13.91 15.059 12.628 16384 87380 >netperf224 -Hshout-my -l 60 -c -C -tUDP_RR UDP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 108544 108544 1 1 60.01 23033.84 15.34 15.00 13.322 13.024 108544 108544
Raw netperf output for PCIXD NICs:
>netperf224 -Hshout-my -l 60 -c -C -- -S262144 -s262144 TCP STREAM TEST to shout-my Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 217088 217088 217088 60.01 1976.77 19.46 22.78 1.613 1.888 >netperf224 -Hshout-my -l 60 -c -C -tUDP_STREAM -- -m 8192 UDP UNIDIRECTIONAL SEND TEST to shout-my Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SS us/KB 108544 8192 60.01 1814677 0 1981.9 16.49 1.364 108544 60.01 1814608 1981.8 19.24 1.591 >netperf224 -Hshout-my -l 60 -c -C -tTCP_RR TCP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.01 19996.15 13.14 12.06 13.147 12.067 16384 87380 >netperf224 -Hshout-my -l 60 -c -C -tUDP_RR UDP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 108544 108544 1 1 60.01 20881.64 10.96 11.79 10.496 11.296 108544 108544
![]()
02 June 2006