Eric,
Were you able to resolve your network tuning issues? This is a different
situation, but I just wanted to share our results.
I did some tests with 20 FPGA gigabit Ethernet ports each hammering a single soft IOC (on RHEL 7.4) with
UDP packets at 22.6kHz. In this scenario, tuning the network parameters didn't help much, but setting the CPU affinity of the
soft IOC to lock the threads each to a single core made the biggest improvement.
$ top -H -b -n 1 | grep PsDaq
32051 ctlsdaq 20 0 2301500 16160 6720 S 81.2 0.4 1:12.42 PsDaqDriverUdpR
32050 ctlsdaq 20 0 2301500 16160 6720 S 6.2 0.4 0:06.80 PsDaqDriverThre
$ taskset -p -c 3 32051
pid 32051's current affinity list: 0-3
pid 32051's new affinity list: 3
$ taskset -p -c 2 32050
pid 32050's current affinity list: 0-3
pid 32050's new affinity list: 2
PS Controllers CPU % (driver) CPU % (udprx) RX pps Drop
pps
20 10.8 80.8 476,305.0 14.0
20 10.8 80.6 475,491.2 14.9
20 10.8 80.3 486,127.4 3.9
20 10.8 80.7 473,840.1 26.6
Prior to this. we measured tens of thousands of dropped packets per second.
Ultimately, to get down to zero dropped packets, we buffered more data in the FPGAs before sending jumbo frames at ~1kHz from each FPGA, set MTU to 9000, and increased net.core.rmem_max to 4M.
Tom
--
Thomas Fors
Engineer, AES Division
Argonne National Laboratory
Are there guidelines for tuning the network parameters for a Linux host running a bunch of EPICS soft IOCs? We were seeing some issues with CA clients seeing delays of over a second for ‘get' requests. Setting the values shown here:
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 124928
net.core.wmem_default = 124928
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.ipv4.tcp_mem = 8388608 8388608 8388608
seems to have improved things. But I don’t know if those numbers, which I got from a google search, are way too big, or not yet big enough
or completely weird.
Suggestions welcomed.
Machine details:
- 8 cores
- 24 GiB RAM
- 1 TB SSD
- Gigabit Ethernet
- EPICS R3.15.4
Load average in the 3 to 4 range but CPU utilization as shown by ‘top’ showed at least 60% idle even when clients were experiencing the slow response.