EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Ethernet question
From: Ryan Pierce via Tech-talk <[email protected]>
To: Mark Rivers <[email protected]>, "'J. Lewis Muir'" <[email protected]>
Cc: "[email protected]" <[email protected]>
Date: Mon, 6 Jan 2020 19:43:22 -0600

Are all of these Linux boxes using the same model of 10 Gbit NIC card, or similar models with the same chipset manufacturer which share the same Linux driver? If so, my guess is that the Linux driver and/or the firmware in the 10 Gbit NIC is at fault. I'm thinking this because:

* Once the packets get into the first switch, then everything downstream of that has no way of knowing if the first hop between the Linux box and the first switch was 10 G or 1 G. Also, the failure seems to happen on the packet from the AIM to the Linux box, and the AIM clearly can't tell anything about the speed of the Linux box's NIC. So I'm inclined to pin the fault at either the Linux box or the switch it connects with.

* Different brands of switches talking to different Linux boxes similarly fail with the 10 G NIC and work with the 1 G NIC. So while different manufacturers could have common faulty firmware, this is looking less likely, and I'm less inclined to blame the switch. Also, if it were a bug in the switch processing certain 802.2 SNAP frames at 10 Gbit, I'd expect it to be resolved if the 10 G NIC were slowed down to 1 G or 100 M, but it still fails. I don't see how the switch would see a difference between the working 1 G NIC and the 10 G NIC in 1 G mode.

* The application software stack on the Linux boxes clearly works, seeing as it works fine at 1 G and fails at 10 G.

* That leads me to suspect the Linux network driver level and/or NIC firmware. If all 3 of your Linux boxes use the same model 10 G NIC, or models sharing the same chipset and Linux driver, then I consider this scenario highly likely.

* Something to consider: This use case is unusual on two levels. First, IEEE 802.2 Extended SNAP is probably not going to get the same amount of testing as, say, TCP/IP, with Linux kernel driver developers. Second, your IOC must put the NIC in promiscuous mode in order to receive the frames. Running a NIC in promiscuous mode is not typical. The intersection of 802.2 SNAP and promiscuous mode is unusual enough for me to think the NIC's kernel driver developer may not have tested it. This leads me to suspect a bug in either the 10 G Linux driver or the NIC firmware.

This seems to explain all the data. It would explain why this works on each of the same Linux boxes using 1 G NICs, why the same switch is happy with 1 G but not with 10 G, why this same behavior happens when using 10 G switches from different manufacturers, and why a slowed down 10 G port also fails. If the same model 10 G NIC card is in your Windows boxes, then I'd narrow it further, exonerate the NIC firmware, and blame the Linux driver.

Ryan

On 1/6/2020 6:28 PM, Mark Rivers via Tech-talk wrote:

> All of the packets are still going through the same 10 GbE switch, though, right (i.e., the one labeled "10 Gbit switch #1" in the network path diagram you included in a message upthread)? 

> So, that switch has been involved in all of the tests conducted so far, so it still could be the problem, right?

 

No, the two different Linux systems have different switches.  The one I showed in a previous message is on the APS experiment hall floor and is our Centos 7 server corvette.

 

The second Linux machine I have tested is an Ubuntu 18 system in my office.  Its topology is:

 

Linux machine (has both 10 Gbit and 1 Gbit NICs)

      |

10 Gbit switch #1 (in my office)

      | (1 Gbit uplink)

1 Gbit switch #2 (in APS network closet)

      | (possibly additional switches in here, I'm not sure)

      |

1 Gbit switch

      |

Device (10 Mbit AUI)

 

> Also, are the 10 GbE switches labeled "10 Gbit switch #1" and "10 Gbit switch #2" in the network path diagram identical and running the same OS or firmware version, or are they different?

 

Different OS and firmware.  Switch #1 is a Dell in the system on the floor is a Dell N1548, while switch #1 in my office is a Netgear X5712T.  Switch #2 in both cases is managed by the APS in the network closet.  It is definitely not a Dell, but is probably an HP or Cisco and they may or may not be the same switch.

 

> Also, are those two switches managed or unmanaged?  If managed, can you find out what the switch has set the speed and duplex to for the ports involved to ensure that it set them correctly?

 

Switches #1 and #2 are managed for both configurations. The link between them is definitely 10 Gbit full-duplex.  The final 1 Gbit switch to the device is unmanaged.  But it clearly is configured correctly because when the Linux NIC is 1 Gbit it works fine.  And using a different NIC on Linux has no effect on the configuration of the port on the final 1 Gbit switch.  And I have tested 2 devices that are connected to 2 different 1 Gbit switches.  They are both Dell but different models.

 

> Could you show the output of "ethtool -S p5p1" just in case it shows more detail about exactly what it means by RX "frame"?

                                                                                                                       

Here is the current output of ifconfig and ethtool (abbreviated).  ifconfig "frame" is 235, which is the same as ethtool "rx_length_errors", so those are the same thing.  They are not CRC errors, which is what I think Michael was assuming.

 

corvette:areaDetector/ADCore/iocBoot>/sbin/ifconfig p5p1

p5p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 164.54.160.82  netmask 255.255.255.0  broadcast 164.54.160.255

        inet6 fe80::3efd:feff:fea3:f258  prefixlen 64  scopeid 0x20<link>

        ether 3c:fd:fe:a3:f2:58  txqueuelen 1000  (Ethernet)

        RX packets 147929456684  bytes 111438844085337 (101.3 TiB)

        RX errors 0  dropped 920  overruns 0  frame 235

        TX packets 100625271243  bytes 29596808110595 (26.9 TiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 

corvette:areaDetector/ADCore/iocBoot>/sbin/ethtool -S p5p1

NIC statistics:

     rx_packets: 147929466577

     tx_packets: 100625281563

     rx_bytes: 111438844918608

     tx_bytes: 29596809166293

     rx_errors: 0

     tx_errors: 0

     rx_dropped: 891

     tx_dropped: 0

     collisions: 0

     rx_length_errors: 235

     rx_crc_errors: 0

     rx_unicast: 147182042495

     tx_unicast: 100555363340

     rx_multicast: 5115

     tx_multicast: 412

     rx_broadcast: 747419793

     tx_broadcast: 69917765

     rx_unknown_protocol: 0

     tx_linearize: 2166

     tx_force_wb: 0

     rx_alloc_fail: 0

     rx_pg_alloc_fail: 0

 

> Is the NIC driver the same for the 1 GbE and the 10 GbE NICs on Linux?

 

I'm not sure.  How can I tell that?

 

> Do you have a Windows machine with a 10 GbE NIC that you could try?

 

Yes, I could try that, but I have not yet.

 

> Do you have a Mac with a 10 GbE NIC that you could try?

 

No.

 

I should add that I normally actually communicate with these devices from vxWorks, going through the same switches, and that is working fine and has been for 20 years (with switch upgrades over the years of course).  I would like to move from vxWorks to Linux but have hit this problem with the 10 Gbit NICs.

 

Thanks,

Mark

 

 

-----Original Message-----
From: J. Lewis Muir <[email protected]>
Sent: Monday, January 6, 2020 2:39 PM
To: Mark Rivers <[email protected]>
Cc: [email protected]
Subject: Re: Ethernet question

 

On 12/23, Mark Rivers via Tech-talk wrote:

> On further investigation I found that the problem only occurs when using 10 Gbit Ethernet adapters.  When using a 1 Gbit adapter it works fine.  I was able to see this in a single machine that has both 10 Gbit and 1 Gbit adapters.  If I use the 1 Gbit adapter it works fine, if I use the 10 Gbit adapter it fails (but kind of works as described above).  On 3 separate systems 1 Gbit works, and 3 other systems 10 Gbit fails.

 

Is the NIC driver the same for the 1 GbE and the 10 GbE NICs on Linux?

 

Do you have a Windows machine with a 10 GbE NIC that you could try?

 

Do you have a Mac with a 10 GbE NIC that you could try?

 

Lewis


References:
Re: Ethernet question J. Lewis Muir via Tech-talk
RE: Ethernet question Mark Rivers via Tech-talk

Navigate by Date:
Prev: RE: Ethernet question Mark Rivers via Tech-talk
Next: Re: New sequencer release 2.2.8 J. Lewis Muir via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Ethernet question Mark Rivers via Tech-talk
Next: Re: Ethernet question J. Lewis Muir via Tech-talk
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  <20202021  2022  2023  2024 
ANJ, 07 Jan 2020 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·