After the fault happened again, I followed Michael's suggestion and suspended all EPICS task one after the other to see
if any of them is feeding the tNet0 task with work. No effect unfortunately. After all EPICS tasks had been halted,
still the tNet0 task used 100%.
I have now upgraded the IOC from VxWorks 6.9.0.0 to 6.9.4.11. Maybe that helps....
Dirk
On Mon, 2022-08-29 at 09:57 +0200, Zimoch Dirk wrote:
> Hi Michael, Andrew,
>
> Thanks for your suggestions. Details below in-line.
>
> On Sat, 2022-08-27 at 09:53 -0700, Michael Davidsaver wrote:
> > On 8/26/22 06:05, Zimoch Dirk (PSI) via Core-talk wrote:
> > > Hi fellow VxWorks users,
> > >
> > > Since I migrated most of our IOCs from VxWorks 5.5 to VxWorks 6.9, I notices a severe network driver problem.
> > > After running fine for weeks or months, suddenly the tNet0 tasks consumed 100% CPU and stays so, which of course
> > > makes
> > > the IOC unusable. Only reboot helps. This happens on multiple IOCs but not synchronously.
> >
> > Are there any commonalities in terms of application code / epics drivers?
> >
> > Any other changes besides the OS version? (eg. different Base version,
> > drivers, adding PVA?)
>
> It happens with EPICS 3.14.12 as well as with 7.0.6.
> Common code is iocStats, caPutLog, autosave and some hytec drivers (which we are using for many years already and
> which
> no not access network).
>
> >
> >
> > > It looks like the tNet0 task is constantly timing out and trying to drop some connection. But I cannot find out
> > > which
> > > one. Any network diagnostics function freezes in ipcom_block_wait at a semaphore.
> >
> > Is there any network traffic associated with this?
>
> As I only ever know of this problem after the fact and it is a rare event for any particular IOC, there is no log of
> network traffic available. I have not been able to intentionally trigger this problem. Also, so far I could not find
> out
> which sockes/IP addresses are involved.
>
> > Does vxworks expose any packet packet Tx/Rx counters?
>
> After it happens, VxWorks 6 does not expose any network related information any more because everything network
> related
> (such as ifconfig) hangs at a semaphore.
>
> > Can you look at what other threads are doing?
> >
> > eg. Maybe some code is in a tight loop trying to (re)connect
> > a TCP socket without a hold-off time in between?
>
> All the CPU goes to tNet0. I suspect that tNet0 itself is putting stuff back on its work queue.
> But next time it happens, I will try to trace other threads and maybe suspend them to see what happens.
>
> > As there is a TCP server for at least CA, such an offending
> > loop could be in client code running on another computer.
>
> My first suspect was our central iocLog/caPutLog server, because our stupid firewall keeps dropping inactive
> connections
> after 30 minutes. To test this theory, I would probably need to disable logging for a while.
> Also, the same CA clients (caqtdm) are used everywhere of course, but it uses the standard R7.0.6 CA library.
> Another suspect may be autosave. On VxWorks 6 it uses NFS3 while on VxWorks 5 there was only NFS2 (our main reason to
> upgrade).
>
> I will have a look if suspending suspects helps.
>
> I found one strange thing in our logs though:
> CAS: request from 172.19.10.40:46228 => CAS: Missaligned protocol rejected
> These messages show up on several systems, but may be completely unrelated.
> I suspect some Java CA clients.
>
>
> On Fri, 2022-08-26 at 11:30 -0500, Andrew Johnson wrote:
> > It hasn't happened here at APS to my knowledge. Is it unique to a
> > particular board type, or a specific network interface driver? What
> > sub-version of VxWorks 6.9 are you running, and have you checked with
> > the Wind River knowledge forums and/or support?
> >
>
> It is a pure 6.9.0.0 without any updates. I was considering to update, but you once told be that there may be
> incompatibilities. I may do it anyways.
>
> Most of our boards are MV5100. I am not quite sure if it has happened on MV2300 as well. I googled and searched the WR
> support network without any relevant results. My experience with the WR support is less than satisfactory, so this
> will
> be the last thing to try. First I wanted to ask people with a better reputation for actual help :-)
> WR typically spends more time telling me how old my HW (and host Linux version) is than trying to solve problems.
>
>
> >
> > > Has anyone seen this behavior before?
> > >
> > > > tt tNet0
> > >
> > > 0x0012c68c vxTaskEntry +0x48 : ipcomNetTask ()
> > > 0x00113c40 ipcomNetTask +0x34 : jobQueueProcess ()
> > > 0x0029d9d8 jobQueueProcess+0xe8 : 0x00241584 ()
> > > 0x002415b4 ipcom_atomic_sub_and_return+0x9c : 0x001f9208 ()
> > > 0x001f9274 ipnet_timeout_cancel+0xe4 : 0x00226910 ()
> > > 0x0022699c iptcp_drop_connection+0x8a4: ipcom_list_remove ()
> > > value = 0 = 0x0
> > > > tt tNet0
> > >
> > > 0x0012c68c vxTaskEntry +0x48 : ipcomNetTask ()
> > > 0x00113c40 ipcomNetTask +0x34 : jobQueueProcess ()
> > > 0x0029d9d8 jobQueueProcess+0xe8 : 0x00241584 ()
> > > 0x002415b4 ipcom_atomic_sub_and_return+0x9c : 0x001f9208 ()
> > > 0x001f9274 ipnet_timeout_cancel+0xe4 : 0x00226910 ()
> > > 0x00226a48 iptcp_drop_connection+0x950: 0x0022b380 ()
> > > 0x0022c4e0 sockInfo +0x2ca4: ipcom_list_insert_first (0x1f, 0x1f)
> > > value = 0 = 0x0
> > > > tt tNet0
> > >
> > > 0x0012c68c vxTaskEntry +0x48 : ipcomNetTask ()
> > > 0x00113c40 ipcomNetTask +0x34 : jobQueueProcess ()
> > > 0x0029d9d8 jobQueueProcess+0xe8 : 0x00241584 ()
> > > 0x002415b4 ipcom_atomic_sub_and_return+0x9c : 0x001f9208 ()
> > > 0x001f9274 ipnet_timeout_cancel+0xe4 : 0x00226910 ()
> > > 0x00226a48 iptcp_drop_connection+0x950: 0x0022b380 ()
> > > 0x0022c4bc sockInfo +0x2c80: 0x0022c58c ()
> > > value = 0 = 0x0
> > > > tt tNet0
> > >
> > > 0x0012c68c vxTaskEntry +0x48 : ipcomNetTask ()
> > > 0x00113c40 ipcomNetTask +0x34 : jobQueueProcess ()
> > > 0x0029d9d8 jobQueueProcess+0xe8 : 0x00241584 ()
> > > 0x002415b4 ipcom_atomic_sub_and_return+0x9c : 0x001f9208 ()
> > > 0x001f9274 ipnet_timeout_cancel+0xe4 : 0x00226910 ()
> > > 0x00226a48 iptcp_drop_connection+0x950: 0x0022b380 ()
> > > value = 0 = 0x0
> > > > tt tNet0
> > >
> > > 0x0012c68c vxTaskEntry +0x48 : ipcomNetTask ()
> > > 0x00113c40 ipcomNetTask +0x34 : jobQueueProcess ()
> > > 0x0029d9d8 jobQueueProcess+0xe8 : 0x00241584 ()
> > > 0x002415b4 ipcom_atomic_sub_and_return+0x9c : 0x001f9208 ()
> > > 0x001f9274 ipnet_timeout_cancel+0xe4 : 0x00226910 ()
> > > value = 0 = 0x0
> > >
> > > Dirk
> > >
> >
> >
- References:
- VxWorks network problem Zimoch Dirk (PSI) via Core-talk
- Re: VxWorks network problem Michael Davidsaver via Core-talk
- Re: VxWorks network problem Zimoch Dirk (PSI) via Core-talk
- Navigate by Date:
- Prev:
epics-7.0 » linux32 - Build # 430 - Still Unstable! APS Jenkins via Core-talk
- Next:
epics-pva2pva-linux32 - Build # 236 - Still Unstable! APS Jenkins via Core-talk
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
<2022>
2023
2024
- Navigate by Thread:
- Prev:
Re: VxWorks network problem Michael Davidsaver via Core-talk
- Next:
Build failed: epics-base base-7.0-57 AppVeyor via Core-talk
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
<2022>
2023
2024
|