|
My patch to the gateway that keeps the exception handler installed has completely solved the problem of the gateway restarting over and over again.
I have some new information about the problem with the newer gateway (2.1.3) repeatedly disconnecting from the Labview PXI crate where the old gateway (2.0.6) didn't. It happens about every ~10 seconds.
Looking at packet captures, the root problem seems to be that the PXI crate reports its protocol version as 12, but it's not really compliant with version 12. The proximal cause is that the newer gateway uses DBE_PROPERTY masks for some connections, where the
old gateway didn't use those at all.
Whenever the gateway tries to subscribe to a PV on the ioc with a mask of 0x08 (DBE_PROPERTY), the ioc responds with an ECA_BADMASK exception. Per the protocol spec, I believe that a version 12 server should understand DBE_PROPERTY, and it should almost certainly
ignore unknown mask bits and not return ECA_BADMASK.
When the gateway gets the BADMASK exception from the ioc, it attempts to cancel the subscription. The ioc, apparently never having created the subscription, responds with a Bad ID exception. When the gateway gets this exception it closes the virtual circuit
and the cycle starts over again.
In contrast, when I use another client like camonitor talking directly to the ioc, it never complains or disconnects even if I use -m p to set the mask to be DBE_PROPERTY. I think this is because camonitor is just ignoring the exceptions.
The Labview PXI crate itself is a complete black box to me. Even the owners know very little about it. The only thing they seem to be able to do with it is configure the environment. The purpose of the gateway for my department is to shield our stuff from this
kind of behavior. The frequent disconnections result in a lot of beacon anomalies on our side from the gateway that are kind of playing havoc with systems like our archiver, which is expected to be able to archive PVs from this crate, among others.
Does anybody have any suggestions about how to approach this? At this point I'm ready to hack the gateway to detect the BADMASK and NOT cancel the subscriptions. Only the property info would be lost, and our archiver doesn't care about that anyway. Alternatively,
I could patch it to not shut down the VC in response to the Bad ID exception, but that seems riskier.
--Brian Bevins
Jefferson Lab
From: Tech-talk <tech-talk-bounces at aps.anl.gov> on behalf of Brian Bevins via Tech-talk <tech-talk at aps.anl.gov>
Sent: Thursday, March 12, 2026 12:44 PM
To: Wang, Lin <wanglin at ihep.ac.cn>
Cc: tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: Re: [EXTERNAL] Re: CA gateway crashes periodically due to badly behaved servers
HI Lin,
Thank you for the information. In my case I don't seem to get back the name of the PV in question. The server reports that the exception was generated in response to cancelling a subscription, but the payload doesn't include the PV name.
I can report that the problem does seem to be mitigated now that I've tweaked our gateway to not uninstall it's exception handler when the throttle trigger is reached. The gateway just keeps running. It still disconnects from the Labview server much more frequently
than our older gateway (2.0.3) did. I'm not sure why yet.
Best,
-Brian
From: Wang, Lin <wanglin at ihep.ac.cn>
Sent: Saturday, March 7, 2026 8:33 PM
To: Brian Bevins <bevins at jlab.org>
Cc: tech-talk at aps.anl.gov <tech-talk at aps.anl.gov>
Subject: [EXTERNAL] Re: CA gateway crashes periodically due to badly behaved servers
Hello Brian,
At CSNS, we have been using LabView IOCs in beam diagnostic devices together with CA gateway for quite a few years, it seems that CA gateway is sensitive with unexpected data format and it indeed sometimes crashes or stops working even if not crashes. We
plan to replace the non-standard IOCs in the ongoing CSNS upgrade.
But the intelligent aspect of CA gateway or the depending ca library in EPICS base is that they always reports the specific bad format PVs in the log file for us to troubleshoot.
For example, half a year ago, we encountered a similar issue that the CA gateway serving LabView IOCs keep restarting until crash when bad format PVs are being accessed by other CA clients via the CA gateway, reported like this in the CA gateway log file:
Oct 03 00:21:54 !!! Errlog message received (message is above)
filename="../../../../src/cas/generic/casPVI.cc" line number=253
Bad data type application type "enums" string conversion table for enumerated PV isnt a string type?
Then, we added one line in EPICS base to print (the print feature exists in newer EPICS version by default) the problematic PVs and reported the issue to colleagues in the beam diagnostic group, they restarted the relevant IOC to resolve the issue and it
never happens again.
So, when issue occurs in our LavView IOC / CA gateway environment, we often restart CA gateway, or find the problematic IOC and restart the IOC.
Regards,
Lin
-----Original Messages-----
From: "Brian Bevins via Tech-talk" <tech-talk at aps.anl.gov>
Send time: Sunday, 03/08/2026 04:50:13
To: "tech-talk at aps.anl.gov" <tech-talk at aps.anl.gov>
Subject: [SPAM] CA gateway crashes periodically due to badly behaved servers
I've noticed an odd behavior in one of our channel access gateways. We use them in various places to sort of shield our stuff from stuff we don't control. I'm using version 2.1.3.
There is an apparently badly behaved server that I don't control that is seen by the gateway. It's a bit of a black box even to those who do control it. All I know is that it's a pxi crate running a LabView ca server. It disconnects about once every 8-40 seconds
and frequently returns errors about bad resource IDs, apparently in response to CA_PROTO_EVENT_CANCEL commands. For unknown reasons this happens much more often than it did when we were running version 2.0.3.0 of the gateway. I'm still looking into that.
The gateway does a fine job of ignoring the errors and reconnecting, until it has received 100 errors. After that the error logging throttle kicks in and it logs only disconnects for 1 hour. What seems odd to me is that when it does this it uninstalls its own
error handler. When the next serious error comes in it invokes the default client handler and aborts the program. This seems wrong to me. I've worked around it by just not uninstalling the error handler. Messages still stop being logged for an hour, but they
don't abort the gateway.
The gateway always restarts within just a few seconds, but our archiver (Mya) doesn't always reconnect for some reason, leaving occasional gaps of several hours in the archived data. I'm trying to look into this too. Unfortunately our Mya expert just retired.
--Brian Bevins
Thomas Jefferson National Accelerator Facility
|