|
|
Experimental Physics and
| ||||||||||||||
|
|
Hi Michael, On 06/27/2018 05:15 AM, [email protected] wrote: > Just a little reminder: we've just seen bug #541221 again here at DLS, > and the impact was ... disconcerting; it's taken a little while to > figure out exactly what's going on. > > Here's the text of our event this time: > > [Tue Jun 26 15:01:38 2018]DB CA Link Exception: "Virtual circuit disconnect", context "172.23.194.103:5064" > [Tue Jun 26 15:04:15 2018] > [Tue Jun 26 15:04:15 2018] > [Tue Jun 26 15:04:15 2018] > [Tue Jun 26 15:04:15 2018]A call to 'assert(pca->pgetNative)' > [Tue Jun 26 15:04:15 2018] by thread 'CAC-TCP-recv' failed in ../dbCa.c line 683. > [Tue Jun 26 15:04:15 2018]EPICS Release EPICS R3.14.12.3 $Date: Mon 2012-12-17 14:11:47 -0600$. > [Tue Jun 26 15:04:15 2018]Local time is 2018-06-26 14:04:15.383540000 UTC > [Tue Jun 26 15:04:15 2018]Please E-mail this message to the author or to [email protected] > [Tue Jun 26 15:04:15 2018]Calling epicsThreadSuspendSelf() > > This happened simultaneously on about thirty IOCs (all running the same > template) when an IOC serving a particular PV was rebooted, and as a > result all affected IOCs needed to be rebooted. Those IOCs must have all had links pointing to a PV in the rebooted IOC that changed its data type; that's usually what triggers this assertion. > I see that I last reported this issue just under 14 years ago (wow!) and > that it is present and live in the bug tracker. > > Now of course you'll see that we're running a pretty elderly version of > EPICS, but given the state and conversation around this bug, I have to > ask -- has this bug actually been fixed in a more recent version of > EPICS? We have made quite a few changes to dbCa.c but the code around line 683 is still mostly the same as it was in your Base-3.14.12.3 revision. That assertion failure gets reported so rarely that I haven't spent time fixing it myself yet, and nobody else has stepped up in the meantime. The code does contain the following comment in connectionCallback() though, which I added in 2008: /* BUG: We have no way to clear any old subscription with the * originally chosen data type/size. That will continue * to send us data and will result in an assert() fail. */ > Looking at the code I see an interesting disjunction between the code > that creates a subscription callback and the callback itself. Let's see > if I can create a suitable caricature of the offending code (obviously > this is 3.14.12.3 code, newer versions may be very different): > > > dbCaTask: > if (link_action & CA_MONITOR_NATIVE) { > pca->pgetNative = dbCalloc(pca->nelements, element_size); > status = ca_add_array_event( > ca_field_type(pca->chid)+DBR_TIME_STRING, > ca_element_count(pca->chid), > pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0); > } > if (link_action & CA_MONITOR_STRING) { > pca->pgetString = dbCalloc(1, MAX_STRING_SIZE); > status = ca_add_array_event(DBR_TIME_STRING, 1, > pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0); > } > > > eventCallback: > if (arg.type == DBR_TIME_STRING && > ca_field_type(pca->chid) == DBR_ENUM) { > // use pca->pgetString > } else { > // use pca->pgetNative > > > Notice that in the calling code we choose whether to create pgetNative > and pgetString based on the transient link_action variable, but in the > callback code we choose which to use based on arg.type and > ca_field_type(). Now, I'm not familiar with the low level of CA, but I > can well imagine that either arg.type or ca_field_type() may depend on > external data, making this a fairly obvious bug! Indeed. When a CA link reconnects after its server has rebooted, the CA subscription gets restarted automatically by libCa, using the originally requested data type. However if the native type of the channel is now different than it used to be, the subsequent discrepancy between the channel's ca_field_type() and the monitor's arg.typ causes problems that the original authors of this code apparently didn't see (or think important enough to handle). My "BUG:" comment was added at a time when I looked at the problem, but didn't fix it. Another element of the problem is that the code allocates a local buffer to hold the latest value(s) received. When this code was written allocating and freeing memory inside the IOC at runtime was seriously frowned upon as it caused memory fragmentation in the VxWorks IOCs, shortening their lives and requiring more frequent reboots. We aren't so concerned about that now, VxWorks 6 changed their memory allocation strategy from first-fit to best-fit which significantly reduces the fragmentation problem. The obvious question is why doesn't the link subscribe using the type that the record wants the data in? The link code doesn't actually know what type that will be until the first call to dbCaGetLink() during record processing, and even then the data type requested could change for later dbCaGetLink() calls, hence all the flags and stuff in the struct caLink definition. The code can't even be sure whether a link is going to be used for input or output, because (probably before CA links appeared) someone noticed that the INP links of a subroutine record would also work as outputs if the code used them that way. Personally I don't think this is something we should be supporting, but I haven't got around to deprecating it and printing lots of ugly warnings when that happens (coming some-day!). Note that the two calls to ca_add_array_event() that you quoted above do not save the eventId, so there is currently no easy way to cancel the subscription and recreate it if we detect that it reconnected with a different data type. I suspected that doing so would be the easiest way to fix the problem. So by this point you're probably hoping that the attached patch fixes the issue; well congratulations for reading this far, I tried out my suspicion above and the attached patch does seem to work for me on the 3.14 branch version, which should be close enough to yours to be able to apply one way or the other. Please test and let me know so I can apply it to the Base-3.14 branch and merge up. - Andrew -- Arguing for surveillance because you have nothing to hide is no different than making the claim, "I don't care about freedom of speech because I have nothing to say." -- Edward Snowdon diff --git a/src/db/dbCa.c b/src/db/dbCa.c
index aaaaa4b..610ce9d 100644
--- a/src/db/dbCa.c
+++ b/src/db/dbCa.c
@@ -591,11 +591,16 @@ static void connectionCallback(struct connection_handler_args arg)
if (pca->gotFirstConnection) {
if (pca->nelements != ca_element_count(arg.chid) ||
pca->dbrType != ca_field_type(arg.chid)) {
- /* BUG: We have no way to clear any old subscription with the
- * originally chosen data type/size. That will continue
- * to send us data and will result in an assert() fail.
- */
- /* Let next dbCaGetLink and/or dbCaPutLink determine options */
+ /* Size or type changed, clear everything and let the next call
+ to dbCaGetLink() and/or dbCaPutLink() reset everything */
+ if (pca->evidNative) {
+ ca_clear_event(pca->evidNative);
+ pca->evidNative = 0;
+ }
+ if (pca->evidString) {
+ ca_clear_event(pca->evidString);
+ pca->evidString = 0;
+ }
plink->value.pv_link.pvlMask &=
~(pvlOptInpNative | pvlOptInpString |
pvlOptOutNative | pvlOptOutString);
@@ -969,7 +974,8 @@ static void dbCaTask(void *arg)
status = ca_add_array_event(
ca_field_type(pca->chid)+DBR_TIME_STRING,
ca_element_count(pca->chid),
- pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
+ pca->chid, eventCallback, pca, 0.0, 0.0, 0.0,
+ &pca->evidNative);
if (status != ECA_NORMAL) {
errlogPrintf("dbCaTask ca_add_array_event %s\n",
ca_message(status));
@@ -981,7 +987,8 @@ static void dbCaTask(void *arg)
pca->pgetString = dbCalloc(1, MAX_STRING_SIZE);
epicsMutexUnlock(pca->lock);
status = ca_add_array_event(DBR_TIME_STRING, 1,
- pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
+ pca->chid, eventCallback, pca, 0.0, 0.0, 0.0,
+ &pca->evidString);
if (status != ECA_NORMAL) {
errlogPrintf("dbCaTask ca_add_array_event %s\n",
ca_message(status));
diff --git a/src/db/dbCaPvt.h b/src/db/dbCaPvt.h
index a0a1f65..58e9afd 100644
--- a/src/db/dbCaPvt.h
+++ b/src/db/dbCaPvt.h
@@ -73,6 +73,8 @@ typedef struct caLink
char *pgetString;
void *pputNative;
char *pputString;
+ evid evidNative;
+ evid evidString;
char gotInNative;
char gotInString;
char gotOutNative;
| ||||||||||||||
| ANJ, 29 Jun 2018 |
·
Home
·
News
·
About
·
Talk
·
Base
·
Modules
·
Extensions
·
· Distributions · Download · Documents · Links · Licensing · |