EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: 'assert (pca->pgetNative)' failed in ../dbCa.c
From: Andrew Johnson <[email protected]>
To: <[email protected]>
Date: Wed, 27 Jun 2018 15:09:02 -0500
Hi Michael,

On 06/27/2018 05:15 AM, [email protected] wrote:
> Just a little reminder: we've just seen bug #541221 again here at DLS,
> and the impact was ... disconcerting; it's taken a little while to
> figure out exactly what's going on.
> 
> Here's the text of our event this time:
> 
> [Tue Jun 26 15:01:38 2018]DB CA Link Exception: "Virtual circuit disconnect", context "172.23.194.103:5064"
> [Tue Jun 26 15:04:15 2018]
> [Tue Jun 26 15:04:15 2018]
> [Tue Jun 26 15:04:15 2018]
> [Tue Jun 26 15:04:15 2018]A call to 'assert(pca->pgetNative)'
> [Tue Jun 26 15:04:15 2018] by thread 'CAC-TCP-recv' failed in ../dbCa.c line 683.
> [Tue Jun 26 15:04:15 2018]EPICS Release EPICS R3.14.12.3 $Date: Mon 2012-12-17 14:11:47 -0600$.
> [Tue Jun 26 15:04:15 2018]Local time is 2018-06-26 14:04:15.383540000 UTC
> [Tue Jun 26 15:04:15 2018]Please E-mail this message to the author or to [email protected]
> [Tue Jun 26 15:04:15 2018]Calling epicsThreadSuspendSelf()
> 
> This happened simultaneously on about thirty IOCs (all running the same
> template) when an IOC serving a particular PV was rebooted, and as a
> result all affected IOCs needed to be rebooted.

Those IOCs must have all had links pointing to a PV in the rebooted IOC
that changed its data type; that's usually what triggers this assertion.

> I see that I last reported this issue just under 14 years ago (wow!) and
> that it is present and live in the bug tracker.
> 
> Now of course you'll see that we're running a pretty elderly version of
> EPICS, but given the state and conversation around this bug, I have to
> ask -- has this bug actually been fixed in a more recent version of
> EPICS?

We have made quite a few changes to dbCa.c but the code around line 683
is still mostly the same as it was in your Base-3.14.12.3 revision. That
assertion failure gets reported so rarely that I haven't spent time
fixing it myself yet, and nobody else has stepped up in the meantime.
The code does contain the following comment in connectionCallback()
though, which I added in 2008:

    /* BUG: We have no way to clear any old subscription with the
     *      originally chosen data type/size.  That will continue
     *      to send us data and will result in an assert() fail.
     */

> Looking at the code I see an interesting disjunction between the code
> that creates a subscription callback and the callback itself.  Let's see
> if I can create a suitable caricature of the offending code (obviously
> this is 3.14.12.3 code, newer versions may be very different):
> 
> 
> dbCaTask:
>     if (link_action & CA_MONITOR_NATIVE) {
>         pca->pgetNative = dbCalloc(pca->nelements, element_size);
>         status = ca_add_array_event(
>             ca_field_type(pca->chid)+DBR_TIME_STRING,
>             ca_element_count(pca->chid),
>             pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
>     }
>     if (link_action & CA_MONITOR_STRING) {
>         pca->pgetString = dbCalloc(1, MAX_STRING_SIZE);
>         status = ca_add_array_event(DBR_TIME_STRING, 1,
>             pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
>     }
> 
> 
> eventCallback:
>     if (arg.type == DBR_TIME_STRING && 
>         ca_field_type(pca->chid) == DBR_ENUM) {
>         // use pca->pgetString
>     } else {
>         // use pca->pgetNative
> 
> 
> Notice that in the calling code we choose whether to create pgetNative
> and pgetString based on the transient link_action variable, but in the
> callback code we choose which to use based on arg.type and
> ca_field_type().  Now, I'm not familiar with the low level of CA, but I
> can well imagine that either arg.type or ca_field_type() may depend on
> external data, making this a fairly obvious bug!

Indeed. When a CA link reconnects after its server has rebooted, the CA
subscription gets restarted automatically by libCa, using the originally
requested data type. However if the native type of the channel is now
different than it used to be, the subsequent discrepancy between the
channel's ca_field_type() and the monitor's arg.typ causes problems that
the original authors of this code apparently didn't see (or think
important enough to handle). My "BUG:" comment was added at a time when
I looked at the problem, but didn't fix it.

Another element of the problem is that the code allocates a local buffer
to hold the latest value(s) received. When this code was written
allocating and freeing memory inside the IOC at runtime was seriously
frowned upon as it caused memory fragmentation in the VxWorks IOCs,
shortening their lives and requiring more frequent reboots. We aren't so
concerned about that now, VxWorks 6 changed their memory allocation
strategy from first-fit to best-fit which significantly reduces the
fragmentation problem.

The obvious question is why doesn't the link subscribe using the type
that the record wants the data in? The link code doesn't actually know
what type that will be until the first call to dbCaGetLink() during
record processing, and even then the data type requested could change
for later dbCaGetLink() calls, hence all the flags and stuff in the
struct caLink definition.

The code can't even be sure whether a link is going to be used for input
or output, because (probably before CA links appeared) someone noticed
that the INP links of a subroutine record would also work as outputs if
the code used them that way. Personally I don't think this is something
we should be supporting, but I haven't got around to deprecating it and
printing lots of ugly warnings when that happens (coming some-day!).

Note that the two calls to ca_add_array_event() that you quoted above do
not save the eventId, so there is currently no easy way to cancel the
subscription and recreate it if we detect that it reconnected with a
different data type. I suspected that doing so would be the easiest way
to fix the problem.


So by this point you're probably hoping that the attached patch fixes
the issue; well congratulations for reading this far, I tried out my
suspicion above and the attached patch does seem to work for me on the
3.14 branch version, which should be close enough to yours to be able to
apply one way or the other. Please test and let me know so I can apply
it to the Base-3.14 branch and merge up.

- Andrew

-- 
Arguing for surveillance because you have nothing to hide is no
different than making the claim, "I don't care about freedom of
speech because I have nothing to say." -- Edward Snowdon
diff --git a/src/db/dbCa.c b/src/db/dbCa.c
index aaaaa4b..610ce9d 100644
--- a/src/db/dbCa.c
+++ b/src/db/dbCa.c
@@ -591,11 +591,16 @@ static void connectionCallback(struct connection_handler_args arg)
     if (pca->gotFirstConnection) {
         if (pca->nelements != ca_element_count(arg.chid) ||
             pca->dbrType != ca_field_type(arg.chid)) {
-            /* BUG: We have no way to clear any old subscription with the
-             *      originally chosen data type/size.  That will continue
-             *      to send us data and will result in an assert() fail.
-             */
-            /* Let next dbCaGetLink and/or dbCaPutLink determine options */
+            /* Size or type changed, clear everything and let the next call
+               to dbCaGetLink() and/or dbCaPutLink() reset everything */
+            if (pca->evidNative) {
+                ca_clear_event(pca->evidNative);
+                pca->evidNative = 0;
+            }
+            if (pca->evidString) {
+                ca_clear_event(pca->evidString);
+                pca->evidString = 0;
+            }
             plink->value.pv_link.pvlMask &=
                 ~(pvlOptInpNative | pvlOptInpString |
                   pvlOptOutNative | pvlOptOutString);
@@ -969,7 +974,8 @@ static void dbCaTask(void *arg)
                 status = ca_add_array_event(
                     ca_field_type(pca->chid)+DBR_TIME_STRING,
                     ca_element_count(pca->chid),
-                    pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
+                    pca->chid, eventCallback, pca, 0.0, 0.0, 0.0,
+                    &pca->evidNative);
                 if (status != ECA_NORMAL) {
                     errlogPrintf("dbCaTask ca_add_array_event %s\n",
                         ca_message(status));
@@ -981,7 +987,8 @@ static void dbCaTask(void *arg)
                 pca->pgetString = dbCalloc(1, MAX_STRING_SIZE);
                 epicsMutexUnlock(pca->lock);
                 status = ca_add_array_event(DBR_TIME_STRING, 1,
-                    pca->chid, eventCallback, pca, 0.0, 0.0, 0.0, 0);
+                    pca->chid, eventCallback, pca, 0.0, 0.0, 0.0,
+                    &pca->evidString);
                 if (status != ECA_NORMAL) {
                     errlogPrintf("dbCaTask ca_add_array_event %s\n",
                         ca_message(status));
diff --git a/src/db/dbCaPvt.h b/src/db/dbCaPvt.h
index a0a1f65..58e9afd 100644
--- a/src/db/dbCaPvt.h
+++ b/src/db/dbCaPvt.h
@@ -73,6 +73,8 @@ typedef struct caLink
 	char		*pgetString;
 	void		*pputNative;
 	char		*pputString;
+	evid		evidNative;
+	evid		evidString;
 	char		gotInNative;
 	char		gotInString;
 	char		gotOutNative;

Replies:
RE: 'assert (pca->pgetNative)' failed in ../dbCa.c [email protected]
RE: 'assert (pca->pgetNative)' failed in ../dbCa.c [email protected]
References:
'assert (pca->pgetNative)' failed in ../dbCa.c [email protected]

Navigate by Date:
Prev: areaDetector R3-3 released Mark Rivers
Next: dbpf to waveform Hinko Kocevar
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: 'assert (pca->pgetNative)' failed in ../dbCa.c [email protected]
Next: RE: 'assert (pca->pgetNative)' failed in ../dbCa.c [email protected]
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  <20182019  2020  2021  2022  2023  2024 
ANJ, 29 Jun 2018 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·