I recently had to diagnose a problem with our local CA
nameserver. We use a fork of the EPICS CA nameserver, which
exhibits the same problem.
The symptom was that certain clients would receive monitor
updates with element counts of zero, even for scalar PVs, when
the PVs were resolved through a new build of the CA nameserver.
This only occurred with clients built against R3.14.12 or newer,
a nameserver built against R3.14.12 or newer, and an ioc built
against R3.14.11 or older. (Note
that I use the term "ioc" throughout to refer to any CA server
in order to avoid confusion over which server is being
described.) We identified camonitor and PyEpics as two clients
that were affected. Notably caget was not.
I traced the problem to a change made in CA_V413, that allows
clients to use a zero element count in a CA_PROTO_EVENT_ADD or
CA_PROTO_READ_NOTIFY messages. Prior to 413 an element count of
zero in a request is supposed to always trigger an error. 413
allows a request for zero elements to which the server responds
with the actual number of elements available. Using the element
count of zero seems to have become the default when a client is
communicating with an ioc that supports V413. Everything works
fine when the client and ioc handle name resolution directly.
The channel falls back to the lowest common denominator of CA
minor protocol version.
The problem occurs when a CA nameserver built against a newer
EPICS redirects newer clients to older iocs. The nameserver
always answers the client with a message that correctly gives
the ioc's address, but includes the nameserver's own protocol
version. When the client then opens a channel with the ioc, the
client has the false belief that the ioc has a later protocol
version that it really does. This never seemed to create a
problem for us until CA_V413, but when a 413 client connects to
a <= 412 ioc it uses the zero element count which the ioc
handles by returning updates with zero elements. (This itself
seems like another bug in that for V412 and earlier an element
count of zero is supposed to generate an error.)
My fix was to create patches to libca and libcas and modify the
nameserver to use them. The libca patch adds a
ca_host_minor_protocol(chid) function to the CA client API that
allows a client to request the minor protocol version of a
connected server. The client side of the CA nameserver uses it
to fetch and store the minor protocol version from each ioc. The
libcas patch adds an overload of pvExistReturn() that allows the
server to respond to a client with both a specified network
address and a specified minor protocol version number. The
server side of the CA nameserver uses this to respond to clients
with both the stored address and the stored protocol version.
With these changes in place the problem disappears.
I know that pcas is basically dead, but the CA nameserver seems
to still be in use. I can provide the patches against R3.15.9 if
there is any interest in that.
--Brian Bevins
Jefferson Lab