Thanks Jeff
All great information.
Will r3.14.9 be sufficient? That is what
we have installed.
I think Kevin was having problems with
r3.14.10 (tnet driver I think) and I am not sure he ever built r3.14.11.
Allan
Allan,
It does sound like
optical telescope upgrades are a tad bit more challenging, and schedule risk
exposed, compared to what we typically see in the particle accelerator world.
We have actually
made the network behavior of EPICS systems as a whole more robust under load in
R3.14. The main change has been that network congestion induced disconnects do
_not_ cause positive congestion
feedback. In practice this means that the client library disconnects the
application, but not the TCP circuit when a CA channel times out. A large
number of other subtle issues have also been fixed, and KECK can quickly
benefit from many of them by switching all the apps to R3.14 even if the IOCs
must stay at R3.13 because it takes more time to complete their upgrade work.
Ø
We have
deployed r3.14 caRepeaters
I wouldn’t
expect any functionally different behavior between the R3.14 and R3.13 CA
repeaters, but I am very willing to have a look at some additional details to
see what might be occurring. At some point we did change from using the fork
system call to auto-start the CA repeater to using the exec system call which
is certainly a behavior change, but I seem to recall that this change was made
as a patch at some point to R3.13. The big change here is avoidance of
duplicating resources created by an application into the CA repeater’s
process, and problems with the CA repeater process having a strange name
belonging to the app that stated it. One can of course avoid such issues
altogether by starting the CA repeater in the workstation’s boot script.
Ø
And we
have some random issues: sometimes we have ‘get’ failure
Ø
responses,
indicating channels are disconnected or non-existant,
Ø
in one
client, whilst other clients are happy;
This can certainly occur if there is CPU
saturation in the IOC, buffer starvation in the IOC’s IP kernel, or
possibly also if you have a really old network with hubs instead of switches.
Old Ethernet networks can experience delays if you have collision chains of
sufficient magnitude. Another possibility is applications with too-short
connection timeouts configured. Here at LANL I have, only on one project, seen
a weird situation where many HP CAD workstations were configured with the wrong
network/host partitioning mask, and they were responding to CA search requests
with an ICMP error response. That was causing the IP input queue on the CA
client’s host to get saturated, and this led to CA connects taking much
longer to complete than they should have. Otherwise not expected.
Ø
responses
from multiple IOCs for a given channel when only one of
Ø
those
IOCs has that channel.
Not expected, and I don’t recall
hearing complaints in this area. If the CA client library received a search
response from the wrong IOC it will try to connect to the wrong IOC, fail doing
that, and then return to searching for the channel where it would presumably
eventually find the correct IOC.
Ø
I have
not had time to delve into those. The errors seem
Ø
to be
realtime IOC/board and/or possibly vxWorks related, as there
Ø
is one
newer processor (not PPC as are most of the others) running
Ø
a newer
version of vxWorks, from which I never see those types of errors.
Some earlier vxWorks
had substantial issues with mbuf starvation and driver buffer pool starvation.
As I recall, the SNS resolved these issues by allocating more m-bufs and
cluster-bufs at vxWorks kernel build time, and by installing all of the very
latest vxWorks network interface driver patches. The network congestion
robustness improvements in R3.14 probably helped also.
Ø
Does
someone have a simple multi-threaded example, utilizing r3.14
Ø
(so I
can compare the CA library calls with what we are currently using in r3.13.10)?
The CA client
interfaces are very close to 100% backwards compatible. There are some new
interfaces that enable new features of course. Hopefully I am not
oversimplifying the situation, but it’s probably safe to say that the
primary multi-threading issue will be with how to properly structure your app
for multi-threading as all of the multi-threading issues related to CA
internals have been dealt with when preparing the first releases of R3.14. As
mentioned in previous mm, you will also need to decide if you want
non-preemptive callback which requires periodic calls to ca_poll from your
thread, or non-preemptive callback where you will receive asynchronous
callbacks from the CA client library. Asynchronous callbacks will of course
require some additional expertise. The application may need some additional
mutual exclusion primitives to control asynchronous access into the
application’s data structures originating from multiple instances of the
CA client library’s auxiliary threads.
Jeff
______________________________________________________
Jeffrey O. Hill Email
[email protected]
LANL MS
H820
Voice 505 665 1831
Los Alamos NM 87545 USA
FAX 505 665 5107
Message
content: TSPA
No negativity noticed J
We have deployed r3.14 caRepeaters. And we
have some random issues: sometimes we have ‘get’ failure responses,
indicating channels are disconnected or non-existant, in one client, whilst
other clients are happy; responses from multiple IOCs for a given channel when
only one of those IOCs has that channel. I have not had time to delve into
those. The errors seem to be realtime IOC/board and/or possibly vxWorks
related, as there is one newer processor (not PPC as are most of the others)
running a newer version of vxWorks, from which I never see those types of
errors.
I think our big issue, with respect to
forging forward with multi-thread clients, (i.e. using r3.14 for all clients)
is that major modifications would need to be made to the layer we have between
CA and our clients (said layer hides CA, as it is not the only
inter-process/processor communications mechanism in place, for instance we have
numerous RPC systems; and other socket based systems). Most of our operational
clients do not interface directly to CA. Hence, that ‘layer’ is
critical. It was the application interface provided to all our sister institutions
(which create non-EPICS/CA instruments/systems) back in the early 90’s. I
have been studying that ‘layer’ in great detail, in my attempt to
solve the multi-threaded issue (r3.13.10), and it may be that I am now
sufficiently less ignorant that I can make those modifications. If that is the
case then I will no doubt have more questions. So, thanks for pointing me
to pertinent documents that will make the transition form r3.13.10 to r.3.14
possible.
Does someone have a simple multi-threaded
example, utilizing r3.14 (so I can compare the CA library calls with what we
are currently using in r3.13.10)?
Cheers,
Al
Aloha again Allan,
Sorry, after
rereading my message, the tone sounds a bit negative which wasn’t
my intent. I should have said, “please read also the section in the
reference manual entitled - Thread Safety and Preemptive Callback to User Code”. When designing this type of application, one
must decide if CA callbacks should occur only when periodically executing in a
CA client library function such as ca_poll, or if the CA callbacks should occur
asynchronously, as soon as the network messages are processed by the auxiliary
threads in the library. Either approach can be used in a multi-threaded
program.
Jeff
______________________________________________________
Jeffrey O. Hill
Email [email protected]
LANL MS
H820
Voice 505 665 1831
Los Alamos NM 87545 USA
FAX 505 665 5107
Message
content: TSPA
Aloha Allan,
Ø Does the seg
fault occur because r3.13.10 is NOT thread safe?
The R3.13 CA Client
library is definitely __not__ thread safe, and I can easily imagine that this
might be the cause of your seg fault.
Ø
Does anyone have an example of a
multi-threaded app using r3.13.10 on UNIX?
The R3.14 CA client library _is_ thread safe, and it should also
interoperate fine with R3.13 IOCs. We routinely operate LANSCE with that
configuration in our production system. Our control room runs R3.14, but many
of our IOCs still run R3.13. You should read the section in the reference
manual entitled “Thread Safety and Preemptive Callback to User Code“.
Jeff
______________________________________________________
Jeffrey O. Hill
Email [email protected]
LANL MS
H820
Voice 505 665 1831
Los Alamos NM 87545 USA
FAX 505 665 5107
Aloha
I am trying to get a multi-threaded application working on
SunOs 5.10 with connection to two UNIX IOC’s.
I get a seg fault for ellDelete, two statements from the end
of cac_select_io() (epics/r3.13.10/base/src/ca/bsd_depen.c).
The seg fault does not occur immediately but within a couple
of minutes (connections are to two IOC’s running on UNIX, with events
from two long records on each IOC, where one record on each system is updated
at 1 hz and the other at 10 hz).
Does the seg fault occur because r3.13.10 is NOT thread
safe?
Does anyone have an example of a multi-threaded app using
r3.13.10 on UNIX?
Thanks,
Allan