1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 | Index | 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 |
<== Date ==> | <== Thread ==> |
---|
Subject: | RE: cac_select_io Segmentation fault |
From: | "Jeff Hill" <[email protected]> |
To: | "'Al Honey'" <[email protected]> |
Cc: | [email protected] |
Date: | Wed, 7 Apr 2010 15:50:29 -0600 |
Allan, I should mention also that, with R3.14, an application thread isn’t
allowed to participate in a preexisting CA context unless it makes a CA call requesting
to join that context. If not, it will implicitly create a new CA context when
it calls the CA client library. Jeff ______________________________________________________ Message
content: TSPA From: Jeff Hill [mailto:[email protected]] Allan, It does sound like optical telescope upgrades are a tad bit more
challenging, and schedule risk exposed, compared to what we typically see in
the particle accelerator world. We have actually made the network behavior of EPICS systems as a
whole more robust under load in R3.14. The main change has been that network
congestion induced disconnects do _not_ cause positive congestion
feedback. In practice this means that the client library disconnects the
application, but not the TCP circuit when a CA channel times out. A large
number of other subtle issues have also been fixed, and KECK can quickly
benefit from many of them by switching all the apps to R3.14 even if the IOCs
must stay at R3.13 because it takes more time to complete their upgrade work. Ø We have
deployed r3.14 caRepeaters I wouldn’t expect any functionally different behavior between
the R3.14 and R3.13 CA repeaters, but I am very willing to have a look at some
additional details to see what might be occurring. At some point we did change
from using the fork system call to auto-start the CA repeater to using the exec
system call which is certainly a behavior change, but I seem to recall that
this change was made as a patch at some point to R3.13. The big change here is
avoidance of duplicating resources created by an application into the CA
repeater’s process, and problems with the CA repeater process having a strange
name belonging to the app that stated it. One can of course avoid such issues
altogether by starting the CA repeater in the workstation’s boot script. Ø And we
have some random issues: sometimes we have ‘get’ failure Ø responses,
indicating channels are disconnected or non-existant, Ø in one
client, whilst other clients are happy; This can certainly occur if there is CPU saturation in the IOC,
buffer starvation in the IOC’s IP kernel, or possibly also if you have a really
old network with hubs instead of switches. Old Ethernet networks can experience
delays if you have collision chains of sufficient magnitude. Another
possibility is applications with too-short connection timeouts configured. Here
at LANL I have, only on one project, seen a weird situation where many HP CAD
workstations were configured with the wrong network/host partitioning mask, and
they were responding to CA search requests with an ICMP error response. That
was causing the IP input queue on the CA client’s host to get saturated, and
this led to CA connects taking much longer to complete than they should have.
Otherwise not expected. Ø responses
from multiple IOCs for a given channel when only one of Ø those IOCs
has that channel. Not expected, and I don’t recall hearing complaints in this area.
If the CA client library received a search response from the wrong IOC it will
try to connect to the wrong IOC, fail doing that, and then return to searching
for the channel where it would presumably eventually find the correct IOC. Ø I have not
had time to delve into those. The errors seem Ø to be
realtime IOC/board and/or possibly vxWorks related, as there Ø is one
newer processor (not PPC as are most of the others) running Ø a newer
version of vxWorks, from which I never see those types of errors. Some earlier vxWorks had substantial issues with mbuf starvation
and driver buffer pool starvation. As I recall, the SNS resolved these issues
by allocating more m-bufs and cluster-bufs at vxWorks kernel build time, and by
installing all of the very latest vxWorks network interface driver patches. The
network congestion robustness improvements in R3.14 probably helped also. Ø Does
someone have a simple multi-threaded example, utilizing r3.14 Ø (so I can
compare the CA library calls with what we are currently using in r3.13.10)? The CA client interfaces are very close to 100% backwards
compatible. There are some new interfaces that enable new features of course.
Hopefully I am not oversimplifying the situation, but it’s probably safe to say
that the primary multi-threading issue will be with how to properly structure
your app for multi-threading as all of the multi-threading issues related to CA
internals have been dealt with when preparing the first releases of R3.14. As
mentioned in previous mm, you will also need to decide if you want
non-preemptive callback which requires periodic calls to ca_poll from your
thread, or non-preemptive callback where you will receive asynchronous
callbacks from the CA client library. Asynchronous callbacks will of course
require some additional expertise. The application may need some additional
mutual exclusion primitives to control asynchronous access into the
application’s data structures originating from multiple instances of the CA
client library’s auxiliary threads. Jeff ______________________________________________________ Message
content: TSPA From: Al Honey [mailto:[email protected]] No negativity noticed J We have deployed r3.14 caRepeaters. And we have some random issues:
sometimes we have ‘get’ failure responses, indicating channels are disconnected
or non-existant, in one client, whilst other clients are happy; responses from
multiple IOCs for a given channel when only one of those IOCs has that channel.
I have not had time to delve into those. The errors seem to be realtime
IOC/board and/or possibly vxWorks related, as there is one newer processor (not
PPC as are most of the others) running a newer version of vxWorks, from which I
never see those types of errors. I think our big issue, with respect to forging forward with
multi-thread clients, (i.e. using r3.14 for all clients) is that major
modifications would need to be made to the layer we have between CA and our
clients (said layer hides CA, as it is not the only inter-process/processor
communications mechanism in place, for instance we have numerous RPC systems;
and other socket based systems). Most of our operational clients do not
interface directly to CA. Hence, that ‘layer’ is critical. It was the
application interface provided to all our sister institutions (which create
non-EPICS/CA instruments/systems) back in the early 90’s. I have been studying
that ‘layer’ in great detail, in my attempt to solve the multi-threaded issue
(r3.13.10), and it may be that I am now sufficiently less ignorant that I can
make those modifications. If that is the case then I will no doubt have more
questions. So, thanks for pointing me to pertinent documents that will
make the transition form r3.13.10 to r.3.14 possible. Does someone have a simple multi-threaded example, utilizing r3.14
(so I can compare the CA library calls with what we are currently using in
r3.13.10)? Cheers, Al From: Jeff Hill
[mailto:[email protected]] Aloha again Allan, Sorry, after rereading my message, the tone sounds a bit
negative which wasn’t my intent. I should have said, “please read also the
section in the reference manual entitled - Thread
Safety and Preemptive Callback to User Code”. When designing this type
of application, one must decide if CA callbacks should occur only when
periodically executing in a CA client library function such as ca_poll, or if
the CA callbacks should occur asynchronously, as soon as the network messages
are processed by the auxiliary threads in the library. Either approach can be
used in a multi-threaded program. Jeff Message content: TSPA From:
[email protected] [mailto:[email protected]] On
Behalf Of Jeff Hill Aloha
Allan, Ø Does the seg fault
occur because r3.13.10 is NOT thread safe? The R3.13 CA Client library is definitely __not__ thread safe, and
I can easily imagine that this might be the cause of your seg fault. Ø Does anyone have an
example of a multi-threaded app using r3.13.10 on UNIX? The R3.14 CA client
library _is_ thread safe, and it should also interoperate fine with
R3.13 IOCs. We routinely operate LANSCE with that configuration in our
production system. Our control room runs R3.14, but many of our IOCs still run
R3.13. You should read the section in the reference manual entitled “Thread
Safety and Preemptive Callback to User Code“.
Jeff Message content: TSPA From:
[email protected] [mailto:[email protected]] On
Behalf Of Al Honey Aloha I
am trying to get a multi-threaded application working on SunOs 5.10 with
connection to two UNIX IOC’s. I
get a seg fault for ellDelete, two statements from the end of cac_select_io()
(epics/r3.13.10/base/src/ca/bsd_depen.c). The
seg fault does not occur immediately but within a couple of minutes
(connections are to two IOC’s running on UNIX, with events from two long
records on each IOC, where one record on each system is updated at 1 hz and the
other at 10 hz). Does
the seg fault occur because r3.13.10 is NOT thread safe? Does
anyone have an example of a multi-threaded app using r3.13.10 on UNIX? Thanks, Allan |