Folks,
I just found one of our vxWorks IOCs had drained its 256-byte mbufs.
inetstatShow gave me IP address of the client that had filled up the sendQ on
the server. It turns out that machine was NOT running IDL. Rather
medm had crashed due to X errors, but the medm process was still running. That
led to the same problem as IDL, i.e. the client (medm) had requested callbacks,
but was no longer processing them. Killing the medm process fixed the
problem and the mbufs recovered.
This was on a Windows 7 64-bit machine where we are having occasional problems
with Exceed errors with medm. But I think the same thing could happen on
Linux or other systems as well if the application generates an error but does
not actually exit and close all sockets.
Mark
________________________________________
From: Mark Rivers
Sent: Friday, October 21, 2011 4:28 PM
To: '[email protected]'
Subject: RE: EZCA and ezcaIDL improvements?
Folks,
I have verified that indeed previous versions of the ezca library used
by IDL, Matlab, and other scripting languages could easily cause vxWorks IOCs
to die because of mbuf depletion. I have appended the output of the
vxWorks IOC that demonstrates this. It happens when IDL enables monitors
on some PVs and then does not periodically call ca_pend_event. This
probably most often happens when one launches a GUI like scanSee or the MCA
display program, then closes the GUI without closing IDL. The monitor
messages initially queue up on the client, but eventually queue up on the
vxWorks IOC. A single IDL client doing this waa not enough to kill the
IOC, but 2 or 3 IDL clients were, and it happened within a few minutes.
The fix is to use a new version of ezca that is built to use preemptive
callbacks with Channel Access 3.14. Till Straumann had already made the
needed changes to ezca.c for use with his LabCA module. I have taken
Till’s version and put it into the EPICS Extensions CVS repository at the
APS. Using this new version of ezca I have built new versions of the
ezcaIDL shareable library for 32-bit and 64-bit architectures on both Linux and
Windows.
The new DLLs can be found (along with other binaries from EPICS base
3-14.12.1) here:
http://cars9.uchicago.edu/software/pub/EPICS_linux-x86_binaries.zip
http://cars9.uchicago.edu/software/pub/EPICS_linux-x86_64_binaries.zip
http://cars9.uchicago.edu/software/pub/EPICS_win32-x86_binaries.zip
http://cars9.uchicago.edu/software/pub/EPICS_windows-x64_binaries.zip
Users who want to build the source code should be able to obtain ezca,
EzcaScan and ezcaIDL from the CVS snapshots here:
http://www.aps.anl.gov/epics/download/extensions/index.php#cvs
However, it appears they are not being updated, so we will work on
that.
There is a problem with generating the required undecorated function
names in ezcaIDL.dll consistently for 32-bit and 64-bit Windows systems.
For now there will be lots of warning messages when building for 64-bits,
but it does work fine.
I have verified that when using these new versions the problem of mbuf
depletion does not occur because the preemptive callbacks on the client
continue to process the monitor events in the background even when IDL is not
doing anything.
Till added a new “count” argument to ezcaSetMonitor.
I have added that to ezcaIDL.c and ezcaIDL.pro. It defaults to 0,
so existing code will work as it previously did.
I have also changed the way that ezcaIDL.pro finds the shareable
library. This is from ezcaIDL/README. Previously only mechanism 1)
below was supported.
1) First, it looks for the environment variable EZCA_IDL_SHARE.
If this envionment
variable exists it must point to the complete path to ezcaIDL.dll
(Windows) or
libezcaIDL.so (all other architectures). For example:
setenv EZCA_IDL_SHARE /usr/local/lib/libezcaIDL.so
2) If the shareable library is not found from the EZCA_IDL_SHARE
environment variable
then it is searched for in the !IDL_PATH. The file name searched
for is the
following:
base_!VERSION.OS_!VERSION.ARCH
base is either ezcaIDL.dll (Windows) or libezcaIDL.so (all other
architectures).
!VERSION.OS can be "linux", "Win32", etc.
!VERSION.ARCH can be "x86", "x86_64", etc.
Do "help, /structure, !VERSION" in IDL to see the values for
your system.
This mechanism can be more convenient than using EZCA_IDL_SHARE because
it allows
easy switching between 32 and 64 bit versions of IDL, and does not
require setting an
environment variable. It does require either copying and renaming
the files to a
location in the !IDL_PATH, or creating soft links in the !IDL_PATH to
point to the
locations of the actual shareable libraries.
On our system I created soft links in idl_user/epics/, which is in the
!IDL_PATH:
lrwxrwxrwx 1 epics epics 53 Oct 21 14:32
ezcaIDL.dll_Win32_x86 ->
/usr/local/epics/extensions/bin/win32-x86/ezcaIDL.dll
lrwxrwxrwx 1 epics epics 55 Oct 21 14:32
ezcaIDL.dll_Win32_x86_64 ->
/usr/local/epics/extensions/bin/windows-x64/ezcaIDL.dll
lrwxrwxrwx 1 epics epics 55 Oct 21 14:35
libezcaIDL.so_linux_x86 ->
/usr/local/epics/extensions/lib/linux-x86/libezcaIDL.so
lrwxrwxrwx 1 epics epics 58 Oct 21 14:36
libezcaIDL.so_linux_x86_64 ->
/usr/local/epics/extensions/lib/linux-x86_64/libezcaID
I suggest that anyone who is using IDL to communicate with EPICS PVs in
vxWorks IOCs upgrade to the new version of these DLLs.
Cheers,
Mark
Here is a demonstration of the problem with the previous version.
This is a vxWorks IOC before setting the monitor callbacks in IDL.
164.54.160.82 is the IP address of the Linux client where 3 IDL sessions are
running.
ioc13lab> mbufShow
type number
--------- ------
FREE : 788
DATA : 5
HEADER : 7
SOCKET : 0
PCB : 0
RTABLE : 0
HTABLE : 0
ATABLE : 0
SONAME : 0
ZOMBIE : 0
SOOPTS : 0
FTABLE : 0
RIGHTS : 0
IFADDR : 0
CONTROL : 0
OOBDATA : 0
IPMOPTS : 0
IPMADDR : 0
IFMADDR : 0
MRTABLE : 0
TOTAL : 800
number of mbufs: 800
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
__________________
CLUSTER POOL TABLE
_______________________________________________________________________________
size clusters
free usage
-------------------------------------------------------------------------------
64
125 117
450510
128
400 400
4990010
256 50
44
589498
512
25
25
6297
1024
25
25
7374
2048 25
25
146726
-------------------------------------------------------------------------------
value = 80 = 0x50 = 'P'
ioc13lab> inetstatShow
Active Internet connections (including servers)
PCB Proto Recv-Q Send-Q Local
Address Foreign Address (state)
-------- ----- ------ ------ ------------------
------------------ -------
1da89d4
TCP 0
0 164.54.160.73.5064
164.54.160.82.40702
1da8950
TCP 0
0 164.54.160.73.5064 164.54.160.82.40680
1da88cc
TCP 0
0 164.54.160.73.5064 164.54.160.82.40674
1da8848
TCP 0
0 164.54.160.73.5064 164.54.160.82.40662
1da87c4
TCP 0
0 164.54.160.73.5064 164.54.160.82.40650
1da8740
TCP 0
0 164.54.160.73.5064 164.54.160.74.1123
1da8428
TCP 0
0 164.54.160.73.5064 164.54.160.130.2328
1da7f84
TCP 0
0 0.0.0.0.5064
0.0.0.0.0
1da7cf0
TCP 0
0 0.0.0.0.111
0.0.0.0.0
1da86bc
UDP 0
0 0.0.0.0.989
0.0.0.0.0
1da8638
UDP 0
0 164.54.160.73.1038
0.0.0.0.0
1da85b4
UDP 0
0 127.0.0.1.1037
127.0.0.1.1031
1da8530
UDP 0
0 127.0.0.1.1036
127.0.0.1.1030
1da84ac
UDP 0
0 0.0.0.0.5065
0.0.0.0.0
1da83a4
UDP 0
0 0.0.0.0.991
0.0.0.0.0
1da7df8
UDP 0
0 0.0.0.0.1034
0.0.0.0.0
1da8320
UDP 0
0 0.0.0.0.1033
0.0.0.0.0
1da829c
UDP 0
0 0.0.0.0.1032
0.0.0.0.0
1da8218
UDP 0 0
0.0.0.0.1031
0.0.0.0.0
1da8194
UDP 0
0 0.0.0.0.1030
0.0.0.0.0
1da8110
UDP 0
0 0.0.0.0.1029
0.0.0.0.0
1da808c
UDP 0
0 0.0.0.0.5064
0.0.0.0.0
1da8008
UDP 0
0 164.54.160.73.1035 164.54.160.255.5065
1da7f00
UDP 0
0 0.0.0.0.1028
0.0.0.0.0
1da7e7c
UDP 0
0
0.0.0.0.1027
0.0.0.0.0
1da7d74
UDP 0
0 0.0.0.0.1025
0.0.0.0.0
1da7ae0 UDP
0 0
0.0.0.0.111
0.0.0.0.0
value = 1 = 0x1
This is the output of netstat on the client Linux machine as it local
TCP buffers fill up:
corvette:~>netstat | grep ioc13lab
tcp 2971200 0
corvette.cars.aps.anl:40674 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40650
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 2947200 0
corvette.cars.aps.anl:40662 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40702
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 2994400 0
corvette.cars.aps.anl:40680 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
corvette:~>more /proc/sys/net/ipv4/tcp_rmem
4096 87380 4194304
corvette:~>netstat | grep ioc13lab
tcp 3013800 0
corvette.cars.aps.anl:40674 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40650
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 2989720 0
corvette.cars.aps.anl:40662 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40702
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 3037000 0
corvette.cars.aps.anl:40680 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
corvette:~>
corvette:~>netstat | grep ioc13lab
tcp 3419400 0
corvette.cars.aps.anl:40674 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40650
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 3427280 0 corvette.cars.aps.anl:40662
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40702
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 3406800 0
corvette.cars.aps.anl:40680 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
corvette:~>netstat | grep ioc13lab
tcp 3419400 0
corvette.cars.aps.anl:40674 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40650
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 3427280 0
corvette.cars.aps.anl:40662 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp
0 0 corvette.cars.aps.anl:40702
ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
tcp 3406800 0
corvette.cars.aps.anl:40680 ioc13lab.cars.aps.anl.:ca-1 ESTABLISHED
Note that at this point 3 connections from corvette to ioc13lab have
filled their receive buffers, and they are no longer getting any larger.
At that point the sendQ buffers on the vxWorks system begin filling up.
ioc13lab> inetstatShow
Active Internet connections (including servers)
PCB Proto Recv-Q Send-Q Local
Address Foreign Address (state)
-------- ----- ------ ------ ------------------
------------------ -------
1da89d4
TCP 0
0 164.54.160.73.5064 164.54.160.82.40702
1da8950
TCP 0 8192 164.54.160.73.5064
164.54.160.82.40680
1da88cc
TCP 0 2600
164.54.160.73.5064 164.54.160.82.40674
1da8848
TCP 0 7760
164.54.160.73.5064 164.54.160.82.40662
1da87c4
TCP 0
0 164.54.160.73.5064 164.54.160.82.40650
1da8740
TCP 0
0 164.54.160.73.5064 164.54.160.74.1123
1da8428
TCP 0
0 164.54.160.73.5064 164.54.160.130.2328
1da7f84
TCP 0
0 0.0.0.0.5064
0.0.0.0.0
1da7cf0
TCP 0 0
0.0.0.0.111
0.0.0.0.0
1da86bc
UDP 0
0 0.0.0.0.989
0.0.0.0.0
1da8638
UDP 0
0 164.54.160.73.1038
0.0.0.0.0
1da85b4
UDP 0
0 127.0.0.1.1037 127.0.0.1.1031
1da8530
UDP 0
0 127.0.0.1.1036
127.0.0.1.1030
1da84ac
UDP 0
0 0.0.0.0.5065
0.0.0.0.0
1da83a4
UDP 0
0 0.0.0.0.991
0.0.0.0.0
1da7df8 UDP 0
0 0.0.0.0.1034
0.0.0.0.0
1da8320
UDP 0
0 0.0.0.0.1033
0.0.0.0.0
1da829c
UDP 0
0 0.0.0.0.1032
0.0.0.0.0
1da8218
UDP 0
0 0.0.0.0.1031 0.0.0.0.0
1da8194
UDP 0
That in turn depletes the mbufs. Note that there are no free
mbufs larger than 256 bytes.
ioc13lab> mbufShow
type number
--------- ------
FREE : 592
DATA : 201
HEADER : 7
SOCKET : 0
PCB : 0
RTABLE : 0
HTABLE : 0
ATABLE : 0
SONAME : 0
ZOMBIE : 0
SOOPTS : 0
FTABLE : 0
RIGHTS : 0
IFADDR : 0
CONTROL : 0
OOBDATA : 0
IPMOPTS : 0
IPMADDR : 0
IFMADDR : 0
MRTABLE : 0
TOTAL : 800
number of mbufs: 800
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
__________________
CLUSTER POOL TABLE
_______________________________________________________________________________
size clusters
free usage
-------------------------------------------------------------------------------
64
125 117
450874
128
400 326
5035830
256
50
0
613939
512
25
0
6708
1024
25
0
8549
2048 25
0
147650
-------------------------------------------------------------------------------
value = 80 = 0x50 = 'P'
At that point network access is dead. Save_restore begins
failing, etc.
value = 80 = 0x50 = 'P'
ioc13lab> save_restore:write_it: fprintf returned -1.
[111019-131137]
../save_restore.c(1369):
[0x2f000b]=write_it:S_rpcLib_RPC_CANTDECODEARGS
save_restore:write_it: fclose('auto_settings.sav') returned -1
../save_restore.c(1485):
[0x2f000b]=write_it:S_rpcLib_RPC_CANTDECODEARGS
save_restore:write_it: Giving up on this attempt to write 'auto_settings.sav'.
[111019-131137]
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
save_restore:write_save_file: Can't write save file. [111019-131137]
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
________________________________________
From: Mark Rivers
Sent: Tuesday, October 18, 2011 6:19 PM
To: [email protected]
Subject: EZCA and ezcaIDL improvements?
Folks,
Andrew Johnson thought he remembered someone at the recent EPICS
collaboration meeting talking about improvements to the EZCA extension to use
3.14 preemptive callbacks. Has someone done some work on this?
Here’s why I want to know:
Thanks to help from Andrew I just discovered that the IDL interface
(and probably other scripting languages like Matlab, etc.) can cause serious
problems with an EPICS IOC. The problem arises when the IDL client sets
monitors on a PV (with caSetMonitor()), but then does not call ca_pend_event()
frequently. ca_pend_event() is called by the “caGet”,
“caPut”, “cacheckMonitor”, etc. functions in IDL.
But if one calls caSetMonitor() and then just sits at the IDL prompt,
ca_pend_event() is not being called. When that happens the CA server is
trying to send monitor events to the client. These initially queue up on
the client side, but eventually they begin to queue up on the server
side. As they queue on a vxWorks server they consume mbufs. I have
found that a single client in this state will consume all of the 256 byte
clusters, and some of the 512 byte clusters. If a few more clients are in
this state then there are no mbufs available above a certain size. When
that happens no new CA connections can be made, other network services fail,
etc.
I think this problem could be avoided if we changed the ezca C library
that IDL calls to use preemptive callbacks, so that periodic calls to
ca_pend_event are no longer needed.
Thanks,
Mark