Hi Oleg,
Remote diagnosis of an unknown system is always more of a
guessing game than anything else.
So, first and most important suggestion: refer to a local
expert.
Nevertheless, some thoughts:
Statistically, many if not most weird errors on the IOC are
caused by memory corruption.
In your case, the thread suspensions happen when the CA
server on the IOC calls db_event_enable (line 477) or
db_event_disable (line 493), and trying to acquire the monitor
lock fails with an error.
The routines db_event_enable /
db_event_disable are called from within the CA server when
access rights change for a record or when a client sets up /
cancels a monitor.
Were there access rights changes happening on the IOC at
07-Apr 19:59:10 and 08-Apr 08:11:10 (at the "line 493"
events)?
Some "line 477" thread suspensions happen with intervals of
a few minutes. That could match a client repeatedly getting
ungracefully disconnected (because of the server-side thread
being suspended) and then reconnecting, provoking another
attempt to lock an invalid monitor lock and get disconnected
again.
The semaphore locking code is used everywhere, all the
time, all over EPICS Base. Not an obvious candidate for a bug.
So ... I think what you see may be consistent with a memory
corruption that affects at least one record (i.e. the pointer
to its monitor lock semaphore) or the memory area where the
semaphore structures have been allocated.
Too bad that the error messages don't show the record
involved. That would give valuable information.
Memory corruption issues (if there is one) are not easy to
track down; strategies and tools depend on the operating
system. Which closes the loop to my first and most important
suggestion: refer to a local expert.
Cheers,
~Ralph