The problem appears between commit 4ab98081 and 56f05d72. (The two commits in between do not produce stable code.)
On Mon, 2022-06-27 at 09:44 +0000, Zimoch Dirk (PSI) via Core-talk wrote:
> 7.0.5 does not show this behavior...
>
> On Mon, 2022-06-27 at 10:09 +0200, Zimoch Dirk wrote:
> > Hi everyone,
> >
> > Thanks Mark for testing it with AD. The linear ramp is probably similar to my "sequence" test device support.
> > The client I used is caqtdm.
> >
> > As Mark already noticed, the problem does not appear with a client running on the same host. I guess this is the
> > case
> > because the host shortcuts the TCP traffic to itself and thus the bandwidth is much higher than what we can achieve
> > between two different hosts.
> >
> > Running the IOC with RT priorities does not change anything.
> >
> > I had not counted the monitors, thus I did not notice that the IOC sends exactly n+1 frames. The behavior is
> > consistent
> > with putting a pointer to the array data (at an unchanging location) into a queue. If the data changes faster than
> > it
> > can be sent, then the next element in the queue will be a pointer not to the next frame but instead will miss some.
> > After data production ends, all remaining pointers in the queue will point to the non-changing last frame, until
> > some
> > other client makes the array produce new data. At that point the original client will receive those updates while
> > working through the backlog. Of course this behavior makes no sense at all. Either put the whole data into the queue
> > or
> > do not queue the pointer at all.
> >
> > Using PVA, I do not see this effect. Still the network cannot keep up with the rate the data is produced, thus
> > frames
> > are lost. But pressing the STOP button takes effect almost immediately. (Maybe after one or two more updates, but
> > that
> > can be latency in the client or network.) Is there a debug variable like CASDEBUG for PVA that would allow to see
> > what
> > PVA is sending?
> >
> > I can try to find out which commit changed the behavior in CA ...
> >
> > Dirk
> >
> >
> > On Sat, 2022-06-25 at 09:03 +0000, Zimoch Dirk (PSI) via Core-talk wrote:
> > > Hi Andrew,
> > >
> > > The cameras run on Windows. I did my test on Linux but not as root, thus I had no RT scheduling. I will repeat the
> > > test on Monday running as root.
> > >
> > > The STOP message gets processed in time! A client that does not monitor the array sees the change immediately. The
> > > counter stops. But CA keeps sending!
> > >
> > > I had expected that the IOC would drop frames if CA cannot send fast enough. Not trying for minutes to work
> > > through
> > > a
> > > pile of unsent frames. And then not even sending updates but simply repeating the last frame.
> > >
> > > Dirk
> > >
> > > > Am 24.06.2022 um 18:10 schrieb Andrew Johnson via Core-talk <core-talk at aps.anl.gov>:
> > > >
> > > > Hi Dirk,
> > > >
> > > > What OS is the IOC running on — I'm guessing Linux but you didn't say. If so is it built for and using priority
> > > > thread scheduling? If the OSSPRI field from epicsThreadShowAll is all zeros it isn't, and enabling that might
> > > > help.
> > > > The normal Linux scheduler tends to maximize throughput, not fairness, so it could be delaying the threads which
> > > > process your STOP message while the threads handling image data can continue to make progress. However this is
> > > > just
> > > > a guess.
> > > >
> > > > - Andrew
> > > >
> > > >
> > > > General musings: The setpriority(2) manpage on RHEL-7 says:
> > > > > BUGS
> > > > > According to POSIX, the nice value is a per-process setting. However, under
> > > > > the current Linux/NPTL implementation of POSIX threads, the nice value is a
> > > > > per-thread attribute: different threads in the same process can have different
> > > > > nice values. Portable applications should avoid relying on the Linux behav‐
> > > > > ior, which may be made standards conformant in the future.
> > > >
> > > >
> > > > I wonder whether we should look at setting nice values for Linux threads when the process doesn't have the
> > > > ability
> > > > to use SCHED_FIFO?
> > > >
> > > >
> > > >
> > > > On 6/24/22 10:39 AM, Zimoch Dirk (PSI) via Core-talk wrote:
> > > > > Hi folks,
> > > > >
> > > > > Some of or users complained that a camera server became less responsive since it had been upgraded from EPICS
> > > > > 3.14.12.6
> > > > > to 7.0.6.1.
> > > > >
> > > > > The camera sends image data as arrays of 20000000 SHORTs (5000x4000 pixels). When the user presses the "STOP"
> > > > > button on
> > > > > the client which displays the image, it takes a long time to stop. The more active clients, the longer it
> > > > > takes.
> > > > > But even sending stop from a different client (e.g. command line caput) takes a long time before the GUI
> > > > > clients
> > > > > update.
> > > > >
> > > > > I have set up a simple simulation and run it with 'var CADEBUG 3'
> > > > > Here is what I see: on EPICS 7.0.6.1
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: TCP Request from 129.129.130.117:47142 => cmmd=4 (CA_PROTO_WRITE) cid=0x4 type=0 count=1 postsize=8
> > > > > version=13
> > > > > CAS: Request from 129.129.130.117:47142 => available=0x2 N=1 paddr=0x7efcb800db80
> > > > > CAS: Request from 129.129.130.117:47142 => Wrote string "STOP"
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > [>80 times the same!]
> > > > > CAS: Sending a message of 40000056 bytes <---- I think this one contains the update of the STOP button
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > [eventually stops many seconds later]
> > > > >
> > > > > The IOC obviously gets the STOP message immediately when I press the button on the client. But the client (and
> > > > > any
> > > > > other
> > > > > client showing the image) does not see the button change. The GUI appears "frozen". But a command line
> > > > > camonitor
> > > > > monitoring the stop button (and a counter that counts the number of created images but not the image itself)
> > > > > show
> > > > > that
> > > > > the records stop immediately.
> > > > > Nevertheless the IOC keeps sending images. But the images do not change any more on the clients. So it seems
> > > > > that
> > > > > the
> > > > > IOC keeps sending the same array data over and over again.
> > > > >
> > > > > On 3.14.12, the output looks similar, but the "send after stop" consists of only a few messages:
> > > > > CAS: Request from 129.129.130.117:47184 => cmmd=4 cid=0x1 type=0 count=1 postsize=8
> > > > > CAS: Request from 129.129.130.117:47184 => available=0x2 N=1 paddr=0x7f0768010b28
> > > > > CAS: Request from 129.129.130.117:47184 => Wrote string "STOP"
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000032 bytes
> > > > > CAS: Sending a message of 40000056 bytes <---- update of the STOP button
> > > > >
> > > > > What can be wrong here?
> > > > > The IOC consists of a counting calc, a bo for the stop switch and a waveform record with a driver that simply
> > > > > fills the
> > > > > waveform with a sequence starting at the counter value. Nothing fancy.
> > > > >
> > > > > Here is my db:
> > > > >
> > > > > record (waveform, "DZ:BIGARRAY")
> > > > > {
> > > > > field(FTVL, "SHORT")
> > > > > field(NELM, "20000000")
> > > > > field(DTYP, "sequence")
> > > > > field(SCAN, ".1 second")
> > > > > field(SDIS, "DZ:STOP")
> > > > > field(INP, "DZ:COUNT")
> > > > > field(FLNK, "DZ:COUNT")
> > > > > }
> > > > >
> > > > > record (calc, "DZ:COUNT")
> > > > > {
> > > > > field(CALC, "VAL+1")
> > > > > }
> > > > >
> > > > > record(bo, "DZ:STOP")
> > > > > {
> > > > > field(ZNAM,"GO")
> > > > > field(ONAM,"STOP")
> > > > > }
> > > > >
> > > > > I suspect this happens when record produces new waveforms faster than they can be sent.
> > > > > The IOC has no problem processing the waveform at 10 Hz, but I see only about 3 CAS messages per second.
> > > > > I had to slow down the waveform processing to ".5 second" to improves responsiveness. That is when the monitor
> > > > > updates
> > > > > can be sent as quickly as being produced. But opening a second client again spoils everything.
> > > > >
> > > > > Dirk
> > > > >
> > > > >
> > > >
> > > >
- Replies:
- Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- References:
- Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Re: Problem with huge waveforms in EPICS 7 Andrew Johnson via Core-talk
- Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Navigate by Date:
- Prev:
Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Next:
Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
<2022>
2023
2024
- Navigate by Thread:
- Prev:
Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Next:
Re: Problem with huge waveforms in EPICS 7 Zimoch Dirk (PSI) via Core-talk
- Index:
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
<2022>
2023
2024
|