Hi Dirk,
On Monday 02 March 2009 06:49:05 Dirk Zimoch wrote:
>
> 1. The error message does not give any hint of the location at all.
>
> 2. The reported location is inside a (EPICS) system function or a macro,
> but the caller who actually caused the error is not reported.
>
> 3. The reported location is inside the source code of a file parser but the
> error location in the faulty file is not reported.
Note that the DBD and DB file parser underwent some major improvements to its
error reporting between R3.14.9 and R3.14.10 so now it does indicate (in
excruciating detail in some cases) the location of any errors in the input
file. If you find further issues like this in R3.14.10 or later, please
report them.
> When looking into epicsMutexOsdLock(), I found that the calling thread is
> suspended via a call to cantProceed() when an error happens. In my opinion,
> this is as well not very helpful because the calling function does not get
> any indication of the failure and thus cannot print a proper error message.
In general cantProceed() was intended to be used when there really is no way
for operations to continue if some particular condition is detected; it is
usually an indication there's a bug in the code. For example, if we run out
of memory when loading a .db file, there's no point returning an error code
to the parser because it isn't going to be able to do anything to recover
from that. By catching such errors early and not returning at all, we save
the caller from having to check for and pass the error condition upwards,
thus reducing code size and complexity in the caller. In C++ we would use an
exception to flag the problem without adding caller complexity, but this
isn't C++ code.
We should never be calling cantProceed() from any code path that is visited
during normal operation, thus the fact that it is called implies that there
is a bug somewhere which needs to be fixed.
In your particular case I believe the posix implementation of the routine
epicsMutexOsdLock() is being passed a pointer to something that is no longer
a epicsMutexOSD object, causing the pthread_mutex_lock() call to return
EINVAL ("Invalid argument"). This indicates there's code somewhere which is
using an object after it has been destroyed. I don't know why
epicsMutexOsdLock() doesn't just return epicsMutexLockError in this case
though, it already does that if you pass in a NULL for pmutex. I will change
that, although it won't fix the underlying problem and may result in some
even less desirable symptoms appearing if/when it occurs.
> In my opinion, cantProceed() does more harm than good as long as it is not
> able to provide useful debug information (at least a stack trace) -- which
> is probably hard to do in a portable way. I think, a call to abort() would
> be more helpful as it produces a process dump (at least on Unix systems)
> which can be analyzed.
Calling abort() on vxWorks destroys all information about the problem location
since there is no equivalent of a core dump file (at least on vxWorks 5.x).
The cantProceed() behavior of suspending the calling thread makes debugging
possible on most (all?) architectures by allowing a human to request a
backtrace of the relevent thread. On Unix-like systems you can do that by
attaching a system debugger (gdb --pid=<pid> on Linux) to the frozen process.
EPICS also provides a fully-documented mechanism (in the AppDevGuide Section
16.3) called the task watchdog that monitors tasks that register themselves
with it and can execute callback functions when any such tasks are suspended.
Most EPICS threads register themselves on start-up and un-register again when
they shut down. If a program wants to recover from a cantProceed() error by
generating an abort(), it can register a task watchdog callback to do just
that.
> Generally, error handling should not be part of the low-level functions.
> These functions do not have any application knowledge and thus cannot
> decide if suspending the thread, terminating the program, printing an error
> message or continuing normally is the "correct" behavior. Instead,
> low-level functions should only report errors (by or exceptions or return
> values -- I am not religious in this matter) and leave the choice of the
> correct response to the caller.
In general I agree with you, but sometimes the original API was designed
without the possibility of reporting errors to the caller — say no errors
could ever occur when the API was first introduced, and it's now called from
many other pieces of code outside of Base. Just adding an error status
return value to a function that used to return void doesn't help because the
unmodified callers will completely ignore the new return value and the
compiler won't flag that. We do use exceptions in C++ code, but obviously
can't in C APIs.
> In this case, the gateway (maybe the CAS library) stopped accepting new
> clients which made it effectively useless. On the other hand, it did not
> terminate. Thus any mechanism to restart the gateway on failure does not
> work, too. I do not consider this a "useful" behavior.
The gateway should register a task watchdog callback to make sure that it can
be restarted in this circumstance; it probably needs to register the main
thread to be monitored as well since it isn't by default. However I'm hoping
that Jeff has/will fix your underlying problem too.
> It may be a goal of the next codathlon to improve the error messages
> provided by EPICS.
The list of people attending is available on the website; feel free to
encourage someone to work on this issue, which I've added to my list of
tasks.
Thanks,
- Andrew
--
The best FOSS code is written to be read by other humans -- Harold Welte
- References:
- Useless error messages in EPICS Dirk Zimoch
- Navigate by Date:
- Prev:
Uncovered gold??? - "Channel Access Client Library Tutorial, R3.13" John Hammonds
- Next:
RE: Useless error messages in EPICS Jeff Hill
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Useless error messages in EPICS Jeff Hill
- Next:
Uncovered gold??? - "Channel Access Client Library Tutorial, R3.13" John Hammonds
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|