Hi everyone,
I've got a bit more info about this problem. Or maybe not. Given that we can't really test our 2700's right now, we're trying to ship the system on an MVME6100 instead, and I've been testing it on the lab. This board has given us quite some trouble in the past (mainly due to overheating, it seems), so we're testing it thoroughly on the lab.
Along with it, I've written a soft watchdog to probe some idea that I've been toying around with. Now, the watchdog operates in a very simple way:
- An RTEMS timer is created, and set to fire after 1 second (this time is arbitrary). If it fires, it will simply inform us that the system is stalling.
- An EPICS record processes every 0.2 seconds, and the associated SNAM resets the timer. If it stops processing, the timer will (eventually) fire.
So far so good. Even if the watchdog was intended to test the situation with the 2700, I included it in the build for the 6100, just for future use. Now, about 50% of the time, everything works without (visible) problems. But when it doesn't, I observe the following which, at least on the surface, seems close enough to what I could see in the 2700, but I can't confirm until I get the same set of debugging tools on it:
- The system starts booting and going through the startup script
- At a certain point, while initializing the sequencer tasks (from the seq supp module), the system stops going through the initialization.
- This happens always at a certain, specific transition. I need to check it up
- The system stops responding to CA events at this point, which is consistent with 2.1, because the transition depends on an external (simulated, in this case, but accessed through CA) PV
- The watchdog is not firing at this point, meaning that RTEMS is still working and, at least, scan0.2 is processing!
- When I get to this point, if I interact with the iocsh console (eg. I just press Enter), something else locks up, scan0.2 stops processing and the watchdog fires.
- Interestingly, iocsh registers the first Enter (a newline is echoed back), but not subsequent ones.
NB: when discussing this with my team, someone suggested that maybe the whole system was blocking until I hit Enter, not just (parts of) EPICS. I have confirmed that this is
not the case by introducing a periodic printf in the function that resets the timer. Indeed, it keeps being called while the sequencer and CA are stalled.
But, and this is a very big but, I had to use printf to put out those messages. Initially I used errlogMessage, and this made the messages not show up. But all this time the watchdog was still not being fired. Now, errlogPrintf (called by errlogMessage) does not print things out to console unless this has been explicitly enabled using eltc (which is not our case), meaning that the messages are being queued to the message buffer to be printed out later by the errlogThread. But errlogThread is not being called...
errlogThread is run as a low priority task (priority 10, one of the lowest). It won't be given CPU time if other, higher priority tasks, are hogging the CPU, which suggests me that something is, and most probably this is the problem I'm facing right now, possibly (but not 100% sure) related to the initial one. But that something is not high priority enough to override scan0.2 (priority 65), and certainly not the timer, which is activated via an ISR and thus should be at the highest priority.
I'll keep trying to isolate the priority of the task that is blocking everything else. If thinks get hairy, I've got a support module that I've been using to measure the system behavior at the RTEMS thread level and I'll try to merge it in to get some info on what's going on.
In the mean time, if someone has any idea, you're welcome to chime in!
Cheers,
Ricardo