Azure GUIX thread stops running after several hours of normal operation.

Mohamed Shehata 0 Reputation points
2023-01-13T01:51:30.2133333+00:00

Hello!

I'm working on a system that's using Azure RTOS and GUIX running on an STM32H743 MCU. We've been trying to track down and debug this issue for a few weeks now. Everything runs perfectly normally on startup, but then at some point (the time between start-up and failure has not been consistent in testing) the gui freezes up and, upon further inspection, we've found that the guix thread ceases to run.

Some information on the system config:

TX_PORT_USE_BASEPRI, TX_PORT_BASEPRI=5,

TX_TIMER_TICKS_PER_SECOND=1000,

TX_TIMER_PROCESS_IN_ISR,

TX_ENABLE_STACK_CHECKING,

TX_DISABLE_PREEMPTION_THRESHOLD,

TX_ENABLE_EVENT_TRACE.

Guix is setup with the default configuration.

As stated, we've been trying to crack this for a while now so we've tried quite a few things to help diagnose already, and I'll try my best to outline as much of that data here:

  • We found the extremely similar question here, and we have confirmed the root cause they found does not apply to us
  • We've disabled all but two threads for the sake of testing. The guix thread itself and the main application thread which sends events to the guix thread. We've also tried both having the guix thread be higher and lower priority than the application thread.
  • We've gone through our ISRs and added "TX INTERRUPT SAVE AREA" tags to each one except the fault handlers, which are infinite loops anyway.
  • We've run the code with debugger breakpoints at the end of each thread to ensure they aren't terminating.

Unfortunately, the results from all our tests so far have provided no change in behavior, except for seemingly arbitrary/random differences in the length of time it runs smoothly before failing. I've attached a couple of screenshots from TraceX looking at trace buffer dumps below, one that shows the system running during normal operation (immediately after application thread runs), and one after the failure has occurred. We've also confirmed that the events we are sending to the guix thread are being generated correctly, and the event queue is subsequently full (0x30) during the fault state when the guix thread ceases to pop any of them off. The only unexplained piece of evidence we might have so far is that we never see the gx system timer decrementing from its re-init value (we've looked at it during the failure state, at random moments during normal operation, and at breakpoints where we expect it to be imminently expiring). The application thread's timer is behaving as expected, with the value changing everytime we stopped to check them.

At this point, we're running out of theories as to the possible causes, and running out of ideas on how to further test the system for more information. Any insight that you may have would be most welcome, and I'll be happy to provide any further information you may need.

Thanks

normalOperationTrace.png

faultStateTrace.png

Azure RTOS
Azure RTOS
An Azure embedded development suite including a small but powerful operating system for resource-constrained devices.
331 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Ken Maxwell 706 Reputation points Microsoft Employee
    2023-01-24T20:30:30.1066667+00:00

    @Mohamed Shehata is this issue resolved? We had a very similar thread come in on our github log and I'm not sure if this is a duplication. In that case I believe the issue turned out to be a different setting for BASEPRI in the C vs. Assembler settings. For the assembler I believe you need to pass the value on the command line. Let me know if this issue is still open for you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.