Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Pi-Zero Random Hang issue

Mon Apr 05, 2021 9:55 am

Issue: Pi-Zero hangs randomly for 1-1.5 hours.
System: I have close to 40nos of Pi-zero, it reads a temperature and sends the data to the cloud server. It is 24x7 up and running. However some of the Pi-Zero gets in to hang mode for 1-1.5 hours, and then resumes back. It is random issues, which happens once in 2-5 days. It is observed in almost 80% of Pi-Zero.
Observations:
(1) We have a simple script which check for Wi-Fi, this runs in corntab, every minute, and logs into a file. During the hang period, we don't find the log entries.
(2) Simple C application running on the Pi-Zero, and sends the data to the cloud, data is absent during the hand period.
(3) During the hang period, we can't able to do SSH to the device.
(4) We can safely rule out any memory issues, because Pi-Zero has script to restart the OS, once in a day.
(5) The problem is random in nature, doesn't appear periodically.
(6) Duration is always between 1-1.5 hours, and it resumes back automatically. I strongly feel that, it might be related to any OS service or OS operations.
(7) Logged the temperature in few of the device, it was typically around 50 Deg Celsius.
(8) We have reset log, which captures when the device reboots. Pi-Zero restores after the hand period with out reboot. OS or Processor went into hang mode and then restored back.

Any suggestion, help would be greatly appreciated.

MiscBits
Posts: 149
Joined: Wed Jan 27, 2021 12:48 pm

Re: Pi-Zero Random Hang issue

Mon Apr 05, 2021 2:01 pm

I would add a little script to output the date / time to a file and see if that works during the hang - do not get it to do anything else.

If you have a new O/S image, try running the date / time script with no other software running to see if it's a hardware issue or a code issue to start with.

User avatar
neilgl
Posts: 3032
Joined: Sun Jan 26, 2014 8:36 pm
Location: Near The National Museum of Computing

Re: Pi-Zero Random Hang issue

Mon Apr 05, 2021 4:49 pm

Are they actually Pi Zero devices or [b]Pi Zero W[/b] devices?

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Mon Apr 05, 2021 5:17 pm

Its Pi zero with Wireless and Bluetooth.

User avatar
thagrol
Posts: 4653
Joined: Fri Jan 13, 2012 4:41 pm
Location: Darkest Somerset, UK
Contact: Website

Re: Pi-Zero Random Hang issue

Mon Apr 05, 2021 5:31 pm

You mention you're zeros are sendign data to the cloud. By "cloud" do you mean a server somewhere on the internet or a server on teh same network as the zeros?

Does the end of the hang match the scheduled reboot time?

Is it possible the problem is external to the zeros? Is their network or internet connection going down?
I'm a volunteer. Take me for granted or abuse my support and I will walk away

All advice given is based on my experience. it worked for me, it may not work for you.
Need help? https://github.com/thagrol/Guides

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 4:59 am

It is internet server.
Device doesn't reboot. Checked the logs.It just freeze for 1 hour and come back to alive.
I can safely rule out network issue, because its deployed in different locations, and there is a simple script in corntab, which runs every min and logs. during the freeze period, logs are missing.

To some extent I can rule out hardware issue. It can't be consistent 1 hour problem.
There is a pattern of 1 hour, which I couldn't able to understand. Network and hardware issues will be random in nature.

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 7:23 am

MiscBits wrote:
Mon Apr 05, 2021 2:01 pm
I would add a little script to output the date / time to a file and see if that works during the hang - do not get it to do anything else.

If you have a new O/S image, try running the date / time script with no other software running to see if it's a hardware issue or a code issue to start with.

Will try your suggestion. However the occurrence of issue is random in nature, 2-3 times in a day to once in 15 days.

User avatar
thagrol
Posts: 4653
Joined: Fri Jan 13, 2012 4:41 pm
Location: Darkest Somerset, UK
Contact: Website

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 11:44 am

The zero has a single core CPU. Is there anything running on those machines that could be using all availble CPU time thus stopping other processes from running?

While not a direct solution, you might want to consider enabling the hardware watchdog (ask google) on the zeros. If the hadrware doesn't get poked by the OS every 15 seconds or so (I forget the actual figure, it might be 17 seconds) the SoC gets a hard reset. The downside is that there is a risk of file system/SD card corruption as the OS is not shutdown cleanly.
I'm a volunteer. Take me for granted or abuse my support and I will walk away

All advice given is based on my experience. it worked for me, it may not work for you.
Need help? https://github.com/thagrol/Guides

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 12:30 pm

Hardware watch dog is great idea. I was not aware that Pi internally has hardware watch dog. I will check this out and update here.
"Is there anything running on those machines that could be using all availble CPU time thus stopping other processes from running?"
I also wonder the same. I checked more closely, Pi-Zero freezes for 72mins. In the last two weeks data, I got this 72mins freeze for at least 200 times, spread over 40+ Pi-Zeros.

I was exploring external watch dog for next lot of devices, but if this internal watch dog solution works, it would be great.

User avatar
neilgl
Posts: 3032
Joined: Sun Jan 26, 2014 8:36 pm
Location: Near The National Museum of Computing

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 3:53 pm

Sounds like the C program is hanging/looping waiting for a network resource in “the cloud”.
Can you post that C code?

ejolson
Posts: 7031
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi-Zero Random Hang issue

Tue Apr 06, 2021 4:00 pm

neilgl wrote:
Tue Apr 06, 2021 3:53 pm
Sounds like the C program is hanging/looping waiting for a network resource in “the cloud”.
Can you post that C code?
Could the problem be how ARP filtering affects different types of WiFi routers?

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 6:18 am

neilgl wrote:
Tue Apr 06, 2021 3:53 pm
Sounds like the C program is hanging/looping waiting for a network resource in “the cloud”.
Can you post that C code?
Not just C-program, everything freezes. Program uses curl library, with proper time out. It is definitely doesn't endlessly wait for cloud resource.

User avatar
neilgl
Posts: 3032
Joined: Sun Jan 26, 2014 8:36 pm
Location: Near The National Museum of Computing

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 8:43 am

Yes Everything freezes but probably caused by the C code. Can you post it?
My pizeroW and others posting temperatures to a local server never freezes.

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 10:25 am

ejolson wrote:
Tue Apr 06, 2021 4:00 pm
neilgl wrote:
Tue Apr 06, 2021 3:53 pm
Sounds like the C program is hanging/looping waiting for a network resource in “the cloud”.
Can you post that C code?
Could the problem be how ARP filtering affects different types of WiFi routers?
Could ARP filtering able to freeze the processor?. There is a instance, where only Pi-Zero was in the network, no other dev was connected, still it got into trouble.

User avatar
neilgl
Posts: 3032
Joined: Sun Jan 26, 2014 8:36 pm
Location: Near The National Museum of Computing

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 10:27 am

Can you post the C code?

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 10:48 am

full code is pretty huge, close to 10+ c files, we have sqlite operations, some XML configs, Json libs, small LCD, some LEDs etc. I am not very sure, how much it would be helpful. It would require at least 2-3 days to understand the code. I have attached the server posting thread snippet.

Code: Select all

void HttpJsonEventPosting(void)
{
      static unsigned int Tick = 0;
      char      scLoop = 0;
      static char Index = 0;
      int       sizeOfDataToBePosted,currentPostingDbId,currentPostingDbId_1,LeastDBId, DbId, RetVal,oldDataElementIndex;
      unsigned int elementIndex = -1,Tick1=0,elementIndex_1;
      int Ret=0,SizeofEvent=0,ReadSize;
      RTC_TIME RtcTime;
      char DbaseCheckedForToday=0;
      
      Tick = getTick();      

      PrintDebug("HttpJsonEventPostingThread:Waiting for event stable time");      
      sleep(EVENT_STABLE_TIME);
 
      // sleep this thread so that RTC is read and set the system time .
//      while( (!fTime_Updated_From_Local_RTC ) && (!fTime_Updated_From_NTP ))
//      while( !fTime_Updated_From_NTP)
//      {
//         PrintDebug("HttpJsonEventPostingThread:Waiting for RTC to be read and system time to be updated");
//         sleep(2);
//      }


      //Initially get the tick count for LS devices battery low
        GetBattLowTickCnt = getTick();
        Tick1 = getTick();
      while(1)
      {
          if(TruncateEvent)
          {
              TruncateEvent=0;
              //TBD: Call truncate function
              RotateEvenetDb(1);
          }
          
//         if(st_LS_Config.LS_Mode==SLAVE)
         if(st_LS_Config.LS_Mode!=MASTER)
         {
            GetBattLowTickCnt = getTick();
            Tick = getTick();
            sleep(10);
            continue;
         }
          

         FillEvent();      

         Index = 0;
//        if ((st_Status.Labsense_Events.EventId != PrevEventId) ||
//           (st_Status.Labsense_Events.Curr_EventState != st_Status.Labsense_Events.Prev_EventState))
         if(EventChanged)
        {
             EventChanged=0;
           funPrepareEventBuffer(Index, LS_EVENT);

           #ifdef DEBUG_PRINT     
              PrintDebug( "HttpEventPosting : Posting Event\n" );      
           #endif

           SizeofEvent=funPrepareEventInJsonFormat();                          
           if(SizeofEvent>0)
           {
		  currentPostingDbId = StoreEventToDbase( 0, EventBuffer );
		  // Push the record to the internal queue which will us fetched by the posting thread.
		  elementIndex = QPut(DBASE_LOG_QUEUE, EventBuffer, strlen(EventBuffer), currentPostingDbId, WRITE_OVER_RIDE);
		  if( elementIndex >= 0 )
		  {
		     // Update data as new data state which will be updated in the posted string.
		     QSet_Posted_State(DBASE_LOG_QUEUE, elementIndex,NEW_DATA);
                        printf("Posting new Even data\n");
		  }
               
           }
            //TBD: check Internet connectivity
           if( IsInternetConnected == NET_CONNECTED)
           {
                //If timezone set then only send events to server
                if (TimeZoneError == 1)    
                {
                    sizeOfDataToBePosted = getEventDataFromQueue(&currentPostingDbId, &elementIndex);
                    if( sizeOfDataToBePosted > 0)
                    {
                        Ret = funSendEvent(NEW_DATA);   
                        if(Ret==1)
                        {
                            //Event Posting Success
                            printf("Event Posting Success\n");
                             QSet_Element_State(DBASE_LOG_QUEUE,elementIndex, Q_POSTED);
                        }
                    }
                }
           }
           
           st_Status.Labsense_Events.Prev_EventState = st_Status.Labsense_Events.Curr_EventState;
           PrevEventId = st_Status.Labsense_Events.EventId;  //SDK                 
        }else{
            //TBD: check Internet Connectivity
           if( IsInternetConnected == NET_CONNECTED)
           {
	       if( getTick() - Tick1 >= (SettingsParms.Old_Data_Posting_Interval * TICK_SECOND) )
	       {
                    Tick1 = getTick();

                  //TBD: check old event to be post is pending
                    // we are searching the old data with lowest ID in the database which is not 
                    // posted.
                    RetVal = QGet_Least_DB_ID(DBASE_LOG_QUEUE,&LeastDBId);
                    if(LeastDBId <=0){
                       LeastDBId = currentPostingDbId;
                    }
                    // Get the old record from database whose ID is one less than the least ID.
                    if(!NoOldEventFlag)
                    {
                        DbId = getOldEventRecordFromDbase(LeastDBId);      //currentPostingDbId
                    }else{
                        DbId = 0;
                    }
                    if( DbId )
                    {
                       // Push the old record on the internal queue for posting.
                       oldDataElementIndex = QPut(DBASE_LOG_QUEUE, EventBuffer, strlen(EventBuffer), DbId, WRITE_OVER_RIDE);
                       if( oldDataElementIndex >= 0 )
                       {
                          // Set the state of the record as old data.
                          QSet_Posted_State(DBASE_LOG_QUEUE, oldDataElementIndex,OLD_DATA);
                          printf("\nPosting old Event data\n");
                          printf( EventBuffer );
                          printf("\n");
                       }
                    }
                    sizeOfDataToBePosted = getEventDataFromQueue(&currentPostingDbId_1, &elementIndex_1);
                    if( sizeOfDataToBePosted > 0)
                    {
                            //TBD: Used another url for Old Event posting
                         Ret = funSendEvent(OLD_DATA);
                         if(Ret>0)
                         {
                             //old Data posted successfull
                             QSet_Element_State(DBASE_LOG_QUEUE,elementIndex_1, Q_POSTED);
                         }
                    }
               }
           }
        }
        // From the internal queue, get the record that has been posted.
        if( (elementIndex = QGet(DBASE_LOG_QUEUE, 0, 0,&ReadSize,&DbId,Q_POSTED)) != -1 )
        {
           //printf( "Tick before update status to dbase = %d \n", getTick() );
           EventUpdateStatusToDbase(DbId);
           //printf( "Tick after update status to dbase = %d \n", getTick() );

           QSet_Element_State(DBASE_LOG_QUEUE,elementIndex, Q_FREE);
           // free the queue element.
        }
         //To Delete Data from Database
        getTime( &RtcTime );
        if( RtcTime.Hour >= SettingsParms.DbaseRotateHour && !DbaseCheckedForToday )
        {
           DbaseCheckedForToday = 1;
           RotateEvenetDb(0);
        }
        else if( RtcTime.Hour < SettingsParms.DbaseRotateHour ) // if day has changed, start checking again
        {	
            DbaseCheckedForToday = 0;  
        }
         
         sleep(1);
     }

     printf( "HttpEventPosting : Exitting from thread\n" );
 
}

ejolson
Posts: 7031
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi-Zero Random Hang issue

Wed Apr 07, 2021 3:29 pm

Navaneetha wrote:
Wed Apr 07, 2021 10:25 am
ejolson wrote:
Tue Apr 06, 2021 4:00 pm
neilgl wrote:
Tue Apr 06, 2021 3:53 pm
Sounds like the C program is hanging/looping waiting for a network resource in “the cloud”.
Can you post that C code?
Could the problem be how ARP filtering affects different types of WiFi routers?
Could ARP filtering able to freeze the processor?. There is a instance, where only Pi-Zero was in the network, no other dev was connected, still it got into trouble.
Do you have a monitor and keyboard (or serial console) attached from which to verify it's the processor not the network that is down?

If indeed the processor, maybe there are leaks in your code and the out of memory daemon eventually kills it after which the program gets restarted and works again.

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Thu Apr 08, 2021 9:39 am

I didn't check with key board and monitor. It has USB LAN driver, which enumerates LAN port, when USB is connected. It didn't work during that time. Application didn't get restarted, which we can see that in application logs.
Application might have memory leaking issue, but there is shell script which runs on crontab, which prints date and time, and wi-fi status. Even that was stopped for frozen time period.
it was just like old windows machine, where if we open some heavy software, it will freeze for some time, and then comes back alive.

User avatar
thagrol
Posts: 4653
Joined: Fri Jan 13, 2012 4:41 pm
Location: Darkest Somerset, UK
Contact: Website

Re: Pi-Zero Random Hang issue

Thu Apr 08, 2021 11:05 am

Navaneetha wrote:
Thu Apr 08, 2021 9:39 am
it was just like old windows machine, where if we open some heavy software, it will freeze for some time, and then comes back alive.
Which, as has been pointed out, is most likely what's happening.

At risk of repeating myself:
  • The Pi zero(W/WH) has a single core CPU
  • A single core CPU can run one process at a time though OS features make it appear otherwise.
  • A poorly written program can prevent the OS from doing its job and swapping between tasks.
You need to find out what is doing that. One approach would be to login to the zero, run top and wait for the hang. top will likely hang but should give some indication of CPU load and the most CPU heavy processes running at that point.

Don't believe me? Try runing this python script on a zero:

Code: Select all

#!/usr/bin/env python
while True:
    pass
I'm a volunteer. Take me for granted or abuse my support and I will walk away

All advice given is based on my experience. it worked for me, it may not work for you.
Need help? https://github.com/thagrol/Guides

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Thu Apr 08, 2021 12:03 pm

thagrol wrote:
Thu Apr 08, 2021 11:05 am
Navaneetha wrote:
Thu Apr 08, 2021 9:39 am
it was just like old windows machine, where if we open some heavy software, it will freeze for some time, and then comes back alive.
Which, as has been pointed out, is most likely what's happening.

At risk of repeating myself:
  • The Pi zero(W/WH) has a single core CPU
  • A single core CPU can run one process at a time though OS features make it appear otherwise.
  • A poorly written program can prevent the OS from doing its job and swapping between tasks.
You need to find out what is doing that. One approach would be to login to the zero, run top and wait for the hang. top will likely hang but should give some indication of CPU load and the most CPU heavy processes running at that point.

Don't believe me? Try runing this python script on a zero:

Code: Select all

#!/usr/bin/env python
while True:
    pass
We checked with top command for few hours. Maximum of 15% CPU is used by the application, that too occasionally, typically it stays at 4-5%. It is very difficult to observe until it freeze. I am not negating the chance of application, maxing out the CPU. But I wonder how it recovers automatically, and there is pattern of 72mins. Application bugs are mostly periodic in nature, but this issue occurs randomly.

Navaneetha
Posts: 11
Joined: Mon Apr 05, 2021 9:31 am

Re: Pi-Zero Random Hang issue

Wed Apr 21, 2021 6:43 am

I have addressed this problem, by using watch dog. It might open up other problems such as SD card corruption issues. But definitely it won't be frequent.
https://diode.io/raspberry%20pi/running ... dog-20202/ => Link which was explaining about enabling watch dog.

I am looking into the code just to check at any point of time, it is maxing out the processor.

Thanks for all you time and support.

Regards
Navaneeth

Return to “Troubleshooting”