sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

debugging "too many files" error

Tue Jul 24, 2018 5:04 pm

I am trying unsuccessfully to debug an error that occurs after hours of run time on my Pi Zero W with Raspbian 9 Stretch with an external WiFi adapter added and a 7-segment display.

Code: Select all

336029 Jul 23 17:40:54 pi02 systemd[1]: Starting Cleanup of Temporary Directories...
336030 Jul 23 17:40:55 pi02 systemd[1]: Started Cleanup of Temporary Directories.
336031 Jul 23 18:17:01 pi02 CRON[14511]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
336032 Jul 23 18:20:56 pi02 systemd[1]: Starting Raise network interfaces...
336033 Jul 23 18:20:57 pi02 ifup[21141]: command failed: Too many open files in system (-23)
336034 Jul 23 18:20:57 pi02 ifup[21141]: ifup: failed to bring up wlan1
336035 Jul 23 18:20:57 pi02 ifup[21141]: command failed: No such file or directory (-2)
336036 Jul 23 18:20:57 pi02 ifup[21141]: ifup: failed to bring up wlan2
336037 Jul 23 18:20:57 pi02 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
In particular, I am trying to determine why there are too many files in the system. I can't figure out how to see what files are currently open. I saw the command 'lsof' which I tried unsuccessfully, as follows.

Code: Select all

pi@pi:~ $ lsof -p 1903
bash: lsof: command not found
pi@pi:~ $ sudo apt-get install lsof
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package lsof is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
I tried several other ways to look for open files, but none worked. So, I just started reviewing my code. I have ten programs, some of which only run on boot. I think there are four or five that run looped. I use two files to pass information from bash script to python. They each look like this.

Code: Select all

 1 #!/usr/bin/env bash
16 count=$(ls -1 | wc -l)
23 sudo sh -c "echo $count > /home/pi/RFcount.txt"

Code: Select all

 1 #!/usr/bin/python
  2 import time
  3 import datetime
  4 import subprocess
  5 # For program ancillary to clock
  6 import os, sys, shutil
122   if second==25 or second==55:
123       subprocess.call("/home/pi/aboveshellscript.sh")
176   # Display data from another app
177   if second==10 or second==20 or second==40 or second==50:
178       with open("/home/pi/RFcount.txt") as fp:
179          lst=list(fp.readlines())
180       count=lst[len(lst)-1].strip()
181       RFcount=str(count).zfill(4)
182       segment.print_number_str(RFcount)
183       segment.write_display()
In either the python script or the bash script, am I leaving a file open? Or am I looking in the wrong place in my code?

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Tue Jul 24, 2018 5:11 pm

More information. Taken 15 seconds apart....

Code: Select all

pi@pi02:~ $ cat /proc/sys/fs/file-nr
2331    0       36895
pi@pi02:~ $ cat /proc/sys/fs/file-nr
2335    0       36895
pi@pi02:~ $ cat /proc/sys/fs/file-nr
2329    0       36895

Ernst
Posts: 1334
Joined: Sat Feb 04, 2017 9:39 am
Location: Germany

Re: debugging "too many files" error

Tue Jul 24, 2018 5:41 pm

I do not know much about python but it would make a lot of sense if there would be a close somewhere in the code.
The road to insanity is paved with static ip addresses

User avatar
B.Goode
Posts: 10356
Joined: Mon Sep 01, 2014 4:03 pm
Location: UK

Re: debugging "too many files" error

Tue Jul 24, 2018 5:54 pm

Ernst wrote:
Tue Jul 24, 2018 5:41 pm
I do not know much about python but it would make a lot of sense if there would be a close somewhere in the code.

Check the semantics of the with statement....

User avatar
B.Goode
Posts: 10356
Joined: Mon Sep 01, 2014 4:03 pm
Location: UK

Re: debugging "too many files" error

Tue Jul 24, 2018 6:31 pm

My suspicion would fall on those zombie subprocesses you create and then lose sight and control of...

Would python threading be a more elegant way of doing this?

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Tue Jul 24, 2018 6:49 pm

Good, @B.Goode, I like suspicions. They give me leads to follow. @Ernst, I thought so too and began with a "fq.close()" statement after "with", but then after reading the man page several times, I thought that might be the problem and removed it.

If there were a way to see what files are open at a point in time, that would give me a good clue. But, I can't find such a way.

It only fails after several hours, so it is hard to reproduce. I just now caught a failure, so I know where the python code was when it went down at least. That is another clue. It is somewhere between lines 138 and 145. Hm. Maybe I don't understand "break." It jumps out of the loop its in, right? Or, does it kill the entire process? That would be embarrassing, but good for coding practices.

Code: Select all

128     backdir=os.getcwd()
129      os.chdir('/home/pi/logs')
130      os.chdir(path)
131      lst=os.listdir('.')
132      os.chdir('/home/pi/logsMAC')
133      os.chdir(path)
134 #                                          #a mental example (see test.txt)
135      length=len(lst)                       #length=13; a string
136      print "length as string: ", length, "time: ", hour, minute, second
137      l=int(length)                         #convert to an integer
138      print "length as integer: ", l
139      if l<10:
140        break
141      else:
142          for m in range(l-1):
143              n=1                           #m 1:12 2:12    12:12
144              mac=lst[m][9:23]              #mac=ac ab      ac
145              print (mac, 'mac')

User avatar
Paeryn
Posts: 2966
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: debugging "too many files" error

Tue Jul 24, 2018 10:10 pm

sonettguy wrote:
Tue Jul 24, 2018 6:49 pm
Hm. Maybe I don't understand "break." It jumps out of the loop its in, right? Or, does it kill the entire process? That would be embarrassing, but good for coding practices.
Yes, break breaks out of the innermost for or while loop that it is in.

As to installing lsof, did you do a sudo apt update before your sudo apt install lsof (or apt-get if you want, apt combines several of the apt-* utilities into one). It is definitely there, I just installed it.

Code: Select all

pi@rpi3:~ $ apt policy lsof
lsof:
  Installed: 4.89+dfsg-0.1
  Candidate: 4.89+dfsg-0.1
  Version table:
 *** 4.89+dfsg-0.1 500
        500 http://raspbian.raspberrypi.org/raspbian stretch/main armhf Packages
        100 /var/lib/dpkg/status
Just wondering about a bit of your code,

Code: Select all

134 #                                          #a mental example (see test.txt)
135      length=len(lst)                       #length=13; a string
136      print "length as string: ", length, "time: ", hour, minute, second
137      l=int(length)                         #convert to an integer
138      print "length as integer: ", l
len() returns an int, not a string so you don't need to convert it into one.

Code: Select all

>>> import os
>>> my_list = os.listdir('.')
>>> my_len = len(my_list)
>>> print(my_len)
58
>>> type(my_len)
<class 'int'>
If you are using Python 2 then it that last line will say <type 'int'> rather than <class 'int'> but it amounts to the same, it is a numeric value.
She who travels light — forgot something.

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Tue Jul 24, 2018 10:54 pm

OK, so my base python script has in it a section like this.
62 while(True):
139 if l<10:
140 break
141 else:
142 for m in range(l-1):
143 n=1 #m 1:12 2:12 12:12
144 mac=lst[m][9:23] #mac=ac ab ac
145 print (mac, 'mac')

So, my program stopping after a few hours of running is likely not due to an error, but following the 'break' command. That explains why I couldn't find an error in any of the logs. Good grief. Once you see it, you have to slap your forehead.

But, the "too many files" question still looms. I have now run all of the subprocesses without running the base python script. New files are not created. The file count stays the same. So, it must either be the base script or a conflict between the base script and subprocesses opening a file at the same time(?)...is that even possible? Do I need to do something to block bash script from opening a file when python is and vice versa? That sounds hard!

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Tue Jul 24, 2018 11:35 pm

@Paeryn, no I did not do an apt-get update. I'll try that.

Thanks for the help on the "len" command. I thought I tried it first as an integer, getting an error, but that was a few weeks ago so I'm not sure now. I'll correct it for better coding. I appreciate the help. This is my first python program. (And, first bash script too.)

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Wed Jul 25, 2018 2:55 am

In pursuit of finding the source of "too many open files" I installed 'lsof' and looked through 2500 records of open files. I can't see any but a handful coming from my programs. In fact, most of them are evenly distributed across many sources. All except two sources called gmain and gdbus. There are about five or six sections repeating a large number of files from these two sources. They look like this.

Code: Select all

gmain     21820 21835               pi    1w      REG      179,2    41382     257902 /home/pi/.cache/lxsession/LXDE-pi/run.log
gmain     21820 21835               pi    2w      REG      179,2    41382     257902 /home/pi/.cache/lxsession/LXDE-pi/run.log
gmain     21820 21835               pi    3u     unix 0xc44364c0      0t0     280593 type=STREAM
gmain     21820 21835               pi    4u  a_inode       0,11        0       4423 [eventfd]
gmain     21820 21835               pi    5u     unix 0xc4436260      0t0     280597 /run/user/1000/.lxterminal-socket- type=STREAM
gmain     21820 21835               pi    6u      CHR        5,2      0t0       4714 /dev/ptmx
gmain     21820 21835               pi    7u      CHR      136,2      0t0          5 /dev/pts/2
gmain     21820 21835               pi    8u     unix 0xc4436e40      0t0     280603 type=STREAM
gmain     21820 21835               pi    9u  a_inode       0,11        0       4423 [eventfd]
gmain     21820 21835               pi   11u     unix 0xc4437300      0t0     280635 type=STREAM
gmain     21820 21835               pi   12u  a_inode       0,11        0       4423 [eventfd]
gdbus     21820 21851               pi  cwd       DIR      179,2     4096        108 /home/pi
gdbus     21820 21851               pi  rtd       DIR      179,2   962560          2 /
gdbus     21820 21851               pi  txt       REG      179,2    60812      15028 /usr/bin/lxterminal
gdbus     21820 21851               pi  DEL       REG        0,5              360450 /SYSV00000000
gdbus     21820 21851               pi  mem       REG      179,2    18140     148506 /usr/lib/arm-linux-gnueabihf/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
gdbus     21820 21851               pi  mem       REG      179,2  1269620      14591 /usr/lib/arm-linux-gnueabihf/libstdc++.so.6.0.22
gdbus     21820 21851               pi  mem       REG      179,2   104796      18601 /usr/lib/arm-linux-gnueabihf/libproxy.so.1.0.0
gdbus     21820 21851               pi  mem       REG      179,2   132844      28661 /usr/share/mime/mime.cache
gdbus     21820 21851               pi  mem       REG      179,2   161456     147254 /usr/lib/arm-linux-gnueabihf/gio/modules/libgvfsdbus.so
gdbus     21820 21851               pi  mem       REG      179,2     9836     146391 /lib/arm-linux-gnueabihf/libutil-2.24.so
gdbus     21820 21851               pi  mem       REG      179,2   173412      19328 /usr/lib/arm-linux-gnueabihf/gvfs/libgvfscommon.so
gdbus     21820 21851               pi  mem       REG      179,2    79400     147253 /usr/lib/arm-linux-gnueabihf/gio/modules/libgioremote-volume-monitor.so
gdbus     21820 21851               pi  mem       REG      179,2   301712      43089 /usr/share/fonts/truetype/liberation2/LiberationMono-Bold.ttf
I don't know if I'm doing myself any favors in this pursuit. I have no earthly idea what these come from or whether I should care. I did this without my main python program running, but with my five subprocesses running. I'll use my new tools with the main python program running

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Wed Jul 25, 2018 4:46 pm

OK, I believe I see what is going on, but I don't know how to fix it. In trying to clean up my large cache of collected data, I use the following code. As a reminder, this is for a Pi Zero W with Raspbian 9 Stretch running a single python script in a continuous auto-started loop (using Cron @boot) and three or four bash scripts running continuously as spawned by the python code.

Code: Select all

  1 #!/usr/bin/python
  2 import time
  3 import datetime
  4 import subprocess
  5 # For program ancillary to clock
  6 import os, sys, shutil
 55 # Find path for latest Day folder in /home/pi/logs
 56 lost=os.listdir('/home/pi/logs')
 57 print "listdir output:",lost
 58 last=sorted(lost)
 60 pathtmp=last[-1]
 66 while(True):
132   if (minute==0 or minute==15 or minute==30 or minute==45) and second==1:
133      backdir=os.getcwd()
134      os.chdir('/home/pi/logs')
135      os.chdir(pathtmp)
136      lst=os.listdir('.')
137      os.chdir('/home/pi/logsMAC')
138      os.chdir(path)
139 #                                          
140      length=len(lst)                       
144      if length<10:
145        pass                                #too few records to deal with
146      else:
147          for m in range(length-1):
                    do many things to analyze data
169          shutil.rmtree("/home/pi/logs/"+pathtmp)
170          os.mkdir("/home/pi/logs/"+pathtmp)
171          print "cache deleted from logs/"+pathtmp
173      # Put things back in order
174      os.chdir(backdir)
I believe the problem is occurring between lines 169 and 170. I get an error that reads

Code: Select all

Traceback (most recent call last):
  File "Adafruit_Python_LED_Backpack/examples/ex_7segment_clock.py", line 169, in <module>
    shutil.rmtree("/home/pi/logs/"+path)
  File "/usr/lib/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/home/pi/logs/day9'
I was confused by this for a long time because I have read
os.rmdir(path)
Remove (delete) the directory path. Only works when the directory is empty, otherwise, OSError is raised. In order to remove whole directory trees, shutil.rmtree() can be used.
Today, however, I realized that one of my subprocesses, a bash file, records the data into that working folder and is running separately. This process contains

Code: Select all

#!/bin/bash
FOLDER=$(ls -d /home/pi/logs/*/ | tail -1)
usr/sbin/tcpdump -i mon1 -e -s 1000 -l | grep -P --line-buffered -o '(a bunch of regex}'| awk -Winteractive -v folder=$FOLDER '{system("touch "folder$0)}' -
I believe it may be recording a record in between the execution of lines 169 and 170 in the python script. How can I block or halt the bash script while I do this file cleanup?

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Wed Jul 25, 2018 5:24 pm

What if I somehow use chmod on the python side to change permissions to read-only, then add try-except to the bash file?

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Thu Jul 26, 2018 1:39 am

I wonder if I could kill -tstp <bash process> from python, then kill -cont <bash process> after the cleanup.

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Thu Jul 26, 2018 3:13 pm

I consistently have between 2300 and 2400 files open according to cat /proc/sys/fs/file-nr. Is that a lot? Too many? Normal? I have nothing to compare it to.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 26659
Joined: Sat Jul 30, 2011 7:41 pm

Re: debugging "too many files" error

Thu Jul 26, 2018 3:32 pm

sonettguy wrote:
Wed Jul 25, 2018 4:46 pm
OK, I believe I see what is going on, but I don't know how to fix it. In trying to clean up my large cache of collected data, I use the following code. As a reminder, this is for a Pi Zero W with Raspbian 9 Stretch running a single python script in a continuous auto-started loop (using Cron @boot) and three or four bash scripts running continuously as spawned by the python code.

Code: Select all

  1 #!/usr/bin/python
  2 import time
  3 import datetime
  4 import subprocess
  5 # For program ancillary to clock
  6 import os, sys, shutil
 55 # Find path for latest Day folder in /home/pi/logs
 56 lost=os.listdir('/home/pi/logs')
 57 print "listdir output:",lost
 58 last=sorted(lost)
 60 pathtmp=last[-1]
 66 while(True):
132   if (minute==0 or minute==15 or minute==30 or minute==45) and second==1:
133      backdir=os.getcwd()
134      os.chdir('/home/pi/logs')
135      os.chdir(pathtmp)
136      lst=os.listdir('.')
137      os.chdir('/home/pi/logsMAC')
138      os.chdir(path)
139 #                                          
140      length=len(lst)                       
144      if length<10:
145        pass                                #too few records to deal with
146      else:
147          for m in range(length-1):
                    do many things to analyze data
169          shutil.rmtree("/home/pi/logs/"+pathtmp)
170          os.mkdir("/home/pi/logs/"+pathtmp)
171          print "cache deleted from logs/"+pathtmp
173      # Put things back in order
174      os.chdir(backdir)
I believe the problem is occurring between lines 169 and 170. I get an error that reads

Code: Select all

Traceback (most recent call last):
  File "Adafruit_Python_LED_Backpack/examples/ex_7segment_clock.py", line 169, in <module>
    shutil.rmtree("/home/pi/logs/"+path)
  File "/usr/lib/python2.7/shutil.py", line 256, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/home/pi/logs/day9'
I was confused by this for a long time because I have read
os.rmdir(path)
Remove (delete) the directory path. Only works when the directory is empty, otherwise, OSError is raised. In order to remove whole directory trees, shutil.rmtree() can be used.
Today, however, I realized that one of my subprocesses, a bash file, records the data into that working folder and is running separately. This process contains

Code: Select all

#!/bin/bash
FOLDER=$(ls -d /home/pi/logs/*/ | tail -1)
usr/sbin/tcpdump -i mon1 -e -s 1000 -l | grep -P --line-buffered -o '(a bunch of regex}'| awk -Winteractive -v folder=$FOLDER '{system("touch "folder$0)}' -
I believe it may be recording a record in between the execution of lines 169 and 170 in the python script. How can I block or halt the bash script while I do this file cleanup?
This is called a race condition, where two processes are actively working on the same "thing", effectively racing each other to do their stuff. Nasty bugs to find.

If everything was in the same program you would use mutxes or similar, but in this case you may need to provide some sort of file based semaphore - create a file when you want access to the folder to delete it, and your bash script should stop writing if that file is present. Once your folder is deleted, also delete the semaphore fill. The bash script checkes the file, it's not there, so it can continue, taking in to account the file structure will have changed.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

User avatar
Paeryn
Posts: 2966
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: debugging "too many files" error

Thu Jul 26, 2018 4:02 pm

sonettguy wrote:
Thu Jul 26, 2018 3:13 pm
I consistently have between 2300 and 2400 files open according to cat /proc/sys/fs/file-nr. Is that a lot? Too many? Normal? I have nothing to compare it to.
That's fairly normal. Unless you're running out of memory Linux can happily deal with a lot more than that. Sort out your race condition(s) first, at least that's a known problem. You never know, the too many open files may be a symptom of that.
She who travels light — forgot something.

sonettguy
Posts: 142
Joined: Wed Jan 10, 2018 7:29 pm
Location: texas, USA

Re: debugging "too many files" error

Sat Jul 28, 2018 1:49 am

Solved.

Interestingly, I was (after two weeks of steady work on this) able to fix the problem of the main python program going down after several hours of operation. When I did, it seemed to free CPU time and cure "too many files" problems, which seemed to be from the Pi busily recording errors from the spawned bash scripts that were still running after the main program was down.

Also interesting is the way I fixed it. I realized in studying this a long time that shutil.rmtree must count the files it needs to remove and stops after hitting the count. Thus, as my spawned bash script was loading new data (which I keep in the form of filenames) it is not actually causing a long-term problem. In fact, it is an advantage! I'm not losing any data and rmtree is deleting all of the old data properly. Sooo, I just ignore the error. So easy. Ignoring the next error incurred by trying to remake the directory name that is actually still there (using os.mkdir) is a little harder. Here's the fix. It works just fine. The program has been running for two days now without skipping a beat.

Code: Select all

169          print "start removal of logs "+pathtmp
170          shutil.rmtree("/home/pi/logs/"+pathtmp, ignore_errors=True)
171          try:
172              os.mkdir("/home/pi/logs/"+pathtmp)
173          except OSError:
174              print "error 2:", sys.exc_info()[0]
175          print "cache deleted from logs/"+pathtmp
So, I am not "solving" the race problem, but instead "allowing" it. This was so easy to fix because I am storing data as filenames in a directory, not characters in file. I do this because it automatically takes care of storing duplicate data without doing anything. In my case, I want only unique data points. All the stars aligned and I was able to allow the errors and keep on going flawlessly.

Return to “Troubleshooting”