Page 1 of 1

Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 10:56 am
by henri_007
I have some huge .txt files (about 10-15gb). Every file has about 100 milions line. Now I have to pull out some informations from that files. I'm thinking to use RPI for that job. Is there any software that you can recommend for that? Or any other way that suite to my situation?

Thank you!

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 11:09 am
by piglet
It depends what you want to get out of the file.

To find lines matching some string, I'd use "grep" from a console:

grep "what you want to find" filename

For complex extraction I'd probably use perl as it's very fast.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 11:38 am
by B.Goode
An RPi running the Raspbian Operating System can in principle do anything a Linux 'mainframe' computer can do.

But it will take a lot longer..

Try it and see. But I have a feeling you will find it frustrating.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 11:45 am
by jbudd
grep prints out lines from a file which match a string or regular expression.
awk can do more complex selection eg "print out the 7th and 8th word on each line which contains "foo" if it follows a line which contains "bar"
perl offers much the same as awk, plus some.

You can use these tools as "filters" - they don't need to read the whole input file into memory, they just watch the data flow past and grab the bits of interest. So your big input file shouldn't be a problem. But just in case it is, split lets you break it up into multiple files of eg 1000000 lines.

On my Pi3 grep can scan a file of 5 million (short) lines in less than a second.
Awk took 6 seconds for the same search. A regular expression will no doubt be much slower (grep - 25s for a simple regex).

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 11:58 am
by hippy
You can write a program in almost any language which scans through a text file and pulls out lines which match some criteria. What would be best depends on how quickly you want to do that and that may be affected by what data you are looking for.

You may be able to search a smaller version of the full file which holds only the data you will be making matches on or holds tokenized data which makes matching easier and quicker.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 12:26 pm
by henri_007
Thank you all!! I want pull out whole line that match criteria.
I wil try with perl because after some googleing I found out that perl is fastest for my job?

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 12:41 pm
by piglet
I'd suspect grep will be quickest, and can use perl regular expression syntax to do the searches. What are the matching search criteria?

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 12:44 pm
by henri_007
Matching criteria is code of products. In line I have, code, name, price, dateOfManufacturing,color, etc. and I need to read informations of products with code 10457 (example)

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 12:54 pm
by hippy
Create an index file of product codes with an offset to the start of the relevant line in the full file and you can probably do the lookup in microseconds.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 1:01 pm
by piglet
If you know that the character strings making up the codes cannot appear anywhere else in the file you can do this:

grep 10457 filename

or to look for a list of known values:

grep -E "10457|22222|33333" filename


However if those codes could exist elsewhere in the lines where you don't want to look (e.g. in the price $10457.12) then you'd need a more complex search. For example, if you know that the code is at the start of the line:

grep -E "^(10457|22222|33333)" filename

(that probably makes it quicker too!)

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 1:17 pm
by jahboater
henri_007 wrote:
Mon May 21, 2018 12:26 pm
Thank you all!! I want pull out whole line that match criteria.
Thats exactly what grep does.
piglet wrote:
Mon May 21, 2018 1:01 pm

grep -E "10457|22222|33333" filename
This might be faster with grep -F
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions),
separated by newlines, any of which is to be matched.
In general, you probably wont beat grep for this sort of work, its been around for a very long time and is well optimized (without hand building an index anyway).

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 1:23 pm
by jbudd
grep is definitely the tool for the job if you only want to find 10457

Code: Select all

for file in filename1 filename2 filename3...
do
grep -F "10457" $file >> matches.txt
done

cat matches.txt
Edit - I'm not sure the -F flag makes any difference in execution time.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 1:55 pm
by jahboater
jbudd wrote:
Mon May 21, 2018 1:23 pm
Edit - I'm not sure the -F flag makes any difference in execution time.
Its probably limited by disk IO speed. grep reads 32k at a time.

The doct for fgrep does say a "fast and compact" algorithm.
"time" shows no difference for short files (I don't have any 15GB files around!).
I tried a short search for non-existent string in a small 14000 line text file and counted the instructions:-

grep 1248890 instructions

grep -E 1249327 instructions

grep -F 1149336 instructions

So it might be fractionally faster, because its not an RE.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:19 pm
by jbudd
I tried a short search for non-existent string in a small, 14000 line text file and counted the instructions
How do you do that?

time shows no difference in execution time between grep -F 10475 20millionlinefile and grep 10475 20millionlinefile
(1.9 seconds to write the results to a file - wow it's fast!)

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:24 pm
by jahboater
jbudd wrote:
Mon May 21, 2018 2:19 pm
I tried a short search for non-existent string in a small 14000 line text file and counted the instructions
How do you do that?
With valgrind. Specifically:-

valgrind --tool=exp-bbv --bb-out-file=/tmp/bbv --pc-out-file=/dev/null grep fsdfsdfsdf file

I think you can use Intel's IACA instead.

From distant memory (decades ago) I remember fgrep had some clever algorithm that could search for multiple (simple) strings with little extra cost.

I bet the slow SD card speed dominates the execution time.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:27 pm
by droleary
While grep and other regex-capable commands are the go-to for searching, it sounds like OP has structured data that might better be put into a database and (repeatedly?) searched that way. Loading it into SQLite would be my suggestion in this case.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:28 pm
by jbudd
Thanks for that.

Hmm. Mine's a Pi3B+ running from a USB3 flash drive, dunno if that's faster or slower than an SD card.

Anyway it looks like the Pi will make mincemeat of a 100 million line input file. I know I'd have had to split it into smaller chunks on a 386 Xenix box

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:32 pm
by jbudd
While grep and other regex-capable commands are the go-to for searching, it sounds like OP has structured data that might better be put into a database and (repeatedly?) searched that way. Loading it into SQLite would be my suggestion in this case.
That reminds me of the two suggestions many years ago for a searchable company phone book:
1. Buy and set up an Oracle database. Create a form to input the desired name and display the results.
2. grep "Janet Smith" phonebook.txt

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 2:37 pm
by jahboater
jbudd wrote:
Mon May 21, 2018 2:32 pm
That reminds me of the two suggestions many years ago for a searchable company phone book:
1. Buy and set up an Oracle database. Create a form to input the desired name and display the results.
2. grep "Janet Smith" phonebook.txt
:) :) :)
I remember ICL had an expensive hardware search thing called CAFS.
On discussion with UNIX people, they just said - we have grep - why do we want it?

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 3:01 pm
by scruss
henri_007 wrote:
Mon May 21, 2018 12:44 pm
Matching criteria is code of products. In line I have, code, name, price, dateOfManufacturing,color, etc. and I need to read informations of products with code 10457 (example)
Um, you're almost perfectly describing a database table there. While grep will work for simple queries, the shell code will get really slow and fiddly if someone asks you how many blue 10457's were made in Q3 2017.

While it's likely to be a bit slower than raw grep calls, the massively-unhelpfully-named “q” allows you to query text files as if they were an SQL database. It's in the Raspbian repo as python3-q-text-as-data.

grep's fine for easily-defined problems like a phone directory, but big text searches benefit from an engine like Lucene. It brings typo-tolerant searches, ranking, search ranges, searching while updating, … none of which grep does well, if even at all.

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 3:49 pm
by piglet
scruss wrote:
Mon May 21, 2018 3:01 pm
grep's fine for easily-defined problems like a phone directory, but big text searches benefit from an engine like Lucene. It brings typo-tolerant searches, ranking, search ranges, searching while updating, … none of which grep does well, if even at all.
Ahh - but Lucene doesn't produce nice graphs of the extracted data on a smartphone well, if even at all.

Without knowing the full story of what the OP is trying to do we can't really advise on the appropriate tool-set to achieve it. Discussion so far has been based on the little we know. I wouldn't expect it needs to be typo -(or coffee spill)-tolerant

Re: Using RPI for searching in huge .txt

Posted: Mon May 21, 2018 5:06 pm
by jahboater
henri_007 wrote:
Mon May 21, 2018 10:56 am
I have some huge .txt files (about 10-15gb). Every file has about 100 milions line.
OK, for fun, I created a large file - 16GB and over 643 million lines.
[email protected]:~ $ ls -lh bigfile
-rw-r--r-- 1 pi pi 16G May 21 17:34 bigfile
[email protected]:~ $ wc -l bigfile
643565520 bigfile
[email protected]:~ $ time grep 10457 bigfile

real 12m7.402s
user 1m25.724s
sys 0m38.458s
Note that this was 100% disk IO limited (as you can see from the high ratio of user/sys to real).
The CPU usage was not even enough for the scaling governor to change from 600Mhz to 1400Mhz (around 16% all the time).
The micro SD card is a sandisk ultra A1 app class. The text was C source code.
Search speed is 884748 lines per sec.

As usual, I am impressed that the little Pi with a tiny micro SD card can comfortably deal with such volumes of data!

Re: Using RPI for searching in huge .txt

Posted: Tue May 22, 2018 2:27 am
by scruss
piglet wrote:
Mon May 21, 2018 3:49 pm
Ahh - but Lucene doesn't produce nice graphs of the extracted data on a smartphone well, if even at all.
Ah sorry — I've got a bunch of experience in corpus linguistics, so to me, searching a file might include collocates and lemmatized searches. While the Linux text tools are pretty good, I often get the feeling here that many people think they're the be-all-and-end-all of text management. There's a lot more out there than just the builtins. Also, get a better smartphone …

My answer was mostly about q and how nifty it is, despite the name.