henri_007
Posts: 12
Joined: Wed Feb 10, 2016 9:27 pm

Using RPI for searching in huge .txt

Mon May 21, 2018 10:56 am

I have some huge .txt files (about 10-15gb). Every file has about 100 milions line. Now I have to pull out some informations from that files. I'm thinking to use RPI for that job. Is there any software that you can recommend for that? Or any other way that suite to my situation?

Thank you!

User avatar
piglet
Posts: 909
Joined: Sat Aug 27, 2011 1:16 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 11:09 am

It depends what you want to get out of the file.

To find lines matching some string, I'd use "grep" from a console:

grep "what you want to find" filename

For complex extraction I'd probably use perl as it's very fast.

User avatar
B.Goode
Posts: 8561
Joined: Mon Sep 01, 2014 4:03 pm
Location: UK

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 11:38 am

An RPi running the Raspbian Operating System can in principle do anything a Linux 'mainframe' computer can do.

But it will take a lot longer..

Try it and see. But I have a feeling you will find it frustrating.

jbudd
Posts: 990
Joined: Mon Dec 16, 2013 10:23 am

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 11:45 am

grep prints out lines from a file which match a string or regular expression.
awk can do more complex selection eg "print out the 7th and 8th word on each line which contains "foo" if it follows a line which contains "bar"
perl offers much the same as awk, plus some.

You can use these tools as "filters" - they don't need to read the whole input file into memory, they just watch the data flow past and grab the bits of interest. So your big input file shouldn't be a problem. But just in case it is, split lets you break it up into multiple files of eg 1000000 lines.

On my Pi3 grep can scan a file of 5 million (short) lines in less than a second.
Awk took 6 seconds for the same search. A regular expression will no doubt be much slower (grep - 25s for a simple regex).

hippy
Posts: 5959
Joined: Fri Sep 09, 2011 10:34 pm
Location: UK

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 11:58 am

You can write a program in almost any language which scans through a text file and pulls out lines which match some criteria. What would be best depends on how quickly you want to do that and that may be affected by what data you are looking for.

You may be able to search a smaller version of the full file which holds only the data you will be making matches on or holds tokenized data which makes matching easier and quicker.

henri_007
Posts: 12
Joined: Wed Feb 10, 2016 9:27 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 12:26 pm

Thank you all!! I want pull out whole line that match criteria.
I wil try with perl because after some googleing I found out that perl is fastest for my job?

User avatar
piglet
Posts: 909
Joined: Sat Aug 27, 2011 1:16 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 12:41 pm

I'd suspect grep will be quickest, and can use perl regular expression syntax to do the searches. What are the matching search criteria?

henri_007
Posts: 12
Joined: Wed Feb 10, 2016 9:27 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 12:44 pm

Matching criteria is code of products. In line I have, code, name, price, dateOfManufacturing,color, etc. and I need to read informations of products with code 10457 (example)

hippy
Posts: 5959
Joined: Fri Sep 09, 2011 10:34 pm
Location: UK

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 12:54 pm

Create an index file of product codes with an offset to the start of the relevant line in the full file and you can probably do the lookup in microseconds.

User avatar
piglet
Posts: 909
Joined: Sat Aug 27, 2011 1:16 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 1:01 pm

If you know that the character strings making up the codes cannot appear anywhere else in the file you can do this:

grep 10457 filename

or to look for a list of known values:

grep -E "10457|22222|33333" filename


However if those codes could exist elsewhere in the lines where you don't want to look (e.g. in the price $10457.12) then you'd need a more complex search. For example, if you know that the code is at the start of the line:

grep -E "^(10457|22222|33333)" filename

(that probably makes it quicker too!)

jahboater
Posts: 4685
Joined: Wed Feb 04, 2015 6:38 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 1:17 pm

henri_007 wrote:
Mon May 21, 2018 12:26 pm
Thank you all!! I want pull out whole line that match criteria.
Thats exactly what grep does.
piglet wrote:
Mon May 21, 2018 1:01 pm

grep -E "10457|22222|33333" filename
This might be faster with grep -F
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions),
separated by newlines, any of which is to be matched.
In general, you probably wont beat grep for this sort of work, its been around for a very long time and is well optimized (without hand building an index anyway).

jbudd
Posts: 990
Joined: Mon Dec 16, 2013 10:23 am

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 1:23 pm

grep is definitely the tool for the job if you only want to find 10457

Code: Select all

for file in filename1 filename2 filename3...
do
grep -F "10457" $file >> matches.txt
done

cat matches.txt
Edit - I'm not sure the -F flag makes any difference in execution time.

jahboater
Posts: 4685
Joined: Wed Feb 04, 2015 6:38 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 1:55 pm

jbudd wrote:
Mon May 21, 2018 1:23 pm
Edit - I'm not sure the -F flag makes any difference in execution time.
Its probably limited by disk IO speed. grep reads 32k at a time.

The doct for fgrep does say a "fast and compact" algorithm.
"time" shows no difference for short files (I don't have any 15GB files around!).
I tried a short search for non-existent string in a small 14000 line text file and counted the instructions:-

grep 1248890 instructions

grep -E 1249327 instructions

grep -F 1149336 instructions

So it might be fractionally faster, because its not an RE.

jbudd
Posts: 990
Joined: Mon Dec 16, 2013 10:23 am

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:19 pm

I tried a short search for non-existent string in a small, 14000 line text file and counted the instructions
How do you do that?

time shows no difference in execution time between grep -F 10475 20millionlinefile and grep 10475 20millionlinefile
(1.9 seconds to write the results to a file - wow it's fast!)

jahboater
Posts: 4685
Joined: Wed Feb 04, 2015 6:38 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:24 pm

jbudd wrote:
Mon May 21, 2018 2:19 pm
I tried a short search for non-existent string in a small 14000 line text file and counted the instructions
How do you do that?
With valgrind. Specifically:-

valgrind --tool=exp-bbv --bb-out-file=/tmp/bbv --pc-out-file=/dev/null grep fsdfsdfsdf file

I think you can use Intel's IACA instead.

From distant memory (decades ago) I remember fgrep had some clever algorithm that could search for multiple (simple) strings with little extra cost.

I bet the slow SD card speed dominates the execution time.
Last edited by jahboater on Mon May 21, 2018 2:27 pm, edited 1 time in total.

droleary
Posts: 174
Joined: Fri Feb 09, 2018 3:45 am
Location: Minneapolis, MN USA
Contact: Website Skype

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:27 pm

While grep and other regex-capable commands are the go-to for searching, it sounds like OP has structured data that might better be put into a database and (repeatedly?) searched that way. Loading it into SQLite would be my suggestion in this case.

jbudd
Posts: 990
Joined: Mon Dec 16, 2013 10:23 am

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:28 pm

Thanks for that.

Hmm. Mine's a Pi3B+ running from a USB3 flash drive, dunno if that's faster or slower than an SD card.

Anyway it looks like the Pi will make mincemeat of a 100 million line input file. I know I'd have had to split it into smaller chunks on a 386 Xenix box

jbudd
Posts: 990
Joined: Mon Dec 16, 2013 10:23 am

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:32 pm

While grep and other regex-capable commands are the go-to for searching, it sounds like OP has structured data that might better be put into a database and (repeatedly?) searched that way. Loading it into SQLite would be my suggestion in this case.
That reminds me of the two suggestions many years ago for a searchable company phone book:
1. Buy and set up an Oracle database. Create a form to input the desired name and display the results.
2. grep "Janet Smith" phonebook.txt

jahboater
Posts: 4685
Joined: Wed Feb 04, 2015 6:38 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 2:37 pm

jbudd wrote:
Mon May 21, 2018 2:32 pm
That reminds me of the two suggestions many years ago for a searchable company phone book:
1. Buy and set up an Oracle database. Create a form to input the desired name and display the results.
2. grep "Janet Smith" phonebook.txt
:) :) :)
I remember ICL had an expensive hardware search thing called CAFS.
On discussion with UNIX people, they just said - we have grep - why do we want it?

User avatar
scruss
Posts: 2480
Joined: Sat Jun 09, 2012 12:25 pm
Location: Toronto, ON
Contact: Website

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 3:01 pm

henri_007 wrote:
Mon May 21, 2018 12:44 pm
Matching criteria is code of products. In line I have, code, name, price, dateOfManufacturing,color, etc. and I need to read informations of products with code 10457 (example)
Um, you're almost perfectly describing a database table there. While grep will work for simple queries, the shell code will get really slow and fiddly if someone asks you how many blue 10457's were made in Q3 2017.

While it's likely to be a bit slower than raw grep calls, the massively-unhelpfully-named “q” allows you to query text files as if they were an SQL database. It's in the Raspbian repo as python3-q-text-as-data.

grep's fine for easily-defined problems like a phone directory, but big text searches benefit from an engine like Lucene. It brings typo-tolerant searches, ranking, search ranges, searching while updating, … none of which grep does well, if even at all.
‘Remember the Golden Rule of Selling: “Do not resort to violence.”’ — McGlashan.

User avatar
piglet
Posts: 909
Joined: Sat Aug 27, 2011 1:16 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 3:49 pm

scruss wrote:
Mon May 21, 2018 3:01 pm
grep's fine for easily-defined problems like a phone directory, but big text searches benefit from an engine like Lucene. It brings typo-tolerant searches, ranking, search ranges, searching while updating, … none of which grep does well, if even at all.
Ahh - but Lucene doesn't produce nice graphs of the extracted data on a smartphone well, if even at all.

Without knowing the full story of what the OP is trying to do we can't really advise on the appropriate tool-set to achieve it. Discussion so far has been based on the little we know. I wouldn't expect it needs to be typo -(or coffee spill)-tolerant

jahboater
Posts: 4685
Joined: Wed Feb 04, 2015 6:38 pm

Re: Using RPI for searching in huge .txt

Mon May 21, 2018 5:06 pm

henri_007 wrote:
Mon May 21, 2018 10:56 am
I have some huge .txt files (about 10-15gb). Every file has about 100 milions line.
OK, for fun, I created a large file - 16GB and over 643 million lines.
[email protected]:~ $ ls -lh bigfile
-rw-r--r-- 1 pi pi 16G May 21 17:34 bigfile
[email protected]:~ $ wc -l bigfile
643565520 bigfile
[email protected]:~ $ time grep 10457 bigfile

real 12m7.402s
user 1m25.724s
sys 0m38.458s
Note that this was 100% disk IO limited (as you can see from the high ratio of user/sys to real).
The CPU usage was not even enough for the scaling governor to change from 600Mhz to 1400Mhz (around 16% all the time).
The micro SD card is a sandisk ultra A1 app class. The text was C source code.
Search speed is 884748 lines per sec.

As usual, I am impressed that the little Pi with a tiny micro SD card can comfortably deal with such volumes of data!

User avatar
scruss
Posts: 2480
Joined: Sat Jun 09, 2012 12:25 pm
Location: Toronto, ON
Contact: Website

Re: Using RPI for searching in huge .txt

Tue May 22, 2018 2:27 am

piglet wrote:
Mon May 21, 2018 3:49 pm
Ahh - but Lucene doesn't produce nice graphs of the extracted data on a smartphone well, if even at all.
Ah sorry — I've got a bunch of experience in corpus linguistics, so to me, searching a file might include collocates and lemmatized searches. While the Linux text tools are pretty good, I often get the feeling here that many people think they're the be-all-and-end-all of text management. There's a lot more out there than just the builtins. Also, get a better smartphone …

My answer was mostly about q and how nifty it is, despite the name.
‘Remember the Golden Rule of Selling: “Do not resort to violence.”’ — McGlashan.

Return to “Beginners”