Skip to content

Working on big files

From time to time I have to work on files that exceed specific disk sizes. When that happens, programs begin to stutter, load indefinitely or simply crash when opening these files. This can already start with files over 100 MegaBytes. Try to open such a text file in any IDE and you will understand the problem. Even on a high-end machine, my Vim setup takes over 20 seconds to open files over 10 MB. There are a lot of reasons: The editor might try to do smart stuff (e.g. syntax highlighting), try to read in the whole file at once into memory and some tools do a lot of preprocessing (e.g. they “count” the lines). And this is only for reading in the file. Try to change a specific line in a 100k line text file.

Therefore, I often need to evaluate which tools can open, edit and save big files. This blog post is a reminder to myself.

Preparation

We will use a file with a big amount of lines to run our examples on. If your network bandwidth is better than your processing power, you can find big text files on the Internet. For my experiments I created a file with the seq command:

# time seq -w 500000000 > 500kk_numbers.txt

real	4m11,968s
user	4m4,503s
sys	0m2,672s

We can check the size and number of lines with the following commands:

# wc -l 500kk_numbers.txt && ls -hl 500kk_numbers.txt 

500000000 500kk_numbers.txt
-rw-rw-r-- 1 davide davide 4,7G mar 11 16:20 500kk_numbers.txt

So our file now has 500 million lines and is nearly 5 GB in size. You already see two tools that are vital to our efforts. wc -l will count the newlines in our file, and ls will show the file size.

Checking the file

To give you a demonstration what I meant above when I talked about the Vim file editor, look at how long it takes to load the file above:

# open file and close immediately
# time vi -c q 500kk_numbers.txt 

real	2m0,722s
user	1m55,402s
sys	0m4,455s

It takes nearly half the time it took to create the file. And this is only for “load file, close file”. So using Vim is not an option in our case.

We can start by looking at the file type:

# file 500kk_numbers.txt 
500kk_numbers.txt: ASCII text

Then we can check the first 10 lines of the file to confirm what we are looking at:

# head -n10 500kk_numbers.txt 
000000001
000000002
000000003
000000004
000000005
000000006
000000007
000000008
000000009
000000010

And then let us look at the last ten lines:

# tail -n10 500kk_numbers.txt 
499999991
499999992
499999993
499999994
499999995
499999996
499999997
499999998
499999999
500000000

You will notice that these last commands were almost instant. This is because the tools have different approaches at searching lines. Different tools have different approaches at reading files.

sed

Read line 250 million with sed:

# time sed '250000000q;d' 500kk_numbers.txt 
250000000

real	0m8,956s
user	0m8,596s
sys	0m0,360s

Read last line with sed:

# time sed '500000000q;d' 500kk_numbers.txt 
500000000

real	0m19,345s
user	0m17,918s
sys	0m0,876s

Read line 250 million by combination of head and tail:

time head -n 250000000 500kk_numbers.txt | tail -1
250000000

real	0m1,824s
user	0m2,466s
sys	0m1,153s

Read last line by combination of head and tail:

# time head -n 500000000 500kk_numbers.txt | tail -1
500000000

real	0m3,492s
user	0m4,802s
sys	0m2,161s

Less is more

less is an amazing command line tool. It reads a file chunk after chunk and only loads the data that you need to look at into memory. Try to open it at the beginning of the file without showing line numbers (-n):

# less -n 500kk_numbers.txt

You can also open the file at a specific line, which is a bit more costly to do:

time less -n +250000000 500kk_numbers.txt 

real	0m14,826s
user	0m14,578s
sys	0m0,248s

What I like about the tool is that you can easily scroll through a file. Search is still slow, but scrolling (e.g. with Page Down) is pretty fast. Also jumping to the start of the file (pressing g) or to the end (pressing Shift+g) is instantaneous if you disabled line numbers. If you input a number before the command, you can jump to the specific line, e.g. 500g jumps to line 500.

Unfortunately less is not an editor and can not modify files. For this we need other tools.

Editing large and small parts

There are lots of editors that can handle large files, if you are willing to wait a few seconds or minutes to load the file. Spoiler: there is not a single editor that instantly loads a file for editing. If you found one, for any platform, please write a comment below!!

Often times I do not even need a full-blown editor. I just need small changes, maybe even only one line. Sometimes you want to search-and-replace strings in the file.

sed is your friend

Remember sed, the “stream editor for filtering and transforming text” we already used above? It is the best builtin tool we have on Linux to perform operations on text. The tool is a pretty standard Linux tool: it is cryptic, every solution requires to dive deep into the manual, and there are a number of obscure command line parameters that speed up or slow down the process. Therefore every example shown here is retrieved either from StackOverflow or other sources that are far more invested into reading man pages than I am.

The sed tool uses the following string pattern to define search-and-replace strings:

s/search/replace/[options]

You can enable regular expressions with the -E parameter. For example, replacing all zeros in our example with the letter Z:

# time sed 's/0/Z/g' 500kk_numbers.txt > NO_0.txt

real	2m10,011s
user	2m4,069s
sys	0m3,567s

This command replaces every occurrence (option global) of zero in every line with Z. According to this StackOverflow comment, there is a faster method for big files, which did not really work for me:

# time sed '/0/ s//Z/g' 500kk_numbers.txt > NO_0_fast.txt

real	2m43,308s
user	2m37,130s
sys	0m3,627s

By the way, the above commands create a new file, which is not optimal if your disk size is limited. You can edit the file directly by using the -i “in place” parameter, e.g.:

# sed -i 's/0/Z/g' 500kk_numbers.txt

Conclusion

As you see, you can do small changes to big files with builtin Linux tools, and any other editors I have seen fail to provide such a fast service. Of course if you want to do any more complicated substitutions and editing, you either:

  • start adding even more tools to the pipe (awk is another strong contender)
  • use custom scripts in Python, Bash or Perl
  • split the file up into smaller chunks

..or you just take your time, get a coffee while your favourite editor starts up and hope for the best!

Published inTips and Tricks

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *