From time to time I have to work on files that exceed specific disk sizes. When that happens, programs begin to stutter, load indefinitely or simply crash when opening these files. This can already start with files over 100 MegaBytes. Try to open such a text file in any IDE and you will understand the problem. Even on a high-end machine, my Vim setup takes over 20 seconds to open files over 10 MB. There are a lot of reasons: The editor might try to do smart stuff (e.g. syntax highlighting), try to read in the whole file at once into memory and some tools do a lot of preprocessing (e.g. they “count” the lines). And this is only for reading in the file. Try to change a specific line in a 100k line text file.
Therefore, I often need to evaluate which tools can open, edit and save big files. This blog post is a reminder to myself.
We will use a file with a big amount of lines to run our examples on. If your network bandwidth is better than your processing power, you can find big text files on the Internet. For my experiments I created a file with the
# time seq -w 500000000 > 500kk_numbers.txt real 4m11,968s user 4m4,503s sys 0m2,672s
We can check the size and number of lines with the following commands:
# wc -l 500kk_numbers.txt && ls -hl 500kk_numbers.txt 500000000 500kk_numbers.txt -rw-rw-r-- 1 davide davide 4,7G mar 11 16:20 500kk_numbers.txt
So our file now has 500 million lines and is nearly 5 GB in size. You already see two tools that are vital to our efforts.
wc -l will count the newlines in our file, and
ls will show the file size.
Checking the file
To give you a demonstration what I meant above when I talked about the Vim file editor, look at how long it takes to load the file above:
# open file and close immediately # time vi -c q 500kk_numbers.txt real 2m0,722s user 1m55,402s sys 0m4,455s
It takes nearly half the time it took to create the file. And this is only for “load file, close file”. So using Vim is not an option in our case.
We can start by looking at the file type:
# file 500kk_numbers.txt 500kk_numbers.txt: ASCII text
Then we can check the first 10 lines of the file to confirm what we are looking at:
# head -n10 500kk_numbers.txt 000000001 000000002 000000003 000000004 000000005 000000006 000000007 000000008 000000009 000000010
And then let us look at the last ten lines:
# tail -n10 500kk_numbers.txt 499999991 499999992 499999993 499999994 499999995 499999996 499999997 499999998 499999999 500000000
You will notice that these last commands were almost instant. This is because the tools have different approaches at searching lines. Different tools have different approaches at reading files.
Read line 250 million with
# time sed '250000000q;d' 500kk_numbers.txt 250000000 real 0m8,956s user 0m8,596s sys 0m0,360s
Read last line with
# time sed '500000000q;d' 500kk_numbers.txt 500000000 real 0m19,345s user 0m17,918s sys 0m0,876s
Read line 250 million by combination of
time head -n 250000000 500kk_numbers.txt | tail -1 250000000 real 0m1,824s user 0m2,466s sys 0m1,153s
Read last line by combination of
# time head -n 500000000 500kk_numbers.txt | tail -1 500000000 real 0m3,492s user 0m4,802s sys 0m2,161s
Less is more
less is an amazing command line tool. It reads a file chunk after chunk and only loads the data that you need to look at into memory. Try to open it at the beginning of the file without showing line numbers (
# less -n 500kk_numbers.txt
You can also open the file at a specific line, which is a bit more costly to do:
time less -n +250000000 500kk_numbers.txt real 0m14,826s user 0m14,578s sys 0m0,248s
What I like about the tool is that you can easily scroll through a file. Search is still slow, but scrolling (e.g. with Page Down) is pretty fast. Also jumping to the start of the file (pressing
g) or to the end (pressing
Shift+g) is instantaneous if you disabled line numbers. If you input a number before the command, you can jump to the specific line, e.g.
500g jumps to line 500.
Unfortunately less is not an editor and can not modify files. For this we need other tools.
Editing large and small parts
There are lots of editors that can handle large files, if you are willing to wait a few seconds or minutes to load the file. Spoiler: there is not a single editor that instantly loads a file for editing. If you found one, for any platform, please write a comment below!!
Often times I do not even need a full-blown editor. I just need small changes, maybe even only one line. Sometimes you want to search-and-replace strings in the file.
sed is your friend
Remember sed, the “stream editor for filtering and transforming text” we already used above? It is the best builtin tool we have on Linux to perform operations on text. The tool is a pretty standard Linux tool: it is cryptic, every solution requires to dive deep into the manual, and there are a number of obscure command line parameters that speed up or slow down the process. Therefore every example shown here is retrieved either from StackOverflow or other sources that are far more invested into reading man pages than I am.
The sed tool uses the following string pattern to define search-and-replace strings:
You can enable regular expressions with the
-E parameter. For example, replacing all zeros in our example with the letter Z:
# time sed 's/0/Z/g' 500kk_numbers.txt > NO_0.txt real 2m10,011s user 2m4,069s sys 0m3,567s
This command replaces every occurrence (option global) of zero in every line with Z. According to this StackOverflow comment, there is a faster method for big files, which did not really work for me:
# time sed '/0/ s//Z/g' 500kk_numbers.txt > NO_0_fast.txt real 2m43,308s user 2m37,130s sys 0m3,627s
By the way, the above commands create a new file, which is not optimal if your disk size is limited. You can edit the file directly by using the
-i “in place” parameter, e.g.:
# sed -i 's/0/Z/g' 500kk_numbers.txt
As you see, you can do small changes to big files with builtin Linux tools, and any other editors I have seen fail to provide such a fast service. Of course if you want to do any more complicated substitutions and editing, you either:
- start adding even more tools to the pipe (
awkis another strong contender)
- use custom scripts in Python, Bash or Perl
splitthe file up into smaller chunks
..or you just take your time, get a coffee while your favourite editor starts up and hope for the best!