From time to time I have to work on files that exceed specific disk sizes. When that happens, programs begin to stutter, load indefinitely or simply crash when opening these files. This can already start with files over 100 MegaBytes. Try to open such a text file in any IDE and you will understand the problem. Even on a high-end machine, my Vim setup takes over 20 seconds to open files over 10 MB. There are a lot of reasons: The editor might try to do smart stuff (e.g. syntax highlighting), try to read in the whole file at once into memory and some tools do a lot of preprocessing (e.g. they “count” the lines). And this is only for reading in the file. Try to change a specific line in a 100k line text file.
Therefore, I often need to evaluate which tools can open, edit and save big files. This blog post is a reminder to myself.
Preparation
We will use a file with a big amount of lines to run our examples on. If your network bandwidth is better than your processing power, you can find big text files on the Internet. For my experiments I created a file with the seq
command:
# time seq -w 500000000 > 500kk_numbers.txt
real 4m11,968s
user 4m4,503s
sys 0m2,672s
We can check the size and number of lines with the following commands:
# wc -l 500kk_numbers.txt && ls -hl 500kk_numbers.txt
500000000 500kk_numbers.txt
-rw-rw-r-- 1 davide davide 4,7G mar 11 16:20 500kk_numbers.txt
So our file now has 500 million lines and is nearly 5 GB in size. You already see two tools that are vital to our efforts. wc -l
will count the newlines in our file, and ls
will show the file size.
Checking the file
To give you a demonstration what I meant above when I talked about the Vim file editor, look at how long it takes to load the file above:
# open file and close immediately
# time vi -c q 500kk_numbers.txt
real 2m0,722s
user 1m55,402s
sys 0m4,455s
It takes nearly half the time it took to create the file. And this is only for “load file, close file”. So using Vim is not an option in our case.
We can start by looking at the file type:
# file 500kk_numbers.txt
500kk_numbers.txt: ASCII text
Then we can check the first 10 lines of the file to confirm what we are looking at:
# head -n10 500kk_numbers.txt
000000001
000000002
000000003
000000004
000000005
000000006
000000007
000000008
000000009
000000010
And then let us look at the last ten lines:
# tail -n10 500kk_numbers.txt
499999991
499999992
499999993
499999994
499999995
499999996
499999997
499999998
499999999
500000000
You will notice that these last commands were almost instant. This is because the tools have different approaches at searching lines. Different tools have different approaches at reading files.
sed
Read line 250 million with sed
:
# time sed '250000000q;d' 500kk_numbers.txt
250000000
real 0m8,956s
user 0m8,596s
sys 0m0,360s
Read last line with sed
:
# time sed '500000000q;d' 500kk_numbers.txt
500000000
real 0m19,345s
user 0m17,918s
sys 0m0,876s
Read line 250 million by combination of head
and tail
:
time head -n 250000000 500kk_numbers.txt | tail -1
250000000
real 0m1,824s
user 0m2,466s
sys 0m1,153s
Read last line by combination of head
and tail
:
# time head -n 500000000 500kk_numbers.txt | tail -1
500000000
real 0m3,492s
user 0m4,802s
sys 0m2,161s
Less is more
less
is an amazing command line tool. It reads a file chunk after chunk and only loads the data that you need to look at into memory. Try to open it at the beginning of the file without showing line numbers (-n
):
# less -n 500kk_numbers.txt
You can also open the file at a specific line, which is a bit more costly to do:
time less -n +250000000 500kk_numbers.txt
real 0m14,826s
user 0m14,578s
sys 0m0,248s
What I like about the tool is that you can easily scroll through a file. Search is still slow, but scrolling (e.g. with Page Down) is pretty fast. Also jumping to the start of the file (pressing g
) or to the end (pressing Shift+g
) is instantaneous if you disabled line numbers. If you input a number before the command, you can jump to the specific line, e.g. 500g
jumps to line 500.
Unfortunately less is not an editor and can not modify files. For this we need other tools.
Editing large and small parts
There are lots of editors that can handle large files, if you are willing to wait a few seconds or minutes to load the file. Spoiler: there is not a single editor that instantly loads a file for editing. If you found one, for any platform, please write a comment below!!
Often times I do not even need a full-blown editor. I just need small changes, maybe even only one line. Sometimes you want to search-and-replace strings in the file.
sed is your friend
Remember sed, the “stream editor for filtering and transforming text” we already used above? It is the best builtin tool we have on Linux to perform operations on text. The tool is a pretty standard Linux tool: it is cryptic, every solution requires to dive deep into the manual, and there are a number of obscure command line parameters that speed up or slow down the process. Therefore every example shown here is retrieved either from StackOverflow or other sources that are far more invested into reading man pages than I am.
The sed tool uses the following string pattern to define search-and-replace strings:
s/search/replace/[options]
You can enable regular expressions with the -E
parameter. For example, replacing all zeros in our example with the letter Z:
# time sed 's/0/Z/g' 500kk_numbers.txt > NO_0.txt
real 2m10,011s
user 2m4,069s
sys 0m3,567s
This command replaces every occurrence (option global) of zero in every line with Z. According to this StackOverflow comment, there is a faster method for big files, which did not really work for me:
# time sed '/0/ s//Z/g' 500kk_numbers.txt > NO_0_fast.txt
real 2m43,308s
user 2m37,130s
sys 0m3,627s
By the way, the above commands create a new file, which is not optimal if your disk size is limited. You can edit the file directly by using the -i
“in place” parameter, e.g.:
# sed -i 's/0/Z/g' 500kk_numbers.txt
Conclusion
As you see, you can do small changes to big files with builtin Linux tools, and any other editors I have seen fail to provide such a fast service. Of course if you want to do any more complicated substitutions and editing, you either:
- start adding even more tools to the pipe (
awk
is another strong contender) - use custom scripts in Python, Bash or Perl
split
the file up into smaller chunks
..or you just take your time, get a coffee while your favourite editor starts up and hope for the best!
Be First to Comment