Difference between revisions of "Tip 8: Data Manipulation in Unix"
(Created page with 'In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do y…') |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now. | In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now. | ||
− | There are a couple VERY useful unix commands: paste | + | There are a couple VERY useful unix commands: paste, cut, and sort. |
+ | === paste === | ||
paste allows you to append files horizontally, line-by-line. Suppose you have file1: | paste allows you to append files horizontally, line-by-line. Suppose you have file1: | ||
1 | 1 | ||
Line 26: | Line 27: | ||
12 | 12 | ||
14 | 14 | ||
+ | If one of the files is specified as "-", it will use stdin. This means you can gather data like this: | ||
+ | |||
+ | myprog | grep somekeyword > file.dat | ||
+ | myprog -option2 | grep somekeyword | paste - file.dat > file2.dat | ||
+ | |||
+ | Note, however, that you must redirect it to a separate file (file2.dat cannot be the same as file.dat) or else it will lose the rest of file when it starts over-writing! | ||
+ | |||
+ | === cut === | ||
+ | cut is the opposite of paste. It allows you to extract columns of data based on delimiters. For example, if you have a file like this: | ||
+ | x = 1 | ||
+ | y = 2 | ||
+ | z = 3 | ||
+ | and you run: | ||
+ | cut -d '=' -f 2 | ||
+ | will extract the column 2 and print it out: | ||
+ | 1 | ||
+ | 2 | ||
+ | 3 | ||
+ | You can also specify fixed character widths (-c), byte widths (-b) or tabbed fields (-d). | ||
+ | |||
+ | === sort === | ||
+ | |||
+ | Sort will, you guessed it, sort data. The only trick is that it uses alphabetical sorting by default. If you want numeric sorting, you must specify "-n". You can also specify "-r" for a reverse sort. | ||
+ | Also the "-k" option allows you to specify the key (which column will be sorted). | ||
+ | |||
+ | === uniq === | ||
+ | |||
+ | This command will eliminate repeating lines (if the input is sorted) so that you have unique values. If you give it the option "-c", it will print out the count of non-unique lines so that you can easily make a histogram like this: | ||
+ | |||
+ | cat mydata.txt | sort -n | uniq -c > newdata.txt | ||
+ | |||
+ | Then something like this in gnuplot: | ||
+ | |||
+ | gnuplot> set boxwidth 1 | ||
+ | gnuplot> set style fill solid | ||
+ | gnuplot> plot 'newdata.txt' using 2:1 with boxes |
Latest revision as of 22:24, 12 March 2015
In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.
There are a couple VERY useful unix commands: paste, cut, and sort.
paste
paste allows you to append files horizontally, line-by-line. Suppose you have file1:
1 2 3 4
and file2:
2 4 6 8 10 12 14
and you run
paste file1 file2
It will output:
1 2 2 4 3 6 4 8 5 10 12 14
If one of the files is specified as "-", it will use stdin. This means you can gather data like this:
myprog | grep somekeyword > file.dat myprog -option2 | grep somekeyword | paste - file.dat > file2.dat
Note, however, that you must redirect it to a separate file (file2.dat cannot be the same as file.dat) or else it will lose the rest of file when it starts over-writing!
cut
cut is the opposite of paste. It allows you to extract columns of data based on delimiters. For example, if you have a file like this:
x = 1 y = 2 z = 3
and you run:
cut -d '=' -f 2
will extract the column 2 and print it out:
1 2 3
You can also specify fixed character widths (-c), byte widths (-b) or tabbed fields (-d).
sort
Sort will, you guessed it, sort data. The only trick is that it uses alphabetical sorting by default. If you want numeric sorting, you must specify "-n". You can also specify "-r" for a reverse sort. Also the "-k" option allows you to specify the key (which column will be sorted).
uniq
This command will eliminate repeating lines (if the input is sorted) so that you have unique values. If you give it the option "-c", it will print out the count of non-unique lines so that you can easily make a histogram like this:
cat mydata.txt | sort -n | uniq -c > newdata.txt
Then something like this in gnuplot:
gnuplot> set boxwidth 1 gnuplot> set style fill solid gnuplot> plot 'newdata.txt' using 2:1 with boxes