Difference between revisions of "Tip 8: Data Manipulation in Unix"

From Vlsiwiki
Jump to: navigation, search
(Created page with 'In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do y…')
 
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.
 
In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.
  
There are a couple VERY useful unix commands: paste and cut.
+
There are a couple VERY useful unix commands: paste, cut, and sort.
  
 +
=== paste ===
 
paste allows you to append files horizontally, line-by-line. Suppose you have file1:
 
paste allows you to append files horizontally, line-by-line. Suppose you have file1:
 
  1
 
  1
Line 26: Line 27:
 
  12
 
  12
 
  14
 
  14
 +
If one of the files is specified as "-", it will use stdin. This means you can gather data like this:
 +
 +
myprog | grep somekeyword > file.dat
 +
myprog -option2 | grep somekeyword | paste - file.dat > file2.dat
 +
 +
Note, however, that you must redirect it to a separate file (file2.dat cannot be the same as file.dat) or else it will lose the rest of file when it starts over-writing!
 +
 +
=== cut ===
 +
cut is the opposite of paste. It allows you to extract columns of data based on delimiters. For example, if you have a file like this:
 +
x = 1
 +
y = 2
 +
z = 3
 +
and you run:
 +
cut -d '=' -f 2
 +
will extract the column 2 and print it out:
 +
1
 +
2
 +
3
 +
You can also specify fixed character widths (-c), byte widths (-b) or tabbed fields (-d).
 +
 +
=== sort ===
 +
 +
Sort will, you guessed it, sort data. The only trick is that it uses alphabetical sorting by default. If you want numeric sorting, you must specify "-n". You can also specify "-r" for a reverse sort.
 +
Also the "-k" option allows you to specify the key (which column will be sorted).
 +
 +
=== uniq ===
 +
 +
This command will eliminate repeating lines (if the input is sorted) so that you have unique values. If you give it the option "-c", it will print out the count of non-unique lines so that you can easily make a histogram like this:
 +
 +
  cat mydata.txt | sort -n | uniq -c > newdata.txt
 +
 +
Then something like this in gnuplot:
 +
 +
  gnuplot> set boxwidth 1
 +
  gnuplot> set style fill solid
 +
  gnuplot> plot 'newdata.txt' using 2:1 with boxes

Latest revision as of 22:24, 12 March 2015

In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.

There are a couple VERY useful unix commands: paste, cut, and sort.

paste

paste allows you to append files horizontally, line-by-line. Suppose you have file1:

1
2
3
4

and file2:

2
4
6
8
10
12
14

and you run

paste file1 file2

It will output:

1	2
2	4
3	6
4	8
5	10 
	12
	14

If one of the files is specified as "-", it will use stdin. This means you can gather data like this:

myprog | grep somekeyword > file.dat
myprog -option2 | grep somekeyword | paste - file.dat > file2.dat

Note, however, that you must redirect it to a separate file (file2.dat cannot be the same as file.dat) or else it will lose the rest of file when it starts over-writing!

cut

cut is the opposite of paste. It allows you to extract columns of data based on delimiters. For example, if you have a file like this:

x = 1
y = 2
z = 3

and you run:

cut -d '=' -f 2

will extract the column 2 and print it out:

1
2
3

You can also specify fixed character widths (-c), byte widths (-b) or tabbed fields (-d).

sort

Sort will, you guessed it, sort data. The only trick is that it uses alphabetical sorting by default. If you want numeric sorting, you must specify "-n". You can also specify "-r" for a reverse sort. Also the "-k" option allows you to specify the key (which column will be sorted).

uniq

This command will eliminate repeating lines (if the input is sorted) so that you have unique values. If you give it the option "-c", it will print out the count of non-unique lines so that you can easily make a histogram like this:

 cat mydata.txt | sort -n | uniq -c > newdata.txt

Then something like this in gnuplot:

 gnuplot> set boxwidth 1
 gnuplot> set style fill solid
 gnuplot> plot 'newdata.txt' using 2:1 with boxes