Difference between revisions of "Tip 8: Data Manipulation in Unix"

Latest revision as of 22:24, 12 March 2015

In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.

There are a couple VERY useful unix commands: paste, cut, and sort.

@@ Line 1: / Line 1: @@
 In most projects, you ultimately have some data in rows and columns in text files. How do you get it there? I assume that you are using "grep" to extract it from a file. How do you manipulate it? That is what I will show now.
-There are a couple VERY useful unix commands: paste and cut.
+There are a couple VERY useful unix commands: paste, cut, and sort.
+=== paste ===
 paste allows you to append files horizontally, line-by-line. Suppose you have file1:
@@ Line 26: / Line 27: @@
+If one of the files is specified as "-", it will use stdin. This means you can gather data like this:
+ myprog | grep somekeyword > file.dat
+ myprog -option2 | grep somekeyword | paste - file.dat > file2.dat
+Note, however, that you must redirect it to a separate file (file2.dat cannot be the same as file.dat) or else it will lose the rest of file when it starts over-writing!
+=== cut ===
+cut is the opposite of paste. It allows you to extract columns of data based on delimiters. For example, if you have a file like this:
+ x = 1
+ y = 2
+ z = 3
+and you run:
+ cut -d '=' -f 2
+will extract the column 2 and print it out:
+You can also specify fixed character widths (-c), byte widths (-b) or tabbed fields (-d).
+=== sort ===
+Sort will, you guessed it, sort data. The only trick is that it uses alphabetical sorting by default. If you want numeric sorting, you must specify "-n". You can also specify "-r" for a reverse sort.
+Also the "-k" option allows you to specify the key (which column will be sorted).
+=== uniq ===
+This command will eliminate repeating lines (if the input is sorted) so that you have unique values. If you give it the option "-c", it will print out the count of non-unique lines so that you can easily make a histogram like this:
+  cat mydata.txt | sort -n | uniq -c > newdata.txt
+Then something like this in gnuplot:
+  gnuplot> set boxwidth 1
+  gnuplot> set style fill solid
+  gnuplot> plot 'newdata.txt' using 2:1 with boxes

Difference between revisions of "Tip 8: Data Manipulation in Unix"

Latest revision as of 22:24, 12 March 2015

Contents

paste

cut

sort

uniq

Navigation menu

Views

Personal tools

Navigation

Search

Tools