Read the Command Sequencing section of HPC Tutorials volume 4 before continuing.

Pipelining

Compression Using Pipes

An extremely useful case on the supercomputer is piping output into compression tools. This can save you an immense amount of storage space. In this example, the dd utility was used to write 1000 blocks of zeroes into a file. The size of the file is 500 kB.

$ dd if=/dev/zero count=1000 > zeroes.txt
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 0.180904 s, 2.8 MB/s
$ ls -lh
total 500k
-rw-rw-r-- 1 user user 500K Oct 21 16:45 zeroes.txt

But, if we pipe that same command into gzip, a compression utility, the same data is only 531 bytes:

$ dd if=/dev/zero count=1000 | gzip > zeroes.txt.gz
1000+0 records in
1000+0 records out
512000 bytes (k12 kB0 copied, 0.0207491 s, 24.7 MB/s
$ ls -lh
total 504k
-rw-rw-r-- 1 user user 500K Oct 21 16:45 zeroes.txt
-rw-rw-r-- 1 user user  531 Oct 21 16:45 zeroes.txt.gz

Data Wrangling

Another use of piping is to manipulate data into the desired format. Imagine that given the wall of text seen here, we’re trying to insert every ‘random reader’ number into on Excel sheet column, and every ‘ops/sec’ number into a second Excel sheet column, to build a chart. Copying and pasting this data into Excel would be difficult.

$ grep 'random readers' slurm-1234.out
     throughput for 1 random readers   =   2847.90 ops/sec
     throughput for 2 random readers   =   5926.94 ops/sec
     throughput for 3 random readers   =   8877.60 ops/sec
     throughput for 4 random readers   =  11733.96 ops/sec
     throughput for 5 random readers   =  14533.67 ops/sec
     throughput for 6 random readers   =  17289.51 ops/sec
     throughput for 7 random readers   =  19039.88 ops/sec
     throughput for 8 random readers   =  20200.80 ops/sec
     throughput for 9 random readers   =  22942.14 ops/sec
     throughput for 10 random readers  =  18564.02 ops/sec
     ...

We can pipe the output into a tool called awk which is very useful for printing specific columns in text output, to print only the two numbers that we want:

$ grep 'random readers' slurm-1234.out | awk '{ print $3, $7 }'
1 2847.90
2 5926.94
3 8877.60
4 11733.96
5 14533.67
6 17289.51
7 19039.88
8 20200.80
9 22942.14
10 18564.02

Exercise

Log into the supercomputer and construct a pipeline to count how many lines in /etc/vimrc contain the string “th.” Use wc to do the counting.