NSCI 580A4

Instructors
Tai Montgomery
Erin Nishimura

2016pipes2

# MORE PIPES

Now that we know what piping is, we can discover some new functionalities of Linux. Let's learn how to pipe the following commands:

sort - sort lines in a file
uniq - find unique (or duplicated) lines in a pre-sorted file
tee - redirect stdout or stderr to multiple locations

Exercise: Let's make a test file. Copy and paste the text below into a file called mini.gff

# A tester gff file.
# For testing pipes.
chrV	test	CDS	789	809	.	+	.	annotation info
chrII	test	CDS	24558	26798	.	+	.	annotation info
chrV	test	CDS	789	809	.	+	.	annotation info
chrI	test	CDS	233	236	.	+	.	annotation info
chrIV	test	CDS	1234	7654	.	-	.	annotation info
chrI	test	CDS	233	236	.	+	.	annotation info
chrII	test	CDS	24558	26798	.	+	.	annotation info
CHRI	test	CDS	11565	11951	.	+	.	annotation info
chrII	test	CDS	24558	26798	.	+	.	annotation info
chrIII	test	CDS	13678	137888	.	+	.	annotation info
CHRII	test	CDS	7997	8547	.	+	.	annotation info
chrIII	test	CDS	13678	137888	.	+	.	annotation info
chrIV	test	CDS	1234	7654	.	-	.	annotation info
chrV	test	CDS	13363	13743	.	+	.	annotation info
chrIV	test	CDS	1234	7654	.	-	.	annotation info
chrIV	test	CDS	1234	7654	.	-	.	annotation info
chrV	test	CDS	789	809	.	+	.	annotation info

## Sorting files by line using sort

We can use sort to sort a file's lines into a new order…

sort usage:
sort [options] <file.txt> …

Exercise: Sort the mini.gff file:

$sort mini.gff Exercise: Read the sort man pages to figure out how you would… • sort in reverse order • sort the capital and lower case letters together • sort in numerical order. • Try some of these options ## Find unique lines using uniq We can identify unique (or duplicated) lines in a pre-sorted file using the command uniq. uniq usage: uniq [options] <sortedFile.txt> To operate on a presorted file, we have two options. We can do the process in two steps: 1. sort file.txt > sortedFile.txt 2. uniq sortedFile.txt OR, we can use the pipe operator to chain the two commands together:$sort mini.gff | uniq

Quick tip: To find the duplicated lines, use -d as an option for uniq.

Common pitfall: Pipes are fun, but pipes can be problematic with large files. Depending on your computer or cluster, there may be a limit to how much information can be piped to a new command. In these cases, creating a temp file (sometimes written as file.tmp) is preferable.

## Redirect to multiple locations using tee

In an earlier class, we learned how to redirect STDOUT and STDIN to a file. If we want to direct STDOUT to both a file and the screen, we can use the tee command. tee is used with the pipe command.

tee usage:
command | tee <filename.txt>

Exercise: Try to send output from a command to both the screen and a file.

$wc mini.gff | tee wc_output.txt Quick tip: tee is really used for redirecting stdout. If you want to redirect stdout and stderr, this command works, but I have no idea why:$wc mini.gff skdjfldj 2>&1 | tee wc_stdoutstderr.txt

Exercise: Can you write a series of pipes that will determine how many unique chromosomes are represented in mini.gff?