The following commands will help you to extract information from files.
cat – concatenate. concatenate files together
uniq – unique. extract only the unique lines out of a file
cut – cut. pull out a specific column (or any other delimited information) from a file
grep – regular expressions. search for a specific pattern within a file
Exercise: Let's make some simple files to play with. Copy the following text into a file and name it
chr_sizes.txt
.
# An example file containing four Saccharomyces cervisiae chromosomes and their lengths. sacCer3. chrI 230218 chrII 813184 chrIII 316620 chrM 85779
The cat
command reads one or more files and prints the output of all files to the screen. The output can be redirected to a file, as well, and in this way, we can join files together.
concatenate usage:
cat <file.txt> …
Typically, we join two different files together. For the purpose of example, let's try to duplicate the contents of our file using cat.
$cat chr_sizes.txt chr_sizes.txt $cat chr_sizes.txt chr_sizes.txt > double_sizes.txt
A Regular expressions in computing describes a sequence of characters for which you want to search. It is often shortened to regex. Regular expressions are very powerful in computing and the expressions themselves can quickly become very complex with lots of wildcards and wiggle room for complex variations on the searched pattern. For this lesson, we'll focus on simple letter and number combinations. In this case, we can think of it here as simple pattern searching and matching.
grep usage
grep [options] <pattern> <file> …
Let's say we want to know how long the mitochondrial genome is in yeast:
$grep 'chrM' chr_sizes.txt
Exercises: Try executing the following to get a sense of what grep does and does not do. To learn more about these options, read the grep man page.
$grep -n 'chrM' chr_sizes.txt $grep -n 'chrM' double_sizes.txt $grep -v 'chrM' chr_sizes.txt $grep -v '#' chr_sizes.txt $grep 'chr' chr_sizes.txt $grep 'chrII' chr_sizes.txt $grep -w 'chrII' chr_sizes.txt
Common pitfall: Did you notice how searching for
chr
gave you both the chromosomes listed in columns as well as the word chromosome
in the header? Also, chrII
returned both chrII
and chrIII
. This is something to look out for with grep. We'll cover more advanced ways to restrict your regular expressions in later lessons.
Quick tip: As long as you use quotes around your search pattern, you can include a space in it.
cut
is a command that can be used for slicing and dicing information out of delimited files. We'll just use the most basic feature of cut
which, by default, pulls out specific columns from tab delimited files. There are ways to change this so that it splits on other delimiters, but today, we'll just stick with the default operation.
column extraction usage
cut -f <number> <file.txt>
Exercise: Let's try to just extract out some columns using cut.
cut
works by default by splitting a file into tab-delimited columns.
chr_sizes.txt
.cut
:$cut -f 1 chr_sizes.txt $cut -f 2 chr_sizes.txt
Common pitfall: The shell counts like so: 1, 2, 3, 4. However, not all computing languages start on 1. Many start on 0 and count like so: 0, 1, 2, 3. It is a good idea to double check your language by testing it every time.