User Tools

Site Tools


wiki:2016grep

WORKING WITH FILES II

The following commands will help you to extract information from files.

cat – concatenate. concatenate files together
uniq – unique. extract only the unique lines out of a file
cut – cut. pull out a specific column (or any other delimited information) from a file
grep – regular expressions. search for a specific pattern within a file

Let's make a file

:!: Exercise: Let's make some simple files to play with. Copy the following text into a file and name it chr_sizes.txt.

# An example file containing four Saccharomyces cervisiae chromosomes and their lengths. sacCer3.
chrI	230218
chrII	813184
chrIII	316620
chrM	85779

Concatenating files with cat

The cat command reads one or more files and prints the output of all files to the screen. The output can be redirected to a file, as well, and in this way, we can join files together.

concatenate usage:
cat <file.txt> …

Typically, we join two different files together. For the purpose of example, let's try to duplicate the contents of our file using cat.

$cat chr_sizes.txt chr_sizes.txt
$cat chr_sizes.txt chr_sizes.txt > double_sizes.txt

Searching for patterns using grep

A Regular expressions in computing describes a sequence of characters for which you want to search. It is often shortened to regex. Regular expressions are very powerful in computing and the expressions themselves can quickly become very complex with lots of wildcards and wiggle room for complex variations on the searched pattern. For this lesson, we'll focus on simple letter and number combinations. In this case, we can think of it here as simple pattern searching and matching.

grep usage
grep [options] <pattern> <file> …

  • there are many options for grep
  • Typically, the pattern given to search is enclosed in quotes.
  • grep can search multiple files

Let's say we want to know how long the mitochondrial genome is in yeast:

$grep 'chrM' chr_sizes.txt

:!: Exercises: Try executing the following to get a sense of what grep does and does not do. To learn more about these options, read the grep man page.

$grep -n 'chrM' chr_sizes.txt
$grep -n 'chrM' double_sizes.txt
$grep -v 'chrM' chr_sizes.txt
$grep -v '#' chr_sizes.txt
$grep 'chr' chr_sizes.txt
$grep 'chrII' chr_sizes.txt
$grep -w 'chrII' chr_sizes.txt

:!: Common pitfall: Did you notice how searching for chr gave you both the chromosomes listed in columns as well as the word chromosome in the header? Also, chrII returned both chrII and chrIII. This is something to look out for with grep. We'll cover more advanced ways to restrict your regular expressions in later lessons.

;-) Quick tip: As long as you use quotes around your search pattern, you can include a space in it.


Extracting columns with cut

cut is a command that can be used for slicing and dicing information out of delimited files. We'll just use the most basic feature of cut which, by default, pulls out specific columns from tab delimited files. There are ways to change this so that it splits on other delimiters, but today, we'll just stick with the default operation.

column extraction usage
cut -f <number> <file.txt>

:!: Exercise: Let's try to just extract out some columns using cut. cut works by default by splitting a file into tab-delimited columns.

  • Let's make a test file that has two columns by removing the first line from chr_sizes.txt.
  • Next, we can extract the first column or second column using cut:
$cut -f 1 chr_sizes.txt
$cut -f 2 chr_sizes.txt

:!: Common pitfall: The shell counts like so: 1, 2, 3, 4. However, not all computing languages start on 1. Many start on 0 and count like so: 0, 1, 2, 3. It is a good idea to double check your language by testing it every time.

Pipes

wiki/2016grep.txt · Last modified: 2016/08/31 12:45 by erin