User Tools

Site Tools


wiki:2016awk

PROGRAMMING IN AWK AND SED


The purpose of this exercise is to introduce awk and sed programming to manipulate genomics data.

OUTLINE

  1. awk
  2. sed

1. awk

  • awk is a programming language for processing text.
  • Very useful for processing large genomics datasets.
  • Works very well with columns, rows, and blocks of data.
  • Often much faster than writing a Perl, Python, or R script.
  • Standard feature of UNIX-based operating systems (Linux and Mac).
  • Drawback - cryptic syntax.

We don't have much time to sufficiently cover even the basics of awk. The purpose of this is to introduce the power of AWK and provide a starting point for you to learn AWK on your own or to modify and AWK code written by others.

For a nice tutorial, see http://www.grymoire.com/Unix/Awk.html.

awk program structure

$ awk 'pattern { action }' input_file >output_file

When the specified pattern is identified in the file, some action is taken.

Pattern can be basically anything, specific lines, specific columns, specific characters strings, etc.

Pattern can also be the beginning of a file (BEGIN) or the end of a file (END). Here is an example of a simple awk script:

#!/bin/awk -f
BEGIN { print "Beginning" }
      { print }
END   { print "End" }

This little program would print the word Beginning on a line, followed by each line of the file, and then a new line with the word End and output the results to a new file.

To execute the script from the command line:

awk -f awk_script input_file >output_file

AWK Relational Operators

Operator Meaning
== is equal to
!= is not equal to
~ matches
!~ doesn't match
> is greater than
>= is greater than or equal to
< is less than
<= is less than or equal to
% remainder (i.e. what's left over after a division)
&& and
|| or

AWK Commands

Command Meaning
RS record separator, by default a new line
FS field separator (use internally on blocks of code; by default any white space - tab or space)
-F field separator (use to specify a field separator from command line before executing code, e.g. awk -F “\t” 'code')
NF number of field in current record ($NF → last field)
$# field (e.g. \$1, \$2, etc.; $0 = all fields)
OFS output separator, by default a space (to change, to tab for example: OFS='\t')
NR total number of records processed
FNR total number of records processed for each input file when multiple files
next script starts again at next line; subsequent commands are only run on files other than first
n[$#] array (e.g. NR==FNR{n[$0];next} → read all NR starting with FNR for each line in FNR)
gsub global sub (see example below)
print return output (by default adds new line; use printf to not add new line; use toupper() to change case; use print substr to return a defined number of characters from a string)

Take for example the following fastq file:

@D64TDFP1:1:1101:1124:2467#0/1
GCAGGATGTAGCGAAGTTTTTCTGGAATTCTCGGTTGCCAAGGAACTCCAGATCAGA
+
abbeeceeegggghhhhhiiihiiiiihigiiiiifc^ceghieghhhiiabbeeec
@D64TDFP1:1:1101:1254:2311#0/1
GATTGGATGGTATCTCTCATTTGGAATTCTCGGGTGCCAAGGAACTCCAGGATCAGA
+
aaaeeecegggggghhhiiiiiiiiiiiiiiiiiiiiihiiihiiihiii\\_ccc`

RS = new line: each line of the file is a record.
FS = white space: there is only one field in each record because there is no white space.
NR = 1-8: the number of records changes as the program runs.
NF = 1: no matter which record it is, each only has 1 field.

Examples

  • If no output file is specified, output is returned to screen.
  • Use the lines.txt file (available in under sample data on the website) as an input file for these examples.

To print a single line number, for example line 7:

awk 'NR == 7 { print }' input_file

Even though we only want line 7, awk will still read through the entire file, which can be slow. Instead we can exit after returning line 7:

awk 'NR == 7 {print; exit}' input_file

* Note that the default is to print the output if no actions are specified, but if you want to include additional code or modify what is printed, you must specify print.

Print every second line counting from line 0 (the first line printed is line 2):

awk 'NR % 2 == 0'  input_file

To print every second line counting from line 1 (the first line printed is line 1):

awk '(NR + 1) % 2 == 0'  input_file

To extract the first 100 lines:

awk 'NR <= 100' input_file

To extract lines 10-20:

awk 'NR >= 10 && NR <= 20 { print }' input_file

To print every third and fourth line of a fastq file, counting from line 1:

awk '(NR + 3) % 4 == 2 { print } (NR + 3) % 4 == 3 { print }'  input_file

Print lines containing a pattern (can also be used for fields - \$1, \$2, etc):

awk '/pattern/ {print $0}' input_file

* Note here we specify the fields to print - $0, all fields. This doesn't change anything because by default all fields are printed. Is print even needed here?

Find and replace:

awk 'gsub(/DNA/,"RNA")' input_file

Could also be written:

awk 'gsub(/DNA/,"RNA") { print }' input_file

Could also be written (if in doubt, guess and check):

awk '{gsub(/DNA/,"RNA"); print}' input_file

Sum the second column of a tab delimited file:

awk '{sum += $2} END {print sum}' input file

Print only the genomic coordinates of a gene from a gff file (alg-1 in this example):

awk '/WormBase\tgene.*alg-1/ {print $1":"$4"-"$5}' input_file

*in this example we created a variable called sum and add the values in the second field of each line.
* awk is typically pretty forgiving when it comes to spaces or lack thereof in code.

Exercises

1. Extract line 10 from the lines.txt file.

2. Extract lines 20-60 from the lines.txt file.

3. Extract all lines after but not including line 100 from the lines.txt file.

4. Extract every sequence ID line from the example fastq file, fastq_example.fastq.

5. Extract every sequence ID and quality score line from the example fastq file, fastq_example.fastq.

6. Convert the tab-delimited human miRNA sequence file, human_miRNAs-tab.txt, to fasta format. You will need to specify a field separator using -F (see instructions in the awk commands table above).

7. In the fasta file you created in challenge 6, convert the RNA sequences to DNA. Only substitute Us for Ts in the sequence lines, not in the header lines.

2. sed

  • sed (stream editor) is a Unix tool.
  • Useful for making substitutions in a document.
  • Allows pattern matching with regular expressions.
  • Reads in a file line by line.
  • Simpler but far more limited ability compared to AWK, but allows for more complex substitutions than tr.
  • On a Mac, install gnu sed to use some common regular expressions missing in the Mac sed distribution.

For a nice tutorial, see http://www.grymoire.com/Unix/Sed.html

sed program structure

$ sed 's/"old_pattern"/"new_pattern"/g' input_file >output_file

Convert a tab delimited sequence file to a fasta formatted file:

gsed 's/\(.*\)\t/>&\n/g' <human_miRNAs-tab.txt >human_miRNAs_fasta.txt

*The pattern match stored in () is saved as the variable &. The parentheses have to be negated as part of the match with backslashes.

★ Interactive Exercise: using awk to process data

Problem: extract the sequence of the gene alg-1 from the C. elegans genome sequence.

1. Download the C. elegans genome sequence and annotations from wormbase (the relevant section of the annotations file can also be downloaded from the course website):

$ ftp ftp.wormbase.org
$ cd /pub/wormbase/releases/WS255/species/c_elegans/PRJNA13758
$ mget c_elegans.PRJNA13758.WS255.genomic.fa.gz
$ mget c_elegans.PRJNA13758.WS255.annotations.gff3.gz

2. Decompress the files using gzip:

$ gzip -d c_elegans.PRJNA13758.WS255.genomic.fa.gz 
$ gzip -d c_elegans.PRJNA13758.WS255.annotations.gff3.gz

3. Examine the decompressed files using more. Find an example of a coding gene entry by searching for 'coding' using grep How can we distinguish entries for coding genes from other entries?

4. Extract wormbase coding gene information from the annotations file using grep (what format is the file in?):

$ grep "WormBase\tgene.*protein_coding" c_elegans.PRJNA13758.WS255.annotations.gff3 >Coding_Genes.gff

5. Identify the genomic coordinates of argonaute-like gene 1 (alg-1; the human argonaute1 homolog) by grep searching the Coding_Genes file you created in step 4. Output the coordinates to a new file named alg-1_coordinates.txt.

* Note: the genome sequence you downloaded is in fasta format and has the following structure:

>CHROMOSOME_I
chromosome I sequence
chromosome I sequence
chromosome I sequence
chromosome I sequence
etc.
>CHROMOSOME_II
chromosome II sequence
chromosome II sequence
chromosome II sequence
chromosome II sequence
etc.
>CHROMOSOME_III
chromosome III sequence
chromosome III sequence
chromosome III sequence
chromosome III sequence
etc.
>CHROMOSOME_IV
chromosome IV sequence
chromosome IV sequence
chromosome IV sequence
chromosome IV sequence
etc.
>CHROMOSOME_V
chromosome V sequence
chromosome V sequence
chromosome V sequence
chromosome V sequence
etc.
>CHROMOSOME_X
chromosome X sequence
chromosome X sequence
chromosome X sequence
chromosome X sequence
etc.
>CHROMOSOME_MTDNA
MtDNA sequence
MtDNA sequence
MtDNA sequence
MtDNA sequence
etc.

6. The sequence of each chromosome is separated across multiple lines of a fixed length. Use awk to combine each chromosome sequence onto a single line. Here is one of many possible approaches:

$ awk 'NR==1 {print} NR>1 {if ($0 ~ />/) print "\n"$0; else printf}' <c_elegans.PRJNA13758.WS255.genomic.fa >c_elegans_genome.fa

7. Create a FASTA file for the alg-1 sequence as follows:

a. Create a new file named alg-1.fa with the alg-1 header line using echo.

$ echo ">alg-1" >alg-1.fa

b. Use awk to extract alg-1 sequence from the FASTA formatted genome from step 6 using the alg-1 coordinates from the annotations file identified in step 5 and insert it into the alg-1_sequence.fa file:

$ awk 'NR==linenumber{print substr(string,start,length)}' <inputfile >>alg-1.fa

*Note: the >> adds the text to the alg-1_sequence.fa file. A single > would overwrite the file contents.
linenumber = the line of the file that contains the alg-1 sequence. The genome is now formatted as follows:

>I
chromosome I sequence
>II
chromosome II sequence
>III
chromosome III sequence
>IV
chromosome IV sequence
>V
chromosome V sequence
>X
chromosome X sequence
>MTDNA
MtDNA sequence

So if alg-1 were on chromosome I, than linenumber would be 2.

  • string = $0
  • start = start position of alg-1
  • length = length of alg-1

8. BLAST the sequence to confirm that it is alg-1.

Appendix: Installing homebrew and gnu sed on a Mac

Install x-code command line tools:

$ xcode-select –install

Install homebrew package manager:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Check homebrew installation:

$ brew doctor

Install gnu-sed:

$ brew install gnu-sed

If gnu-sed not linked properly, do the following:

$ sudo chown -R whoami:admin /usr/local/bin
$ sudo chown -R whoami:admin /usr/local/share
$ brew uninstall gnu-sed
$ brew install gnu-sed
wiki/2016awk.txt · Last modified: 2017/09/12 09:30 by tai