Now that we've learned how to acquire data and do some very basic tasks with that data, we'll discuss ways of manipulating data using substitutions and regular expressions.
Regular expressions are character combinations that represent a particular pattern.
|\s||any white space (new line, space, etc.)|
|\d||any digit 0-9|
|\w||any alphanumeric character or underscore|
|.||a wildcard character (pretty much anything but a new line)|
|*||match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible)|
|+||match the preceding character or pattern one or more times|
|^||matches beginning of line or string|
|$||matches end of line or string|
|[0-9]||a character class, any number|
|[a-z]||a character class, any letter|
||||use to match something on the left or something on the right e.g. DNA|RNA (use egrep)|
|\\n||the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.|
Substitutions refer to changing one pattern to another in a file, filename, etc.
There are several tools for doing substitutions, the simplest of which is tr.
tr….deletes or replaces one or a set of characters in a file with one or another set of characters.
$ tr 'ACTG' 'TGAC' input_file
In the above example, A would be substituted for T, C for G, T for A, and G for C.
Follow along with the instructor on your own computer.
1. Obtain all miRNA sequences in fasta format from the mirbase download page. The file is called mature.fa.
2. Extract all worm (C. elegans), fly (D. malanogaster), and human (H. sapiens) miRNAs preserving fasta format (egrep is needed when matching either or).
3. Use tr to convert all RNA sequences to DNA sequences.
4. Determine how many miRNAs start with each possible 5' nucleotide (A, C, G, T) - what is the most common 5' nt?
5. Determine how many miRNAs are 20, 21, 22, 23, and 24 nt. What is the most common length of miRNAs?
6. Extract all potential let-7 miRNA family members using grep, preserving the fasta format. miRNA families are determined by their seed sequence, positions 2-8 of the mature miRNA. The seed sequence of let-7 is: GAGGTAG.