User Tools

Site Tools


assignments:2016assignment2

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
assignments:2016assignment2 [2016/08/31 22:25]
erin
assignments:2016assignment2 [2016/08/31 23:01] (current)
erin
Line 26: Line 26:
   - The file ''​S288C_reference_sequence_R64-1-1_20110203.fsa''​ is a fasta file. It contains annotation lines that begin with ''>''​ and sequence information that contains the characters A,T,G, or C. Can you figure out how many annotation lines are in the file? Write two commands piped together that will allow you to display the number of annotation lines in the ''​S288C_reference_sequence_R64-1-1_20110203.fsa''​ file.   - The file ''​S288C_reference_sequence_R64-1-1_20110203.fsa''​ is a fasta file. It contains annotation lines that begin with ''>''​ and sequence information that contains the characters A,T,G, or C. Can you figure out how many annotation lines are in the file? Write two commands piped together that will allow you to display the number of annotation lines in the ''​S288C_reference_sequence_R64-1-1_20110203.fsa''​ file.
   - One common thing we do in computational biology is to make testfiles. What command would you use to make a testfile, called ''​test.fsa''​ that contains just the top 1000 lines of ''​S288C_reference_sequence_R64-1-1_20110203.fsa''?​   - One common thing we do in computational biology is to make testfiles. What command would you use to make a testfile, called ''​test.fsa''​ that contains just the top 1000 lines of ''​S288C_reference_sequence_R64-1-1_20110203.fsa''?​
-  - For the next few questions, move to the annotation file (''​~/​03_annotations/​saccharomyces_cerevisiae_R64-1-1_20110208.gff''​). You'll remember that .gff files list all the annotated features in a genome (genes, start codons, tRNAs, snoRNAs, etc). A gff file contains (1) commented information (lines that start with #); (2) tab-delimited ​annotation ​lines corresponding to all the genome features (lines start with chr...); and (3) sometimes a fasta file at the end. One cool aspect of grep is that you can specify to match strings that appear at the very beginning of a line using the ''​^''​ symbol. This is used like so... <code bash>+  - For the next few questions, move to the annotation file (''​~/​03_annotations/​saccharomyces_cerevisiae_R64-1-1_20110208.gff''​). You'll remember that .gff files list all the annotated features in a genome (genes, start codons, tRNAs, snoRNAs, etc). A gff file contains (1) commented information (lines that start with #); (2) tab-delimited ​annotated feature ​lines corresponding to all the genome features (lines start with chr...); and (3) sometimes a fasta file at the end. One cool aspect of grep is that you can specify to match strings that appear at the very beginning of a line using the ''​^''​ symbol. This is used like so... <code bash>
 $grep '​^apple'​ file.txt $grep '​^apple'​ file.txt
-</​code>​ This would find only instances of the word ''​apple''​ that appeared at the beginning of a line. See how you could use ''​^''​ within a grep command to pull out just the tab-delimited ​annotation ​portion of the .gff file and save it as ''​sacCer3_tab.gff''​. What command did you use?+</​code>​ This would find only instances of the word ''​apple''​ that appeared at the beginning of a line. See how you could use ''​^''​ within a grep command to pull out just the tab-delimited ​annotated feature ​portion of the .gff file and save it as ''​sacCer3_tab.gff''​. What command did you use?
   - Say we want to start with the file ''​sacCer3_tab.gff''​ and extract out tab-delimited lines for ONLY the nuclear-encoded tRNAs. What piped series of commands would you use to (1) extract out just the lines that contain '​tRNA'​ entries (listed in the third column), (2) remove any lines that contain '​chrMito',​ and (3) save the resulting file with the name ''​sacCer3_tRNA_minusMito.gff''​ ([[assignments:​2_hint1|hint1]];​ [[assignments:​2_hint2|hint2]])   - Say we want to start with the file ''​sacCer3_tab.gff''​ and extract out tab-delimited lines for ONLY the nuclear-encoded tRNAs. What piped series of commands would you use to (1) extract out just the lines that contain '​tRNA'​ entries (listed in the third column), (2) remove any lines that contain '​chrMito',​ and (3) save the resulting file with the name ''​sacCer3_tRNA_minusMito.gff''​ ([[assignments:​2_hint1|hint1]];​ [[assignments:​2_hint2|hint2]])
   - Write a series of piped commands that will allow you to test that '​tRNA'​ is the only unique entry in the 3rd column of ''​sacCer3_tRNA_minusMito.gff''​ file. What are they? ([[assignments:​2_hint3|hint3]])   - Write a series of piped commands that will allow you to test that '​tRNA'​ is the only unique entry in the 3rd column of ''​sacCer3_tRNA_minusMito.gff''​ file. What are they? ([[assignments:​2_hint3|hint3]])
assignments/2016assignment2.txt ยท Last modified: 2016/08/31 23:01 by erin