# NSCI 580A3 fall 2017

### Sidebar

NSCI 580A3

Instructors
Tai Montgomery
Erin Nishimura

wiki:2016assignment2

# Assignment 2

Due: 10am, Sept 6, 2016

Assignments account for 60 % of your final grade. All assignments are due by the following class start time (Tuesday 10 am). You will need to have completed the assignment to effectively follow along in the next class. A point will be awarded for each question (10 points total). Possible answers will be posted the evening after the assignments are due.

## Instructions:

We are going use the yeast genome for these exercises. Make sure you have downloaded S288C_reference_genome_R64-1-1_20110203.tgz from http://www.yeastgenome.org.

## Performing the assignment:

• Using your text editor (TextWrangler, Notepad++, etc) start a file titled <yourlastname>_Assignment2.txt. Replace <yourlastname> with your actual last name.
• Answer each question below. You can use the template shown below by copying and pasting the template into your text editor.
• Do NOT use Microsoft WORD or any Office software to edit your .txt file. This can add extra characters to the file and you'll lose points.
• When a question asks for a command, please supply the entire command line, the full set of instructions that you would put after the prompt and before pressing return.

## Assignment questions:

1. Navigate to the yeast genome file. It is a file that ends in .fsa. What command would you execute to show the number of lines, words, and characters in this file?
2. What command would you execute to show JUST the number of lines in this file?
3. The file S288C_reference_sequence_R64-1-1_20110203.fsa is a fasta file. It contains annotation lines that begins with > and sequence information that contains the characters A,T,G, or C. Can you figure out how many annotation lines are in the file? Write two commands piped together that will display the number of annotation lines in the S288C_reference_sequence_R64-1-1_20110203.fsa file.
4. One common thing we do in computational biology is to make testfiles. What command would you use to make a testfile that contains just the top 1000 lines of S288C_reference_sequence_R64-1-1_20110203.fsa?
5. For the next few questions, move to the annotation file (~/03_annotations/saccharomyces_cerevisiae_R64-1-1_20110208.gff). One cool aspect of grep is that you can specify to match strings that appear at the very beginning of a line using the ^ symbol. This is used like so…
\$grep '^apple' file.txt

This would find only instances of the word apple that appeared at the beginning of a line. See how you could use ^ to pull out just the tab-delimited portion of the .gff file and save it as sacCer3_tdt_annotation.gff. What command did you use?

6. Say we want to start with the file sacCer3_tdt_annotation.gff and extract out annotation lines for ONLY the nuclear-encoded tRNAs. What piped series of commands would you use to (1) extract out just the lines that contain 'tRNA' entries (in the third column), (2) remove any lines that contain 'chrMito', and (3) save the resulting file with the name sacCer3_tRNA_minusMito.gff (hint1: I used a grep command, another grep command, and a redirection. hint2: You may need to get tricky to extract the 'tRNA' entries. Think '\t'.
7. How many lines are in your sacCer3_tRNA_minusMito.gff file?
8. Write a series of piped commands that will allow you to test that 'tRNA' is the only unique entry in the 3rd column of sacCer3_tRNA_minusMito.gff file. What are they?
9. A .bed file is a standardized file type that contains four columns of information: (1) chromosome name, (2) start, (3) stop, (3) strand. Can you execute a code that will convert your sacCer3_tRNA_minusMito.gff file into a bed file? Save it as sacCer3_tRNA.bed. (hint: see the cut man pages and look at the examples)
10. Let's say you want to make a custom command that converts gff files into bed files. How would you make a command called gff2bed using alias?