User Tools

Site Tools


assignments:2017assignment2

Assignment 2

Due Date & Grading

Due: 10am, Sept 5, 2017

A point will be awarded for each question (10 points total).

Instructions:

  • We are going use the yeast genome for these exercises.
  • Make sure you have downloaded S288C_reference_genome_R64-1-1_20110203.tgz from http://www.yeastgenome.org.
  • These exercises will use a fasta file, a gff file, and a bed file. To learn more about these file types, please explore their descriptions on UCSC Genome Browser File Format FAQ.

Performing the assignment:

  • Using your text editor (TextWrangler, Notepad++, etc) start a file titled <yourlastname>_Assignment1.txt. Replace <yourlastname> with your actual last name.
  • Answer each question below. You can use this template by copying and pasting the template into your text editor.
  • Do NOT use Microsoft WORD or any Office software to edit your .txt file. This can add extra characters to the file and you'll lose points.
  • When a question asks for a command, please supply the entire command line, the full set of instructions that you would put after the prompt and before pressing return.
  • You don't need to include the output of the command, just the command itself.
  • Turn the assignment in via canvas.

Assignment questions:

  1. Navigate to the yeast genome file called S288C_reference_sequence_R64-1-1_20110203.fsa. What command would you execute to show the number of lines, words, and characters in this file?
  2. What command would you execute to show JUST the number of lines in this file? (hint: use man)
  3. The file S288C_reference_sequence_R64-1-1_20110203.fsa is a fasta file. It contains annotation lines that begin with > and sequence information that contains the characters A,T,G, or C. Can you figure out how many annotation lines are in the file? Write two commands piped together that will allow you to display the number of annotation lines in the S288C_reference_sequence_R64-1-1_20110203.fsa file.
  4. One common thing we do in computational biology is to make testfiles. What command would you use to make a testfile, called test.fsa that contains just the top 1000 lines of S288C_reference_sequence_R64-1-1_20110203.fsa?
  5. For the next few questions, move to the annotation file (~/03_annotations/saccharomyces_cerevisiae_R64-1-1_20110208.gff). You'll remember that .gff files list all the annotated features in a genome (genes, start codons, tRNAs, snoRNAs, etc). A gff file contains (1) commented annotation information (lines that start with #); (2) tab-delimited feature information lines corresponding to all the genome features (lines start with chr…); and (3) sometimes a fasta file at the end. One cool aspect of grep is that you can specify to match strings that appear at the very beginning of a line using the ^ symbol. This is used like so…
    $grep '^apple' file.txt

    This would find only instances of the word apple that appeared at the beginning of a line. See how you could use ^ within a grep command to pull out just the tab-delimited feature information from the .gff file (in other words, leave behind #-commented information or fasta information) and save the feature information as sacCer3_tab.gff. What command did you use?

  6. Say we want to start with the file sacCer3_tab.gff and extract out tab-delimited lines for ONLY the tRNAs. What command would you use to (1) extract out just the lines that contain 'tRNA' entries listed in the third column and (2) save them to a file called sacCer3_tRNA.gff. (Hint: to make sure you don't capture the word 'tRNA' listed in some other entry, try to restrict that you want just tRNAs that have a tab before and after them. Tabs can be specified as '\t'). (windows_hint;hint1; hint2)
  7. Write a series of piped commands that will allow you to test that 'tRNA' is the only unique entry in the 3rd column of sacCer3_tRNA.gff file. What are they? (hint3)
  8. A .bed file is a standardized file type that contains four columns of information: (1) chromosome name, (2) start, (3) stop, (3) strand. Execute a command that will convert your sacCer3_tRNA.gff file into a bed file and save it as sacCer3_tRNA.bed. What command did you use? (hint4)
  9. On summit, there are different architectures of compute nodes that you can request to use (CPU, GPU, big memory, etc). A) How many bigmem nodes are on Summit? B) What is the maximum number of cores per bigmem node you can request? C) Therefore, what are the maximum number of total bigmem cores that can be used on the system?
  10. Say you want to request to (1) execute a job on one sgpu node, using the (2) maximum number of cores you can use on that node, using the (3) maximum time, and you would like your job to be in a (4) normal queue. List the #SBATCH options at the top of your script to specify these requests?

Template

#######################################################################################
#NAME: 
#
#DATE:
#
#ASSIGNMENT: 2
#######################################################################################
1)

2)

3)

4)

5)

6)

7) 

8)

9A)
9B)
9C)

10)
assignments/2017assignment2.txt · Last modified: 2017/08/31 14:03 by erin