Due date: 9/7 by 10 am
*In order to access the montgomery lab server you must be on campus or use a VPN client.
*Submit your answers to the questions marked Q on Canvas.
Useful commands for this assignment (refer to the unix cheat sheet for usage):
|Common grep options||Purpose|
|-v||report non-pattern matching lines|
|-f||search against a list of patterns from a file|
|-A N||report N additional lines after pattern match|
|-B N||report N additional lines before pattern match|
|-F||interpret regex characters literally|
|-w||search whole words only|
|-r||search files in subdirectories|
|-c||return the number of matches to a pattern|
|-n||show the line number of the pattern match|
1. Access the montgomery lab server using ssh: ssh firstname.lastname@example.org, password: genomics2.
2. Change into the NSCI580A4 directory within the Documents directory.
3. Create a new directory named with your first and last name (e.g. TaiMontgomery) within the NSCI580A4 directory.
4. Copy the C_elegans directory within the NSCI580A4 directory to your directory from the command line.
5. From the sample data page on the course website, download miRNAs.txt onto your computer.
6. Transfer miRNAs.txt from your computer to your directory on the server.
7. Inspect miRNAs.txt in a terminal window. What type of file format is miRNAs.txt?
8. How many instances of the miRNA let-7 (TGAGGTAGTAGGTTGTATAGTT) are in miRNAs.txt? Why is one of the sequences different than the others?
9. Determine the number of C. elegans miRNA sequences in the miRNAs.txt file. Hint: pipe the output of grep to wc.
10. Challenge: Extract all the C. elegans miRNAs preserving their fasta format using grep to a new file. All the commands/options needed are listed above. Feel free to work in groups. Be sure to remove the separator lines containing -- (can be done with grep).
Q. No question for the assignment but it is a prerequisite to assignment 3B.
1. Extract all C. elegans miRNA sequences (excluding fasta header lines) from the generated in step 9 above.
2. Challenge: Determine the number of reads for each C. elegans miRNA in a wild type small RNA high-throughput sequencing library. A tab-delimited file containing small RNA sequences and reads from the library is in the C_elegans folder - Lib129.txt. Copy the results into Excel and sum the total number of miRNA reads.
Q. Question: What is the total number of miRNA reads in the high-throughput sequencing library contained in Lib129.txt?
1. Examine the file barcoded.txt.gz located in the C_elegans directory. What type of file is it and what information is contained in each line?
2. Challenge: We commonly multiplex high-throughput sequencing libraries by introducing a barcode. Sometimes the barcode, or index, sequence is inserted into the names of each read in the fastq file. Extract out all reads (4 lines each) from the library with the index sequence CGATGT. Be sure to remove the separator lines containing --.
Q. Question: How many reads correspond to the library containing the index CGATGT?
1. Obtain the human argonaute1 (hAgo1) protein sequence from GenBank and identify the closest homolog in C. elegans using BLAST (google search BLAST - the answer is alg-1).
2. Obtain the C. elegans genome features table from wormbase using ftp:
$ ftp ftp.wormbase.org Name (ftp.wormbase.org:montgomery): ftp ftp> cd /pub/wormbase/releases/WS255/species/c_elegans/PRJNA13758 ftp> mget c_elegans.PRJNA13758.WS255.annotations.gff3.gz (at the prompt 'mget c_elegans.PRJNA13758.WS255.annotations.gff3.gz [anpqy?]?' type y for yes.
3. Decompress the table.
4. Browse the decompressed table in a terminal window and identify what information is contained in each column and row. What type of file format is this?
5. Search the table for alg-1 while in more or less. Use /pattern to search.
6. Extract all alg-1 associated features (i.e. any features that have the name alg-1 anywhere in the description) using grep.
7. The actual sequence ID of alg-1 is F48F7.1. Repeat step 6 using the sequence ID.
Q. Question: What are the genomic coordinates of the alg-1 gene (the information is contained in the line with gene in the feature column?