# NSCI 580A4 fall 2017

### Sidebar

NSCI 580A4

Instructors
Tai Montgomery
Erin Nishimura

assignments:assignment3

# Assignment 3

Due date: 9/7 by 10 am

*In order to access the montgomery lab server you must be on campus or use a VPN client.

Useful commands for this assignment (refer to the unix cheat sheet for usage):

• To change directories, use cd.
• To list the contents of a directory, use ls.
• To create a new directory, use mkdir.
• To redirect output from a command to a new file or append it to an existing file, use >>.
• To display the contents of a file in the terminal, use less or more.
• To compress files, use gzip. To decompress files, use the -d option.
• To setup a secure shell on a remote computer, use ssh.
• In some instances, files can be copied from a remote computer using ftp and mget.
• To copy files between computers from the command line, use scp.
• To copy files from one directory to another on the same machine (can be a remote machine), use cp.
• To redirect command output to another command, use a pipe |.
• Some commands, such as more, less, and grep, can be modified to work on compressed files by prepending z (e.g. zgrep)
• To search for patterns within files, use grep.
Common grep options Purpose
-- end options
-v report non-pattern matching lines
-f search against a list of patterns from a file
-A N report N additional lines after pattern match
-B N report N additional lines before pattern match
-F interpret regex characters literally
-i ignore case
-w search whole words only
-r search files in subdirectories
-c return the number of matches to a pattern
-n show the line number of the pattern match

### 3A

1. Access the montgomery lab server using ssh: ssh genomics@montgomeryserver.biology.colostate.edu, password: genomics2.

2. Change into the NSCI580A4 directory within the Documents directory.

3. Create a new directory named with your first and last name (e.g. TaiMontgomery) within the NSCI580A4 directory.

4. Copy the C_elegans directory within the NSCI580A4 directory to your directory from the command line.

6. Transfer miRNAs.txt from your computer to your directory on the server.

7. Inspect miRNAs.txt in a terminal window. What type of file format is miRNAs.txt?

8. How many instances of the miRNA let-7 (TGAGGTAGTAGGTTGTATAGTT) are in miRNAs.txt? Why is one of the sequences different than the others?

9. Determine the number of C. elegans miRNA sequences in the miRNAs.txt file. Hint: pipe the output of grep to wc.

10. Challenge: Extract all the C. elegans miRNAs preserving their fasta format using grep to a new file. All the commands/options needed are listed above. Feel free to work in groups. Be sure to remove the separator lines containing -- (can be done with grep).

Q. No question for the assignment but it is a prerequisite to assignment 3B.

### 3B

1. Extract all C. elegans miRNA sequences (excluding fasta header lines) from the generated in step 9 above.

2. Challenge: Determine the number of reads for each C. elegans miRNA in a wild type small RNA high-throughput sequencing library. A tab-delimited file containing small RNA sequences and reads from the library is in the C_elegans folder - Lib129.txt. Copy the results into Excel and sum the total number of miRNA reads.

Q. Question: What is the total number of miRNA reads in the high-throughput sequencing library contained in Lib129.txt?

### 3C

1. Examine the file barcoded.txt.gz located in the C_elegans directory. What type of file is it and what information is contained in each line?

2. Challenge: We commonly multiplex high-throughput sequencing libraries by introducing a barcode. Sometimes the barcode, or index, sequence is inserted into the names of each read in the fastq file. Extract out all reads (4 lines each) from the library with the index sequence CGATGT. Be sure to remove the separator lines containing --.

Q. Question: How many reads correspond to the library containing the index CGATGT?

### 3D

1. Obtain the human argonaute1 (hAgo1) protein sequence from GenBank and identify the closest homolog in C. elegans using BLAST (google search BLAST - the answer is alg-1).

2. Obtain the C. elegans genome features table from wormbase using ftp:

\$ ftp ftp.wormbase.org
Name (ftp.wormbase.org:montgomery): ftp
ftp> cd /pub/wormbase/releases/WS255/species/c_elegans/PRJNA13758
ftp> mget c_elegans.PRJNA13758.WS255.annotations.gff3.gz
(at the prompt 'mget c_elegans.PRJNA13758.WS255.annotations.gff3.gz [anpqy?]?' type y for yes.

3. Decompress the table.

4. Browse the decompressed table in a terminal window and identify what information is contained in each column and row. What type of file format is this?

5. Search the table for alg-1 while in more or less. Use /pattern to search.

6. Extract all alg-1 associated features (i.e. any features that have the name alg-1 anywhere in the description) using grep.

7. The actual sequence ID of alg-1 is F48F7.1. Repeat step 6 using the sequence ID.

Q. Question: What are the genomic coordinates of the alg-1 gene (the information is contained in the line with gene in the feature column?