# NSCI 580A5 fall 2017

### Sidebar

NSCI 580A5

Instructors
Tai Montgomery
Asa Ben-Hur

assignments:assignment5

## Assignment 5

Due date: 10/25 at 10am.

#### Part 1: Processing CSV files

One of the most common formats for storing data is called CSV, in which a matrix is represented as lines of comma separated values (hence the name CSV). In this part of the assignment you will write code for reading and processing data stored in CSV format. The first step is to read the data stored in the file. For that purpose, write a function called csv_read(file_name) that reads the data stored in the given file and returns a matrix (implemented as a list-of-lists). Suppose the data is stored in a file that has the following data:

2,3,4
5,6,7

>>> matrix = csv_read("test.csv")
>>> matrix
[[2, 3, 4], [5, 6, 7]]
>>> matrix[0]
[2, 3, 4]
>>> matrix[1][2]
7

Our next step is to write a function that uses the resulting matrix to compute the column-wise averages of the matrix given to it as input. Call that function column_average(matrix). It should return a list providing the averages of the columns of the matrix, such that element $i$ in the list is the average of column $i$ in the matrix. Continuing our previous example:

>>> averages = column_average(matrix)
>>> averages
[3.5, 4.5, 5.5]
# the resulting list should have as many entries as
# there are columns in the matrix:
>>> len(averages) == len(matrix[0])
True

We will use this function in order to generate a new mean-subtracted matrix using a function called mean_subtract(matrix). This function should return a new matrix that has the same size as the original and satisfies the following equation:

$$M_{ij} = O_{ij} - m_i$$ where $O$ is the original matrix, $m$ is the list of means, and $M$ is the result of mean-subtraction.

Our final step is to write the resulting matrix to a file using a function called csvwrite(matrix, filename), which writes a matrix (the first argument) into the file given by the second argument.

Overall, a user might use your functions as follows:

>>> matrix = csv_read("test.csv")
>>> mean_subtracted = mean_subtract(matrix)
>>> csvwrite(mean_subtracted, "test_mean_subtracted.csv")

Python has a module called csv, which as you may guess, is useful for reading/writing CSV files; you may use it for this task (or choose to write it by yourselves).

#### Part 2: finding frequently occurring k-mers

In my lab we are interested in alternative splicing, and particularly in intron retention, which is the most common form of alternative splicing in plants. This phenomenon is known to be regulated, and thus it is interesting to find short sequence elements that tend to appear more often in retained introns than in non-retained introns. Your task is to score each k-mer (sequence element of length $k$) using the following formula:

$$\log_2 \frac{\textrm{ fraction of retained introns that contain kmer of interest}}{\textrm{fraction of non-retained introns that contain kmer of interest}}$$ where we are using log in base 2 for the computation. A high value of this score indicates that the k-mer is associated with increased rates of intron retention, and is potentially involved in regulation of this process.

Write a function called score_kmers(retained_introns_file, non_retained_introns_file, k) where

• retained_introns_file: a Fasta file containing a retained introns
• non_retained_introns_file: a Fasta file containing introns not known to be retained
• k: the k-mer size

Your function should return the k-mer and print out the following information:

• The highest scoring k-mer
• Number and fraction of retained/non-retained introns in which this k-mer occurs in.

Use the following Fasta files to write this function:

#### Submission

Put the functions you wrote in a module called assignment5.py, and follow the template shown in class in writing your code. The “main” segment of the module should be used to test each of the functions. Use a single main segment! Submit your code via Canvas.