Assignment 5

Due date: 10/25 at 10am.

Part 1: Processing CSV files

One of the most common formats for storing data is called CSV, in which a matrix is represented as lines of comma separated values (hence the name CSV). In this part of the assignment you will write code for reading and processing data stored in CSV format. The first step is to read the data stored in the file. For that purpose, write a function called csv_read(file_name) that reads the data stored in the given file and returns a matrix (implemented as a list-of-lists). Suppose the data is stored in a file that has the following data:


Reading this file should generate:

>>> matrix = csv_read("test.csv")
>>> matrix
[[2, 3, 4], [5, 6, 7]]
>>> matrix[0]
[2, 3, 4]
>>> matrix[1][2]

Our next step is to write a function that uses the resulting matrix to compute the column-wise averages of the matrix given to it as input. Call that function column_average(matrix). It should return a list providing the averages of the columns of the matrix, such that element $i$ in the matrix is the average of column $i$ in the matrix. Continuing our previous example:

>>> averages = column_average(matrix)
>>> averages
[3.5, 4.5, 5.5]
# the resulting list should have as many entries as 
# there are columns in the matrix:
>>> len(averages) == len(matrix[0])

We will use this function in order to generate a new mean-subtracted matrix using a function called mean_subtract(matrix). This function should return a new matrix that has the same size as the original and satisfies the following equation:

$$ M_{ij} = O_{ij} - m_i $$ where $O$ is the original matrix, $m$ is the list of means, and $M$ is the result of mean-subtraction.

Our final step is to write the resulting matrix to a file using a function called csvwrite(matrix, filename), which writes a matrix (the first argument) into the file given by the second argument.

Overall, a user might use your functions as follows:

>>> matrix = csv_read("test.csv")
>>> mean_subtracted = mean_subtract(matrix)
>>> csvwrite(mean_subtracted, "test_mean_subtracted.csv")

Although Python has a module called csv, which as may guess, is useful for reading/writing CSV files, please refrain from using it.

Part 2: finding frequently occurring k-mers

In my lab we are interested in alternative splicing, and particularly in intron retention, which is the most common form of alternative splicing in plants. This phenomenon is known to be regulated, and thus it is interesting to find short sequence elements that tend to appear more often in retained introns than in non-retained introns. Your task is to score each k-mer (sequence element of length $k$) using the following formula:

$$ \log_2 \frac{\textrm{ fraction of retained introns that contain kmer of interest}}{\textrm{fraction of non-retained introns that contain kmer of interest}} $$ where we are using log in base 2 for the computation. A high value of this score indicates that the k-mer is associated with increased rates of intron retention, and is potentially involved in regulation of this process.

Write a function called score_kmers(retained_introns_file, non_retained_introns_file, k) where

Your function should return the k-mer and print out the following information:

Use the following Fasta files to write this function:


Put the functions you wrote in a module called, and follow the template shown in class in writing your code. The “main” segment of the module should be used to test each of the functions. Use one single main segment! Submit your code via Canvas.