User Tools

Site Tools


wiki:differential_expression

Differential Expression

Before starting today's work we need to update our .bashrc by adding the following lines to it:

export PYTHONPATH="${PYTHONPATH}:/s/chopin/c/proj/protfun/arch/x86_64/lib64/python/pysam-0.8.3-py2.7-linux-x86_64.egg/:/s/chopin/c/proj/protfun/arch/univ/lib/python"
export R_LIBS=/s/chopin/c/proj/protfun/arch/x86_64/R

After doing that you will need to run the source command for these changes to take effect:

$ source .bashrc

Read counts

Tools such as edgeR require a table of read counts - how many reads are associated with each gene. There is a variety of tools for computing read counts, including HTSeq, and several R/Bioconductor packages. We'll use code written by my students that uses the pysam library:

count_reads.py
from SpliceGrapher.formats.loader import loadGeneModels
import pysam,os
 
# the output file name                                                                               
read_counts_file = "read_counts_L1_t1.txt"
 
# sorted, indexed bamfiles (for line 1, time stamp = 1Hr)                                            
alignWT_r1 = "/s/jawar/b/nobackup/altsplice2/Sorghum_New/Sample_R2/TopHat_Aligned/accepted_hits_noMu\
lti.bam"
alignWT_r2 = "/s/jawar/b/nobackup/altsplice2/Sorghum_New/Sample_R3/TopHat_Aligned/accepted_hits_noMu\
lti.bam"
alignTreated_r1 = "/s/jawar/b/nobackup/altsplice2/Sorghum_New/Sample_R4/TopHat_Aligned/accepted_hits\
_noMulti.bam"
alignTreated_r2 = "/s/jawar/b/nobackup/altsplice2/Sorghum_New/Sample_R5/TopHat_Aligned/accepted_hits\
_noMulti.bam"
 
ctr_rep1 = pysam.Samfile(alignWT_r1, 'rb')
ctr_rep2 = pysam.Samfile(alignWT_r2, 'rb')
trt_rep1 = pysam.Samfile(alignTreated_r1, 'rb')
trt_rep2 = pysam.Samfile(alignTreated_r2, 'rb')
 
# load annotations                                                                                   
annotation_file = "/s/jawar/b/nobackup/altsplice2/Sorghum_New/annotations/genome.gff3"
gene_models = loadGeneModels(annotation_file, verbose=1)
genes = gene_models.getAllGenes()
 
read_count_threshold = 50
with open(read_counts_file, 'w') as fout :
    # write the header line                                                                          
    fout.write('Symbol\tCTR1\tCTR2\tTRD1\tTRD2\n')
    for gene in genes:
	#if not gene.chromosome.isdigit(): continue
        if gene.chromosome.startswith('chrm') or gene.chromosome.startswith('chrc') : continue
	c = gene.chromosome.capitalize() if gene.chromosome.startswith('chr') else gene.chromosome
        ctr1 = len([r for r in ctr_rep1.fetch(c, gene.minpos, gene.maxpos)])
        ctr2 = len([r for r in ctr_rep2.fetch(c, gene.minpos, gene.maxpos)])
	trt1 = len([r for r in trt_rep1.fetch(c, gene.minpos, gene.maxpos)])
        trt2 = len([r for r in trt_rep2.fetch(c, gene.minpos, gene.maxpos)])
        #You can require a certain overall read count in order to include the gene:                  
        if ctr1 + ctr2 + trt1 + trt2  > read_count_threshold :
            fout.write('%s\t%d\t%d\t%d\t%d\n' % ( gene.id, ctr1, ctr2, trt1, trt2))

edgeR

First make sure you have a file of read counts if you haven't run the above script.

In the Linux command line type the command “R” to start the R interpreter. Our first step is to load the edgeR library:

library(edgeR)

Next, read in the read count data, where the row identifiers are taken from the column named “Symbol”:

counts <- read.delim("read_counts_l1_t1.txt", row.names="Symbol")

The following command encodes the grouping of the experiments:

group <- factor(c("CTR","CTR","TRD","TRD"))

Create a DGEList object which is edgeR's container for RNA-seq count data:

d <- DGEList(counts=counts, group=group)

Each of the following commands adds additional components to the DGEList object:

d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)

The calcNormFactors() function normalizes the data by finding scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. Using calcNormFactors does not change the counts: it just updates the column norm.factors in $samples. The dispersion parameter for each tag is a measure of the degree of inter-library variation for that tag. Common dispersion estimates a global dispersion for the dataset, and gives an idea of the overall variability.

We are now ready to test for differential expression:

de <- exactTest(d)

For more sophisticated experimental designs, you will need to use edgeR's glmRT method. Let's take a look at a histogram of the unadjusted p-values:

hist(de$table[,"PValue"], breaks=50)

For a summary of the results:

summary(decideTestsDGE(de, p=0.01, adjust="fdr"))

This will tell us how many genes are down/up regulated at an FDR-adjusted level of 0.01. Next, we'll use the topTags function to create a tabular summary of the differential expression statistics:

tt = topTags(de, n=nrow(d))

This creates a table of the top differentially expressed genes, ranked by their p-value. Let's write the table to a file:

write.table(tt, "deg_list.csv", quote=FALSE)

Next, let's create an MA-plot, which is a nice way of viewing the distribution of significant genes:

deg =  rownames(tt$table)[tt$table$FDR < .05]
plotSmear(d, de.tags=deg)
abline(h = c(-2, 2), col = "blue")

Gene set enrichment

To help make sense of the list of differentially expressed genes, let's use David, which performs a gene set enrichment analysis.

wiki/differential_expression.txt · Last modified: 2016/09/01 09:29 (external edit)