# NSCI 580A3 fall 2017

### Sidebar

NSCI 580A3

Instructors
Tai Montgomery
Erin Nishimura

wiki2016nsci580a3tuxedoexercise

## DIFFERENTIAL GENE EXPRESSION ANALYSIS USING CUFFDIFF

The purpose of this exercise is to introduce tools for analyzing differential gene expression in RNA-seq data. You will analyze RNA-seq data from human reference (a mix of tissues) and brain tissue to identify genes for which expression is enriched in the brain. Due to time constraints, the data we will analyze is a subset (chr22) of a larger RNA-seq dataset.

We will use a suite of tools called the Tuxedo pipeline. For an additional tutorial, see the following paper: Trapnell et al. 2014, Nature Protocols. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Note: this pipeline is no longer updated and has been replaced with a more efficient and accurate pipeline, however, we will use the original Tuxedo pipeline because there is far more support currently available for it and it has fewer bugs to contend with. The new and improved pipeline consists of a similar suite of tools: HISAT, StringTie, and Ballgown.

### OUTLINE

1. Assess data quality using FastQC.
2. Quality filter datasets using Trimmomatic.
3. Align the RNA-seq reads to the human genome using TopHat2.
4. Assemble transcripts based on RNA-seq data using cufflinks and cuffmerge.
5. Compare expression differences using cuffdiff.
6. Visualize data using the genome viewing software IGV.
7. Plot data with R and cummeRbund.

### LAB NOTEBOOK

1. Create a new document in TextWrangler and save it as NSCI580A3_Lab_Notebook.txt.

2. Record all commands executed during the course in this document.

• Each date should be indicated.
• Use # to distinguish notes from commands.

### QUALITY CONTROL, ALIGNMENT, AND DIFFERENTIAL GENE EXPRESSION

1. Open a terminal window and log into the rna server:
ssh last_name@rna.biology.colostate.edu

2. cd into the brain_data/brain_data_fastq_files directory within your home directory.

There are 12 RNA-seq datasets corresponding to paired-end data for 3 replicates from two sample sets (brain and ref). Examine a few lines of one of the files using zmore or zless. What information is contained in each line?

• brain_rep1_1.fastq.gz
• brain_rep1_2.fastq.gz
• brain_rep2_1.fastq.gz
• brain_rep2_2.fastq.gz
• brain_rep3_1.fastq.gz
• brain_rep3_2.fastq.gz
• ref_rep1_1.fastq.gz
• ref_rep1_2.fastq.gz
• ref_rep2_1.fastq.gz
• ref_rep2_2.fastq.gz
• ref_rep3_1.fastq.gz
• ref_rep3_2.fastq.gz

3. Assess the quality of the data using FastQC:

In FastQC:

File>open

If you don't wish to install FastQC on your computer, you can follow along with instructor.

See FastQC tutorial for additional details: https://www.youtube.com/watch?v=bz93ReOv87Y

4. Trim adapter sequences and quality filter the RNA-seq data (fastq files) using Trimmomatic:

Trim adapter sequences and quality filter each dataset using Trimmomatic (you will run trimmomatic 6 times in total).

$trimmomatic PE -phred33 'input_fastq_1' 'input_fastq_2' 'output_fastq_paired_1' 'output_fastq_unpaired_1' 'output_fastq_paired_2' 'output_fastq_unpaired_2' ILLUMINACLIP:/usr/share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36  See the Trimmomatic manual for a detailed description of options: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf What proportion of the reads were retained? 5. Examine a file after adapter trimming and quality filtering using zmore or zless. 6. Assess the quality of one of the datasets after quality filtering using FastQC: In FastQC: File>open 7. Create a bowtie index for the human chromosome 22 sequence: $ bowtie2-build 'sequence.fa' 'prefix'

The chr22 sequence is in the brain_data folder: hg38_chr22.fa.

For the bowtie prefix, use chr22.

8. Move the 6 bowtie index files and the genome sequence file (hg38_chr22.fa) to a new folder called bowtie_chr22.

See the bowtie manual for additional details: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

9. Align sequences from each of the libraries to the human genome using TopHat2 (you will run tophat 6 times in total):

tophat -p 8 -G 'path_to_genome_annotations.gtf' -o 'output_folder' 'path_to_bowtie_index_for_reference_genome/prefix' 'fastq_file_paired_1_1','fastq_file_paired_1_2','fastq_file_unpaired_1_1','fastq_file_unpaired_1_2' • NOTE: There are no spaces between the fastq file names. • The directory containing the fastq files should be the current working directory. • Name the output folders as follows: ref1, ref2, ref3, brain1, brain2, brain3 See the TopHat manual for additional details:https://ccb.jhu.edu/software/tophat/manual.shtml 10. Determine what proportion of the reads from each library were aligned: Use the UNIX more command to open each TopHat summary file in the terminal. The TopHat summary files are named align_summary.txt and are located in the output folder specified in step 5.  more ./ref1/align_summary.txt
etc.

11. Assemble transcripts using cufflinks (cufflinks uses the accepted_hits.bam output files from TopHat):

$cufflinks -p 8 -o 'output_folder' 'path_to_library_accepted_hits.bam' Name the output folders as follows: output_folder1: cufflinks_ref1 output_folder2: cufflinks_ref2 etc See the cufflinks manual for additional details about cufflinks, cuffmerge, and cuffdiff: http://cole-trapnell-lab.github.io/cufflinks/manual/ 12. Merge the assembled transcripts from the four libraries using cuffmerge: Create a file called assemblies.txt with the paths to each of the six individual assemblies files: $ echo ./cufflinks_ref1/transcripts.gtf >assemblies.txt
$echo ./cufflinks_ref2/transcripts.gtf >>assemblies.txt etc. Merge assemblies: $ cuffmerge -p 8 -g 'path_to_genome_annotations.gtf' -s 'path_to_genome_sequence.fa' assemblies.txt

You should now have a single file, merged.gtf located in a folder called merged_asm created by cuffmerge, that contains all of the predicted transcripts based on the sequencing data, as well as the previously annotated transcripts.

13. Identify genes differentially regulated between the reference and brain tissue samples using cuffdiff:

• Provide an output_folder name such as cuffdiff_output
• The relative path to the merged.gtf is merged_asm/merged.gtf
• You will compare the three reference libraries to the three brain libraries. Cufflinks uses the accepted_hits.bam output files from TopHat. If you are in the RNA-seq_Data directory, the paths to these files are as follows:
• ./ref1/accepted_hits.bam
• ./ref2/accepted_hits.bam
• etc
\$ cuffdiff -p 8 -o 'output_folder' -L ref,brain 'path_to_merged.gtf' 'path_to_tophat_output_library1_replicate1'/accepted_hits.bam,'path_to_tophat_output_library1_replicate2'/accepted_hits.bam,'path_to_tophat_output_library1_replicate3'/accepted_hits.bam \'path_to_tophat_output_library2_replicate1'/accepted_hits.bam,'path_to_tophat_output_library2_replicate2'/accepted_hits.bam,'path_to_tophat_output_library2_replicate3'/accepted_hits.bam

NOTE: There are no spaces between the label names (i.e. ref and brain).

Several output files are generated. Explore these on your own. The gene_exp.diff file contains a summary of differential gene expression.

14. Identify which genes are enriched or depleted in brain tissue:

• Open the gene_exp.diff file from step 13 using Excel.
• Reverse sort the data in Excel based on significance.

How many genes are significantly different between the brain and reference tissue?

### VISUALIZING DATA WITH IGV

1. Open the Integrative Genome Viewer.

2. Select the 'hg19 human genome' from the toolbar (top left).

3. Load the transcripts generated in cufflinks:

File > Load from file > Select the merged.gtf file that was downloaded in step 7 above.

The transcripts should now show up as a track.

In the search box (that says go next to it), enter chr22.

How well do the transcripts generated by cufflinks compare to the annotated RefSeq genes? Why might they differ?

4. Reformat accepted_hits.bam file generated by TopHat and convert to tdf format using IGV tools:

Tools > Run igvtools Under Command, select Count

Under Input file, select one of the accepted_hits.bam files and select Run

Repeat for each for each of the accepted_hits.bam files.

Close IGV tools when complete.

5. Load each of the tdf formatted files into IGV:

File > Load from File.
Select the tdf file that contains sequencing data.

Right click on each plot to modify the parameters (e.g. Change Track Height, Autoscale, Track Color etc).

Repeat for each of the tdf formatted RNA-seq files.

6. Examine the genes in IGV that were differentially regulated based on the cuffdiff analysis:

To view all the data, in the search box that says 'go' next to it, enter chr22.

In the same search box, enter the ID of a gene that is significantly different in brain tissue to zoom in on that region of the genome.

Repeat for several genes.

Does there appear to be in differences in the expression levels of these genes based on the RNA-seq tracks in IGV?

See the IGV website for additional information: http://software.broadinstitute.org/software/igv/

### PLOTTING DATA WITH R

1. Log into RStudio on the rna server using your account name and password: http://rna.biology.colostate.edu:8787

2. Load the CummeRbund package:

library(cummeRbund)

3. Create a CummeRbund database:

data <- readCufflinks('path_to_cuffdiff_output')

4. Plot the distribution of gene expression levels:

csDensity(genes(data))

5. Draw a scatter plot of gene expression in brain and reference tissues:

csScatter(genes(data), 'ref', 'brain')

6. Draw a volcano plot displaying brain and reference tissues:

csVolcano(genes(data), 'ref', 'brain', alpha=0.05, showSignificant=T)

7. Draw an MA plot displaying brain and reference tissues:

MAplot(genes(data), 'ref', 'brain')

Recall that the data from this exercise is only a partial dataset.

See the CummeRbund manual for additional details: http://compbio.mit.edu/cummeRbund/manual_2_0.html

wiki2016nsci580a3tuxedoexercise.txt · Last modified: 2017/11/14 09:14 by tai