NSCI 580A3

Instructors
Tai Montgomery
Erin Nishimura

wiki2016htseq

# Intro to HTSeq

## References

HTSeq — A Python framework to work with high-throughput sequencing data
Simon Anders, Paul Theodor Pyl, Wolfgang Huber
Bioinformatics (2014)

## HTSeq Installation

HTSeq is already installed on Tai's server.

If you would like to install HTSeq on your own laptop, desktop, or server, please see the instructions here:

## What does HTSeq do?

HTSeq simply counts the number of reads that fall within the limits of each “feature”. In our case, we think of features as genes.

EXERCISE: How would you design a read-counting program to assign reads to different features?

The htseq-count “union” mode solves this problem with the following logic: If a read is contained within a unique feature (gene) OR if it partially overlaps a unique feature, it is counted as mapping to that gene. If it spans more than one feature (gene) it is discarded.

The HTSeq algorithm takes the following input and gives you the following output…

HTSeq input

• A .bam/.sam file from Tophat or any other aligner
• A .gtf/.gff annotation file for your genome

HTSeq output

• a .txt file containing count data for each feature.

## HTSeq Usage

htseq-count [options] <alignment_file> <gff_file> > <redirected_outputfile>

Options
-f <format>, --format=<sam/bam> #Default is <sam>
-s <yes/no/reverse>, --stranded=<yes/no/reverse> #Default is <yes>
-a <minaqual>, --minaqual=<Minimum_quality_score> #Default is <10>
-t <feature type>, --type=<feature_type/exon> #Can be any term in 3rd column of .gff file. Default is <exon>
-m <mode>, --mode=<union/intersection_strict/intersection_nonempty> #Default is <union>

For more detailed usage, Counting reads in features with htseq-count or simply:

 $htseq-count --help ## HTSeq Exercise Copy the dataset. Copy the directory /home/erin/02_htseq_demo to your home directory. $ cd     # Go to YOUR home directory
$cp -R /home/erin/02_htseq_demo . # Copy the dataset from MY directory to YOURS Looks like someone already started an RNA-seq analysis project and they have already run Tophat. Let's find a .bam file. • Navigate to the sub-directory 03_OUTPUT. • Navigate to the sub-directory AD001_tophat HTSeq says it works with either .bam or .sam, but for me, it always works better with .sam files. Let's convert .bam to .sam (~ 1 min): $ samtools
$samtools view -h -o accepted_hits.sam accepted_hits.bam Now we can execute HTSeq using the following options (~ 4 min): • stranded = no • minaqual = 20 • samfile = accepted_hits.sam • gfffile = ../../01_INPUT/C_elegans_WS235_liftover_ce10.gtf • outputfile = AD001_counts.txt $ htseq-count --stranded=no --minaqual=20 accepted_hits.sam ../../01_INPUT/C_elegans_WS235_liftover_ce10.gtf > AD001_counts.txt

## Automating a Tophat/HTSeq Analysis Pipeline

Within the directory, 02_htseq_demo, you'll find a project that has already been started. Let's use the script located in the 02_SCRIPT sub-directory to automate the HT-seq read counting for all samples.

Let's take a look at what this script does. Go ahead and open it in Text Wrangler to inspect the code.

Let's learn about the metadata.txt file. Open metadata.txt to inspect it.

OK, let's execute the script. First, we may need to change some basic settings on your computer to allow the execution of the shell script.

Execute the script with command:

$bash RNAseq_analyzer_elt2.sh <metadataFILE>.txt 2>&1 | tee <LOGFILE>.txt$ bash RNAseq_analyzer_elt2.sh metadata.txt 2>&1 | tee 171128_output.txt

Inspect progress. This takes about 40 minutes to complete.

Instead of letting it go the whole time, you can cancel the command…

<CNTRL> + C