User Tools

Site Tools


wiki2016htseq

Intro to HTSeq

References

HTSeq Installation

HTSeq is already installed on Tai's server.

If you would like to install HTSeq on your own laptop, desktop, or server, please see the instructions here:

How To Install HTSeq


What does HTSeq do?

HTSeq simply counts the number of reads that fall within the limits of each “feature”. In our case, we think of features as genes.

:!: EXERCISE: How would you design a read-counting program to assign reads to different features?


The htseq-count “union” mode solves this problem with the following logic: If a read is contained within a unique feature (gene) OR if it partially overlaps a unique feature, it is counted as mapping to that gene. If it spans more than one feature (gene) it is discarded.

The HTSeq algorithm takes the following input and gives you the following output…

HTSeq input

  • A .bam/.sam file from Tophat or any other aligner
  • A .gtf/.gff annotation file for your genome

HTSeq output

  • a .txt file containing count data for each feature.

HTSeq Usage

htseq-count [options] <alignment_file> <gff_file> > <redirected_outputfile>
 
Options
     -f <format>, --format=<sam/bam> #Default is <sam>
     -s <yes/no/reverse>, --stranded=<yes/no/reverse> #Default is <yes>
     -a <minaqual>, --minaqual=<Minimum_quality_score> #Default is <10>
     -t <feature type>, --type=<feature_type/exon> #Can be any term in 3rd column of .gff file. Default is <exon>
     -m <mode>, --mode=<union/intersection_strict/intersection_nonempty> #Default is <union>

For more detailed usage, Counting reads in features with htseq-count or simply:

 $ htseq-count --help

HTSeq Exercise

Copy the dataset. Copy the directory /home/erin/02_htseq_demo to your home directory.

$ cd     # Go to YOUR home directory
$ cp -R /home/erin/02_htseq_demo .   # Copy the dataset from MY directory to YOURS

Looks like someone already started an RNA-seq analysis project and they have already run Tophat. Let's find a .bam file.

  • Navigate to the sub-directory 03_OUTPUT.
  • Navigate to the sub-directory AD001_tophat

HTSeq says it works with either .bam or .sam, but for me, it always works better with .sam files. Let's convert .bam to .sam (~ 1 min):

$ samtools
$ samtools view -h -o accepted_hits.sam accepted_hits.bam

Now we can execute HTSeq using the following options (~ 4 min):

  • stranded = no
  • minaqual = 20
  • samfile = accepted_hits.sam
  • gfffile = ../../01_INPUT/C_elegans_WS235_liftover_ce10.gtf
  • outputfile = AD001_counts.txt
$ htseq-count --stranded=no --minaqual=20 accepted_hits.sam ../../01_INPUT/C_elegans_WS235_liftover_ce10.gtf > AD001_counts.txt

Did it work? Inspect AD001_counts.txt.


Automating a Tophat/HTSeq Analysis Pipeline

Within the directory, 02_htseq_demo, you'll find a project that has already been started. Let's use the script located in the 02_SCRIPT sub-directory to automate the HT-seq read counting for all samples.

Let's take a look at what this script does. Go ahead and open it in Text Wrangler to inspect the code.

Let's learn about the metadata.txt file. Open metadata.txt to inspect it.

OK, let's execute the script. First, we may need to change some basic settings on your computer to allow the execution of the shell script.

Execute the script with command:

$ bash RNAseq_analyzer_elt2.sh <metadataFILE>.txt 2>&1 | tee <LOGFILE>.txt
$ bash RNAseq_analyzer_elt2.sh metadata.txt 2>&1 | tee 171128_output.txt

Inspect progress. This takes about 40 minutes to complete.

Instead of letting it go the whole time, you can cancel the command…

<CNTRL> + C

… and then download the final results here..

htseq_out.tar.gz


 assignments:assignment2 Assignment 2

wiki2016htseq.txt · Last modified: 2017/11/28 09:03 by erin