This is an old revision of the document!
HTSeq — A Python framework to work with high-throughput sequencing data
Simon Anders, Paul Theodor Pyl, Wolfgang Huber
HTSeq is already installed on Tai's server.
If you would like to install HTSeq on your own laptop, desktop, or server, please see the instructions here:
HTSeq simply counts the number of reads that fall within the limits of each “feature”. In our case, we think of features as genes.
EXERCISE: How would you design a read-counting program to assign reads to different features?
The htseq-count “union” mode solves this problem with the following logic: If a read is contained within a unique feature (gene) OR if it partially overlaps a unique feature, it is counted as mapping to that gene. If it spans more than one feature (gene) it is discarded.
The HTSeq algorithm takes the following input and gives you the following output…
htseq-count [options] <alignment_file> <gff_file> > <redirected_outputfile> Options -f <format>, --format=<sam/bam> #Default is <sam> -s <yes/no/reverse>, --stranded=<yes/no/reverse> #Default is <yes> -a <minaqual>, --minaqual=<Minimum_quality_score> #Default is <10> -t <feature type>, --type=<feature_type/exon> #Can be any term in 3rd column of .gff file. Default is <exon> -m <mode>, --mode=<union/intersection_strict/intersection_nonempty> #Default is <union>
For more detailed usage, Counting reads in features with htseq-count or simply:
$ htseq-count --help
Copy the dataset. Copy the directory
/home/erin/02_htseq_demo to your home directory.
$ cd # Go to YOUR home directory $ cp -R /home/erin/02_htseq_demo . # Copy the dataset from MY directory to YOURS
Looks like someone already started an RNA-seq analysis project and they have already run Tophat. Let's find a .bam file.
HTSeq says it works with either .bam or .sam, but for me, it always works better with .sam files. Let's convert .bam to .sam (~ 1 min):
$ samtools $ samtools view -h -o accepted_hits.sam accepted_hits.bam
Now we can execute HTSeq using the following options (~ 4 min):
$ htseq-count --stranded=no --minaqual=20 accepted_hits.sam ../../01_INPUT/C_elegans_WS235_liftover_ce10.gtf > AD001_counts.txt
Did it work? Inspect AD001_counts.txt.
Within the directory,
02_htseq_demo, you'll find a project that has already been started. Let's use the script located in the
02_SCRIPT sub-directory to automate the HT-seq read counting for all samples.
Let's take a look at what this script does. Go ahead and open it in Text Wrangler to inspect the code.
Let's learn about the
metadata.txt file. Open
metadata.txt to inspect it.
OK, let's execute the script. First, we may need to change some basic settings on your computer to allow the execution of the shell script.
Execute the script with command:
$ bash RNAseq_analyzer_elt2.sh <metadataFILE>.txt 2>&1 | tee <LOGFILE>.txt $ bash RNAseq_analyzer_elt2.sh metadata.txt 2>&1 | tee 171128_output.txt
Inspect progress. This takes about 40 minutes to complete.
Instead of letting it go the whole time, you can cancel the command…
<CNTRL> + D
… and then download the final results here..