User Tools

Site Tools



Once we've acquired data, it is often necessary to view it, store it, reformat it, and extract information from it. Again, Linux provides many powerful tools.


  1. Obtaining sequence and annotation information
  2. Processing and storing large datasets
  3. Manipulating large datasets

1. Obtaining Sequence and Annotation Information

  • The most current and complete genome sequences and annotations can be found on model organism databases (usually via ftp sites) and GenBank (
  • Genome sequence format typically FASTA (fa).
  • Feature format typically general feature format (gff).

Sequence and annotations file formats

  • Gene and genome sequence format typically FASTA (fa).
  • Feature format typically general feature format (gff).
  • High-throughput sequencing data format typically fastq.

FASTA: DNA sequence alignment software. The software gave rise to the fasta format, now ubiquitous sequence file format.



*DNA, RNA, or amino acid sequence

Generic Feature Format (gff3): The most common format for positional information of genomics features. 9 tab-delimited columns.

2. Processing, storing, and manipulating large datasets

gzip…compress or decompress a file.

$ gzip file.txt
$ gzip -d file.gz

tar…combine files into a single archive – a tarball.

$ tar cf archive_name file1 file2 file3
$ tar xf archive_name.tar
  • See cheat sheet for additional options.
  • Datasets can be stored in a variety of locations: portable external hard drives, storage arrays, etc.

grep….search for patterns within a file and return lines containing the pattern.

$ grep "pattern" file
wiki/2016datasets.txt · Last modified: 2017/09/01 10:37 by tai