Title: From sequence to knowledge: Analysis of large-scale sequencing data Abstract: Powered by the advances in next-generation sequencing, cost of sequencing is reaching $1,000 per genome. This lower cost has enabled the collection of vast amounts of genomic information and has made large-scale sequencing a practical strategy for biological and medical studies. However, tools for analyzing this vast amount of data lag behind. The analysis of high-throughput sequencing data is challenging because errors in high throughput sequence data are even more common than true sequence variation. The errors are derived from multiple sources: in-vitroproblems such as DNA sample contamination, and in-silico problems such as mapping artifact, and the existence of complex variants. I will explain methods for estimating and ultimately correcting in-vitro contamination in sequencing and genotype array data, and then a massively parallel SNP calling pipeline that can handle the analysis of thousands of sequenced samples and identify high-quality SNPs by applying machine-learning-based method to detect in-silico problems. Then I will briefly summarize recent developments in currently ongoing large-scale sequencing projects, followed by future research directions.