Biomedical Data Science

External Resources

As the need for biomedical data science increases in society, so do the resources available to train anyone in the skills needed for biomedical data science. There are many, many resources available so please recognize that inclusion on this page is neither a direct endorsement of the resource and that the resources provided are not exhaustive.

Educational Resource Discovery Index (ERuDIte)

ERuDIte is the educational resource discovery index that powers the NIH BD2K Training Coordinating Center (TCC) Web Portal. ERuDIte not only serves as a resource collector and aggregator but also as system powered by Machine Learning, Information Retrieval, and Natural Language Processing that intelligently organizes resources to provide a dynamic and personalized curriculum for biomedical researchers interested in learning about Data Science.


MOOCs and Online Trainings

Pandas in Python

Online Tutorial

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. pandas is well suited for many different kinds of data: 1. Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet 2. Ordered and unordered (not necessarily fixed-frequency) time series data.Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels 3. Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Data Quest

Whether you’re new to the field or looking to take a step up in your career, Dataquest can teach you the data skills you’ll need.

Learn Python, R, SQL, data visualization, data analysis, and machine learning. Try any of our 60 free missions now and start your data science journey

Codecademy - Python

Learn the basics of the world's fastest growing and most popular programming language used by software engineers, analysts, data scientists, and machine learning engineers alike.

Data Science Specialization (Johns Hopkins University)

MOOC (Coursera)

This Specialization covers the concepts and tools you'll need throughout the entire data science pipeline, from asking the right kinds of questions to making inferences and publishing results. In the final Capstone Project, you’ll apply the skills learned by building a data product using real-world data. At completion, students will have a portfolio demonstrating their mastery of the material.

Genomics Data Science

MOOC - Coursera

This specialization covers the concepts and tools to understand, analyze, and interpret data from next generation sequencing experiments. It teaches the most common tools used in genomic data science including how to use the command line, Python, R, Bioconductor, and Galaxy. The sequence is a stand alone introduction to genomic data science or a perfect compliment to a primary degree or postdoc in biology, molecular biology, or genetics.


Rosalind is a platform for learning bioinformatics and programming through problem-solving.

R for Data Science

Garrett Grolemund and Hadley Wickham

This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science.

Bioinformatics Data Skills

Vince Buffalo

Bioinformatics Data Skills is an intermediate-level book, aimed at readers with some experience with a scripting language like Python, and very basic Unix. Bioinformatics Data Skills gives readers a solid Unix foundation. Readers are also introduced to the R language through learning exploratory data analysis.

Genomics Papers

Curated by Jeff Leeks, JHSPH

When I was a student, my advisor John Storey made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn't need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work with me (or who want to learn about statistical genomics). So this is my attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.

Exploratory Data Analysis with R

Roger Peng

This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data you have. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing informative data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

Elements of Statistical Learning

Trevor Hastie, Robert Tibshirani, Jerome Friedman

During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

Introduction to Statistical Learning

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist. (You can buy the book or download a free PDF)

ModernDive: Statistical Inference via Data Science

Chester Ismay and Albert Y. Kim

We hope that by the end of this book, you’ll have learned

  1. How to use R to explore data.
  2. How to answer statistical questions using tools like confidence intervals and hypothesis tests.
  3. How to effectively create “data stories” using these tools.

What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion, such as How strong is the relationship between per capita income and crime in Chicago neighborhoods? and How many f**ks does Quentin Tarantino give (as measured by the amount of swearing in his films)?. Further discussions on data stories can be found in this Think With Google article.

Stats for Data Science

Danny Kaplan

If statistics had not already existed, data science would need to invent it. Statistics provides the foundation for describing variation among individuals, for relating different factors to one another, and for drawing appropriate inferences from patterns observed in data. Statistics also provided the impetus for a critically important method of science: the randomized controlled experiment. It is fair to say, “There can be no data science without statistics.”

The converse is not true. Statistics as a field can and did exist without data science.