💡 Tips: Revolutions: life sciences

life sciences

July 20, 2017

Data Analysis for Life Sciences

Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.

Data Analysis for the Life Sciences Series

A companion book and associated R Markdown documents are also available for download.

Genomics Data Analysis Series

For links to all of the course components, including videos and supplementary materials, follow the link below.

rafalab: HarvardX Biomedical Data Science Open Online Training

Posted by David Smith at 08:00 in courses, life sciences, R | Permalink | Comments (0)

June 29, 2017

How R is used by the FDA for regulatory compliance

I was recently alerted (thanks Maëlle and Mikhail!) to an enlightening presentation from last years' useR! conference. (This year's useR! conference takes place next week in Belgium.) Paul H Schuette, Scientific Computing Coordinator at the FDA Center for Drug Evaluation and Research (CDER), talked about how R is used in the process of regulating and approving drugs at the FDA.

In what has become a common theme of FDA presentations at R conferences, Schuette refutes the fallacy that SAS is the only software that can be used for FDA submissions, by sponsors such as pharmaceutical companies. On the contrary, he says "sponsors may propose to use R, and R has been used by some sponsors for certain types of analyses and simulations (post-market)."

The myth persists despite the FDA's Statistical Software Clarifying Statement declaring that any suitable software can be used. This is probably because some data-exchange regulations do require the use of the "XPT" (also known as SAS XPORT) file format, but that data format is an open standard and not restricted to SAS. XPT files can be read into R with the built-in read.xport function, and exported from R with the write.xport function in the SASxport package. (If you have legacy data in other SAS formats, here's a handy SAS macro to export XPT files.) The R Foundation also provides guidance on how R complies with other FDA regulations in the document R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments.

In addition to sponsors using R in submission, R is also used internally at the FDA. Statisticians there may use the statistical package of their choice, provided it's fit for the purpose. The software used includes SAS, R, Minitab and Stata. Schuette notes that R is used specifically for:

Statistical review of data analysis in clinical trial submissions. The primary goal here is, "Can we, on our own, replicate the conclusions of the sponsor?"
Methodology development, innovation and evaluation.
Graphics (in some cases, the detailed information folded and included with prescription medications feature R graphics).
Simulations.
The openFDA initiative, including this LRT Signal Analysis for a Drug Shiny application.

Check out the entire presentation, embedded below.

Channel 9: Using R in a regulatory environment: FDA experiences.

Posted by David Smith at 11:48 in applications, life sciences, R | Permalink | Comments (0)

April 04, 2016

Help improve treatment for brain injuries using machine learning and R

The field of neuroscience -- the study of brains and the nervous system -- has taken some major leaps in recent years. Scientists can now gather real-time electrical activity from the brain during actions and thoughts, which is helping to pinpoint the exact location of brain lesions caused by strokes, and is leading to promising treatments for epilepsy and even profound paralysis. Joseph Sirosh describes these advances in a keynote presented at Strata Hadoop World last week:

In the video, Dr Kai Miller, Neurosurgery Resident at Stanford University, described an ingenious experiment designed to link brain activity to perception. In the experiment, several epilepsy patients were shown a series of images, each of which was either a house or a face. Simultaneously, electrical activity on the brain surface was measured by 64 separate brain sensors. The goal is to be able to create a model from the brain sensor data that can accurately predict what the patients is seeing: a face or a house.

You can try creating such a model yourself in the Azure Machine Learning competition, Decoding Brain Signals. To enter the competition, you'll need to train a model on the competition data, and have it accurately predict the images seen by other patients in the study (whose data remains hidden from all participants). You can use the built-on Azure ML Studio machine learning modules, or you can build your model entirely using R and R packages (this Tutorial using R explains the process).

Your submission will be ranked against the other participants according to prediction accuracy. As of this writing, the best model has a 73.75% accuracy rate. If your model can do better than that, and remains the best model when the competition closes on July 1, you could win $3,000 in prize money. (Second place gets $1,500 and third place gets $500.) Note that you'll need a free Microsoft Azure account to participate, and there is no charge for training, validating or submitting your competition models. For more information on the competition and how to submit a model, follow the link below.

Cortana Intelligence Gallery: Competition: Decoding Brain Signals

Posted by David Smith at 10:00 in life sciences, Microsoft, predictive analytics, R | Permalink | Comments (0)

August 04, 2015

A Short Introduction to Bioconductor

by Peter Hickey (@PeteHaitch)

One of the keys to R's success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.

Bioconductor is an open source, open development software project that focuses on providing tools for the analysis of high-throughput genomic data, an area of research known variously as bioinformatics or computational biology. Examples of these data are sequencing the DNA of human genomes or measuring the level of expression of genes in hundreds of tumours. Recent advances in technology mean that such data are a central part of modern biological research, be it medical, agricultural, or basic science.

The Bioconductor project began in 2001 and initiated by Robert Gentleman, one of the originators of the R language. Nowadays there is a core team of nine developers, led by Martin Morgan, who develop some of the important core packages and maintain the infrastructure of the project. As with CRAN, it is the user-contributed packages that make the Bioconductor project the valuable resource that it is. There are more than 1000 software packages in the most recent Bioconductor release. In addition to these packages, Bioconductor includes more than 900 annotation packages and 200 experiment data packages. Annotation packages help streamline the oft-tedious bookkeeping and annotation of data associated with bioinformatics research while the experiment data packages contain processed data and are a valuable teaching resource.

Since its establishment, two of the main goals of the Bioconductor project have been reproducible research and high-quality documentation. In support of these aims, Bioconductor releases packages under a biannual schedule, which is tied to the most recent 'release' version of R, and each Bioconductor software package must contain a vignette. Each vignette is a document that provides a task-oriented description of package functionality, more like a book chapter than the technical and often terse function-level documentation accessible via ? or help() at the R console. Some of these vignettes, such as the User's Guide that accompanies the limma package (pdf), include multiple case studies and carefully explain the statistical foundations of the methods implemented in the package. There is also a dedicated support forum containing many years worth of questions on common problems with answers from experts in the field.

Bioconductor has recently begun publishing separate "workflows", along with teaching materials used in courses and conferences to help users learn how to analyse high-throughput biological data. These are excellent resources for those wishing to learn more about what is available in Bioconductor and how to get the most from the project. The website also hosts detailed instructions on installing Bioconductor on your local machine or trying it out a preconfigured setup using Amazon Machine Images or Docker images.

The teaching resources have been further bolstered by material from the recent Bioconductor meeting, held in Seattle, USA on July 21-22. This annual meeting is a great mix of basic science and data analysis methodology talks, presentations on interesting Bioconductor packages, and afternoon workshops where you can learn from the developers themselves. All the workshop materials, and most of the slides from the presentations, can be found here. The meeting was preceded by Developer Day, a less formal get-together including talks and brainstorming sessions about the current state and future directions of the Bioconductor project. There is also an annual European Bioconductor meeting and, for the first time, an Asia-Pacific Bioconductor Developer's Meeting and workshop, to be held as part of GIW/InCoB 2015 in Tokyo, Japan on September 8-11.

With its more specialised focus than CRAN, Bioconductor strongly encourages that package developers make use of the excellent infrastructure provided by existing Bioconductor packages. The intention is to reduce the number of times the wheel is re-invented, as well increasing the interoperability of objects and methods from different packages. The source code of these core packages can make useful reading for R developers, particularly those wishing to learn more about the S4 object oriented system. This source code can be accessed using Subversion or via the GitHub mirror of all Bioconductor packages.

Bioconductor has been, and continues to be, an incredibly useful resource for people analysing high-throughput genomic data. The development and maintainence of the project is a considerable undertaking, and there is a great debt owed to those who established the project, brought it into being, and continue its day-to-day running. But just as important is the community of users and developers. It is this community of users and developers that sees such a project succeed and be exciting to be a part of.

Posted by Joseph Rickert at 08:30 in life sciences, packages, R | Permalink | Comments (4)

June 05, 2015

Any R code as a cloud service: R demonstration at BUILD

At last month's BUILD conference for Microsoft developers in San Francisco, R was front-and-center on the keynote stage.

In the keynote, Microsoft CVP Joseph Sirosh introduced the "language of data": open source R. Sirosh encouraged the audience to learn R, saying "if there is a single language that you choose to learn today .. let it be R".

The keynote featured a demonstration of genomic data analysis using R. The analysis was based on the 1000 genomes data set stored in the HDInsight Hadoop-in-the-cloud service. Revolution R Enterprise running on eight Hadoop clusters distributed around the globe (about 1600 cores in total), and R's Bioconductor suite (specifically the VariantTools and gmapR packages), was used to perform 'variant calling' and calculate the disease risks indicated by a subset of the 1000 genomes in parallel. The result was an interactive heat-map showing the disease risks for each individual.

The heat map was created by Winston Chang and Joe Cheng from RStudio as an htmlwidget using the D3heatmap package. (You can interact with a variant of the heatmap from the demo here.)

The next part of the demo was to compare an individual's disease risks — as indicated by his or her DNA — to the population. Joseph Sirosh had his own DNA sequence for this purpose, which he submitted via a Windows Phone app to an Azure service running R. This is easy to do with Azure ML Studio: just put your R code as part of a workflow, and an API will automatically be generated on request. In this way you can publish any R code as an API to the cloud, which is then callable by any connected application.

You can watch the entire keynote presentation below, and the R demo begins at around the 23 minute mark.

Posted by David Smith at 06:00 in applications, big data, events, life sciences, Microsoft, R, Rmedia | Permalink | Comments (0)

January 20, 2015

New Introductory R Course for Health Analytics

by Bob Horton, Data Scientist, Revolution Analytics

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

plot(0:20, choose(n=20, k=0:20))

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through [email protected] if you are interested in working with us. Stay tuned for progress reports.

Posted by Joseph Rickert at 08:00 in academia, courses, life sciences, R, Revolution, statistics | Permalink | Comments (0)

January 16, 2015

Learn Statistics and R online from Harvard

Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You'll just need a backround in basic math and programming to follow along and complete homework in the R language.

As a new course, I haven't seen any of the content, but the presenters Rafael Irizarry and Michael Love are active contributors to the Bioconductor project, so it should be good. The course begins January 19 and registration is open through 27 April at the link below.

edX: Statistics and R for the Life Sciences

Posted by David Smith at 09:07 in courses, life sciences, R | Permalink | Comments (0)

January 09, 2015

Bioconductor project advances understanding of genetics

Bioconductor is a project to develop and curate a collection of R packages used for analysis of genetic data (specifically, analysis and comprehension of high-throughput genomic data). With the wealth of genetic data on humans and animals now available, Bioconductor is widely used in medical research to understand how genes influence our health, and to develop new therapies and drugs. (It was used in this recent Nature Molecular Psychiatry article, for example.)

The project currently includes 935 R packages — a tally that's not normally included in the count of available R packages (the count of package in CRAN currently stands at 6179). There are also 895 packages of genetic data (includeing many annotated genomes), and 223 packages of experimental data.

The latest issue of the Bioconductor newsletter shared some impressive statistics on the growth of the project:

A search for BioConductor citations on Google Scholar yields more than 27,000 hits. (You can see a list of recent articles citing BioConductor here, including articles in Nature, Genome Biology, and Statistical Science.)
Visits to the BioConductor website increased by 23% in 2014.
63 packages were added in the last 3 months of 2014.
Package downloads have increased by 9%.

You can find much more news about the Bioconductor project in the current and previous Bioconductor newsletters. Start with the most revent issue at the link below.

Bioconductor Newsletter: January 2015

Posted by David Smith at 09:09 in applications, life sciences, packages, R | Permalink | Comments (1)

July 24, 2014

Agent Based Models and RNetLogo

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Nature for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent JASS paper Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's JSS paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go
  if not any? turtles  ;; either fires or embers
    [ stop ]
  ask fires
    [ ask neighbors4 with [pcolor = green]
        [ ignite ]
      set breed embers ]
  fade-embers
  tick
end
 
;; creates the fire turtles
to ignite  ;; patch procedure
  sprout-fires 1
    [ set color red ]
  set pcolor black
  set burned-trees burned-trees + 1
end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the
# NetLogo Fire Model
library(RNetLogo)
nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5"
setwd(nlDir)
 
nl.path <- getwd()
NLStart(nl.path)
 
model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo")
NLLoadModel(file.path(nl.path, model.path))
 
NLCommand("set density 70")    # set density value
NLCommand("setup")             # call the setup routine 
NLCommand("go")                # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot
library(ggplot2)
NLCommand("set density 60")
NLCommand("setup")
burned <- NLDoReportWhile("any? turtles", "go",
                c("ticks", "(burned-trees / initial-trees) * 100"),
                as.data.frame = TRUE, df.col.names = c("tick", "percent.burned"))
# Plot with ggplot2
p <- ggplot(burned,aes(x=tick,y=percent.burned))
p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1)                  # vector of densities to examine
res <- rep.sim(d, 20)                # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

On validating ABMs
ABMs and

Posted by Joseph Rickert at 08:30 in applications, finance, life sciences, open source, packages, predictive analytics, R, statistics | Permalink | Comments (0)

August 15, 2013

R, drug development and the FDA

by Joseph Rickert

When you not directly working in an industry it is often extremely difficult to get any real insight into common practices that may be blindly transparent to people who are. With some persistence though, every once in awhile you can stumble into an opportunity to see why things are the way they are. Last week, at the JSM in Montreal, I was lucky enough to hear some candid discussion of the relative roles of SAS and R in the pharmaceutical industry during the invited session Opening the Doors to Open Source Programming in Drug Development.

Now, one doesn’t have to be much of an insider to know that SAS has been dominant in pharma for quite some time. This is still true even though there is ample evidence that the situation is changing and R is seeing quite a bit of use in the drug development process. At useR! 2007, Mat Sokup of the FDA gave a talk entitled R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments in which he concluded that there was no impediment to using R in an FDA submission. Last year at useR 2012 FDA statistician Jea Brodsky presented a poster described how FDA scientists “use R on a daily basis” and have themselves written R packages for use at various stages in the drug submission process. Nevertheless, certain myths persist.

During his talk: Open Source Software in the Biopharma Industry: Challenges and Opportunities, José Pinheiro of Jansen R & D (and coauthor of an influential text on mixed-effects models) addressed several of these myths including two that I encounter frequently: “the FDA will only accept SAS submissions” and “Unlike SAS, R code can not be validated”. I was surprised that Dr Pinheiro felt compelled to address these untruths, but I suppose I should not have been. Six years isn’t much time for ideas floated at useR conferences to have much of an impact on statisticians who have been using SAS all of their working careers. One can easily imagine that in an industry as conservative as pharma even informed people might ignore information that would perturb what is perceived to be a successful, established pattern. The twin pressures of not wanting to disturb a process that has worked successfully in the past and the fear that even small changes could contribute to additional FDA scrutiny are powerful forces to working to preserve the status quo.

However, according to Dr Pinheiro not only has R made its way into drug development, R is probably used more than SAS. (Apparently, R based simulations have become a good part of the daily work of designing drugs.) Nevertheless, R is very rarely used in submissions. The insight I got from Dr Pinheiro's presentation and the conversation at the JSM session is that the daily simulations are a relatively new task in the drug development process. Therefore, one would expect new tools could be applied if they were useful. But in the ultra conservative pharma environment it is unreasonable to assume that a new tool would replace the customary tool in an established process that seems to be working well. Something big has to break before the current generation of SAS trained statisticians give R a chance in the FDA submission process. The irony that R is used by FDA statisticians for all kinds of task associated with the submission process. Jyoti Rayamahi of Eli Lilly and Company, the discussant for the JSM session had the best line. “At Lilly we do our submissions in SAS and the FDA uses R to check them”.

Note last week the R Foundation refreshed the date on the document: R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments which it first published a few years age. As it states in the Introduction, the document “ ...is intended to provide a reasonable consensus position on the part of the R Foundation for Statistical Computing ... relative to the use of R within these regulated environments (environments in which the FDA regulates the use of statistical software) and to provide a common foundation for end users to meet their own internal standard operating procedures, documentation requirements and regulatory obligations.” It lists the pertinent regulatory documents and provides information that should be helpful to anyone whoe would like to deploy R in a regulated activity.

Posted by Joseph Rickert at 08:54 in life sciences, R, statistics | Permalink | Comments (0) | TrackBack (0)

Jan	FEB	Mar
	17
2024	2025	2026

Revolutions

Milestones in AI, Machine Learning, Data Science, and visualization with R and Python since 2008