Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.
In what has become a common theme of FDA presentations at R conferences, Schuette refutes the fallacy that SAS is the only software that can be used for FDA submissions, by sponsors such as pharmaceutical companies. On the contrary, he says "sponsors may propose to use R, and R has been used by some sponsors for certain types of analyses and simulations (post-market)."
In addition to sponsors using R in submission, R is also used internally at the FDA. Statisticians there may use the statistical package of their choice, provided it's fit for the purpose. The software used includes SAS, R, Minitab and Stata. Schuette notes that R is used specifically for:
Statistical review of data analysis in clinical trial submissions. The primary goal here is, "Can we, on our own, replicate the conclusions of the sponsor?"
Methodology development, innovation and evaluation.
Graphics (in some cases, the detailed information folded and included with prescription medications feature R graphics).
The field of neuroscience -- the study of brains and the nervous system -- has taken some major leaps in recent years. Scientists can now gather real-time electrical activity from the brain during actions and thoughts, which is helping to pinpoint the exact location of brain lesions caused by strokes, and is leading to promising treatments for epilepsy and even profound paralysis. Joseph Sirosh describes these advances in a keynote presented at Strata Hadoop World last week:
In the video, Dr Kai Miller, Neurosurgery Resident at Stanford University, described an ingenious experiment designed to link brain activity to perception. In the experiment, several epilepsy patients were shown a series of images, each of which was either a house or a face. Simultaneously, electrical activity on the brain surface was measured by 64 separate brain sensors. The goal is to be able to create a model from the brain sensor data that can accurately predict what the patients is seeing: a face or a house.
You can try creating such a model yourself in the Azure Machine Learning competition, Decoding Brain Signals. To enter the competition, you'll need to train a model on the competition data, and have it accurately predict the images seen by other patients in the study (whose data remains hidden from all participants). You can use the built-on Azure ML Studio machine learning modules, or you can build your model entirely using R and R packages (this Tutorial using R explains the process).
Your submission will be ranked against the other participants according to prediction accuracy. As of this writing, the best model has a 73.75% accuracy rate. If your model can do better than that, and remains the best model when the competition closes on July 1, you could win $3,000 in prize money. (Second place gets $1,500 and third place gets $500.) Note that you'll need a free Microsoft Azure account to participate, and there is no charge for training, validating or submitting your competition models. For more information on the competition and how to submit a model, follow the link below.
One of the keys to R's success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.
Bioconductor is an open source, open development software project that focuses on providing tools for the analysis of high-throughput genomic data, an area of research known variously as bioinformatics or computational biology. Examples of these data are sequencing the DNA of human genomes or measuring the level of expression of genes in hundreds of tumours. Recent advances in technology mean that such data are a central part of modern biological research, be it medical, agricultural, or basic science.
The Bioconductor project began in 2001 and initiated by Robert Gentleman, one of the originators of the R language. Nowadays there is a core team of nine developers, led by Martin Morgan, who develop some of the important core packages and maintain the infrastructure of the project. As with CRAN, it is the user-contributed packages that make the Bioconductor project the valuable resource that it is. There are more than 1000 software packages in the most recent Bioconductor release. In addition to these packages, Bioconductor includes more than 900 annotation packages and 200 experiment data packages. Annotation packages help streamline the oft-tedious bookkeeping and annotation of data associated with bioinformatics research while the experiment data packages contain processed data and are a valuable teaching resource.
Since its establishment, two of the main goals of the Bioconductor project have been reproducible research and high-quality documentation. In support of these aims, Bioconductor releases packages under a biannual schedule, which is tied to the most recent 'release' version of R, and each Bioconductor software package must contain a vignette. Each vignette is a document that provides a task-oriented description of package functionality, more like a book chapter than the technical and often terse function-level documentation accessible via ? or help() at the R console. Some of these vignettes, such as the User's Guide that accompanies the limma package (pdf), include multiple case studies and carefully explain the statistical foundations of the methods implemented in the package. There is also a dedicated support forum containing many years worth of questions on common problems with answers from experts in the field.
The teaching resources have been further bolstered by material from the recent Bioconductor meeting, held in Seattle, USA on July 21-22. This annual meeting is a great mix of basic science and data analysis methodology talks, presentations on interesting Bioconductor packages, and afternoon workshops where you can learn from the developers themselves. All the workshop materials, and most of the slides from the presentations, can be found here. The meeting was preceded by Developer Day, a less formal get-together including talks and brainstorming sessions about the current state and future directions of the Bioconductor project. There is also an annual European Bioconductor meeting and, for the first time, an Asia-Pacific Bioconductor Developer's Meeting and workshop, to be held as part of GIW/InCoB 2015 in Tokyo, Japan on September 8-11.
With its more specialised focus than CRAN, Bioconductor strongly encourages that package developers make use of the excellent infrastructure provided by existing Bioconductor packages. The intention is to reduce the number of times the wheel is re-invented, as well increasing the interoperability of objects and methods from different packages. The source code of these core packages can make useful reading for R developers, particularly those wishing to learn more about the S4 object oriented system. This source code can be accessed using Subversion or via the GitHub mirror of all Bioconductor packages.
Bioconductor has been, and continues to be, an incredibly useful resource for people analysing high-throughput genomic data. The development and maintainence of the project is a considerable undertaking, and there is a great debt owed to those who established the project, brought it into being, and continue its day-to-day running. But just as important is the community of users and developers. It is this community of users and developers that sees such a project succeed and be exciting to be a part of.
In the keynote, Microsoft CVP Joseph Sirosh introduced the "language of data": open source R. Sirosh encouraged the audience to learn R, saying "if there is a single language that you choose to learn today .. let it be R".
The keynote featured a demonstration of genomic data analysis using R. The analysis was based on the 1000 genomes data set stored in the HDInsight Hadoop-in-the-cloud service. Revolution R Enterprise running on eight Hadoop clusters distributed around the globe (about 1600 cores in total), and R's Bioconductor suite (specifically the VariantTools and gmapR packages), was used to perform 'variant calling' and calculate the disease risks indicated by a subset of the 1000 genomes in parallel. The result was an interactive heat-map showing the disease risks for each individual.
The next part of the demo was to compare an individual's disease risks — as indicated by his or her DNA — to the population. Joseph Sirosh had his own DNA sequence for this purpose, which he submitted via a Windows Phone app to an Azure service running R. This is easy to do with Azure ML Studio: just put your R code as part of a workflow, and an API will automatically be generated on request. In this way you can publish any R code as an API to the cloud, which is then callable by any connected application.
by Bob Horton, Data Scientist, Revolution Analytics
From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.
This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.
As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.
Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as
plot(0:20, choose(n=20, k=0:20))
Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.
Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.
Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.
Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through [email protected] if you are interested in working with us. Stay tuned for progress reports.
Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You'll just need a backround in basic math and programming to follow along and complete homework in the R language.
As a new course, I haven't seen any of the content, but the presenters Rafael Irizarry and Michael Love are active contributors to the Bioconductor project, so it should be good. The course begins January 19 and registration is open through 27 April at the link below.
Bioconductor is a project to develop and curate a collection of R packages used for analysis of genetic data (specifically, analysis and comprehension of high-throughput genomic data). With the wealth of genetic data on humans and animals now available, Bioconductor is widely used in medical research to understand how genes influence our health, and to develop new therapies and drugs. (It was used in this recent Nature Molecular Psychiatry article, for example.)
The project currently includes 935 R packages — a tally that's not normally included in the count of available R packages (the count of package in CRAN currently stands at 6179). There are also 895 packages of genetic data (includeing many annotated genomes), and 223 packages of experimental data.
A search for BioConductor citations on Google Scholar yields more than 27,000 hits. (You can see a list of recent articles citing BioConductor here, including articles in Nature, Genome Biology, and Statistical Science.)
Visits to the BioConductor website increased by 23% in 2014.
63 packages were added in the last 3 months of 2014.
Package downloads have increased by 9%.
You can find much more news about the Bioconductor project in the current and previous Bioconductor newsletters. Start with the most revent issue at the link below.
If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s.
Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Nature for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)
Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.
As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent JASS paper Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.
Here is part of an extended example from Thiele's JSS paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.
Snippet of NetLogo Code that drives the Fire model
to go
if not any? turtles ;; either fires or embers
[stop]
ask fires
[ ask neighbors4 with[pcolor = green][ ignite ]
set breed embers ]
fade-embers
tick
end;; creates the fire turtles
to ignite ;; patch procedure
sprout-fires 1[ set color red ]
set pcolor black
set burned-trees burned-trees+1end
The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.
This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.
# Launch RNetLogo and control an initial run of the# NetLogo Fire Modellibrary(RNetLogo)
nlDir <-"C:/Program Files (x86)/NetLogo 5.0.5"setwd(nlDir)
nl.path <-getwd()
NLStart(nl.path)
model.path <-file.path("models","Sample Models","Earth Science","Fire.nlogo")
NLLoadModel(file.path(nl.path, model.path))
NLCommand("set density 70")# set density value
NLCommand("setup")# call the setup routine
NLCommand("go")# launch the model from R
Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.
This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.
# Investigate percentage of forest burned as simulation proceeds and plotlibrary(ggplot2)
NLCommand("set density 60")
NLCommand("setup")
burned <- NLDoReportWhile("any? turtles","go",c("ticks","(burned-trees / initial-trees) * 100"),as.data.frame = TRUE, df.col.names = c("tick","percent.burned"))# Plot with ggplot2
p <-ggplot(burned,aes(x=tick,y=percent.burned))
p + geom_line()+ ggtitle("Non-linear forest fire progression with density = 60")
As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.
RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.
d <-seq(55,65,1)# vector of densities to examine
res <- rep.sim(d,20)# Run the simulation
The plot below shows the variability of the percent of trees burned as a function of density in the transition region.
My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.
Finally, here are a few more interesting links related to ABMs.
When you not directly working in an industry it is often extremely difficult to get any real insight into common practices that may be blindly transparent to people who are. With some persistence though, every once in awhile you can stumble into an opportunity to see why things are the way they are. Last week, at the JSM in Montreal, I was lucky enough to hear some candid discussion of the relative roles of SAS and R in the pharmaceutical industry during the invited session Opening the Doors to Open Source Programming in Drug Development.
Now, one doesn’t have to be much of an insider to know that SAS has been dominant in pharma for quite some time. This is still true even though there is ample evidence that the situation is changing and R is seeing quite a bit of use in the drug development process. At useR! 2007, Mat Sokup of the FDA gave a talk entitledR: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments in which he concluded that there was no impediment to using R in an FDA submission. Last year at useR 2012 FDA statistician Jea Brodsky presented a poster described how FDA scientists “use R on a daily basis” and have themselves written R packages for use at various stages in the drug submission process. Nevertheless, certain myths persist.
However, according to Dr Pinheiro not only has R made its way into drug development, R is probably used more than SAS. (Apparently, R based simulations have become a good part of the daily work of designing drugs.) Nevertheless, R is very rarely used in submissions. The insight I got from Dr Pinheiro's presentation and the conversation at the JSM session is that the daily simulations are a relatively new task in the drug development process. Therefore, one would expect new tools could be applied if they were useful. But in the ultra conservative pharma environment it is unreasonable to assume that a new tool would replace the customary tool in an established process that seems to be working well. Something big has to break before the current generation of SAS trained statisticians give R a chance in the FDA submission process. The irony that R is used by FDA statisticians for all kinds of task associated with the submission process. Jyoti Rayamahi of Eli Lilly and Company, the discussant for the JSM session had the best line. “At Lilly we do our submissions in SAS and the FDA uses R to check them”.
Note last week the R Foundation refreshed the date on the document:R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environmentswhich it first published a few years age. As it states in the Introduction, the document “ ...is intended to provide a reasonable consensus position on the part of the R Foundation for Statistical Computing ... relative to the use of R within these regulated environments (environments in which the FDA regulates the use of statistical software) and to provide a common foundation for end users to meet their own internal standard operating procedures, documentation requirements and regulatory obligations.” It lists the pertinent regulatory documents and provides information that should be helpful to anyone whoe would like to deploy R in a regulated activity.