Getting Started with RStudio by John Verzani
Get full access to Getting Started with RStudio and 60K+ other titles, with a free 10-day trial of O'Reilly.
There are also live events, courses curated by job role, and more.
Chapter 4. Case Study: Creating a Package
Before describing more systematically the components that RStudio provides for development work in R (most importantly the source-code editor), we will pick up where we left off on our case study of analyzing the group behavior and individual movements of a colony of naked mole rats. Here, our goal is to illustrate one way to do package development with RStudio.
Imagine after a short time using RStudio for interactive use, that we are pretty happy using the command line for short commands, but have learned to really enjoy writing scripts in the Code editor. Even 2- or 3-line commands are much easier to debug when done in the editor. The directness of typing at the command line isn’t lost, as our fingers are now trained to hit Ctrl+Enter to send the current line or selection to the R interpreter—or even Ctrl+Shift+Enter to send the entire buffer (Command, not Ctrl, for Mac users). We never need to leave the keyboard unless we choose to.
Along the way, we have been able to create a large script file that we now want to share with a colleague.
How do we do this? There are many ways. We could just send along the entire script, or with just a bit more extra work, we could share our work through a version control system. In some cases, this approach might be the best thing to do, as then our colleague can do exactly what we have been doing. However, there are many situations where this isn’t so great. For example, perhaps this colleague doesn’t know R too well and she just wants to have some functions to use. Plus, she may want to have some documentation on how to actually use the functions. Besides, we might want to share our work much more widely. At times like this, R users start to think about packages .
Packages are how R users extend R in a structured, reusable way. CRAN houses over 3,000 of them, and many more are scattered widely throughout the internet at R-specific repositories like those hosted by the Bioconductor project or on r-forge . Packages also appear on code-hosting sites such as http://github.com or http://code.google.com . However, we don’t need to get packages from a website. We can start by creating our own local packages to share with colleagues. Let’s see how, taking advantage of the features of the code-editor component of RStudio.
Creating Functions from Script Files
Currently, our script is one long set of commands that processes the data files and then makes a plot. We first want to turn some of the commands into functions . Functions make code reuse much more feasible. A basic pattern in R is to write functions for simple small tasks and then chain these tasks together using function composition . This is similar to the composition concept from mathematics, where we take the output from one function and use this as the input for another.
RStudio’s integrated Source code editor—where we wrote our script—makes working with functions quite easy. We’ll illustrate some of the features here.
For our task, we have a script that does four things to process the data:
It reads in the data and does some data cleaning.
It creates zoo objects for each mole rat.
It merges these into one large zoo object.
It makes a plot.
This naturally lends itself to four functions. Keeping our functions small and focused on a single task makes them easier to test and debug. It can also help later on in the development of a package, when we may think about combining similar tasks into more general functions, although we won’t see that here.
RStudio provides a convenient action for turning a series of commands into a function. The magic wand toolbar button in the code editor has the Extract Function action. We simply highlight the text we want and then wave the magic wand—tada! In Figures 4-1 and 4-2 , we illustrate the changes introduced by the magic wand. Our first function will be the one that reads the data into a data frame where the time column is using one of R’s date classes.
The magic wand does most of the work, but not all in this case, as the text can’t adequately be parsed. In R, functions have arguments, a body of commands, a return value, and optionally are assigned to a name. We specify the name in the extract-function dialog, but for this instance added the function argument “f” and the return value “x” after the extraction.
We don’t try to automate the process of converting the rtf file into a txt file, as that isn’t so easy. We will put together the commands to process the data frame and create a list of zoo objects (one for each mole rat) and the commands to create a multivariate zoo object. This will be done with the magic wand in a similar manner as above.
A Package Skeleton
Packages must have a special structure, detailed in the Writing R Extensions manual that accompanies a standard R installation. We can consult that for detailed reference, but for now all we need to know is that the function package.skeleton will set up this structure for us. (The ProjectTemplate package can be used to provide even more detail to this process.)
This function needs, at a minimum, just two things: where and what. As in, where are we going to write our package files and what will we initially populate them with? We choose the directory ~/NMRpackage , and will start with one of the functions from our script:
We now want to inform RStudio that we are working on a new project, allowing us to compartmentalize our session data and accompanying functions. A more detailed desciption of projects in RStudio is postponed to Organizing Activities with Projects , for now we note that we used the directory just created by package.skeleton .
After creating a new project, we refresh the Files browser to show us which files were created ( Figure 4-3 ).
We see two unexpected files in the base directory and two subdirectories. We first investigate what is in the Read-and-delete-me by clicking on the link and reading. For now, nothing we need. It says to delete the file, so we oblige by selecting the file’s checkbox and clicking the Delete toolbar button.
The DESCRIPTION file is used by R to organize its packages. Ours needs to be updated to reflect our package. Clicking the link opens the file in the code editor. Here we edit the Title: field and some others. Since our package will rely on the zoo and ggplot2 packages, we add those to the Depends field. This file is in dcf format with a keyword (the name before the colon) and value on one line. If you need more lines for the value, just give any additional lines some indented space, as was done for the “Description:” line (see Figure 4-4 ).
The R directory is where all the programming is done. In this directory we have the files containing our function definitions. We change our working directory (Ctrl+Shift+K), and the file browser updates to show this directory.
We see that the call to package.skeleton created a file named readNMRData.R , containing the definition of the one function we gave it. We could have one file per function, but that will quickly get unwieldy. We could also put all our functions into one file—but again that gets bulky. A better strategy is to group similar functions together into a file. For now, we will create a file to hold our data-processing functions ( process.R ), and another file for our still-to-be-written visualization functions ( visualize.R ).
To rename our file through the Files browser, we select its checkbox and then click the Rename toolbar button. A dialog prompts us for the new name. We open this file for editing by clicking on its link. We then open our original script file ( one-big-script-file.R , which isn’t in our new project) by using the Open File toolbar button on the application’s toolbar. We then proceed to use the magic wand to create functions createZooObjects and createStateMatrix . These are then copy-and-pasted into the appropriate file in the R directory.
RStudio has some facilities for navigating through a file full of functions. The “Go to file/function” search box in the main toolbar, allows one to quickly and conveniently navigate to all the functions in a project. For in-file navigation, in the lower-left corner of the code-editor component sits a label ( Figure 4-5 ) that contains the line and column number, and next to that, a combobox that can be popped up to select a function to jump to.
We next open a new R Script (Shift+Ctrl+N or through the File menu) for holding any functions for visualization and add a function to use ggplot2 to make a graphic. We save the file and update our Files menu through its Refresh button.
Documenting Functions With roxygen2
The package.skeleton command makes the man subdirectory. In R, all exported functions must be documented in some file. Such files are written using R’s Rd markup language. Looking at the man directory, we see that two files were made: readNMRData.Rd (a stub for our function), and NMRpackage-package.Rd (a stub for documenting the entire package). We open up the latter and make the required changes—at a minimum, the lines that have paired double tildes are edited to match our idea of the package.
We could go on to edit the readNMRData.Rd template, but instead we will use the roxygen2 package to document our package’s functions. Although R is organized around a workflow where one writes the function then documents it separately (presumably after the function is done), many other programming languages have facilities for writing in a literate programming style using inline documentation. Some R programmers are used to this functionality (it simplifies iterative function writing and documenting) and the roxygen2 package makes this feasible. For the modest demands of this package, it is an excellent choice.
Rd format has a number of required sections, and using roxygen2 does not eliminate the need for following that structure. All directives appear in comments (we use ##' ). Keywords are prefaced with an at symbol ( @ ). The main sections that are usually defined are a title (taken from the first line), an optional description (taken from the first paragraph), the function’s arguments (defined through the @param tags), a description of the return value ( @return ), whether the function will be exported ( @export ), and, optionally some examples. R has tools to check that the right format is followed. In particular, it will catch if you have failed to document all the arguments or if you misname a section tag.
The Rd markup is fairly straightforward and is described in the Writing R Extensions manual. An example of a documented function is shown in Figure 4-6 .
We can also create a NEWS file to keep track of changes between versions of the package. This may not be useful for this package, but if the package proves popular with our colleagues, a NEWS file will help them see what has happened with our package ( Figure 4-7 ). The NEWS file is a plain-text file with a simple structure. We open it through the File > New menu, but this time select Text File . The code editor will present a different toolbar in this case, as it makes no sense to be able to source R code from this file.
The devtools Package
Testing a package can involve loading the package, testing it, making desired changes, then reloading the package. This workflow can get tedious—it can even involve closing and restarting R to flush residual changes. The devtools package is designed to make this task easier.
If it isn’t installed, we can install it from CRAN using the Packages component ( Figure 4-8 ). Click the Install Packages toolbar button and then type the desired package name into the dialog. (An auto-complete feature makes this easy.) Leaving the Install dependencies option checked will also install roxygen2 and the testthat package, if needed.
Windows users will want to install RTools , a set of development tools for Windows maintained by D. Murdoch. The files are found at http://cran.r-project.org/bin/windows/Rtools .
The devtools package provides the load_all function to reload the package without having to restart R. To use it we define a package variable ( pkg ) pointing to the directory (an “.Rpackages” file can be used to avoid this step), then load the code ( Figure 4-9 ). The new functions do not appear in the Workspace browser, as they are stored in an environment just below the global workspace, but they do show up through RStudio’s code-completion functionality.
We can try it out. In doing so, if we realize that we would like to change some function, no problem. We make the adjustment in the code editor, save the file, then reissue the command load_all(pkg) .
For working with documentation, the devtools package has the document function (as in document(pkg) ) to call roxygen2 to create the corresponding Rd files and show_news to view the NEWS file.
We can add our testing commands in an example, but we will need to have some data to use when we distribute our package. We wrote readNMRData to accept any data file in the same format, as we imagine our colleagues using it with other data sets generated by the experiment. However, we can combine the data we have into the package for testing and example purposes. R has the data directory for including data in a package. This data should be in a format R can easily read in—ours isn’t (it has a different separator and we need to skip every other line). So instead, we use the inst directory to create our own data directory. We call this sampledata (not data , as this would interfere with the data directory that R processes automatically). We create the needed directories with the New Folder toolbar button in the Files browser.
How you get the package data file into this folder depends on how you are using RStudio. If you are using the desktop version, you simply copy the file over using your usual method (e.g., Finder, command line, Windows Explorer). If you are using the server version, then this won’t work. In that case, the Files component has an additional Upload toolbar button to allow you to upload your file. This button summons a dialog that allows you to browse for a file or a zip archive of many files ( Figure 4-10 ).
R documentation files have the option of an “examples” section, where one would usually see documentation of example usage of the function(s). This is a very good idea, as it gives the user a sample template to build on. In Figure 4-11 , we see sample code added to our readNMRData function’s documentation.
For an installed package, examples can be run by the user through the example function. During development with devtools , the programmer can use the run_examples function.
Although examples will typically be run during package development, it is a good practice to include tests of the package’s core functions as well. Tests serve a different purpose than examples. Well-designed tests can help find bugs introduced by changes to functions—a not uncommon event. The devtools package can run tests (through testthat ) that appear in the inst/tests subdirectory of the package.
Building and Installing the Package
Packages can be checked for issues, built for distribution and installed for subsequent usage. RStudio does not have any features for performing such, but all can be done within devtools, or from a shell outside of the R process. For example, a UNIX or Mac OS X user could run:
We could replace build with CHECK to check our package for consistency with R’s expectations. Though checking isn’t required for sharing a package with colleagues, a package distributed on CRAN should pass the check phase cleanly. Checking is a good thing in any case.
Installing locally built packages can be done from the Install Packages dialog by selecting the option to install from a Package Archive File (.tgz) .
The devtools package provides the functions check , build , and install for performing these three basic tasks.
For Windows users, the WinBuilder project ( http://win-builder.R-project.org ) is a web service that can be used to create packages. Otherwise, building R packages under Windows is made easier using the Rtools bundle mentioned earlier.
Get Getting Started with RStudio now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.
Don’t leave empty-handed
Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.
It’s yours, free.
Check it out now on O’Reilly
Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.
- Case Study: Exploratory Data Analysis in R
- by Daniel Pinedo
- Last updated about 2 years ago
- Hide Comments (–) Share Hide Toolbars
Twitter Facebook Google+
Or copy & paste this link into an email or IM:
Gautier paux and alex dmitrienko, introduction.
Several case studies have been created to facilitate the implementation of simulation-based Clinical Scenario Evaluation (CSE) approaches in multiple settings and help the user understand individual features of the Mediana package. Case studies are arranged in terms of increasing complexity of the underlying clinical trial setting (i.e., trial design and analysis methodology). For example, Case study 1 deals with a number of basic settings and increasingly more complex settings are considered in the subsequent case studies.
Case study 1
This case study serves a good starting point for users who are new to the Mediana package. It focuses on clinical trials with simple designs and analysis strategies where power and sample size calculations can be performed using analytical methods.
- Trial with two treatment arms and single endpoint (normally distributed endpoint).
- Trial with two treatment arms and single endpoint (binary endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint with censoring).
- Trial with two treatment arms and single endpoint (count-type endpoint).
Case study 2
This case study is based on a clinical trial with three or more treatment arms . A multiplicity adjustment is required in this setting and no analytical methods are available to support power calculations.
This example also illustrates a key feature of the Mediana package, namely, a useful option to define custom functions, for example, it shows how the user can define a new criterion in the Evaluation Model.
Clinical trial in patients with schizophrenia
Case study 3
This case study introduces a clinical trial with several patient populations (marker-positive and marker-negative patients). It demonstrates how the user can define independent samples in a data model and then specify statistical tests in an analysis model based on merging several samples, i.e., merging samples of marker-positive and marker-negative patients to carry out a test that evaluated the treatment effect in the overall population.
Clinical trial in patients with asthma
Case study 4
This case study illustrates CSE simulations in a clinical trial with several endpoints and helps showcase the package’s ability to model multivariate outcomes in clinical trials.
Clinical trial in patients with metastatic colorectal cancer
Case study 5
This case study is based on a clinical trial with several endpoints and multiple treatment arms and illustrates the process of performing complex multiplicity adjustments in trials with several clinical objectives.
Clinical trial in patients with rheumatoid arthritis
Case study 6
This case study is an extension of Case study 2 and illustrates how the package can be used to assess the performance of several multiplicity adjustments. The case study also walks the reader through the process of defining customized simulation reports.
Case study 1 deals with a simple setting, namely, a clinical trial with two treatment arms (experimental treatment versus placebo) and a single endpoint. Power calculations can be performed analytically in this setting. Specifically, closed-form expressions for the power function can be derived using the central limit theorem or other approximations.
Several distribution will be illustrated in this case study:
Normally distributed endpoint
Binary endpoint, survival-type endpoint, survival-type endpoint (with censoring), count-type endpoint.
Suppose that a sponsor is designing a Phase III clinical trial in patients with pulmonary arterial hypertension (PAH). The efficacy of experimental treatments for PAH is commonly evaluated using a six-minute walk test and the primary endpoint is defined as the change from baseline to the end of the 16-week treatment period in the six-minute walk distance.
Define a Data Model
The first step is to initialize the data model:
After the initialization, components of the data model can be added to the DataModel object incrementally using the + operator.
The change from baseline in the six-minute walk distance is assumed to follow a normal distribution. The distribution of the primary endpoint is defined in the OutcomeDist object:
The sponsor would like to perform power evaluation over a broad range of sample sizes in each treatment arm:
As a side note, the seq function can be used to compactly define sample sizes in a data model:
The sponsor is interested in performing power calculations under two treatment effect scenarios (standard and optimistic scenarios). Under these scenarios, the experimental treatment is expected to improve the six-minute walk distance by 40 or 50 meters compared to placebo, respectively, with the common standard deviation of 70 meters.
Therefore, the mean change in the placebo arm is set to μ = 0 and the mean changes in the six-minute walk distance in the experimental arm are set to μ = 40 (standard scenario) or μ = 50 (optimistic scenario). The common standard deviation is σ = 70.
Note that the mean and standard deviation are explicitly identified in each list. This is done mainly for the user’s convenience.
After having defined the outcome parameters for each sample, two Sample objects that define the two treatment arms in this trial can be created and added to the DataModel object:
Define an Analysis Model
Just like the data model, the analysis model needs to be initialized as follows:
Only one significance test is planned to be carried out in the PAH clinical trial (treatment versus placebo). The treatment effect will be assessed using the one-sided two-sample t -test:
According to the specifications, the two-sample t-test will be applied to Sample 1 (Placebo) and Sample 2 (Treatment). These sample IDs come from the data model defied earlier. As explained in the manual, see Analysis Model , the sample order is determined by the expected direction of the treatment effect. In this case, an increase in the six-minute walk distance indicates a beneficial effect and a numerically larger value of the primary endpoint is expected in Sample 2 (Treatment) compared to Sample 1 (Placebo). This implies that the list of samples to be passed to the t-test should include Sample 1 followed by Sample 2. It is of note that from version 1.0.6, it is possible to specify an option to indicate if a larger numeric values is expected in the Sample 2 ( larger = TRUE ) or in Sample 1 ( larger = FALSE ). By default, this argument is set to TRUE .
To illustrate the use of the Statistic object, the mean change in the six-minute walk distance in the treatment arm can be computed using the MeanStat statistic:
Define an Evaluation Model
The data and analysis models specified above collectively define the Clinical Scenarios to be examined in the PAH clinical trial. The scenarios are evaluated using success criteria or metrics that are aligned with the clinical objectives of the trial. In this case it is most appropriate to use regular power or, more formally, marginal power . This success criterion is specified in the evaluation model.
First of all, the evaluation model must be initialized:
Secondly, the success criterion of interest (marginal power) is defined using the Criterion object:
The tests argument lists the IDs of the tests (defined in the analysis model) to which the criterion is applied (note that more than one test can be specified). The test IDs link the evaluation model with the corresponding analysis model. In this particular case, marginal power will be computed for the t-test that compares the mean change in the six-minute walk distance in the placebo and treatment arms (Placebo vs treatment).
In order to compute the average value of the mean statistic specified in the analysis model (i.e., the mean change in the six-minute walk distance in the treatment arm) over the simulation runs, another Criterion object needs to be added:
The statistics argument of this Criterion object lists the ID of the statistic (defined in the analysis model) to which this metric is applied (e.g., Mean Treatment ).
Perform Clinical Scenario Evaluation
After the clinical scenarios (data and analysis models) and evaluation model have been defined, the user is ready to evaluate the success criteria specified in the evaluation model by calling the CSE function.
To accomplish this, the simulation parameters need to be defined in a SimParameters object:
The function call for CSE specifies the individual components of Clinical Scenario Evaluation in this case study as well as the simulation parameters:
The simulation results are saved in an CSE object ( case.study1.results ). This object contains complete information about this particular evaluation, including the data, analysis and evaluation models specified by the user. The most important component of this object is the data frame contained in the list named simulation.results ( case.study1.results$simulation.results ). This data frame includes the values of the success criteria and metrics defined in the evaluation model.
Summarize the Simulation Results
Summary of simulation results in r console.
To facilitate the review of the simulation results produced by the CSE function, the user can invoke the summary function. This function displays the data frame containing the simulation results in the R console:
If the user is interested in generate graphical summaries of the simulation results (using the the ggplot2 package or other packages), this data frame can also be saved to an object:
General a Simulation Report
A very useful feature of the Mediana package is generation of a Microsoft Word-based report to provide a summary of Clinical Scenario Evaluation Report.
To generate a simulation report, the user needs to define a presentation model by creating a PresentationModel object. This object must be initialized as follows:
Project information can be added to the presentation model using the Project object:
The user can easily customize the simulation report by defining report sections and specifying properties of summary tables in the report. The code shown below creates a separate section within the report for each set of outcome parameters (using the Section object) and sets the sorting option for the summary tables (using the Table object). The tables will be sorted by the sample size. Further, in order to define descriptive labels for the outcome parameter scenarios and sample size scenarios, the CustomLabel object needs to be used:
Once the presentation model has been defined, the simulation report is ready to be generated using the GenerateReport function:
The R code and the report that summarizes the results of Clinical Scenario Evaluation for this case study can be downloaded on the Mediana website:
Consider a Phase III clinical trial for the treatment of rheumatoid arthritis (RA). The primary endpoint is the response rate based on the American College of Rheumatology (ACR) definition of improvement. The trial’s sponsor in interested in performing power calculations using several treatment effect assumptions (Placebo 30% - Treatment 50%, Placebo 30% - Treatment 55% and Placebo 30% - Treatment 60%)
The three outcome parameter sets displayed in the table are combined with four sample size sets ( SampleSize(c(80, 90, 100, 110)) ) and the distribution of the primary endpoint ( OutcomeDist(outcome.dist = "BinomDist") ) is specified in the DataModel object case.study1.data.model :
The analysis model uses a standard two-sample test for comparing proportions ( method = "PropTest" ) to assess the treatment effect in this clinical trial example:
Power evaluations are easily performed in this clinical trial example using the same evaluation model utilized in the case of a normally distributed endpoint, i.e., evaluations rely on marginal power:
An extension of this clinical trial example is provided in Case study 5 . The extension deals with a more complex setting involving several trial endpoints and multiple treatment arms.
If the trial’s primary objective is formulated in terms of analyzing the time to a clinically important event (progression or death in an oncology setting), data and analysis models can be set up based on an exponential distribution and the log-rank test.
As an illustration, consider a Phase III trial which will be conducted to evaluate the efficacy of a new treatment for metastatic colorectal cancer (MCC). Patients will be randomized in a 2:1 ratio to an experimental treatment or placebo (in addition to best supportive care).
The trial’s primary objective is to assess the effect of the experimental treatment on progression-free survival (PFS).
A single treatment effect scenario is considered in this clinical trial example. Specifically, the median time to progression is assumed to be:
Placebo : t0 = 6 months,
Treatment: t1 = 9 months.
Under an exponential distribution assumption (which is specified using the ExpoDist distribution), the median times correspond to the following hazard rates:
λ0 = log(2)/t0 = 0.116,
λ1 = log(2)/t1 = 0.077,
and the resulting hazard ratio (HR) is 0.077/0.116 = 0.67.
It is important to note that, if no censoring mechanisms are specified in a data model with a time-to-event endpoint, all patients will reach the endpoint of interest (e.g., progression) and thus the number of patients will be equal to the number of events. Using this property, power calculations can be performed using either the Event object or SampleSize object. For the purpose of illustration, the Event object will be used in this example.
To define a data model in the MCC clinical trial, the total event count in the trial is assumed to range between 270 and 300. Since the trial’s design is not balanced, the randomization ratio needs to be specified in the Event object:
It is worth noting that the primary endpoint’s type (i.e., the outcome.type argument in the OutcomeDist object) is not specified. By default, the outcome type is set to fixed , which means that a design with a fixed follow-up is assumed even though the primary endpoint in this clinical trial is clearly a time-to-event endpoint. This is due to the fact that, as was explained earlier in this case study, there is no censoring in this design and all patients are followed until the event of interest is observed. It is easy to verify that the same results are obtained if the outcome type is set to event .
The analysis model in this clinical trial is very similar to the analysis models defined in the case studies with normal and binomial outcome variables. The only difference is the choice of the statistical method utilized in the primary analysis ( method = "LogrankTest" ):
To illustrate the specification of a Statistic object, the hazard ratio will be computed using the Cox method. This can be accomplished by adding a Statistic object to the AnalysisModel such presented below.
An evaluation model identical to that used earlier in the case studies with normal and binomial distribution can be applied to compute the power function at the selected event counts. Moreover, the average hazard ratio accross the simulations will be computed.
The power calculations presented in the previous case study assume an idealized setting where each patient is followed until the event of interest (e.g., progression) is observed. In this case, the sample size (number of patients) in each treatment arm is equal to the number of events. In reality, events are often censored and a sponsor is generally interested in determining the number of patients to be recruited in order to ensure a target number of events, which translates into desirable power.
The Mediana package can be used to perform power calculations in event-driven trials in the presence of censoring. This is accomplished by setting up design parameters such as the length of the enrollment and follow-up periods in a data model using a Design object.
In general, even though closed-form solutions have been derived for sample size calculations in event-driven designs, the available approaches force clinical trial researchers to make a variety of simplifying assumptions, e.g., assumptions on the enrollment distribution are commonly made, see, for example, Julious (2009, Chapter 15). A general simulation-based approach to power and sample size calculations implemented in the Mediana package enables clinical trial sponsors to remove these artificial restrictions and examine a very broad set of plausible design parameters.
Suppose, for example, that a standard design with a variable follow-up will be used in the MCC trial introduced in the previous case study. The total study duration will be 21 months, which includes a 9-month enrollment (accrual) period and a minimum follow-up of 12 months. The patients are assumed to be recruited at a uniform rate. The set of design parameters also includes the dropout distribution and its parameters. In this clinical trial, the dropout distribution is exponential with a rate determined from historical data. These design parameters are specified in a Design object:
Finally, the primary endpoint’s type is set to event in the OutcomeDist object to indicate that a variable follow-up will be utilized in this clinical trial.
The complete data model in this case study is defined as follows:
Since the number of events has been fixed in this clinical trial example and some patients will not reach the event of interest, it will be important to estimate the number of patients required to accrue the required number of events. In the Mediana package, this can be accomplished by specifying a descriptive statistic named PatientCountStat (this statistic needs to be specified in a Statistic object). Another descriptive statistic that would be of interest is the event count in each sample. To compute this statistic, EventCountStat needs to be included in a Statistic object.
In order to compute the average values of the two statistics ( PatientCountStat and EventCountStat ) in each sample over the simulation runs, two Criterion objects need to be specified, in addition to the Criterion object defined to obtain marginal power. The IDs of the corresponding Statistic objects will be included in the statistics argument of the two Criterion objects:
The last clinical trial example within Case study 1 deals with a Phase III clinical trial in patients with relapsing-remitting multiple sclerosis (RRMS). The trial aims at assessing the safety and efficacy of a single dose of a novel treatment compared to placebo. The primary endpoint is the number of new gadolinium enhancing lesions seen during a 6-month period on monthly MRIs of the brain and a smaller number indicates treatment benefit. The distribution of such endpoints has been widely studied in the literature and Sormani et al. ( 1999a , 1999b ) showed that a negative binomial distribution provides a fairly good fit.
The list below gives the expected treatment effect in the experimental treatment and placebo arms (note that the negative binomial distribution is parameterized using the mean rather than the probability of success in each trial). The mean number of new lesions is set to 13 in the Treament arm and 7.8 in the Placebo arm, with a common dispersion parameter of 0.5.
The corresponding treatment effect, i.e., the relative reduction in the mean number of new lesions counts, is 100 * (13 − 7.8)/13 = 40%. The assumptions in the table define a single outcome parameter set.
The OutcomeDist object defines the distribution of the trial endpoint ( NegBinomDist ). Further, a balanced design is utilized in this clinical trial and the range of sample sizes is defined in the SampleSize object (it is convenient to do this using the seq function). The Sample object includes the parameters required by the negative binomial distribution (dispersion and mean).
The treatment effect will be assessed in this clinical trial example using a negative binomial generalized linear model (NBGLM). In the Mediana package, the corresponding test is carrying out using the GLMNegBinomTest method which is specified in the Test object. It should be noted that as a smaller value indicates a treatment benefit, the first sample defined in the samples argument must be Treatment .
Alternatively, from version 1.0.6, it is possible to specify the argument lower in the parameters of the method. If set to FALSE a numerically lower value is expected in Sample 2.
The objective of this clinical trial is identical to that of the clinical trials presented earlier on this page, i.e., evaluation will be based on marginal power of the primary endpoint test. As a consequence, the same evaluation model can be applied.
This clinical trial example deals with settings where no analytical methods are available to support power calculations. However, as demonstrated below, simulation-based approaches are easily applied to perform а comprehensive assessment of the relevant operating characteristics within the clinical scenario evaluation framework.
Case study 2 is based on a clinical trial example introduced in Dmitrienko and D’Agostino (2013, Section 10). This example deals with a Phase III clinical trial in a schizophrenia population. Three doses of a new treatment, labelled Dose L, Dose M and Dose H, will be tested versus placebo. The trial will be declared successful if a beneficial treatment effect is demonstrated in any of the three dosing groups compared to the placebo group.
The primary endpoint is defined as the reduction in the Positive and Negative Syndrome Scale (PANSS) total score compared to baseline and a larger reduction in the PANSS total score indicates treatment benefit. This endpoint is normally distributed and the treatment effect assumptions in the four treatment arms are displayed in the next table.
The treatment effect assumptions presented in the table above define a single outcome parameter set and the common sample size is set to 260 patients. These parameters are specified in the following data model:
The analysis model, shown below, defines the three individual tests that will be carried out in the schizophrenia clinical trial. Each test corresponds to a dose-placebo comparison such as:
H1: Null hypothesis of no difference between Dose L and placebo.
H2: Null hypothesis of no difference between Dose M and placebo.
H3: Null hypothesis of no difference between Dose H and placebo.
Each comparison will be carried out based on a one-sided two-sample t -test ( TTest method defined in each Test object).
As indicated earlier, the overall success criterion in the trial is formulated in terms of demonstrating a beneficial effect at any of the three doses. Due to multiple opportunities to claim success, the overall Type I error rate will be inflated and the Hochberg procedure is introduced to protect the error rate at the nominal level.
Since no procedure parameters are defined, the three significance tests (or, equivalently, three null hypotheses of no effect) are assumed to be equally weighted. The corresponding analysis model is defined below:
To request the Hochberg procedure with unequally weighted hypotheses, the user needs to assign a list of hypothesis weights to the par argument of the MultAdjProc object. The weights typically reflect the relative importance of the individual null hypotheses. Assume, for example, that 60% of the overall weight is assigned to H3 and the remainder is split between H1 and H2. In this case, the MultAdjProc object should be defined as follow:
It should be noted that the order of the weights must be identical to the order of the Test objects defined in the analysis model.
An evaluation model specifies clinically relevant criteria for assessing the performance of the individual tests defined in the corresponding analysis model or composite measures of success. In virtually any setting, it is of interest to compute the probability of achieving a significant outcome in each individual test, e.g., the probability of a significant difference between placebo and each dose. This is accomplished by requesting a Criterion object with method = "MarginalPower" .
Since the trial will be declared successful if at least one dose-placebo comparison is significant, it is natural to compute the overall success probability, which is defined as the probability of demonstrating treatment benefit in one of more dosing groups. This is equivalent to evaluating disjunctive power in the trial ( method = "DisjunctivePower" ).
In addition, the user can easily define a custom evaluation criterion. Suppose that, based on the results of the previously conducted trials, the sponsor expects a much larger treatment treatment difference at Dose H compared to Doses L and M. Given this, the sponsor may be interested in evaluating the probability of observing a significant treatment effect at Dose H and at least one other dose. The associated evaluation criterion is implemented in the following function:
The function’s first argument ( test.result ) is a matrix of p-values produced by the Test objects defined in the analysis model and the second argument ( statistic.result ) is a matrix of results produced by the Statistic objects defined in the analysis model. In this example, the criteria will only use the test.result argument, which will contain the p-values produced by the tests associated with the three dose-placebo comparisons. The last argument ( parameter ) contains the optional parameter(s) defined by the user in the Criterion object. In this example, the par argument contains the overall alpha level.
The case.study2.criterion function computes the probability of a significant treatment effect at Dose H ( test.result[,3] <= alpha ) and a significant treatment difference at Dose L or Dose M ( (test.result[,1] <= alpha) | (test.result[,2]<= alpha) ). Since this criterion assumes that the third test is based on the comparison of Dose H versus Placebo, the order in which the tests are included in the evaluation model is important.
The following evaluation model specifies marginal and disjunctive power as well as the custom evaluation criterion defined above:
Another potential option is to apply the conjunctive criterion which is met if a significant treatment difference is detected simultaneously in all three dosing groups ( method = "ConjunctivePower" ). This criterion helps characterize the likelihood of a consistent treatment effect across the doses.
The user can also use the metric.tests parameter to choose the specific tests to which the disjunctive and conjunctive criteria are applied (the resulting criteria are known as subset disjunctive and conjunctive criteria). To illustrate, the following statement computes the probability of a significant treatment effect at Dose M or Dose H (Dose L is excluded from this calculation):
This case study deals with a Phase III clinical trial in patients with mild or moderate asthma (it is based on a clinical trial example from Millen et al., 2014, Section 2.2 ). The trial is intended to support a tailoring strategy. In particular, the treatment effect of a single dose of a new treatment will be compared to that of placebo in the overall population of patients as well as a pre-specified subpopulation of patients with a marker-positive status at baseline (for compactness, the overall population is denoted by OP, marker-positive subpopulation is denoted by M+ and marker- negative subpopulation is denoted by M−).
Marker-positive patients are more likely to receive benefit from the experimental treatment. The overall objective of the clinical trial accounts for the fact that the treatment’s effect may, in fact, be limited to the marker-positive subpopulation. The trial will be declared successful if the treatment’s beneficial effect is established in the overall population of patients or, alternatively, the effect is established only in the subpopulation. The primary endpoint in the clinical trial is defined as an increase from baseline in the forced expiratory volume in one second (FEV1). This endpoint is normally distributed and improvement is associated with a larger change in FEV1.
To set up a data model for this clinical trial, it is natural to define samples (mutually exclusive groups of patients) as follows:
Sample 1: Marker-negative patients in the placebo arm.
Sample 2: Marker-positive patients in the placebo arm.
Sample 3: Marker-negative patients in the treatment arm.
Sample 4: Marker-positive patients in the treatment arm.
Using this definition of samples, the trial’s sponsor can model the fact that the treatment’s effect is most pronounced in patients with a marker-positive status.
The treatment effect assumptions in the four samples are summarized in the next table (expiratory volume in FEV1 is measured in liters). As shown in the table, the mean change in FEV1 is constant across the marker-negative and marker-positive subpopulations in the placebo arm (Samples 1 and 2). A positive treatment effect is expected in both subpopulations in the treatment arm but marker-positive patients will experience most of the beneficial effect (Sample 4).
The following data model incorporates the assumptions listed above by defining a single set of outcome parameters. The data model includes three sample size sets (total sample size is set to 330, 340 and 350 patients). The sizes of the individual samples are computed based on historic information (40% of patients in the population of interest are expected to have a marker-positive status). In order to define specific sample size for each sample, they will be specified within each Sample object.
The analysis model in this clinical trial example is generally similar to that used in Case study 2 but there is an important difference which is described below.
As in Case study 2 , the primary endpoint follows a normal distribution and thus the treatment effect will be assessed using the two-sample t -test.
Since two null hypotheses are tested in this trial (null hypotheses of no effect in the overall population of patients and subpopulation of marker-positive patients), a multiplicity adjustment needs to be applied. The Hochberg procedure with equally weighted null hypotheses will be used for this purpose.
A key feature of the analysis strategy in this case study is that the samples defined in the data model are different from the samples used in the analysis of the primary endpoint. As shown in the Table, four samples are included in the data model. However, from the analysis perspective, the sponsor in interested in examining the treatment effect in two samples, namely, the overall population and marker-positive subpopulation. As shown below, to perform a comparison in the overall population, the t -test is applied to the following analysis samples:
Placebo arm: Samples 1 and 2 ( Placebo M- and Placebo M+ ) are merged.
Treatment arm: Samples 3 and 4 ( Treatment M- and Treatment M+ ) are merged.
Further, the treatment effect test in the subpopulation of marker-positive patients is carried out based on these analysis samples:
Placebo arm: Sample 2 ( Placebo M+ ).
Treatment arm: Sample 4 ( Treatment M+ ).
These analysis samples are specified in the analysis model below. The samples defined in the data model are merged using c() or list() function, e.g., c("Placebo M-", "Placebo M+") defines the placebo arm and c("Treatment M-", "Treatment M+") defines the experimental treatment arm in the overall population test.
It is reasonable to consider the following success criteria in this case study:
Marginal power: Probability of a significant outcome in each patient population.
Disjunctive power: Probability of a significant treatment effect in the overall population (OP) or marker-positive subpopulation (M+). This metric defines the overall probability of success in this clinical trial.
Conjunctive power: Probability of simultaneously achieving significance in the overall population and marker-positive subpopulation. This criterion will be useful if the trial’s sponsor is interested in pursuing an enhanced efficacy claim ( Millen et al., 2012 ).
The following evaluation model applies the three criteria to the two tests listed in the analysis model:
Case study 4 serves as an extension of the oncology clinical trial example presented in Case study 1 . Consider again a Phase III trial in patients with metastatic colorectal cancer (MCC). The same general design will be assumed in this section; however, an additional endpoint (overall survival) will be introduced. The case of two endpoints helps showcase the package’s ability to model complex design and analysis strategies in trials with multivariate outcomes.
Progression-free survival (PFS) is the primary endpoint in this clinical trial and overall survival (OS) serves as the key secondary endpoint, which provides supportive evidence of treatment efficacy. A hierarchical testing approach will be utilized in the analysis of the two endpoints. The PFS analysis will be performed first at α = 0.025 (one-sided), followed by the OS analysis at the same level if a significant effect on PFS is established. The resulting testing procedure is equivalent to the fixed-sequence procedure and controls the overall Type I error rate ( Dmitrienko and D’Agostino, 2013 ).
The treatment effect assumptions that will be used in clinical scenario evaluation are listed in the table below. The table shows the hypothesized median times along with the corresponding hazard rates for the primary and secondary endpoints. It follows from the table that the expected effect size is much larger for PFS compared to OS (PFS hazard ratio is lower than OS hazard ratio).
In this clinical trial two endpoints are evaluated for each patient (PFS and OS) and thus their joint distribution needs to be listed in the general set.
A bivariate exponential distribution will be used in this example and samples from this bivariate distribution will be generated by the MVExpoPFSOSDist function which implements multivariate exponential distributions. The function utilizes the copula method, i.e., random variables that follow a bivariate normal distribution will be generated and then converted into exponential random variables.
The next several statements specify the parameters of the bivariate exponential distribution:
Parameters of the marginal exponential distributions, i.e., the hazard rates.
Correlation matrix of the underlying multivariate normal distribution used in the copula method.
The hazard rates for PFS and OS in each treatment arm are defined based on the information presented in the table above ( placebo.par and treatment.par ) and the correlation matrix is specified based on historical information ( corr.matrix ). These parameters are combined to define the outcome parameter sets ( outcome.placebo and outcome.treatment ) that will be included in the sample-specific set of data model parameters ( Sample object).
To define the sample-specific data model parameters, a 2:1 randomization ratio will be used in this clinical trial and thus the number of events as well as the randomization ratio are specified by the user in the Event object. Secondly, a separate sample ID needs to be assigned to each endpoint within the two samples (e.g. Placebo PFS and Placebo OS ) corresponding to the two treatment arms. This will enable the user to construct analysis models for examining the treatment effect on each endpoint.
The treatment comparisons for both endpoints will be carried out based on the log-rank test ( method = "LogrankTest" ). Further, as was stated in the beginning of this page, the two endpoints will be tested hierarchically using a multiplicity adjustment procedure known as the fixed-sequence procedure. This procedure belongs to the class of chain procedures ( proc = "ChainAdj" ) and the following figure provides a visual summary of the decision rules used in this procedure.
The circles in this figure denote the two null hypotheses of interest:
H1: Null hypothesis of no difference between the two arms with respect to PFS.
H2: Null hypothesis of no difference between the two arms with respect to OS.
The value displayed above a circle defines the initial weight of each null hypothesis. All of the overall α is allocated to H1 to ensure that the OS test will be carried out only after the PFS test is significant and the arrow indicates that H2 will be tested after H1 is rejected.
More formally, a chain procedure is uniquely defined by specifying a vector of hypothesis weights (W) and matrix of transition parameters (G). Based on the figure, these parameters are given by
Two objects (named chain.weight and chain.transition ) are defined below to pass the hypothesis weights and transition parameters to the multiplicity adjustment parameters.
As shown above, the two significance tests included in the analysis model reflect the two-fold objective of this trial. The first test focuses on a PFS comparison between the two treatment arms ( id = "PFS test" ) whereas the other test is carried out to assess the treatment effect on OS ( test.id = "OS test" ).
Alternatively, the fixed-sequence procedure can be implemented using the method FixedSeqAdj introduced from version 1.0.4. This implementation is facilitated as no parameters have to be specified.
The evaluation model specifies the most basic criterion for assessing the probability of success in the PFS and OS analyses (marginal power). A criterion based on disjunctive power could be considered but it would not provide additional information.
Due to the hierarchical testing approach, the probability of detecting a significant treatment effect on at least one endpoint (disjunctive power) is simply equal to the probability of establishing a significant PFS effect.
This case study extends the straightforward setting presented in Case study 1 to a more complex setting involving two trial endpoints and three treatment arms. Case study 5 illustrates the process of performing power calculations in clinical trials with multiple, hierarchically structured objectives and “multivariate” multiplicity adjustment strategies (gatekeeping procedures).
Consider a three-arm Phase III clinical trial for the treatment of rheumatoid arthritis (RA). Two co-primary endpoints will be used to evaluate the effect of a novel treatment on clinical response and on physical function. The endpoints are defined as follows:
Endpoint 1: Response rate based on the American College of Rheumatology definition of improvement (ACR20).
Endpoint 2: Change from baseline in the Health Assessment Questionnaire-Disability Index (HAQ-DI).
The two endpoints have different marginal distributions. The first endpoint is binary whereas the second one is continuous and follows a normal distribution.
The efficacy profile of two doses of a new treatment (Doses L and Dose H) will be compared to that of a placebo and a successful outcome will be defined as a significant treatment effect at either or both doses. A hierarchical structure has been established within each dose so that Endpoint 2 will be tested if and only if there is evidence of a significant effect on Endpoint 1.
Three treatment effect scenarios for each endpoint are displayed in the table below. The scenarios define three outcome parameter sets. The first set represents a rather conservative treatment effect scenario, the second set is a standard (most plausible) scenario and the third set represents an optimistic scenario. Note that a reduction in the HAQ-DI score indicates a beneficial effect and thus the mean changes are assumed to be negative for Endpoint 2.
As in Case study 4 , two endpoints are evaluated for each patient in this clinical trial example, which means that their joint distribution needs to be specified. The MVMixedDist method will be utilized for specifying a bivariate distribution with binomial and normal marginals ( var.type = list("BinomDist", "NormalDist") ). In general, this function is used for modeling correlated normal, binomial and exponential endpoints and relies on the copula method, i.e., random variables are generated from a multivariate normal distribution and converted into variables with pre-specified marginal distributions.
Three parameters must be defined to specify the joint distribution of Endpoints 1 and 2 in this clinical trial example:
Variable types (binomial and normal).
Outcome distribution parameters (proportion for Endpoint 1, mean and SD for Endpoint 2) based on the assumptions listed in the Table above.
Correlation matrix of the multivariate normal distribution used in the copula method.
These parameters are combined to define three outcome parameter sets (e.g., outcome1.plac , outcome1.dosel and outcome1.doseh ) that will be included in the Sample object in the data model.
These outcome parameter set are then combined within each Sample object and the common sample size per treatment arm ranges between 100 and 120:
To set up the analysis model in this clinical trial example, note that the treatment comparisons for Endpoints 1 and 2 will be carried out based on two different statistical tests:
Endpoint 1: Two-sample test for comparing proportions ( method = "PropTest" ).
Endpoint 2: Two-sample t-test ( method = "TTest" ).
It was pointed out earlier in this page that the two endpoints will be tested hierarchically within each dose. The figure below provides a visual summary of the testing strategy used in this clinical trial. The circles in this figure denote the four null hypotheses of interest:
H1: Null hypothesis of no difference between Dose L and placebo with respect to Endpoint 1.
H2: Null hypothesis of no difference between Dose H and placebo with respect to Endpoint 1.
H3: Null hypothesis of no difference between Dose L and placebo with respect to Endpoint 2.
H4: Null hypothesis of no difference between Dose H and placebo with respect to Endpoint 2.
A multiple testing procedure known as the multiple-sequence gatekeeping procedure will be applied to account for the hierarchical structure of this multiplicity problem. This procedure belongs to the class of mixture-based gatekeeping procedures introduced in Dmitrienko et al. (2015) . This gatekeeping procedure is specified by defining the following three parameters:
Families of null hypotheses ( family ).
Component procedures used in the families ( component.procedure ).
Truncation parameters used in the families ( gamma ).
These parameters are included in the MultAdjProc object defined below. The tests to which the multiplicity adjustment will be applied are defined in the tests argument. The use of this argument is optional if all tests included in the analysis model are to be included. The argument family states that the null hypotheses will be grouped into two families:
Family 1: H1 and H2.
Family 2: H3 and H4.
It is to be noted that the order corresponds to the order of the tests defined in the analysis model, except if the tests are specifically specified in the tests argument of the MultAdjProc object.
The families will be tested sequentially and a truncated Holm procedure will be applied within each family ( component.procedure ). Lastly, the truncation parameter will be set to 0.8 in Family 1 and to 1 in Family 2 ( gamma ). The resulting parameters are included in the par argument of the MultAdjProc object and, as before, the proc argument is used to specify the multiple testing procedure ( MultipleSequenceGatekeepingAdj ).
The test are then specified in the analysis model and the overall analysis model is defined as follows:
Recall that a numerically lower value indicates a beneficial effect for the HAQ-DI score and, as a result, the experimental treatment arm must be defined prior to the placebo arm in the test.samples parameters corresponding to the HAQ-DI tests, e.g., samples = samples("DoseL HAQ-DI", "Placebo HAQ-DI") .
In order to assess the probability of success in this clinical trial, a hybrid criterion based on the conjunctive criterion (both trial endpoints must be significant) and disjunctive criterion (at least one dose-placebo comparison must be significant) can be considered.
This criterion will be met if a significant effect is established at one or two doses on Endpoint 1 (ACR20) and also at one or two doses on Endpoint 2 (HAQ-DI). However, due to the hierarchical structure of the testing strategy (see Figure), this is equivalent to demonstrating a significant difference between Placebo and at least one dose with respect to Endpoint 2. The corresponding criterion is a subset disjunctive criterion based on the two Endpoint 2 tests (subset disjunctive power was briefly mentioned in Case study 2 ).
In addition, the sponsor may also be interested in evaluating marginal power as well as subset disjunctive power based on the Endpoint 1 tests. The latter criterion will be met if a significant difference between Placebo and at least one dose is established with respect to Endpoint 1. Additionally, as in Case study 2 , the user could consider defining custom evaluation criteria. The three resulting evaluation criteria (marginal power, subset disjunctive criterion based on the Endpoint 1 tests and subset disjunctive criterion based on the Endpoint 2 tests) are included in the following evaluation model.
Case study 6 is an extension of Case study 2 where the objective of the sponsor is to compare several Multiple Testing Procedures (MTPs). The main difference is in the specification of the analysis model.
The same data model as in Case study 2 will be used in this case study. However, as shown in the table below, a new set of outcome parameters will be added in this case study (an optimistic set of parameters).
As in Case study 2 , each dose-placebo comparison will be performed using a one-sided two-sample t -test ( TTest method defined in each Test object). The same nomenclature will be used to define the hypotheses, i.e.:
In this case study, as in Case study 2 , the overall success criterion in the trial is formulated in terms of demonstrating a beneficial effect at any of the three doses, inducing an inflation of the overall Type I error rate. In this case study, the sponsor is interested in comparing several Multiple Testing Procedures, such as the weighted Bonferroni, Holm and Hochberg procedures. These MTPs are defined as below:
The mult.adj1 object, which specified that no adjustment will be used, is defined in order to observe the decrease in power induced by each MTPs.
It should be noted that for each weighted procedure, a higher weight is assigned to the test of Placebo vs Dose H (1/2), and the remaining weight is equally assigned to the two other tests (i.e. 1/4 for each test). These parameters are specified in the par argument of each MTP.
The analysis model is defined as follows:
For the sake of compactness, all MTPs are combined using a MultAdj object, but it is worth mentioning that each MTP could have been directly added to the AnalysisModel object using the + operator.
As for the data model, the same evaluation model as in Case study 2 will be used in this case study. Refer to Case study 2 for more information.
The last Criterion object specifies the custom criterion which computes the probability of a significant treatment effect at Dose H and a significant treatment difference at Dose L or Dose M.
Using the data, analysis and evaluation models, simulation-based Clinical Scenario Evaluation is performed by calling the CSE function:
Generate a Simulation Report
This case study will also illustrate the process of customizing a Word-based simulation report. This can be accomplished by defining custom sections and subsections to provide a structured summary of the complex set of simulation results.
Create a Customized Simulation Report
Define a presentation model.
Several presentation models will be used produce customized simulation reports:
A report without subsections.
A report with subsections.
A report with combined sections.
First of all, a default PresentationModel object ( case.study6.presentation.model.default ) will be created. This object will include the common components of the report that are shared across the presentation models. The project information ( Project object), sorting options in summary tables ( Table object) and specification of custom labels ( CustomLabel objects) are included in this object:
Report without subsections
The first simulation report will include a section for each outcome parameter set. To accomplish this, a Section object is added to the default PresentationModel object and the report is generated:
Report with subsections
The second report will include a section for each outcome parameter set and, in addition, a subsection will be created for each multiplicity adjustment procedure. The Section and Subsection objects are added to the default PresentationModel object as shown below and the report is generated:
Report with combined sections
Finally, the third report will include a section for each combination of outcome parameter set and each multiplicity adjustment procedure. This is accomplished by adding a Section object to the default PresentationModel object and specifying the outcome parameter and multiplicity adjustment in the section’s by argument.
CSE report without subsections
CSE report with subsections
CSE report with combined subsections
R news and tutorials contributed by hundreds of R bloggers
A data science case study in r.
Posted on March 13, 2017 by Robert Grünwald in R bloggers | 0 Comments
[social4i size="small" align="align-left"] --> [This article was first published on R-Programming – Statistik Service , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Demanding data science projects are becoming more and more relevant, and the conventional evaluation procedures are often no longer sufficient. For this reason, there is a growing need for tailor-made solutions, which are individually tailored to the project’s goal, which is often implemented by R programming . T o provide our readers with support in their own R programming, we have carried out an example evaluation which demonstrates several application possibilities of the R programming.
Data Science Projects
Approaching your first data science project can be a daunting task. Luckily, there are rough step-by-step outlines and heuristics than can help you on your way to becoming a data ninja. In this article, we review some of these guidelines and apply them to an example project in R.
For our analysis and the R programming, we will make use of the following R packages:
Anatomy of a Data Science project
A basic data science project consists of the following six steps:
- State the problem you are trying to solve. It has to be an unambiguous question that can be answered with data and a statistical or machine learning model. At least, specify: What is being observed? What has to be predicted?
- Collect the data, then clean and prepare it. This is commonly the most time-consuming task, but it has to be done in order to fit a prediction model with the data.
- Explore the data. Get to know its properties and quirks. Check numerical summaries of your metric variables, tables of the categorical data, and plot univariate and multivariate representations of your variables. By this, you also get an overview of the quality of the data and can find outliers.
- Check if any variables may need to be transformed. Most commonly, this is a logarithmic transformation of skewed measurements such as concentrations or times. Also, some variables might have to be split up into two or more variables.
- Choose a model and train it on the data. If you have more than one candidate model, apply each and evaluate their goodness-of-fit using independent data that was not used for training the model.
- Use the best model to make your final predictions.
We apply the principles on an example data set that was used in the ASA’s 2009 Data expo . The given data are around 120 million commercial and domestic flights within the USA between 1987 and 2008. Measured variables include departure and arrival airport, airline, and scheduled and actual departure and arrival times.
We will focus on the 2008 subset of this data. Because even this is a 600MB subset, it makes sense to start a first analysis on a random sample of this data to be able to quickly explore and develop your code, and then, periodically verify on the real data set that your results are still valid on the complete data.
The following commands read in our subset data and display the first three observations:
Fortunately, the ASA provides a code book with descriptions of each variable here . For example, we now know that for the Variable DayOfWeek, a 1 denotes Monday, a 2 is Tuesday, and so on.
With this data, it is possible to answer many interesting questions. Examples include:
Do planes with a delayed departure fly with a faster average speed to make up for the delay?
How does the delay of arriving flights vary during the day are planes more delayed on weekends.
- How has the market share of different airlines shifted over these 20 years?
- Are there specific planes that tend to have longer delays? What characterizes them? Maybe the age, or the manufacturer?
Additionally to these concrete questions, the possibilities for explorative, sandbox-style data analysis are nearly limitless.
Here, we will focus on the first two boldened questions.
You should always check out the amount of missing values in your data. For this, we write an sapply-loop over each column in the flights data and report the percentage of missing values:
We see that most variables have at most a negligible amount of missing values. However, the last five variables, starting at the CarrierDelay, show almost 80% missing values. This is usually an alarmingly high amount of missing data that would suggest dropping this variable from the analysis altogether, since not even a sophisticated imputing procedure can help here. But, as further inspection shows, these variables only apply for delayed flights, i.e. a positive value in the ArrDelay column.
When selecting only the arrival delay and the five sub-categories of delays, we see that they add up to the total arrival delay. For our analysis here, we are not interested in the delay reason, but view only the total ArrDelay as our outcome of interest.
The pipe operator %>%, by the way, is a nice feature of the magrittr package (also implemented in dplyr) that resembles the UNIX-style pipe. The following two lines mean and do exactly the same thing, but the second version is much easier to read:
The pipe operator thus takes the output of the left expression, and makes it the first argument of the right expression.
We have surprisingly clean data where not much has to be done before proceeding to feature engineering.
Our main variables of interest are:
- The date, which conveniently is already split up in the columns Year, Month, and DayOfMonth, and even contains the weekday in DayOfWeek. This is rarely the case, you mostly get a single column with a name like date and entries such as „2016-06-24“. In that case, the R package lubridate provides helpful functions to efficiently work with and manipulate these dates.
- CRSDepTime, the scheduled departure time. This will indicate the time of day for our analysis of when flights tend to have higher delays.
- ArrDelay, the delay in minutes at arrival. We use this variable (rather than the delay at departure) for the outcome in our first analysis, since the arrival delay is what has the impact on our day.
- For our second question of whether planes with delayed departure fly faster, we need DepDelay, the delay in minutes at departure, as well as a measure of average speed while flying. This variable is not available, but we can compute it from the available variables Distance and AirTime. We will do that in the next section, „Feature Engineering“.
Let’s have an exploratory look at all our variables of interest.
Since these are exploratory analyses that you usually won’t show anyone else, spending time on pretty graphics does not make sense here. For quick overviews, I mostly use the standard graphics functions from R, without much decoration in terms of titles, colors, and such.
Since we subsetted the data beforehand, it makes sense that all our flights are from 2008. We also see no big changes between the months. There is a slight drop after August, but the remaining changes can be explained by the number of days in a month.
The day of the month shows no influence on the amount of flights, as expected. The fact that the 31st has around half the flights of all other days is also obvious.
When plotting flights per weekday, however, we see that Saturday is the most quiet day of the week, with Sunday being the second most relaxed day. Between the remaining weekdays, there is little variation.
A histogram of the departure time shows that the number of flights is relatively constant from 6am to around 8pm and dropping off heavily before and after that.
Arrival and departure delay
Both arrival and departure delay show a very asymmetric, right-skewed distribution. We should keep this in mind and think about a logarithmic transformation or some other method of acknowledging this fact later.
The structure of the third plot of departure vs. arrival delay suggests that flights that start with a delay usually don’t compensate that delay during the flight. The arrival delay is almost always at least as large as the departure delay.
To get a first overview for our question of how the departure time influences the average delay, we can also plot the departure time against the arrival delay:
Aha! Something looks weird here. There seem to be periods of times with no flights at all. To see what is going on here, look at how the departure time is coded in the data:
A departure of 2:55pm is written as an integer 1455. This explains why the values from 1460 to 1499 are impossible. In the feature engineering step, we will have to recode this variable in a meaningful way to be able to model it correctly.
Distance and AirTime
Plotting the distance against the time needed, we see a linear relationship as expected, with one large outlier. This one point denotes a flight of 2762 miles and an air time of 823 minutes, suggesting an average speed of 201mph. I doubt planes can fly at this speed, so we should maybe remove this observation.
Feature engineering describes the manipulation of your data set to create variables that a learning algorithm can work with. Often, this consists of transforming a variable (through e.g. a logarithm), or extracting specific information from a variable (e.g. the year from a date string), or converting something like the ZIP code to a
For our data, we have the following tasks:
- Convert the weekday into a factor variable so it doesn’t get interpreted linearly.
- Create a log-transformed version of the arrival and departure delay.
- Transform the departure time so that it can be used in a model.
- Create the average speed from the distance and air time variables.
Converting the weekday into a factor is important because otherwise, it would be interpreted as a metric variable, which would result in a linear effect. We want the weekdays to be categories, however, and so we create a factor with nice labels:
log-transform delay times
When looking at the delays, we note that there are a lot of negative values in the data. These denote flights that left or arrived earlier than scheduled. To allow a log-transformation, we set all negative values to zero, which we interpret as „on time“:
Now, since there are zeros in these variables, we create the variables log(1+ArrDelay) and log(1+DepDelay):
Transform the departure time
The departure time is coded in the format hhmm, which is not helpful for modelling, since we need equal distances between equal durations of time. This way, the distance between 10:10pm and 10:20pm would be 10, but the distance between 10:50pm and 11:00pm, the same 10 minutes, would be 50.
For the departure time, we therefore need to convert the time format. We will use a decimal format, so that 11:00am becomes 11, 11:15am becomes 11.25, and 11:45 becomes 11.75.
The mathematical rule to transform the „old“ time in hhmm-format into a decimal format is:
Here, the first part of the sum generates the hours, and the second part takes the remainder when dividing by 100 (i.e., the last two digits), and divides them by 60 to transform the minutes into a fraction of one hour.
Let’s implement that in R:
Of course, you should always verify that your code did what you intended by checking the results.
Create average speed
The average flight speed is not available in the data – we have to compute it from the distance and the air time:
We have a few outliers with very high, implausible average speeds. Domain knowledge or a quick Google search can tell us that speeds of more than 800mph are not maintainable with current passenger planes. Thus, we will remove these flights from the data:
Choosing an appropriate Method
For building an actual model with your data, you have the choice between two worlds, statistical modelling and machine learning.
Broadly speaking, statistical models focus more on quantifying and interpreting the relationships between input variables and the outcome. This is the case in situations such as clinical studies, where the main goal is to describe the effect of a new medication.
Machine learning methods on the other hand focus on achieving a high accuracy in prediction, sometimes sacrificing interpretability. This results in what is called „black box“ algorithms, which are good at predicting the outcome, but it’s hard to see how a model computes a prediction for the outcome. A classic example for a question where machine learning is the appropriate answer is the product recommendation algorithm on online shopping websites.
For our questions, we are interested in the effects of certain input variables (speed and time of day / week). Thus we will make use of statistical models, namely a linear model and a generalized additive model.
To answer our first question, we first plot the variables of interest to get a first impression of the relationship. Since these plots will likely make it to the final report or at least a presentation to your supervisors, it now makes sense to spend a little time on generating a pretty image. We will use the ggplot2 package to do this:
It seems like there is a slight increase in average speed for planes that leave with a larger delay. Let’s fit a linear model to quantify the effect:
There is a highly significant effect of 0.034 for the departure delay. This represents the increase in average speed for each minute of delay. So, a plane with 60 minutes of delay will fly 2.04mph faster on average.
Even though the effect is highly significant with a p value of less than 0.0001, its actual effect is negligibly small.
For the second question of interest, we need a slightly more sophisticated model. Since we want to know the effect of the time of day on the arrival delay, we cannot assume a linear effect of the time on the delay. Let’s plot the data:
We plot both the actual delay and our transformed log-delay variable. The smoothing line of the second plot gives a better image of the shape of the delay. It seems that delays are highest at around 8pm, and lowest at 5am. This emphasizes the fact that a linear model would not be appropriate here.
We fit a generalized additive model , a GAM, to this data. Since the response variable is right skewed, a Gamma distribution seems appropriate for the model family. To be able to use it, we have to transform the delay into a strictly positive variable, so we compute the maximum of 1 and the arrival delay for each observation first.
We again see the trend of lower delays in the morning before 6am, and high delays around 8pm. To differentiate between weekdays, we now include this variable in the model:
With this model, we can create an artificial data frame x_new, which we use to plot one prediction line per weekday:
We now see several things:
- The nonlinear trend over the day is the same shape on every day of the week
- Fridays are the worst days to fly by far, with Sunday being a close second. Expected delays are around 20 minutes during rush-hour (8pm)
- Wednesdays and Saturdays are the quietest days
- If you can manage it, fly on a Wednesday morning to minimize expected delays.
As noted in the beginning of this post, this analysis is only one of many questions that can be tackled with this enormous data set. Feel free to browse the data expo website and especially the „Posters & results“ section for many other interesting analyses.
Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service .
To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Copyright © 2022 | MH Corporate basic by MH Themes
Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)
Hashtag by IECSE
Oct 5, 2021
Programming in R: A Case Study
This introductory tutorial walks you through the basic statistical features available in R and how to use them. The tutorial gives you hands-on learning using the Palmer Penguins dataset.
R is a programming language that is used in the statistical and numerical analysis of complex datasets. R provides multiple libraries, which makes data analysis and graphical representation amazingly easy to achieve.
Software programmers, statisticians, data scientists, and data miners are using R to develop statistical software.
Prerequisites for the Tutorial
There are no prerequisites required as such, but basic programming knowledge in any language will help you understand Exploratory Data Analysis (EDA) and Data Visualization. And if you have dealt with Python before, then a lot of things would look similar.
How is R different from Python?
R has more data analysis and visualization libraries than Python, and they are much easier to implement and understand.
R is built by statisticians and leans heavily into statistical models and specialized analytics. Data scientists use R for deep statistical analysis, supported by just a few lines of code and beautiful data visualizations. For example, you might use R for customer behavior analysis or genomics research.
The visualizations in R are much more pleasing to the eye, making it easier to comprehend the dataset. It has unmatched libraries like dplyr and dbplyr for Data Wrangling, ggplot2 for Data Visualization, mlr3 and caret for Machine Learning, tidyverse for General Data Science Tasks, etc. All these libraries and many more can be used for unmatched data exploration and experimentation.
Even though Python is widely used for its wide range of purposes because developers can use Python for production and deployment, R scores higher from a statistical analysis point of view.
What will I learn at the end of the Tutorial?
We are very sure that you will get a good understanding of the packages and libraries in R, and you will also be able to analyze any of your custom datasets using R.
Palmer Penguins: A Case Study
Now that we are familiar with R, let’s try coding some basic commands on one of R’s most popular datasets — Palmer Penguins. We urge you to code along with us to experience the power of R.
To begin, we need to load the Palmer Penguins dataset in the environment.
To install any package in an R environment, we follow the syntax:
This is a one-time step . Once the package is installed on your system, you only need to load it in the environment before using it. To load a package, we use the syntax:
Take note that the # symbol denotes a comment, which we insert in the code to make it more readable.
Basic Exploratory Data Analysis
Now that the Palmer Penguins package is all loaded and set to be manipulated, let’s get to know it! Remember that the name of the dataset in this package is ‘penguins’.
The conventional way to start any dataset analysis is to view its first few rows. This gives us an idea of how the dataset is built without overloading the environment with an enormous number of rows. To do this, we use the function:
head(name of the dataset)
By default, the head() function displays the first 6 rows of the dataset. However, if we want to view more/ fewer rows, we can specify the number of rows in the parentheses as such:
head(name of the dataset, number of rows)
Similarly, we can also view the last few rows of the dataset using the function:
tail(name of the dataset)
Just like the head() function, by default, the tail() function displays the last 6 rows . However, we can tailor the function to our needs and display as many number of rows we want by adding a parameter:
tail(name of the dataset, number of rows)
To know the dimensions of the dataset, we use the dim() function . The syntax of the function is:
dim(name of the dataset)
Here we can see that our dataset has 344 data points and 8 features, namely- species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year. Also, we can see that the columns- species, island, and sex have the data type factor <fct> (which means categorical data), the columns- bill_length_mm, and bill_depth_mm have the data type double <dbl> (which means numbers with decimals), and the columns — flipper_length_mm, body_mass_g, and year have the data type int <int>(which means numbers without a decimal point).
A simpler way to know all these details is by using the str() — structure function . The syntax of the str() function is as follows:
str(name of the dataset)
Using the str() function, we get to know the dimensions of the dataset, the data type of the columns, and the first few values as well.
Here, we also get deeper insights , such as the different ‘levels’ or ‘classes’ or ‘categories’ in the columns with the factor data type, for example- the column sex has two levels- female and male, the column species has three levels- Adelie, Chinstrap and Gentoo.
Awesome! Now that we are introduced to the Palmer Penguins dataset, let’s apply some functions that will help us measure some stats about the dataset.
We are using the library ‘ tidyverse’ . Like any other package, we first install the package in our system (reminder: one-time step only) and then load it in the environment.
Before we begin, let’s see a handy operator — %>%, the pipe operator . This operator builds a pipe of functions, which are performed one after the other, using the result of the previous functions. Note, the pipe operator must be at the end of each line.
To start, we will use the count() function . This function is generally applied to columns having the factor data type. It returns the count of each level in a column. The syntax is as follows:
dataset name %>%
count(name of the column)
As we can see, the count() function returns the count of each species in the column. There are 152 penguins of the ‘Adelie’ species, 68 penguins of the ‘Chinstrap’ species, and 124 penguins of the ‘Gentoo’ species.
Next, we use the summarize() function that returns a particular value after performing an operation. The syntax is:
summarize(new column = function(column name))
This function returns the average of the body mass of the penguins as a new column called avg_body_mass. We use the mean() function to calculate the average. We can also use median() and getmode() functions to get the column’s median and mode, respectively. The statement na.rm=TRUE ignores any null values in the column which might cause a calculation error.
Let’s add another function to the pipe. Now, we are going to find the average of the body mass group-wise. The summary() function remains the same, but before that, we add in the group_by() function .
The group_by() function groups the dataset by the levels of a factor column and then applies the next functions group/factor-wise. The syntax of the group_by() function:
group_by(column name) %>%
Unlike the previous result, here we see that the summarize() function has calculated the mean for each level of the species separately and stored it in the avg_body_mass column.
Now, what if we want to filter out only those penguins who have body mass less than the average of their species? The filter() function comes to our rescue!
In the pipe, we add in the filter() function after the group_by() function. The syntax for the filter() function is as follows:
Here, we see the penguins who have body mass less than the average body mass of their species. Recollect, we have used the mean() function and the na.rm=TRUE statement for calculation. We use the relational operator ‘<’ or less than to build the condition. We can also use conditions like ‘>’ or greater than, ‘>=’ or greater than or equal to, ‘<=’ or less than or equal to, ‘==’ or equal to, to build the conditions. The ‘!’ represents the not operator and reverses the condition.
Moving on, if we want to arrange the penguins who have less body mass than the average of their species, in a particular order, we use the arrange() function , using the syntax:
Here, we see that the penguins having body mass less than the average of their species are sorted in descending order by their body mass. By default, the arrange function sorts the dataset in ascending order. However, because we have added the desc() parameter , our dataset is arranged in descending order.
Yay! We have explored our dataset and various statistical measures of it.
Let’s proceed to visualize our data using the library ‘ggplot2’ . Like before, we first install the package in our system (reminder: this is a one-time step) and then load the package in the environment
The basic syntax to make any plot using ggplot2 is as follows:
ggplot(dataset, aes(x=..,y=..) +
Here, aes stands for the aesthetics of the plot and the word geom stands for the geometric object. Note that the shape after geom changes for every kind of plot we make. Also, take care that the ‘+’ operator is placed at the end of each line.
To begin with, we will make a basic scatter plot. A scatter plot plots every individual data point in the dataset using shapes on a graph. The syntax for it is as follows:
gglot(dataset, aes(x=..,y=..) +
The above plot represents the variation of body mass with respect to the flipper length of the penguin. Note that passing the values of both x and y is compulsory in a scatter plot.
Next, we will plot a line graph . A line graph plots every point in the dataset on the graph and joins it with its consequent and subsequent data points in the dataset using a line. The syntax to plot a line graph is as follows:
ggplot(dataset, aes(x=..,y=..,) +
The line graph above represents the variation of body mass with respect to the flipper length of a penguin. Like a scatter plot, passing both the values of x and y is compulsory in a line graph.
Moving on, we will plot a bar graph . In a bar graph, every data point is represented by a bar whose length is proportional to the value they represent. To build a bar graph, we use the following syntax:
ggplot(dataset, aes(x=..)) +
As we can see, the bar graph we have plotted represents the number of penguins having a particular body mass. Unlike the scatter and line plots, we have to pass only x to the bar graph function, and by default, the y-axis will represent the count of the x-axis variable.
Similar to the bar graph, a histogram plots bars but for data points grouped in bins. The bins are continuous number ranges.
To make a histogram, we use the syntax:
ggplot(dataset, aes(x=..,)) +
Referring to the above plot, we can see that the system has grouped the body mass of penguins into ranges and then plotted bars with length proportional to the count of penguins having body mass in that range. Like a bar graph, we only have to pass the x variable, and by default the y axis measures the count of the x axis variables.
The difference between a bar plot and a histogram is that bar plots work with both, the categorical and numeric data types, whereas histograms work only with the numeric data type.
All these plots convey a lot of information, but they look a little boring, don’t they? Let’s make them more visually appealing!
To add in a dash of colour, we include the colour parameter in the aesthetics parenthesis. The colour parameter highlights the points on the basis of the levels of a factor column.
Here, the points are highlighted based on the species of penguins. R also provides a legend to make sense of the colours used. In this case, the data related to the Adelie penguins are orange in colour, the Chinstrap are green and the Gentoo are blue.
Similarly, we can change the size of the points on the basis of the levels of a factor column. Just add the size parameter in aesthetics parenthesis.
In this case, we see that along with the size, the different species of penguins can be separated on the basis of the size of points. The Adelie penguins have small sized points, the Chinstrap penguins have medium sized points and the Gentoo penguins have big sized points.
If we want to go even further, we can change the shape of the point on the basis of levels of the factor column. To do this, we add the shape parameter in the aesthetics parenthesis.
Here, the three species of penguins are very clearly distinguished on the basis of colour, size and shape. Notice, the legend updates itself accordingly.
To make the graph more readable, we will add in some text. The different ways we can insert text are:
- Title : Appears at the top left of the plot in a big font size.
- Subtitle : Appears at the top left of the plot beneath the Title in a small font size.
- Caption : Appears as a note at the bottom right of the plot in a small font size.
- X : Appears as the label of the x-axis, in a medium font size.
- Y : Appears as the label of the y-axis, in a medium font size.
The text is added in the labs() function as follows:
geom_point () +
labs(title=.., subtitle=.., caption=.., x=.., y=..)
A sneak into Advanced Exploratory Data Analysis
Dimensionality is a curse. Often we need to drop some columns to avoid overfitting of models which leads to decrease in accuracy.
To do this, there are two ways:
To select columns on the basis of indices , we use [..,..] to access them in the dataset. The syntax is as follows:
dataset [row indices, column indices]
If we want to include all of the rows or all of the columns, we leave the space blank. If we want to access only specific columns, we put in the index/ range of indices.
Here, we have left the indices of rows empty, meaning we want to include all rows in the new dataset. Also, we want columns with indices 3, 4, 5, and 6. So we have put the expression 3:6 which generates the required series.
Caution: The changes we have made are not permanent. These functions provide a copy of the dataset with the filters applied. To make the changes in the original dataset, we assign the original dataset to the changed dataset.
Similarly, if we want to drop rows, we will put in the indices before the comma. Here, we have also used the ‘-’ operator which omits the mentioned rows and retains the others. Also notice that we have kept the indices of columns empty, hence, we retain all the columns.
We can see that the number of rows have reduced by 3 (344 to 341).
If we want to access the columns by their names, we use the select() function . This comes in the ‘tidyverse’ package. The syntax is as follows:
select(dataset name, column names)
In this case, we do not want the columns sex and year. To do this, we have made a vector of column names using the c() or concat function and put a ‘-’ sign in front of the vector to omit those rows.
Another major problem in datasets is the presence of null values. To count the number of null values, we use the function sum(is.na(..)) as follows:
From above, we can see that the dataset has 19 null values in total.
To remove these values, we use the na.omit(..) function as shown below:
Here, we can see that 11 rows were removed (as the number of rows dropped from 344 to 333). This means that some rows had multiple null values.
In the real world, we remove columns on the basis of correlation. To know which columns are highly correlated, we will plot a correlation heatmap. However, correlation can only be calculated between numeric columns. So, let’s select only the numeric columns from the dataset and make the change permanent.
To do this, we have used the select_if() and is.numeric functions. The select_if() function is a part of the ‘tidyverse’ package and subsets the column based on a condition. The is.numeric function checks whether a column is of the numeric data type. The syntax of the select_if() function is:
select_if(dataset name, condition)
As we can see, the penguins dataset has only numeric columns, viz. bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and year.
Note: just by typing the name of the dataset, we can have a preview of the dataset.
To avoid any calculation error, let’s drop rows with missing values, and make it permanent.
As we can see, there are no missing values in the dataset.
All set now! Let’s plot the correlation heatmap.
We will be using the cor() function to calculate the correlation between the columns in the dataset. Let’s store this in a variable called corr.
To plot the correlation heatmap, we need to install the ‘corrplot’ library. Before you see the next code snippet, write your own code! Let’s test your learning.
Now let’s plot the heatmap . The syntax is rather simple:
corrplot(correlation between columns)
Ta-da! As we can see, flipper_length_mm is highly correlated with bill_lenght_mm and body_mass_g. That means, we will have to drop it to improve the accuracy of our model. Nuh-uh, you’ll have to write the code for this one!
Psst. scroll up for the cheat code!
Congratulations! You can now perform basic statistical operations on datasets and make beautiful visualizations using R. At this point, we encourage you to explore further and move forward in your journey to become an R expert. Here are some resources that might help:
Special thanks to Lajith Puthuchery and Nishant Dash for their valuable contributions to the article.
More from Hashtag by IECSE
A blog about all things technical brought to you by IECSE, the official Computer Science club of MIT, Manipal. Find out more at: https://iecsemanipal.com
About Help Terms Privacy
Get the Medium app
Data Science & Engineering Undergraduate at Manipal Institute of Technology
Text to speech
5 A predictive modeling case study
Tidymodels packages:, introduction 🔗︎.
Each of the four previous Get Started articles has focused on a single task related to modeling. Along the way, we also introduced core packages in the tidymodels ecosystem and some of the key functions you’ll need to start working with models. In this final case study, we will use all of the previous articles as a foundation to build a predictive model from beginning to end with data on hotel stays.
To use code in this article, you will need to install the following packages: glmnet, ranger, readr, tidymodels, and vip.
Alternatively, open an interactive version of this article in your browser:
The Hotel Bookings Data 🔗︎
Let’s use hotel bookings data from Antonio, Almeida, and Nunes (2019) to predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc. This was also a #TidyTuesday dataset with a data dictionary you may want to look over to learn more about the variables. We’ll use a slightly edited version of the dataset for this case study.
To start, let’s read our hotel data into R, which we’ll do by providing readr::read_csv() with a url where our CSV data is located (" https://tidymodels.org/start/case-study/hotels.csv "):
In the original paper, the authors caution that the distribution of many variables (such as number of adults/children, room type, meals bought, country of origin of the guests, and so forth) is different for hotel stays that were canceled versus not canceled. This makes sense because much of that information is gathered (or gathered again more accurately) when guests check in for their stay, so canceled bookings are likely to have more missing data than non-canceled bookings, and/or to have different characteristics when data is not missing. Given this, it is unlikely that we can reliably detect meaningful differences between guests who cancel their bookings and those who do not with this dataset. To build our models here, we have already filtered the data to include only the bookings that did not cancel, so we’ll be analyzing hotel stays only.
We will build a model to predict which actual hotel stays included children and/or babies, and which did not. Our outcome variable children is a factor variable with two levels:
We can see that children were only in 8.1% of the reservations. This type of class imbalance can often wreak havoc on an analysis. While there are several methods for combating this issue using recipes (search for steps to upsample or downsample ) or other more specialized packages like themis , the analyses shown below analyze the data as-is.
Data Splitting & Resampling 🔗︎
For a data splitting strategy, let’s reserve 25% of the stays to the test set. As in our Evaluate your model with resampling article, we know our outcome variable children is pretty imbalanced so we’ll use a stratified random sample:
In our articles so far, we’ve relied on 10-fold cross-validation as the primary resampling method using rsample::vfold_cv() . This has created 10 different resamples of the training set (which we further split into analysis and assessment sets), producing 10 different performance metrics that we then aggregated.
For this case study, rather than using multiple iterations of resampling, let’s create a single resample called a validation set . In tidymodels, a validation set is treated as a single iteration of resampling. This will be a split from the 37,500 stays that were not used for testing, which we called hotel_other . This split creates two new datasets:
the set held out for the purpose of measuring performance, called the validation set , and
the remaining data used to fit the model, called the training set .
We’ll use the validation_split() function to allocate 20% of the hotel_other stays to the validation set and 30,000 stays to the training set . This means that our model performance metrics will be computed on a single set of 7,500 hotel stays. This is fairly large, so the amount of data should provide enough precision to be a reliable indicator for how well each model predicts the outcome with a single iteration of resampling.
This function, like initial_split() , has the same strata argument, which uses stratified sampling to create the resample. This means that we’ll have roughly the same proportions of hotel stays with and without children in our new validation and training sets, as compared to the original hotel_other proportions.
A first model: penalized logistic regression 🔗︎
Since our outcome variable children is categorical, logistic regression would be a good first model to start. Let’s use a model that can perform feature selection during training. The glmnet R package fits a generalized linear model via penalized maximum likelihood. This method of estimating the logistic regression slope parameters uses a penalty on the process so that less relevant predictors are driven towards a value of zero. One of the glmnet penalization methods, called the lasso method , can actually set the predictor slopes to zero if a large enough penalty is used.
Build the model
To specify a penalized logistic regression model that uses a feature selection penalty, let’s use the parsnip package with the glmnet engine :
We’ll set the penalty argument to tune() as a placeholder for now. This is a model hyperparameter that we will tune to find the best value for making predictions with our data. Setting mixture to a value of one means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model.
Create the recipe
Let’s create a recipe to define the preprocessing steps we need to prepare our hotel stays data for this model. It might make sense to create a set of date-based predictors that reflect important components related to the arrival date. We have already introduced a number of useful recipe steps for creating features from dates:
step_date() creates predictors for the year, month, and day of the week.
step_holiday() generates a set of indicator variables for specific holidays. Although we don’t know where these two hotels are located, we do know that the countries for origin for most stays are based in Europe.
step_rm() removes variables; here we’ll use it to remove the original date variable since we no longer want it in the model.
Additionally, all categorical predictors (e.g., distribution_channel , hotel , …) should be converted to dummy variables, and all numeric predictors need to be centered and scaled.
step_dummy() converts characters or factors (i.e., nominal variables) into one or more numeric binary model terms for the levels of the original data.
step_zv() removes indicator variables that only contain a single unique value (e.g. all zeros). This is important because, for penalized models, the predictors should be centered and scaled.
step_normalize() centers and scales numeric variables.
Putting all these steps together into a recipe for a penalized logistic regression model, we have:
Create the workflow
As we introduced in Preprocess your data with recipes , let’s bundle the model and recipe into a single workflow() object to make management of the R objects easier:
Create the grid for tuning
Before we fit this model, we need to set up a grid of penalty values to tune. In our Tune model parameters article, we used dials::grid_regular() to create an expanded grid based on a combination of two hyperparameters. Since we have only one hyperparameter to tune here, we can set the grid up manually using a one-column tibble with 30 candidate values:
Train and tune the model
Let’s use tune::tune_grid() to train these 30 penalized logistic regression models. We’ll also save the validation set predictions (via the call to control_grid() ) so that diagnostic information can be available after the model fit. The area under the ROC curve will be used to quantify how well the model performs across a continuum of event thresholds (recall that the event rate—the proportion of stays including children— is very low for these data).
It might be easier to visualize the validation set metrics by plotting the area under the ROC curve against the range of penalty values:
This plots shows us that model performance is generally better at the smaller penalty values. This suggests that the majority of the predictors are important to the model. We also see a steep drop in the area under the ROC curve towards the highest penalty values. This happens because a large enough penalty will remove all predictors from the model, and not surprisingly predictive accuracy plummets with no predictors in the model (recall that an ROC AUC value of 0.50 means that the model does no better than chance at predicting the correct class).
Our model performance seems to plateau at the smaller penalty values, so going by the roc_auc metric alone could lead us to multiple options for the “best” value for this hyperparameter:
Every candidate model in this tibble likely includes more predictor variables than the model in the row below it. If we used select_best() , it would return candidate model 11 with a penalty value of 0.00137, shown with the dotted line below.
However, we may want to choose a penalty value further along the x-axis, closer to where we start to see the decline in model performance. For example, candidate model 12 with a penalty value of 0.00174 has effectively the same performance as the numerically best model, but might eliminate more predictors. This penalty value is marked by the solid line above. In general, fewer irrelevant predictors is better. If performance is about the same, we’d prefer to choose a higher penalty value.
Let’s select this value and visualize the validation set ROC curve:
The level of performance generated by this logistic regression model is good, but not groundbreaking. Perhaps the linear nature of the prediction equation is too limiting for this data set. As a next step, we might consider a highly non-linear model generated using a tree-based ensemble method.
A second model: tree-based ensemble 🔗︎
An effective and low-maintenance modeling technique is a random forest . This model was also used in our Evaluate your model with resampling article. Compared to logistic regression, a random forest model is more flexible. A random forest is an ensemble model typically made up of thousands of decision trees, where each individual tree sees a slightly different version of the training data and learns a sequence of splitting rules to predict new data. Each tree is non-linear, and aggregating across trees makes random forests also non-linear but more robust and stable compared to individual trees. Tree-based models like random forests require very little preprocessing and can effectively handle many types of predictors (sparse, skewed, continuous, categorical, etc.).
Build the model and improve training time
Although the default hyperparameters for random forests tend to give reasonable results, we’ll plan to tune two hyperparameters that we think could improve performance. Unfortunately, random forest models can be computationally expensive to train and to tune. The computations required for model tuning can usually be easily parallelized to improve training time. The tune package can do parallel processing for you, and allows users to use multiple cores or separate machines to fit models.
But, here we are using a single validation set, so parallelization isn’t an option using the tune package. For this specific case study, a good alternative is provided by the engine itself. The ranger package offers a built-in way to compute individual random forest models in parallel. To do this, we need to know the the number of cores we have to work with. We can use the parallel package to query the number of cores on your own computer to understand how much parallelization you can do:
We have 12 cores to work with. We can pass this information to the ranger engine when we set up our parsnip rand_forest() model. To enable parallel processing, we can pass engine-specific arguments like num.threads to ranger when we set the engine:
This works well in this modeling context, but it bears repeating: if you use any other resampling method, let tune do the parallel processing for you — we typically do not recommend relying on the modeling engine (like we did here) to do this.
In this model, we used tune() as a placeholder for the mtry and min_n argument values, because these are our two hyperparameters that we will tune .
Create the recipe and workflow
Unlike penalized logistic regression models, random forest models do not require dummy or normalized predictor variables. Nevertheless, we want to do some feature engineering again with our arrival_date variable. As before, the date predictor is engineered so that the random forest model does not need to work hard to tease these potential patterns from the data.
Adding this recipe to our parsnip model gives us a new workflow for predicting whether a hotel stay included children and/or babies as guests with a random forest:
When we set up our parsnip model, we chose two hyperparameters for tuning:
The mtry hyperparameter sets the number of predictor variables that each node in the decision tree “sees” and can learn about, so it can range from 1 to the total number of features present; when mtry = all possible features, the model is the same as bagging decision trees. The min_n hyperparameter sets the minimum n to split at any node.
We will use a space-filling design to tune, with 25 candidate models:
The message printed above “Creating pre-processing data to finalize unknown parameter: mtry” is related to the size of the data set. Since mtry depends on the number of predictors in the data set, tune_grid() determines the upper bound for mtry once it receives the data.
Here are our top 5 random forest models, out of the 25 candidates:
Right away, we see that these values for area under the ROC look more promising than our top model using penalized logistic regression, which yielded an ROC AUC of 0.876.
Plotting the results of the tuning process highlights that both mtry (number of predictors at each node) and min_n (minimum number of data points required to keep splitting) should be fairly small to optimize performance. However, the range of the y-axis indicates that the model is very robust to the choice of these parameter values — all but one of the ROC AUC values are greater than 0.90.
Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:
To calculate the data needed to plot the ROC curve, we use collect_predictions() . This is only possible after tuning with control_grid(save_pred = TRUE) . In the output, you can see the two columns that hold our class probabilities for predicting hotel stays including and not including children.
To filter the predictions for only our best random forest model, we can use the parameters argument and pass it our tibble with the best hyperparameter values from tuning, which we called rf_best :
Now, we can compare the validation set ROC curves for our top penalized logistic regression model and random forest model:
The random forest is uniformly better across event probability thresholds.
The last fit 🔗︎
Our goal was to predict which hotel stays included children and/or babies. The random forest model clearly performed better than the penalized logistic regression model, and would be our best bet for predicting hotel stays with and without children. After selecting our best model and hyperparameter values, our last step is to fit the final model on all the rows of data not originally held out for testing (both the training and the validation sets combined), and then evaluate the model performance one last time with the held-out test set.
We’ll start by building our parsnip model object again from scratch. We take our best hyperparameter values from our random forest model. When we set the engine, we add a new argument: importance = "impurity" . This will provide variable importance scores for this last model, which gives some insight into which predictors drive model performance.
This fitted workflow contains everything , including our final metrics based on the test set. So, how did this model do on the test set? Was the validation set a good estimate of future performance?
This ROC AUC value is pretty close to what we saw when we tuned the random forest model with the validation set, which is good news. That means that our estimate of how well our model would perform with new data was not too far off from how well our model actually performed with the unseen test data.
We can access those variable importance scores via the .workflow column. We can extract out the fit from the workflow object, and then use the vip package to visualize the variable importance scores for the top 20 features:
The most important predictors in whether a hotel stay had children or not were the daily cost for the room, the type of room reserved, the time between the creation of the reservation and the arrival date, and the type of room that was ultimately assigned.
Let’s generate our last ROC curve to visualize. Since the event we are predicting is the first level in the children factor (“children”), we provide roc_curve() with the relevant class probability .pred_children :
Based on these results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our random forest model with the selected hyperparameters would perform well when predicting new data.
Where to next? 🔗︎
If you’ve made it to the end of this series of Get Started articles, we hope you feel ready to learn more! You now know the core tidymodels packages and how they fit together. After you are comfortable with the basics we introduced in this series, you can learn how to go farther with tidymodels in your modeling and machine learning projects.
Here are some more ideas for where to go next:
Study up on statistics and modeling with our comprehensive books .
Dig deeper into the package documentation sites to find functions that meet your modeling needs. Use the searchable tables to explore what is possible.
Keep up with the latest about tidymodels packages at the tidyverse blog .
Find ways to ask for help and contribute to tidymodels to help others.
Session information 🔗︎.
Chapter 4. Case Study: Creating a Package Before describing more systematically the components that RStudio provides for development work in R (most
In this case, you want to remove “Not present” and “Not a member”. # Load the dplyr package library(dplyr) ## ## Attaching package: 'dplyr'
Case study 1. This case study serves a good starting point for users who are new to the Mediana package. It focuses on clinical trials with
R packages. For our analysis and the R programming, we will make use of the following R packages:.
Basic Exploratory Data Analysis. Now that the Palmer Penguins package is all loaded and set to be manipulated, let's get to know it! Remember
Till the last count, R had more than 12000 packages in its repositories providing a wide variety of statistical, machine learning and graphical. Page 2
In the case study by Isis Bulte and Patrick Onghena, Isis and Patrick discussed another. Page 31. 31. R commander package single-case data analysis (SCDA) by
3 A plankton case study. The FlowCam® device (Yokogawa Fluid Imaging Technology, Inc.) is an imaging-in-flow system. A training set was
The count of open source software packages hosted by the Comprehensive R Archive Network (CRAN) using key spatial data handling packages has
The glmnet R package fits a generalized linear model via penalized maximum likelihood. This method of estimating the logistic regression slope parameters uses a