## Getting Started with RStudio by John Verzani

Get full access to Getting Started with RStudio and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

## Chapter 4. Case Study: Creating a Package

## Creating Functions from Script Files

For our task, we have a script that does four things to process the data:

It reads in the data and does some data cleaning.

It creates zoo objects for each mole rat.

It merges these into one large zoo object.

## A Package Skeleton

## Documenting Functions With roxygen2

## The devtools Package

## Package Data

## Package Examples

## Adding Tests

## Building and Installing the Package

Get Getting Started with RStudio now with the O’Reilly learning platform.

## Don’t leave empty-handed

## It’s yours, free.

## Check it out now on O’Reilly

- Case Study: Exploratory Data Analysis in R
- by Daniel Pinedo
- Last updated about 2 years ago
- Hide Comments (–) Share Hide Toolbars

## Or copy & paste this link into an email or IM:

## Case studies

Gautier paux and alex dmitrienko, introduction.

## Case study 1

- Trial with two treatment arms and single endpoint (normally distributed endpoint).
- Trial with two treatment arms and single endpoint (binary endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint with censoring).
- Trial with two treatment arms and single endpoint (count-type endpoint).

## Case study 2

Clinical trial in patients with schizophrenia

## Case study 3

Clinical trial in patients with asthma

## Case study 4

Clinical trial in patients with metastatic colorectal cancer

## Case study 5

Clinical trial in patients with rheumatoid arthritis

## Case study 6

Several distribution will be illustrated in this case study:

## Normally distributed endpoint

## Define a Data Model

The first step is to initialize the data model:

As a side note, the seq function can be used to compactly define sample sizes in a data model:

## Define an Analysis Model

Just like the data model, the analysis model needs to be initialized as follows:

## Define an Evaluation Model

First of all, the evaluation model must be initialized:

Secondly, the success criterion of interest (marginal power) is defined using the Criterion object:

## Perform Clinical Scenario Evaluation

To accomplish this, the simulation parameters need to be defined in a SimParameters object:

## Summarize the Simulation Results

Summary of simulation results in r console.

## General a Simulation Report

Project information can be added to the presentation model using the Project object:

## Report generation

and the resulting hazard ratio (HR) is 0.077/0.116 = 0.67.

The complete data model in this case study is defined as follows:

H1: Null hypothesis of no difference between Dose L and placebo.

H2: Null hypothesis of no difference between Dose M and placebo.

H3: Null hypothesis of no difference between Dose H and placebo.

Sample 1: Marker-negative patients in the placebo arm.

Sample 2: Marker-positive patients in the placebo arm.

Sample 3: Marker-negative patients in the treatment arm.

Sample 4: Marker-positive patients in the treatment arm.

Placebo arm: Samples 1 and 2 ( Placebo M- and Placebo M+ ) are merged.

Treatment arm: Samples 3 and 4 ( Treatment M- and Treatment M+ ) are merged.

Placebo arm: Sample 2 ( Placebo M+ ).

Treatment arm: Sample 4 ( Treatment M+ ).

It is reasonable to consider the following success criteria in this case study:

Marginal power: Probability of a significant outcome in each patient population.

The next several statements specify the parameters of the bivariate exponential distribution:

Parameters of the marginal exponential distributions, i.e., the hazard rates.

Correlation matrix of the underlying multivariate normal distribution used in the copula method.

The circles in this figure denote the two null hypotheses of interest:

H1: Null hypothesis of no difference between the two arms with respect to PFS.

H2: Null hypothesis of no difference between the two arms with respect to OS.

Endpoint 2: Change from baseline in the Health Assessment Questionnaire-Disability Index (HAQ-DI).

Variable types (binomial and normal).

Correlation matrix of the multivariate normal distribution used in the copula method.

Endpoint 1: Two-sample test for comparing proportions ( method = "PropTest" ).

Endpoint 2: Two-sample t-test ( method = "TTest" ).

H1: Null hypothesis of no difference between Dose L and placebo with respect to Endpoint 1.

H2: Null hypothesis of no difference between Dose H and placebo with respect to Endpoint 1.

H3: Null hypothesis of no difference between Dose L and placebo with respect to Endpoint 2.

H4: Null hypothesis of no difference between Dose H and placebo with respect to Endpoint 2.

Families of null hypotheses ( family ).

Component procedures used in the families ( component.procedure ).

Truncation parameters used in the families ( gamma ).

The analysis model is defined as follows:

## Generate a Simulation Report

## Create a Customized Simulation Report

Several presentation models will be used produce customized simulation reports:

A report with combined sections.

## Report without subsections

## Report with subsections

## Report with combined sections

CSE report without subsections

CSE report with combined subsections

## R news and tutorials contributed by hundreds of R bloggers

A data science case study in r.

Posted on March 13, 2017 by Robert Grünwald in R bloggers | 0 Comments

## Data Science Projects

For our analysis and the R programming, we will make use of the following R packages:

## Anatomy of a Data Science project

A basic data science project consists of the following six steps:

- State the problem you are trying to solve. It has to be an unambiguous question that can be answered with data and a statistical or machine learning model. At least, specify: What is being observed? What has to be predicted?
- Collect the data, then clean and prepare it. This is commonly the most time-consuming task, but it has to be done in order to fit a prediction model with the data.
- Explore the data. Get to know its properties and quirks. Check numerical summaries of your metric variables, tables of the categorical data, and plot univariate and multivariate representations of your variables. By this, you also get an overview of the quality of the data and can find outliers.
- Check if any variables may need to be transformed. Most commonly, this is a logarithmic transformation of skewed measurements such as concentrations or times. Also, some variables might have to be split up into two or more variables.
- Choose a model and train it on the data. If you have more than one candidate model, apply each and evaluate their goodness-of-fit using independent data that was not used for training the model.
- Use the best model to make your final predictions.

The following commands read in our subset data and display the first three observations:

## The problem

With this data, it is possible to answer many interesting questions. Examples include:

## Do planes with a delayed departure fly with a faster average speed to make up for the delay?

How does the delay of arriving flights vary during the day are planes more delayed on weekends.

- How has the market share of different airlines shifted over these 20 years?
- Are there specific planes that tend to have longer delays? What characterizes them? Maybe the age, or the manufacturer?

Here, we will focus on the first two boldened questions.

## Data cleaning

## Explorative analyses

Our main variables of interest are:

- The date, which conveniently is already split up in the columns Year, Month, and DayOfMonth, and even contains the weekday in DayOfWeek. This is rarely the case, you mostly get a single column with a name like date and entries such as „2016-06-24“. In that case, the R package lubridate provides helpful functions to efficiently work with and manipulate these dates.
- CRSDepTime, the scheduled departure time. This will indicate the time of day for our analysis of when flights tend to have higher delays.
- ArrDelay, the delay in minutes at arrival. We use this variable (rather than the delay at departure) for the outcome in our first analysis, since the arrival delay is what has the impact on our day.
- For our second question of whether planes with delayed departure fly faster, we need DepDelay, the delay in minutes at departure, as well as a measure of average speed while flying. This variable is not available, but we can compute it from the available variables Distance and AirTime. We will do that in the next section, „Feature Engineering“.

Let’s have an exploratory look at all our variables of interest.

## Flight date

## Departure Time

## Arrival and departure delay

## Distance and AirTime

## Feature Engineering

For our data, we have the following tasks:

- Convert the weekday into a factor variable so it doesn’t get interpreted linearly.
- Create a log-transformed version of the arrival and departure delay.
- Transform the departure time so that it can be used in a model.
- Create the average speed from the distance and air time variables.

## log-transform delay times

## Transform the departure time

The mathematical rule to transform the „old“ time in hhmm-format into a decimal format is:

Of course, you should always verify that your code did what you intended by checking the results.

## Create average speed

## Choosing an appropriate Method

- The nonlinear trend over the day is the same shape on every day of the week
- Fridays are the worst days to fly by far, with Sunday being a close second. Expected delays are around 20 minutes during rush-hour (8pm)
- Wednesdays and Saturdays are the quietest days
- If you can manage it, fly on a Wednesday morning to minimize expected delays.

## Closing remarks

Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service .

Copyright © 2022 | MH Corporate basic by MH Themes

## Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

## Programming in R: A Case Study

## Prerequisites for the Tutorial

## How is R different from Python?

## What will I learn at the end of the Tutorial?

## Palmer Penguins: A Case Study

To begin, we need to load the Palmer Penguins dataset in the environment.

To install any package in an R environment, we follow the syntax:

install.packages(“package name”)

library(package name)

Take note that the # symbol denotes a comment, which we insert in the code to make it more readable.

## Basic Exploratory Data Analysis

head(name of the dataset)

head(name of the dataset, number of rows)

Similarly, we can also view the last few rows of the dataset using the function:

tail(name of the dataset)

tail(name of the dataset, number of rows)

To know the dimensions of the dataset, we use the dim() function . The syntax of the function is:

dim(name of the dataset)

str(name of the dataset)

count(name of the column)

summarize(new column = function(column name))

group_by(column name) %>%

filter(condition)

arrange(column name)

Yay! We have explored our dataset and various statistical measures of it.

The basic syntax to make any plot using ggplot2 is as follows:

ggplot(dataset, aes(x=..,y=..) +

gglot(dataset, aes(x=..,y=..) +

geom_point ()

ggplot(dataset, aes(x=..,y=..,) +

geom_line ()

geom_bar ()

To make a histogram, we use the syntax:

geom_histogram ()

- Title : Appears at the top left of the plot in a big font size.
- Subtitle : Appears at the top left of the plot beneath the Title in a small font size.
- Caption : Appears as a note at the bottom right of the plot in a small font size.
- X : Appears as the label of the x-axis, in a medium font size.
- Y : Appears as the label of the y-axis, in a medium font size.

The text is added in the labs() function as follows:

labs(title=.., subtitle=.., caption=.., x=.., y=..)

## A sneak into Advanced Exploratory Data Analysis

To do this, there are two ways:

dataset [row indices, column indices]

We can see that the number of rows have reduced by 3 (344 to 341).

select(dataset name, column names)

From above, we can see that the dataset has 19 null values in total.

To remove these values, we use the na.omit(..) function as shown below:

select_if(dataset name, condition)

Note: just by typing the name of the dataset, we can have a preview of the dataset.

To avoid any calculation error, let’s drop rows with missing values, and make it permanent.

As we can see, there are no missing values in the dataset.

All set now! Let’s plot the correlation heatmap.

Now let’s plot the heatmap . The syntax is rather simple:

corrplot(correlation between columns)

Psst. scroll up for the cheat code!

https://cran.r-project.org/doc/manuals/R-intro.pdf

https://ggplot2.tidyverse.org/reference/

## More from Hashtag by IECSE

## Get the Medium app

## Khushee Kapoor

Data Science & Engineering Undergraduate at Manipal Institute of Technology

## 5 A predictive modeling case study

Tidymodels packages:, introduction 🔗︎.

Alternatively, open an interactive version of this article in your browser:

## The Hotel Bookings Data 🔗︎

## Data Splitting & Resampling 🔗︎

the set held out for the purpose of measuring performance, called the validation set , and

the remaining data used to fit the model, called the training set .

## A first model: penalized logistic regression 🔗︎

## Build the model

## Create the recipe

step_date() creates predictors for the year, month, and day of the week.

step_normalize() centers and scales numeric variables.

Putting all these steps together into a recipe for a penalized logistic regression model, we have:

## Create the workflow

## Create the grid for tuning

## Train and tune the model

Let’s select this value and visualize the validation set ROC curve:

## A second model: tree-based ensemble 🔗︎

## Build the model and improve training time

## Create the recipe and workflow

When we set up our parsnip model, we chose two hyperparameters for tuning:

We will use a space-filling design to tune, with 25 candidate models:

Here are our top 5 random forest models, out of the 25 candidates:

Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:

The random forest is uniformly better across event probability thresholds.

## The last fit 🔗︎

## Where to next? 🔗︎

Here are some more ideas for where to go next:

Study up on statistics and modeling with our comprehensive books .

Keep up with the latest about tidymodels packages at the tidyverse blog .

Find ways to ask for help and contribute to tidymodels to help others.

## IMAGES

## VIDEO

## COMMENTS

Chapter 4. Case Study: Creating a Package Before describing more systematically the components that RStudio provides for development work in R (most

In this case, you want to remove “Not present” and “Not a member”. # Load the dplyr package library(dplyr) ## ## Attaching package: 'dplyr'

Case study 1. This case study serves a good starting point for users who are new to the Mediana package. It focuses on clinical trials with

R packages. For our analysis and the R programming, we will make use of the following R packages:.

Basic Exploratory Data Analysis. Now that the Palmer Penguins package is all loaded and set to be manipulated, let's get to know it! Remember

Till the last count, R had more than 12000 packages in its repositories providing a wide variety of statistical, machine learning and graphical. Page 2

In the case study by Isis Bulte and Patrick Onghena, Isis and Patrick discussed another. Page 31. 31. R commander package single-case data analysis (SCDA) by

3 A plankton case study. The FlowCam® device (Yokogawa Fluid Imaging Technology, Inc.) is an imaging-in-flow system. A training set was

The count of open source software packages hosted by the Comprehensive R Archive Network (CRAN) using key spatial data handling packages has

The glmnet R package fits a generalized linear model via penalized maximum likelihood. This method of estimating the logistic regression slope parameters uses a