Challenge 4: Hotel Bookings

challenge_4

Teresa Lardo

hotel_bookings

Using mutate and lubridate to consolidate data

Author

Teresa Lardo

Published

March 21, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

For this challenge, we’ll read in the csv file on hotel bookings.

Code

library(readr)
hotel_bookings <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

Briefly describe the data

Code

head(hotel_bookings, 10)

# A tibble: 10 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Code

tail(hotel_bookings, 10)

# A tibble: 10 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 City …       0      44    2017 August       35      31       1       3      2
 2 City …       0     188    2017 August       35      31       2       3      2
 3 City …       0     135    2017 August       35      30       2       4      3
 4 City …       0     164    2017 August       35      31       2       4      2
 5 City …       0      21    2017 August       35      30       2       5      2
 6 City …       0      23    2017 August       35      30       2       5      2
 7 City …       0     102    2017 August       35      31       2       5      3
 8 City …       0      34    2017 August       35      31       2       5      2
 9 City …       0     109    2017 August       35      31       2       5      2
10 City …       0     205    2017 August       35      29       2       7      2
# … with 22 more variables: children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Code

dim(hotel_bookings)

[1] 119390     32

The data set describes just under 120,000 bookings at two hotels over the course of 2 years - July 2015 through August 2017. There are 32 variables.

Identify variables that need to be mutated

This data set has many variables, and some of them can be combined. For instance, the arrival date for each booking is broken into 4 different variables - year of arrival, month of arrival, day of the month of arrival, and week of the year of arrival. These could be used to break down the data and create visualizations showing which months or even weeks of the year are the most/least booked, but that’s still a lot of variables just to answer the question “when does the customer plan to show up?”

Create a single Arrival Date column

We can use the mutate() and make_date() functions to turn some of these columns into a single column that lists the entire date in a date format. Because the column showing the expected month of arrival uses the names of the months - July, August, etc. - instead of the number of the month (1-12), the values in that column will also have to be altered so we can use the make_date() function.

Code

# Load dplyr and lubridate from the library to enable the piping operator and the make_date function
library(dplyr)
library(lubridate)
hotel_bookings <- hotel_bookings %>%
# Use mutate and case_when to change the month values from names to numbers  
    mutate(month = case_when(
       arrival_date_month == "January" ~ 1,
       arrival_date_month == "February" ~ 2,
       arrival_date_month == "March" ~ 3,
       arrival_date_month == "April" ~ 4,
       arrival_date_month == "May" ~ 5,
       arrival_date_month == "June" ~ 6,
       arrival_date_month == "July" ~ 7,
       arrival_date_month == "August" ~ 8,
       arrival_date_month == "September" ~ 9,
       arrival_date_month == "October" ~ 10,
       arrival_date_month == "November" ~ 11,
       arrival_date_month == "December" ~ 12,
     )) %>%
# Use the year, day, and new month columns to create a new column showing the entire arrival date
    mutate(arrival_date = make_date(arrival_date_year, month, arrival_date_day_of_month)) %>%
# Use select to remove the extraneous date columns
    select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month, month, arrival_date_week_number)) %>%
# Use select to move the new arrival_date column closer to the left side of the data set
    select(hotel, arrival_date, everything())

Because we’ve removed 4 of the original date columns (year, month, day, and week) and added one new column to show the entire date of expected arrival, the number of columns (variables) in our data set should change from 32 to 29.

Code

dim(hotel_bookings)

[1] 119390     29

Using the dim() function confirms our new number of variables.

Code

head(hotel_bookings)

# A tibble: 6 × 29
  hotel   arrival_…¹ is_ca…² lead_…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
  <chr>   <date>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
1 Resort… 2015-07-01       0     342       0       0      2       0      0 BB   
2 Resort… 2015-07-01       0     737       0       0      2       0      0 BB   
3 Resort… 2015-07-01       0       7       0       1      1       0      0 BB   
4 Resort… 2015-07-01       0      13       0       1      1       0      0 BB   
5 Resort… 2015-07-01       0      14       0       2      2       0      0 BB   
6 Resort… 2015-07-01       0      14       0       2      2       0      0 BB   
# … with 19 more variables: country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Combine Week & Weekend Night columns into a single Total Nights column

One other thing we can do to make our data more concise is to combine the columns that show the number of weekend nights and week nights. We can mutate those two columns into one that shows simply the total number of nights of the booked stay. If we add a new Total Nights column and remove the two original columns, we should have 28 columns.

Code

hotel_bookings <- hotel_bookings %>%
# Mutate new column that adds the values from the weekend/week night columns
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
# Remove the original columns
  select(-c(stays_in_weekend_nights, stays_in_week_nights)) %>%
# Rearrange the order of the columns
  select(hotel, arrival_date, is_canceled, lead_time, total_nights, everything())

Let’s use the dim() function for a quick sanity check.

Code

dim(hotel_bookings)

[1] 119390     28

28 columns confirmed!

Code

head(hotel_bookings)

# A tibble: 6 × 28
  hotel   arrival_…¹ is_ca…² lead_…³ total…⁴ adults child…⁵ babies meal  country
  <chr>   <date>       <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr> <chr>  
1 Resort… 2015-07-01       0     342       0      2       0      0 BB    PRT    
2 Resort… 2015-07-01       0     737       0      2       0      0 BB    PRT    
3 Resort… 2015-07-01       0       7       1      1       0      0 BB    GBR    
4 Resort… 2015-07-01       0      13       1      1       0      0 BB    GBR    
5 Resort… 2015-07-01       0      14       2      2       0      0 BB    GBR    
6 Resort… 2015-07-01       0      14       2      2       0      0 BB    GBR    
# … with 18 more variables: market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
#   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
#   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …