4  Unit 2 Part 2

5 Fundamental of R Programming

Topics Covered in Unit 2 Part 2
  • Setting up R Project
  • Data manipulations
  • Data transformations, normalization, standardization
  • Missing values imputation
  • Dummy variables
  • Data visualization (2D and 3D)
  • Descriptive analytics-case study covering data manipulation, measures of central tendency, measures of dispersion, measures of distribution, measures of associations, t-test, Hypothesis testing (t-test, ANOVA, Chi Squared Test, F-Test), basic statistical modeling framework

To start with content of this chapter, we need to create a new project. Let’s create new project with this sequence:

-> Go to File tab -> Click New Project tab -> In New Project Wizard, click New Project Directory -> In New Project Wizard, click Project Type: Quarto Project

Now you see, following window:

In this window, give name to your directory in the box “Directory Name” and set the directory path with browse option. Now, click the Create Project tab to start your first ever project. This is how your workspace will look like:

In the output pane (in blue circle), you can see that there are three files are created- _quarto.yml, name.qmd and name.Rproj. Here “name” is inherited from directory name we have given while setting the project. You can give any name by selecting the .qmd file and using rename tab in output pane. We will do all coding in the .qmd file only. We can add more .qmd file in the project by clicking File Tab>Quarto Document and the file will be available in the project.

_quarto.yml file contains the information related to project. We can use many parameters to define the projects through changes in _quarto.yml file. We can create book, website, blog, manuscript, etc. by changing the parameters in this file.

name.Rproj file is created to denote the project environment. We don’t need to change or modify this.

We can add more files and folders in the project using the Folder tab and File in the output pane. It is advised to create different folders for different types of files like - for datasets - create data folder, for images - create images folder, etc.

The red arrow in the image pointed towards Source tab. Click on this tab to set the .qmd from editor mode to source mode. Now copy each step to build your first data project.


5.0.1 Uploading Dataset

In this step, we upload the dataset in R workspace which we are going to analyze. DOwnload the dataset from this link. We need to install a few libraries to process the dataset:

#install.packages("dplyr")
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
#install.packages("tidyverse")
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.2     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("janitor")
library(janitor)
Warning: package 'janitor' was built under R version 4.4.2

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

After downloading the dataset zomato.csv, create a data folder in the project directory and move the dataset into the folder. Now upload the dataset in workspace with following command chunk:

zomato = read.csv("data/zomato.csv", header = TRUE, stringsAsFactors = FALSE)

Now we have uploaded the dataset in csv format, and named as zomato. You can give any name to it. To see complete dataset in R workspace, we can type View(zomato) command in Console.

To know how many rows and columns are in the dataset:

dim(zomato)
[1] 9551   21

In the dataset, we have 9551 rows and 21 columns in the dataset. The 21 columns are variables and rows are the observations.

Let’s check the variable names in the dataset:

colnames(zomato)
 [1] "Restaurant.ID"        "Restaurant.Name"      "Country.Code"        
 [4] "City"                 "Address"              "Locality"            
 [7] "Locality.Verbose"     "Longitude"            "Latitude"            
[10] "Cuisines"             "Average.Cost.for.two" "Currency"            
[13] "Has.Table.booking"    "Has.Online.delivery"  "Is.delivering.now"   
[16] "Switch.to.order.menu" "Price.range"          "Aggregate.rating"    
[19] "Rating.color"         "Rating.text"          "Votes"               

Now, we know the dataset variables and following is the definitions of the variables:

  • Restaurant Id: Unique id of every restaurant across various cities of the world
  • Restaurant Name: Name of the restaurant
  • Country Code: Country in which restaurant is located
  • City: City in which restaurant is located
  • Address: Address of the restaurant
  • Locality: Location in the city
  • Locality Verbose: Detailed description of the locality
  • Longitude: Longitude coordinate of the restaurant’s location
  • Latitude: Latitude coordinate of the restaurant’s location
  • Cuisines: Cuisines offered by the restaurant
  • Average Cost for two: Cost for two people in different currencies 👫
  • Currency: Currency of the country
  • Has Table booking: yes/no
  • Has Online delivery: yes/ no
  • Is delivering: yes/ no
  • Switch to order menu: yes/no
  • Price range: range of price of food
  • Aggregate Rating: Average rating out of 5
  • Rating color: depending upon the average rating color
  • Rating text: text on the basis of rating of rating
  • Votes: Number of ratings casted by people

We need to also know thke data type contained in each variable:

str(zomato)
'data.frame':   9551 obs. of  21 variables:
 $ Restaurant.ID       : int  6317637 6304287 6300002 6318506 6314302 18189371 6300781 6301290 6300010 6314987 ...
 $ Restaurant.Name     : chr  "Le Petit Souffle" "Izakaya Kikufuji" "Heat - Edsa Shangri-La" "Ooma" ...
 $ Country.Code        : int  162 162 162 162 162 162 162 162 162 162 ...
 $ City                : chr  "Makati City" "Makati City" "Mandaluyong City" "Mandaluyong City" ...
 $ Address             : chr  "Third Floor, Century City Mall, Kalayaan Avenue, Poblacion, Makati City" "Little Tokyo, 2277 Chino Roces Avenue, Legaspi Village, Makati City" "Edsa Shangri-La, 1 Garden Way, Ortigas, Mandaluyong City" "Third Floor, Mega Fashion Hall, SM Megamall, Ortigas, Mandaluyong City" ...
 $ Locality            : chr  "Century City Mall, Poblacion, Makati City" "Little Tokyo, Legaspi Village, Makati City" "Edsa Shangri-La, Ortigas, Mandaluyong City" "SM Megamall, Ortigas, Mandaluyong City" ...
 $ Locality.Verbose    : chr  "Century City Mall, Poblacion, Makati City, Makati City" "Little Tokyo, Legaspi Village, Makati City, Makati City" "Edsa Shangri-La, Ortigas, Mandaluyong City, Mandaluyong City" "SM Megamall, Ortigas, Mandaluyong City, Mandaluyong City" ...
 $ Longitude           : num  121 121 121 121 121 ...
 $ Latitude            : num  14.6 14.6 14.6 14.6 14.6 ...
 $ Cuisines            : chr  "French, Japanese, Desserts" "Japanese" "Seafood, Asian, Filipino, Indian" "Japanese, Sushi" ...
 $ Average.Cost.for.two: int  1100 1200 4000 1500 1500 1000 2000 2000 6000 1100 ...
 $ Currency            : chr  "Botswana Pula(P)" "Botswana Pula(P)" "Botswana Pula(P)" "Botswana Pula(P)" ...
 $ Has.Table.booking   : chr  "Yes" "Yes" "Yes" "No" ...
 $ Has.Online.delivery : chr  "No" "No" "No" "No" ...
 $ Is.delivering.now   : chr  "No" "No" "No" "No" ...
 $ Switch.to.order.menu: chr  "No" "No" "No" "No" ...
 $ Price.range         : int  3 3 4 4 4 3 4 4 4 3 ...
 $ Aggregate.rating    : num  4.8 4.5 4.4 4.9 4.8 4.4 4 4.2 4.9 4.8 ...
 $ Rating.color        : chr  "Dark Green" "Dark Green" "Green" "Dark Green" ...
 $ Rating.text         : chr  "Excellent" "Excellent" "Very Good" "Excellent" ...
 $ Votes               : int  314 591 270 365 229 336 520 677 621 532 ...

In the dataset, we have 5 integer variables, 13 character variables, 8 numeric variables, and 0 factor variables.

To process the data smoothly, we need the column names in the suitable format. Column names of dataset contains “.” symbol which will cause problem and secondly, the column names have capital letters. We can improve these column names with janitor package commands:

zomato = zomato |>
  
  janitor::clean_names()

colnames(zomato)
 [1] "restaurant_id"        "restaurant_name"      "country_code"        
 [4] "city"                 "address"              "locality"            
 [7] "locality_verbose"     "longitude"            "latitude"            
[10] "cuisines"             "average_cost_for_two" "currency"            
[13] "has_table_booking"    "has_online_delivery"  "is_delivering_now"   
[16] "switch_to_order_menu" "price_range"          "aggregate_rating"    
[19] "rating_color"         "rating_text"          "votes"               

Now we see consistency in column names and all “.” symbol is removed from all column names. We need to do multiple changes in the dataset before we can analyze the dataset and gain the insights. So, we begin with data manipulation step now. Data manipulation is also called data preparation, data wrangling step.

5.0.2 Data Manipulation

Certain variables contain repetitive values. We need to change them from character variable type to factor variable type. These variables are has_table_booking, has_online_delivery, is_delivering_now, switch_to_order_menu, rating_color, and rating_text.

zomato$has_table_booking = as.factor(zomato$has_table_booking)

zomato$has_online_delivery = as.factor(zomato$has_online_delivery)

zomato$is_delivering_now = as.factor(zomato$is_delivering_now)

zomato$switch_to_order_menu = as.factor(zomato$switch_to_order_menu)

zomato$rating_color = as.factor(zomato$rating_color)

zomato$rating_text = as.factor(zomato$rating_text)

Now check whether variable is changed from character variable to factor variable. Write command chunk in console:

class(zomato$rating_color)

Likewise, we can check for other variables also. DO IT ON YOUR LAPTOP NOW.

We need to find out if some variables are really useful or not for the data analysis purpose. We can remove these variables. We can remove the variables if we find following:

  • If variable is a unique identifier
  • If variable contains a constant value
  • If variable contains no value
  • If variable contains large part as NAs
  • If variable contains address or uneccessary landmark detains

In zomato dataset, we have restaurant_id, restaurant_name,address, locality, and locality_verbose which can be atrributed to the unique identifier and address. We shall remove them because they are not useful for data analysis.

zomato$restaurant_id = NULL

zomato$restaurant_name = NULL

zomato$address = NULL

zomato$locality = NULL

zomato$locality_verbose = NULL

Now, look at the summary of each variable from following command chunk:

summary(zomato)
  country_code        city             longitude          latitude     
 Min.   :  1.00   Length:9551        Min.   :-157.95   Min.   :-41.33  
 1st Qu.:  1.00   Class :character   1st Qu.:  77.08   1st Qu.: 28.48  
 Median :  1.00   Mode  :character   Median :  77.19   Median : 28.57  
 Mean   : 18.37                      Mean   :  64.13   Mean   : 25.85  
 3rd Qu.:  1.00                      3rd Qu.:  77.28   3rd Qu.: 28.64  
 Max.   :216.00                      Max.   : 174.83   Max.   : 55.98  
   cuisines         average_cost_for_two   currency         has_table_booking
 Length:9551        Min.   :     0       Length:9551        No :8393         
 Class :character   1st Qu.:   250       Class :character   Yes:1158         
 Mode  :character   Median :   400       Mode  :character                    
                    Mean   :  1199                                           
                    3rd Qu.:   700                                           
                    Max.   :800000                                           
 has_online_delivery is_delivering_now switch_to_order_menu  price_range   
 No :7100            No :9517          No:9551              Min.   :1.000  
 Yes:2451            Yes:  34                               1st Qu.:1.000  
                                                            Median :2.000  
                                                            Mean   :1.805  
                                                            3rd Qu.:2.000  
                                                            Max.   :4.000  
 aggregate_rating     rating_color     rating_text       votes        
 Min.   :0.000    Dark Green: 301   Average  :3737   Min.   :    0.0  
 1st Qu.:2.500    Green     :1079   Excellent: 301   1st Qu.:    5.0  
 Median :3.200    Orange    :3737   Good     :2100   Median :   31.0  
 Mean   :2.666    Red       : 186   Not rated:2148   Mean   :  156.9  
 3rd Qu.:3.700    White     :2148   Poor     : 186   3rd Qu.:  131.0  
 Max.   :4.900    Yellow    :2100   Very Good:1079   Max.   :10934.0  

We are keeping country_code and city variables in spite of that these are unique identifier variables. We will be using them when we will visualize the data in map. Now, we check for missing values in the dataset.

5.0.3 Missing Vakye Analysis

sum(is.na(zomato))
[1] 0

In the dataset, there is no missing values hence we don’t need to do any missing value treatment. We can also check each variable wise missing values:

sapply(zomato, function(x) sum(is.na(x)))
        country_code                 city            longitude 
                   0                    0                    0 
            latitude             cuisines average_cost_for_two 
                   0                    0                    0 
            currency    has_table_booking  has_online_delivery 
                   0                    0                    0 
   is_delivering_now switch_to_order_menu          price_range 
                   0                    0                    0 
    aggregate_rating         rating_color          rating_text 
                   0                    0                    0 
               votes 
                   0 

5.0.4 Data transformations, normalization, standardization

Sometimes, we need to transform the variable to make the data values consitent across the observations. For example, When we look at variable average_cost_for_two. If all values of this variable are in same currency than it should not be a problem but we check variable currency, there are 12 different currencies are used to report average_cost_for_two. We can’t compare prices in different currencies. We also need to consider Purchasing Power Parity to rationalize the prices.

Currency PPP Conversion
Pounds 0.679372
Botswana Pula 5.081844545
Brazilian Real 2.480586012
Dollar 1
Emirati Dirham 2.326086126
Indian Rupees 20.44566932
Indonesian Rupiah 4747.908576
NewZealand 1.46197
Qatari Rial 2.201602287
Rand 0.456
Sri Lankan Rupee 87.26860544
Turkish Lira 11.554529

Source: World Bank link

With PPP Conversion factor, we will use formula to convert all data into Indian Rupees:

\[ Local Price = Base Price * (PPP Conversion Factor(Country)/PPP Conversion Factor(Base)) \] \(LocalPrice\) = Dollar equivalent in local currency

\(PPP Conversion Factor(Country)\) = PPP Conversion Factor

\(BasePrice\) = $100 (US PPP = 1.0)

\(PPP Conversion Factor(Base)\) = USD 1.0

After getting LocalPrice in USD, we will convert the LocalPrice in IndianRupee value with following formula:

\[ INR_Value = LocalPrice * USDINR Corrency Ex Rate \] The USDINR coversion rate we are considering is 83.

Let’s convert average_cost_for_two:

average_cost_for_two = ifelse(zomato$currency=="Pounds(\x8c\xa3)", 
                                     zomato$average_cost_for_two * 20.44566932/11.554529,
                                     ifelse(zomato$currency=="Botswana Pula(P)", 
                                     zomato$average_cost_for_two * 20.44566932/5.081844545,
                                     ifelse(zomato$currency=="Brazilian Real(R$)", 
                                     zomato$average_cost_for_two * 20.44566932/2.480586012,
                                     ifelse(zomato$currency=="Dollar($)", 
                                     zomato$average_cost_for_two * 20.44566932/1,
                                     ifelse(zomato$currency=="Emirati Diram(AED)", 
                                     zomato$average_cost_for_two * 20.44566932/2.326086126,
                                     ifelse(zomato$currency=="Indonesian Rupiah(IDR)", 
                                     zomato$average_cost_for_two * 20.44566932/4747.908576,
                                     ifelse(zomato$currency=="NewZealand($)", 
                                     zomato$average_cost_for_two * 20.44566932/1.46197,
                                     ifelse(zomato$currency=="Qatari Rial(QR)", 
                                     zomato$average_cost_for_two * 20.44566932/ 2.201602287,
                                     ifelse(zomato$currency=="Rand(R)", 
                                     zomato$average_cost_for_two * 20.44566932/0.456,
                                     ifelse(zomato$currency=="Sri Lankan Rupee(LKR)", 
                                     zomato$average_cost_for_two * 20.44566932/87.26860544,
                                     ifelse(zomato$currency=="Turkish Lira(TL)", 
                                     zomato$average_cost_for_two * 20.44566932/11.554529,zomato$average_cost_for_two
                                     )))))))))))

zomato$average_cost_for_two = round(average_cost_for_two, digits = 0)

Now we have all prices in Indian Rupees. We can see from the summary:

summary(zomato$average_cost_for_two)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0     300     500     760     700  143927 

When we summary, maximum value of the order is 143927 and minimum value is 0. But order value varies between 300 to 700 most of the time. We need to check whether the Min and Max values are outlier or not and how many values are below 300 and more than 700. The value 300 is 1st Quartile and value 700 is third quartile. Let’s find out how many values are more than 3rd quartile 700:

nrow(zomato[zomato$average_cost_for_two>700,])
[1] 2380

The number of observations where value of average_cost_for_two is more than 700 are 2380 which is approx. 25% of the total observations in the dataset which is huge. We can discard the observations carrying 700+ value of average_cost_for_two. Let’s see the visualisation box plot for this:

hist(zomato$average_cost_for_two, n=1000)

We can see that the density plot of the average cost is right skewed due to presence of outliers on the higher side. Let’s plot the log form of this and see what happens:

hist(log(zomato$average_cost_for_two), n=1000, col="blue")

In the plot, we can see that the distribution of average_cost_for_two values is normally distributed. We will use the log value for further data analysis now. We add new variable log_avg_cost to the dataset:

zomato$log_avg_cost = log(zomato$average_cost_for_two)

We learnt one method of data transformation here i.e. log transformation to achieve normality in the data. Normality of data is an important factor in the data analysis. Many statistical theories, tests and models assume that the variable values should be normally distributed. To comply with that we transform the variable to get closer to the normal distribution.

Other type of transformations and their application criteria is as follows:

Data Transformation Type Criteria to Apply Sample Command in R
Log Transformation Data is right-skewed; variance increases with mean log(x) or log1p(x) (for zero values)
Square Root Transformation Count data or moderately skewed data sqrt(x)
Standardization (Z-score) Variables on different scales; need mean 0, sd 1 scale(x)
Min-Max Normalization Scale data to [0,1] for machine learning (x - min(x)) / (max(x) - min(x))
Box-Cox Transformation Data is positive and skewed; find best power transform library(MASS); boxcox(lm(y ~ x))
Categorical Encoding Convert factors for modeling as.numeric(factor(x)) or model.matrix(~x-1)
Logit Transformation Data is proportion/fraction in (0,1) log(x / (1 - x))
Differencing Time series with trends to make stationary diff(x)

Purpose of transformation to achieve the normality in the data as per the middle figure shown here:

Variable price_range contains four repeating values - 1,2,3 and 4. Let’s change this variable from integer to factor variable:

zomato$price_range = as.factor(zomato$price_range)

Let’s also see how values of votes are distributed:

hist(zomato$votes, n=1000)

We don’t transform votes variable in spite of the fact that the data is normally distributed because the votes are abolute values which we will use to rank different restaurants.

Data standardization is used when we have two different variables measured in different units needs to be compared. With standardization, variables are scaled on common scale, usually between 0 and 1.

plot(zomato$average_cost_for_two, zomato$aggregate_rating)

Now, we standardize the variables between 0 and 1 with function min_max_scale and then we log transform both variable and create a scatter plot:

min_max_scale <- function(data) {
  # Avoid division by zero if all values are the same
  if (max(data) == min(data)) {
    return(rep(0, length(data)))
  }
  return((data - min(data)) / (max(data) - min(data)))
}

scaled_cost = min_max_scale(zomato$average_cost_for_two)

scaled_rating = min_max_scale(zomato$aggregate_rating)

plot(log(scaled_cost), log(scaled_rating))

This plot gives better visualization of the interaction between two variables then the previous one.