In this investigative data driven study we take a look at how several factors influence Taxi Driver Tips. We have understood and realised that our analysis is as good as the data we get, so we take a look at taxi rides in the city of Chicago during the entire year of 2017. After a thorough analysis of the data in the columns we plan to rigourously test and explore the following questions in the data:
# All the modules we ever need for analysis see inline comments for more information
import dask #For large scale data analysis
import dask.dataframe as dd #Provides pandas like evironment for large scale data
from dask.distributed import Client #A compute cluster to paralleize Tasks
import matplotlib.pyplot as plt #The Ubiquitous Matplotlib
import seaborn as sns #Seaborn: Just because it looks way cooler
import scipy.stats as stats #Scipy: For statsitical analysis
import swifter
import geopandas as gpd #Geopandas: For Geospatial Analysis
import itertools #For getting Combinations (Pairs) in an Array etc.
import numpy as np #Accelrated Linear Algebra Computations
from IPython.display import HTML, Image #Display .png/.html file on the notebook
client = Client()
client
df = dd.read_csv('2017_*.csv', dtype={'Trip Seconds': 'float64'})
number_of_records = len(df)
print('There are {} records present'.format(number_of_records))
Hence there are around 25 million records each of which is a taxi rides during the year of 2017 in Chicago
############INSERT EXPLANATION/TABLE AS TO WHICH COLUMNS WE CHOSE AND WHY############
df.describe().compute()
Computing Pearson Correlation Coefficients between Time and Tips to check for a linear relationship
time_df = df[['Trip Seconds', 'Tips']]
correlation = time_df.corr()
correlation = correlation.compute()
corr_coef = correlation['Tips']['Trip Seconds']
corr_coef
The correlation coefficient of 0.3209 has a moderately positive correlation. However, this includes all the data that is present including outliers.
Let us now try to remove outliers on both Tips and Trip Times and compute the correlation coefficient
On contrast to the typical definition of an ourlier being any datapoint that is $\pm \ 1.5 \times \text{IQR}$, we offer a slightly more liberal definiton of an outlier to be anything that falls within the 97th percentile of the tip. We suggest this because from past experience we realise that the most generous tip that a taxi driver is going to get would be USD 5 to USD 10, and upon checking the distribution of quantiles as shown below, we can see that the 97th percentile in tip values is 10.3, we feel that is a suitable arbitrary threhold based off of personal experience. Having more stringent cutoffs for outliers on Tips may affect the nature of out statistical analysis.
outlier_tips[0].compute()
outlier_time = dask.array.percentile(time_df['Trip Seconds'].dropna().values, q = 97)
outlier_tips = dask.array.percentile(time_df['Tips'].dropna().values, q = 97)
filtered_time_df = time_df[(time_df['Trip Seconds'] <= outlier_time[0]) & (time_df['Tips'] <= outlier_tips[0])]
correlation_filtered = filtered_time_df.corr()
correlation_filtered = correlation_filtered.compute()
corr_coef = correlation_filtered['Tips']['Trip Seconds']
corr_coef
We can see that the correlation coefficient has improved to around 0.46. This signifies a stronger positive relationship between Tips and Time Duration
Finally, we compute the correlation coefficient between Tips and time duration for those trips which actually gave a non-zero tip
filtered_time_df_nonzeros = filtered_time_df[filtered_time_df['Tips'] != 0]
correlation_filtered = filtered_time_df_nonzeros.corr()
correlation_filtered = correlation_filtered.compute()
corr_coef = correlation_filtered['Tips']['Trip Seconds']
corr_coef
THIS IS A HUGE IMPROVEMENT!
As we can see the correlation coefficient shows that 75% of the variance in the Tips can be simply explained with the help of the Trip Seconds data. Also, higher degrees of correlation means for those passengers who do indeed tip (barring outliers) there exists a greater linear association between Tips recieved and the Amount of time a passenger spends inside a taxi.
To further substantiate this finding, we visualize and plot a linear regression line with a 95 percentile confidence interval. Since there are around 25 million records, we select a random sample of 500 rides to show the general trend of the line regression line
thousand_random = filtered_time_df_nonzeros.sample(frac = 4.0156697860121856e-05)
thousand_random = thousand_random.compute()
sns.regplot(x = 'Trip Seconds', y = 'Tips', data = thousand_random)
plt.title('Regressing Tips vs Trip Seconds on a Sample from Taxi Rides')
plt.show()
The following graph shows us that the width of the confidence interval is not too high, showing that the linear trend is infact fairly significcant
Seeing this, we now try to perform statistical validation of our strong linear relationship
Let us check if our Tips (y-values) are normally distributed
VISUAL TEST
y_values = filtered_time_df_nonzeros['Tips'].dropna().to_dask_array()
sns.distplot(y_values, kde = True)
plt.title('Distribution of Tips')
plt.xlabel('Tip')
plt.ylabel('KDE density')
plt.show()
From the historgram we can see that there is a huge peak centered at 2 USD. However, we see smaller peaks at 1 USD and 3 USD. However, we see that the distribution is heavily right skewed with smaller and smaller peaks corresponding to the various other integer value of tips. The distribution is hence unimodal with a heavy right skew. This suggests that the distribution is not normal, which is to be expected since intuitively speaking, tip values are supposed to have a right-skew accounting for those people who give the uncommonly high tip!
Note that the KDE density can be used as a proxy for frequency.
STATISTICAL TEST
We perform the following test wil the following null and alternative hypothesis using a significance level ($\alpha$) of 0.05:
$H_0:$ The distribution of tip values is normal
$H_A:$ The distribution of tip values is not normal
results = stats.normaltest(y_values)
results
The p-value clearly indicates that we can reject the null hypothesis. Hence the distribution of tips is indeed concluded to be statistically not-normal.
However, our goal is to estabilish that there is a statisically significant linear relationship between tips and time duration.
Hence, we use the simple fact that $\rho(x,y) = \rho(y,x) \text{ where } \rho$ is the correlation coefficient between two variables x and y namely tips and time duration in cab.
Let us check if time duration (the previous x-variable) is normally distributed instead.
VISUAL TEST
y_values = filtered_time_df_nonzeros['Trip Seconds'].dropna().to_dask_array()
sns.distplot(y_values, kde = True)
plt.title('Distribution of Trip Times')
plt.xlabel('Trip Seconds')
plt.ylabel('KDE density')
plt.show()
From the historgram we can see that the distribution has a geavt right skew again, however, it is also unimodal but has a hevy right skew. Visually it appears to be not normal.
Note that the KDE density can be used as a proxy for frequency.
STATISTICAL TEST
We perform the following test wil the following null and alternative hypothesis using a significance level ($\alpha$) of 0.05:
$H_0:$ The distribution of time in cabs are normal
$H_A:$ The distribution of time in cabs are not normal
results = stats.normaltest(y_values)
results
The p-value clearly indicates that we can reject the null hypothesis. Hence the distribution of trip durations is indeed concluded to be statistically not-normal.
We perform the following test wil the following null and alternative hypothesis using a significance level ($\alpha$) of 0.05:
$H_0:$ The slope of regression line $b_0$[1] is 0
$H_A:$ The slope of regression line $b_0$[1] is not 0
[1]. When normalized the standardarized the slope of regressiion line is the correlation coefficient between the dependent and independent variable ($\rho$).
filtered_time_df_nonzeros_dropna = filtered_time_df_nonzeros.dropna()
slope, intercept, r_value, p_value, std_err = stats.linregress(filtered_time_df_nonzeros_dropna['Trip Seconds'],filtered_time_df_nonzeros_dropna['Tips'])
print('correlation coefficient is: {}'.format(r_value), 'p-value is: {}'.format(p_value))
The p-value clearly indicates that we can reject the null hypothesis. Hence the fact that there exists a statsitically significant strong linear relationship between Tips and Trip Durations is True.
Since we fail the assumptions for calculation of linear regression, and since we have so much of data, let us attempt to use the non-parametric Spearman Rank Correlation to estimate if there is a monotonically increasing relationship between cab travel times and tips
Assumptions for Spearman Rank Correlation
STATISTICAL TEST
We perform the following test wil the following null and alternative hypothesis using a significance level ($\alpha$) of 0.05:
$H_0:$ There is no monotonic relationship between Trip Times and Tips
$H_A:$ There is a monotonic relationship between Trip Times and Tips
results = stats.spearmanr(filtered_time_df_nonzeros_dropna['Trip Seconds'],filtered_time_df_nonzeros_dropna['Tips'])
results
The moderately high positive value of the statsitic suggests that the relationship is strongly monotonically increasing. This means that greater times spend in cabs results in greater tips being recieved by the drivers for those passengers who do tip. The low p value of suggests that the findings statistically favor the rejection of the null hypothesis suggesting that the monotically increasing relationship is statistically significant
We select the company column which is contains the names of the company to which the driver belongs for each trip.
company_df = df[['Tips', 'Company']]
We see that the number of trips for which the company is null is in fact zero
numnulls = company_df['Company'].isna().sum().compute()
numnulls
unique_companies = company_df['Company'].unique().compute()
We see that the there are 86 companies!
len(unique_companies)
Let us select only the "major-players" in Chicago
To do this we list the proportion of rides with non-null tips for each company
company_df_nonnulltips = company_df.dropna()
proportion_of_trips = (company_df_nonnulltips['Company'].value_counts())/len(company_df_nonnulltips['Company'])
proportion_of_trips = proportion_of_trips.compute()
proportion_of_trips.hist()
plt.title('Histogram of the Proportion of Trips by Taxi Company')
plt.ylabel('Number of Companies')
plt.xlabel('Proportion of Trips')
plt.show()
We see that most (~70 out of 86) of the companies have less than 2% of the rides in the city. Find some descriptive statistics which will help us identify the major players.
proportion_of_trips.describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.98])
From the descriptive statistics, let us define a major player as one which as greater than the 90th percentile of shares in taxi rides in the city of chicago.
ninety_percentile = proportion_of_trips[proportion_of_trips > 3.809464e-02]
ninety_percentile.plot(kind = 'bar')
plt.title('Top Taxi Companies in Chicago')
plt.xlabel('Company Name')
plt.ylabel('Ride Share')
plt.show()
Let us compute the Tips for each of the trips offered by these companies which are not-null and falls within the 97th percentile of Tips to eliminate the outliers.
top_companies_tips = company_df[(company_df['Tips'] <= outlier_tips[0]) & (company_df['Company'].isin(ninety_percentile.index.tolist())) & (company_df['Tips'] != 0)]
categories = dict(zip(list(ninety_percentile.index), [i+1 for i in range(len(ninety_percentile))]))
top_companies_tips['Company Category'] = top_companies_tips['Company'].apply(lambda c: categories[c], meta = int)
import pandas as pd
pd.DataFrame.from_dict(categories, orient = 'index', columns = ['Category Number'])
Let us plot a violin plot to see if there are any differences in the distribution of Tips across companies
plt.rcParams["figure.figsize"] = [16,9]
sns.violinplot(x="Company Category", y="Tips", data=top_companies_tips.compute())
plt.title('Distribution of Tips by Company')
plt.savefig('CATPLOT.png', dpi = 400)
We see that the distribution of tips are extremely similar across companies. There are peaks corresponding to USD1, USD2, and USD3. However, we see that companies 2, 5, 8, 9 have slight peaks corresponding to USD 4 or more.
However, the boxplots within these distributions tell a completely different story. We can see quite clearly that the means of the tips across the various companies are not the same. We can see that company Chicago Carriage Cab Corp, Taxi Affiliation Service Yellow, Medallion Leasin have a mean tip close to USD 3, whereas companies Flash Cab, Taxi Affiliation Services, and Yellow Cab have a mean closer to USD 2. Lastly companies City Service, Sun Taxi, and Blue Ribbon Taxi Association Inc. have a mean tip that is lower than USD2.
Since neither of these distributions are normal, we use a non-parametric test (Kruskal Wallis Test) to test whether there are differences in tips recieved across companies. Since each of these companies have greater than 5 rides, we are able to satisfy the assumptions of this test.
We perform the following test wil the following null and alternative hypothesis using a significance level ($\alpha$) of 0.05:
$H_0:$ The median of tips recieved by drivers across companies is the same
$H_A:$ The median of tips recieved by drivers across companies not the same
for cat in categories.values():
globals()['arr_{}'.format(cat)] = top_companies_tips[top_companies_tips['Company Category'] == cat].Tips
stats.kruskal(arr_1, arr_2, arr_3, arr_4, arr_5, arr_6, arr_7, arr_8, arr_9)
This suggests that the median of tips across the companies is in-fact not the same. This observation is once again clearly shown in the boxplots in the fugure above. The differences in Tips recieved can be most likely due to differences in business models across the companies. However such causal relationships are merely speculative and further analysis is necessary to discover the causality from the association.
In this example we are forced to use the single core Pandas/ Geopandas which is inherently slower than dask. So in order to faciliate feasible computation, we choose to work with a representative sample of the dataset. For sampling strategy we choose to go with Proportionate allocation based on a Simple Random Sample. To make our results more robust we simulate and run our analysis over 4 successive iterations. Finally, we choose to deal with ethical considerations by dropping the coordinates themselves, and choose the work with Census tracts, which are far more granular. Further, with the dataset's description in mind, we choose to drop all locations where we have no information on the census tract.
df_geographic = df[['Tips', 'Dropoff Census Tract']]
df_geographic = df_geographic.dropna(subset = ['Dropoff Census Tract'])
print('There are {} records left'.format(len(df_geographic)))
So we see that we still have a large number of records (~17 Million) left! Let us now filter the Tips which are outliers or are 0 USD.
outlier_tips = dask.array.percentile(df['Tips'].dropna().values, q = 97)
filtered_geo_df = df_geographic[(df_geographic['Tips'] <= outlier_tips[0]) & (df_geographic['Tips'] != 0)]
print('There are {} records left'.format(len(filtered_geo_df)))
Even after this filtering, we notice that we still have a large number of records (7 Million) left for analysis. In order to simplify our analysis we choose a random subset of 12000 taxi rides (an arbitrarily high threshold before our computers started taking too long). We realise that the choice to only use a random subset will affect the power of the conclusions we draw. However, due to the fact that Dask does not natively support Geopandas, so we are left with lttle choice computationally speaking.
Using a random sample of ~12,000 taxi rides, we do an inner join with Census Tract Information retrieved from the United States Census Bureau for the State of Illinois for 2017. This provides us with information of the various polygons (geographical areas) which consitute each census tract.
df_geographic = filtered_geo_df[['Tips', 'Dropoff Census Tract']]
sample = df_geographic.sample(frac = 12000/7310603)
sample_df = sample.compute()
Reading the the Census Tracts for the State of Illinois from the U.S Census Bureau
shapefile_df = gpd.read_file('./shapefiles/cb_2017_17_tract_500k.shp')
shapefile_df['GEOID'] = shapefile_df['GEOID'].astype(float)
geo_df_merged = sample_df.merge(shapefile_df, how = 'inner', left_on='Dropoff Census Tract', right_on='GEOID')
Merging Entries and converting it to a Geopandas Geo Data Frame
geo_df_merged_shp = geo_df_merged[['Tips', 'GEOID', 'geometry']]
geo_df_merged_shp = gpd.GeoDataFrame(geo_df_merged_shp, crs=shapefile_df.crs, geometry='geometry')
Plotting everything out using Matplotlib
plt.rcParams["figure.figsize"] = [16,9]
ax = plt.gca()
ax.grid(color='#b8b8b8', linestyle='--', linewidth=2, zorder=0)
ax.set_facecolor("#f2f2f2")
geo_df_merged_shp.plot(column = 'Tips', cmap='OrRd', legend=True, ax = ax)
plt.title('Degree of Spatial Clustering of Taxi Tips across Chicago', fontsize=15)
plt.xlabel('Longitude', fontsize=12)
plt.ylabel('Latitude', fontsize=12)
plt.savefig('plotted.png', dpi = 300)
plt.close()
### We have manually added the anotation on the scale to make things a bit clear without cluttering things up.
Image('plotted.png')
We can see that this map does not look anything like the city of Chicago! However, we do see certain areas having cluters of high tips and certain areas having low tips. This brings the need to perform/test for spatial autocorrelation.
Spatial Autocorrelation 1
Spatial autocorrelation is the term used to describe the presence of systematic spatial variation in a variable. Positive spatial autocorrelation is the tendency for sites or areas that are close to one another to have similar values of the variable (i.e., both high or both low). Negative spatial autocorrelation is the tendency for adjacent values to be dissimilar (high values next to low values). The presence of spatial autocorrelation in map data has often been taken as indicative that there is something of interest in the map that calls for further investigation in order to understand the reasons behind the observed spatial variation.
In order to test for overall spatial autocorrelation in this graph we use a statistic known as the Global Moran's I coefficient. The interpretation of the value is much akin to a correlation coefficient. The Moran's I coefficient can range between -1 and 1. where 1 indicates completely positive spatial autocorrelation and -1 indicates complete negative spatial autocorrelation. A value of 0 indicates spatial randomness. We compute the Moran's-I using the following formula:
The spatial matrix W has been chosen to be computed thorugh an Inverse Distance Based metric where points further away from a chosen region in space are weighted more heavily when compared to points nearby.
We attempted to compute the Gobal Moran's I using the library PySAL. However, for some reason, the latest version of the library failed to install on our computer. To mitigate this, we chose a free trial of the Geospatial Software ArcGIS Pro. ArcGIS Pro is an industry-standard alternative to PySAL which offers comprehensive spatial statsitics through an in-house modified version of Python known as ArcPy, in addition to ArcMap which is the equivalent of PySAL. The code used to compute the spatial analysis was actually interfaced using geopandas! Geopandas allowed us to create a shapefile from the dataset, which we could then load into ArcPy to write Python Scripts to perform spatial analysis. We are attaching any/all of the python code we wrote within the code cells of this notebook.
We found out that the only way to transfer our data now would be to convert our Geo-DataFrame to a shapefile
geo_df_merged_shp.to_file('tipsfil.shp')
Now, we can use this to work with Python within the ARCGIS Pro console. First we see compute the global spatial autocorrelation that exists for the tips. Note that our shapefile is 'tipsfil'. The second argument refers to which column of the shapefile we want to test for autocorrelation (which is Tips), we pass in the GENERATE_REPORT to get a visualization of the spatial correlation statistic. INVERSE DISTANCE refers to the method to compute the weights matrix. Alternatives included Queen's Distance, Rook's Distance etc. EUCLIDEAN DISTANCE is a parameter for distance threshold which is automatically set by Arcmap
Code
arcpy.stats.SpatialAutocorrelation("tipsfil", "Tips", "GENERATE_REPORT", "INVERSE_DISTANCE", "EUCLIDEAN_DISTANCE", "NONE", None, None)
From the report we can clearly see the presence of a autocorrelation between dropoff location and taxi driver tips. Additionally we can see that the Moran's Global I statistic is weakly positive which suggest that The spatial distribution of high Tips and/or low Tips in the dataset is more spatially clustered than would be expected if underlying spatial processes were random albeit weakly. This suggests that There is a dependence between Dropoff Location and the Tips Recieved for drivers that do recieve a tip for the ride.
Having Quantified Location dependence of Tips, let us figure out are any particular locations which are hotspots for High Tips!
We compute this using the Getis Ord $G_i^*$ Statistic. According to the documentation of ArcGIS Pro, the HotSpots function in ArcMap allows one to detect statististically significant clusters of high values of tips (hot-spots) and low values of tips (cold-spots). The formula of the $G_i^*$ statistic is provided below.
Based on the value we have color hotspots are "Red" and coldspots as "Blue". Further, the intensity of the color is varied with the level of statistical significance of the respective value.
Code </u>
import arcpy
arcpy.stats.HotSpots("tipsfil", "Tips", r"C:\Users\gani1\OneDrive\Documents\ArcGIS\Projects\COGS108\COGS108.gdb\tipsfil_HotSpots", "FIXED_DISTANCE_BAND", "EUCLIDEAN_DISTANCE", "NONE", None, None, None, "NO_FDR")
Airports a center for high tips?
We have chosen the U.S Labelled based map in ARCGIS Pro. Using this we use the to_featurelayer() function in ArcPy to overlay the results on the basemap. Using this, and the code above, we can see that the entire City of Chicago is a cold-spot for Tips. We attribute this to the reason that taxi rides within the city limits do not cost as much, and hence do not warrant a huge tip. However, as we can clearly see, Taxi Rides to the outskirts of the City typically end up being Hotspots (Which Agrees with our Conclusion from Part-1 that greater times spend in the cab, results in great tips) Finally, we see that both the Airports serving Chicago (MDW, ORD) are hotspots for Tips! This is probably because of the airport being quite a distance away from the city's center. However, an association does not mean causation and further investigation is necessary to understand why Airports in Chicago attract higher Tips.
In the final section we investigate whether there are any significant differences in Taxi Tips Recieved during the Holiday Season (Thanksgiving/Christmas) compared to other portions of the year. Our Hypothesis is that if people are more generous during the holiday season, we would expect greater tips on a typical ride!
We first isolate the Trip Start Timestamp and the Tips Column. Per usual, we then retain all the tips which are within the 97th percentile of tip values. In addition, we also dropp all the tips which are nil (0 USD).
holiday_df = df[['Trip Start Timestamp', 'Tips']]
outlier_tips = dask.array.percentile(df['Tips'].dropna().values, q = 97)
filtered_holiday_df = holiday_df[(holiday_df['Tips'] <= outlier_tips[0]) & (holiday_df['Tips'] > 0)]
filtered_holiday_df = filtered_holiday_df.dropna()
tg_df = filtered_holiday_df.copy()
xmas_df = filtered_holiday_df.copy()
non_hol_df = filtered_holiday_df.copy()
We write user defined functions to filter the dates with respect to the dates for Christmas and Thanksgiving for the year of 2017. Thankgiving was 11/23/2017, So we start filtering dates from 11/18/17 through 11/27/17 to account for people travelling to Chicago for Thanksgiving. Similarly, We filter dates from 12/20/17 to 12/31/17 for Christmas travel. Finally we also write a function to get the dates which do not fall into any of these brackets (i.e. non-christmas, non-thanksgiving time-frames). Note that there are plenty of holidays celebrated in the U.S, and it is our subjective choice to choose the two biggest holiday seasons in the U.S.
def date_check_tg(x):
x = x[:10]
if x[:2] != "11":
return np.nan
elif 18 <= int(x[3:5]) <= 27:
return x
else:
return np.nan
def non_tg_non_xmas(x):
x = x[:10]
if x[:2] != "11" or x[:2] != "12":
return x
elif 18 <= int(x[3:5]) <= 27:
if x[:2] != '11':
return x
elif 20 <= int(x[3:5]) <= 31:
if x[:2] != '12':
return x
else:
return np.nan
def date_check_xmas(x):
x = x[:10]
if x[:2] != "12":
return np.nan
elif 20 <= int(x[3:5]) <= 31:
return x
else:
return np.nan
tg_df["Trip Start Timestamp"] = tg_df["Trip Start Timestamp"].apply(date_check_tg, meta=('Trip Start Timestamp', 'object'))
non_hol_df["Trip Start Timestamp"] = non_hol_df["Trip Start Timestamp"].apply(non_tg_non_xmas, meta=('Trip Start Timestamp', 'object'))
xmas_df["Trip Start Timestamp"] = xmas_df["Trip Start Timestamp"].apply(date_check_xmas, meta=('Trip Start Timestamp', 'object'))
We drop nulls, since that is how we return non-conforming dates in our user defined function
tg_df = tg_df.dropna()
non_hol_df = non_hol_df.dropna()
xmas_df = xmas_df.dropna()
In order to facilitate the drawing of a seaborn violin plot, we need a categorical variable. So we manually create a feature known as season which represents which holiday season the date of the ride was a part of. The variable season takes on 3 values:
We then append these datasets to create on large dataset known as all_data_with_season and all_data_with_thanksgiving. This contains the categorical variable season!
xmas_df['season'] = 'xmas'
non_hol_df['season'] = 'non-holiday'
tg_df['season'] = 'thankgiving'
all_data_with_season = dd.concat([xmas_df, non_hol_df])
all_data_with_thanksgiving = dd.concat([tg_df, non_hol_df])
Note that we could have easily concatanated xmas_df, non_hol_df and tg_df into a single dataframe but we choose to not. This is because in the following cell, we plan to perform the kruskal wallis test. The test allows us to is essentially a non-parametric version of ANOVA. In short the kruskal wallis test will test the following hypothesis:
$H_0:$ The distribution of tips during the Thanksgiving and Non-Holiday Seasons is the same
$H_A:$ The distribution of tips recieved by drivers during Thanksgiving and Non-Holiday Seasons companies not the same
stats.kruskal(tg_df.Tips, non_hol_df.Tips)
Using our earlier defined cut-off of $\alpha$ = 0.05, We see that we can accept the null Hypothesis. Hence, ths statistical test suggests that there are no significant differences between Thanksgiving Tips and Non-Holiday Tips. This is also clearly reflected in the visual violin-plot generated below!
sns.violinplot(x = 'season', y = 'Tips', data = all_data_with_thanksgiving.compute())
plt.savefig('thanksgiving.png')
plt.close()
Similarly we do the same anaysis for Christmas and other periods of the year
$H_0:$ The median tips during the Christmas and Non-Holiday Seasons is the same
$H_A:$ The median tips recieved by drivers during Christmas and Non-Holiday Seasons companies not the same
Using our earlier defined cut-off of $\alpha$ = 0.05, We see that we can accept the alternative Hypothesis. Hence, ths statistical test suggests that there are significant differences between Christmas Tips and Non-Holiday Tips. While the distribution of these Tips are pretty consistent as shown in the violin plots, we see that the inner side-by-side boxplots between the two variables shows that the medians are indeed different. However, we see quite clearly that the 75th percentile of tips in christmas is lower than other periods of the year. This would in turn affect the median value too!
stats.kruskal(xmas_df.Tips, non_hol_df.Tips)
sns.violinplot(x = 'season', y = 'Tips', data = all_data_with_season.compute())
plt.savefig('xmas.png')
plt.close()
According to the City of Chicago’s data collection team, they prioritize personal privacy when developing the data set. They have done this through specially developed de-identification and aggregation techniques. This technique involves: aggregation by time, in which all trips were rounded to the nearest 15 minute interval. Using this, the data collection team exploits the sheer size of the city of Chicago, which is spread over 800 census tracts each of which ranging between 89,000 sq. ft and 84 million sq. ft. As a result of this, coupled with the time-based de-identification, it is impossible to know the precise time and place the trip occurred beyond a 15-minute window and an 89,000 square foot area. As the dataset does provide the approximate location of a trip, another layer of protection was added to avoid linking individuals’ trip location data to their identities. These techniques and explanations shown below are directly obtained from the city of Chicago’s Data Privacy Site referenced below:
Source: City of Chicago: Data Privacy
In addition to the above facts, we performed a quick analysis on the columns of the dataset which was obtained. We saw, there was no information about the rider published; and that the only columns related to the rider in the data set are the fare, the locations, and the start/stop times of the trip. There is no data collection such as their name, address, phone number, email, or any other information which may invade their privacy. The columns which we chose to analyse namely Tips, Company, and Trip Start Time are indeed only related to taxi rides and adhere to the safe harbor methods discussed in class.
Our efforts to ensure Privacy and Ethics in Analysis </u>
Despite the City of Chicago’s efforts in anonymizing the data, securing personal privacy, and aggregating by geographical location and time, there is still an obvious concern about the exact locations (lat, long) of pick-up and drop-off provided. While it cannot be linked back to a specific individual, patterns could easily be found using this data set of individuals, which is an invasion of privacy. An example of this would be that a frequent drop-off location could be extrapolated to a specific individual - a clear invasion of privacy. This also violates the safe harbor “geographical subdivisions” method, indicating that it should not be published publicly. Also, there are trip and taxi IDs which are disclosed in the data-set. While these are anonymized, they still relate to a unique individual which could be an invasion of privacy as it is unique.
We handled this issue by ignoring the latitude and longitude coordinates (dropping them from the data set), and instead focused our geographical analysis on the census tracts disclosed by the data set. These census tracts are far more general than the coordinates provided,since they are by definition aggregations rather than specific locations and are simply abstract regions created by the City of Chicago.
The drop-off census tracts which did not guarantee anonymity (indicated by NaNs in the dataset), were dropped from the geospatial portion of our analysis. Thus, there is no zip-code or specific coordinates involved in the analysis, just a general region (akin to a neighbourhood) associated with the drop-off location.
Similarly, in a light to respect privacy/ethical laws, we simply chose to drop the taxi ID and trip ID columns in our analysis, following the insightful discussion in class where one could potentially "hack" the anonymization (hashing) process.
Other than this, the data set is open for public use so there are no issues with us utilizing it for our project. There should not be any bias within the data set either, as it is collected by an agency unrelated to the cab services. The files are simply for documentation purposes and for analysis like this.