Univariate Plots Section
## InvoiceNo StockCode Description Quantity InvoiceDateTime UnitPrice
## 1 0 0 0 0 0 0
## CustomerID Country Date InvoiceYear InvoiceMonth InvoiceWday InvoiceHour
## 1 0 0 0 0 0 0 0
## Value
## 1 0
## InvoiceNo StockCode Description Quantity
## 1: 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2: 536365 71053 WHITE METAL LANTERN 6
## 3: 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4: 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5: 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6: 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDateTime UnitPrice CustomerID Country Date
## 1: 2010-12-01 08:26:00 2.55 17850 United Kingdom 2010-12-01
## 2: 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12-01
## 3: 2010-12-01 08:26:00 2.75 17850 United Kingdom 2010-12-01
## 4: 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12-01
## 5: 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12-01
## 6: 2010-12-01 08:26:00 7.65 17850 United Kingdom 2010-12-01
## InvoiceYear InvoiceMonth InvoiceWday InvoiceHour Value
## 1: 2010 Dec Wed 8 6.5025
## 2: 2010 Dec Wed 8 11.4921
## 3: 2010 Dec Wed 8 7.5625
## 4: 2010 Dec Wed 8 11.4921
## 5: 2010 Dec Wed 8 11.4921
## 6: 2010 Dec Wed 8 58.5225
## Classes 'data.table' and 'data.frame': 406829 obs. of 14 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description : chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDateTime: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
## $ Date : Date, format: "2010-12-01" "2010-12-01" ...
## $ InvoiceYear : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ InvoiceMonth : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## $ InvoiceWday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ InvoiceHour : int 8 8 8 8 8 8 8 8 8 8 ...
## $ Value : num 6.5 11.49 7.56 11.49 11.49 ...
## - attr(*, ".internal.selfref")=<externalptr>
After the transformations our dataset consists of 14 variables, with 406,829 observations and no nulls.
InvoiceNo
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 12.00 18.33 24.00 542.00
##
## Outliers are 9.17 % of the total occurences
## [1] "22190 No of distinct InvoiceNos"
Each occurrence of an InvoiceNo
marks a unique transaction. The number an InvoiceNo
appears is equivalent to the basket size.
We observe an extremely right skewed distribution with a prevalence of 1 item orders with a third quartile of 24 and a max of 524 occurrences and a quite few outliers in between.
- There are 22190 distinct Invoice Nos (including cancellations).
InvoiceDateTime
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 13.00 19.88 26.00 543.00
##
## Outliers are 9.92 % of the total occurences
InvoiceDateTime
is the time stamp of every transaction and it’s frequency distribution is the same as of InvoiceNo
.
The summary statistics look slightly different to the ‘InvoiceNo’ although they should look the same. I wonder why that is.
How does the number of transactions look over time??
We observe the number of transactions decreasing towards the beginning of 2011 and a sharp increase in the number of transactions in the last quarter of 2011
StockCode
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 12.0 49.0 110.4 139.0 2077.0
##
## Outliers are 12.4 % of the total occurences
Description
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 11.0 45.0 104.7 131.0 2070.0
##
## Outliers are 12.97 % of the total occurences
The distributions are right skewed with a prevalence of 1 occurrence.
## [1] "3684 No of distinct stock codes"
## [1] "3885 No of distinct descriptions"
Stockcode
and Description
should be representing the same thing which is the unique stock items (3885) described either by a code or by a human-readable description. There is a discrepancy of 200 distinct itmes.
The distributions are very similar but the summary statistics are different. What could be the reason?
## InvoiceNo StockCode Description Quantity InvoiceDateTime
## 1: 536370 POST POSTAGE 3 2010-12-01 08:45:00
## 2: C536379 D Discount -1 2010-12-01 09:41:00
## 3: 536403 POST POSTAGE 1 2010-12-01 11:27:00
## 4: 536527 POST POSTAGE 1 2010-12-01 13:04:00
## 5: 536540 C2 CARRIAGE 1 2010-12-01 14:05:00
## 6: 536569 M Manual 1 2010-12-01 15:35:00
## 7: 536569 M Manual 1 2010-12-01 15:35:00
## 8: 536779 BANK CHARGES Bank Charges 1 2010-12-02 15:08:00
## 9: 536840 POST POSTAGE 1 2010-12-02 18:27:00
## 10: 536852 POST POSTAGE 1 2010-12-03 09:51:00
## UnitPrice CustomerID Country Date InvoiceYear
## 1: 18.00 12583 France 2010-12-01 2010
## 2: 27.50 14527 United Kingdom 2010-12-01 2010
## 3: 15.00 12791 Netherlands 2010-12-01 2010
## 4: 18.00 12662 Germany 2010-12-01 2010
## 5: 50.00 14911 EIRE 2010-12-01 2010
## 6: 1.25 16274 United Kingdom 2010-12-01 2010
## 7: 18.95 16274 United Kingdom 2010-12-01 2010
## 8: 15.00 15823 United Kingdom 2010-12-02 2010
## 9: 18.00 12738 Germany 2010-12-02 2010
## 10: 18.00 12686 France 2010-12-03 2010
## InvoiceMonth InvoiceWday InvoiceHour Value
## 1: Dec Wed 8 324.0000
## 2: Dec Wed 9 756.2500
## 3: Dec Wed 11 225.0000
## 4: Dec Wed 13 324.0000
## 5: Dec Wed 14 2500.0000
## 6: Dec Wed 15 1.5625
## 7: Dec Wed 15 359.1025
## 8: Dec Thu 15 225.0000
## 9: Dec Thu 18 324.0000
## 10: Dec Fri 9 324.0000
We see that there are descriptions of generic transactions that are not equivalent to sales and therefore there does not exist a one to one relationship between StockCode
and Description
.
Discount
at Description
means a cancelled order winch is denoted with a D
at the StockCode
and a C
at the start of the InvoiceNo
How many are there?
## # A tibble: 8 x 3
## StockCode Description n
## <chr> <chr> <int>
## 1 POST POSTAGE 1196
## 2 M Manual 465
## 3 C2 CARRIAGE 134
## 4 D Discount 77
## 5 CRUK CRUK Commission 16
## 6 DOT DOTCOM POSTAGE 16
## 7 BANK CHARGES Bank Charges 12
## 8 PADS PADS TO MATCH ALL CUSHIONS 4
## [1] "The number of special transactions in the ddataset are : 1920"
1920 is much less than the total number of invoices.
Let’s have a look at the top 20 descriptions by the number of occurrences.
- The first 6 descriptions have a significantly higher number of occurrences than the rest.
- The top items are a light holder a cacestand and a restorsport bag wich gives as an idea about the sot of items the onli store is seelling
Postage
which is a generic description is included in the top 10.
Quantity
## nbr.val nbr.null nbr.na min max
## 4.068290e+05 0.000000e+00 0.000000e+00 -8.099500e+04 8.099500e+04
## range sum median mean SE.mean
## 1.619900e+05 4.906888e+06 5.000000e+00 1.206130e+01 3.899045e-01
## CI.mean.0.95 var std.dev coef.var
## 7.642011e-01 6.184839e+04 2.486934e+02 2.061911e+01
##
## Outliers are 13.43 % of the total occurences
We observe huge variance (6.184839e+04) and big symmetric outliers on both side. Limited the range of the x axis in steps to get a better idea of the distribution.
- A min and max of -/+ 80995 which could be big cancelled orders.
- Values are more symmetric the bigger they get which suggest that most big orders get’s cancelled.
We observe prevalence of orders with Quantity
less than 25 with a big peak at 12 which is also the upper quartile range.
How many negative values are there (Means cancelled orders)?
## [1] 8905
UnitPrice
## nbr.val nbr.null nbr.na min max
## 4.068290e+05 4.000000e+01 0.000000e+00 0.000000e+00 3.897000e+04
## range sum median mean SE.mean
## 3.897000e+04 1.407820e+06 1.950000e+00 3.460471e+00 1.086732e-01
## CI.mean.0.95 var std.dev coef.var
## 2.129961e-01 4.804592e+03 6.931516e+01 2.003056e+01
##
## Outliers are 12.53 % of the total occurences
- The distriubution is so right skewed that it doees not even get plotted as a histogram. After transfoming to log10 the
Unitprice
distribution looks uniform.
We again limited the the range of the occurrences take a better look.
- There is a huge outlier that at 38970 Sterling that is skewing the distribution extremely to the right. I wonder what this item is.
- Average price is 1.25 which is also the mode with a median of 1.95 and a minimum of 0!
I wonder which items have 0 price.
Value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 2.000e+00 4.000e+00 4.817e+03 1.400e+01 1.519e+09
##
## Outliers are 17.73 % of the total occurences
Zoomed in below the 10.000 to be able to vsualize the distribution.
- The maximum order value is 1.519e+09 and looks like an extreme outlier since it is pretty far away from the 3rd Quartile which is 14.
- The outlier percentage at 17.73% is higher than the others we have seen until now.
CustomerID
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 42.00 93.05 102.00 7983.00
##
## Outliers are 12.6 % of the total occurences
- The frequency distribution here is similar to the previous ones with most
customerid
occurrences at 1.
- One customerid appears 7983 times which is quite higher than the average frequency, which is 93.05.
Country
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 127 358 10995 1259 361878
##
## Outliers are 16.22 % of the total occurences
- The distribution is so right skewed it does not even show at the histogram.
- The mean (10995) is 15 times higher than the median (358).
- The max (UK) at 361878 occurrences is very far from the main the rest of the distribution but since it is where the majority of transactions are happening
How many countries are there and which have the highest occurrences?
## [1] 8.895089e-01 2.333904e-02 2.087118e-02 1.839839e-02 6.226203e-03
## [6] 5.828001e-03 5.085675e-03 4.613732e-03 3.637892e-03 3.094666e-03
## [11] 2.669426e-03 1.973802e-03 1.863191e-03 1.708334e-03 1.528898e-03
## [16] 1.135612e-03 9.856721e-04 9.561757e-04 8.799766e-04 8.381900e-04
## [21] 7.152882e-04 6.145088e-04 5.997606e-04 5.628901e-04 4.473624e-04
## [26] 3.711633e-04 3.588731e-04 3.121705e-04 1.671464e-04 1.499401e-04
## [31] 1.425660e-04 1.106116e-04 8.603123e-05 7.865713e-05 7.374106e-05
## [36] 4.178660e-05 2.458035e-05
UK is dominating the countries with 88.9% of occurrences which makes the sense since that is where the online store is based.
- We observe another small group of countries (Germany, France, Eire) above the others that have a much higher number of occurrences than the rest.
Univariate Analysis
What is the structure of your dataset?
Original Dataset Variables
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- UnitPrice: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
InvoiceNo
, StockCode
, Description
, CustomerI
D and Country
are categorical variables
Quantity
, UnitPrice
and Value
are numeric variables
InvoiceDateTime
is a datetime variable
After the transformations our dataset consists of 14 features( InvoiceNo, StockCode, Description, Quantity, InvoiceDateTime, InvoiceMonth, InvoiceWday and InvoiceHour UnitPrice, CustomerID, Country, Value, with 406,829 observations and no nulls.
Main Oservations:
- The frequency distribution of
InvoiceNo
, StockCode
, Description
, CustomerID
is extremely right skewed with a prevalence of 1 occurrence
- UK is the country where the most orders come from (88.9%)
- There is a discrepancy between
Description
and and Stockcode
distinct value- counts which should be identical in theory.
Description
and and Stockcode
and description include 1920 special transactions that do not correspond to specific items but are connected to logistical transactions for some of the Invoices. This represents a small portion of the total invoices though.
- Most orders are placed towards the end of 2011.
- 8,895 of the orders have been cancelled. It looks like most of them are the big outlying orders.
What is/are the main feature(s) of interest in your dataset?
InvoiceDateTime
and it’s components, Value
, Unitprice
and Quantity
. I would like to explore the correlation between the Datetime features(Month, Day, Hour) and the revenue the Unit price and the country.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Country
, CustomerID
, StockCode
/Description
Did you create any new variables from existing variables in the dataset?
I created the Value
variable by multiplying UnitPrice
and Quantity
to have a measure of the value in Sterling of an item order. I also extracted the InvoiceMonth
, InvoiceWday
and InvoiceHour
fr InvoiceDatetime
to analyze the monthly, weekly and hourly purchase patterns.
Bivariate Plots Section
InvoiceDateTime
How does InvoiceDateTime correlate with value.
We will group are values to daily observations and plot them.
There seems to be one transaction with a very high value (the outlier we found earlier), that is skewing the results. Let’s plot it without this value.
There is more information now, but there is still a number of values (as we saw in the univariate analysis) that are on a totally different scale than the majority of the values.
Our assumption was true. There are still a lot of volatility but a more clear trend can be persevered now that is consistent with the number of transactions trend.
Revenue drops of abruptly in December 2011 and this due to the fact that the whole month is not included. We will subset the data to exclude this month.
We have now a dataset that starts at the end November 2010 and and ends at the end of November 2011, 13 months in total.
We plotted the data using he loess
smoothing method and monthly and yearly re-sampling. Here the orange line represents the yearly trend and the blue line the monthly trend.
- We observe a great amount of volatility where values drop very low and rise very high in a short time period.
Let’s zoom in to the middle of our time period.
We observe what seems to be a monthly seasonality with the peak towards the middle of the month and steady and very slightly declining trend until around September 2011 where the revenue picks up.
We will plot it using the moving average to compare to the loess
method we used above.
The monthly and weekly moving averages look very similar to the loess
estimations.
Let’s sum the value per month of the year and plot them.
The monthly pattern is not obvious here. We observe a decline until February , then two consecutive and increasing peaks at March and May , a decline again and ten a very steep increase starting August 2011.
We will try to confirm our observations using a time-series decomposition model.
Time Series
We will calculate the seasonality:
## freq spec time
## 107 0.285333333 5415684253 3.504673
## 54 0.144000000 4515980204 6.944444
## 2 0.005333333 4171897151 187.500000
## 1 0.002666667 3340504198 375.000000
## 53 0.141333333 1537341509 7.075472
## 161 0.429333333 1223535323 2.329193
## 3 0.008000000 1163956615 125.000000
## 10 0.026666667 689007595 37.500000
## 12 0.032000000 660367372 31.250000
## 6 0.016000000 638031544 62.500000
We padded the Saturday values that where missing to create a univariate daily time series and created a periodogramm.
We see that there are indeed seasonal patterns in our dataset although not monthly as we expected. The most important seasonalities are every 3.5 and 6.9 days followed 187.5 and 375.0.
The 3.5 suggest that there may be some sub-weekly seasonality and 6.9 is a weekly seasonality that is expected at daily univariate time series.
The times closer to a monthly seasonality (37.5 and 31.25) have lower frequency and our observation of a monthly seasonality is not confirmed by the periodogramm.
In addition to volatility, modeling daily level data might require specifying multiple seasonality levels, such day of the week, week of the year, month of the year, holidays, etc.
For the sake of simplicity, we will model using a frequency of 7 (weekly).
- The trend looks similar to the monthly moving average we plotted earlier and does still seem to contain a irregular monthly to bimonthly pattern.
- We can also see that there is quite a lot of noise confirming the volatility and high amount of outliers we have in our dataset.
- The missing Saturday values are obvious at the seasonality pattern as well as a weekly peak. Which day would that be?
- The number of transactions and revenue increases towards the end of the year with a peak in November.
- The minimum for both transactions and revenue is in February.
- We observe an increasing of transactions and revenue through the week with the peak on Thursday
- No transactions seem to happen on Saturday.
- Most of the orders and the highest revenue happen at the middle of the day (12.00 and 13.00).
- The distribution of orders and revenue looks almost normal
Results per year are as expected from our previous observations
UnitPrice
We see that the distribution of price is mostly similar between the different hours in the working day and that most of the Unitprices fall beyond 5 Sterling with a median around 2.
- The items bought very early in the morning have a higher unit price and variance without outliers.
- They seem to be sligthly lower at 16.00 and at 20.00
How does the picture look if we plot the unit price against wdays?
Distribution of UnitPrice is fairly similar for all weekdays.
UnitPrice vs Datetime
- We observe visible unit price lines which suggest a more continuous price range at lower price and more distinct prices the higher they get.
- The blue line depicts the linear trend and there does not seem to exist a clear time trend although it is slightly decreasing.
Country
- There are 5 country codes in the data tha do not correspond to existing countries, like
unspecified
or European Community
.
- The revenue is the highest in the UK as expected but the store has presence in quite few countries.
- Most of the countries here there are sales are developed.
- The only continent that is not represented is Africa.
How many?
## [1] 37
Let’s see how their sales look like without he UK.
Most sales are in European countries, channel islands :) and Australia.
What is the relationship of country to the average UnitPrice?
The highest average UnitPrice
is in Lebanon! UK is in the middle of the list.
I wonder if there is any relationship between the total revenue of the country and the average unit price per country.
##
## Pearson's product-moment correlation
##
## data: avg_unit_price and sum_revenue
## t = -0.42159, df = 35, p-value = 0.6759
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3862059 0.2589000
## sample estimates:
## cor
## -0.07108199
##
## Pearson's product-moment correlation
##
## data: avg_unit_price and log10(sum_revenue)
## t = -0.4866, df = 35, p-value = 0.6296
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3954881 0.2486497
## sample estimates:
## cor
## -0.08197367
- Used log transformation of the y axis to make the relationship better to see.
- No correlation is visible in the scatter plot between the average unit price and the sum of value per country.
- The correlation is weak and negative and does not change much for log.
##
## Pearson's product-moment correlation
##
## data: avg_unit_price and avg_revenue
## t = 27.685, df = 35, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9572090 0.9886667
## sample estimates:
## cor
## 0.9779217
There seems to be a strong positive correlation between average revenue and average unit price. It is logical since unit price is a component of revue.
Let’s see which countries have the highest average revenue.
This looks the same as the plot with the average unit price per country.
It does make sense as the average revenue is the Value/No of Transactions and the higher the average unit price the higher this will be.
How many customers per country?
The order is the the same as in the number of customers as per revenue.
Which are the customers with the most transactions and Revenue
Reflection
E-commerce is one of the fastest growing economic sectors worldwide. Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data.
However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title “Online Retail”.
The dataset can also be found at Kaggle E-Commerce Data
I chose this dataset as I felt it accurately represented real world datasets and wanted get the experience exploring it.
One of the main challenges I faced, is the open ended nature of EDA and the endless possibilities of explorations and transformations that are possible and having to draw the line and create a narrative out of the existing visualizations and explorations.
Besides the basic univariate exploration in this EDA I chose to get a general idea of the the patterns and change over time and it’s components(Weekly, monthly, hourly) in the data, and asses it’s potential for predicting future customer behaviour. Additionally I explored possible differences in these patterns between different countries.
The main issues I faced during the analysis:
- Very skewed distributions with a big percentage of outliers.
- The transactions happen over a relative short time period making predictions and observations weaker.
- The amount of items and transactions is quite big ,increasing the computational ad plotting time to problematic levels at times.
- UK is dominating the sales making country a poor predictor to use in modelling sales behavior.
Future Work
This was just the beginning. In the future I intend to proceed wit the following analyses:
- Cohort analysis.
- Churn prediction.
- Customer basket prediction.
- Customer segmentation.
- Time Series forecasting.