Description

This report explores a dataset containing all the actual transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Univariate Plots Section

##   InvoiceNo StockCode Description Quantity InvoiceDateTime UnitPrice
## 1         0         0           0        0               0         0
##   CustomerID Country Date InvoiceYear InvoiceMonth InvoiceWday InvoiceHour
## 1          0       0    0           0            0           0           0
##   Value
## 1     0

##    InvoiceNo StockCode                         Description Quantity
## 1:    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2:    536365     71053                 WHITE METAL LANTERN        6
## 3:    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4:    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5:    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6:    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##        InvoiceDateTime UnitPrice CustomerID        Country       Date
## 1: 2010-12-01 08:26:00      2.55      17850 United Kingdom 2010-12-01
## 2: 2010-12-01 08:26:00      3.39      17850 United Kingdom 2010-12-01
## 3: 2010-12-01 08:26:00      2.75      17850 United Kingdom 2010-12-01
## 4: 2010-12-01 08:26:00      3.39      17850 United Kingdom 2010-12-01
## 5: 2010-12-01 08:26:00      3.39      17850 United Kingdom 2010-12-01
## 6: 2010-12-01 08:26:00      7.65      17850 United Kingdom 2010-12-01
##    InvoiceYear InvoiceMonth InvoiceWday InvoiceHour   Value
## 1:        2010          Dec         Wed           8  6.5025
## 2:        2010          Dec         Wed           8 11.4921
## 3:        2010          Dec         Wed           8  7.5625
## 4:        2010          Dec         Wed           8 11.4921
## 5:        2010          Dec         Wed           8 11.4921
## 6:        2010          Dec         Wed           8 58.5225

## Classes 'data.table' and 'data.frame':   406829 obs. of  14 variables:
##  $ InvoiceNo      : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode      : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description    : chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity       : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDateTime: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice      : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID     : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country        : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ Date           : Date, format: "2010-12-01" "2010-12-01" ...
##  $ InvoiceYear    : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ InvoiceMonth   : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ InvoiceWday    : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ InvoiceHour    : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ Value          : num  6.5 11.49 7.56 11.49 11.49 ...
##  - attr(*, ".internal.selfref")=<externalptr>

After the transformations our dataset consists of 14 variables, with 406,829 observations and no nulls.

InvoiceNo

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00   12.00   18.33   24.00  542.00

## 
## Outliers are 9.17 % of the total occurences

## [1] "22190 No of distinct InvoiceNos"

Each occurrence of an InvoiceNo marks a unique transaction. The number an InvoiceNo appears is equivalent to the basket size.

We observe an extremely right skewed distribution with a prevalence of 1 item orders with a third quartile of 24 and a max of 524 occurrences and a quite few outliers in between.

There are 22190 distinct Invoice Nos (including cancellations).

InvoiceDateTime

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00   13.00   19.88   26.00  543.00

## 
## Outliers are 9.92 % of the total occurences

InvoiceDateTime is the time stamp of every transaction and it’s frequency distribution is the same as of InvoiceNo.

The summary statistics look slightly different to the ‘InvoiceNo’ although they should look the same. I wonder why that is.

How does the number of transactions look over time??

We observe the number of transactions decreasing towards the beginning of 2011 and a sharp increase in the number of transactions in the last quarter of 2011

StockCode

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    12.0    49.0   110.4   139.0  2077.0

## 
## Outliers are 12.4 % of the total occurences

Description

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    11.0    45.0   104.7   131.0  2070.0

## 
## Outliers are 12.97 % of the total occurences

The distributions are right skewed with a prevalence of 1 occurrence.

## [1] "3684 No of distinct stock codes"

## [1] "3885 No of distinct descriptions"

Stockcodeand Description should be representing the same thing which is the unique stock items (3885) described either by a code or by a human-readable description. There is a discrepancy of 200 distinct itmes.

The distributions are very similar but the summary statistics are different. What could be the reason?

##     InvoiceNo    StockCode  Description Quantity     InvoiceDateTime
##  1:    536370         POST      POSTAGE        3 2010-12-01 08:45:00
##  2:   C536379            D     Discount       -1 2010-12-01 09:41:00
##  3:    536403         POST      POSTAGE        1 2010-12-01 11:27:00
##  4:    536527         POST      POSTAGE        1 2010-12-01 13:04:00
##  5:    536540           C2     CARRIAGE        1 2010-12-01 14:05:00
##  6:    536569            M       Manual        1 2010-12-01 15:35:00
##  7:    536569            M       Manual        1 2010-12-01 15:35:00
##  8:    536779 BANK CHARGES Bank Charges        1 2010-12-02 15:08:00
##  9:    536840         POST      POSTAGE        1 2010-12-02 18:27:00
## 10:    536852         POST      POSTAGE        1 2010-12-03 09:51:00
##     UnitPrice CustomerID        Country       Date InvoiceYear
##  1:     18.00      12583         France 2010-12-01        2010
##  2:     27.50      14527 United Kingdom 2010-12-01        2010
##  3:     15.00      12791    Netherlands 2010-12-01        2010
##  4:     18.00      12662        Germany 2010-12-01        2010
##  5:     50.00      14911           EIRE 2010-12-01        2010
##  6:      1.25      16274 United Kingdom 2010-12-01        2010
##  7:     18.95      16274 United Kingdom 2010-12-01        2010
##  8:     15.00      15823 United Kingdom 2010-12-02        2010
##  9:     18.00      12738        Germany 2010-12-02        2010
## 10:     18.00      12686         France 2010-12-03        2010
##     InvoiceMonth InvoiceWday InvoiceHour     Value
##  1:          Dec         Wed           8  324.0000
##  2:          Dec         Wed           9  756.2500
##  3:          Dec         Wed          11  225.0000
##  4:          Dec         Wed          13  324.0000
##  5:          Dec         Wed          14 2500.0000
##  6:          Dec         Wed          15    1.5625
##  7:          Dec         Wed          15  359.1025
##  8:          Dec         Thu          15  225.0000
##  9:          Dec         Thu          18  324.0000
## 10:          Dec         Fri           9  324.0000

We see that there are descriptions of generic transactions that are not equivalent to sales and therefore there does not exist a one to one relationship between StockCode and Description.

Discount at Description means a cancelled order winch is denoted with a D at the StockCode and a C at the start of the InvoiceNo

How many are there?

## # A tibble: 8 x 3
##   StockCode    Description                    n
##   <chr>        <chr>                      <int>
## 1 POST         POSTAGE                     1196
## 2 M            Manual                       465
## 3 C2           CARRIAGE                     134
## 4 D            Discount                      77
## 5 CRUK         CRUK Commission               16
## 6 DOT          DOTCOM POSTAGE                16
## 7 BANK CHARGES Bank Charges                  12
## 8 PADS         PADS TO MATCH ALL CUSHIONS     4

## [1] "The number of special transactions in the ddataset are :  1920"

1920 is much less than the total number of invoices.

Let’s have a look at the top 20 descriptions by the number of occurrences.

The first 6 descriptions have a significantly higher number of occurrences than the rest.
The top items are a light holder a cacestand and a restorsport bag wich gives as an idea about the sot of items the onli store is seelling
Postage which is a generic description is included in the top 10.

Quantity

##       nbr.val      nbr.null        nbr.na           min           max 
##  4.068290e+05  0.000000e+00  0.000000e+00 -8.099500e+04  8.099500e+04 
##         range           sum        median          mean       SE.mean 
##  1.619900e+05  4.906888e+06  5.000000e+00  1.206130e+01  3.899045e-01 
##  CI.mean.0.95           var       std.dev      coef.var 
##  7.642011e-01  6.184839e+04  2.486934e+02  2.061911e+01

## 
## Outliers are 13.43 % of the total occurences

We observe huge variance (6.184839e+04) and big symmetric outliers on both side. Limited the range of the x axis in steps to get a better idea of the distribution.

A min and max of -/+ 80995 which could be big cancelled orders.
Values are more symmetric the bigger they get which suggest that most big orders get’s cancelled.

We observe prevalence of orders with Quantity less than 25 with a big peak at 12 which is also the upper quartile range.

How many negative values are there (Means cancelled orders)?

## [1] 8905

UnitPrice

##      nbr.val     nbr.null       nbr.na          min          max 
## 4.068290e+05 4.000000e+01 0.000000e+00 0.000000e+00 3.897000e+04 
##        range          sum       median         mean      SE.mean 
## 3.897000e+04 1.407820e+06 1.950000e+00 3.460471e+00 1.086732e-01 
## CI.mean.0.95          var      std.dev     coef.var 
## 2.129961e-01 4.804592e+03 6.931516e+01 2.003056e+01

## 
## Outliers are 12.53 % of the total occurences

The distriubution is so right skewed that it doees not even get plotted as a histogram. After transfoming to log10 the Unitprice distribution looks uniform.

We again limited the the range of the occurrences take a better look.

There is a huge outlier that at 38970 Sterling that is skewing the distribution extremely to the right. I wonder what this item is.
Average price is 1.25 which is also the mode with a median of 1.95 and a minimum of 0!

I wonder which items have 0 price.

Value

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.000e+00 4.000e+00 4.817e+03 1.400e+01 1.519e+09

## 
## Outliers are 17.73 % of the total occurences

Zoomed in below the 10.000 to be able to vsualize the distribution.

The maximum order value is 1.519e+09 and looks like an extreme outlier since it is pretty far away from the 3rd Quartile which is 14.
The outlier percentage at 17.73% is higher than the others we have seen until now.

CustomerID

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   42.00   93.05  102.00 7983.00

## 
## Outliers are 12.6 % of the total occurences

The frequency distribution here is similar to the previous ones with most customerid occurrences at 1.
One customerid appears 7983 times which is quite higher than the average frequency, which is 93.05.

Country

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10     127     358   10995    1259  361878

## 
## Outliers are 16.22 % of the total occurences

The distribution is so right skewed it does not even show at the histogram.
The mean (10995) is 15 times higher than the median (358).
The max (UK) at 361878 occurrences is very far from the main the rest of the distribution but since it is where the majority of transactions are happening

How many countries are there and which have the highest occurrences?

##  [1] 8.895089e-01 2.333904e-02 2.087118e-02 1.839839e-02 6.226203e-03
##  [6] 5.828001e-03 5.085675e-03 4.613732e-03 3.637892e-03 3.094666e-03
## [11] 2.669426e-03 1.973802e-03 1.863191e-03 1.708334e-03 1.528898e-03
## [16] 1.135612e-03 9.856721e-04 9.561757e-04 8.799766e-04 8.381900e-04
## [21] 7.152882e-04 6.145088e-04 5.997606e-04 5.628901e-04 4.473624e-04
## [26] 3.711633e-04 3.588731e-04 3.121705e-04 1.671464e-04 1.499401e-04
## [31] 1.425660e-04 1.106116e-04 8.603123e-05 7.865713e-05 7.374106e-05
## [36] 4.178660e-05 2.458035e-05

UK is dominating the countries with 88.9% of occurrences which makes the sense since that is where the online store is based.

We observe another small group of countries (Germany, France, Eire) above the others that have a much higher number of occurrences than the rest.

Univariate Analysis

What is the structure of your dataset?

Original Dataset Variables

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
UnitPrice: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.
InvoiceNo, StockCode, Description, CustomerID and Country are categorical variables
Quantity, UnitPrice and Value are numeric variables
InvoiceDateTime is a datetime variable

After the transformations our dataset consists of 14 features( InvoiceNo, StockCode, Description, Quantity, InvoiceDateTime, InvoiceMonth, InvoiceWday and InvoiceHour UnitPrice, CustomerID, Country, Value, with 406,829 observations and no nulls.

Main Oservations:

The frequency distribution of InvoiceNo, StockCode, Description, CustomerID is extremely right skewed with a prevalence of 1 occurrence
UK is the country where the most orders come from (88.9%)
There is a discrepancy between Description and and Stockcode distinct value- counts which should be identical in theory.
Description and and Stockcode and description include 1920 special transactions that do not correspond to specific items but are connected to logistical transactions for some of the Invoices. This represents a small portion of the total invoices though.
Most orders are placed towards the end of 2011.
8,895 of the orders have been cancelled. It looks like most of them are the big outlying orders.

What is/are the main feature(s) of interest in your dataset?

InvoiceDateTime and it’s components, Value, Unitprice and Quantity. I would like to explore the correlation between the Datetime features(Month, Day, Hour) and the revenue the Unit price and the country.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Country, CustomerID, StockCode/Description

Did you create any new variables from existing variables in the dataset?

I created the Value variable by multiplying UnitPrice and Quantity to have a measure of the value in Sterling of an item order. I also extracted the InvoiceMonth, InvoiceWday and InvoiceHour fr InvoiceDatetime to analyze the monthly, weekly and hourly purchase patterns.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

After loading, our dataset consisted of eight variables, with 541,909 observations.

There where 135,080 rows where the CustomerId variable has missing values (null).

We perform the following transformations:

Convert the InvoiceDateTime column to a datetime format.
Extract InvoiceMonth, InvoiceWday and InvoiceHour variable from the InvoiceDateTime variable.
Drop all the rows with null CustomerId since they are not useful for our analysis
Create a Value variable to get the total value for each transaction by multiplying UnitPrice with UnitPrice.

Bivariate Plots Section

InvoiceDateTime

How does InvoiceDateTime correlate with value.

We will group are values to daily observations and plot them.

There seems to be one transaction with a very high value (the outlier we found earlier), that is skewing the results. Let’s plot it without this value.

There is more information now, but there is still a number of values (as we saw in the univariate analysis) that are on a totally different scale than the majority of the values.

Our assumption was true. There are still a lot of volatility but a more clear trend can be persevered now that is consistent with the number of transactions trend.

Revenue drops of abruptly in December 2011 and this due to the fact that the whole month is not included. We will subset the data to exclude this month.

We have now a dataset that starts at the end November 2010 and and ends at the end of November 2011, 13 months in total.

We plotted the data using he loess smoothing method and monthly and yearly re-sampling. Here the orange line represents the yearly trend and the blue line the monthly trend.

We observe a great amount of volatility where values drop very low and rise very high in a short time period.

Let’s zoom in to the middle of our time period.

We observe what seems to be a monthly seasonality with the peak towards the middle of the month and steady and very slightly declining trend until around September 2011 where the revenue picks up.

We will plot it using the moving average to compare to the loess method we used above.

The monthly and weekly moving averages look very similar to the loess estimations.

Let’s sum the value per month of the year and plot them.

The monthly pattern is not obvious here. We observe a decline until February , then two consecutive and increasing peaks at March and May , a decline again and ten a very steep increase starting August 2011.

We will try to confirm our observations using a time-series decomposition model.

Time Series

We will calculate the seasonality:

##            freq       spec       time
## 107 0.285333333 5415684253   3.504673
## 54  0.144000000 4515980204   6.944444
## 2   0.005333333 4171897151 187.500000
## 1   0.002666667 3340504198 375.000000
## 53  0.141333333 1537341509   7.075472
## 161 0.429333333 1223535323   2.329193
## 3   0.008000000 1163956615 125.000000
## 10  0.026666667  689007595  37.500000
## 12  0.032000000  660367372  31.250000
## 6   0.016000000  638031544  62.500000

We padded the Saturday values that where missing to create a univariate daily time series and created a periodogramm.

We see that there are indeed seasonal patterns in our dataset although not monthly as we expected. The most important seasonalities are every 3.5 and 6.9 days followed 187.5 and 375.0.

The 3.5 suggest that there may be some sub-weekly seasonality and 6.9 is a weekly seasonality that is expected at daily univariate time series.
The times closer to a monthly seasonality (37.5 and 31.25) have lower frequency and our observation of a monthly seasonality is not confirmed by the periodogramm.

In addition to volatility, modeling daily level data might require specifying multiple seasonality levels, such day of the week, week of the year, month of the year, holidays, etc.

For the sake of simplicity, we will model using a frequency of 7 (weekly).

The trend looks similar to the monthly moving average we plotted earlier and does still seem to contain a irregular monthly to bimonthly pattern.
We can also see that there is quite a lot of noise confirming the volatility and high amount of outliers we have in our dataset.
The missing Saturday values are obvious at the seasonality pattern as well as a weekly peak. Which day would that be?

The number of transactions and revenue increases towards the end of the year with a peak in November.
The minimum for both transactions and revenue is in February.

We observe an increasing of transactions and revenue through the week with the peak on Thursday
No transactions seem to happen on Saturday.

Most of the orders and the highest revenue happen at the middle of the day (12.00 and 13.00).
The distribution of orders and revenue looks almost normal

Results per year are as expected from our previous observations

UnitPrice

We see that the distribution of price is mostly similar between the different hours in the working day and that most of the Unitprices fall beyond 5 Sterling with a median around 2.

The items bought very early in the morning have a higher unit price and variance without outliers.
They seem to be sligthly lower at 16.00 and at 20.00

How does the picture look if we plot the unit price against wdays?

Distribution of UnitPrice is fairly similar for all weekdays.

UnitPrice vs Datetime

We observe visible unit price lines which suggest a more continuous price range at lower price and more distinct prices the higher they get.
The blue line depicts the linear trend and there does not seem to exist a clear time trend although it is slightly decreasing.

Country

There are 5 country codes in the data tha do not correspond to existing countries, like unspecified or European Community.
The revenue is the highest in the UK as expected but the store has presence in quite few countries.
Most of the countries here there are sales are developed.
The only continent that is not represented is Africa.

How many?

## [1] 37

Let’s see how their sales look like without he UK.

Most sales are in European countries, channel islands :) and Australia.

What is the relationship of country to the average UnitPrice?

The highest average UnitPrice is in Lebanon! UK is in the middle of the list.

I wonder if there is any relationship between the total revenue of the country and the average unit price per country.

## 
##  Pearson's product-moment correlation
## 
## data:  avg_unit_price and sum_revenue
## t = -0.42159, df = 35, p-value = 0.6759
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3862059  0.2589000
## sample estimates:
##         cor 
## -0.07108199

## 
##  Pearson's product-moment correlation
## 
## data:  avg_unit_price and log10(sum_revenue)
## t = -0.4866, df = 35, p-value = 0.6296
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3954881  0.2486497
## sample estimates:
##         cor 
## -0.08197367

Used log transformation of the y axis to make the relationship better to see.
No correlation is visible in the scatter plot between the average unit price and the sum of value per country.
The correlation is weak and negative and does not change much for log.

## 
##  Pearson's product-moment correlation
## 
## data:  avg_unit_price and avg_revenue
## t = 27.685, df = 35, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9572090 0.9886667
## sample estimates:
##       cor 
## 0.9779217

There seems to be a strong positive correlation between average revenue and average unit price. It is logical since unit price is a component of revue.

Let’s see which countries have the highest average revenue.

This looks the same as the plot with the average unit price per country.

It does make sense as the average revenue is the Value/No of Transactions and the higher the average unit price the higher this will be.

How many customers per country?

The order is the the same as in the number of customers as per revenue.

Which are the customers with the most transactions and Revenue

Bivariate Analysis

The revenue and no. of transactions had a small decline until the the begin of the third quarter of 2011 when they started a sharp incline.
Most transactions and the highest revenue happens on Thursdays and in the middle of the day
There are no transactions on Saturdays.
The time series decomposition revealed a weekly seasonality , a big amount of noise and random variance in value
There seems to be a monthly-bimonthly seasonality that is hard to discern.

What was the strongest relationship you found?

Average Unit price and average value are strongly correlated
Very early in the morning unit prices are on average higher

Multivariate Plots Section

UnitPrice Vs InvoiceHour Per Day.

It is very interesting to note that all transaction after 18.00 and before 06.00 happen mainly on Thursday.

Is it a specific customer or country driving these transactions?

UnitPrice Vs InvoiceHour Per Country.

We do see that countries with a higher total revenue have a more spread out pattern of purchases through the day.
It is not easy to see some correlation with the pattern of purchasing later mainly on Thursdays.
Customers from Sweden seem to only buy low priced items.

Customers from Lebaanon and RSA seem to buy only on Thursdays but they are not late- hour buyers according to the previous plot.
This plot does not seem to answer what is driving the relationship between buying on Thursday ad late hours.

UnitPrice Vs InvoiceWDay Per Hour.

Not easy to discern a pattern here except that it looks like Thursday has indeed more orders after 17.00 (deepeer red colour is visible) and most low price orders are later in the day.

We see that the UK has the highest revenue but a moderate average unit price wile Spain and EIRE have higher than average revenues and higher average unit prices as well.

Channel Islands has the highest average unit price and revenue just slightly less that average.

Germany and France have higher sales than average but relatively low prices.

Netherlands is has the lowest price and average sales.

Which are the biggest customers and where do they come from?

It is interesting to note the that the biggest customer is from EIRE and together with Australia and UK that dominates the number of appearances are the only ones in the top 50.

Final Plots and Summary

Plot One

Description One

In this plot we can observe the daily revenue patterns over time as well as the and time trends of the aggregated revenue per day of the week and hour of the day.

The main observations are:

The number of transactions and revenue increases towards the end of the year with a peak in November.
The minimum for both transactions and revenue is in February.
We observe an increasing of transactions and revenue through the week with the peak on Thursday
No transactions seem to happen on Saturday.
Most of the orders and the highest revenue happen at the middle of the day (12.00 and 13.00).
The distribution of orders and revenue looks almost normal

Plot Two

Description Two

In this plot we can see a map with all the countries that appear in the dataset coloured according to their revenue and a (extra) bar plot to help visualize the relative size of each country’s revenue.

The main observations are:

The revenue is the highest in the UK as expected but the store has presence in quite few countries (37).
UK is dominating the countries with 88.9% of occurrences which makes the sense since that is where the online store is based.
We observe another small group of countries (Germany, France, Eire) above the others that have a much higher number of occurrences than the rest.
Most of the sales are in developed countries.
The only continent that is not represented is Africa.

Plot Three

Description Three

In this plot we observe a quadrant where each country is positioned according to their relative log10 of the sum of revenue and their average unit price. The size of the point is driven by the revenue and the colour by the country.

The main observations are:

We see that the UK has the highest revenue but a moderate average unit price wile Spain and EIRE have higher than average revenues and higher average unit prices as well.
Channel Islands has the highest average unit price and sell just slightly less that average.
Germany and France have higher sales than average but relatively low prices
Netherlands is has the lowest price and average sales

Reflection

E-commerce is one of the fastest growing economic sectors worldwide. Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data.

However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title “Online Retail”.

The dataset can also be found at Kaggle E-Commerce Data

I chose this dataset as I felt it accurately represented real world datasets and wanted get the experience exploring it.

One of the main challenges I faced, is the open ended nature of EDA and the endless possibilities of explorations and transformations that are possible and having to draw the line and create a narrative out of the existing visualizations and explorations.

Besides the basic univariate exploration in this EDA I chose to get a general idea of the the patterns and change over time and it’s components(Weekly, monthly, hourly) in the data, and asses it’s potential for predicting future customer behaviour. Additionally I explored possible differences in these patterns between different countries.

The main issues I faced during the analysis:

Very skewed distributions with a big percentage of outliers.
The transactions happen over a relative short time period making predictions and observations weaker.
The amount of items and transactions is quite big ,increasing the computational ad plotting time to problematic levels at times.
UK is dominating the sales making country a poor predictor to use in modelling sales behavior.

Future Work

This was just the beginning. In the future I intend to proceed wit the following analyses:

Cohort analysis.
Churn prediction.
Customer basket prediction.
Customer segmentation.
Time Series forecasting.

Sources

Date axis breaks: https://stackoverflow.com/questions/17758006/time-series-plot-with-x-axis-in-year-month-in-r
strttime formatting: http://strftime.org/
Get month-year out of datetime: https://stackoverflow.com/questions/41122645/lubridate-how-to-parse-month-year
Change Rstudio Locale: https://stackoverflow.com/questions/5345132/sys-setlocale-request-to-set-locale-cannot-be-honored
Manipulate RMD output: https://yihui.name/knitr/demo/output/
Pad missing datetime entries (to univariate): https://stackoverflow.com/questions/16787038/insert-rows-for-missing-dates-times
Replace nans with 0: https://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe
Detect Seasonality using Fourier Transform: https://anomaly.io/detect-seasonality-using-fourier-transform-r/
Extract Seasonal & Trend: using decomposition in R: https://anomaly.io/seasonal-trend-decomposition-in-r/
Seasonal periods in R: https://robjhyndman.com/hyndsight/seasonal-periods/
Times Series Tutorial: https://ec.europa.eu/eurostat/sa-elearning/components-%E2%80%93-overview
Dplyr and pipes: http://seananderson.ca/2014/09/13/dplyr-intro.html
Override aes with manual colour - https://stackoverflow.com/questions/29307725/using-override-aes-in-ggplot2-with-layered-symbols-r

Explore and Summarize Data with R

Ioannis K Breier

Description

Univariate Plots Section

InvoiceNo

InvoiceDateTime

StockCode

Description

Quantity

UnitPrice

Value

CustomerID

Country

Univariate Analysis

What is the structure of your dataset?

Original Dataset Variables

Main Oservations:

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

InvoiceDateTime

Time Series

UnitPrice

Country

Bivariate Analysis

What was the strongest relationship you found?

Multivariate Plots Section

UnitPrice Vs InvoiceHour Per Day.

UnitPrice Vs InvoiceHour Per Country.

UnitPrice Vs InvoiceWDay Per Hour.

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

Future Work

Sources