“In God we trust. Everyone else, bring data.”

Former New York City Mayor Michael R. Bloomberg

Marketing is “the science and art of finding, retaining, and growing profitable customers [sic]” (Kotler and Armstrong 2001, p. 392). Over time, the marketing departments of most companies have collected vast amounts of customer data. Through the use of technology and data mining techniques these data sets can be processed into valuable insights into customer behavior (Davenport 2006; Harris et al. 2010).

However, as data analysis as such focus on the past rather than on the future the insights derived from this analysis are only of a small value (Sanders 2015). Therefore, it is even more important that Chief Marketing Officers (CMOs) are able to leverage predictive customer analytics by making inferences from past data, yet many CMOs lack the skills to proceed so (Davenport 2006; Sanders 2015). Predictive analytics allow substantially better business decisions through extremely precise predictions (Sanders 2015). Indeed, for high-performing companies sophisticated analytical capabilities represent a key differentiator separating them from their competitors (Gutierrez 2015; Davenport 2006; Davenport and Harris 2013). To this end, predictive analytics is said to be “the next big thing” (Hollison 2014, p. 1).

One great application of predictive analytics is to analyze customer information and predict future purchases in order to calculate the present value of each customer to the organization (Corrigan et al. 2014; Kim 2015b). In fact, businesses seek to discover their frequent and, more importantly, their profitable customers for two reasons. First, the seek to treat each customer appropriately. For instance, by targeting their customers with customized offers companies aspire to increase customer loyalty (Corrigan et al. 2014; Kim 2015b). Second, firms intend to maximize their profits which result from marketing campaigns aiming for sales (Zettelmeyer and Ertle 2014; Kim 2015b).

However, to collect, to combine and to act intelligently on customer data is a challenging task for marketers due to their lack of profound statistical and technological expertise. Statisticians, in contrast, lack the marketing knowledge. As a result, most organizations still act product-centric rather than customer-centric and miss out on great revenue potential (Kim 2015b). A recent publication suggets that “with less than 10 percent of marketers using any predictive modeling, according to Gartner Research Director Martin Kihn, there’s plenty of room for growth” (Sluis 2014, p. 44).

Against this background, this thesis aims to give an understanding how to use customer analytics to target the most profitable customers, using the software program Stata to empirically analyze and process customer data sets. In a first step, the reader is introduced to customer analytics and how customer value can be calculated. In a second step, necessary basics in business statistics are explained. Finally, in the third step, predictive customer targeting models are applied and reviewed using Stata. The aim of this thesis is to familiarize readers with the quantitative methods behind customer analytics in order to be able to apply it.

This thesis addresses the following main research question: How can a company use Stata to identify and target the most profitable customers based on customer data? To investigate this research question, I am going to discuss the following aspects in more detail: What is Customer Analytics? What is strategic targeting? What is Stata? What data is used? What does predictive in the marketing context mean? Which opportunities are there to identify and target a customer? How does a company identify and target a customer? How can a company quantify customer value? How can some company use statistics to predict a customer response rate? What is RFM analysis? How does RFM analysis work? What is logistic regression and how does it work? What other alternatives are there to identify customers?

This thesis employs a method of systematic literature review as well as reviewing academic contents. Furthermore, the thesis applies the theory practically in the statistics software Stata by analyzing real customer data. Subsequently, state-of-the-art marketing and statistics literature as well as best-practices from accepted marketing professionals are adduced. Research was conducted on EBSCO Business Source Premier, T&I ProQuest, Google Search, Google Scholar, Springer as well as through personal communication with my former lecturer from Emory University Tongil Kim.

Theoretical Foundation

This chapter explains basic terms and gives a framework to work and think. It thus deals with the following major questions: What is Customer Analytics? What is Predictive Analytics? What is strategic targeting? What is Stata?

Customer Analytics

Customer Analytics consists of the two terms customer and analytics. Firstly, according to the American Marketing Association (2015, p. 1), a customer is “the actual or prospective purchaser of products or services.” Secondly, Analytics is defined as the “practice of extracting information from existing data sets in order to determine patterns” (Lawless 2015, p. 44). By combining these two terms, Customer Analytics can be described as the scientific approach to examine gathered information about an actual or prospective purchaser of products or services in order to determine purchase patterns. This thesis aims to find these patterns in the data using the software program Stata and use the patterns uncovered in a further step to predict customer behavior.

Predictive Analytics

As mentioned earlier, the real value of customer analytics lies in prediction. Predictive Analytics makes “use of statistics, machine learning, data mining, and modeling to analyze current and historical facts to make predictions about future events” (Hollison 2014, p. 1). However, it merely “forecasts what might happen in the future with an acceptable level of reliability, and includes what-if scenarios and risk assessment” (Lawless 2015, p. 44).

Strategic Targeting

Considering the uncertainties of the future, a company\’s aim is to strategically target its customers using prediction of future purchase behavior. According to Kotler et al. (2001), a target customer is a specific person whom an offer, an advertisement or a message are addressed to. Marketers seek to motivate prospective customers to purchase by targeting them through marketing actions.

Strategic targeting takes this approach one step further: it selects and executes the most profitable targeting strategy of a marketing campaign. When it is unprofitable to approach certain customers due to their unprobable response in form of a purchase, a selective targeting strategy makes sense.

Data

For the purposes of this thesis, I apply a modified customer data set of Amazonas, a fictive online retailer for books and other consumer goods. The used data was originally collected by an actual company called the Bookbinders Book Club using its customer database and, in this example, sold to Amazonas. In reality, companies buy or lease customer data in addition to collecting them as it is expensive and takes a long time to gather customer data. The additional acquired customer information represents a substantial added value for Amazonas in the context of the analysis and its marketing activities.

To keep Amazonas\’ data orderly, it is important to divide between the different types of variables as they are measured on different levels as well as they are handled in different ways in statistics (Mooi and Sarstedt 2014a; Kim 2015f). For example, it is possible to calulate an average age, but impossible to determine an average gender. Certain statistical techniques only work with one type of variable (Kim 2015f). Different types of measurement levels can be broadly distinguished into non-metric (qualitative) and metric (quantitative) variables (Weiss 2012).

Non-metric “yield nonnumerical information” (Weiss 2012, p. 35) and can be either categorical (sometimes called nominal) or ordinal: Categorical variables have two (binary variables) or more categories and are, for example, gender or religion (Kim 2015f; Waldhauser 2013). Ordinal variables are similar to categorical variables. The main difference is that ordinal variables follow a clear ordering and thus offer additional information (Chen et al. 2003; Mooi and Sarstedt 2014a). Subsequently, ordinal variables can be described as a numeric code for a qualitative characteristic, for instance, a customer category A, B or C (Kim 2015f; Waldhauser 2013).

Metric variables can be distinguished between discrete and continuous variables. (Weiss 2012) A discrete variable is “a quantitative variable whose possible values can be listed” (Weiss 2012, p. 36) and is usually a count of something. Continuous variables can be further divided into interval variables and ratio variables (Chen et al. 2003; Kim 2015f; Waldhauser 2013). An interval variable is a numeric code for a quantitative characteristic (Waldhauser 2013) and the “intervals between the values of the interval variable are equally spaced” (Chen et al. 2003, p. 1). As an example, consider measuring temperature in degree Celcius for which it is clearly defined what one degree more or less is. Furthermore, ratio variables are one specific type of interval variables with a clear definition of zero, e.g. temperature measured in degree Kelvin (Waldhauser 2013).

The original size of Amazonas\’ database is 550,000 customers from which a 50,000 customers random sample was extracted for analysis and prediction purposes. This means, that the 50,000 sample customers are addressed with a test marketing campaign and each customer’s buying response to this special offer is recorded. Based on this record of buying behavior, analytics methods can discover characteristics or certain values of variables that make a purchase more propable. Based on these, prediction of a purchase probability is processed by Stata and can then be applied to the remaining 500,000 customer within the database. Of course, the test sample can be far smaller than 50,000. However, a larger test sample will deliver more statistically significant results. Table 1 below depicts all variables included in the employed data set. Furthermore, it shows the scale levels of each variable.

To get a better understanding of the dataset, various descriptive analyses can be conducted. They are further explained below.

Table 1 Amazonas customer variables

Content of the Amazonas data sample – consists of 50,000 customers

Variable name Scale level Description

account_num Metric: discrete Customer Account Number

sex Non-metric: categorical Customer gender: M = male, F = female

state Non-metric: categorical US State where customer lives

zip Metric: discrete ZIP code (5 digits)

zip3 Metric: discrete The first 3 digits of the ZIP code

purch_initial Metric: discrete Number of months since the initial purchase

purch_last Metric: discrete Number of months since the last purchase

product_book Metric: discrete Total amount of money spent on books

product_nonbook Metric: discrete Total amount of money spent on other products

purchase_total Metric: discrete Total amount of money spent

total_num_purch Metric: discrete Total amount of purchased books

genre_child Metric: discrete Total amount of purchased children books

genre_youth Metric: discrete Total amount of purchased youth books

genre_cook Metric: discrete Total amount of purchased cook books

genre_do_it Metric: discrete Total amount of purchased do it yourself books

genre_reference Metric: discrete Total amount of purchased reference books

genre_art Metric: discrete Total amount of purchased art books

genre_geog Metric: discrete Total amount of purchased geography books

response_buy Non-metric: categorical Did the customer buy the special offer? (1 = yes, 0 = no)

Describe data

To describe the data, there are two broad types of analyses: univariate descriptives and bivariate descriptives. Univariate descriptives mean that only one variable is described at a time, whereas bivariate descriptives focus on the relationship between two or more variables (Mooi and Sarstedt 2014b).

Exhibit 2 The different types of descriptive statistics (Mooi and Sarstedt 2014b, p. 100; Kim 2015f)

Univariate variables are commonly (graphically) depicted by using instruments like bar charts, histograms, box plots, pie charts and frequency tables. The objective is to show the distribution of data (Kim 2015f). Moreover, you can describe a “central tendency” of a data set: Firstly, one can identify the mode, that is the value that occurs most often in a given variable distribution (Weiss 2012). Secondly, the analyst can assess the median (that is the middle point of a distribution) of the variable. Thirdly, and only in the case of metric variables, it is possible to calculate the mean (that is the average) and. For instance, you cannot calculate an average gender (non-metric variable) but an average number of purchases (metric variable). Furthermore, the dispersion of metric data can be analyzed, that means its variance, standard deviation and range (Kim 2015f).

Bivariate data means that there are two variables that can have an association. This relationship can be illustrated in scatter plots and cross-tabs. For statistically testing the relationship, common tools are correlation tests and regression. These tools are reviewed and explained in chapter 1.3 and 1.4.

Stata

There are multiple statistical software programs available that can be used to describe and analyze data. However, there is a reason why this thesis employs Stata. Stata is a powerful and user-friendly statistical software. It includes “smart data-management facilities, a wide array of up-to-date statistical techniques, and an excellent system for producing publication-quality graphs” (Rodríguez 2015, p. 1). Furthermore, many data scientists and researchers use data because of its breadth, accuracy, extensibility, and reproducibility (Stata 2015a).

Moreover, Stata brings considerable advantages compared to other potential software: Contrary to Excel, Stata includes statistical tools required for serious analysis of the data and can handle vast numbers of customer records (50,000+). Compared to SPSS, SAS and R, Stata is easier to use and independent of any existing infrastructure. It is considered being a perfect tool for marketers (Zettelmeyer and Ertle 2014).

Quantifying Customer Value

Ofek (2014, p. 1) contends that “the ultimate goal [of marketing activities] is to develop highly committed customers who not only make repeat purchases and generate continual revenue streams, but also require minimal maintenance along the way.” This goal is central to the idea of customer centricity. Companies that follow customer centricity refocus their attention from the product to the customer. Subsequently, businesses shift from the so-called product profitability to customer profitability. In consequence, whole customer relationships are analyzed rather than single transactions (Kim 2015c).

Considering that, there are two questions arising: How can you determine the economic value of a customer to a business? And how can this information improve marketing and product decisions?

The Customer Lifetime Value

In order to quantify the worth of a customer to an organization, companies have to compare initial customer acquisiton costs to expected profits that result from the customer\’s relationship with the organization which is known as the Customer Lifetime Value (CLTV). The CLTV is a newer approach to measure the customer value rather than only look at revenues and profits generated by the customer.

CASH FLOW

Inflow In- and Outflow

CAPTURED BEHAVIOR Past Revenue Profit

Past and Future Life Time Value (LTV)

Exhibit 3 Customer Lifetime Value (Kim 2015b, p. 4)

The Customer Lifetime Value can be described using the following equation 1 (Mason 2001; Ofek 2014, Gupta et al. 2006):

CLTV = ∑_(t=1)^n▒(P_t (Q_t π_t))/〖(1+d)〗^t -∑_(t=1)^n▒((D_t+R_t))/〖(1+d)〗^t -AC (1)

where

Pt = the probability of purchase in period t

Qt = the quantity purchased in period t

πt = the margin on purchases in period t

d = the discount factor (where often d = interest rate×risk factor) Dt = costs to develop the relationship in period t

Rt = costs to retain the customer in period t

AC = initial customer acquisition cost

n = the number of periods

The first part of the CLTV equation represents the net revenues of one customer, discounted over the customer life time. The second part captures the costs occuring because of maintaining and developing the relationship with the customer, discounted over the life time. The third part constitutes the initial customer acquisition costs.

As an example, consider one of Amazonas\’ customers Michael who is male, from New Jersey and made his initial purchase at the company 41 months ago. His last purchase was only 1 month ago, and he has spent a total of $238 so far – $124 on books and $114 on other goods. That means, customer8 spends on average $36.29 every 12 months on books and $33.37 a year on other goods. Furthermore, an internal Amazonas research showed that customers who receive a newsletter are less likely to churn, as compared to customers who never receive an email from Amazonas. This suggests a difference in customer lifetime value between those two groups.

Exhibit 2 shows the calculations of the value of Michael with and without newsletter marketing predicted for five years into the future: The maintenance and support service costs are estimated to be $2 per year per customer. Furthermore, to retain the average customer, $3 per year are assigned to customer relationship management and promotional measures. As the different attrition rates between the two groups result in differences in probability of being active customers, the expected profits per average customer differ as well. Discounted to present with a 10% discount rate, we receive the present value of expected profits per year. In order to calculate the CLTV in year 5, we add the LTVs of the five years together. In the upper table of Exhibit 4, the costs and revenues are used to calculate the CLTV for Michael in the case that he is not targeted with newsletters. Furthermore, the lower table of Exhibit 4 displays the lifetime value calculation of Michael in the case that he receives newsletters. In this example, both calculations differ only in attrition rate, that is the probability of Michael to stop purchasing from Amazonas.

Exhibit 4 shows that the current CLTV of Michael who receives newsletters is $230.42. In comparison, Michael who does not get any newsletter is at this point worth only $206.4 due to the higher attrition rate. This is a difference in customer value of $24.02 or more than 10% respectively. The result shows that newsletter mailing is of substantial value for Amazonas.

In order to maximize Amazonas\’ profits, it aims to maximize the customer lifetime value of any customer. Therefore, Amazonas needs to understand the interaction process of its customers with the company. Below is a framework to analyze the relationship between Amazonas and its customer to effectively detect methods to increase the CLTV.

The Customer Lifecycle

The CLTV of any customer can be used as a reference figure for adjusting marketing activities and the promotional mix. To proceed so, marketers have to analyze the relationship between the organization and customer. One common framework to describe the relationship between a customer and a business over time is the customer lifecycle. The customer lifecycle divides the relationship into three stages which are shown in Exhibit 6. The relationship starts with the acquisiton stage, which represents the period before a customer becomes a customer. This stage is followed by the development stage and eventually by the retention stage (Kim 2015b; Mason 2001; Gallo 2014). In this thesis, current Amazonas customers are analyzed. Therefore, I am working within the retention stage.

Exhibit 5 The customer lifecycle (Mason 2001; Kim 2015b, p. 20)

In the acquisition stage individuals can be differentiated into Prospects and Responders. A Prospect is any potential customer in the target market who is not yet a customer. Responders, however, have at least had some contact with the business but have not yet evolved into customers (Mason 2001; Kim 2015b; Ofek 2014).

A responder who has made an initial purchase of a product or service is called New Customer. Within the development stage, a customer\’s early behavior can be often used for prediction of future behavior and thus be followed by suitable marketing actions. Once the customer is acquired, the main focus is to retain him and to encourage him to repeat purchasing and become a loyal customer. Based on historical customer data, companies are able to rate their customers according to their value to the organization. Usually customers are distributed proportionally to the power law distribution. This means only a few customers account for the majority of profits and have high customer lifetime values. Obviously, businesses try to retain those.

However, there are also low value customers which are unprofitable for the firm. Marketers try to convert them into profitable ones or to make them churn, that is to volunatrily leave the company. High potential customers vary constantly in profitability but appear to have the potential to be in the high value customer group. Those are of special interest for marketers to turn them into high value customers (Mason 2001; Ofek 2014).

Customers who are no longer active are called Former Customers. There are customers who voluntarily churn, often because of attractive offers by competitors. This behavior is particularly common in the telecommunication and in the financial services industries because of attractive offers by competitors. Other customers churn because they do not need the product or service anymore, for instance, parents who no longer need diapers for their child as it has grown out. Nevertheless, in some cases companies have to force customers to churn, for example, when they fail to pay (Mason 2001; Ofek 2014).

Along this customer lifecycle, the value of a customer can be optimized in every stage. One possibility to increase Customer Lifetime Value is to reduce acquisition costs through more targeted acquisition campaigns, in particular, by identifying profitable customers in order to focus on the acquisition of high value customers. Within the Customer Development stage, the CLTV can be expanded, for instance, by cross-selling, up-selling, personalizing communications and targeted offers. Furthermore, by having more inciting offers than competitors voluntary attrition of valued customers can be decreased. As a result, the customer lifetime is being extended and thus the CLTV is higher. Moreover, by increasing fees or reducing service a part of unprofitable customers can be converted into profitable ones and the remaining ones are encouraged to attrite (Mason 2001; Ofek 2014). Samuel (2015, p. 1) illustrates that “data can play a powerful role in telling your story at every stage in your relationship with customers”.

Applied Business Statistics

After having outlined the importance of customer analysis, I now draw to the statistical concepts that are important for the actual analysis. Statistics has evolved to a crucial instrument in marketing due to its predictive functionality. Therefore, this chapter gives an overview over the use of statistics in marketing which represents a foundation for the understanding and appropriate use of Stata in chapter 1.4 and the following chapters.

Sampling

The analyzed Amazonas data are a 50,000 customer random sample extracted from the whole 550,000 customers database. Using samples in marketing has two main reasons: Firstly, it is difficult and expensive to collect complete data from every individual or item (called the population) that are examined using a statistical study (Weiss 2012; Kim 2015c). Secondly, test samples are especially important for testing the effectiveness of marketing campaigns like promotional offers.

By employing one or more representative sample(s) of its customer population, the marketer is able to examine the profitable customers who are most likely to respond to a campaign (and buy) and, consequently, able to estimate the response among the whole population. In a next step, a company would roll out the full campaign to all those customers, who have been identified as potentially interested based on the test sample/test campaign (Clow and James 2013). In most cases, the company saves money by pre-selecting the target audience, resulting in higher profits. Moreover, a vast portion of the company\’s customers will not wonder about inappropriate offers as they will not be targeted due to the data analysis.

Therefore, most statistical procedures are based merely on a sample, that is “part of the population from which information is obtained” (Weiss 2012, p. 4). There are different types of samples. A representative sample can be used to predict the opinions of the population without contacting every single member because it represents, in a smaller context, the composition of the overall population (Stine and Foster 2014a). In order to receive a representative sample with the composition of the whole population, Stine and Foster (2014a, p. 311) suggest “to pick members of the population at random”.

In this context Stine and Foster (2014a, p. 311) mention that “larger populations don\’t require larger samples”. In particular, the size of a population is not affecting the minimum size of a sample of this population. However, under consideration of the trade-off of marginal insight and costs, samples should be as large as possible.

Nevertheless, sampling includes an uncertainty as incomplete information are used to make statements about a population. Biased samples distort inferences from the data about the population, for instance, when omitting one part of the population completely and can only be used carefully for prediction (Stine and Foster 2014a). Kim (2015c, p. 7) describes a bias as a “systematic error in selecting the sample” which can result from voluntary response during the data collection, from convenience samples, or from a certain wording in the survey (Kim 2015c).

There are multiple approaches of sampling customer data. The easiest and a very common one is to use simple random sampling (Mooi and Sarstedt 2014; Clow and James 2013; Stine and Foster 2014a). In Stata we use the command sample to take a random sample. Using the command “sample 10”, Stata will create a random sample which has the size of 10% of the population. Using the command “sample 10, count”, Stata would create a random sample with the size of only 10 individual customers (Stata 2015b).

Randomization, which prevents bias, allows us gauging characteristics of a population based on a random sample (Stine and Foster 2014a, p. 312). Characteristics of the population, for instance, its average revenue per customer and the respective variance are called population parameters. Characteristics of a sample are called sample statistics. There can be only one population parameter per population, for instance, only one population average of sales but an infinite number of sample statistics because the marketer can draw various different samples based on the overall population (Kim 2015d). Using representative sample statistics, we can estimate the corresponding population parameters. For example, a population mean can be estimated using a sample (Stine and Foster 2014a, p. 316).

Determining Association Between Variables

As stated in chapter 1, prediction is of a very high value for marketers to find and develop the most profitable customers. Given that, it is to mention that prediction is mainly about finding a relationship between two variables. As mentioned in chapter 1.14, there are two types of variables, metric and non-metric. Subsequently, there are three types of possible combinations that can be interesting for the marketing analysis: The association between either two metric or two non-metric variables as well as between one metric and one non-metric variable.

The Association Between Two Non-Metric Variables

In order to describe the association between two non-metric variables, we use the cross-tabs method (Kim 2015f). For Amazonas, there are two genders within the 50,000 customer sample: 33,302 are female, whereas 16,698 are male which is shown by the left table of Exhibit 6. Furthermore, the right table of Exhibit 6 shows the purchase results of the executed marketing test campaign: 4,522 out of the 50,000 customers make a purchase in response to the test offer that is sent to the customers. The remaining 45,478 customers from the test sample that receive the same offer do not respond in the form of a purchase.

Exhibit 6 Special offer

Male 16,698

(33.4%) Yes 4,522

Female 33,302

(66.6%) No 45,478

If there was no relationship between gender and buying decision, the company would predict the response as shown below in Exhibit 7, that is by multiplying the number of respondents and non-respondents by the chance of the being male or female. This results in the believe that 1,510 buyers are male, whereas 3,012 of respondents are female.

Exhibit 7 Predict Purchase

No Yes

Male 45,478 × 33.4%=15,159 4,522 × 33.4%=1,510 16,698

Female 45,478 × 66.6% =30,290 4,522 × 66.6%=3,012 33,302

45,478 4,522 50,000

However, reality is different. To show the actual response respectively to the gender, the company has to process a cross-tab. In Stata, the analyst has to run the command “tabulate sex response_buy” in order to receive an overview table as shown in Exhibit 8.

Exhibit 8 Stata cross-tab

As illustrated in Exhibit 8, the test results are that 2,133 male customers make the purchase due to the test offer, rather than the assumed 1,510 men. Furthermore, only 2,389 women respond to the offer and purchase, that is 623 women less than calculated in Exhibit 7 above neglecting the relationship between gender and buying decision.

Subsequently, the question arises whether the difference between the actual and the predicted response (assuming the variables “gender” and “response_buy” are uncorrelated) is significant. Hence, we specify the null hypothesis H0: predicted response and actual response are the same, that means there is no relationship between gender and response (Kim 2015f).

Chi-Square Test

To determine whether the association between non-metric variables is significant, a common method is the chi-square test. The chi-square test compares the actual numbers of the test campaign shown in Exhibit 8 above with the expected numbers which are shown in Exhibit 7 above (Kim 2015f).

χ^2=〖(15,159-14,565)〗^2/14,565+〖(1,510-2,133)〗^2/2,133+〖(30,290-30,913)〗^2/30,913+〖(3,012-2,389)〗^2/2,389=424.02

The bigger the χ^2-test statistic the more the actual test results differ from the expected number of men and women who purchase. This means there is an association, and, thus, knowing one variable tells you something about the other variable. To run a chi-square test in stata, simply add “, chi2” to the tabulate command used before (Kim 2015f).

“Pr” represents the p-value of the chi-square test. Therefore, in this case we reject the null hypothesis (p-value < 0.05) and conclude that the actual and the predicted response are significantly different. We can make the inference, that gender and purchase decision are significantly correlated, that means gender matters (Kim 2015f).

1.1.1 Association Between Two Metric Variables

1.1.1.1 Scatter

As for Amazonas, it seems interesting to measure the association between the total amount of money spent and the various genres of books. Thus, the company can assess which genre is most popular among customer who spend a lot, or least which genre is most often bought. A good way to get a graphic depiction of how two metric variables are associated, is to draw a scatterplot.

In order to create a scatterplot in Stata, we use the command “scatter <var1> <var2> … <varn>”, whereas the variables of interest are inserted instead of <var1/2/…/n> such as “purchase_total” and “genre_cook”. The scatterplot in Exhibit 10 shows the association between the total amount of money spent per customer and the purchases of cook books per customer.

Exhibit 10 Scatterplot Stata

This scatterplot in Exhibit 10 shows all customers of the data sample as dots, indicating the variable “purchase_total” on the y-axis and the number of purchased cook books per customer on the x-axis. The distribution of the data points in scatterplot shows a tendency that there could be a correlation between cook books and a higher amount of money spent in total which is tested in the next section.

1.1.1.2 Measuring Correlation

A common method to statistacally describe the association between two metric variables, for instance, “purchase_total” and “genre_cook” is to calculate the correlation coefficient which is bounded between -1 and 1. To process a correlation analysis in Stata, the analyst uses the command “pwcorr <var1> <var2> … <varn>” with the variables of interest. In order to see if the correlation is statistically significant, the company adds “, sig” to the command line before. Stata will then calculate the p-value for the correlation as well (Kim 2015f). The result of this analysis is illustrated in Exhibit 11 below.

This output is interpreted as follows: The highest correlation can be found between total amount of money spent and cooking books, with a correlation coefficient of 0.3671. Its p-value of 0.0000 which is displayed in the row below the respective correlation coefficient indicates that this result is considered statistically significant.

Following this, it could be concluded that a customer who buys cook books is more likely to be a top spender than a customer who does not. However, the result could also mean that a top spender is more likely to purchase a cook book. In general, correlations give only little information and do not say anything about direction or causality of the results. The correlation coefficient is merely about the consistency of observed customer characteristics within the sample. Therefore, correlations can be misleading and, therefore, have to be used carefully (Kim 2015f).

Furthermore, in many common cases the association between two variables is misleading because there is another variable that is unreported. This problem is referred to as “omitted variable bias” or “spurious correlation”. To leverage this bias, all results are always to be questioned (Kim 2015f).

1.1.1.1 Ordinary Least Sqaure Regression Analysis

A more sophisticated approach to measure the influence of one variable on another is the ordinary least square (OLS) regression analysis. The ordinary least square regression analysis gives us an equation of a line from our sample of customers which includes all data points in the analysis. The regression equation estimates rates of change, for instance, the increase of the total amount of money spent associated with a larger number of purchase cook books. The equation of the line is:

(2)

or equivalently,

(3)

In this example, the “purchase_total” on the y-axis is called response, outcome or dependent variable. The number of purchase cook books on the x-axis is named explanatory, predictor or independent variable (Stine and Foster 2014d). Amazonas has now to chose and in a way that describes all data points as good and equally fair as possible. Therefore, Amazonas needs to find the one line that has in summary the least distance to all points. The distance or vertical deviation from the data points is called residual, and is graphically illustrated in Exhibit 12.

Exhibit 12 Residuals (Stine and Foster 2014, p. 488)

Using the least squares regression, Amazonas chooses the line (of all potential lines) that makes the squares of residuals as small as possible. This method uses the squares in order to treat the positive and negative deviations equally. As calculating all residuals is a lot of work, the use of statistical software is important. In particaular, the intercept and slope are calculated in the following way:

(4)

and

(5)

The first formula for expresses that the point always lies on the line. The slope is the product of the ratio of the standard deviations and the correlation r between x and y (Stine and Foster 2014d, p. 489).

To make the regression more meaningful, it is possible to add unlimited further explanatory variables. For Amazonas, it is interesting how the number of books of each genre purchased by one customer affects the total amount of money spent by him or her. Therefore, Amazonas runs the respective regression analysis with “purchase_total” as response variable and all its book genres as explanatory variables: to run this regression in Stata, they apply the command “regress <response variable> <explanatory variable1> … <explanatory variablen>”, whereas the example variables in brackets have to be replaced by the actual variables of interest from the sample (Kim 2015f). The outcome of this regression is shown in Exhibit 13 below.

Exhibit 13 Stata regression

The p-value of the whole regression is shown as Prob > F = 0.0000. Therefore, the regression output is significant as its p-value is smaller than 0.01. From this regression Amazonas can draw that every cook book a customer purchase increases, the total amount of money spent by 15.5. However, the low r-squared of 0.266 shows that this regression equation does not describe the data sample very well. The r-squared is further explained below.

1.1.1.1.1 R-Squared

The quantity r2, or r-squared, says how well the least squares regression line fits to the data and has a value between -1 and 1. For example, r2 = 0 states that the regression line does not explain any of the variation in the data and captures the variation of the data. If r2 = 1, all of the variation in the data is represented by the regression line. In general, the square of the correlation r2 is a common interpretation figure of a regression because of its convenient use (Stine and Foster 2014e).

As for Amazonas, the last regression of “purchase_total” and the different book genres has an r2 of 0.266. It can be concluded, that the total amount of money spent by a customer can not be described very well by this regression equation. Therefore, the company has to look for a more sophisticated way to get insights into their customers\’ preferences and behavior in chapter 4.

1.1.2 Associations Between Metric And Non-Metric Variables

Now consider the case that Amazonas wants to know whether female or male customers spend more money in order to target better. To this end, the company needs to asses the association between the non-metric variable “sex” and the metric variable “purchase_total”. Therefore, it has to calculate the average amount of money spent by men and women. By using the command “tabstat purchase_total, statistics(mean) by (sex)” determines the average (mean) total amount of money spent (purchase_total) of the two genders (sex) (Kim 2015f). The results are illustrated in Exhibit 14 below: the average female customer spent approximately $201 in total, whereas the male customers purchased for more than $223 on average.

Exhibit 14 Female versus male customers

In order to determine whether the means are statistically different between men and women, Amazonas uses a t-test in Stata by using the command “ttest purchase_total, by (sex)”. The results of the t-test are illustrated in Exhibit 15 below.

Exhibit 15 t-test purchase_total: men versus women

The output suggests that the average total amount of money spent by male customers is $223.64 and female customers\’ average total amount of money spent is $200.64. This output