GitHub

Problem: Only about 11% of customers at a Portuguese bank purchased long term deposits (between 2008 and 2013). How can the bank increase these numbers?

This is a telemarketing campaign, so time is money! Therefore, what we want to maximize is the number of sales (during a fixed time period, eg. 40 hours), and I propose that targeting customers based only on their occupation will improve overall sales. Thus the key performance indicator (KPI) that we want to optimize is the time needed to ensure that at least one sale occurs.

Exploratory Data Analysis

As you probably know, data scientists spend the majority of their time in cleaning up data. I converted a few variables into binary variables, converted some strings (day) to integers, and the same with month. After cleaning up the data we can summarize the data on a high level. In this case, our typical customer is 40 years old, married, works in the admin field, graduated from university and is contacted via a cell phone.

Stats on the current campaign:

  • Primarily conducted during the summer months, particularly April
  • Calls were made nearly uniform during the week (day)
  • Calls that resulted in no purchase lasted an average of 3.7 minutes and those that did result in a sale lasted an average of 9.2 minutes
  • Each customer was contacted an average of 2.5 times during this campaign

Stats on previous campaign(s):

  • After being contacted (for a previous campaign), an average of 962 days passed
  • During the last campaign, 1 in every 5.8 customers were contacted at least once
  • Every customer was contacted at least once during the current campaign
  • 24.4% of customers purchased a long-term deposit during the previous campaign
  • 86.3% of all customers were not contacted at all during the previous campaign
    • Which means only 13.6% of all customers were contacted at least once during both campaigns

The standard deviation of the social-economic variables are high (relative to it’s mean).

Automated Profiling

Gathering descriptive statistics can be a tedious process. Gladly, there are libraries that exist that perform all of the data crunching for you. They output a very clear profile of your data. pandas-profiling is one of them. That library offers out-of-the-box statistical profiling of your dataset. Since the dataset we are using is tidy and standardized, we can use the library right away on our dataset.

Overview

Dataset info

Number of variables 22
Number of observations 41188
Total Missing (%) 3.9%
Total size in memory 5.5 MiB
Average record size in memory 141.0 B

Variables types

Numeric 11
Categorical 5
Boolean 4
Date 0
Text (Unique) 0
Rejected 2
Unsupported 0

Warnings

Variables

age
Numeric

Distinct count 78
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 40.024
Minimum 17
Maximum 98
Zeros (%) 0.0%

Quantile statistics

Minimum 17
5-th percentile 26
Q1 32
Median 38
Q3 47
95-th percentile 58
Maximum 98
Range 81
Interquartile range 15

Descriptive statistics

Standard deviation 10.421
Coef of variation 0.26037
Kurtosis 0.79131
Mean 40.024
MAD 8.4615
Skewness 0.7847
Sum 1648511
Variance 108.6
Memory size 321.9 KiB
Value Count Frequency (%)  
31 1947 4.7%
 
32 1846 4.5%
 
33 1833 4.5%
 
36 1780 4.3%
 
35 1759 4.3%
 
34 1745 4.2%
 
30 1714 4.2%
 
37 1475 3.6%
 
29 1453 3.5%
 
39 1432 3.5%
 
Other values (68) 24204 58.8%
 

Minimum 5 values

Value Count Frequency (%)  
17 5 0.0%
 
18 28 0.1%
 
19 42 0.1%
 
20 65 0.2%
 
21 102 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
91 2 0.0%
 
92 4 0.0%
 
94 1 0.0%
 
95 1 0.0%
 
98 2 0.0%
 

age_bracket
Categorical

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
1
19768
2
19188
3
 
2041
Value Count Frequency (%)  
1 19768 48.0%
 
2 19188 46.6%
 
3 2041 5.0%
 
4 191 0.5%
 

campaign
Numeric

Distinct count 42
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.5676
Minimum 1
Maximum 56
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 2
Q3 3
95-th percentile 7
Maximum 56
Range 55
Interquartile range 2

Descriptive statistics

Standard deviation 2.77
Coef of variation 1.0788
Kurtosis 36.98
Mean 2.5676
MAD 1.6342
Skewness 4.7625
Sum 105754
Variance 7.673
Memory size 321.9 KiB
Value Count Frequency (%)  
1 17642 42.8%
 
2 10570 25.7%
 
3 5341 13.0%
 
4 2651 6.4%
 
5 1599 3.9%
 
6 979 2.4%
 
7 629 1.5%
 
8 400 1.0%
 
9 283 0.7%
 
10 225 0.5%
 
Other values (32) 869 2.1%
 

Minimum 5 values

Value Count Frequency (%)  
1 17642 42.8%
 
2 10570 25.7%
 
3 5341 13.0%
 
4 2651 6.4%
 
5 1599 3.9%
 

Maximum 5 values

Value Count Frequency (%)  
40 2 0.0%
 
41 1 0.0%
 
42 2 0.0%
 
43 2 0.0%
 
56 1 0.0%
 

cons.conf.idx
Numeric

Distinct count 26
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean -40.503
Minimum -50.8
Maximum -26.9
Zeros (%) 0.0%

Quantile statistics

Minimum -50.8
5-th percentile -47.1
Q1 -42.7
Median -41.8
Q3 -36.4
95-th percentile -33.6
Maximum -26.9
Range 23.9
Interquartile range 6.3

Descriptive statistics

Standard deviation 4.6282
Coef of variation -0.11427
Kurtosis -0.35856
Mean -40.503
MAD 3.9383
Skewness 0.30318
Sum -1668200
Variance 21.42
Memory size 321.9 KiB
Value Count Frequency (%)  
-36.4 7763 18.8%
 
-42.7 6685 16.2%
 
-46.2 5794 14.1%
 
-36.1 5175 12.6%
 
-41.8 4374 10.6%
 
-42.0 3616 8.8%
 
-47.1 2458 6.0%
 
-31.4 770 1.9%
 
-40.8 715 1.7%
 
-26.9 447 1.1%
 
Other values (16) 3391 8.2%
 

Minimum 5 values

Value Count Frequency (%)  
-50.8 128 0.3%
 
-50.0 282 0.7%
 
-49.5 204 0.5%
 
-47.1 2458 6.0%
 
-46.2 5794 14.1%
 

Maximum 5 values

Value Count Frequency (%)  
-33.0 172 0.4%
 
-31.4 770 1.9%
 
-30.1 357 0.9%
 
-29.8 267 0.6%
 
-26.9 447 1.1%
 

cons.price.idx
Numeric

Distinct count 26
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 93.576
Minimum 92.201
Maximum 94.767
Zeros (%) 0.0%

Quantile statistics

Minimum 92.201
5-th percentile 92.713
Q1 93.075
Median 93.749
Q3 93.994
95-th percentile 94.465
Maximum 94.767
Range 2.566
Interquartile range 0.919

Descriptive statistics

Standard deviation 0.57884
Coef of variation 0.0061858
Kurtosis -0.82981
Mean 93.576
MAD 0.50981
Skewness -0.23089
Sum 3854200
Variance 0.33506
Memory size 321.9 KiB
Value Count Frequency (%)  
93.994 7763 18.8%
 
93.91799999999999 6685 16.2%
 
92.89299999999999 5794 14.1%
 
93.444 5175 12.6%
 
94.465 4374 10.6%
 
93.2 3616 8.8%
 
93.075 2458 6.0%
 
92.20100000000001 770 1.9%
 
92.963 715 1.7%
 
92.431 447 1.1%
 
Other values (16) 3391 8.2%
 

Minimum 5 values

Value Count Frequency (%)  
92.20100000000001 770 1.9%
 
92.37899999999999 267 0.6%
 
92.431 447 1.1%
 
92.469 178 0.4%
 
92.649 357 0.9%
 

Maximum 5 values

Value Count Frequency (%)  
94.199 303 0.7%
 
94.215 311 0.8%
 
94.465 4374 10.6%
 
94.601 204 0.5%
 
94.76700000000001 128 0.3%
 

contact
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
cellular
26144
telephone
15044
Value Count Frequency (%)  
cellular 26144 63.5%
 
telephone 15044 36.5%
 

day_of_week
Numeric

Distinct count 5
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.9796
Minimum 0
Maximum 4
Zeros (%) 20.7%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 2
Q3 3
95-th percentile 4
Maximum 4
Range 4
Interquartile range 2

Descriptive statistics

Standard deviation 1.4115
Coef of variation 0.71304
Kurtosis -1.2998
Mean 1.9796
MAD 1.2032
Skewness 0.00055242
Sum 81535
Variance 1.9924
Memory size 321.9 KiB
Value Count Frequency (%)  
3 8623 20.9%
 
0 8514 20.7%
 
2 8134 19.7%
 
1 8090 19.6%
 
4 7827 19.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 8514 20.7%
 
1 8090 19.6%
 
2 8134 19.7%
 
3 8623 20.9%
 
4 7827 19.0%
 

Maximum 5 values

Value Count Frequency (%)  
0 8514 20.7%
 
1 8090 19.6%
 
2 8134 19.7%
 
3 8623 20.9%
 
4 7827 19.0%
 

default
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 7.2837e-05
0
41185
1
 
3
Value Count Frequency (%)  
0 41185 100.0%
 
1 3 0.0%
 

duration
Numeric

Distinct count 1544
Unique (%) 3.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 258.29
Minimum 0
Maximum 4918
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 36
Q1 102
Median 180
Q3 319
95-th percentile 752.65
Maximum 4918
Range 4918
Interquartile range 217

Descriptive statistics

Standard deviation 259.28
Coef of variation 1.0038
Kurtosis 20.248
Mean 258.29
MAD 171.67
Skewness 3.2631
Sum 10638243
Variance 67226
Memory size 321.9 KiB
Value Count Frequency (%)  
85 170 0.4%
 
90 170 0.4%
 
136 168 0.4%
 
73 167 0.4%
 
124 164 0.4%
 
87 162 0.4%
 
72 161 0.4%
 
104 161 0.4%
 
111 160 0.4%
 
106 159 0.4%
 
Other values (1534) 39546 96.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 4 0.0%
 
1 3 0.0%
 
2 1 0.0%
 
3 3 0.0%
 
4 12 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
3631 1 0.0%
 
3643 1 0.0%
 
3785 1 0.0%
 
4199 1 0.0%
 
4918 1 0.0%
 

education
Categorical

Distinct count 8
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
university.degree
12168
high.school
9515
basic.9y
6045
Other values (5)
13460
Value Count Frequency (%)  
university.degree 12168 29.5%
 
high.school 9515 23.1%
 
basic.9y 6045 14.7%
 
professional.course 5243 12.7%
 
basic.4y 4176 10.1%
 
basic.6y 2292 5.6%
 
unknown 1731 4.2%
 
illiterate 18 0.0%
 

emp.var.rate
Numeric

Distinct count 10
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.081886
Minimum -3.4
Maximum 1.4
Zeros (%) 0.0%

Quantile statistics

Minimum -3.4
5-th percentile -2.9
Q1 -1.8
Median 1.1
Q3 1.4
95-th percentile 1.4
Maximum 1.4
Range 4.8
Interquartile range 3.2

Descriptive statistics

Standard deviation 1.571
Coef of variation 19.185
Kurtosis -1.0626
Mean 0.081886
MAD 1.4228
Skewness -0.7241
Sum 3372.7
Variance 2.4679
Memory size 321.9 KiB
Value Count Frequency (%)  
1.4 16234 39.4%
 
-1.8 9184 22.3%
 
1.1 7763 18.8%
 
-0.1 3683 8.9%
 
-2.9 1663 4.0%
 
-3.4 1071 2.6%
 
-1.7 773 1.9%
 
-1.1 635 1.5%
 
-3.0 172 0.4%
 
-0.2 10 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
-3.4 1071 2.6%
 
-3.0 172 0.4%
 
-2.9 1663 4.0%
 
-1.8 9184 22.3%
 
-1.7 773 1.9%
 

Maximum 5 values

Value Count Frequency (%)  
-1.1 635 1.5%
 
-0.2 10 0.0%
 
-0.1 3683 8.9%
 
1.1 7763 18.8%
 
1.4 16234 39.4%
 

euribor3m
Highly correlated

This variable is highly correlated with emp.var.rate and should be ignored for analysis

Correlation 0.97224

housing
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.52384
1
21576
0
19612
Value Count Frequency (%)  
1 21576 52.4%
 
0 19612 47.6%
 

job
Categorical

Distinct count 12
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
admin.
10422
blue-collar
9254
technician
6743
Other values (9)
14769
Value Count Frequency (%)  
admin. 10422 25.3%
 
blue-collar 9254 22.5%
 
technician 6743 16.4%
 
services 3969 9.6%
 
management 2924 7.1%
 
retired 1720 4.2%
 
entrepreneur 1456 3.5%
 
self-employed 1421 3.5%
 
housemaid 1060 2.6%
 
unemployed 1014 2.5%
 
Other values (2) 1205 2.9%
 

loan
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.15169
0
34940
1
 
6248
Value Count Frequency (%)  
0 34940 84.8%
 
1 6248 15.2%
 

marital
Categorical

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
married
24928
single
11568
divorced
 
4612
Value Count Frequency (%)  
married 24928 60.5%
 
single 11568 28.1%
 
divorced 4612 11.2%
 
unknown 80 0.2%
 

month
Numeric

Distinct count 10
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.6079
Minimum 2
Maximum 11
Zeros (%) 0.0%

Quantile statistics

Minimum 2
5-th percentile 3
Q1 4
Median 5
Q3 7
95-th percentile 10
Maximum 11
Range 9
Interquartile range 3

Descriptive statistics

Standard deviation 2.041
Coef of variation 0.36395
Kurtosis -0.027876
Mean 5.6079
MAD 1.661
Skewness 0.85151
Sum 230978
Variance 4.1657
Memory size 321.9 KiB
Value Count Frequency (%)  
4 13769 33.4%
 
6 7174 17.4%
 
7 6178 15.0%
 
5 5318 12.9%
 
10 4101 10.0%
 
3 2632 6.4%
 
9 718 1.7%
 
8 570 1.4%
 
2 546 1.3%
 
11 182 0.4%
 

Minimum 5 values

Value Count Frequency (%)  
2 546 1.3%
 
3 2632 6.4%
 
4 13769 33.4%
 
5 5318 12.9%
 
6 7174 17.4%
 

Maximum 5 values

Value Count Frequency (%)  
7 6178 15.0%
 
8 570 1.4%
 
9 718 1.7%
 
10 4101 10.0%
 
11 182 0.4%
 

nr.employed
Highly correlated

This variable is highly correlated with euribor3m and should be ignored for analysis

Correlation 0.94515

pdays
Numeric

Distinct count 27
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 962.48
Minimum 0
Maximum 999
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 999
Q1 999
Median 999
Q3 999
95-th percentile 999
Maximum 999
Range 999
Interquartile range 0

Descriptive statistics

Standard deviation 186.91
Coef of variation 0.1942
Kurtosis 22.229
Mean 962.48
MAD 70.362
Skewness -4.9222
Sum 39642439
Variance 34936
Memory size 321.9 KiB
Value Count Frequency (%)  
999 39673 96.3%
 
3 439 1.1%
 
6 412 1.0%
 
4 118 0.3%
 
9 64 0.2%
 
2 61 0.1%
 
7 60 0.1%
 
12 58 0.1%
 
10 52 0.1%
 
5 46 0.1%
 
Other values (17) 205 0.5%
 

Minimum 5 values

Value Count Frequency (%)  
0 15 0.0%
 
1 26 0.1%
 
2 61 0.1%
 
3 439 1.1%
 
4 118 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
22 3 0.0%
 
25 1 0.0%
 
26 1 0.0%
 
27 1 0.0%
 
999 39673 96.3%
 

poutcome
Numeric

Distinct count 3
Unique (%) 0.0%
Missing (%) 86.3%
Missing (n) 35563
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.24409
Minimum 0
Maximum 1
Zeros (%) 10.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 1
Range 1
Interquartile range 0

Descriptive statistics

Standard deviation 0.42958
Coef of variation 1.7599
Kurtosis -0.57967
Mean 0.24409
MAD 0.36902
Skewness 1.1919
Sum 1373
Variance 0.18454
Memory size 321.9 KiB
Value Count Frequency (%)  
0.0 4252 10.3%
 
1.0 1373 3.3%
 
(Missing) 35563 86.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 4252 10.3%
 
1.0 1373 3.3%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 4252 10.3%
 
1.0 1373 3.3%
 

previous
Numeric

Distinct count 8
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.17296
Minimum 0
Maximum 7
Zeros (%) 86.3%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 7
Range 7
Interquartile range 0

Descriptive statistics

Standard deviation 0.4949
Coef of variation 2.8613
Kurtosis 20.109
Mean 0.17296
MAD 0.29868
Skewness 3.832
Sum 7124
Variance 0.24493
Memory size 321.9 KiB
Value Count Frequency (%)  
0 35563 86.3%
 
1 4561 11.1%
 
2 754 1.8%
 
3 216 0.5%
 
4 70 0.2%
 
5 18 0.0%
 
6 5 0.0%
 
7 1 0.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 35563 86.3%
 
1 4561 11.1%
 
2 754 1.8%
 
3 216 0.5%
 
4 70 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
3 216 0.5%
 
4 70 0.2%
 
5 18 0.0%
 
6 5 0.0%
 
7 1 0.0%
 

y
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.11265
0
36548
1
 
4640
Value Count Frequency (%)  
0 36548 88.7%
 
1 4640 11.3%
 

Correlations

Sample

age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y age_bracket
0 56 housemaid married basic.4y 0 0 0 telephone 4 0 261 1 999 0 NaN 1.1 93.994 -36.4 4.857 5191.0 0 2
1 57 services married high.school 0 0 0 telephone 4 0 149 1 999 0 NaN 1.1 93.994 -36.4 4.857 5191.0 0 2
2 37 services married high.school 0 1 0 telephone 4 0 226 1 999 0 NaN 1.1 93.994 -36.4 4.857 5191.0 0 1
3 40 admin. married basic.6y 0 0 0 telephone 4 0 151 1 999 0 NaN 1.1 93.994 -36.4 4.857 5191.0 0 2
4 56 services married high.school 0 0 1 telephone 4 0 307 1 999 0 NaN 1.1 93.994 -36.4 4.857 5191.0 0 2

Key Performance Indicator

For this experiment, the KPI that we wish to optimize is the average duration required to make one sale. My proposal is to With this being said, it is also very important that all other variables remain the same (ex. total number of calls should not change dramatically). We do not wish to change the distribution of customer classes such as job or education, simply the selection process to identify candidates to call.

So we definitely want to minimize how many customers we call who work in the blue-collar and services fields, and call more students and retired customers. We can just simply swap the probabilities of those groups (blue-collar observed probability and student, services and retired).

Feasibility

In order to determine if this experiment is actually feasible, we need to compute the minimum sample size that we need for each group. A few assumption we made here:

The groups will be the same size An even split will result in 2 groups that have equal variance in the output variable being measured (purchase) Sales associates work 6 hours a day, 5 days a week (or 108,000 seconds a week). We can find the value of N (sample size), as:

\[N = \frac{t_{statistic=-1.652}}{0.05}^{2} \cdot 2 \cdot var\]

So N needs to be at least 218.25, and since we cannot have 0.25 of a call we round up to get 219. Therefore group A and group B both need to make 219 calls, or 56564.42 seconds. Since there is a total of 108,000 seconds in a work week (if working 6 hours a day, 5 days a week), this experiment will require at least 75 calls per group, for a total of 150 calls.

Conclusion

In conclusion, we are going to need at least 12 sales associates (6 per group) that need to make a total of 150 calls. This is quite feasible.

  • The more calls made, the higher the probability of a sale, so adding more sales associates should also result in more sales. However, the number of sales associates in each group needs to be equal