携程客户流失分析项目(个人练习+源代码)

这是一个关于携程客户流失分析的个人练习项目,涉及到大量数据处理和机器学习模型的运用。通过对不同规模的数据集(从5行到数十万行不等,包含50列以上的信息)进行探索和建模,旨在预测并理解客户流失的原因和模式,从而为业务提供决策支持。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import datetime
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import f_classif, SelectFromModel, VarianceThreshold
from sklearn.ensemble import RandomForestClassifier as RFC
%matplotlib inline

plt.rcParams['font.family'] = ['SimHei']    # 显示中文,解决图中无法显示中文的问题
plt.rcParams['axes.unicode_minus']=False
# 读取文件
df = pd.read_table('userlostprob.txt')
# 查看头五行
df.head()
labelsampleiddarrivaliforderpv_24hdecisionhabit_userhistoryvisit_7ordernumhistoryvisit_totalordernumhotelcrordercanceledprecent...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
00246362016-05-182016-05-180NaNNaNNaN1.04NaN...615.0NaN0.2912.8803.147NaNNaN7NaN12
11246372016-05-182016-05-180NaNNaNNaN1.06NaN...513.0NaN0.5317.9334.913NaNNaN33NaN14
20246412016-05-182016-05-190NaNNaNNaN1.05NaN...382.0NaN0.603.9930.760NaNNaN10NaN19
30246422016-05-182016-05-180NaNNaNNaN1.01NaN...203.0NaN0.183.2200.660NaNNaN8NaN16
41246442016-05-182016-05-190NaNNaNNaN1.00NaN...84.0NaNNaN0.013NaNNaNNaN1NaN21

5 rows × 51 columns

# 观察标签分布状况
df['label'].value_counts()
0    500588
1    189357
Name: label, dtype: int64
# 查看后五行
df.tail()
labelsampleiddarrivaliforderpv_24hdecisionhabit_userhistoryvisit_7ordernumhistoryvisit_totalordernumhotelcrordercanceledprecent...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
689940122384192016-05-152016-05-17119.0NaNNaN1.06NaN...406.0NaN0.4813.5731.6601034.01.05119.018
689941122384212016-05-152016-05-15110.03.03.01.060.33...199.0713.00.512.8800.513179.02.0151472.012
689942022384222016-05-152016-05-170NaNNaNNaN1.07NaN...544.0NaN0.4515.2932.0670.0NaN8107.00
689943022384252016-05-152016-05-170NaNNaNNaN1.04NaN...156.0NaN0.292.4670.333NaNNaN4NaN0
689944022384262016-05-152016-05-150NaNNaNNaN1.02NaN...275.0NaNNaN12.6002.653NaNNaN2NaN11

5 rows × 51 columns

# 随机查看五行
df.sample(5)
labelsampleiddarrivaliforderpv_24hdecisionhabit_userhistoryvisit_7ordernumhistoryvisit_totalordernumhotelcrordercanceledprecent...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
47701318202352016-05-212016-05-21015.0NaN15.01.050.36...582.018831.00.4817.2203.4004242.01.33446906.09
42692607365982016-05-152016-05-1501.0NaN39.01.050.16...978.012199.00.135.1130.847642.01.367322583.08
628554010724022016-05-202016-05-200NaNNaN3.01.020.00...147.055214.00.2715.8733.22010002.01.11186905.019
24827504386332016-05-182016-06-09019.02.028.01.020.78...NaN3329.0NaN1.3200.087145.01.1244917397.011
19897203565502016-05-192016-05-1907.0NaN2.01.040.50...206.061467.00.3220.4805.15313264.01.08591522.020

5 rows × 51 columns

# 数据形状
df.shape
(689945, 51)
# 查看数据类型
df.dtypes
label                                 int64
sampleid                              int64
d                                    object
arrival                              object
iforderpv_24h                         int64
decisionhabit_user                  float64
historyvisit_7ordernum              float64
historyvisit_totalordernum          float64
hotelcr                             float64
ordercanceledprecent                float64
landhalfhours                       float64
ordercanncelednum                   float64
commentnums                         float64
starprefer                          float64
novoters                            float64
consuming_capacity                  float64
historyvisit_avghotelnum            float64
cancelrate                          float64
historyvisit_visit_detailpagenum    float64
delta_price1                        float64
price_sensitive                     float64
hoteluv                             float64
businessrate_pre                    float64
ordernum_oneyear                    float64
cr_pre                              float64
avgprice                            float64
lowestprice                         float64
firstorder_bu                       float64
customereval_pre2                   float64
delta_price2                        float64
commentnums_pre                     float64
customer_value_profit               float64
commentnums_pre2                    float64
cancelrate_pre                      float64
novoters_pre2                       float64
novoters_pre                        float64
ctrip_profits                       float64
deltaprice_pre2_t1                  float64
lowestprice_pre                     float64
uv_pre                              float64
uv_pre2                             float64
lowestprice_pre2                    float64
lasthtlordergap                     float64
businessrate_pre2                   float64
cityuvs                             float64
cityorders                          float64
lastpvgap                           float64
cr                                  float64
sid                                   int64
visitnum_oneyear                    float64
h                                     int64
dtype: object
# 查看数据基本信息
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689945 entries, 0 to 689944
Data columns (total 51 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   label                             689945 non-null  int64  
 1   sampleid                          689945 non-null  int64  
 2   d                                 689945 non-null  object 
 3   arrival                           689945 non-null  object 
 4   iforderpv_24h                     689945 non-null  int64  
 5   decisionhabit_user                385450 non-null  float64
 6   historyvisit_7ordernum            82915 non-null   float64
 7   historyvisit_totalordernum        386525 non-null  float64
 8   hotelcr                           689148 non-null  float64
 9   ordercanceledprecent              447831 non-null  float64
 10  landhalfhours                     661312 non-null  float64
 11  ordercanncelednum                 447831 non-null  float64
 12  commentnums                       622029 non-null  float64
 13  starprefer                        464892 non-null  float64
 14  novoters                          672918 non-null  float64
 15  consuming_capacity                463837 non-null  float64
 16  historyvisit_avghotelnum          387876 non-null  float64
 17  cancelrate                        678227 non-null  float64
 18  historyvisit_visit_detailpagenum  307234 non-null  float64
 19  delta_price1                      437146 non-null  float64
 20  price_sensitive                   463837 non-null  float64
 21  hoteluv                           689148 non-null  float64
 22  businessrate_pre                  483896 non-null  float64
 23  ordernum_oneyear                  447831 non-null  float64
 24  cr_pre                            660548 non-null  float64
 25  avgprice                          457261 non-null  float64
 26  lowestprice                       687931 non-null  float64
 27  firstorder_bu                     376993 non-null  float64
 28  customereval_pre2                 661312 non-null  float64
 29  delta_price2                      437750 non-null  float64
 30  commentnums_pre                   598368 non-null  float64
 31  customer_value_profit             439123 non-null  float64
 32  commentnums_pre2                  648457 non-null  float64
 33  cancelrate_pre                    653015 non-null  float64
 34  novoters_pre2                     657616 non-null  float64
 35  novoters_pre                      648956 non-null  float64
 36  ctrip_profits                     445187 non-null  float64
 37  deltaprice_pre2_t1                543180 non-null  float64
 38  lowestprice_pre                   659689 non-null  float64
 39  uv_pre                            660548 non-null  float64
 40  uv_pre2                           661189 non-null  float64
 41  lowestprice_pre2                  660664 non-null  float64
 42  lasthtlordergap                   447831 non-null  float64
 43  businessrate_pre2                 602960 non-null  float64
 44  cityuvs                           682274 non-null  float64
 45  cityorders                        651263 non-null  float64
 46  lastpvgap                         592818 non-null  float64
 47  cr                                457896 non-null  float64
 48  sid                               689945 non-null  int64  
 49  visitnum_oneyear                  592910 non-null  float64
 50  h                                 689945 non-null  int64  
dtypes: float64(44), int64(5), object(2)
memory usage: 268.5+ MB
# 描述性统计
df.describe([0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
labelsampleidiforderpv_24hdecisionhabit_userhistoryvisit_7ordernumhistoryvisit_totalordernumhotelcrordercanceledprecentlandhalfhoursordercanncelednum...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
count689945.0000006.899450e+05689945.000000385450.00000082915.000000386525.000000689148.000000447831.000000661312.000000447831.000000...660664.000000447831.000000602960.000000682274.000000651263.000000592818.000000457896.000000689945.0000005.929100e+05689945.000000
mean0.2744526.285402e+050.1937375.3170481.85609411.7104871.0609960.3421196.086366154.179369...318.541812101830.9194000.36823710.6482782.25325012049.4093821.137476153.7024141.855185e+0414.462315
std0.4462384.146815e+050.39522638.5244832.10386217.2514290.0452640.35421012.413225398.456986...351.913035122784.3138640.21994515.6966823.53845325601.3741380.204789277.8076972.288603e+056.301575
min0.0000002.463600e+040.0000000.0000001.0000001.0000001.0000000.0000000.0000000.000000...1.0000000.0000000.0000000.0070000.0070000.0000001.0000000.0000001.000000e+000.000000
1%0.0000003.620588e+040.0000001.0000001.0000001.0000001.0000000.0000000.0000000.000000...52.000000244.0000000.0100000.0130000.0070000.0000001.0000001.0000002.100000e+010.000000
10%0.0000001.398464e+050.0000001.0000001.0000001.0000001.0100000.0000000.0000000.000000...101.0000003518.0000000.0500000.1600000.033000127.0000001.0000004.0000001.610000e+026.000000
25%0.0000003.123200e+050.0000002.0000001.0000002.0000001.0300000.0000000.0000000.000000...145.00000014999.0000000.1700000.8270000.127000551.0000001.00000017.0000004.710000e+0211.000000
50%0.0000005.996370e+050.0000003.0000001.0000006.0000001.0500000.2500000.0000002.000000...233.00000046890.0000000.4000003.5270000.6270002848.0000001.05000062.0000001.315000e+0315.000000
75%1.0000008.874600e+050.0000005.0000002.00000014.0000001.0900000.5700004.000000153.000000...388.000000138953.0000000.55000013.3270002.74700010726.0000001.210000180.0000003.141000e+0320.000000
90%1.0000001.059705e+061.00000010.0000003.00000029.0000001.1200000.98000027.000000492.000000...611.000000311492.0000000.65000035.5670007.54700030384.9000001.400000392.0000006.634000e+0322.000000
99%1.0000002.226893e+061.00000027.0000007.00000082.0000001.1900001.00000048.0000001752.000000...1464.000000484734.0000000.78000066.00700014.453000138722.0000002.0000001212.0000002.625670e+0523.000000
max1.0000002.238426e+061.0000003167.000000106.000000711.0000003.1800001.00000049.00000013475.000000...43700.000000527026.0000000.99000067.14000014.507000194386.00000011.0000009956.0000009.651192e+0623.000000

12 rows × 49 columns

# 删除重复值
df.drop_duplicates(inplace=True)
df.shape
(689945, 51)
# 根据缺失值比例进行排序
null = df.isnull().mean().reset_index().sort_values(0)
null_1 = null.rename(columns={'index':'特征', 0:'缺失比'})
null_1
特征缺失比
0label0.000000
48sid0.000000
4iforderpv_24h0.000000
50h0.000000
2d0.000000
1sampleid0.000000
3arrival0.000000
8hotelcr0.001155
21hoteluv0.001155
26lowestprice0.002919
44cityuvs0.011118
17cancelrate0.016984
14novoters0.024679
28customereval_pre20.041500
10landhalfhours0.041500
40uv_pre20.041679
41lowestprice_pre20.042440
39uv_pre0.042608
24cr_pre0.042608
38lowestprice_pre0.043853
34novoters_pre20.046857
33cancelrate_pre0.053526
45cityorders0.056065
35novoters_pre0.059409
32commentnums_pre20.060132
12commentnums0.098437
43businessrate_pre20.126075
30commentnums_pre0.132731
49visitnum_oneyear0.140642
46lastpvgap0.140775
37deltaprice_pre2_t10.212720
22businessrate_pre0.298646
13starprefer0.326190
20price_sensitive0.327719
15consuming_capacity0.327719
47cr0.336330
25avgprice0.337250
23ordernum_oneyear0.350918
42lasthtlordergap0.350918
11ordercanncelednum0.350918
9ordercanceledprecent0.350918
36ctrip_profits0.354750
31customer_value_profit0.363539
29delta_price20.365529
19delta_price10.366405
16historyvisit_avghotelnum0.437816
7historyvisit_totalordernum0.439774
5decisionhabit_user0.441332
27firstorder_bu0.453590
18historyvisit_visit_detailpagenum0.554698
6historyvisit_7ordernum0.879824
# 绘制密度图
plt.figure(figsize=(8,6))
sns.kdeplot(null_1['缺失比'], shade=True)

在这里插入图片描述

# 用条形图观察缺失值
plt.figure(figsize=(8,6))
plt.bar(range(null_1.shape[0]), null_1['缺失比'], label='lost rate')
plt.legend(loc='best')

在这里插入图片描述

# 删除缺失值过多的列
df = df.drop(['historyvisit_7ordernum'], axis=1)
df
labelsampleiddarrivaliforderpv_24hdecisionhabit_userhistoryvisit_totalordernumhotelcrordercanceledprecentlandhalfhours...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
00246362016-05-182016-05-180NaNNaN1.04NaN22.0...615.0NaN0.2912.8803.147NaNNaN7NaN12
11246372016-05-182016-05-180NaNNaN1.06NaN0.0...513.0NaN0.5317.9334.913NaNNaN33NaN14
20246412016-05-182016-05-190NaNNaN1.05NaN3.0...382.0NaN0.603.9930.760NaNNaN10NaN19
30246422016-05-182016-05-180NaNNaN1.01NaN2.0...203.0NaN0.183.2200.660NaNNaN8NaN16
41246442016-05-182016-05-190NaNNaN1.00NaN0.0...84.0NaNNaN0.013NaNNaNNaN1NaN21
..................................................................
689940122384192016-05-152016-05-17119.0NaN1.06NaN1.0...406.0NaN0.4813.5731.6601034.01.05119.018
689941122384212016-05-152016-05-15110.03.01.060.3349.0...199.0713.00.512.8800.513179.02.0151472.012
689942022384222016-05-152016-05-170NaNNaN1.07NaN0.0...544.0NaN0.4515.2932.0670.0NaN8107.00
689943022384252016-05-152016-05-170NaNNaN1.04NaN0.0...156.0NaN0.292.4670.333NaNNaN4NaN0
689944022384262016-05-152016-05-150NaNNaN1.02NaN0.0...275.0NaNNaN12.6002.653NaNNaN2NaN11

689945 rows × 50 columns

# 异常值观察
df.describe([0.01, 0.25, 0.5, 0.75, 0.99]).T
countmeanstdmin1%25%50%75%99%max
label689945.00.2744520.4462380.0000.000000.0000.0001.0001.000000e+001.000
sampleid689945.0628540.209625414681.49869724636.00036205.88000312320.000599637.000887460.0002.226893e+062238426.000
iforderpv_24h689945.00.1937370.3952260.0000.000000.0000.0000.0001.000000e+001.000
decisionhabit_user385450.05.31704838.5244830.0001.000002.0003.0005.0002.700000e+013167.000
historyvisit_totalordernum386525.011.71048717.2514291.0001.000002.0006.00014.0008.200000e+01711.000
hotelcr689148.01.0609960.0452641.0001.000001.0301.0501.0901.190000e+003.180
ordercanceledprecent447831.00.3421190.3542100.0000.000000.0000.2500.5701.000000e+001.000
landhalfhours661312.06.08636612.4132250.0000.000000.0000.0004.0004.800000e+0149.000
ordercanncelednum447831.0154.179369398.4569860.0000.000000.0002.000153.0001.752000e+0313475.000
commentnums622029.01272.0908882101.8716010.0001.00000115.000514.0001670.0008.796000e+0334189.000
starprefer464892.067.53230419.1750940.00020.0000053.30069.40080.3001.000000e+02100.000
novoters672918.01706.2479012811.6900071.0001.00000157.000692.0002196.0001.157600e+0445455.000
consuming_capacity463837.039.15414023.2401470.0008.0000022.00033.00051.0001.000000e+02100.000
historyvisit_avghotelnum387876.06.51017941.0452610.0001.000002.0004.0007.0002.900000e+013167.000
cancelrate678227.01051.6041431509.0661341.0002.00000137.000503.0001373.0006.399000e+0318930.000
historyvisit_visit_detailpagenum307234.037.15360373.4028911.0001.000006.00018.00044.0002.620000e+026199.000
delta_price1437146.079.067012512.942824-99879.000-1227.55000-31.00081.000226.0001.081000e+035398.000
price_sensitive463837.024.64586326.6856060.0000.000005.00016.00033.0001.000000e+02100.000
hoteluv689148.095.092708169.9815270.0070.1670010.42736.180107.7479.641130e+021722.613
businessrate_pre483896.00.3727170.2327910.0000.010000.1500.3900.5708.000000e-010.990
ordernum_oneyear447831.011.64206117.1372091.0001.000002.0006.00014.0008.100000e+01711.000
cr_pre660548.01.0629060.0445881.0001.000001.0301.0601.0901.190000e+002.950
avgprice457261.0422.458701290.8533321.00091.00000232.000350.000524.0001.491000e+036383.000
lowestprice687931.0318.806242575.782415-3.00037.00000116.000200.000380.0001.823000e+03100000.000
firstorder_bu376993.011.6977952.7468211.0003.0000012.00013.00013.0001.700000e+0121.000
customereval_pre2661312.03.0485191.2266350.0000.000002.0003.0004.0005.500000e+006.000
delta_price2437750.077.277208391.413839-43344.000-949.00000-29.00069.000198.0001.018000e+035114.000
commentnums_pre598368.01415.1595612329.4189220.0001.00000137.000592.0001862.0009.732000e+0334189.000
customer_value_profit439123.03.0384096.625281-24.075-0.296780.2690.9913.1382.845100e+01598.064
commentnums_pre2648457.01313.3887371719.5133540.0003.00000270.000768.0001780.0007.457000e+0334189.000
cancelrate_pre653015.00.3444220.1791470.0000.050000.2300.3200.4201.000000e+001.000
novoters_pre2657616.01787.1976142316.7129851.0005.00000391.0001054.0002413.0001.001800e+0445436.000
novoters_pre648956.01890.6984503116.1200621.0002.00000187.000783.0002453.0001.383900e+0445436.000
ctrip_profits445187.04.2084959.314438-44.313-0.393000.3401.3474.3204.075580e+01600.820
deltaprice_pre2_t1543180.03.28374048.805880-2296.000-103.00000-3.0002.00010.0001.110000e+023324.000
lowestprice_pre659689.0315.954583463.7236431.00038.00000118.000208.000385.0001.750000e+03100000.000
uv_pre660548.0107.846076186.7319070.0070.2400012.53342.500124.7071.047787e+031722.613
uv_pre2661189.0103.352990157.1178630.0070.5000017.56351.287126.2008.567254e+021722.613
lowestprice_pre2660664.0318.541812351.9130351.00052.00000145.000233.000388.0001.464000e+0343700.000
lasthtlordergap447831.0101830.919400122784.3138640.000244.0000014999.00046890.000138953.0004.847340e+05527026.000
businessrate_pre2602960.00.3682370.2199450.0000.010000.1700.4000.5507.800000e-010.990
cityuvs682274.010.64827815.6966820.0070.013000.8273.52713.3276.600700e+0167.140
cityorders651263.02.2532503.5384530.0070.007000.1270.6272.7471.445300e+0114.507
lastpvgap592818.012049.40938225601.3741380.0000.00000551.0002848.00010726.0001.387220e+05194386.000
cr457896.01.1374760.2047891.0001.000001.0001.0501.2102.000000e+0011.000
sid689945.0153.702414277.8076970.0001.0000017.00062.000180.0001.212000e+039956.000
visitnum_oneyear592910.018551.846682228860.3111171.00021.00000471.0001315.0003141.0002.625670e+059651192.000
h689945.014.4623156.3015750.0000.0000011.00015.00020.0002.300000e+0123.000
# 查看异常值的列,这里出现了负数价格,以及出现过高的价格
df[['lowestprice_pre', 'lowestprice']].describe([0.01, 0.25, 0.5, 0.75, 0.99]).T
countmeanstdmin1%25%50%75%99%max
lowestprice_pre659689.0315.954583463.7236431.038.0118.0208.0385.01750.0100000.0
lowestprice687931.0318.806242575.782415-3.037.0116.0200.0380.01823.0100000.0
# 存储异常值的列
col_block = ['lowestprice_pre', 'lowestprice']
# 定义盖帽法函数,去除异常值
def block_upper(x):
    upper = x.quantile(0.99)
    out = x.mask(x > upper, upper)
    return out

def block_lower(x):
    lower = x.quantile(0.01)
    out = x.mask(x < lower, lower)
    return out
# 处理异常值
df[col_block] = df[col_block].apply(block_upper)

df[col_block] = df[col_block].apply(block_lower)

df[['lowestprice_pre', 'lowestprice']].describe([0.01, 0.25, 0.5, 0.75, 0.99]).T
countmeanstdmin1%25%50%75%99%max
lowestprice_pre659689.0304.439507287.19251238.038.0118.0208.0385.01750.01750.0
lowestprice687931.0305.025771297.38283837.037.0116.0200.0380.01823.01823.0
# 深拷贝,不随原数据而改变
df_copy = df.copy(deep=True)

# 去除标签和编号的数据
X = df_copy.iloc[:, 2:]

# 标签列
y = df_copy.iloc[:, 0]
X.head(10)
darrivaliforderpv_24hdecisionhabit_userhistoryvisit_totalordernumhotelcrordercanceledprecentlandhalfhoursordercanncelednumcommentnums...lowestprice_pre2lasthtlordergapbusinessrate_pre2cityuvscityorderslastpvgapcrsidvisitnum_oneyearh
02016-05-182016-05-180NaNNaN1.04NaN22.0NaN1089.0...615.0NaN0.2912.8803.147NaNNaN7NaN12
12016-05-182016-05-180NaNNaN1.06NaN0.0NaN5612.0...513.0NaN0.5317.9334.913NaNNaN33NaN14
22016-05-182016-05-190NaNNaN1.05NaN3.0NaN256.0...382.0NaN0.603.9930.760NaNNaN10NaN19
32016-05-182016-05-180NaNNaN1.01NaN2.0NaNNaN...203.0NaN0.183.2200.660NaNNaN8NaN16
42016-05-182016-05-190NaNNaN1.00NaN0.0NaNNaN...84.0NaNNaN0.013NaNNaNNaN1NaN21
52016-05-182016-05-200NaNNaN1.02NaN0.0NaN15.0...408.0NaNNaN2.8800.427NaNNaN1NaN21
62016-05-182016-05-250NaNNaN1.12NaN0.0NaN2578.0...145.0NaNNaN4.4270.493NaNNaN1NaN22
72016-05-182016-05-2003.021.01.110.790.0395.0NaN...204.010475.00.5312.7131.9877566.01.5231265.017
82016-05-182016-05-19013.0NaN1.08NaN0.0NaN2572.0...99.0NaN0.415.3930.86015.01.020596.020
92016-05-182016-06-0812.07.01.070.8647.06.0NaN...191.018873.00.523.0930.287288.01.03121926.07

10 rows × 48 columns

# 数据集切分
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# 删除日期列
col_date = ['d', 'arrival']
X_train.drop(col_date, axis=1, inplace=True)
X_train.shape
(517458, 46)
# 对不同类型特征进行选择
col = X_train.columns.tolist()
col_no = ['sid', 'iforderpv_24h', 'h'] # 没有缺失值的数据,除去两个日期特征
col_clf = ['decisionhabit_user'] # 分类特征
col_neg = ['delta_price1', 'delta_price2', 'customer_value_profit', 'ctrip_profits', 'deltaprice_pre2_t1'] # 含有负数的特征
col_35 = ['ordernum_oneyear', 'lasthtlordergap', 'ordercanncelednum',
          'ordercanceledprecent', 'ctrip_profits', 'historyvisit_avghotelnum', 'historyvisit_totalordernum',  # 缺失值在35以上的特征
          'decisionhabit_user', 'firstorder_bu', 'historyvisit_visit_detailpagenum']
col_std = X_train.columns[X_train.describe(include='all').T['std'] > 100].to_list() # 方差大于100的列
col_std.remove('sid')
col_std.remove('delta_price2')
col_std.remove('delta_price1')
col_std.remove('lasthtlordergap')
col_norm = list(set(col) - set(col_no + col_clf + col_neg + col_35))
# 对训练集填充缺失值
X_train[col_clf] = X_train[col_clf].fillna(X_train[col_clf].mode())

X_train[col_neg] = X_train[col_neg].fillna(X_train[col_neg].median())

X_train[col_35] = X_train[col_35].fillna(-1)

X_train[col_std] = X_train[col_std].fillna(X_train[col_std].median())

X_train[col_norm] = X_train[col_norm].fillna(X_train[col_norm].mean())
# 对测试集填充缺失值
X_test[col_clf] = X_test[col_clf].fillna(X_test[col_clf].mode())

X_test[col_neg] = X_test[col_neg].fillna(X_test[col_neg].median())

X_test[col_35] = X_test[col_35].fillna(-1)

X_test[col_std] = X_test[col_std].fillna(X_test[col_std].median())

X_test[col_norm] = X_test[col_norm].fillna(X_test[col_norm].mean())
# 查看缺失数据数
X_train.isnull().any().sum()
0
X_test.isnull().any().sum()
0
X_train.shape
(517458, 46)
X_test.shape
(172487, 48)
# 方差过滤
selector = VarianceThreshold()

X_train_var = selector.fit_transform(X_train)
X_train_var.shape
(517458, 46)
# F检验
f, p_values = f_classif(X_train, y_train)

(p_values > 0.01).sum()
6
# F检验筛选之后的训练集
col_f = X_train.columns[p_values <= 0.01]
X_train = X_train[col_f]
X_train.shape
(517458, 40)
X_test = X_test[col_f]
X_test.shape
(172487, 40)
# 重置索引
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
# 建模,生成特征重要性
rfc = RFC(n_estimators=10, random_state=42)
importances = rfc.fit(X_train, y_train).feature_importances_
importances
array([0.0116028 , 0.01896167, 0.01802851, 0.0150776 , 0.01591757,
       0.02015549, 0.02034934, 0.0203898 , 0.01943307, 0.0211841 ,
       0.01791408, 0.02056686, 0.02291317, 0.02291462, 0.02051174,
       0.02323111, 0.02106971, 0.01054197, 0.02306534, 0.01938803,
       0.0258305 , 0.02571515, 0.02508617, 0.02485002, 0.02534584,
       0.02850795, 0.02408976, 0.02625239, 0.0279376 , 0.02743348,
       0.02772216, 0.03058076, 0.02754623, 0.04068007, 0.03904556,
       0.03655198, 0.03553142, 0.03861745, 0.04116244, 0.03829647])
# 交叉检验和嵌入法,画出学习曲线
scores = []
thresholds = np.linspace(0, importances.max(), 20)
for i in thresholds:
    time0 = time()
    X_embedded = SelectFromModel(rfc, threshold=i).fit_transform(X_train, y_train)
    score = cross_val_score(rfc, X_embedded, y_train, cv=5, n_jobs=-1).mean()
    scores.append(score)
    print(datetime.datetime.fromtimestamp(time() - time0).strftime('%M:%S:%f'))

plt.plot(thresholds, scores)
plt.show()
01:12:090613
01:13:526636
01:09:249811
01:08:676264
01:07:224143
01:06:510684
01:14:200829
01:13:976302
01:13:173232
01:07:879761
01:05:081422
01:03:703478
00:57:668075
00:55:913328
00:49:275084
00:47:250707
00:48:596243
00:53:333874
00:45:020932
00:53:343730

在这里插入图片描述

# 查看最大分数
max(scores)
0.9507844100831351
# 查看最大分数对应的阈值
thresholds[scores.index(max(scores))]
0.028163774952387383
col_k =  X_train.columns[importances > 0.028163774952387383].to_list()
X_train_embedded = X_train[col_k]
X_train_embedded.head()
ctrip_profitslasthtlordergapcityuvscityorderslastpvgapcrsidvisitnum_oneyearh
01.347-1.03.7870.3872850.01.13740531314.06
11.347-1.00.1270.0072850.01.13740571314.013
20.767-1.018.9733.6007272.01.137405457348.012
315.4331986.01.5070.28747.01.16000043020273.08
41.347-1.01.4330.16720.01.0000008583.013
X_test_embedded = X_test[col_k]
X_test_embedded.head()
ctrip_profitslasthtlordergapcityuvscityorderslastpvgapcrsidvisitnum_oneyearh
03.9407224.07.1470.580000539.01.1500002204542.04
11.34740911.00.4470.0600003.01.137687813156.03
20.887-1.04.3130.4600006532.01.000000811026.017
31.347-1.00.4600.053000363.01.00000027349.022
41.54082256.00.0602.24631441.01.17000063811.010
# 画出热力图,查看相关性
plt.figure(figsize=(10,8))
sns.heatmap(X_train_embedded.corr(), annot=True, linewidths=1)

在这里插入图片描述

# 删除相关性高的特征
X_train_embedded.drop('cityuvs', axis=1, inplace=True)
X_test_embedded.drop('cityuvs', axis=1, inplace=True)
# 保存清洗之后的数据
X_train_embedded.to_csv('X_train_embedded.csv')
X_test_embedded.to_csv('X_test_embedded.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')
# 读取数据
X_train_embedded = pd.read_csv('X_train_embedded.csv', index_col=0)
X_test_embedded = pd.read_csv('X_test_embedded.csv', index_col=0)
y_train = pd.read_csv('y_train.csv', index_col=0)
y_test = pd.read_csv('y_test.csv', index_col=0)
y_train = np.ravel(y_train)
y_train.shape
y_test = np.ravel(y_test)
y_test.shape
(172487,)
# 'lasthtlordergap':'一年内距离上次下单时长'
# 'cityorders':'昨日提交当前城市同入住日期的app订单数'
# 'lastpvgap':'一年内距上次访问时长'
# 'cr':'用户转化率'
# 'sid':'会话id,sid=1可认为是新访'
# 'visitnum_oneyear':'年访问次数'
# 'h':'访问时间点'
import scipy

# 重新合并数据
woe_data = pd.concat([X_train_embedded, pd.Series(y_train, name='label')], axis=1)
woe_data
ctrip_profitslasthtlordergapcityorderslastpvgapcrsidvisitnum_oneyearhlabel
01.347-1.00.3872850.01.13740531314.060
11.347-1.00.0072850.01.13740571314.0130
20.767-1.03.6007272.01.137405457348.0121
315.4331986.00.28747.01.16000043020273.081
41.347-1.00.16720.01.0000008583.0130
..............................
5174531.347-1.00.1134347.01.1374058278.0210
5174541.34741045.00.5202972.01.330000251095.000
5174551.347113046.00.093522.01.1374051206309.0160
517456-0.067266544.00.60028378.01.00000022100.090
5174571.347-1.00.4202850.01.13740551314.0170

517458 rows × 9 columns

# 计算woe
def get_woe(num_bins):
    columns = ['min', 'max', 'count_0', 'count_1']
    df = pd.DataFrame(num_bins, columns=columns)
    
    df['total'] = df['count_0'] + df['count_1']
    df['percentage'] = df['total'] / df['total'].sum()
    df['bad_rate'] = df['count_1'] / df['total']
    df['good%'] = df['count_0'] / df['count_0'].sum()
    df['bad%'] = df['count_1'] / df['count_1'].sum()
    df['good-bad'] = df['good%'] - df['bad%']
    df['woe'] = np.log(df['good%'] / df['bad%'])
    return df
# 计算IV值,返回IV值
def get_iv(df):
    iv = np.sum(df['good-bad'] * df['woe'])
    return iv
# 返回详细矩阵
def get_bin(X, q):
    df = woe_data.copy()
    df['qcut'], updown = pd.qcut(df[X], retbins=True, q=q, duplicates='drop')
    count_0 = df[df['label']==0].groupby('qcut').count()['label']
    count_1 = df[df['label']==1].groupby('qcut').count()['label']
    num_bins = [*zip(updown,updown[1:],count_0,count_1)]
    woe_df = get_woe(num_bins)
    
    return woe_df
# 作图,查看不同分箱方式
def get_graph(X, n=2, q=20):
    df = woe_data.copy()
    df['qcut'], updown = pd.qcut(df[X], retbins=True, q=q, duplicates='drop')
    count_0 = df[df['label']==0].groupby('qcut').count()['label']
    count_1 = df[df['label']==1].groupby('qcut').count()['label']
    num_bins = [*zip(updown,updown[1:],count_0,count_1)]

    IV = []
    axisx = []

    while len(num_bins) > n:
        pvs = []
        for i in range(len(num_bins)-1):
            x1 = num_bins[i][2:]
            x2 = num_bins[i+1][2:]
            pv = scipy.stats.chi2_contingency([x1,x2])[1]
            pvs.append(pv)
        i = pvs.index(max(pvs))
        num_bins[i:i+2] = [(num_bins[i][0], num_bins[i+1][1],
                           num_bins[i][2] + num_bins[i+1][2],
                           num_bins[i][3] + num_bins[i+1][3])]
        woe_df = get_woe(num_bins)
        axisx.append(len(num_bins))
        IV.append(get_iv(woe_df))

    plt.figure()
    plt.plot(axisx, IV)
    plt.xticks(axisx)
    plt.xlabel("number of box")
    plt.ylabel("IV")
    plt.show()
col_woe = ['ctrip_profits', 'lasthtlordergap', 'cityorders',
          'lastpvgap', 'cr', 'sid', 'visitnum_oneyear', 'h']
for i in col_woe:
    print(i)
    get_graph(i)
ctrip_profits

在这里插入图片描述

lasthtlordergap

在这里插入图片描述

cityorders

在这里插入图片描述

lastpvgap

在这里插入图片描述

cr

在这里插入图片描述

sid

在这里插入图片描述

visitnum_oneyear

在这里插入图片描述

h

在这里插入图片描述

# 发现上次下单时长在2356-29219内的用户流失率比较高
get_bin('lasthtlordergap', 10)
minmaxcount_0count_1totalpercentagebad_rategood%bad%good-badwoe
0-1.02356.0158912480862069980.4000290.2323020.4231980.3387410.0844570.222603
12356.013291.03098320754517370.0999830.4011440.0825110.146201-0.063691-0.572057
213291.029219.03417517574517490.1000060.3396010.0910110.123800-0.032789-0.307683
329219.056455.03551116229517400.0999890.3136640.0945690.114325-0.019756-0.189714
456455.0110984.03724514499517440.0999970.2802060.0991870.102138-0.002951-0.029318
5110984.0232020.03917712572517490.1000060.2429420.1043320.0885630.0157690.163861
6232020.0527026.03950012241517410.0999910.2365820.1051920.0862320.0189610.198753
# 发现用户转化率在1.12以下时用户留存较多
get_bin('cr',10)
minmaxcount_0count_1totalpercentagebad_rategood%bad%good-badwoe
01.0000001.120000166695424712091660.4042180.2030490.4439250.2991860.1447380.394588
11.1200001.137405135099472481823470.3523900.2591100.3597810.3328380.0269440.077841
21.1374051.170000157527000227520.0439690.3076650.0419490.049311-0.007362-0.161699
31.1700001.3300003767424461621350.1200770.3936750.1003290.172315-0.071986-0.540866
41.33000011.0000002028320775410580.0793460.5059920.0540160.146349-0.092334-0.996724
# 发现客户留存结果随着订单数增加而逐渐降低,除了在1.4--2.25的区间中有了明显的回升
get_bin('cityorders',10)
minmaxcount_0count_1totalpercentagebad_rategood%bad%good-badwoe
00.0070000.0330004455511516560710.1083590.2053820.1186540.0811240.0375300.380231
10.0330000.0930003974410584503280.0972600.2103000.1058420.0745590.0312830.350359
20.0930000.2000004029411095513890.0993100.2159020.1073070.0781590.0291480.316952
30.2000000.3800003765611761494170.0955000.2379950.1002810.0828500.0174310.190947
40.3800000.7530003717014675518450.1001920.2830550.0989870.103378-0.004391-0.043400
50.7530001.4000003549216268517600.1000270.3142970.0945190.114600-0.020081-0.192649
61.4000002.2555654946316530659930.1275330.2504810.1317250.1164450.0152790.123292
72.2555653.2600002553111738372690.0720230.3149530.0679910.082688-0.014697-0.195694
83.2600006.6330003287218901517730.1000530.3650740.0875410.133148-0.045607-0.419350
96.63300014.5070003272618887516130.0997430.3659350.0871520.133049-0.045897-0.423060
# 发现在晚上七点之后访问的用户,流失率较低,白天访问的用户流失率较高
get_bin('h',10)
minmaxcount_0count_1totalpercentagebad_rategood%bad%good-badwoe
00.06.04228715678579650.1120190.2704740.1126140.1104430.0021710.019465
16.010.04685022957698070.1349040.3288640.1247660.161720-0.036954-0.259428
210.012.03440016455508550.0982790.3235670.0916100.115917-0.024307-0.235329
312.013.0198158752285670.0552060.3063670.0527690.061653-0.008884-0.155599
413.015.03866018137567970.1097620.3193300.1029550.127766-0.024811-0.215905
515.017.04253718293608300.1175550.3007230.1132800.128865-0.015585-0.128901
617.019.04030516122564270.1090470.2857140.1073360.113571-0.006235-0.056466
719.021.04968015527652070.1260140.2381190.1323030.1093800.0229230.190266
821.022.0314056631380360.0735050.1743350.0836340.0467120.0369220.582455
922.023.0295643403329670.0637100.1032240.0787320.0239720.0547591.189144
# 客户价值并非越高流失率越低,在1.147-1.347时流失率最低的区间
get_bin('ctrip_profits',10)
minmaxcount_0count_1totalpercentagebad_rategood%bad%good-badwoe
0-44.3130.1473792113979519000.1002980.2693450.1009870.0984750.0025120.025192
10.1470.5003767313923515960.0997110.2698460.1003270.0980800.0022460.022645
20.5001.1473771314400521130.1007100.2763230.1004330.101441-0.001007-0.009980
31.1471.347150615444041950190.3768790.2276910.4011020.3128030.0882990.248641
41.3471.58782963333116290.0224730.2866110.0220930.023479-0.001386-0.060856
51.5873.2203608915701517900.1000850.3031670.0961080.110605-0.014497-0.140493
63.2207.3273531016403517130.0999370.3171930.0940340.115551-0.021517-0.206054
77.327600.8203188619812516980.0999080.3832260.0849150.139565-0.054650-0.496877
# 通过观察可以发现lastpvgap,sid,visitnum_oneyear的IV值过小,可以去掉这三个特征
X_train_woe = X_train_embedded[['ctrip_profits', 'lasthtlordergap', 'cityorders', 'cr', 'h']]
X_test_woe = X_test_embedded[['ctrip_profits', 'lasthtlordergap', 'cityorders', 'cr', 'h']]
# 最大深度学习曲线
scores = []
time0 = time()
for i in np.arange(5,21,1):
    rfc = RFC(n_estimators=10, max_depth=i, random_state=42)
    score = cross_val_score(rfc, X_train_embedded, y_train, cv=5, n_jobs=-1).mean()
    scores.append(score)

print('花费时间:{}'.format(datetime.datetime.fromtimestamp(time()-time0).strftime('%M:%S:%f')))
print('最大分数为{},最大深度为{}'.format(max(scores), np.arange(5,21,1)[np.argmax(scores)]))
plt.figure(figsize=(10,8))
plt.plot(np.arange(5,21,1), scores)
plt.show()
花费时间:03:27:049065
最大分数为0.8891446300224614,最大深度为20

在这里插入图片描述

# 最小分割数学习曲线
scores = []
time0 = time()
for i in np.arange(2,10,1):
    rfc = RFC(n_estimators=10, max_depth=20, min_samples_split=i, random_state=42)
    score = cross_val_score(rfc, X_train_embedded, y_train, cv=5, n_jobs=-1).mean()
    scores.append(score)

print('花费时间:{}'.format(datetime.datetime.fromtimestamp(time()-time0).strftime('%M:%S:%f')))
print('最大分数为{},最小分割数为{}'.format(max(scores), np.arange(2,10,1)[np.argmax(scores)]))
plt.figure(figsize=(10,8))
plt.plot(np.arange(2,10,1), scores)
plt.show()
花费时间:02:16:546873
最大分数为0.8891446300224614,最小分割数为2

在这里插入图片描述

# 最小叶子节点样本数学习曲线
scores = []
time0 = time()
for i in np.arange(1,10,1):
    rfc = RFC(n_estimators=10, max_depth=20, min_samples_split=2, min_samples_leaf=i, random_state=42)
    score = cross_val_score(rfc, X_train_embedded, y_train, cv=5, n_jobs=-1).mean()
    scores.append(score)

print('花费时间:{}'.format(datetime.datetime.fromtimestamp(time()-time0).strftime('%M:%S:%f')))
print('最大分数为{},小叶子节点样本数{}'.format(max(scores), np.arange(1,10,1)[np.argmax(scores)]))
plt.figure(figsize=(10,8))
plt.plot(np.arange(1,10,1), scores)
plt.show()
花费时间:02:13:264602
最大分数为0.8891446300224614,小叶子节点样本数1

在这里插入图片描述

# 树数量学习曲线
scores = []
time0 = time()
for i in np.arange(10,201,10):
    rfc = RFC(n_estimators=i, max_depth=20, random_state=42)
    score = cross_val_score(rfc, X_train_embedded, y_train, cv=3, n_jobs=-1).mean()
    scores.append(score)
    print('花费时间:{}'.format(datetime.datetime.fromtimestamp(time()-time0).strftime('%M:%S:%f')))
print('最大分数为{},树数量{}'.format(max(scores), np.arange(10,201,10)[np.argmax(scores)]))
plt.figure(figsize=(10,8))
plt.plot(np.arange(10,201,10), scores)
plt.show()
花费时间:00:33:316156
花费时间:01:39:789986
花费时间:03:21:423289
花费时间:05:40:185622
花费时间:08:30:948403
花费时间:11:56:105026
花费时间:15:51:769696
花费时间:20:36:914026
花费时间:25:38:212320
花费时间:31:10:214501
花费时间:37:27:759795
花费时间:44:10:525391
花费时间:51:28:059915
花费时间:59:16:547367
花费时间:07:41:280818
花费时间:16:38:253700
花费时间:26:07:070954
花费时间:36:06:570602
花费时间:48:40:747174
花费时间:00:14:880475
最大分数为0.8988864801364768,树数量100

在这里插入图片描述

# 选出最佳模型参数
rfc = RFC(n_estimators=100, max_depth=20, random_state=42).fit(X_train_embedded, y_train)
# 查看训练集和测试集上的得分
print('训练集得分为{}'.format(rfc.score(X_train_embedded, y_train)))
print('测试集得分为{}'.format(rfc.score(X_train_embedded, y_test)))
训练集得分为0.9144162424776504
测试集得分为0.8858812548192038
# 查看特征重要性
rfc.feature_importances_
array([0.12193391, 0.12869867, 0.14163503, 0.13799971, 0.10331983,
       0.12834216, 0.14547079, 0.09259991])
# 测试集预测概率
y_scores = rfc.predict_proba(X_train_embedded)
y_scores
array([[0.42810513, 0.57189487],
       [0.90260411, 0.09739589],
       [0.87430575, 0.12569425],
       ...,
       [0.3633656 , 0.6366344 ],
       [0.79112542, 0.20887458],
       [0.14388709, 0.85611291]])
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)
roc_auc
0.9680887508287116
# 绘制roc曲线
def draw_roc(roc_auc, fpr, tpr):
    plt.subplots(figsize=(7,5.5))
    plt.plot(fpr, tpr, color='orange', label='roc curve(area={})'.format(roc_auc))
    plt.plot([0,1], [0,1], color='blue', linestyle='--')
    plt.xlabel('fpr')
    plt.ylabel('tpr')
    plt.xlim([0,1])
    plt.ylim([0,1.05])
    plt.title('ROC Curve')
    plt.legend(loc=4)
    plt.show()

draw_roc(roc_auc, fpr, tpr)

在这里插入图片描述

# RFM计算
rfm = df[['sampleid','ordernum_oneyear','avgprice','lasthtlordergap']]
rfm.head()
sampleidordernum_oneyearavgpricelasthtlordergap
024636NaNNaNNaN
124637NaNNaNNaN
224641NaNNaNNaN
324642NaNNaNNaN
424644NaNNaNNaN
# RFM模型重命名
rfm = rfm.dropna().reset_index(drop=True).rename(columns={'ordernum_oneyear':'F', 'avgprice':'M', 'lasthtlordergap':'R'})
rfm.head()
sampleidFMR
02465021.0363.010475.0
1246537.0307.018873.0
2246551.0343.032071.0
32465833.01000.04616.0
4246624.0685.044830.0
# 通过计算得出R的单位是分钟,可以将其转换成天
rfm['R'] = round(rfm['R'] / 1440, 0)
rfm.head()
sampleidFMR
02465021.0363.07.0
1246537.0307.013.0
2246551.0343.022.0
32465833.01000.03.0
4246624.0685.031.0
rfm.describe().T
countmeanstdmin25%50%75%max
sampleid426425.0629380.138599414760.18303224650.0313549.0600907.0887813.02238403.0
F426425.012.13791617.4054191.03.06.014.0711.0
M426425.0421.604962286.9877001.0233.0351.0523.06383.0
R426425.070.74216384.8447800.011.033.097.0366.0
# 这里根据数据分布情况以及常规思路,对RFM进行划分
f_bins = [-1, 3, 5, 7, 10, 720]
m_bins = [-1, 200, 400, 600, 800, 7000]
r_bins = [-1, 3, 7, 30, 180, 370]

rfm['R_score'] = pd.cut(rfm['R'], bins=r_bins, labels=[5,4,3,2,1]).astype('int')
rfm['F_score'] = pd.cut(rfm['F'], bins=f_bins, labels=[1,2,3,4,5]).astype('int')
rfm['M_score'] = pd.cut(rfm['M'], bins=m_bins, labels=[1,2,3,4,5]).astype('int')

rfm
sampleidFMRR_scoreF_scoreM_score
02465021.0363.07.0452
1246537.0307.013.0332
2246551.0343.022.0312
32465833.01000.03.0555
4246624.0685.031.0224
........................
42642022383882.0226.0119.0212
42642122383894.0461.00.0523
42642222383965.0193.044.0221
42642322383971.0258.087.0212
42642422384033.0256.052.0212

426425 rows × 7 columns

# 大于平均分时记为1,否则记为0
rfm['R_level'] = (rfm['R_score'] > rfm['R_score'].mean()) * 1
rfm['F_level'] = (rfm['F_score'] > rfm['F_score'].mean()) * 1
rfm['M_level'] = (rfm['M_score'] > rfm['M_score'].mean()) * 1

rfm
sampleidFMRR_scoreF_scoreM_scoreR_levelF_levelM_level
02465021.0363.07.0452110
1246537.0307.013.0332110
2246551.0343.022.0312100
32465833.01000.03.0555111
4246624.0685.031.0224001
.................................
42642022383882.0226.0119.0212000
42642122383894.0461.00.0523101
42642222383965.0193.044.0221000
42642322383971.0258.087.0212000
42642422384033.0256.052.0212000

426425 rows × 10 columns

# 合并数据,并根据RFM标签来对用户进行划分
rfm['RFM'] = pd.concat([rfm['R_level'].astype('str') + rfm['F_level'].astype('str') + rfm['M_level'].astype('str')])
rfm['RFM'].replace(['111','101','011','001','110','100','010','000']
            , ['重要价值用户','重要发展用户','重要保持用户','重要挽留用户','一般价值用户','一般发展用户','一般保持用户','一般挽留用户'], inplace=True)
rfm
sampleidFMRR_scoreF_scoreM_scoreR_levelF_levelM_levelRFM
02465021.0363.07.0452110一般价值用户
1246537.0307.013.0332110一般价值用户
2246551.0343.022.0312100一般发展用户
32465833.01000.03.0555111重要价值用户
4246624.0685.031.0224001重要挽留用户
....................................
42642022383882.0226.0119.0212000一般挽留用户
42642122383894.0461.00.0523101重要发展用户
42642222383965.0193.044.0221000一般挽留用户
42642322383971.0258.087.0212000一般挽留用户
42642422384033.0256.052.0212000一般挽留用户

426425 rows × 11 columns

# 统计各个类型用户的数量
rfm_new = pd.DataFrame(rfm.groupby('RFM', as_index=False)['sampleid'].agg('count'))
rfm_new
RFMsampleid
0一般价值用户78592
1一般保持用户46850
2一般发展用户42275
3一般挽留用户83394
4重要价值用户63595
5重要保持用户38850
6重要发展用户20235
7重要挽留用户52634
# 绘制饼状图观察结果
plt.figure(figsize=(12,6))
plt.pie((rfm_new['sampleid'] / rfm_new['sampleid'].sum()).to_list(), labels=rfm_new['RFM'].to_list(), autopct='%0.2f%%')
 [Text(0.9207056795449674, 0.6019144886556893, '一般价值用户'),
  Text(0.07432600254562635, 1.0974860570164833, '一般保持用户'),
  Text(-0.6110720650087508, 0.9146534487804336, '一般发展用户'),
  Text(-1.0982775444537256, 0.061534017816936265, '一般挽留用户'),
  Text(-0.5691837587192285, -0.9412915854347425, '重要价值用户'),
  Text(0.23025758705583027, -1.0756307189752563, '重要保持用户'),
  Text(0.6623554620643879, -0.8782284679247601, '重要发展用户'),
  Text(1.0183302486279413, -0.4159368999371846, '重要挽留用户')],
 [Text(0.5022030979336185, 0.3283169938121941, '18.43%'),
  Text(0.04054145593397801, 0.5986287583726272, '10.99%'),
  Text(-0.33331203545931853, 0.49890188115296374, '9.91%'),
  Text(-0.5990604787929412, 0.03356400971832887, '19.56%'),
  Text(-0.31046386839230644, -0.5134317738734958, '14.91%'),
  Text(0.1255950474849983, -0.5867076648955943, '9.11%'),
  Text(0.3612847974896661, -0.47903370977714177, '4.75%'),
  Text(0.5554528628879679, -0.22687467269300973, '12.34%')])

在这里插入图片描述


评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值