Rossmann 商店销售预测

2025年3月17日 | 阅读13分钟

引言

一种商品或服务的需求总是在变化的。如果不能有效预测客户需求和产品/服务的未来销量，任何公司都无法提高其财务绩效。销售预测可以预测给定产品在预定时间内的需求或销量。在本文中，我将通过一个来自 Kaggle 的真实商业挑战，演示如何使用机器学习来预测销售。在此案例研究中，一切都从头开始解决。因此，您将看到案例研究在实际世界中如何解决的每一个步骤。

问题陈述

Rossmann 在七个欧洲国家经营着 3,000 多家药店。

Rossmann 商店经理必须提前六周预测他们每日的销售额。影响商店销售的因素包括营销、竞争、州和联邦假日、季节性以及地点。由于数千名不同的经理根据自己的情况进行销售预测，因此结果的准确性可能差异很大。

误差度量： RMSPE 代表均方根百分比误差。

该度量的公式如下

目标

使用数据预测未来六周的销售额。
尽可能减小指定的度量。

数据

文件如下

train.csv
test.csv
store.csv

数据字段

Id：测试集中双精度数的唯一标识符。
Store：每个商店的唯一 ID。
Sales：每日营业额，这是您正在假设的。
Customers：特定日期有多少客户。
Open：0 表示商店关闭，1 表示商店开业。

StateHoliday：一个表示州假日的标志。除了极少数例外，所有商店在国定假日通常都会关闭。请注意，每个机构在周末和联邦假日都休息。S 代表公共假日，B 代表复活节假期，C 代表圣诞节，N 代表无。

如果公立学校关闭影响了（商店，时间），则由 SchoolHoliday 指示。

StoreType：区分四种不同的商店型号（a、b、c 和 d）。

Assortment：指定三个级别的商品组合：基础、附加和扩展。

CompetitionDistance：以米为单位衡量到最近的竞争对手商店的距离。

CompetitionOpenSince：提供最近的竞争对手首次开业的年份和月份的近似值。

表示该商店当天是否提供促销活动。

Promo2：一些商店正在进行一项持续性活动：0 表示商店未参与，1 表示参与。

Promo2Since[Year / Week]：指定商店首次加入 Promo2 的日历年和周。

PromoInterval：定义 Promo2 以固定间隔启动的月份。

探索性数据分析 (EDA)

让我们使用 EDA 来获取对提供数据的洞察。

以下是 train.csv 的信息

class 'pandas.core.frame.DataFrame'>
Int64Index: 1058297 entries, 0 to 41087
Data columns ( total 11 columns ):
Customers        1017209 nonnull float64
Date             1058297 non null datetime64[ns]
DayOfWeek        1058297 non null int64
Id               41088 non null float64
Open             1058286 non null float64
Promo            1058297 non null int64
Sales            1017209 non null float64
SchoolHoliday    1058297 non null int64
Set              1058297 non null int64
StateHoliday     1058297 non null object
Store            1058297 non null int64
dtypes: datetime64[ns]( 1 ), float64( 4 ), int64( 5 ), object( 1 )
memory usage: 96.9+ MB

正如我们所见，我们有大约 100 万个数据点。此外，因为这是一个时间序列预测问题，所以我们必须按日期对数据进行排序。

在这种情况下，我们的目标变量是 Sales。

以下是 store.csv 的信息

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1115 entries, 0 to 1114
Data columns ( total ten columns ):
Store                        1115 non null int64
StoreType                    1115 non null object
Assortment                   1115 non null object
CompetitionDistance          1112 non null float64
CompetitionOpenSinceMonth    761 non null float64
CompetitionOpenSinceYear     761 non null float64
Promo2                       1115 non null int64
Promo2SinceWeek              571 non null float64
Promo2SinceYear              571 non null float64
PromoInterval                571 non null object
dtypes: float64( 5 ), int64( 2 ), object( 3 )
memory usage: 95.8+ KB

我们有 1115 家不同的商店。此表中的许多列都包含空值。我们将在稍后处理它们。

现在我们来看一下数据中各列的详细信息。

Promo

促销列，紧挨着 Sales 和 Customers。

我们可以看到，在促销期间，收入和客户基础会显著增加。这表明促销对商店有积极影响。

销售

每周平均销售额

还值得注意的是，圣诞节和新年（参见第 52 周的图表）导致了销售额的飙升。因为 Rossmann Stores 提供健康和美容产品，所以可以合理地假设在节假日和新年期间，人们外出庆祝时会购买美容产品，这可以解释销售额的快速增长。

DayOfWeek

将销售额和客户与 DayOfWeek 列进行比较。

由于大多数商店都关闭，我们可以看到周日企业和客户数量有所下降。

此外，周一的销售额是一周中最高的。这可能是因为大多数商店在周日关闭。

另一个关键点是，在学校假期期间营业的零售商的销售额高于正常水平。

客户和销售额基于商店的商店类型。

我们可以看到 A 类商店的客户和销售量更大。B 类商店在销售额和客户数量方面排名第二。

EDA 的结论

A 是最受欢迎和最繁忙的零售类型。
客户数量与销售额密切相关。
促销活动会增加所有商店的销售额和客户数量。
在学校假期期间营业的商店比平日销售额更高。
学校假期期间的商店比州假日期间更多。
圣诞节期间的销售额有所增加，这可能是因为人们在节假日期间购买更多美容产品。
属性 Competition Open Since Year / Month 中数据的缺失并不意味着没有竞争。在其他两个值为空的情况下，距离值不为空。
在通过波浪分解评估销售额后，我发现销售额数据存在一些季节性。

特征工程

异常值列

在此列中，我们将根据中位数绝对偏差 (MAD) 来确定销售额数字是否为异常值。

MAD 公式。

我们构建了异常值列（按商店划分），这意味着我们对每个唯一商店都独立进行了处理，然后合并了数据。

日期特征

首先，我们将使用 pandas 的 to__datetime 方法转换 Date 列。之后，我们可以从 Date 中提取更多属性。

本周、上周和下周的假日

我们创建了三个特征，分别表示本周、上周和下周的假日总数。

州假日计算器

上面所示的函数用于生成两个新特征。一个表示距离州假日还有多少天，另一个表示距离上次州假日已经过去了多少天。

学校假期促销和计数器

除了上述特征之外，我还构建了四个额外的特征，它们表示距离促销或学校假期还有多少天（之前或之后）。

关闭虚拟变量：此特征有两个值：+1 或 -1。如果商店昨天或明天关闭，则为 +1；否则为 -1。

删除销售额为零的数据点：在这种情况下，会删除销售额为零的数据点，因为它们表示商店因任何原因关闭。如果我们收到一个未营业的商店，我们可以预测其销售额为零。

Customers__per__day、Sales__per__customers__per__day 和 Sales__per__customers__per__day

这些特征的名称仅仅表明其含义。无需进一步解释。

Open Competition 和 Open Promo：我们将这两个特征从“年”单位更改为月单位。

Promo interval 特征生成： Promointerval 的格式如下：May, August, November。我们将它们拆分如下：May 是一个特征，August 是另一个，November 是第三个。

销售额变化和加速度： Variation = y - (y-1)，y = sales

Acceleration = [(y-1) - (y-2)]，其中 y = sales

傅里叶特征

我使用 numpy 的 fft 函数来计算傅里叶频率和幅度。然后将它们用作特征。

其他特征包括

这些包含关于 DayOfWeek、Promotions、Holidays 等的重要模式。

外部信息

这里只有两个额外的信息。一种是州数据，它识别商店所属的州；另一种是特定州在特定日期的天气数据。

VIF 分析：添加所有特征后，我们进行了 VIF 分析，以查看它们之间是否存在任何共线性。高共线性特征已被删除。

现在让我们进行一些建模。

建模

基础模型

我们使用 sklearn Pipe 和 Column Transformer 来预处理数据。

import numpy as npp
import pandas as pdd
import matplotlib.pyplot as plot
import xgboost as boost

import pylab
import csv
import datetime
import math
import re
import time
import random
import os
From pandas. Series.offsets import *
from operator import *
from sklearn.cross__validation import train__test__split
%matplotlib inline
npp.set__printoptions( precision = 4, threshold = 10000, linewidth = 100, edgeitems = 999, suppress = True )
pdd.set__option( 'display.max__columns', None )
pdd.set__option( 'display.max__rows', None )
pdd.set__option( 'display.width', 100 )
pdd.set__option( 'expand__frame__repr', False )
pdd.set__option( 'precision', 6 )
start__time  =  time.time( )
In [2]:
def ToWeight( y ):
    w  =  npp.zeros( y.shape, dtype = float )
    ind  =  y ! =  0
    w[ind]  =  1. / ( y[ind]**2 )
    return w

def rmspe( yhat, y ):
    w  =  ToWeight( y )
    rmspe  =  npp.sqrt( npp.mean(  w * ( y - yhat )**2  ) )
    return rmspe

def rmspe__xg( yhat, y ):
    # y  =  y.values
    y  =  y.get__label( )
    y  =  npp.exp( y ) - 1
    yhat  =  npp.exp( yhat ) - 1
    w  =  ToWeight( y )
    rmspe  =  npp.sqrt( npp.mean( w * ( y - yhat )**2 ) )
    return "rmspe", rmspe

中位数用于填充数值，而最频繁用于填充定性值。数值也进行了缩放。

现在将数据分割成训练集和验证集。

将数据分割成 test 和 train 用于异常值检测

In [36]:
X__train, X__test, y__train, y__test  =  train__test__split( dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )][features__x],
                                                    dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )][features__y],
                                                    test__size = 0.1, random__state = seed )
In [37]:
dtrain  =  xgboost.DMatrix( X__train, y__train )
dtest  =  xgboost.DMatrix( X__test, y__test )
In [38]:
num__round  =  20000
evallist  =  [( dtrain, 'train' ), ( dtest, 'test' )]
In [39]:
param  =  { 'bst:max__depth' :12,
         'bst:eta' :0.01,
         'subsample' :0.8,
         'colsample__bytree' :0.7,
         'silent' :1,
         'objective' :'reg:linear',
         'nthread' :6,
         'seed' :seed }

plst  =  param.items( )

bst  =  xgboost.train( plst, dtrain, num__round, evallist, feval = rmspe__xg, verbose__eval = 250, early__stopping__rounds = 250 )

输出

Will train until test error hasn't decreased in 250 rounds.
[0]	train-rmspe:0.99963	test-rmspe:0.999863
[250]	train-rmspe:0.41216	test-rmspe:0.487971
[500]	train-rmspe:0.19972	test-rmspe:0.188309
[71]	train-rmspe:1.166821	test-rmspe:1.156818
[111]	train-rmspe:1.137129	test-rmspe:1.132996
151]	train-rmspe:1.122311	test-rmspe:1.121135
[111]	train-rmspe:1.119952	test-rmspe:1.112465
[175]	train-rmspe:1.111481	test-rmspe:1.116788
[211]	train-rmspe:1.193883	test-rmspe:1.112796
[251]	train-rmspe:1.188571	test-rmspe:1.111461
[511]	train-rmspe:1.183871	test-rmspe:1.198587
[275]	train-rmspe:1.181263	test-rmspe:1.197114
[311]	train-rmspe:1.177273	test-rmspe:1.195896
[351]	train-rmspe:1.174326	test-rmspe:1.194886
[511]	train-rmspe:1.171833	test-rmspe:1.194189
[375]	train-rmspe:1.169711	test-rmspe:1.193385
[411]	train-rmspe:1.167834	test-rmspe:1.192811
[451]	train-rmspe:1.166113	test-rmspe:1.192347
[511]	train-rmspe:1.164519	test-rmspe:1.191941
[4751]	train-rmspe:1.162961	test-rmspe:1.191569
[5111]	train-rmspe:1.161581	test-rmspe:1.191271
[5251]	train-rmspe:1.161224	test-rmspe:1.191116
[5511]	train-rmspe:1.158971	test-rmspe:1.191792
[5751]	train-rmspe:1.157782	test-rmspe:1.191615
[6111]	train-rmspe:1.156656	test-rmspe:1.191459
[6251]	train-rmspe:1.155568	test-rmspe:1.191351
[6511]	train-rmspe:1.154547	test-rmspe:1.191252
[6751]	train-rmspe:1.153527	test-rmspe:1.191143
[7111]	train-rmspe:1.152577	test-rmspe:1.191167
[7251]	train-rmspe:1.151698	test-rmspe:1.191118
[7511]	train-rmspe:1.151825	test-rmspe:1.189956
[7751]	train-rmspe:1.151112	test-rmspe:1.189897
[8111]	train-rmspe:1.149217	test-rmspe:1.189844
[8251]	train-rmspe:1.148413	test-rmspe:1.189811
[8511]	train-rmspe:1.147679	test-rmspe:1.189768
[8751]	train-rmspe:1.146973	test-rmspe:1.189742
[9111]	train-rmspe:1.146274	test-rmspe:1.189741
Stopping. Best iteration:
[8751]	train-rmspe:1.146971	test-rmspe:1.189741

特征选择

在生成上述基本模型后，我对特征工程阶段创建的所有附加特征进行了前向选择。

以下是管道中的新特征

读取商店数据

In [14]:
dataframe__store  =  pdd.read__csv( '.. / data / store.csv', 
                       nrows = nrows )
In [15]:
In [16]:
dataframe__store['StoreType']  =  dataframe__store['StoreType'].astype( 'category' ).cat.codes
dataframe__store['Assortment']  =  dataframe__store['Assortment'].astype( 'category' ).cat.codes
In [17]:
def convertCompetitionOpen( dataframe ):
    try:
        date  =  '{  }-{  }'.format( int( dataframe['CompetitionOpenSinceYear'] ), int( dataframe['CompetitionOpenSinceMonth'] ) )
        return pdd.to__datetime( date )
    except:
        return npp.nan

dataframe__store['CompetitionOpenInt']  =  dataframe__store.apply( lambda dataframe: convertCompetitionOpen( dataframe ), axis = 1 ).astype( npp.int64 )
In [18]:
def convertPromo2( dataframe ):
    try:
        date  =  '{  }{  }1'.format( int( dataframe['Promo2SinceYear'] ), int( dataframe['Promo2SinceWeek'] ) )
        return pdd.to__datetime( date, format = '%Y%W%w' )
    except:
        return npp.nan

dataframe__store['Promo2SinceFloat']  =  dataframe__store.apply( lambda dataframe: convertPromo2( dataframe ), axis = 1 ).astype( npp.int64 )
In [19]:
s  =  dataframe__store['PromoInterval'].str.split( ',' ).apply( pdd.Series, 1 )
s.columns  =  ['PromoInterval0', 'PromoInterval1', 'PromoInterval2', 'PromoInterval3']
dataframe__store  =  dataframe__store.join( s )
In [20]:
def monthToNum( date ):
    return{ 
            'Jan' : 1,
            'Feb' : 2,
            'Mar' : 3,
            'Apr' : 4,
            'May' : 5,
            'Jun' : 6,
            'Jul' : 7,
            'Aug' : 8,
            'Sept' : 9, 
            'Oct' : 10,
            'Nov' : 11,
            'Dec' : 12
     }[date]

dataframe__store['PromoInterval0']  =  dataframe__store['PromoInterval0'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval1']  =  dataframe__store['PromoInterval1'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval2']  =  dataframe__store['PromoInterval2'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
dataframe__store['PromoInterval3']  =  dataframe__store['PromoInterval3'].map( lambda x: monthToNum( x ) if str( x ) ! =  'nan' else npp.nan )
In [21]:
del dataframe__store['PromoInterval']
In [22]:
store__features  =  ['Store', 'StoreType', 'Assortment', 
                  'CompetitionDistance', 'CompetitionOpenInt',
                  'PromoInterval0']

### Features not helping
# PromoInterval1, PromoInterval2, PromoInterval3

features__x  =  list( set( features__x + store__features ) )
In [23]:
dataframe  =  pdd.merge( dataframe, dataframe__store[store__features], how = 'left', on = ['Store'] )
In [24]:
### Convert every NAN to -1
for feature in features__x:
    dataframe[feature]  =  dataframe[feature].fillna( -1 )

元学习

此方法概述如下

将数据分割成 80-20 的比例。
将训练集分成两部分，D1 和 D2。
取 9 个 D1 样本，并在所有样本上训练一个基于森林的 Regressor 模型。
根据这 9 个模型预测 D2。为了训练一个新模型，将这 9 个预测用作特征，D2 的 y__original 作为输出。
对于实验集，使用 9 个模型进行预测，并将这 9 个预测作为特征发送到元模型。作为最终预测，使用元模型的预测。

读取销售数据

In [4]:
nrows  =  None

dataframe__train  =  pdd.read__csv( '.. / data / train.csv', 
                       nrows = nrows,
                       parse__dates = ['Date'],
                       date__parser = ( lambda dt: pdd.to__datetime( dt, format = '%Y-%m-%d' ) ) )

nrows  =  nrows

dataframe__submit  =  pdd.read__csv( '.. / data / test.csv', 
                        nrows = nrows,
                        parse__dates = ['Date'],
                        date__parser = ( lambda dt: pdd.to__datetime( dt, format = '%Y-%m-%d' ) ) )
dataframe__train['Set']  =  1
dataframe__submit['Set']  =  0
In [6]:
frames  =  [dataframe__train, dataframe__submit]
dataframe  =  pdd.concat( frames )
In [8]:
features__x  =  ['Store', 'Date', 'DayOfWeek', 'Open', 'Promo', 'SchoolHoliday', 'StateHoliday']
features__y  =  ['SalesLog']
In [9]:
dataframe  =  dataframe.loc[~( ( dataframe['Open']  =  =  1 ) & ( dataframe['Sales']  =  =  0 ) )]
In [10]:
dataframe.loc[dataframe['Set']  =  =  1, 'SalesLog']  =  npp.log1p( dataframe.loc[dataframe['Set']  =  =  1]['Sales'] ) #  =  npp.log( dataframe['Sales'] + 1 )
In [11]:
dataframe['StateHoliday']  =  dataframe['StateHoliday'].astype( 'category' ).cat.codes
In [12]:
var__name  =  'Date'

dataframe[var__name + 'Day']  =  pdd.Index( dataframe[var__name] ).day
dataframe[var__name + 'Week']  =  pdd.Index( dataframe[var__name] ).week
dataframe[var__name + 'Month']  =  pdd.Index( dataframe[var__name] ).month
dataframe[var__name + 'Year']  =  pdd.Index( dataframe[var__name] ).year
dataframe[var__name + 'DayOfYear']  =  pdd.Index( dataframe[var__name] ).dayofyear

dataframe[var__name + 'Day']  =  dataframe[var__name + 'Day'].fillna( 0 )
dataframe[var__name + 'Week']  =  dataframe[var__name + 'Week'].fillna( 0 )
dataframe[var__name + 'Month']  =  dataframe[var__name + 'Month'].fillna( 0 )
dataframe[var__name + 'Year']  =  dataframe[var__name + 'Year'].fillna( 0 )
dataframe[var__name + 'DayOfYear']  =  dataframe[var__name + 'DayOfYear'].fillna( 0 )

features__x.remove( var__name )
features__x.append( var__name + 'Day' )
features__x.append( var__name + 'Week' )
features__x.append( var__name + 'Month' )
features__x.append( var__name + 'Year' )
features__x.append( var__name + 'DayOfYear' )
In [13]:
dataframe['DateInt']  =  dataframe['Date'].astype( npp.int64 )



In [40]:
dpred  =  xgboost.DMatrix( dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )][features__x] )
In [41]:
ypred__bst  =  bst.predict( dpred )
In [42]:
dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True ), 'SalesLog']  =  ypred__bst
dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True ), 'Sales']  =  npp.exp( ypred__bst ) - 1
In [43]:
no__stores__to__check  =  10

plot.rcParams["figure.figsize"]  =  [20,no__stores__to__check*5]

for i in range( 1,no__stores__to__check+1 ):
    stor  =  i

    # Normal sales
    X1  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )]
    y1  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  False )]['Sales']

    # Outliers
    X2  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )]
    y2  =  dataframe.loc[( dataframe['Set']  =  =  1 ) & ( dataframe['Store']  =  =  stor ) & ( dataframe['Open']  =  =  1 ) & ( dataframe['Outlier']  =  =  True )]['Sales']

    Xt  =  dataframe.loc[( dataframe['Store']  =  =  stor )]
    
    plot.subplot( 10,5,i )
    plot.plot( X1['Date'], y1, '-' )
    plot.plot( X2['Date'], y2, 'r.' )
    plot.title( i )
    plot.axis( 'off' )

输出

结论

表格 1
模型	测试集上的 rmspe	Kaggle 私人分数
SGD Regressor	0.250	0.234
决策树回归器	0.1767	0.16442
随机森林回归器	0.164	0.139
Light GBM 回归器	0.157	0.121
使用 9 个模型进行堆叠	0.1988	0.17375

所有分数的表格。

上表表明，Light GBM 模型是最佳模型。

未来工作

随着深度学习或机器学习的发展，LSTM 可以是一个很好的起点，以提高在给定数据集上的性能。

可以应用其他集成策略来查看它们是否能提高结果。

下一主题Python 中查找下一个更大的元素

Rossmann 商店销售预测

数据字段

探索性数据分析 (EDA)

EDA 的结论

特征工程

异常值列

日期特征

建模

基础模型

将数据分割成 test 和 train 用于异常值检测

特征选择

读取商店数据

元学习

结论

未来工作

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Rossmann 商店销售预测

数据字段

探索性数据分析 (EDA)

EDA 的结论

特征工程

异常值列

日期特征

建模

基础模型

将数据分割成 test 和 train 用于异常值检测

特征选择

读取商店数据

元学习

结论

未来工作

相关帖子

使用 Python 的 Enum 构建常量枚举

使用 Python 构建 Twitter Bot

Sklearn 中的交叉验证

Sklearn 线性回归示例

Python Time 模块

ModuleNotFoundError: no module named Python

使用 Python 制作太空侵略者游戏

如何运行 Python 程序

Python 验证

如何在 Python 中将字节转换为字符串

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器