基于人口统计学的推荐系统

2025 年 2 月 3 日 | 阅读 9 分钟

一种称为“基于人口统计的推荐系统”的推荐引擎，通过用户的人口统计数据来提供产品建议。为了预测偏好和行为，这些系统根据年龄、性别、就业、教育水平和地理位置等变量对用户进行分类。通过了解人口统计因素，这些系统能够定制建议，以更好地满足不同用户群体的需求和偏好。

代码

现在，我们将创建一个基于人口统计的推荐系统，用于向用户推荐服务。

导入库

 
 # For Data Manipulation
import numpy as np
import pandas as pd
from datetime import datetime

# For Graphical Plots
import seaborn as sns
import matplotlib.pyplot as plt

# For Data Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics.pairwise import cosine_similarity

# ML Models used to fill null values
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

# To visualize iterations
from tqdm import tqdm

# For reading files and data
import os

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Configure to display maximum columns
pd.set_option("display.max_columns",1000)  

读取数据集

 
 # Train data path - here parquet files reside
train_folder = "D:/DATA SCIENCE/Kaggle Datasets/Santander Product Recommendation/santander-product-recommendation/parquet files"

# Names of all the files inside the train folder
train_files = os.listdir(train_folder)

# Sort the files (as we will concatenate them later)
train_files.sort(key=len)

# Daatframes list
train_df_list = []

# Iterate through each file and read
for file in tqdm(train_files):
    # Complete file path
    train_file_path = os.path.join(train_folder, file)
    # Read the parquet file
    train_file = pd.read_parquet(train_file_path)
    # Append the dataframes
    train_df_list.append(train_file)

# Concatenate all the files
train_df = pd.concat(train_df_list, axis=0)

# Delete train_df_list to save space
del train_df_list

# Print head of dataframe
train_df.head()  

输出

 
 # Reading data description which will be used later
data_desc = pd.read_csv("D:/DATA SCIENCE/Kaggle Datasets/Santander Product Recommendation/santander-product-recommendation/train_ver2.csv/data_desc.csv")     

数据清洗与处理

 
# Drop the index
train_df.reset_index(drop=True, inplace=True)

# Print head of data frame
train_df.head()   

输出

 
# Checking percentage of null values
train_df.isnull().mean() * 100   

输出

 
 # Deleting 'conyuemp' and 'ult_fec_cli_1t' as 99% of values were missing
train_df.drop(columns=['ult_fec_cli_1t','conyuemp'], inplace=True)  

 
# Checking dataframe of null for feature ind_empleado as 0.203220 null value seems common in many features
train_df[train_df['ind_empleado'].isnull()]   

输出

据我们所知，数据库中关于用户的信息仅有服务选择数据。我们需要服务选择数据而不是用户数据，因此我们可以使用这些数据来构建协同过滤推荐系统，但不能用于我们的人口统计推荐系统。

正如我们所见，有些记录没有选择任何服务，但 `ind_nomina_ult1` 和 `ind_nom_pens_ult1` 字段的值为 null。其他记录选择了服务，但这两个字段的值为 null。当转换为标签格式时，我们将删除没有选择任何服务选项的记录。

 
# Checking records with a null value in column ind_nomina_ult1
train_df[train_df['ind_nomina_ult1'].isnull()]   

输出

编码目标

为了提供建议，我们必须将独热编码向量（目标）转换为标签编码。编码后，我们可以通过使用 `sklearn` 的 `label encoder` 对象来快速获取服务名称。

 
# Define label encoder object
le = LabelEncoder()

# Convert one-hot encoded vectors to a single column
raw_target = train_df.iloc[:, 22:].idxmax(1)

# Fit transform the labels
transformed_target = le.fit_transform(raw_target)

# Concatenate the column to the dataframe
train_df['service_opted'] = transformed_target

# Typecaste to uint8 to save memory
train_df['service_opted'] = train_df['service_opted'].astype('uint8')

# Print the dataframe
train_df.head(10)   

输出

 
# Checking the value count of the products
plt.figure(figsize=(12,8))

# Get the name and the occurences
names = raw_target.value_counts().index
values = raw_target.value_counts().values

# Map the names with their English translation via data_desc
names = [data_desc[data_desc['Column Name'] == name]['Description'].values[0] for name in names]

# Plot the plot
ax = sns.barplot(x=names, y=values)

# Set the title
ax.set_title("Number Of Services Opted In Millions")

# Set the xticklabels and rotate
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

# Label the bars
for p in ax.patches:
    ax.annotate("{:.1f}".format(p.get_height()), (p.get_x(), p.get_height()), rotation=25)

# Show the plot
plt.show()   

输出

我们希望包含在数据集中的三个列是 `user_id`、`item_id` 和 `rating`。我们有 `user_id` 和 `item_id`，因为我们正在推荐一项银行服务，但我们不确定 `rating`。因此，我们将使用服务选择率（衡量客户满意度的指标）来代替评分。

首先，我们将计算一个用户选择一项服务的次数。然后，我们将用户在银行服务期间选择的总服务次数除以每个用户选择的服务数量。其范围从 0 到 1。

 
# Creating a user-item matrix, each entry indicates the number of times service opted by that user
user_item_matrix = pd.crosstab(index=train_df.ncodpers, columns=le.transform(raw_target), values=1, aggfunc='sum')

# Filling nan values as 0 as service does not opt
user_item_matrix.fillna(0, inplace=True)

# Print the user-item matrix(Represents Count)
user_item_matrix   

输出

 
# Creating a user-item interaction matrix representing the ratio
# Convert the user_item_matrix to array datatype
uim_arr = np.array(user_item_matrix)

# Iterate through each row(user)
for row,item in tqdm(enumerate(uim_arr)):
    # Iterate through each column(item)
    for column,item_value in enumerate(item):
        # Change the count of service opted to the ratio
        uim_arr[row, column] = uim_arr[row, column] / sum(item)
        
# Convert the array to a dataframe for a better view
user_item_ratio_matrix = pd.DataFrame(uim_arr, columns=user_item_matrix.columns, index=user_item_matrix.index)

# Print the user_item_ratio_matrix(Represents the ratio)
user_item_ratio_matrix   

输出

合并成单个文件

 
# Stack the user_item_ratio_matrix to get all values in a single column
user_item_ratio_stacked = user_item_ratio_matrix.stack().to_frame()

# Create a column for user id
user_item_ratio_stacked['ncodpers'] = [index[0] for index in user_item_ratio_stacked.index]

# Create a column for service_opted
user_item_ratio_stacked['service_opted'] = [index[1] for index in user_item_ratio_stacked.index]

# Reset and drop the index
user_item_ratio_stacked.reset_index(drop=True, inplace=True)

# Print the dataframe
user_item_ratio_stacked   

输出

我们将正确地表示数据。

 
 # Rename the column 0 to service_selection_ratio
user_item_ratio_stacked.rename(columns={0:"service_selection_ratio"}, inplace=True)

# Arange the column systematically for a better view
user_item_ratio_stacked = user_item_ratio_stacked[['ncodpers','service_opted', 'service_selection_ratio']]

# Drop all the rows with 0 entries as it means the user has never opted for the service
user_item_ratio_stacked.drop(user_item_ratio_stacked[user_item_ratio_stacked['service_selection_ratio']==0].index, inplace=True)

# Reset the index
user_item_ratio_stacked.reset_index(drop=True, inplace=True)

# Display the final dataframe
user_item_ratio_stacked  

输出

基于人口统计的推荐系统

通过向量化数据，我们将利用这些特征来向相似或具有相似特征的用户推荐服务。

正如我们在空值处理部分之前所涵盖的那样，我们将首先删除没有任何用户信息的条目。

 
 # Dropping rows with no useful data
train_df.drop(train_df[train_df['ind_empleado'].isnull()].index, axis=0, inplace=True)

# Dropping rows with no useful data
train_df.drop(train_df[train_df['ind_nomina_ult1'].isnull()].index, axis=0, inplace=True)

# Dropping one-hot encoded columns of services
train_df.drop(columns=train_df.iloc[:1,22:-1].columns, inplace=True)

# Print the dataframe
train_df.head()  

输出

 
# Checking the null value for all columns
train_df.isnull().mean()*100  

输出

 
 # Filling renta with its mean
train_df['renta'].fillna(train_df['renta'].mean(), inplace=True)

# Filling cod_prov with its mode
train_df['cod_prov'].fillna(train_df['cod_prov'].mode()[0], inplace=True)

# Filling indrel_1mes with its mode
train_df['indrel_1mes'].fillna(train_df['indrel_1mes'].mode()[0], inplace=True)  

我们需要将列从分类格式转换为数值格式，以便我们能够计算相似度。

 
# List of names of columns of type object
obj_cols = train_df.select_dtypes('object')

# Iterate through each column
for col in obj_cols:
    print("*"*5,col,"*"*5)
    # Print its unique value
    print(train_df[col].unique(),"\n\n")   

输出

在此，我们注意到

我们需要将 `-age` 特征从 `object` 数据类型转换为 `uint8`。
我们将合并 `-indrel_1mes` 特征的多个重复标签，使它们成为一个单一标签，因为它们之间存在细微差异。
`cod_prov` 已经有一个数值编码，所以可以删除 `-nomprov`。

 
# Typecaste age to integer
train_df['age'] = train_df['age'].astype('uint8')
# Correcting the categories of the column - indrel_1mes
train_df['indrel_1mes'].replace('1', 1, inplace=True)
train_df['indrel_1mes'].replace('1.0', 1, inplace=True)
train_df['indrel_1mes'].replace('2', 2, inplace=True)
train_df['indrel_1mes'].replace('2.0', 2, inplace=True)
train_df['indrel_1mes'].replace('3', 3, inplace=True)
train_df['indrel_1mes'].replace('3.0', 3, inplace=True)
train_df['indrel_1mes'].replace('4', 4, inplace=True)
train_df['indrel_1mes'].replace('4.0', 4, inplace=True)
train_df['indrel_1mes'].replace('P', 5, inplace=True)
train_df['indrel_1mes'].replace('None',np.nan, inplace=True)

# Print dataframe
train_df.head()   

输出

 
# List of columns to encode
cols_to_encode = ['ind_empleado', 'pais_residencia', 'sexo', 'indrel', 'tiprel_1mes', 'indresi', 'indext', 'canal_entrada', 'indfall', 'segmento']

# List of label encoders which will be used for transformations later
label_encoders = []

# Create Label encode these columns iteratively
for col in tqdm(cols_to_encode):
    # Initialize a label encoder object
    lab_enc = LabelEncoder()
    
    # Encode the column and replace it with existing
    train_df[col] = lab_enc.fit_transform(train_df[col])
    
    # Typecaste to uint8 dtype
    train_df[col] = train_df[col].astype('uint8')
    
    # Append it in the label_encoders list to use it later
    label_encoders.append(lab_enc)
    
    # Delete the label encoder object
    del lab_enc
    
# Print the data
train_df.head()   

输出

 
# Deleting column 'nomprov' as we already have its encoded feature(cod_prov)
train_df.drop(columns=['nomprov'], inplace=True)

# Deleting column tipodom as all values are '1'
train_df.drop(columns=['tipodom'], inplace=True)

# Print the dataframe
train_df.head()  

输出

假设我们有 'N' 个用户。我们将选择每个用户最近的一次交易。在最近一次交易日期之前，我们将计算每项服务完成的交易次数，并将其记录在数据集中。

 
# Selecting non-duplicate rows(unique) and saving the latest transaction by giving parameter keep='last'
user_data = train_df[~train_df['ncodpers'].duplicated(keep='last')]

# Reset the index
user_data.reset_index(drop=True, inplace=True)

# Print the head
user_data.head()   

输出

 
 from tqdm.notebook import tqdm
tqdm.pandas()
# Create one-hot encodings using the service_opted variables
service_one_hot = pd.get_dummies(user_data['service_opted'],prefix='service')

# Join service one hot with real data
user_data = pd.concat([user_data, service_one_hot], axis=1)

# Print dataframe
user_data.head()  

输出

我们将用户 ID 和选择的服务作为索引，并对索引进行排序，以便能够极快地完成过滤，因为我们将检索与 `user_data` 数据框中的条目相关的先前记录。

 
# Set the userid and the service opted as index
train_df.set_index(['ncodpers','service_opted'], inplace=True)

# Sort the index to fetch records faster
train_df.sort_index(inplace=True)

# Print the dataframe
train_df.head()  

输出

 
# List of service labels
service_list = [i for i in range(24)]

# For each service labels
for service_no in tqdm(service_list):
    # Iterate through each row of user_data
    for index, row in tqdm(enumerate(user_data.itertuples())):
        # Fetch old transactions service count of the current user
        try:
            old_service_no_count = train_df.loc[(row.ncodpers, service_no)].shape[0]
        except:
            old_service_no_count = 0
        finally:
            # Create new columns and add data to it
            user_data.at[index, f'service_{service_no}'] = old_service_no_count
        
# Print the user_data dataframe
user_data.head()   

输出

 
# Fecha alto feature creation
user_data['fecha_alta_dow'] = user_data['fecha_alta'].progress_apply(lambda date: datetime(list(map(int, date.split('-')))[0], list(map(int, date.split('-')))[1], list(map(int, date.split('-')))[2]).weekday())
user_data['fecha_alta_month'] = user_data['fecha_alta'].progress_apply(lambda date: int(date.split('-')[1]))
user_data['fecha_alta_year'] = user_data['fecha_alta'].progress_apply(lambda date: int(date.split('-')[0]))

# Converting all these columns to uint8(0-255 range) except year to save memory as these features will be in this range
user_data['fecha_alta_dow'] = user_data['fecha_alta_dow'].astype('uint8')
user_data['fecha_alta_month'] = user_data['fecha_alta_month'].astype('uint8')
user_data['fecha_alta_year'] = user_data['fecha_alta_year'].astype('int16')

# drop the fecha_alta column
del user_data['fecha_alta'], user_data['fecha_dato']

# show dataframe
user_data.head()   

输出

 
# Splitting the Dataset 
Y = user_data['service_opted'].copy()
X = user_data.drop(columns=['service_opted'], inplace=False)  

现在我们将对数据集进行缩放。

 
from sklearn.preprocessing import StandardScaler
# Define a scalar object
scaler = StandardScaler()

# Fit transform the data
user_data_scaled = scaler.fit_transform(X)   

现在我们将进行降维

 
from sklearn.decomposition import PCA
# Define a PCA instance
pca = PCA(0.95)

# Fit transform the data
user_data_reduced = pd.DataFrame(pca.fit_transform(user_data_scaled), index=user_data.ncodpers)

# show data
user_data_reduced.head()  

输出

 
# Getting Specified recommendation for a user-specified
from numpy import dot
from numpy.linalg import norm
def get_label_name(label, le=le):
    return data_desc[data_desc['Column Name'] == le.inverse_transform([label])[0]]['Description'].values[0]
def cosine_sim(X,Y):
    return dot(X,Y) / (norm(X)*norm(Y))

def get_sim_user_recommendation(uid, top_n, X):
    # Fetch the specified user
    user_specified = X.loc[uid]
    
    # Calculate similarity with each and every user
    res = X.progress_apply(lambda user: cosine_sim(user_specified, user), axis=1)
    
    # Convert to a dataframe
    res = res.to_frame(name='sim_score')
    
    # Drop the index and make it a column
    res.reset_index(inplace=True)
    
    # Join the user_data and the res table on ncodpers
    res = pd.merge(left= user_data[['ncodpers','service_opted']], 
                   right = res, 
                   on='ncodpers')
    
    # Fetch the most similar row from each service category
    res = res[~res['service_opted'].duplicated(keep='first')]
    
    # Sort the results
    res.sort_values(by='sim_score', ascending=False, inplace=True)
    
    # Add a service opted name column
    res['service_opted_name'] = res['service_opted'].progress_apply(lambda label: get_label_name(label, le))
    
    # Drop the index and make it a column
    res.reset_index(drop=True, inplace=True)
    
    # Return the predictions
    return res   

检查推荐

让我们看看基于年龄的推荐。

 
# Get result for 1214789.0 (age-22)
res1 = get_sim_user_recommendation(1214789.0, 24, user_data_reduced)
res1 

输出

 
# Get result for 891565.0 (age-51)
res2 = get_sim_user_recommendation(891565.0, 24, user_data_reduced)
res2

输出

 
# Get result for 55890.0 (82 y/0)
res3 = get_sim_user_recommendation(55890.0, 24, user_data_reduced)
res3   

输出

下一主题机器学习工程师与研究员的差异

基于人口统计学的推荐系统

导入库

读取数据集

数据清洗与处理

编码目标

合并成单个文件

基于人口统计的推荐系统

检查推荐

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

基于人口统计学的推荐系统

导入库

读取数据集

数据清洗与处理

编码目标

合并成单个文件

基于人口统计的推荐系统

检查推荐

相关帖子

StandardScaler、MinMaxScaler 和 RobustScaler 技术

数据分析 vs. 机器学习

使用 ColumnTransformer 和 OneHotEncoder 进行预测

Softmax 分类器简介

使用 LangChain 构建聊天机器人 Web 应用

机器学习中的森林覆盖类型预测

机器学习中的高斯过程

VGGNet-16 架构

独立成分分析

反向传播 - 算法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器