机器学习中的安置预测

2025年7月30日 | 阅读 10 分钟

Placement Prediction Using Machine Learning

随着科技渗透到我们生活的方方面面，机器学习的应用在我们当今快节奏的社会中日益普及。招聘预测是机器学习众多应用之一。利用机器学习算法，招聘预测根据包括学业成绩、技能组合和过往工作经验在内的各种标准，确定学生被公司录用的可能性。

机器学习招聘预测的工作原理

为了预测招聘结果，信息从各种来源收集，包括学业成绩单、简历和过往工作经验。
之后，这些数据会被清洗和预处理，以消除任何不一致或错误。
在数据清洗后，数据被分成两类：训练数据和测试数据。
机器学习算法使用训练数据进行训练，并使用测试数据评估其有效性。系统通过各种方法进行训练，包括神经网络、决策树和回归分析。

回归分析是一种用于确定两个或多个变量之间关系的统计方法。在招聘预测中，回归分析用于确定包括学业成绩、技能组合、过往工作经验以及被公司录用可能性在内的多个变量之间的联系。

决策树是一种机器学习算法，它使用树状结构模拟决策和潜在结果。在招聘预测场景中，决策树被用于模拟企业在招聘过程中的决策制定。

受到人脑结构和功能的启发，机器学习算法被称为神经网络。在招聘预测中，神经网络被用于表示影响被公司录用可能性的众多因素之间的复杂联系。

在模型训练完成后，会使用测试数据对其进行测试以评估其性能。算法的有效性通过多种指标进行评估，包括准确率、精确率、召回率和 F1 分数。这些指标可以反映算法在预测学生录用可能性方面的有效性。

使用机器学习进行招聘预测的优势

使用机器学习算法预测招聘结果有诸多优势。

自动化候选人的初步筛选，减少了招聘过程所需的时间和精力。
它提供了一个更具数据驱动性和客观性的招聘流程，最大限度地减少了主观性和偏见的影响。
它为企业提供了发现通过传统招聘流程可能错失的优秀人才的机会。

代码实现

在这里，我们尝试实现机器学习技术和方法，以找出被录用和未被录用的学生之间的关系和模式。

1. 导入库

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

from sklearn.decomposition import PCA

from sklearn. preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score

import pickle

2. 读取数据集

dataframe = pd.read_csv('/kaggle/input/engineering-placements-prediction/collegePlace.csv')

# Getting to know the shape of data
dataframe.shape

输出

我们有 2966 行，8 个特征。

# Showing the first 5 rows of the dataset
dataframe.head()

输出

# Showing 4 rows of the dataset at random
dataframe.sample(4)

输出

# Getting to know the data type of columns that are in the dataset
dataframe.dtypes

输出

# Getting to know the detailed information of the columns
dataframe.info()

输出

# Statistical Descriptions of the numerical values in the dataset
dataframe.describe()

输出

# Getting to know the correlation between the target column and other features.
dataframe.corr()['PlacedOrNot']

输出

“是否录用”与学生的 CGPA 关系最密切。

3. 预处理

预处理是机器学习中的一个重要步骤，意味着在将数据馈送给算法进行学习之前，要使其准备好并干净。预处理是将原始数据转换为适合分析和建模格式的过程。

现在，我们将检查数据集中是否存在任何缺失值或重复值。

# missing values
dataframe.isnull().sum()

输出

# duplicate rows
print(dataframe.duplicated().sum())

#drop duplicates
dataframe.drop_duplicates(inplace=True)

输出

# Check if the duplicate rows are removed
print(dataframe.duplicated().sum())

输出

4. EDA

探索性数据分析是机器学习中的一个重要阶段，涉及检查和可视化数据以了解其构成、特征和趋势。它在开发实际的机器学习模型之前进行，对于发现潜在问题和选择正确的预处理和特征工程策略至关重要。

# Plotting  the graph so that we can visualize the output with respect to major features
figure = px.scatter(dataframe, x="CGPA", y="Internships", color="PlacedOrNot",
                 hover_data=['CGPA'])
figure.show()

输出

# Plotting Histogram for the count of place and not placed
px.histogram(dataframe, x='PlacedOrNot', color='PlacedOrNot', barmode='group')

输出

# Pie Chart: Percentage pie chart of Placed or Not Placed
figure = px.pie(dataframe, values=dataframe['PlacedOrNot'].value_counts().values, names=dataframe['PlacedOrNot'].value_counts().index, title='Placed Vs Not Placed')
figure.show()

输出

# Printing the Age of the youngest and Eldest student who is placed
print("Max Age of Placed Person: ",dataframe[(dataframe['Age'] == dataframe['Age'].max()) & (dataframe['PlacedOrNot']==1)]['Age'].values[0])
print("Min Age of Placed Person: ",dataframe[(dataframe['Age'] == dataframe['Age'].min()) & (dataframe['PlacedOrNot']==1)]['Age'].values[0])

输出

# Printing the Maximum and the Minimum number of internships done by the student who is placed. 
We will also print the Maximum and Minimum number of students who did the max internship and the minimum number of internships.
print("Max Internships Done by the Placed Student: ",dataframe[(dataframe['Internships'] == dataframe['Internships'].max()) & (dataframe['PlacedOrNot']==1)]['Internships'].values[0])
print("No of students who did max Internships and are placed: ",dataframe[(dataframe['Internships'] == dataframe['Internships'].max()) & (dataframe['PlacedOrNot']==1)]['Internships'].value_counts().values[0])

print("Min Internships Done by the Placed Person: ",dataframe[(dataframe['Internships'] == dataframe['Internships'].min()) & (dataframe['PlacedOrNot']==1)]['Internships'].values[0])
print("No of students who did min Internships and are placed: ",dataframe[(dataframe['Internships'] == dataframe['Internships'].min()) & (dataframe['PlacedOrNot']==1)]['Internships'].value_counts().values[0])

输出

# Printing the Maximum and Minimum number of CGPA obtained by the student who is placed.
# We will also print the Maximum and the Minimum number of students who got the max CGPA and minimum CGPA who are placed.

print("Max CGPA of Placed Student: ",dataframe[(dataframe['CGPA'] == dataframe['CGPA'].max()) & (dataframe['PlacedOrNot']==1)]['CGPA'].values[0])
print("No of students has max CGPA and are placed: ",dataframe[(dataframe['CGPA'] == dataframe['CGPA'].max()) & (dataframe['PlacedOrNot']==1)]['CGPA'].value_counts().values[0])

print("Min CGPA of Placed Person: ",dataframe[(dataframe['CGPA'] == dataframe['CGPA'].min()) & (dataframe['PlacedOrNot']==1)]['CGPA'].values[0])
print("No of students has min CGPA and are placed: ",dataframe[(dataframe['CGPA'] == dataframe['CGPA'].min()) & (dataframe['PlacedOrNot']==1)]['CGPA'].value_counts().values[0])

输出

5. 表示

统计表示过程涉及使用统计测量和可视化来以有意义且易于理解的方式呈现数据，主要目的是使用户能够理解数据中的洞察和模式，并使用数据做出明智的决策。

figure = px.box(dataframe, y='CGPA')
figure.show()

输出

figure = px.box(dataframe, y='Age')
figure.show()

输出

figure = px.box(dataframe, y=['Internships','CGPA', 'Age'])
figure.show()

输出

6. 将分类变量编码为数值变量

在机器学习中，将分类变量编码为数值变量是常见的预处理步骤。它需要将代表类别的质量属性变量转换为数值变量，该变量可用于数学运算和模型。

# Converting Gender column
dataframe['Gender'] = dataframe['Gender'].map({'Male': 1, 'Female': 0})

输出

# Converting Stream column
dataframe['Stream'] = dataframe['Stream'].map({'Electronics And Communication': 1,
                                 'Computer Science': 2,
                                'Information Technology': 3,
                                'Mechanical':4,
                                'Electrical':5,
                                'Civil':6})

输出

7. 提取输入和输出列

X = dataframe.iloc[:,0:7]
y = dataframe.iloc[:,-1]
X

输出

# Getting the shape of the X and Y
print(X.shape)
print(y.shape)

输出

# Splitting the dataset into training and testing datasets.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33)

# Getting the Shape of all the training and testing dataset
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

输出

9. 缩放值

scaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)

10. 模型训练与评估

模型的训练和评估是机器学习中决定模型准确性和性能的两个关键步骤。这些步骤需要仔细的规划、对细节的关注和严格的评估，才能开发出能够很好地泛化到新的、未见过的数据的模型。

在这里，我们将尝试不同的机器学习算法并找出它们的准确率。

1. 逻辑回归

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Without Scaling
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

#scaling has not much effect

输出

2. 决策树分类器

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0)

#without scaling
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

3. 随机森林分类器

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=10, random_state=0)

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

4. 支持向量机

from sklearn.svm import SVC

svc = SVC()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
classifier = GridSearchCV(svc, parameters)

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

from sklearn.svm import NuSVC
classifier = NuSVC()

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

5. 朴素贝叶斯

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

输出

from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

from sklearn.naive_bayes import CategoricalNB
classifier = CategoricalNB()

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))

输出

6. KNN

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())


# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

11. SGD 分类器

from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(max_iter=1000, tol=1e-3)

# Without Scaling
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

from sklearn.linear_model import Perceptron

classifier = Perceptron(tol=1e-3, random_state=0)
# Without Scaling
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

from sklearn.linear_model import LogisticRegressionCV
classifier = LogisticRegressionCV(cv=5, random_state=0)

# Without Scaling
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

# With Scaling
classifier.fit(X_train_scale,y_train)
y_pred = classifier.predict(X_test_scale)
print("With Scaling and Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train_scale, y_train, cv=10)
print("With Scaling and With CV: ",scores.mean())

输出

模型选择

因此，模型中最好的准确率来自随机森林分类器。

classifier = RandomForestClassifier(max_depth=10, random_state=0)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("With CV: ",scores.mean())
print("Precision Score: ", precision_score(y_test, y_pred))
print("Recall Score: ", recall_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

输出

模型调优

使用GridsearchCV 进行超参数调优，调整随机森林的参数并获得最佳参数。

param_grid = {
    'bootstrap': [False,True],
    'max_depth': [5,8,10, 20],
    'max_features': [3, 4, 5, None],
    'min_samples_split': [2, 10, 12],
    'n_estimators': [100, 200, 300]
}

rfclassifier = RandomForestClassifier()

classifier = GridSearchCV(estimator = rfclassifier, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Accuracy: ",accuracy_score(y_test,y_pred))
print(classifier.best_params_)
print(classifier.best_estimator_)

输出

使用最佳参数检查模型准确率

classifier = RandomForestClassifier(bootstrap=False, max_depth=5,max_features=None,
                             min_samples_split=2,
                             n_estimators=100, random_state=0)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print("Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(classifier, X_train, y_train, cv=10)
print("With CV: ",scores.mean())
print("Precision Score: ", precision_score(y_test, y_pred))
print("Recall Score: ", recall_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

输出

使用 CV 的模型准确率为 83%，不使用 CV 的模型准确率为 80%。

我们可以说，我们创建的模型准确率相当高。

结论

使用机器学习进行招聘预测可以预测学生被公司录用的可能性。机器学习算法的应用为招聘流程提供了一种更具数据驱动性和客观性的方法，使企业能够发现通过传统招聘技术可能被忽略的潜在应聘者。机器学习在各个行业的普及程度越来越高，使用机器学习算法进行招聘预测必将成为招聘流程中的一项重要工具。

下一主题理解多头注意力机制

机器学习中的安置预测

机器学习招聘预测的工作原理

使用机器学习进行招聘预测的优势

代码实现

1. 导入库

2. 读取数据集

3. 预处理

4. EDA

5. 表示

6. 将分类变量编码为数值变量

7. 提取输入和输出列

9. 缩放值

10. 模型训练与评估

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的安置预测

机器学习招聘预测的工作原理

使用机器学习进行招聘预测的优势

代码实现

1. 导入库

2. 读取数据集

3. 预处理

4. EDA

5. 表示

6. 将分类变量编码为数值变量

7. 提取输入和输出列

9. 缩放值

10. 模型训练与评估

结论

相关帖子

机器学习中的信用卡审批

2021 年十大机器学习课程

广义线性模型

机器学习的 A/B 测试

机器学习中的词袋 (BoW) 模型

心理模型对数据科学家和 ML/AI 从业者的重要性

Python 中随机森林超参数调优

流行的机器学习平台

YOLOV5 - 视频中的目标跟踪器

机器学习中的图像字幕生成

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器