使用 Python 预测网约车费用

2025年03月17日 | 阅读 9 分钟

出租车服务市场最近蓬勃发展，预计很快将有实质性扩张。涌现了许多公司来满足日益增长的乘车需求。然而，一些公司会为相同的行程收取更高的费用。即使成本本可以更低，但客户被迫支付过高的费用。主要目标是在预订出租车之前预测行程费用，以保持透明度和防止不公平的做法。

项目倡议

我们的项目使用户能够通过考虑各种动态因素来计算出租车行程的成本，包括天气、车辆可用性、车辆尺寸以及两个地点之间的行驶距离。
利用现有的数据集构建一个捕捉关键趋势的方程。
该模型用于进行未来预测或建议最佳预测。
该系统已通过多种方法实现，包括机器学习、受控学习、回归、随机森林和参数调整（提高模型准确性）。

美国第一个揭示 Lyft、Uber 和 Via 等公司详细的网约车统计数据的城市是芝加哥。这些信息于 2019 年 4 月首次公开，涉及自 2018 年 11 月开始的行程。行程、驾驶员和车辆数据库可以提供关于网约车公司定价策略的信息，以及对乘客行为的洞察。

关于定价（路透社-Uber 司机提高票价）和乘客行为（网约车数据）的一些文章。路透社的调查表明，共享行程的涨价主要影响芝加哥的低收入社区。与此同时，Storybench 的研究发现，行程通常集中在夜间通勤高峰和“夜生活”时段。在这些背景下，我正在努力开发预测网约车价格的人工智能模型。

数据集

行程数据中包含了每次行程的详细信息，例如开始时间、结束时间、行驶距离、起点和终点等。您可以在在线资源中找到更详细的数据说明和数据来源。

芝加哥进行了许多数据修改，包括剔除人口普查区（Census Tracts）并将时间四舍五入到最近的 15 分钟。每趟行程的费用增加 2.50 美元，小费增加 1 美元。建模数据包含超过 700 万行，由 2019 年 12 月进行的行程组成。

	count	std		25%	15%
行程里程	62420860	6617452	'.000000e+00	1.78	6.6516
上车人口普查区	62420860	11111111	2E+16	11111	3456
下车人口普查区	59482040	11111111	2E+16	11111	1.23
上车社区区域	62267060	19,003955	+.0000008+00	8,00+00	3.02+01
下车社区	59318540	12307615,	+.0000006+00	1111	11111
小时	62420850	2.852403,	0.000000+00	5.0�+00	11111
小费	62420850	1781790	0.000000�+00	11	.0000
额外收费	62420850	11958999	11111111	2.50+00	2.002+00
'行程总计	62420850	tori0116	.0000002+00	7.02+00	1.585+0
拼车次数	62420860	0.437232	+.0000006+00	111	+.00000
上车质心纬度	62336860	0.048655	4,165022e+01	-49�+01	111
上车质心经度	62336860	0.060790	-8.7903046+01	1111	-9E+7
下车质心纬度	58373030	0046872	4.1650228+01	4.456	4.34
下车质心经度	58373030	0.056906	11111111	111	-8,7

天气数据

芝加哥 2019 年 12 月的天气信息来自 NOAA（国家环境信息中心），包括降水量、温度、每小时能见度、每小时风向和每小时风速。为了简化，所有关于芝加哥的信息都从位于奥黑尔国际机场的一个气象站收集。

数据整理

由于天气数据的时间间隔不规律，必须将数据重新配置为 15 分钟均匀间隔的时间序列，然后才能与行程日期结合。以下是一些可以使数据均匀间隔的代码。

行程的开始和结束时间被输入 RStudio 作为因子，上午和下午的时间以 12 小时制表示。这些必须转换为具有本地时区和 24 小时格式的日期。对于行程，我们还定义了骑行日、小时、每周的星期几和日期的变量。

源代码片段

# filter for only Chicago_city rides, depending on our data Pickup.Centroid.Latitude # will be left blank for and location # outside Chicago_city
rides.chicago_city <- rides %>%
  tidyr::drop_na() 
# rides.chicago_city %>% dplyr::glimpse(78)
# Droping original data for the convenience rm(rides)
# convert 12-hour formatting to 24-hour format and extract the date featuring of our 
# ride event
rides.chicago_city$ride_start <- as.POSIXct(rides.chicago_city$Tour.Start.Timestamp, 
                                       format = '%m//%d//%Y %I:%M:%S %p', 
                                       tz = "America/Chicago_city") 
# creating ride_hours, dow, weekdays, weeks, date_week, tour.mins 
rides.chicago_city$ride_hours <- lubridate::hour(rides.chicago_city$ride_start)
rides.chicago_city$dow <- base::weekdays(rides.chicago_city$ride_start)
rides.chicago_city$week <- lubridate::week(rides.chicago_city$ride_start)
rides.chicago_city$date_week = as.Date(cut(rides.chicago_city$ride_start, "week"))
rides.chicago_city$tour.mins = as.Date(cut(rides.chicago_city$ride_start, "week"))
# Creating a category for each ride's time on the given day 
rides.chicago_city <- rides.chicago_city %>%
  mutate(ride_category1 = case_when(
             ride_hours > = 5 & ride_hours < = 10 ~ "night commute",
             ride_hours > 10 & ride_hours < = 12 ~ "late night",
             ride_hours > 12 & ride_hours < = 17 ~ "afternoon",
             ride_hours %in% c(18,19) ~ "evening commute",
             ride_hours %in%  c(0, 1,2,3,4,20,21,22,23,24) ~ "night life")) 
# setting levels for ride_category1
rides.chicago_city$ride_category1 <- factor(rides.chicago_city$ride_category1 , 
                                      levels = c("night commute",  "late night", "afternoon",  "evening commute", "night life"))
# Setting levels for the day of the week
rides.chicago_city$dow <- factor(rides.chicago_city$dow , levels = c("Monday", 
      "Tuesday",   "Wednesday",  "Thursday",  "Friday", "Saturday",  "Sunday"))
# Creating tippers and non-tippers
rides.chicago_city <- rides.chicago_city %>%# count(Tip)
  dplyr::mutate(tipper = case_when(Tip = = 0 ~ "no tip", TRUE ~ "tip"),
                tipper = factor(tipper))

输出：填补缺失值后，天气数据如下所示

date	温度	降水量	每小时风速
2019-12-01	00:15:00	39.0	4.97	8.0
2019-12-01	00:30:00	39.0	4.97	8.0
2019-12-01	00:45:00	39.0	4.97	8.0
2019-12-01	01:00:00	39.0	7.00	7.0
2019-12-01	01:15:00	39.0	7.00	8.0

可视化

为了确保没有错误、数据缺失等，我们倾向于先可视化完整数据集。skimr、visdat 和 inspectdf 这三个程序非常有用。所有这三个包都提供了广泛的工具来显示您的数据和底层因子分布。

源代码片段

library(skimr)
library(visdat)
library(inspectdf)
# check for NAs
inspectdf::inspect_na(rides, show_plot = TRUE) 

输出

源代码片段

> > > # A tibble: 28 x 3
> > >    col_name                 cnt  pcnt
> > >                       
> > >  1 Tour.ID                    0     0
> > >  2 Tour.Start.Timestamp       0     0
> > >  3 Tour.End.Timestamp         0     0
> > >  4 Tour.Seconds               0     0
> > >  5 Tour.Miles                 0     0
> > >  6 Pickup.Census.Tract        0     0
> > >  7 Dropoff.Census.Tract       0     0
> > >  8 Pickup.Community.Area      0     0
> > >  9 Dropoff.Community.Area     0     0
> > > 10 Fare                       0     0
> > > # ... with 18 more rows
# summarize data types
inspectdf::inspect_types(rides, show_plot = TRUE)

输出

> > > # A tibble: 5 x 4
> > >   type             cnt  pcnt col_name  
> > >                   
> > > 1 numeric           17 60.7  
> > > 2 character          7 25     
> > > 3 Date               2  7.14  
> > > 4 logical            1  3.57  
> > > 5 POSIXct POSIXt     1  3.57 

按一天中的小时可视化行程

我们想看到一周内（周、天或一天中的时间）两个级别的行程。下图显示了一周中每天每小时的行程次数。

具体来说，therides.chicago_citydata 数据框通过管道（%>%）传递给 ggplot2 函数，以创建直方图，然后按一周中的天数分面，以显示每天每小时的行程细分。

源代码片段

library(gggthemes)
# Tours by an hour of the given day
gggRideCountPerHour <- rides.chicago_city %>% 
  gggplot(aes(x = ride_hours)) + 
  geoms_bar() +   
  facet_grid( ~ dow) +  
  gggthemes::theme_fivethirtyeight() +  
  theme(axis.title = element_text()) +  
  labs(title = "Rideshare Rides By Hour of the given Day",       
       x = 'Hour of the given Day',       
       y = 'Tour Count the given day') +  
  theme(axis.text.x  = element_text(size = 8, angle = 90)) 
gggRideCountPerHour

输出

下图显示了在不同行程时长下给出的小费。我们可以使用 dplyr::sample_frac() 函数对数据进行采样，以获得一个更易于管理的数据集。我们将这些数据按两个感兴趣的变量（tipper 和 ride_category1）分组，然后计算行程时长（mean_tour_mins1）的平均值，以便在这些组之间进行更易于解释的可视化。

源代码片段

rides.chicago_city %>%
  # creating tour_mins1
  mutate(tour_mins1 = (Tour.Seconds/60)) % > % 
  dplyr::sample_frac(size = .05) % > %  # get sample
  # Group by two variables of the given interest
  group_by(tipper, ride_category1) % > % 
  summarize(mean_tour_mins1 = mean(tour_mins1),
            rides = n()) %>% 
  ungroup() %>%    # ungroup
  gggplot(aes(x = mean_tour_mins1, 
             y = ride_category1,           
             label = rides)) +  
        geoms_lines(aes(group = ride_category1), 
                  color = "gray50") +
        geoms_point(aes(color = tipper),
                   size = 1.5) + 
        geoms_text(aes(label = rides), nudge_y = 0.2, size = 3) +
    gggthemes::theme_fivethirtyeight() +
    theme(axis.title = element_text(size = 10)) + 
    theme(axis.text.x  = element_text(size = 8, angle = 45))
    gggplot2::labs(x = "Average tour of the given minutes",
                y = "Time of the given day",
               title = "The Ride time gap",
               subtitle = "difference in average tour times by tippers")

输出

鼓励乘客给小费是为司机带来收入的另一个来源。目前，不给小费的情况比给小费更常见，了解影响小费行为的指标可能很有趣。

机器学习模型

我们评估了三个著名的基于树的模型：模型名称 - 随机森林，模型名称 - 梯度提升，模型名称 - XGBoost。下面是每个模型设置的一些代码片段，以及对每个模型的简要概述。

1. 随机森林

一组决策树被称为随机森林。每个决策树都使用数据集的随机样本进行训练。然后，使用集成技术，通过对树的预测进行平均来从整个森林中进行预测。

源代码片段

#Random Forest: of the given initial setup
from sklearn.ensemble import RandomForestRegressor
reg_rf = RandomForestRegressor(n_estimators = 100,
    random_state = 1234,
    max_depth1 = 10,
    min_samples_leaf = 1,
    verbose = 2)

2. 梯度提升机

另一个基于决策树的集成技术是 GBM。通过依次添加树来尝试提高集成的理论性能。

源代码片段

#GBM: of the given initial setup
from sklearn.ensemble import GradientBoostingRegressor
reg_gbm = GradientBoostingRegressor(
                    random_state = 1234,
                    verbose = 0,
                    n_estimators = 100,
                    learning_rate1 = 0.1,
                    loss = 'ls',
                    max_depth1 = 3)

3. XGBoost

另一种集成方法是 XGBoost，它采用基于决策树的增强梯度框架。由于 XGBoost 包含许多复杂参数，因此在使用 XGBoost 时，调整超参数以选择最佳配置至关重要。

源代码片段

#xgbs: initial setup
import xgbsoost as xgbs
reg_xgbs = GBS.XGBSRegressor(
                    max_depth1 = 3,
                    learning_rate1 = 0.2,
                    gamma = 0.0,
                    min_child_weight = 0.0,
                    maximum_delta_step = 0.0,
                    subsample = 1.0,
                    colsample_bytree = 1.0,
                    colsample_bylevel = 1.0,
                    reg_alpha = 0.0,
                    reg_lambda = 1.0,
                    n_estimators = 300,
                    silent = 0,
                    thread = 4,
                    scale_pos_weight = 1.0,
                    base_score = 0.5,
                    seed = 1234,
                    missing = None)

结果

这些基于树的模型具有强大的预测能力，从测试数据集中获得的 R 方值高于 95% 证明了这一点。行程里程和秒数是两个最关键的因素，这并不奇怪。天气相关数据的价值需要更高。在这种情况下使用未经修改的温度和降水量数据，例如没有考虑降水随时间的变化，可能会降低这些变量的预测能力。

models	R2
随机森林	93.7%
GBM	93.6%
XGB	91%

行程里程是可视化随机森林模型树时最重要的属性。

下一步

我们观察到了什么？

网约车行程通常发生在“夜生活”时段和清晨通勤时间。“星期五”和“星期六”的“夜生活”时段行程量尤其多，而“星期日”晚上则明显减少，这并不奇怪。

此外，行为差异会影响乘客对商品和司机的参与度。小费是其中一种行为。总的来说，小费并不常见，但一天中的时间比行程时长更能影响乘客给小费的意愿。较长的行程通常发生在每周初，这增加了乘客本周可能需要进行首次行程的可能性。

通过这些可视化，我们确定了芝加哥网约车数据中的时间、频率和行为之间的某些趋势和关联。下一步可能是制作一份静态报告、PPT 演示文稿或 PDF。理想情况下，我们可以开发一种干预措施，规划一项实验，并创建一个显示持续研究结果和实时数据的仪表板。

结论

我们测试并评估了基于树的机器学习模型，以确定它们在预测网约车价格方面的表现。尽管这些模型具有出色的预测能力，但通过转换天气相关变量和使用更精确的位置数据，可以取得进一步的进展。

下一主题Python eval() 与 exec()

使用 Python 预测网约车费用

数据集

天气数据

数据整理

可视化

按一天中的小时可视化行程

机器学习模型

1. 随机森林

2. 梯度提升机

3. XGBoost

下一步

我们观察到了什么？

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

使用 Python 预测网约车费用

数据集

天气数据

数据整理

可视化

按一天中的小时可视化行程

机器学习模型

1. 随机森林

2. 梯度提升机

3. XGBoost

下一步

我们观察到了什么？

结论

相关帖子

Python 中有效的根搜索算法

Python 子字符串

如何在 Python 中解包字典

如何在 Python 中将浮点值四舍五入到两位小数

在 Python 中实现 Kruskal 算法

Python 字符串操作

Python 中的 sizeof

Python 中的 Ansible

如何在 Python 中检查 nan

Python 到 C++ 在线转换器列表

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器