机器学习中的森林覆盖类型预测

2025年3月17日 | 阅读20分钟

Forest Cover Type Prediction Using Machine Learning

在广阔多样的森林世界中，每一种植被类型都具有其独特的生态重要性。能够预测这些植被类型对于生态保护、自然资源管理以及加深我们对自然世界的理解至关重要。这正是机器学习发挥作用的地方。

眼前的任务是解密森林的秘密——根据各种环境特征预测特定区域的植被类型。机器学习算法是这项工作中现代的密码破解者，揭示了收集到的海量数据中隐藏的模式。这些覆盖类型可以是高大的云杉/冷杉树，也可以是耐寒的科鲁姆霍尔茨，每一种都在生态系统丰富的生物多样性中发挥着至关重要的作用。本质上，机器学习帮助我们揭示森林的奥秘，并为我们保护和管理这些重要的自然资源的努力做出贡献。

数据摘要

研究区域包括位于科罗拉多州北部罗斯福国家森林的四个荒野地区。每个数据点代表一个 30m x 30m 的地块。任务涉及预测森林覆盖类型的整数分类，它可以属于七个类别之一：

云杉/冷杉
黑松
黄松
棉白杨/柳树
白杨
花旗松
科鲁姆霍尔茨

训练数据集包含 15,120 个观测值，提供特征和 Cover_Type。另一方面，测试集仅包含特征，要求参与者预测测试集中 565,892 个观测值的 Cover_Type。

关键数据字段

海拔： 以米为单位的高度测量值
坡向： 以度方位角表示方向的测量值
坡度： 以度表示坡度的测量值
距水文水平距离： 表示到最近水文特征的水平距离的测量值
距水文垂直距离： 表示到最近水文特征的垂直距离的测量值
距道路水平距离： 表示到最近道路的水平距离的测量值
山体阴影_上午9点，山体阴影_中午，山体阴影_下午3点： 表示夏至期间上午9点、中午和下午3点山体阴影指数的测量值
距火点水平距离： 表示到最近野火着火点的水平距离的测量值
荒野区域： 表示荒野区域存在（1）或不存在（0）的二进制列
土壤类型： 表示土壤类型存在（1）或不存在（0）的二进制列
覆盖类型： 表示森林覆盖类型（1-7）的名称

荒野区域分类如下

拉瓦荒野区域
尼奥塔荒野区域
科曼奇峰荒野区域
普德雷堡荒野区域

土壤类型为

大教堂家族 - 岩石露头复合体，极其多石。
瓦内特 - 拉塔克家族复合体，非常多石。
高山冰原土 - 岩石露头复合体，碎石多。
拉塔克家族 - 岩石露头复合体，碎石多。
瓦内特家族 - 岩石露头复合体，碎石多。
瓦内特 - 韦特莫尔家族 - 岩石露头复合体，多石。
哥特式家族。
主管 - 细枝家族复合体。
特劳特维尔家族 - 非常多石。
布尔瓦克 - 卡塔蒙特家族 - 岩石露头复合体，碎石多。
布尔瓦克 - 卡塔蒙特家族 - 岩石地复合体，碎石多。
勒高家族 - 岩石地复合体，多石。
卡塔蒙特家族 - 岩石地 - 布尔瓦克家族复合体，碎石多。
肥沃灰土 - 水土复合体。
未在美国林务局土壤和ELU调查中指定。
冰水土 - 寒带土复合体。
盖特维尤家族 - 冰水土复合体。
罗杰特家族，非常多石。
典型冰水土 - 泥炭土复合体。
典型冰水土 - 典型冰水土复合体。
典型冰水土 - 莱坎家族，冰碛母质复合体。
莱坎家族，冰碛母质，极其多石。
莱坎家族，冰碛母质 - 典型冰水土复合体。
莱坎家族，极其多石。
莱坎家族，温暖，极其多石。
花岗岩 - 卡塔蒙特家族复合体，非常多石。
莱坎家族，温暖 - 岩石露头复合体，极其多石。
莱坎家族 - 岩石露头复合体，极其多石。
科莫 - 勒高家族复合体，极其多石。
科莫家族 - 岩石地 - 勒高家族复合体，极其多石。
莱坎 - 卡塔蒙特家族复合体，极其多石。
卡塔蒙特家族 - 岩石露头 - 莱坎家族复合体，极其多石。
莱坎 - 卡塔蒙特家族 - 岩石露头复合体，极其多石。
冰积土 - 岩石地复合体，极其多石。
冰育土 - 岩石露头 - 冰水土复合体。
布罗斯家族 - 岩石地 - 冰育土复合体，极其多石。
岩石露头 - 冰育土 - 冰积土复合体，极其多石。
莱坎 - 莫兰家族 - 冰水土复合体，极其多石。
莫兰家族 - 冰积土 - 莱坎家族复合体，极其多石。
40 莫兰家族 - 冰积土 - 岩石地复合体，极其多石。

现在，我们将尝试建立一个可以预测森林覆盖类型的模型。

使用机器学习预测森林覆盖类型的 Python 代码

导入库

import warnings
warnings.filterwarnings('ignore')
import pandas 
Import numpy

读取数据集

df_dataset = pandas.read_csv("../input/train.csv") 

# Drop the initial 'Id' column as it solely contains serial numbers, which have no relevance in the prediction procedure.
df_dataset = df_dataset.iloc[:,1:]

数据集统计

它指的是数据集关键数值特征和属性的摘要或描述。

# Size of the dataframe
print(df_dataset.shape)

输出

很明显，有 15,120 个实例，每个实例有 55 个属性。由于维度与数据描述一致，我们可以说数据已成功加载

# Datatypes of the attributes
print(df_dataset.dtypes)

输出

所有属性的数据类型都已推断为 int64。

# Statistical description
pandas.set_option('display.max_columns', None)
print(df_dataset.describe())

输出

注意到以下观察结果

所有属性的计数都一致为 15,120，因此没有属性存在缺失值。因此，所有行都可以被利用。
“距水文垂直距离”中存在负值，这使得某些测试（如卡方检验）不适用。
“荒野区域”和“土壤类型”都经过了独热编码。因此，它们可能可以转换回来用于特定分析。“Soil_Type7”和“Soil_Type15”属性可以排除，因为它们保持不变。
并非所有属性都具有相同的尺度，这意味着对于某些算法，可能需要重新缩放和标准化。

# Skewness of the distribution
print(df_dataset.skew())

输出

在这里，接近零的值表示最小的偏度。此外，“土壤类型”中的几个属性表现出显著的偏度。纠正这种偏度可能会有利于某些算法。

# Number of instances belonging to each class
df_dataset.groupby('Cover_Type').size()

输出

我们观察到每个类别都得到了同等程度的表示，表明不需要进行类别再平衡。

与数据集的交互

在这里，我们将就相关性和散点图与数据集进行交互。

相关性

import numpy

# Correlation indicates the relationship between two attributes.
# For Correlation it is necessary to have continuous data. Hence, ignore Wilderness_Area and Soil_Type as they are binary

# sets the number of features considered
size = 10 

# create a dataframe with only 'size' features
data=df_dataset.iloc[:,:size] 

# get the names of all the columns
cols=data.columns 

# "Computes Pearson coefficients for all possible combinations."
corr_data = data.corr()

# Setting the threshold to choose only attributes with strong correlations
threshold = 0.5

# List of pairs along with correlation above threshold
list_corr = []

#Search for the highly correlated pairs
for i in range(0,size): #for 'size' features
    for j in range(i+1,size): #avoid repetition
        if (corr_data.iloc[i,j] >= threshold and corr_data.iloc[i,j] < 1) or (corr_data.iloc[i,j] < 0 and corr_data.iloc[i,j] <= -threshold):
            list_corr.append([corr_data.iloc[i,j],i,j]) #store correlation and columns index

#Sort to show higher ones first            
s_list_corr = sorted(list_corr,key=lambda x: -abs(x[0]))

#Print correlations and column names
for v,i,j in s_list_corr:
    print ("%s and %s = %.2f" % (cols[i],cols[j],v))

# Significant correlation is noted among the following pairs, suggesting a potential to reduce the feature set through techniques like PCA

输出

在这里，相关性提供了关于不同环境变量之间如何相互关联的见解，

散点图

#import plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot of only the highly correlated pairs
for v,i,j in s_list_corr:
    sns.pairplot(df_dataset, hue="Cover_Type", size=6, x_vars=cols[i],y_vars=cols[j] )
    plt.show()

输出

以下是根据上述图表得出的要点

这些图表说明了数据点如何分类到各自的类别中。图表中类别的分布存在一些重叠。
山体阴影图案相互比较时呈现出吸引人的椭圆形。
“坡向”和“山体阴影”属性共同创建了 S 形图案。

到水文的水平距离和垂直距离显示出几乎线性的关系。

数据可视化

现在，我们将用小提琴图可视化我们的数据，然后我们将对独热属性进行分组。

箱线图和密度图

# We will visualize all the attributes using Violin Plot - a combination of box and density plots

#names of all the attributes 
cols = df_dataset.columns

#number of attributes (exclude target)
size = len(cols)-1

# The x-axis has a target attribute to distinguish between classes
x = cols[size]

# The y-axis shows the values of an attribute
y = cols[0:size]

#Plot violin for all attributes
for i in range(0,size):
    sns.violinplot(data=df_dataset,x=x,y=y[i])  
    plt.show()

输出

以下是通过小提琴图得出的观察结果

海拔对大多数类别都表现出独特的分布。它与目标变量高度相关，使其成为一个重要属性。
坡向在几个类别中显示出多个正态分布。
到道路和水文的水平距离遵循相似的分布。
上午 9 点和中午的山体阴影表现出左偏。
下午 3 点的山体阴影遵循正态分布。
到水文的垂直距离存在大量零值。
Wilderness_Area3 没有提供清晰的类别区分，因为它缺少值。然而，其他荒野区域为区分类别提供了一些潜力。
某些 Soil_Type 值，特别是 1、5、8、9、12、14 和 18-22，以及 25-30 和 35-40，由于在许多类别中缺失，有助于类别区分。

独热编码属性分组

# Group one-hot encoded variables of a category into one single variable

#names of all the columns
cols = df_dataset.columns

#number of rows=r , number of columns=c
r,c = df_dataset.shape

#Create a new dataframe with r rows, one column for each encoded category, and target in the end
data = pandas.DataFrame(index=numpy.arange(0, r),columns=['Wilderness_Area','Soil_Type','Cover_Type'])

#Make an entry in 'data' for each r as category_id, target value
for i in range(0,r):
    w=0;
    s=0;
    # Category1 range
    for j in range(10,14):
        if (df_dataset.iloc[i,j] == 1):
            w=j-9  #category class
            break
    # Category2 range        
    for k in range(14,54):
        if (df_dataset.iloc[i,k] == 1):
            s=k-13 #category class
            break
    #Make an entry in 'data' for each r as category_id, target value        
    data.iloc[i]=[w,s,df_dataset.iloc[i,c-1]]

#Plot for Category1    
sns.countplot(x="Wilderness_Area", hue="Cover_Type", data=data)
plt.show()
#Plot for Category2
plt.rc("figure", figsize=(25, 10))
sns.countplot(x="Soil_Type", hue="Cover_Type", data=data)
plt.show()

输出

以下是我们从图中可以得出的结论
在 cover_type 4 中存在大量的 WildernessArea_4，表明存在很强的类别区分。
WildernessArea_3 不提供显著的类别区分
土壤类型 1-6、10-14、17、22-23、29-33、35 和 38-40 对类别区分有显著贡献，因为它们在某些情况下计数显著偏高。

数据集清洗

现在我们将删除不必要的列。

#Removal list initialize
rem = []

#Add constant columns as they don't help in the prediction process
for c in df_dataset.columns:
    if df_dataset[c].std() == 0: #standard deviation is zero
        rem.append(c)

#drop the columns        
df_dataset.drop(rem,axis=1,inplace=True)

print(rem)

输出

以上是已删除的列。

数据集准备

在这里我们将执行以下操作

原始
删除缺失值或进行插补
StandardScaler（标准缩放器）
MinMaxScaler（最小-最大缩放器）
Normalizer（归一化器）

#get the number of rows and columns
r, c = df_dataset.shape

#get the list of columns
cols = df_dataset.columns
#create an array that has indexes of columns
i_cols = []
for i in range(0,c-1):
    i_cols.append(i)
#array of importance rank of all features  
ranks = []

#Extract only the values
array = df_dataset.values

#Y is the target column, X has the rest
X = array[:,0:(c-1)]
Y = array[:,(c-1)]

#Validation chunk size
val_size = 0.1

#Use a common seed in all experiments so that the same chunk is used for validation
seed = 0

#Split the data into chunks
from sklearn import cross_validation
X_train, val_X, Y_train, val_Y = cross_validation.train_test_split(X, Y, test_size=val_size, random_state=seed)

#Import libraries for data transformations
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

#All features
X_all = []
#Additionally we will make a list of subsets
all_X_add =[]

#columns to be dropped
rem = []
#indexes of columns to be dropped
i_rem = []

#List of combinations
comb = []
comb.append("All+1.0")

#Add this version of X to the list 
X_all.append(['Orig','All', X_train,val_X,1.0,cols[:c-1],rem,ranks,i_cols,i_rem])

#point where categorical data begins
size=10

#Standardized
#Apply transform only for non-categorical data
X_temp = StandardScaler().fit_transform(X_train[:,0:size])
val_X_temp = StandardScaler().fit_transform(val_X[:,0:size])
#Concatenate non-categorical data and categorical
X_con = numpy.concatenate((X_temp,X_train[:,size:]),axis=1)
val_X_con = numpy.concatenate((val_X_temp,val_X[:,size:]),axis=1)
#Add this version of X to the list 
X_all.append(['StdSca','All', X_con,val_X_con,1.0,cols,rem,ranks,i_cols,i_rem])

#MinMax
#Apply transform only for non-categorical data
X_temp = MinMaxScaler().fit_transform(X_train[:,0:size])
val_X_temp = MinMaxScaler().fit_transform(val_X[:,0:size])
#Concatenate non-categorical data and categorical
X_con = numpy.concatenate((X_temp,X_train[:,size:]),axis=1)
val_X_con = numpy.concatenate((val_X_temp,val_X[:,size:]),axis=1)
#Add this version of X to the list 
X_all.append(['MinMax', 'All', X_con,val_X_con,1.0,cols,rem,ranks,i_cols,i_rem])

#Normalize
#Apply transform only for non-categorical data
X_temp = Normalizer().fit_transform(X_train[:,0:size])
val_X_temp = Normalizer().fit_transform(val_X[:,0:size])
#Concatenate non-categorical data and categorical
X_con = numpy.concatenate((X_temp,X_train[:,size:]),axis=1)
val_X_con = numpy.concatenate((val_X_temp,val_X[:,size:]),axis=1)
#Add this version of X to the list 
X_all.append(['Norm', 'All', X_con,val_X_con,1.0,cols,rem,ranks,i_cols,i_rem])

#Impute
#Imputer is not used as no data is missing

#List of transformations
trans_list = []

for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
    trans_list.append(trans)

特征选择

这是机器学习数据预处理阶段的关键一步。它涉及从数据集中选择最相关和信息量最大的特征（变量或列）子集，同时丢弃不相关或冗余的特征。

#Select top 75%,50%,25%
list_ratio = [0.75,0.50,0.25]

#List of feature selection models
feat = []

#List of names of feature selection models
feat_list =[]

#Import the libraries
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

#Add ExtraTreeClassifiers to the list
n = 'ExTree'
feat_list.append(n)
for val in list_ratio:
    comb.append("%s+%s" % (n,val))
    feat.append([n,val,ExtraTreesClassifier(n_estimators=c-1,max_features=val,n_jobs=-1,random_state=seed)])      

#Add GradientBoostingClassifiers to the list 
n = 'GraBst'
feat_list.append(n)
for val in list_ratio:
    comb.append("%s+%s" % (n,val))
    feat.append([n,val,GradientBoostingClassifier(n_estimators=c-1,max_features=val,random_state=seed)])   

#Add RandomForestClassifiers to the list 
n = 'RndFst'
feat_list.append(n)
for val in list_ratio:
    comb.append("%s+%s" % (n,val))
    feat.append([n,val,RandomForestClassifier(n_estimators=c-1,max_features=val,n_jobs=-1,random_state=seed)])   

#Add XGBClassifier to the list 
n = 'XGB'
feat_list.append(n)
for val in list_ratio:
    comb.append("%s+%s" % (n,val))
    feat.append([n,val,XGBClassifier(n_estimators=c-1,seed=seed)])   
        
#For all transformations of X
for trans,s, X, val_X, d, cols, rem, ra, i_cols, i_rem in X_all:
    #For all feature selection models
    for name,v, model in feat:
        #Train the model against Y
        model.fit(X,Y_train)
        #Combine the importance and index of the column in the array joined
        joined = []
        for i, pred in enumerate(list(model.feature_importances_)):
            joined.append([i,cols[i],pred])
        #Sort in descending order    
        joined_sorted = sorted(joined, key=lambda x: -x[2])
        #Starting point of the columns to be dropped
        rem_start = int((v*(c-1)))
        #List of names of columns selected
        cols_list = []
        #Indexes of columns selected
        i_cols_list = []
        #Ranking of all the columns
        rank_list =[]
        #List of columns not selected
        rem_list = []
        #Indexes of columns not selected
        i_rem_list = []
        #Split the array. Store selected columns in cols_list and remove them in rem_list
        for j, (i, col, x) in enumerate(list(joined_sorted)):
            #Store the rank
            rank_list.append([i,j])
            #Store selected columns in cols_list and indexes in i_cols_list
            if(j < rem_start):
                cols_list.append(col)
                i_cols_list.append(i)
            #Store not selected columns in rem_list and indexes in i_rem_list    
            else:
                rem_list.append(col)
                i_rem_list.append(i)    
        #Sort the rank_list and store only the ranks. Drop the index 
        #Append model name, array, columns selected and columns to be removed to the additional list        
        all_X_add.append([trans,name,X,val_X,v,cols_list,rem_list,[x[1] for x in sorted(rank_list,key=lambda x:x[0])],i_cols_list,i_rem_list])    

#Set figure size
plt.rc("figure", figsize=(25, 10))

#Plot a graph for different feature selectors        
for f_name in feat_list:
    #Array to store the list of combinations
    leg=[]
    fig, ax = plt.subplots()
    #Plot each combination
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        if(name==f_name):
            plt.plot(rank_list)
            leg.append(trans+"+"+name+"+%s"% v)
    #Set the tick names to the names of columns
    ax.set_xticks(range(c-1))
    ax.set_xticklabels(cols[:c-1],rotation='vertical')
    #Display the plot
    plt.legend(leg,loc='best')    
    #Plot the rankings of all the features for all combinations
    plt.show()

输出

排名摘要

排名摘要是一份报告或列表，提供数据集中各个特征（变量）的重要性或排名信息。此摘要对于理解每个特征对机器学习任务的相关性和贡献，以及决定在预测模型中包含或排除哪些特征至关重要。

df_rank = pandas.DataFrame(data=[x[7] for x in all_X_add],columns=cols[:c-1])
_ = df_rank.boxplot(rot=90)
# The Below plot summarizes the rankings according to the standard feature selection techniques
#Top ranked attributes are ... first 10 attributes, Wilderness_Area1,4 ...Soil_Type 3,4,10,38-40

输出

基于中位数对特征进行排名

根据特征的中位数对其进行排名是一种直接的特征选择方法。

df_rank = pandas.DataFrame(data=[x[7] for x in all_X_add],columns=cols[:c-1])
med = df_rank.median()
print(med)
#Write medians to output file for exploratory study on ML algorithms
with open("median.csv", "w") as subfile:
       subfile.write("Column,Median\n")
       subfile.write(med.to_string())

输出

最高中位数（变异性最大）

土壤类型 8
土壤类型 25
荒野区域 2

最低中位数（变异性最小）

海拔
距水文水平距离
距火点水平距离

现在我们将根据中位数排名选择特征，因为它在其他特征选择方法中看起来是最好的情况。

#Select top 75%,50%,25%
list_ratio = [0.75,0.50,0.25]

#Median of rankings for each column
unsorted_rank = [0,8,11,4,5,2,5,7.5,9.5,3,8,28.5,14.5,2,35,19.5,12,14,37,25.5,50,44,9,28,20.5,19.5,40,38,20,38,43,35,44,22,24,33,49,42,46,47,27.5,19,31.5,23,28,42,30.5,46,40,12,13,18]

#List of feature selection models
feat = []

#Add Median to the list 
n = 'Median'
for val in list_ratio:
    feat.append([n,val])   

for trans,s, X, val_X, d, cols, rem_cols, ra, i_cols, i_rem in X_all:
    #Create subsets of feature lists based on ranking and list_ratio
    for name, v in feat:
        #Combine the importance and index of the column in the array joined
        joined = []
        for i, pred in enumerate(unsorted_rank):
            joined.append([i,cols[i],pred])
        #Sort in descending order    
        joined_sorted = sorted(joined, key=lambda x: x[2])
        #Starting point of the columns to be dropped
        rem_start = int((v*(c-1)))
        #List of names of columns selected
        cols_list = []
        #Indexes of columns selected
        i_cols_list = []
        #Ranking of all the columns
        rank_list =[]
        #List of columns not selected
        rem_list = []
        #Indexes of columns not selected
        i_rem_list = []
        #Split the array. Store selected columns in cols_list and remove them in rem_list
        for j, (i, col, x) in enumerate(list(joined_sorted)):
            #Store the rank
            rank_list.append([i,j])
            #Store selected columns in cols_list and indexes in i_cols_list
            if(j < rem_start):
                cols_list.append(col)
                i_cols_list.append(i)
            #Store not selected columns in rem_list and indexes in i_rem_list    
            else:
                rem_list.append(col)
                i_rem_list.append(i)    
        #Sort the rank_list and store only the ranks. Drop the index 
        #Append model name, array, columns selected and columns to be removed to the additional list        
        all_X_add.append([trans,name,X,val_X,v,cols_list,rem_list,[x[1] for x in sorted(rank_list,key=lambda x:x[0])],i_cols_list,i_rem_list])

#Import plotting library    
import matplotlib.pyplot as plt    

#Dictionary to store the accuracies for all combinations 
acc = {}

#List of combinations
comb = []

#Append the name of the transformation to trans_list
for trans in trans_list:
    acc[trans]=[]

模型

我们将继续采用一系列机器学习算法。

01. 线性判别分析

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

#Set the base model
model = LinearDiscriminantAnalysis()
algo = "LDA"

##Set figure size
#plt.rc("figure", figsize=(25, 10))

#Accuracy of the model using all features
for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
    model.fit(X[:,i_cols_list],Y_train)
    result = model.score(val_X[:,i_cols_list], val_Y)
    acc[trans].append(result)
    #print(trans+"+"+name+"+%d" % (v*(c-1)))
    #print(result)
comb.append("%s+%s of %s" % (algo,"All",1.0))
        
#Accuracy of the model using a subset of features    
for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
    model.fit(X[:,i_cols_list],Y_train)
    result = model.score(val_X[:,i_cols_list], val_Y)
    acc[trans].append(result)
    #print(trans+"+"+name+"+%d" % (v*(c-1)))
    #print(result)
for v in list_ratio:
    comb.append("%s+%s of %s" % (algo,"Subset",v))

02. 逻辑回归

from sklearn.linear_model import LogisticRegression

C_list = [100]

for C in C_list:
    #Set the base model
    model = LogisticRegression(n_jobs=-1,random_state=seed,C=C)
   
    algo = "LR"

    ##Set figure size
    #plt.rc("figure", figsize=(25, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with C=%s+%s of %s" % (algo,C,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    for v in list_ratio:
        comb.append("%s with C=%s+%s of %s" % (algo,C,"Subset",v))

03. KNN

#Evaluation of various combinations of KNN Classifier using all the views

#Import the library
from sklearn.neighbors import KNeighborsClassifier

n_list = [1]

for n_neighbors in n_list:
    #Set the base model
    model = KNeighborsClassifier(n_jobs=-1,n_neighbors=n_neighbors)
   
    algo = "KNN"

    ##Set figure size
    #plt.rc("figure", figsize=(25, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with n=%s+%s of %s" % (algo,n_neighbors,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_neighbors,"Subset",v))

04. 朴素贝叶斯

#Evaluation of various combinations of Naive Bayes using all the views

#Import the library
from sklearn.naive_bayes import GaussianNB

#Set the base model
model = GaussianNB()
algo = "NB"

##Set figure size
#plt.rc("figure", figsize=(25, 10))

#Accuracy of the model using all features
for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
    model.fit(X[:,i_cols_list],Y_train)
    result = model.score(val_X[:,i_cols_list], val_Y)
    acc[trans].append(result)
    #print(trans+"+"+name+"+%d" % (v*(c-1)))
    #print(result)
comb.append("%s+%s of %s" % (algo,"All",1.0))
        
#Accuracy of the model using a subset of features    
for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
    model.fit(X[:,i_cols_list],Y_train)
    result = model.score(val_X[:,i_cols_list], val_Y)
    acc[trans].append(result)
    #print(trans+"+"+name+"+%d" % (v*(c-1)))
    #print(result)
for v in list_ratio:
    comb.append("%s+%s of %s" % (algo,"Subset",v))

05. 决策树分类器

#Evaluation of various combinations of CART using all the views

#Import the library
from sklearn.tree import DecisionTreeClassifier

d_list = [13]

for max_depth in d_list:
    #Set the base model
    model = DecisionTreeClassifier(random_state=seed,max_depth=max_depth)
   
    algo = "CART"

    #Set figure size
    plt.rc("figure", figsize=(15, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with d=%s+%s of %s" % (algo,max_depth,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    for v in list_ratio:
        comb.append("%s with d=%s+%s of %s" % (algo,max_depth,"Subset",v))

06. 支持向量机

#Evaluation of various combinations of SVM using all the views

#Import the library
from sklearn.svm import SVC

c_list = [10]

for C in c_list:
    #Set the base model
    model = SVC(random_state=seed,C=C)

    algo = "SVM"

    #Set figure size
    #plt.rc("figure", figsize=(15, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with C=%s+%s of %s" % (algo,C,"All",1.0))

07. 袋装决策树

#Evaluation of various combinations of Bagged Decision Trees using all the views

#Import the library
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#Base estimator
base_estimator = DecisionTreeClassifier(random_state=seed,max_depth=13)

n_list = [100]

for n_estimators in n_list:
    #Set the base model
    model = BaggingClassifier(n_jobs=-1,base_estimator=base_estimator, n_estimators=n_estimators, random_state=seed)
   
    algo = "Bag"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"Subset",v))

08. 随机森林分类器

#Evaluation of various combinations of Random Forest using all the views

#Import the library
from sklearn.ensemble import RandomForestClassifier

n_list = [100]

for n_estimators in n_list:
    #Set the base model
    model = RandomForestClassifier(n_jobs=-1,n_estimators=n_estimators, random_state=seed)
   
    algo = "RF"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"Subset",v))

09. 极端随机树

from sklearn.ensemble import ExtraTreesClassifier

n_list = [100]

for n_estimators in n_list:
    #Set the base model
    model = ExtraTreesClassifier(n_jobs=-1,n_estimators=n_estimators, random_state=seed)
   
    algo = "ET"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"Subset",v))

10. AdaBoost（提升）

from sklearn.ensemble import AdaBoostClassifier

n_list = [100]

for n_estimators in n_list:
    #Set the base model
    model = AdaBoostClassifier(n_estimators=n_estimators, random_state=seed)
   
    algo = "Ada"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
     comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)

    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"Subset",v))

11. 梯度提升分类器

from sklearn.ensemble import GradientBoostingClassifier

d_list = [9]

for max_depth in d_list:
    #Set the base model
    model = GradientBoostingClassifier(max_depth=max_depth, random_state=seed)
   
    algo = "SGB"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
    comb.append("%s with d=%s+%s of %s" % (algo,max_depth,"All",1.0))

12. 投票分类器

from sklearn.ensemble import VotingClassifier

estimators_list =[]

estimators = []
model_01 = ExtraTreesClassifier(n_jobs=-1,n_estimators=100, random_state=seed)
estimators.append(('et', model_01))
model_02 = RandomForestClassifier(n_jobs=-1,n_estimators=100, random_state=seed)
estimators.append(('rf', model_02))
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
base_estimator = DecisionTreeClassifier(random_state=seed,max_depth=13)
model3 = BaggingClassifier(n_jobs=-1,base_estimator=base_estimator, n_estimators=100, random_state=seed)
estimators.append(('bag', model3))

estimators_list.append(['Voting',estimators])

for name, estimators in estimators_list:
    #Set the base model
    model = VotingClassifier(estimators=estimators, n_jobs=-1)
   
    algo = name

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
        #print(trans+"+"+name+"+%d" % (v*(c-1)))
        #print(result)
    comb.append("%s+%s of %s" % (algo,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
    for v in list_ratio:
        comb.append("%s+%s of %s" % (algo,"Subset",v))

13. XGBoost

from xgboost import XGBClassifier

n_list = [300]

for n_estimators in n_list:
    #Set the base model
    model = XGBClassifier(n_estimators=n_estimators, seed=seed,subsample=0.25)
   
    algo = "XGB"

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)
    comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"All",1.0))

    #Accuracy of the model using a subset of features    
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
        model.fit(X[:,i_cols_list],Y_train)
        result = model.score(val_X[:,i_cols_list], val_Y)
        acc[trans].append(result)

    for v in list_ratio:
        comb.append("%s with n=%s+%s of %s" % (algo,n_estimators,"Subset",v))

模型评估

#Evaluation of baseline model of MLP using all the views

#Import libraries for deep learning
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense

#Import libraries for encoding
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

#no. of output classes
y = 7

#random state
numpy.random.seed(seed)

# one hot encode class values
encoder = LabelEncoder()
Y_train_en = encoder.fit_transform(Y_train)
Y_train_hot = np_utils.to_categorical(Y_train_en,y) 
val_Y_en = encoder.fit_transform(val_Y)
val_Y_hot = np_utils.to_categorical(val_Y_en,y) 


# define baseline model
def baseline(v):
     # create model
     model = Sequential()
     model.add(Dense(v*(c-1), input_dim=v*(c-1), init='normal', activation='relu'))
     model.add(Dense(y, init='normal', activation='sigmoid'))
     # Compile model
     model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
     return model

# define a smaller model
def smaller(v):
 # create model
 model = Sequential()
 model.add(Dense(v*(c-1)/2, input_dim=v*(c-1), init='normal', activation='relu'))
 model.add(Dense(y, init='normal', activation='sigmoid'))
 # Compile model
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model

# define a deeper model
def deeper(v):
 # create model
 model = Sequential()
 model.add(Dense(v*(c-1), input_dim=v*(c-1), init='normal', activation='relu'))
 model.add(Dense(v*(c-1)/2, init='normal', activation='relu'))
 model.add(Dense(y, init='normal', activation='sigmoid'))
 # Compile model
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model

# Optimize using dropout and decay
from keras.optimizers import SGD
from keras.layers import Dropout
from keras.constraints import maxnorm

def dropout(v):
    #create model
    model = Sequential()
    model.add(Dense(v*(c-1), input_dim=v*(c-1), init='normal', activation='relu',constraint_W=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(v*(c-1)/2, init='normal', activation='relu', constraint_W=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(y, init='normal', activation='sigmoid'))
    # Compile model
    sgd = SGD(lr=0.1,momentum=0.9,decay=0.0,nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

# define decay model
def decay(v):
    # create model
    model = Sequential()
    model.add(Dense(v*(c-1), input_dim=v*(c-1), init='normal', activation='relu'))
    model.add(Dense(y, init='normal', activation='sigmoid'))
    # Compile model
    sgd = SGD(lr=0.1,momentum=0.8,decay=0.01,nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model
    
est_list = [('MLP',baseline),('smaller',smaller),('deeper',deeper),('dropout',dropout),('decay',decay)]

for name, est in est_list:
 
    algo = name

    #Set figure size
    plt.rc("figure", figsize=(20, 10))

    #Accuracy of the model using all features
    for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in X_all:
        model = KerasClassifier(build_fn=est, v=v, nb_epoch=10, verbose=0)
        model.fit(X[:,i_cols_list],Y_train_hot)
        result = model.score(val_X[:,i_cols_list], val_Y_hot)
        acc[trans].append(result)
    #    print(trans+"+"+name+"+%d" % (v*(c-1)))
    #    print(result)
    comb.append("%s+%s of %s" % (algo,"All",1.0))

    ##Accuracy of the model using a subset of features    
    #for trans,name,X,val_X,v,cols_list,rem_list,rank_list,i_cols_list,i_rem_list in all_X_add:
    #    model = KerasClassifier(build_fn=est, v=v, nb_epoch=10, verbose=0)
    #    model.fit(X[:,i_cols_list],Y_train_hot)
    #    result = model.score(val_X[:,i_cols_list], val_Y_hot)
    #    acc[trans].append(result)
    #    print(trans+"+"+name+"+%d" % (v*(c-1)))
    #    print(result)
    #for v in list_ratio:
    #    comb.append("%s+%s of %s" % (algo,"Subset",v))

#Plot the accuracies of all combinations
fig, ax = plt.subplots()
#Plot each transformation
for trans in trans_list:
        plt.plot(acc[trans])
#Set the tick names to names of combinations
ax.set_xticks(range(len(comb)))
ax.set_xticklabels(comb,rotation='vertical')
#Display the plot
plt.legend(trans_list,loc='best')    
#Plot the accuracy for all combinations
plt.show()   

输出

以下是模型评估得出的观察结果

线性判别分析： 在不进行任何转换的情况下使用所有特征时，最高估计性能达到 65%。然而，MinMax 缩放和归一化技术的性能明显低于预期。
逻辑回归： 使用逻辑回归 (LR) 并在 C 值等于 100、考虑所有属性并对数据进行标准化的情况下，最高估计性能接近 67%。此外，性能倾向于随着 C 值的增加而提高。相反，归一化和 MinMax Scaler 方法的性能通常不尽如人意。
KNN： 当 n_neighbors 设置为 1 且数据经过归一化时，最佳估计性能徘徊在 86% 左右。
朴素贝叶斯： 最高估计性能约为 64%。即使只使用 50% 的子集，原始数据集也优于所有朴素贝叶斯 (NB) 转换变体。
决策树分类器： 最高估计性能接近 79%，在最大深度设置为 13 且使用原始数据集时实现。
SVM： 训练时间明显长于其他算法。对于原始数据集，性能明显不足，这强调了数据转换的重要性。当 C 设置为 10，并使用 StandardScaler 和 0.25 的子集时，最佳估计性能约为 77%。
袋装决策树： 在使用原始数据集且 n_estimators 为 100 时，最高估计性能接近 82%。
随机森林： 100 个 n_estimators 的最高估计性能几乎达到 85%。
极端随机树： 在 100 个 n_estimators、StandardScaler 和 0.75 的子集下，最高估计性能接近 88%。
AdaBoost： 100 个 n_estimators 的最高估计性能约为 38%。
梯度提升： 训练时间过长。当深度设置为 7 时，最佳估计性能接近 86%。
投票分类器： 最高估计性能接近 86%。
XGBoost： 在使用 300 个 n_estimators、0.25 的子样本和 0.75 的子集时，最高估计性能接近 80%。

KNN、投票分类器、极端随机树和随机森林算法在预测森林覆盖类型方面表现最佳，因此我们可以使用这些算法中的任何一种来进行未来的森林覆盖预测。

结论

使用机器学习预测森林覆盖类型是保护和可持续管理森林的重要工具。它使我们能够做出明智的决策，保护生物多样性，并确保这些关键生态系统的长寿。随着技术和数据的不断进步，我们理解和保护对地球生命至关重要的森林的能力也将随之提高。

下一个主题机器学习中的 AdaBoost 算法

机器学习中的森林覆盖类型预测

数据摘要

关键数据字段

使用机器学习预测森林覆盖类型的 Python 代码

散点图

基于中位数对特征进行排名

模型

01. 线性判别分析

02. 逻辑回归

03. KNN

04. 朴素贝叶斯

05. 决策树分类器

06. 支持向量机

07. 袋装决策树

08. 随机森林分类器

09. 极端随机树

10. AdaBoost（提升）

11. 梯度提升分类器

12. 投票分类器

13. XGBoost

模型评估

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的森林覆盖类型预测

数据摘要

关键数据字段

使用机器学习预测森林覆盖类型的 Python 代码

散点图

基于中位数对特征进行排名

模型

01. 线性判别分析

02. 逻辑回归

03. KNN

04. 朴素贝叶斯

05. 决策树分类器

06. 支持向量机

07. 袋装决策树

08. 随机森林分类器

09. 极端随机树

10. AdaBoost（提升）

11. 梯度提升分类器

12. 投票分类器

13. XGBoost

模型评估

结论

相关帖子

神经网络中的学习率 (eta)

Light Gradient Boosted Machine (LightGBM)

用于推荐系统的矩阵分解

使用机器学习预测薪资

使用 Pix2Pix 进行图像到图像转换

交叉验证中的分组

共形预测

成人数据集

什么是机器学习中的 Softmax 激活函数

机器学习中的模式识别

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器