世界杯点球|2014世界杯德国队阵容|1337个性档案里的世界杯独特视角|1337profile.com

  • 首页
  • 个性赛事解读
  • 独特观点分享
  • 个性球迷社区

最新发表

  • 卡里略个人资料简介及身高
  • 巴西球员单次转会费排行:内马尔居首,库尼亚7400万欧列第七
  • 世界杯与欧冠:两大足球盛宴的影响力对比与全球效应分析
  • 世界杯积分动态:腾讯体育带您全面解析赛事进展
  • 斯诺克世锦赛罗伯逊再创辉煌,精准长台击球技惊四座
  • 梦一队传奇:回顾1992年巴塞罗那奥运会篮球史上的巅峰之战
  • 张山运动员:从射击场到世界杯的传奇之路
  • 非诚勿扰羽毛球运动员的传奇故事:从赛场到综艺的华丽转身
  • 世界杯的激情与俄罗斯美女的风采:足球赛场上的一道独特风景
  • 罗斯2017近期比赛表现分析:伤病后的崛起与未来的挑战

友情链接

Copyright © 2022 世界杯点球|2014世界杯德国队阵容|1337个性档案里的世界杯独特视角|1337profile.com All Rights Reserved.

Python网络爬虫-FIFA球员数据爬取

个性赛事解读 · 2025-10-12 14:11:34

FIFA球员数据爬取

一.选题背景

世界杯的开展,给全世界广大球迷带来了狂欢和盛宴,球员们的精彩表现给世界各地的观众带来了一场场精彩绝伦的比赛。在比赛中,球员是决定比赛胜负的关键,为了探究世界球员的特点和能力,本文爬取了FIFA中的球员数据,通过对数据进行可视化探索和建模分析,能够更好的发掘影响球员能力的重要因素,从而有利于球员的成长和发现。

二.爬虫设计方案

1.爬虫名称

FIFA球员信息多线程爬虫。

2.爬取的内容和数据特征分析

本文的爬取信息为:球员名字,基本信息(出生年月,身高,体重),Overall_Rating, Potential, Value, Wage以及ATTACKING,SKILL,MOVEMENT,POWER,MENTALITY,DEFENDING,GOALKEEPING模块下的具体信息。

数据特征分析主要在于分析球员能力的分布,并探究球员潜力(Potential)与不同能力之间的关系。

3.爬虫设计方案概述

实现思路

在球员大致信息页面中爬取球员具体信息页面网址,在通过爬取到的球员具体信息网址进入网址中爬取具体信息。

其中在主页面,也就是球员具体信息入门页面处,每个主页面含有60个球员信息入口界面,本文通过gevent库创建多线程爬虫(开启60个任务各自爬取对面副页面的球员具体信息)对球员具体信息进行爬取。除此之外,还需调用time,request,lxml和random库 随后调用numpy,pandas,seanborn,sklearn和matplotlib库对数据进行分析

技术难点

多行程并行爬虫,球员信息的定位,对于反爬进制的破解,回归方程的设立。

三.主题页面的结构特征分析

1.主题页面的结构和特征分析

球员的大致内容如下所示:

球员具体信息如下所示:

2.Htmls页面解析

每个球员的大致内容页面含有60个球员具体信息页面入口,在球员大致内容信息页面依次爬取60个具体信息 界面,并通过开启60个线程,同时爬取60个球员的具体信息。并通过xpath定位到具体的信息的地址,通过逐个查找找到需要数据的所在位置,发现所需要的数据都在下的。

3.节点(标签)查找方法与遍历方法

查找方法:lxml库的xpath函数

遍历方法,人工确定数据位置

四.网络爬虫程序设计

1. 数据的爬取与采集

# 导入模块

import gevent

from gevent import monkey

monkey.patch_all()

import time

import requests

import random

from lxml import etree

import pandas as pd

class FIFA21():

def __init__(self):

self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}

self.baseURL ='https://sofifa.com/'

self.path='data.csv'

self.row_num=0

self.title=['player_name','basic_info','Overall_rating','Potential','Value','Wage',

'preferred_foot','weak_foot','skill_moves','international_reputation']

self.small_file=['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB',

'LDM','CDM','RDM','RWB','LB','LCB','CB','RCB',

'RB','GK']

self.details=['Crossing','Finishing','Heading Accuracy','Short Passing','Volleys','Dribbling,Curve',

'FK Accuracy','Long Passing','Ball Control','Acceleration','Sprint Speed','Agility','Reactions',

'Balance','Shot Power','Jumping','Stamina','Strength','Long Shots','Aggression','Interceptions','Positioning',

'Vision','Penalties','Composure','Defensive' 'Awareness','Standing Tackle','Sliding Tackle','GK Diving',

'GK Handling','GK Kicking','GK Positioning','GK Reflexes']

for name in self.title:

exec ('self.'+name+'=[]')

for name in self.small_file:

exec('self.' + name + '=[]')

for i in range(len(self.details)):

exec('self.details_' +str(i) + '=[]')

def loadPage(self,url):

time.sleep(random.random())

return requests.get(url, headers=self.headers).content

def get_player_links(self,url):

content = self.loadPage(url)

html = etree.HTML(content)

player_links=html.xpath("//div[@class='card']/table[@class='table table-hover persist-area']/tbody[@class='list']"

"/tr/td[@class='col-name']/a[@role='tooltip']/@href")

result=[]

for link in player_links:

result.append(self.baseURL[:-1]+link)

return result

#return [self.baseURL[:-1]+link for link in player_links]

def next_page(self, url):

content = self.loadPage(url)

html = etree.HTML(content)

new_page= html.xpath("//div[@class='pagination']/a/@href")

if url == self.baseURL:

return self.baseURL[:-1]+new_page[0]

else:

if len(new_page)==1:

return 'stop'

else:

return self.baseURL[:-1]+new_page[1]

def Get_player_small_field(self,html):

content = html.xpath(

"//div[@class='center']/div[@class='grid']/div[@class='col col-4']/div[@class='card calculated']/div[@class='field-small']"

"/div[@class='lineup']/div[@class='grid half-spacing']/div[@class='col col-2']/div/text()")

content=content[1::2]

length=len(content)

for i in range(length):

exec ('self.'+self.small_file[i]+'.append('+'\''+content[i]+'\''+')')

#return dict(zip(keys,values))

def Get_player_basic_info(self,html):

player_name=html.xpath("//div[@class='center']/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/div[@class='info']/h1/text()")

player_basic=html.xpath("//div[@class='center']/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/div[@class='info']/div[@class='meta ellipsis']/text()")

exec ("self.player_name.append("+"\""+player_name[0]+"\""+")")

exec ('self.basic_info.append(' +'\''+player_basic[-1]+'\''+')')

def Get_rating_value_wage(self,html):

overall_potential_rating = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/span[1]/text()")

if len(overall_potential_rating)==0:

overall_potential_rating=html.xpath("//div[1]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/span[1]/text()")

value_wage = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/text()")

if len(value_wage)==0:

value_wage = html.xpath("//div[1]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/text()")

exec('self.Overall_rating.append('+'\''+overall_potential_rating[0]+'\''+')')

exec('self.Potential.append('+'\''+overall_potential_rating[1]+'\''+')')

exec('self.Value.append(' + '\'' + value_wage[2] + '\'' + ')')

exec('self.Wage.append(' + '\'' + value_wage[3] + '\'' + ')')

# / html / body / div[1] / div / div[2] / div[1] / section / div[1] / div / span

def Get_profile(self,html):

profile=html.xpath(

"//div[@class='center']/div[@class='grid']/div[@class='col col-12']/div[@class='block-quarter'][1]"

"/div[@class='card']/ul[@class='pl']/li[@class='ellipsis']/text()[1]")

exec('self.preferred_foot.append(' +'\''+ profile[0]+'\''+')')

exec('self.weak_foot.append(' +'\''+ profile[1]+'\''+')')

exec('self.skill_moves.append(' + '\'' + profile[2] + '\'' + ')')

exec('self.international_reputation.append(' + '\'' + profile[3] + '\'' + ')')

def Get_detail(self,html):

#// *[ @ id = "body"] / div[3] / div / div[2] / div[9] / div / ul / li[1] / span[1]

keys=html.xpath("//div[3]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[2]/text()")

if(len(keys)==0):

keys = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[2]/text()")

values=html.xpath("//div[3]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[1]/text()")

if (len(values)==0):

values = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[1]/text()")

values=values[:len(keys)]

values=dict(zip(keys,values))

for i in range(len(self.details)):

if self.details[i] in keys:

exec ('self.details_'+str(i)+'.append('+'\''+values[self.details[i]]+'\''+')')

else:

exec ('self.details_'+str(i)+'.append('+'\'Nan\''+')')

#return

def start_player(self,url):

content = self.loadPage(url)

html = etree.HTML(content)

#info= {}

self.Get_player_basic_info(html)

self.Get_profile(html)

self.Get_rating_value_wage(html)

self.Get_player_small_field(html)

self.Get_detail(html)

self.row_num+=1

def startWork(self):

current_url=self.baseURL

while(current_url!='stop'):

print(current_url)

player_links = self.get_player_links(current_url)

# spawn创建协程任务,并加入到任务队列里

#print(player_links)

jobs=[]

for link in player_links:

jobs.append(gevent.spawn(self.start_player, link))

#jobs = [gevent.spawn(self.start_player, link) for link in player_links]

# 父线程阻塞,等待所有任务结束后继续执行

gevent.joinall(jobs)

current_url=self.next_page(current_url)

self.save()

# 循环get队列的数据,直到队列为空则退出

def save(self):

exec ('df=pd.DataFrame()')

for name in self.title:

exec ("df["+"\'"+name+"\'"+"]=self."+name)

for name in self.small_file:

exec ('df['+'\"'+name+'\"'+']=self.'+name)

for i in range(len(self.details)):

exec ('df['+'\"'+self.details[i]+'\"'+']=self.details_'+str(i))

exec ('df.to_csv(self.path,index=False)')

if __name__ == "__main__":

start = time.time()

spider=FIFA21()

spider.startWork()

stop = time.time()

print("\n[LOG]: %f seconds..." % (stop - start))

2. 对数据进行清洗和处理

通过查阅信息知道部分爬取下来的信息不是本文分析所需要的,故将不需要的信息删除掉。

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

data=pd.read_csv('data.csv') #读入数据

data.info() #查看数据的信息

#选取自己需要的数据

left_data=data.iloc[:,:9]

right_data=data.iloc[:,37:]

data=pd.concat([left_data,right_data],axis=1)

data.drop(['basic_info','Value','Wage'],axis=1,inplace=True)

data.dropna(inplace=True) #删除缺失值

for index in data.columns[6:]:

data[index]=data[index].astype('float64')

data.info()

3. 数据分析与可视化

fig, ax= plt.subplots(nrows = 1, ncols = 2)

fig.set_size_inches(14,4)

sns.boxplot(data = data.loc[:,["Overall_rating",'Potential']], ax = ax[0])

ax[0].set_xlabel('')

ax[0].set_ylabel('')

sns.distplot(a = data.loc[:,["Overall_rating"]], bins= 10, kde = True, ax = ax[1], \

label = 'Overall_rating')

sns.distplot(a = data.loc[:,["Potential"]], bins= 10, kde = True, ax = ax[1], \

label = 'Potential')

ax[1].legend()

sns.jointplot(x='Overall_rating',y = 'Potential',data =data,kind = 'scatter')

fig.tight_layout()

查看potential和overall_rating各自的分布和关系

查看分类变量的分布

fig, ax = plt.subplots(nrows = 1, ncols = 3)

fig.set_size_inches(12,4)

sns.countplot(x = data['preferred_foot'],ax = ax[0])

sns.countplot(x = data['weak_foot'],ax = ax[1])

sns.countplot(x = data['skill_moves'],ax = ax[2])

fig.tight_layout()

两个变量之间散点图分析

通过分析相关系数图,本文发现和Potential变量相关的变量有Overall_rating, Reactions和shot power等,做出散点图如下。

sns.jointplot(x='Reactions',y = 'Potential',data =data,kind = 'scatter')

sns.jointplot(x='Shot Power',y = 'Potential',data =data,kind = 'scatter')

sns.jointplot(x='Vision',y = 'Potential',data =data,kind = 'scatter')

4. 建立回归模型

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

x=df.drop('Potential',axis=1)

y=df['Potential']

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

model=LinearRegression()

model.fit(x_train,y_train)

from sklearn.metrics import r2_score

r2_score(y_test,model.predict(x_test))

模型最终的R方为0.50,效果较好。

5. 数据持久化

df.to_csv('df.csv')

保存为csv文件

6. 源码

爬虫

# 导入模块

import gevent

from gevent import monkey

monkey.patch_all()

import time

import requests

import random

from lxml import etree

import pandas as pd

class FIFA21():

def __init__(self):

self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}

self.baseURL ='https://sofifa.com/'

self.path='data.csv'

self.row_num=0

self.title=['player_name','basic_info','Overall_rating','Potential','Value','Wage',

'preferred_foot','weak_foot','skill_moves','international_reputation']

self.small_file=['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB',

'LDM','CDM','RDM','RWB','LB','LCB','CB','RCB',

'RB','GK']

self.details=['Crossing','Finishing','Heading Accuracy','Short Passing','Volleys','Dribbling,Curve',

'FK Accuracy','Long Passing','Ball Control','Acceleration','Sprint Speed','Agility','Reactions',

'Balance','Shot Power','Jumping','Stamina','Strength','Long Shots','Aggression','Interceptions','Positioning',

'Vision','Penalties','Composure','Defensive' 'Awareness','Standing Tackle','Sliding Tackle','GK Diving',

'GK Handling','GK Kicking','GK Positioning','GK Reflexes']

for name in self.title:

exec ('self.'+name+'=[]')

for name in self.small_file:

exec('self.' + name + '=[]')

for i in range(len(self.details)):

exec('self.details_' +str(i) + '=[]')

def loadPage(self,url):

time.sleep(random.random())

return requests.get(url, headers=self.headers).content

def get_player_links(self,url):

content = self.loadPage(url)

html = etree.HTML(content)

player_links=html.xpath("//div[@class='card']/table[@class='table table-hover persist-area']/tbody[@class='list']"

"/tr/td[@class='col-name']/a[@role='tooltip']/@href")

result=[]

for link in player_links:

result.append(self.baseURL[:-1]+link)

return result

#return [self.baseURL[:-1]+link for link in player_links]

def next_page(self, url):

content = self.loadPage(url)

html = etree.HTML(content)

new_page= html.xpath("//div[@class='pagination']/a/@href")

if url == self.baseURL:

return self.baseURL[:-1]+new_page[0]

else:

if len(new_page)==1:

return 'stop'

else:

return self.baseURL[:-1]+new_page[1]

def Get_player_small_field(self,html):

content = html.xpath(

"//div[@class='center']/div[@class='grid']/div[@class='col col-4']/div[@class='card calculated']/div[@class='field-small']"

"/div[@class='lineup']/div[@class='grid half-spacing']/div[@class='col col-2']/div/text()")

content=content[1::2]

length=len(content)

for i in range(length):

exec ('self.'+self.small_file[i]+'.append('+'\''+content[i]+'\''+')')

#return dict(zip(keys,values))

def Get_player_basic_info(self,html):

player_name=html.xpath("//div[@class='center']/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/div[@class='info']/h1/text()")

player_basic=html.xpath("//div[@class='center']/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/div[@class='info']/div[@class='meta ellipsis']/text()")

exec ("self.player_name.append("+"\""+player_name[0]+"\""+")")

exec ('self.basic_info.append(' +'\''+player_basic[-1]+'\''+')')

def Get_rating_value_wage(self,html):

overall_potential_rating = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/span[1]/text()")

if len(overall_potential_rating)==0:

overall_potential_rating=html.xpath("//div[1]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/span[1]/text()")

value_wage = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/text()")

if len(value_wage)==0:

value_wage = html.xpath("//div[1]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='bp3-card player']/section[@class='card spacing']/div[@class='block-quarter']/div/text()")

exec('self.Overall_rating.append('+'\''+overall_potential_rating[0]+'\''+')')

exec('self.Potential.append('+'\''+overall_potential_rating[1]+'\''+')')

exec('self.Value.append(' + '\'' + value_wage[2] + '\'' + ')')

exec('self.Wage.append(' + '\'' + value_wage[3] + '\'' + ')')

# / html / body / div[1] / div / div[2] / div[1] / section / div[1] / div / span

def Get_profile(self,html):

profile=html.xpath(

"//div[@class='center']/div[@class='grid']/div[@class='col col-12']/div[@class='block-quarter'][1]"

"/div[@class='card']/ul[@class='pl']/li[@class='ellipsis']/text()[1]")

exec('self.preferred_foot.append(' +'\''+ profile[0]+'\''+')')

exec('self.weak_foot.append(' +'\''+ profile[1]+'\''+')')

exec('self.skill_moves.append(' + '\'' + profile[2] + '\'' + ')')

exec('self.international_reputation.append(' + '\'' + profile[3] + '\'' + ')')

def Get_detail(self,html):

#// *[ @ id = "body"] / div[3] / div / div[2] / div[9] / div / ul / li[1] / span[1]

keys=html.xpath("//div[3]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[2]/text()")

if(len(keys)==0):

keys = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[2]/text()")

values=html.xpath("//div[3]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[1]/text()")

if (len(values)==0):

values = html.xpath("//div[2]/div[@class='grid']/div[@class='col col-12']"

"/div[@class='block-quarter']/div[@class='card']/ul[@class='pl']/li/span[1]/text()")

values=values[:len(keys)]

values=dict(zip(keys,values))

for i in range(len(self.details)):

if self.details[i] in keys:

exec ('self.details_'+str(i)+'.append('+'\''+values[self.details[i]]+'\''+')')

else:

exec ('self.details_'+str(i)+'.append('+'\'Nan\''+')')

#return

def start_player(self,url):

content = self.loadPage(url)

html = etree.HTML(content)

#info= {}

self.Get_player_basic_info(html)

self.Get_profile(html)

self.Get_rating_value_wage(html)

self.Get_player_small_field(html)

self.Get_detail(html)

self.row_num+=1

def startWork(self):

current_url=self.baseURL

while(current_url!='stop'):

print(current_url)

player_links = self.get_player_links(current_url)

# spawn创建协程任务,并加入到任务队列里

#print(player_links)

jobs=[]

for link in player_links:

jobs.append(gevent.spawn(self.start_player, link))

#jobs = [gevent.spawn(self.start_player, link) for link in player_links]

# 父线程阻塞,等待所有任务结束后继续执行

gevent.joinall(jobs)

current_url=self.next_page(current_url)

self.save()

# 循环get队列的数据,直到队列为空则退出

def save(self):

exec ('df=pd.DataFrame()')

for name in self.title:

exec ("df["+"\'"+name+"\'"+"]=self."+name)

for name in self.small_file:

exec ('df['+'\"'+name+'\"'+']=self.'+name)

for i in range(len(self.details)):

exec ('df['+'\"'+self.details[i]+'\"'+']=self.details_'+str(i))

exec ('df.to_csv(self.path,index=False)')

if __name__ == "__main__":

start = time.time()

spider=FIFA21()

spider.startWork()

stop = time.time()

print("\n[LOG]: %f seconds..." % (stop - start))

数据分析和建模

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

data=pd.read_csv('data.csv')

left_data=data.iloc[:,:9]

right_data=data.iloc[:,37:]

data=pd.concat([left_data,right_data],axis=1)

data.drop(['basic_info','Value','Wage'],axis=1,inplace=True)

data.dropna(inplace=True)

for index in data.columns[6:]:

data[index]=data[index].astype('float64')

data.info()

fig, ax= plt.subplots(nrows = 1, ncols = 2)

fig.set_size_inches(14,4)

sns.boxplot(data = data.loc[:,["Overall_rating",'Potential']], ax = ax[0])

ax[0].set_xlabel('')

ax[0].set_ylabel('')

sns.distplot(a = data.loc[:,["Overall_rating"]], bins= 10, kde = True, ax = ax[1], \

label = 'Overall_rating')

sns.distplot(a = data.loc[:,["Potential"]], bins= 10, kde = True, ax = ax[1], \

label = 'Potential')

ax[1].legend()

sns.jointplot(x='Overall_rating',y = 'Potential',data =data,kind = 'scatter')

fig.tight_layout()

fig, ax = plt.subplots(nrows = 1, ncols = 3)

fig.set_size_inches(12,4)

sns.countplot(x = data['preferred_foot'],ax = ax[0])

sns.countplot(x = data['weak_foot'],ax = ax[1])

sns.countplot(x = data['skill_moves'],ax = ax[2])

fig.tight_layout()

corr2 = data.select_dtypes(include =['float64','int64']).\

loc[:,data.select_dtypes(include =['float64','int64']).columns[:]].corr()

fig,ax = plt.subplots(nrows = 1,ncols = 1)

fig.set_size_inches(w=24,h=24)

sns.heatmap(corr2,annot = True,linewidths=0.5,ax = ax)

plt.savefig('1.png')

sns.jointplot(x='Reactions',y = 'Potential',data =data,kind = 'scatter')

sns.jointplot(x='Shot Power',y = 'Potential',data =data,kind = 'scatter')

sns.jointplot(x='Vision',y = 'Potential',data =data,kind = 'scatter')

sns.jointplot(x='Composure',y = 'Potential',data =data,kind = 'scatter')

df=data[['Potential','Overall_rating','Short Passing','Long Passing','Ball Control','Reactions','Shot Power','Vision','Composure']]

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

x=df.drop('Potential',axis=1)

y=df['Potential']

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

model=LinearRegression()

model.fit(x_train,y_train)

from sklearn.metrics import r2_score

r2_score(y_test,model.predict(x_test))

df.to_csv('df.csv')

五.总结

在完成爬虫课程设计的过程中,会遇到很多问题,如在爬取过程中的防爬机制,打代码时不断的出错,本文需要主动的上网去找解决问题的办法,同时由于数据量大,时间长,本文更改了爬取的方式,使用多线程来对数据进行爬取。爬取数据时由于网页数据安排并不是特别简洁,故本文需要对网页数据的爬取进行一个一个的手动定位,比较麻烦。在数据分析时爬取的数据并非你全部都是本文后续分析的需要,故需要对数据进行筛选,同时数据存在一定量的缺失值,故本文对数据删除了缺失值。

达到了预期的目标与收获:

本次课程设计,让作者学会了爬虫的基本原理,数据清洗和处理的方式。其中,爬虫常用requests模拟请求网页,获取数据。同时深化了作者对于Sklearn库使用的能力,直接调用SKlearn库来对数据进行建模分析会使得效率更高。

需要改进的建议:

在进行线性回归时,加入更多数据使数据更具有预测性。


NBA常规赛 火箭106-111小牛视频集锦
芦苇谈伦纳德