项目作者: hicala

项目描述 :
Data extraction from a Atlanta Hawks Roster web site
高级语言: Jupyter Notebook
项目地址: git://github.com/hicala/nba_roster_analytic.git
创建时间: 2020-10-30T19:54:48Z
项目社区:https://github.com/hicala/nba_roster_analytic

开源协议:

下载


Web Scraping using Beautiful Soup

Summary

This study is part of a serie of statistical analysis in the composition and salary earned by main and key players in the NBA.

I am using Beautiful Soup for the this Python app. Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

The data I used came from Atlanta Hawks Roster. Reference: https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks

Home Page

Methodology

  1. Import Modules
  2. Get the URL link
  3. Navigate the URL Data Structure
  4. Testing out data requests
  5. Write data to a file in pseudo-code:
    • Open up a file to write in and append data.
    • Write headers
    • Run for loop that will make it clean the HTML tags and add their values in an array results
    • Run for loop that will write elements of the array to file
    • When complete, close the file
  6. The output file in CSV format.

Main goal

  • To access all of the content from the source code of the webpage with Python
  • Parse and extract data.
  • Save the info in CSV file for further analysis.

Data info extracted:

Name, POS ,Age ,HT ,WT ,College and Salary of Team Roster

Atlanta Hawks Roster

If you don’t have Beautiful Soup, install with ‘conda install beautifulsoup’ in terminal.

Python requires us to explicitly load the libraries that we want to use:

  1. import requests
  2. import bs4
  3. import re
  4. import pandas as pd

Load a webpage into python so that we can parse it and manipulate it.

  1. URL = 'https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks'

Control of Connection. We just turned the website code into a Python object.

  1. response = requests.get(URL)
  2. soup = bs4.BeautifulSoup(response.text, "html.parser")

Find all the tags with class city or number

  1. data = soup.findAll(attrs={'class':['inline']})

Source Code HTML

Open new file, make sure path to your data file is correct.

Later, I write headers

  1. f = open('hilca_nba_team_roster.csv','w')
  2. f.write("Name\tPos\tAge\tHT\tWT\tCollege\tSalary" + "\n")

Clear HTML tag and assign to the array results

  1. results = []
  2. for element in data:
  3. TAG_RE = re.compile(r'<[^>]+>')
  4. text = TAG_RE.sub('', str(element))
  5. results.append(text)
  1. i = 0
  2. j = 0
  3. for item in results:
  4. if not item:
  5. i = 0
  6. j = j + 1
  7. if j > 1: f.write("\n")
  8. else:
  9. i = i + 1
  10. if (i == 1): f.write(item + "\t") # write name and add tabulator
  11. if (i == 2): f.write(item + "\t") # write pos and add tabulator
  12. if (i == 3): f.write(item + "\t") # write age and add tabulator
  13. if (i == 4): f.write(item + "\t") # write ht and add tabulator
  14. if (i == 5): f.write(item + "\t") # write wt and add tabulator
  15. if (i == 6): f.write(item + "\t") # write college and add tabulator
  16. if (i == 7): f.write(item) # write salary and add tabulator
  1. f.close() # close file

cvs data

Conclusiones

We used Beautiful Soup as the main tool. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library.

Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.