数据融合-raw-data-state-of-the-union-PROSAGA-码农传奇

State of the Union Speeches of the United States of America scraper using Python - beautifulsoup4.

Purpose. The scripts on this repository provides an easy way to scrape the state of the union speeches from https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union.

Instructions.

Clone repository and install the required Python modules.

git clone https://github.com/stressosaurus/raw-data-state-of-the-union.git
cd raw-data-state-of-the-union/
pip install -r requirements.txt

Start scraping the website for the speeches by using the command below.
```
python wrangleSotu.py
```
The above command will create a ‘html_files’ folder with the html files of the speeches and a separate ‘sotu.npy’ will be created containing the processed data for easy access. The data is in a pandas DataFrame format containing columns ‘year’, ‘month’, ‘day’, ‘president’, ‘title’, and ‘text’.

You can open the “sotu.pkl” file by using the pandas module in Python.

import pandas as pd
sotu_df = pd.read_pickle('sotu.pkl')
print(sotu_df)