Austrian Aid Scraper
The scraper extracts information from the austrian development projects since 2010 from the austrian development agency website. The automatically extracted informations are stored in CSV and JSON files to make the further usage as easy as possible.
This repository provides the code and documentation and keeps track of bugs as well as feature requests.
Used software
The sourcecode is written in Python 2. It was created with use of iPython, BeautifulSoup4 and urllib2.
Description
The scraper fetches the overview page html with the table, stores it locally and parses out the data with beautifulsoup4. Then the scraper downloads every aid project entry and parses out the description from it. At the end, the data is stored as JSON and CSV files for easy usage later on.
Run scraper
Go into the root folder of this repository and execute following commands in your terminal:
cd code
python aid-scraper.py
Original sourcecode
Thanks to Christian Goebel for the original sourcecode, which got used for the final version.
Configure the Scraper
There are two global variables in aid-scraper.py you may want to change to your needs.
datetime.now().strftime('%Y-%m-%d-%H-%M')
, so it is the timestamp when the scraper starts.Download raw html
Here all the html raw data gets downloaded, stored locally and the basic data gets parsed.
Parse html
Here the description of the project gets added to the data.
Export as CSV
Here the data gets exported as a CSV file.
The original data is from the project list of the austrian development agency (ADA) published on their website. The data consists of all contracts approved since January 1st of 2010. in the list in chronologically descending order. The date of the last update can be found on the first table page as “Datum der letzten Aktualisierung”.
The tables are the basic data, where most of the data is parsed out. The data is published in the following structure (e. g. first project).
Vertragsnummer | Vertragstitel | Land/Region | OEZA/ADA-Vertragssumme | Vertragspartner |
---|---|---|---|---|
2325-02/2016 | Programm zum Schutz der MenschenrechtsverteidigerInnen in der westlichen Region Guatemalas | Guatemala | EUR 64.300,00 | HORIZONT3000 - Österreichische Organisation für Entwicklungszusammena |
Attributes
When you click on the contract titel in a table you get to the project page. It consists of the same data as the table view, except the additional description text (named “Beschreibung”).
So far, we can not say anything about the data quality (completeness, accuracy, etc.), but there are also so far no reasons to doubt the quality.
Data errors found
raw html
The scraper downloads all raw html of each table and each project page.
aid data JSON
The parsed data is stored in an easy-to-read JSON file for further usage.
[
{
'contract-number': contract number of the project
'contract-title': title of the project
'country-region': country and/or region, where the project takes place
'OEZA-ADA-contract-volume': amount of funding by austrian development agency
'contract-partner': partner organisation(s)
'description': description text of the project
'url': url of the project page
},
]
aid data csv
The parsed data is stored in a human-readable CSV file for further usage.
columns (see attribute description above):
row: one project each row.
In the spirit of free software, everyone is encouraged to help improve this project.
Here are some ways you can contribute:
When you are ready, submit a pull request.
We use the GitHub issue tracker to track bugs and features. Before submitting a bug report or feature request, check to make sure it hasn’t already been submitted. When submitting a bug report, please try to provide a screenshot that demonstrates the problem.
All content is openly licensed under the Creative Commons Attribution 4.0 license, unless otherwisely stated.
All sourcecode is free software: you can redistribute it and/or modify it under the terms of the MIT License.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Visit http://opensource.org/licenses/MIT to learn more about the MIT License.
Aid
Documentation
See the whole history. Next the actual version.
extended scraper