Data Protection Enforcement Cases in Singapore
This package contains utilities which allow you to create a corpus of decisions from the
Personal Data Protection Commission of Singapore’s
Data Protection Enforcement Cases.
The primary use of such a corpus is for studying, possibly using data science tools such as
natural language processing.
It currently has the following features:
The features are discovered by passing --extras
to the command.
I dockerised the application for my personal ease of use.
It is probably the easiest and most straight-forward way
to use the application and I recommend it too.
The dockerised application also contains all pre-requisites
so there is no need for any manual installs.
You need to have docker installed.
Pull the image from docker hub.
```shell script
docker pull houfu/pdpc-decisions
After that you can run the image and pass commands and arguments to it.
For example, if you would like the application to do all actions.
```shell script
docker run houfu/pdpc-decisions all
This isn’t clever because downloads will be stored in the docker image
and not easily accessed. Bind a volume in your filesystem and
use the --root
option to direct the application
to save the files there. For example:
```shell script
docker run \
—mount type=bind,source=”$(pwd)”/target,target=/code/download \ # Target directory must exist!
houfu/pdpc-decisions \
all \
—root /code/download/
### Local install
* Install via PIP
```shell script
pip install pdpc-decisions
Once the package is installed, used the command line tool pdpc-decisions
to use the script.
If necessary, install Chrome
and ChromeDriver for Selenium to work.
The main entry point for the script is pdpcdecision.py
The script accepts the following actions and options:
Accepts the following actions.
“all
“ Does all the actions (scraping the website, saving a csv,
downloading all files and creating a corpus).
“corpus
“ Converts PDF format of decisions into plain text files.
“csv
“ Save the items gathered by the scraper as a csv file.
“files
“ Downloads all the decisions from the PDPC website into a
folder.
Options:
--csv FILE
Filename for saving the items gathered by scraper as a
csv file. [default: scrape_results.csv]
--download DIRECTORY
Destination folder for downloads of all PDF/web pages
of PDPC decisions [default: download/]
--corpus DIRECTORY
Destination folder for PDPC decisions converted to
text files [default: corpus/]
-r, --root DIRECTORY
Root directory for downloads and files [default:
Your current working directory]
--extras/--no-extras
Add extra features to the data collected. This increases processing time. This feature is ignored if action is files
or downloads
.
(Experimental and requires reading of actual decisions)
[default: False, ‘—no-extras’]
--extra-corpus/--no-extra-corpus
Enable experimental features for corpus.
This increases processing time.
--verbose
Verbose output
--help
Show this message and exit.
Feel free to let me have your suggestions, comments or issues using the issue tracker or by
emailing me.
It would also be nice to hear how you have used this corpus by using the above contacts.