This repository contains Excadrill, a deep web data extraction and analytics platform.
The code in this repository has been created for a hackathon, and so it
contains unfinished work and is generally unpolished.
This repository contains Excadrill, a deep web data extraction and analytics
platform. Its purpose is to crawl Tor hidden services pertaining to predefined
keywords, extract meaningful data from them, and present it in a useful form,
from which conclusions can be drawn. It is divided into two parts: discovery
and analytics.
The discovery module crawls .onion domains, using results from several deep web
search engines as a starting point. The data is parsed in an attempt to separate
high-quality data from noise on pages, using an algorithm specifically optimized
for the structure of typical Tor hidden services. Crawling is prioritized by
several parameters and heuristics, and the results are stored in a PostgreSQL
database for further interaction.
The analytics module analyzes the stored data, and extracts several useful
features, such as dates and times, geographical locations, sentiments, etc. The
results are summarized, also stored in the database, and visualized on an HTML5
dashboard developed using React and Semantic UI.
Due to the limited time available for the completion of the project (48 hours),
several example analysis tools and visualizations have been developed as a
proof of concept.
The dashboard also provides current status of the disovery process running on the
server, with the ability to run new crawling jobs from given keywords and monitor
the progress of existing ones.
The widgets provided are: