Academic project for data integration course held by Prof. G. Costagliola
Academic project for web data integration course held by Prof. G. Costagliola at the Dipartimento di Informatica (‘Department of Computer Science’) of the University of Salerno.
Adbis is an ebook and audiobook aggregator that offers to its users the chance to buy books from several e-commerce web sites by just making their queries to a single web site.
The available sources are the following ones:
Sources styles may vary without further notice, causing the application to stop working as expected anytime.
Adbis architecture is based on a mediator among the above-mentioned sources. Previously retrieved results are stored into a database acting like a cache.
Apart from Google Play Books exposing an API, the sources required a scraping activity to retrieve their data; scraping classes have been implemented as a hierarchy, in order to gather common methods into the abstract superclass and specializing the type of items to scrape within the subclasses.
According to this scheme, a Scraper
abstract class is superclass of BookScraper
, AudiobookScraper
and ReviewScraper
subclasses.
Every scraper connects to search pages via cURL; resulting pages are scraped by XPath queries, stored into source wrappers; extracted string data are checked in order to return valid results and a new entity is at last built and added into a set returned to wrappers which return it to mediator.
To determine whether a result was similar to user queried keyword, we implemented a similarity metric based on Jaccard index which is a value in [0, 1]
range that express how much similar two strings are.
The basic algorithm is divided into following steps:
keyword
and target
strings;K
and target set T
;K
and T
:J(K, T)
is greater or equal to 0.5
then keyword
and target
strings are similarK
is contained into set T
or vice versa: if there’s containment of one of them into the other one, consider keyword
and target
strings similar.Backend has been written in object-oriented PHP 7; to cache data about previous search results, MySQL RDBMS has been used; front-end interface has been developed with Bootstrap.
Dependencies for PHP have been managed by Composer while for JavaScript NPM was used.
After cloning or downloading the repository or a release, make sure to run the following commands (composer
and npm
have to be installed):
composer install
view
subfolder npm install
(dependencies should be installed anyway, despite security warnings)A web server has to be configured in order to properly use routing functionality (PHP integrated development server isn’t enough for that, please rely on Apache server or nginx).
Adbis has been developed on Apache server, properly configured to support PHP; the following alias configuration has been specified to connect to it by http:\\localhost:8080\adbis\
URL:
Alias /adbis "<parentDir>/adbis/"
<Directory "<parentDir>/adbis">
Options Indexes FollowSymLinks MultiViews ExecCGI
AllowOverride All
Require all granted
</Directory>
Also make sure that mod_rewrite
module is enabled.
Adbis authors are Antonio Addeo (@AddeusExMachina) and Simone Bisogno (@bissim).