项目作者: italia

项目描述 :
高级语言: Scala
项目地址: git://github.com/italia/daf-semantic-triplifier.git
创建时间: 2018-11-27T09:08:44Z
项目社区:https://github.com/italia/daf-semantic-triplifier

开源协议:Apache License 2.0

下载


RDF triplifier

[README last update: 2018-07-11] pre-alpha version

This component provides a simple microservice for creating an RDF representation of data from a JDBC connector.

The RDF processor used is Ontop, which implements the W3C standard R2RML language for tabular to RDF conversion.

NOTE (Impala)

the ssl_impala folder should be created under the root folder of the project, and should contain the following files:

  1. ├──ssl_impala
  2. jssecacerts
  3. master-impala.jks
  4. master-impala.pem

NOTE (dependencies)

this project uses third-party dependencies, that were included under the local /lib folder (there is currently no public available maven repository for the DAF components).

  1. ├──/lib
  2. ImpalaJDBC41.jar (required for Impala)
  3. TCLIServiceClient.jar (required for Impala)
  4. http-api-jersey-0.2.0-SNAPSHOT.jar (uber jar)

stateless endpoint

A simple (stateless) version of endpoint for executing the R2RML mapping process can be used as follow:

  1. curl -X POST 'http://localhost:7777/kb/api/v1/triplify/process' \
  2. -H "accept: text/plain" \
  3. -H "content-type: application/x-www-form-urlencoded" \
  4. --data-binary "config=${config}" \
  5. --data-binary "r2rml=${r2rml}" \
  6. -d 'format=text/turtle'

NOTE that this version of the service expects the actual content of the mapping, so when using curl it’s best to prepare it
using a shell variable such as r2rml=`cat r2rml_file` before launching curl.

The directory /script contains some example, which can be extended.

otherwise we can test the endpoint by using the example page http://localhost:7777/static/r2rml.html:

http_rdf_processor


dataset-oriented endpoint

Another endpoint is provided, which may be useful for calling the process by different datasets, for example from a workflow/pipeline orchestrator.
This could be useful when we need to divide a mapping process, creating some different dataset (for example a dataset for each resource type).

The structure of a call is the following:

  1. /kb/api/v1/triplify/datasets/{group}/{dataset_path}.{ext}?cached={T|F}

where the idea is to expose the last created RDF representation for a dataset, unless an explicit cached=false parameter is provided.
This way the first time the endpoint is called it will generate the dump, the other one we can choose to re-use data already created.
The group parameter is simply a useful way to divide data for testing from the other, while the dataset_path can be used to create subdivisions.
The mappings needs to be prepared on disk accordingly, as explained later.

example: creating the RDF for the regions example

  1. curl -X GET
  2. http://localhost:7777/kb/api/v1/triplify/datasets/test/territorial-classifications/regions.ttl?cached=true
  3. -H "accept: text/plain"
  4. -H "content-type: application/x-www-form-urlencoded"

each configuration on disk will have a structure similar to the one used for testing on SQLite example database:

  1. ├───/data
  2. └───test
  3. └───territorial-classifications
  4. ├───cities
  5. └─── ...
  6. ├───provinces
  7. └─── ...
  8. └───regions
  9. ├───regions.conf
  10. ├───regions.metadata.ttl
  11. └───regions.r2rml.ttl
  12. ...

TODO:

  1. + add config example
  2. + add R2RML example
  3. + add metadata example/ explaination

SEE ALSO: daf-semantics project


running locally

  1. mvn clean package
  2. # win
  3. java -cp "target/triplifier-0.0.5.jar;target/libs/*" triplifier.main.MainHTTPTriplifier
  4. # *nix
  5. java -cp "target/triplifier-0.0.5.jar:target/libs/*" triplifier.main.MainHTTPTriplifier

TODO

  • update DockerFile
  • merge of external manual swagger definitions
  • fix swagger problems with multi-lines: try updating to version 3+
  • add an internal interlinking strategy, using silk/duke

Ideally we could imagine having some specific microservices:

  • one for handling merging of RDF, and direct publication
  • one for creating relations between the current datasource and an external target, using silk or duke