💨 Luft is standard operators replacement for Airflow with declarative DAGs via Yaml file.
Luft is standard operators replacement for Airflow with declarative DAGs via Yaml file. It is basically client that helps you with everyday BI tasks.
Airflow comes with batteries loaded - couple of operators that makes your BI work less painful. But after years of using it we stumbled upon to some issues with standard operators.
Luft is solving most of those problems.
Luft is ment to be running inside Docker container (but of course it can run without it). It is just a simple Python library that is wrapper of multiple services.
For example for paralell and fast bulk loading of data from any JDBC to BigQuery it uses Embulk, for executin BigQuery command it use standard Python BQ library, etc.
Every work is done by task which is defined in YAML file (example is in examples/tasks
).
For example loading table Test from MySQL database into S3 is one task, loading data from GA into S3 is another task, historization script in BQ is another task etc.
Mandatory parameters of every task are:
example/tasks/world
then world
will be source system if you do not specify it in your yaml file.example/tasks/world/public
then public
will be source subsystem if you do not specify it in your yaml file.Tasks are organized into Task Lists that is an array of work to be done for certain period of time.
E.g. you want to download tables T1, T2 and T3 from MySQL database into S3 from 2018-01-01 to 2019-05-02 (and you have where condition on some date).
Luft is currently supporting following task types:
Run Embulk and load data from JDBC db into S3 or GCS. Data are stored as CSV. Some other output data formats will be added later.
luft jdbc load
y
, --yml-path
(mandatory): folder or single yml file inside default tasks folder (see luft.cfg).-s
, --start-date
: Start date in format YYYY-MM-DD for executing task in loop. If not specified yesterday date is used.-e
, --end-date
: End date in format YYYY-MM-DD for executing task in loop. This day is not included. If not specified today date is used.-sys
, --source-system
: override sourcesystem parameter. See description in _Task section. Has to be same as name in jdbc.cfg to get right credentials for JDBC database.-sub
, --source-subsystem
: override sourcesubsystem parameter. See description in _Task section.-b
, --blacklist
: Name of tables/objects to be ignored during processing. E.g. —yml-path gis and -b TEST. It will process all objects in gis folder except object TEST.-w
, --whitelist
: Name of tables/objects to be processed. E.g. —yml-path gis and -b TEST. It will process only object TEST.jdbc.cfg
file with right configuration.This file contains basic jdbc configuration for all of your databases. Every database has to have [DATABASE_NAME]
header. This has to be same as source_system. Supported parameters are:
MY_DB_PASS
then docker run -e MY_DB_PASS=Password123 luft jdbc load -y <path_to_yml>
should work.Inside yaml file, following parameters are supported:
embulk-jdbc-load
by default but can be overidden. When overriden it is going to be different kind of task :).
|T1| -> |T4|
|T2| -> |T5|
|T3|
When I specify any thread_name in task T4:
|T1| -> |T5|
|T2|
|T3|
|T4|
#A3E9DA
is used.luft.cfg
will be used.{date_valid}
parameter inside this command to print actual date valid. E.g. where_clause: date_of_change >= '{date_valid}'
. And if you execute luft jdbc load -y <path_to_task> -s 2019-01-01 -e 2019-05-01
for evey date between 2019-01-01
and 2019-05-01
it will print WHERE date_of_change >= '2019-01-01'
.value: 'Null'
.Load data from BigQuery from Google Cloud Storage and historize them. Currently only CSV is supported
luft bq load
y
, --yml-path
(mandatory): folder or single yml file inside default tasks folder (see luft.cfg).-s
, --start-date
: Start date in format YYYY-MM-DD for executing task in loop. If not specified yesterday date is used.-e
, --end-date
: End date in format YYYY-MM-DD for executing task in loop. This day is not included. If not specified today date is used.-sys
, --source-system
: override sourcesystem parameter. See description in _Task section. Has to be same as name in jdbc.cfg to get right credentials for JDBC database.-sub
, --source-subsystem
: override sourcesubsystem parameter. See description in _Task section.-b
, --blacklist
: Name of tables/objects to be ignored during processing. E.g. —yml-path gis and -b TEST. It will process all objects in gis folder except object TEST.-w
, --whitelist
: Name of tables/objects to be processed. E.g. —yml-path gis and -b TEST. It will process only object TEST.pip install luft[bq]
.service_account.json
) mapped into docker and configured in luft.cfg
.Inside yaml file, following parameters are supported:
bq-load
by default but can be overidden. When overriden it is going to be different kind of task :).thread_name - applicable only when used with Airflow. Thread name is automatically genereted based on number of threads. If you need this task to have totally different thread you can specify custom thread name.
Eg. I have tasks T1, T2, T3, T4 and T5 in my task list. and thread count set to 3. By default (if no task has thread_name specified) it will look like this in Airflow:
|T1| -> |T4|
|T2| -> |T5|
|T3|
When I specify any thread_name in task T4:
|T1| -> |T5|
|T2|
|T3|
|T4|
color - applicable only when used with Airflow. Hex color of Task in Airflow. If not specified #03A0F3
is used.
luft.cfg
.location.cfg
.value: 'Null'
.Run BigQuery sql command from file.
luft bq exec
y
, --yml-path
(mandatory): folder or single yml file inside default tasks folder (see luft.cfg).-s
, --start-date
: Start date in format YYYY-MM-DD for executing task in loop. If not specified yesterday date is used.-e
, --end-date
: End date in format YYYY-MM-DD for executing task in loop. This day is not included. If not specified today date is used.-sys
, --source-system
: override sourcesystem parameter. See description in _Task section. Has to be same as name in jdbc.cfg to get right credentials for JDBC database.-sub
, --source-subsystem
: override sourcesubsystem parameter. See description in _Task section.-b
, --blacklist
: Name of tables/objects to be ignored during processing. E.g. —yml-path gis and -b TEST. It will process all objects in gis folder except object TEST.-w
, --whitelist
: Name of tables/objects to be processed. E.g. —yml-path gis and -b TEST. It will process only object TEST.pip install luft[bq]
.service_account.json
) mapped into docker and configured in luft.cfg
.Inside yaml file, following parameters are supported:
bq-load
by default but can be overidden. When overriden it is going to be different kind of task :).thread_name - applicable only when used with Airflow. Thread name is automatically genereted based on number of threads. If you need this task to have totally different thread you can specify custom thread name.
Eg. I have tasks T1, T2, T3, T4 and T5 in my task list. and thread count set to 3. By default (if no task has thread_name specified) it will look like this in Airflow:
|T1| -> |T4|
|T2| -> |T5|
|T3|
When I specify any thread_name in task T4:
|T1| -> |T5|
|T2|
|T3|
|T4|
color - applicable only when used with Airflow. Hex color of Task in Airflow. If not specified #73DBF5
is used.
luft.cfg
.location.cfg
.Inside of SQL you can use shortcuts for some useful variables:
bq-exec
.Example:
-- Example of templating
SELECT '{{ BQ_LOCATION }}';
SELECT '{{ BQ_PROJECT_ID }}';
SELECT '{{ DATE_VALID }}';
SELECT '{{ SOURCE_SYSTEM }}';
SELECT '{{ ENV }}';
Export application from Qlik Sense Enterprise, upload it to Qlik Sense cloud and publish it into certain stream.
luft qlik-cloud upload
y
, --yml-path
(mandatory): folder or single yml file inside default tasks folder (see luft.cfg).-s
, --start-date
: Start date in format YYYY-MM-DD for executing task in loop. If not specified yesterday date is used.-e
, --end-date
: End date in format YYYY-MM-DD for executing task in loop. This day is not included. If not specified today date is used.-sys
, --source-system
: override sourcesystem parameter. See description in _Task section. Has to be same as name in jdbc.cfg to get right credentials for JDBC database.-sub
, --source-subsystem
: override sourcesubsystem parameter. See description in _Task section.-b
, --blacklist
: Name of tables/objects to be ignored during processing. E.g. —yml-path gis and -b TEST. It will process all objects in gis folder except object TEST.-w
, --whitelist
: Name of tables/objects to be processed. E.g. —yml-path gis and -b TEST. It will process only object TEST.pip install luft[qlik-cloud]
.google-chrome
and chromedriver
in your Docker image or localhost - see Python Selenium Installation.client_key.pem
, client.pem
and root.pem
) mapped into docker and configured in luft.cfg
in [qlik_enterprise]
section.luft.cfg
in sections [qlik_enterprise]
and [qlik_cloud]
.Inside yaml file, following parameters are supported:
bq-load
by default but can be overidden. When overriden it is going to be different kind of task :).thread_name - applicable only when used with Airflow. Thread name is automatically genereted based on number of threads. If you need this task to have totally different thread you can specify custom thread name.
Eg. I have tasks T1, T2, T3, T4 and T5 in my task list. and thread count set to 3. By default (if no task has thread_name specified) it will look like this in Airflow:
|T1| -> |T4|
|T2| -> |T5|
|T3|
When I specify any thread_name in task T4:
|T1| -> |T5|
|T2|
|T3|
|T4|
color - applicable only when used with Airflow. Hex color of Task in Airflow. If not specified #009845
is used.
Load data from Qlik metric, convert them to json and upload to blob storage.
luft qlik-metric load
y
, --yml-path
(mandatory): folder or single yml file inside default tasks folder (see luft.cfg).-s
, --start-date
: Start date in format YYYY-MM-DD for executing task in loop. If not specified yesterday date is used.-e
, --end-date
: End date in format YYYY-MM-DD for executing task in loop. This day is not included. If not specified today date is used.-sys
, --source-system
: override sourcesystem parameter. See description in _Task section. Has to be same as name in jdbc.cfg to get right credentials for JDBC database.-sub
, --source-subsystem
: override sourcesubsystem parameter. See description in _Task section.-b
, --blacklist
: Name of tables/objects to be ignored during processing. E.g. —yml-path gis and -b TEST. It will process all objects in gis folder except object TEST.-w
, --whitelist
: Name of tables/objects to be processed. E.g. —yml-path gis and -b TEST. It will process only object TEST.pip install luft[qlik-metric]
.client_key.pem
, client.pem
and root.pem
) mapped into docker and configured in luft.cfg
in [qlik_enterprise]
section.luft.cfg
in sections [qlik_enterprise]
.Inside yaml file, following parameters are supported:
bq-load
by default but can be overidden. When overriden it is going to be different kind of task :).thread_name - applicable only when used with Airflow. Thread name is automatically genereted based on number of threads. If you need this task to have totally different thread you can specify custom thread name.
Eg. I have tasks T1, T2, T3, T4 and T5 in my task list. and thread count set to 3. By default (if no task has thread_name specified) it will look like this in Airflow:
|T1| -> |T4|
|T2| -> |T5|
|T3|
When I specify any thread_name in task T4:
|T1| -> |T5|
|T2|
|T3|
|T4|
color - applicable only when used with Airflow. Hex color of Task in Airflow. If not specified #009845
is used.
luft.cfg
First you need to create config file luft.cfg
according to example in example/config/luft.cfg
and place it into root folder. If you want to use BigQuery and Google Cloud Storage you of course need credentials for it - GC authentication. In case of AWS S3 you need to get AWS Access Key ID
and AWS Secret Access Key
.
Credentials (GCS, AWS, BigQuery) can be specified by three ways:
luft.cfg
fileWARNING: this possibility is recommended only for local development. Because if you publish image to public repository, everybody will know your secrets |
---|
EMBULK_COMMAND=embulk
LUFT_CONFIG=example/config/luft.cfg
JDBC_CONFIG=example/config/jdbc.cfg
TASKS_FOLDER=example/tasks
BLOB_STORAGE=gcs
GCS_BUCKET=
GCS_AUTH_METHOD=json_key
GCS_JSON_KEYFILE=
BQ_PROJECT_ID=
BQ_CREDENTIALS_FILE=
BQ_LOCATION=US
AWS_BUCKET=
AWS_ENDPOINT=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
And then run your Docker command with this enviroment file:
docker run -it -rm --env-file .env luft
This variant is prefered.
docker run -it -rm -e BLOB_STORAGE=gcs luft
jcbc.cfg
For example purposes just copy jdbc.cfg
from example/config/
into root folder or set JDBC_CONFIG
in your .env
file or by -e
parameter.
Just run:
docker build -t luft .
docker run -d -p 5432:5432 aa8y/postgres-dataset:world
Store example data from postgres database in S3 or GCS.
docker run -rm luft jdbc load -y world
Optionally if you have configured BigQuery in your luft.cfg
you can run:
docker run -rm luft bq exec -y bq