DSS plugin for fast loading between Snowflake and HDFS
This is a Dataiku plugin that makes it easy to enable fast loads of files between Snowflake and S3.
Supported file formats include JSON Lines and Parquet (as an HDFS dataset).
STAGE
that points to the same S3 bucket and path as DSS’s managed HDFS connectionFor JSON:
STORAGE_INTEGRATION
that points to the same S3 bucket and path as DSS’s managed S3 connectionYour HDFS connection here:
Should match the URL
here:
CREATE OR REPLACE STAGE YOUR_ACCOUNT_DATAIKU_EMR_MANAGED_STAGE
URL = 's3://your-account-dataiku-emr/data'
CREDENTIALS = (AWS_ROLE = 'arn:aws:iam::123456:role/SnowflakeCrossAccountRole');
GRANT USAGE ON YOUR_ACCOUNT_DATAIKU_EMR_MANAGED_STAGE TO DSS_SF_ROLE_NAME;
Note that this example uses an AWS IAM role for securing the stage’s connection to your S3 bucket. There’s no reason a stage secured using AWS access keys wouldn’t work, but it has not been tested.
You can install the plugin by referencing this GitHub repository and following these instructions.
Or, you can create a Zip file and following these instructions. To create the Zip file, you’ll need to build it:
json_pp
and node
installed locallymake plugin
/dist
directoryYou can (optionally) configure a Default Snowflake Stage in the plugin’s settings. For example, the STAGE
created above would be entered as @PUBLIC.YOUR_ACCOUNT_DATAIKU_EMR_MANAGED_STAGE
.
When using the recipe, you can override the default stage in the Snowflake Stage setting.
STAGE
, as described above, and the Format is Parquet.STAGE
, as described above. Additionally, the dataset must be Parquet format with Snappy compression.)Make sure that wget is installed. For macOS, you can install via brew install wget
.
Custom Recipe libraries aren’t included in DSS’s dataiku-internal-client
package so we need to fake it ‘til we make it.
First, create the package by executing ./make_dss_pip.sh
. If successful, the last line it prints is a pip
command.
Second, use the pip
command from the previous step to install the package in the library’s virtual environment. (If you’re using PyCharm, open View → Tool Windows → Terminal and paste the pip
command in.)
pip
steps aboveTIMESTAMP_TZ
or TIMESTAMP_LTZ
columns to TIMESTAMP_NTZ
which simply drops the timezone offset attribute. For greater control of this behaviour, transform your Snowflake table before passing it to this plugin. Consider using CONVERT_TIMEZONE('UTC', t."date")::TIMESTAMP_NTZ
. See Date & Time Data Types in Snowflake for more details.TIMESTAMP_NTZ
and DATE
columns as annotated logical types in Parquet. Because Dataiku does not support logical types, these appear in Dataiku as bigint
(int64
as datetime
) and int
(int32
as date
). More details and workarounds are described in #12CSV
, AVRO
, etc.)STAGE
lining up to HDFS)CONVERT_TIMEZONE