项目作者: spe-uob

项目描述 :
FHIR to OMOP using PySpark on AWS Glue
高级语言: Python
项目地址: git://github.com/spe-uob/HealthcareLakeETL.git
创建时间: 2021-03-07T01:04:36Z
项目社区:https://github.com/spe-uob/HealthcareLakeETL

开源协议:MIT License

下载


build
Scan

HealthcareLakeETL

This repository contains the Spark ETL jobs for our AWS Glue pipeline. Used by the HealthcareLake project.

FHIR → OMOP

We are transforming one dataframe (FHIR) into several dataframes that correspond with the OMOP Common Data Model (CDM). The exact mapping can be found here.

Once the patient-level data model (FHIR) has been transformed to the population-level data model (OMOP CDM), we can access the Observational Health Data Sciences and Informatics (OHDSI) resources that can perform data aggregations and packages for cohort creation and various population level data analytics. More info

Local development

These instructions are for working with the data offline as opposed to connecting to AWS EMR. This is recommended as there is less setup involved.

To setup the Jupyter Notebook environment, follow these steps:

  1. Install Anaconda

  2. Create a Virtual Environment with Anaconda

  1. conda create --name etl python=3.7
  1. Switch to this virtual environment
  1. conda activate etl
  1. Add the environment to jupyter kernels
  1. pip install --user ipykernel

And then link it

  1. python -m ipykernel install --user --name=etl

You should now be able to run jupyter notebook in your browser:

  1. jupyter notebook

Select Kernel→Change kernel→etl

  1. Install PySpark

Open a new terminal. (Remember to activate the environment with conda activate etl)

  1. pip install pyspark
  1. Start developing

In your notebook:

  1. from pyspark.sql import SparkSession
  2. # Create a local Spark session
  3. spark = SparkSession.builder.appName('etl').getOrCreate()
  4. # Read in our data
  5. df = spark.read.parquet('data/catalog.parquet')

That’s it, you have the DataFrame to work with.