MeDaX Pipeline
📋 Description
The MeDaX pipeline transforms healthcare data from FHIR databases into Neo4j graph databases. This conversion enables efficient searching, querying, and analyses of interconnected health data that would otherwise be complex to retrieve using traditional SQL databases.
✨ Features
- Seamless conversion from FHIR to Neo4j graph structure
- Support for patient-centric data retrieval using FHIR's
$everything
operation - Configurable batch processing for handling large datasets
- Docker-based deployment for easy setup and portability
- Compatible with public FHIR servers (e.g., HAPI FHIR) and private authenticated instances
⚙️ Prerequisites
- Docker with the Docker Compose plugin
- A FHIR database with API access and the
$everything
operation enabled for retrieving patient data- Alternatively: Use a public FHIR server such as HAPI FHIR (default configuration)
🚀 Installation
Setup
-
Clone this repository
-
Create an environment configuration file
-
Configure the environment variables in
.env
:- For HAPI test server (default): No changes needed
- For custom FHIR server:
- Change
MODE
to anything else - Uncomment and set
URL
,PASSWORD
, andUSERNAME
variables - Adjust
BATCH_SIZE
andNUMBER_OF_PATIENTS
according to your needs - Configure any required proxy settings
- Change
-
If needed, modify proxy settings in the
Dockerfile
- Uncomment and set proxy variables
Running the Pipeline
Start the containers:
docker compose up --build
Stop and clean up (between runs):
docker compose down --volumes
Complete removal (containers and images):
docker compose down --volumes --rmi all
Note: Depending on your Docker installation, you might need to use
docker-compose
instead ofdocker compose
.
🔍 Accessing the Neo4j Database
Once the pipeline has completed processing, you can access the Neo4j database:
- Open your browser and navigate to
http://localhost:8080/
- Connect using the following credentials:
- Username: neo4j
- Password: neo4j
- Set the new password and save it to a secure password manager
📊 Example Queries
Here are some basic Cypher queries to get you started with exploring your health data:
// Count all nodes by type
MATCH (n) RETURN labels(n) as NodeType, count(*) as Count;
// Find all records for a specific patient
MATCH (p:Patient {id: 'patient-id'})-[r]-(connected)
RETURN p, r, connected;
// Retrieve all medication prescriptions
MATCH (m:Medication)-[r]-(p:Patient)
RETURN m, r, p;
❓ Troubleshooting
Common Issues:
- Connection refused to FHIR server: Check your network settings and ensure the FHIR server is accessible from within the Docker container.
- Authentication failures: Verify your credentials in the
.env
file. - Container startup failures: Ensure all required Docker ports are available and not used by other applications.
- No data found in fhir bundle: Ensure that the FHIR server is up and responding patient data. Try sett the COMPLEX_PATIENTS variable to FALSE in your .env file. Some FHIR servers might not support the FHIR search logic.
📚 Architecture
The MeDaX pipeline consists of the following components:
- FHIR Client: Connects to the FHIR server and retrieves patient data
- Data Transformer: Converts FHIR resources into graph entities and relationships
- Reference Processor: Converts references to relationships
- BioCypher Adapter: Prepares the transformed data for Neo4j admin import
- Neo4j Database: Stores and serves the graph representation of the health data
✍️ Citation
If you use the MeDaX pipeline in your research, please cite: 10.5281/zenodo.15229077 and Mazein, I and Gebhardt, T et al. MeDaX, a knowledge graph on FHIR.
🙏 Acknowledgements
- We are leveraging BioCypher
to create the Neo4j admin input.
- Remark: We introduced slight adjustments to BioCypher's code to support batching.
- We used BioCypher's git template as a starting point for our development:
- Lobentanzer, S., BioCypher Consortium, & Saez-Rodriguez, J. Democratizing knowledge representation with BioCypher [Computer software]. https://github.com/biocypher/biocypher
- We used synthetic data generated with Synthea during the development process. This data is provided in the testData folder.