medax_pipeline/README.md
2025-04-24 15:32:28 +02:00

4.7 KiB

MeDaX Pipeline

📋 Description

The MeDaX pipeline transforms healthcare data from FHIR databases into Neo4j graph databases. This conversion enables efficient searching, querying, and analyses of interconnected health data that would otherwise be complex to retrieve using traditional SQL databases.

Features

  • Seamless conversion from FHIR to Neo4j graph structure
  • Support for patient-centric data retrieval using FHIR's $everything operation
  • Configurable batch processing for handling large datasets
  • Docker-based deployment for easy setup and portability
  • Compatible with public FHIR servers (e.g., HAPI FHIR) and private authenticated instances

⚙️ Prerequisites

  • Docker with the Docker Compose plugin
  • A FHIR database with API access and the $everything operation enabled for retrieving patient data
    • Alternatively: Use a public FHIR server such as HAPI FHIR (default configuration)

🚀 Installation

Setup

  1. Clone this repository

  2. Create an environment configuration file

  3. Configure the environment variables in .env:

    • For HAPI test server (default): No changes needed
    • For custom FHIR server:
      • Change MODE to anything else
      • Uncomment and set URL, PASSWORD, and USERNAME variables
      • Adjust BATCH_SIZE and NUMBER_OF_PATIENTS according to your needs
      • Configure any required proxy settings
  4. If needed, modify proxy settings in the Dockerfile

    • Uncomment and set proxy variables

Running the Pipeline

Start the containers:

docker compose up --build

Stop and clean up (between runs):

docker compose down --volumes

Complete removal (containers and images):

docker compose down --volumes --rmi all

Note: Depending on your Docker installation, you might need to use docker-compose instead of docker compose.

🔍 Accessing the Neo4j Database

Once the pipeline has completed processing, you can access the Neo4j database:

  1. Open your browser and navigate to http://localhost:8080/
  2. Connect using the following credentials:
    • Username: neo4j
    • Password: neo4j
  3. Set the new password and save it to a secure password manager

📊 Example Queries

Here are some basic Cypher queries to get you started with exploring your health data:

// Count all nodes by type
MATCH (n) RETURN labels(n) as NodeType, count(*) as Count;

// Find all records for a specific patient
MATCH (p:Patient {id: 'patient-id'})-[r]-(connected)
RETURN p, r, connected;

// Retrieve all medication prescriptions
MATCH (m:Medication)-[r]-(p:Patient)
RETURN m, r, p;

Troubleshooting

Common Issues:

  • Connection refused to FHIR server: Check your network settings and ensure the FHIR server is accessible from within the Docker container.
  • Authentication failures: Verify your credentials in the .env file.
  • Container startup failures: Ensure all required Docker ports are available and not used by other applications.
  • No data found in fhir bundle: Ensure that the FHIR server is up and responding patient data. Try sett the COMPLEX_PATIENTS variable to FALSE in your .env file. Some FHIR servers might not support the FHIR search logic.

📚 Architecture

The MeDaX pipeline consists of the following components:

  1. FHIR Client: Connects to the FHIR server and retrieves patient data
  2. Data Transformer: Converts FHIR resources into graph entities and relationships
  3. Reference Processor: Converts references to relationships
  4. BioCypher Adapter: Prepares the transformed data for Neo4j admin import
  5. Neo4j Database: Stores and serves the graph representation of the health data

✍️ Citation

If you use the MeDaX pipeline in your research, please cite: 10.5281/zenodo.15229077 and Mazein, I and Gebhardt, T et al. MeDaX, a knowledge graph on FHIR.

🙏 Acknowledgements

  • We are leveraging BioCypher DOI to create the Neo4j admin input.
    • Remark: We introduced slight adjustments to BioCypher's code to support batching.
  • We used BioCypher's git template as a starting point for our development:
  • We used synthetic data generated with Synthea during the development process. This data is provided in the testData folder.