ETL Pipeline

When do I need to run the ETL pipeline?

You only need to run the ETL pipeline if you would like to reproduce results, collect information beyond the resources we have shared (such as data from blocks created after our cutoff point), or host your own solution.

You do not need to run the ETL pipeline if you only intend to sample new graphs or develop models on the sampled communities we provide.

This section details the Extract, Transform, and Load (ETL) pipeline for collecting Bitcoin data, formatting it as a graph, and importing it into a Neo4j graph database.

You can run the pipeline on the entire Bitcoin history or on a subset of blocks. Since running the pipeline end-to-end is highly resource-intensive, we made two key design decisions: first, we developed it from the ground up to run on a single machine, relying on on-disk processing for resource-heavy operations. Second, given its weeks-long runtime on the entire Bitcoin history, the pipeline is divided into distinct steps, and we provide checkpoint data for each one, allowing you to start the process at any point depending on your application's needs.

Please use the following guide to select the right entry point into the pipeline based on your specific goals and available resources.

Step 1: Sync a Bitcoin Node
- When to Run
- System Requirements
- Checkpoint
If you need to reproduce the entire pipeline, include historical data not in our current graph, or add data from the new blocks created after our cutoff date.

Step 2: Extract Block Data
- When to Run
- System Requirements
- Checkpoint
After completing Step 1, to generate the Bitcoin graph in TSV files, block-level summary statistics, TXO lifecycle, or Bitcoin address statistics.
- Noe4j format files on:
  
  AWS: s3://bitcoin-graph/v1/data_to_import_neo4j/
  
  Google Drive
- Block summary statistics
2-3 days runtime, ~1.5TB storage.

Step 3: Address Statistics
- When to Run
- System Requirements
- Checkpoint
You may run this optional step if you need detailed summary statistics about script addresses. This step runs on the raw address statistics generated in step 2.
- Block summary statistics
~1 day runtime, ~300GB storage.

Step 4: TXO Lifecycle
- When to Run
- System Requirements
- Checkpoint
You may run this optional step if you need summary statistics on transaction output (TXO) lifecycles, such as coin dormancy. This step runs on the raw TXO lifecycle data generated in step 2.
- Block summary statistics
~1 day runtime, ~300GB storage.

Step 5: Import Data into Neo4j
- When to Run
- System Requirements
- Checkpoint
This step imports the formatted TSV files from Step 5 into a Neo4j database to enable efficient graph exploration and custom graph sampling.
- Neo4j database dump on:
  
  AWS: s3://bitcoin-graph/v1/neo4j_db_dump/
  
  Google Drive
~2-3 weeks runtime, ~6TB storage.