ETL Pipeline
You only need to run the ETL pipeline if you would like to reproduce results, collect information beyond the resources we have shared (such as data from blocks created after our cutoff point), or host your own solution.
You do not need to run the ETL pipeline if you only intend to sample new graphs or develop models on the sampled communities we provide.
This section details the Extract, Transform, and Load (ETL) pipeline for collecting Bitcoin data, formatting it as a graph, and importing it into a Neo4j graph database.
You can run the pipeline on the entire Bitcoin history or on a subset of blocks. Since running the pipeline end-to-end is highly resource-intensive, we made two key design decisions: first, we developed it from the ground up to run on a single machine, relying on on-disk processing for resource-heavy operations. Second, given its weeks-long runtime on the entire Bitcoin history, the pipeline is divided into distinct steps, and we provide checkpoint data for each one, allowing you to start the process at any point depending on your application's needs.
Please use the following guide to select the right entry point into the pipeline based on your specific goals and available resources.
-
- When to Run
- System Requirements
- Checkpoint
If you need to reproduce the entire pipeline, include historical data not in our current graph, or add data from the new blocks created after our cutoff date.
This step performs the Initial Block Download (IBD), a process where the Bitcoin Core client downloads the blockchain history directly from the Bitcoin network. We cannot provide checkpoint data for this step. Upon completion, you will have a fully synchronized and indexed Bitcoin Core node.
~1 week runtime, ~900GB storage, ~800GB Internet traffic.
-
- When to Run
- System Requirements
- Checkpoint
After completing Step 1, to generate the Bitcoin graph in TSV files, block-level summary statistics, TXO lifecycle, or Bitcoin address statistics.
2-3 days runtime, ~1.5TB storage.
-
- When to Run
- System Requirements
- Checkpoint
You may run this optional step if you need detailed summary statistics about script addresses. This step runs on the raw address statistics generated in step 2.
~1 day runtime, ~300GB storage.
-
- When to Run
- System Requirements
- Checkpoint
You may run this optional step if you need summary statistics on transaction output (TXO) lifecycles, such as coin dormancy. This step runs on the raw TXO lifecycle data generated in step 2.
~1 day runtime, ~300GB storage.
-
Step 5: Import Data into Neo4j
- When to Run
- System Requirements
- Checkpoint
This step imports the formatted TSV files from Step 5 into a Neo4j database to enable efficient graph exploration and custom graph sampling.
~2-3 weeks runtime, ~6TB storage.