ETL Pipeline
You only need to run the ETL pipeline if you would like to reproduce results, collect information beyond the resources we have shared (such as data from blocks created after our cutoff point), or host your own solution.
You do not need to run the ETL pipeline if you only intend to sample new graphs or develop models on the sampled communities we provide.
This section details the Extract, Transform, and Load (ETL) pipeline for collecting Bitcoin data, formatting it as a graph, and importing it into a Neo4j graph database.
You can run the pipeline on the entire Bitcoin history or on a subset of blocks. Since running the pipeline end-to-end is highly resource-intensive, we made two key design decisions: first, we developed it from the ground up to run on a single machine, relying on on-disk processing for resource-heavy operations. Second, given its weeks-long runtime on the entire Bitcoin history, the pipeline is divided into distinct steps, and we provide checkpoint data for each one, allowing you to start the process at any point depending on your application's needs.
Please use the following guide to select the right entry point into the pipeline based on your specific goals and available resources.
-
- When to Run
- System Requirements
- Checkpoint
If you need to reproduce the entire pipeline, include historical data not in our current graph, or add data from the new blocks created after our cutoff date.
This step performs the Initial Block Download (IBD), a process where the Bitcoin Core client downloads the blockchain history directly from the Bitcoin network. We cannot provide checkpoint data for this step. Upon completion, you will have a fully synchronized and indexed Bitcoin Core node.
~1week runtime,~900GBstorage,~800GBInternet traffic.
-
- When to Run
- System Requirements
- Checkpoint
After completing Step 1, to generate the Bitcoin graph in TSV files.
-
Noe4j format files on:
- AWS:
s3://bitcoin-graph/v1/data_to_import_neo4j/
- AWS:
2-3days runtime,~1.5TBstorage.
-
Step 3: Import Data into Neo4j
- When to Run
- System Requirements
- Checkpoint
This step imports the formatted TSV files from Step 2 into a Neo4j database to enable efficient graph exploration and custom graph sampling.
-
Neo4j database dump on:
- AWS:
s3://bitcoin-graph/v1/neo4j_db_dump/
- AWS:
~2-3weeks runtime,~6TBstorage.