Skip to main content

ETL Pipeline

When do I need to run the ETL pipeline?

You only need to run the ETL pipeline if you would like to reproduce results, collect information beyond the resources we have shared (such as data from blocks created after our cutoff point), or host your own solution.

You do not need to run the ETL pipeline if you only intend to sample new graphs or develop models on the sampled communities we provide.

This section details the Extract, Transform, and Load (ETL) pipeline for collecting Bitcoin data, formatting it as a graph, and importing it into a Neo4j graph database.

You can run the pipeline on the entire Bitcoin history or on a subset of blocks. Since running the pipeline end-to-end is highly resource-intensive, we made two key design decisions: first, we developed it from the ground up to run on a single machine, relying on on-disk processing for resource-heavy operations. Second, given its weeks-long runtime on the entire Bitcoin history, the pipeline is divided into distinct steps, and we provide checkpoint data for each one, allowing you to start the process at any point depending on your application's needs.

Please use the following guide to select the right entry point into the pipeline based on your specific goals and available resources.


  • Step 1: Sync a Bitcoin Node

    If you need to reproduce the entire pipeline, include historical data not in our current graph, or add data from the new blocks created after our cutoff date.



  • Step 3: Address Statistics

    You may run this optional step if you need detailed summary statistics about script addresses. This step runs on the raw address statistics generated in step 2.


  • Step 4: TXO Lifecycle

    You may run this optional step if you need summary statistics on transaction output (TXO) lifecycles, such as coin dormancy. This step runs on the raw TXO lifecycle data generated in step 2.


  • Step 5: Import Data into Neo4j

    This step imports the formatted TSV files from Step 5 into a Neo4j database to enable efficient graph exploration and custom graph sampling.