Import into neo4j
Neo4j's neo4j-admin import tool offers the fastest method for initially populating a database. Its main drawback is that it requires the database to be empty; hence, it doesn't support incremental updates. Therefore, we provide separate solutions optimized for both initial population and incremental updates.
Initial Data Load
-
[Experimental] you need to run a script that takes batches CSV files, combines them into single file per node or relationship type, and formats them in a way that neo4j admin tool can use it.
1.1. Run:
EXP_PrepareDataForNeo4j
Since we need to avoid duplicates in node definitions, and due to file size and memory usage contraints, this first step will not attempt to avoid duplicates, instead it will output files whose duplicates will be removed in the following steps.
1.2.
cd
to the directory where the data is persisted1.3. Combine the files:
cat *_BitcoinTxNode.tsv > combined_BitcoinTxNode.tsv
cat *_BitcoinScriptNode.tsv > combined_BitcoinScriptNode.tsv
1.4. Sort the files: (The goal of the following is to de-duplicate the Tx and Script node files. neo4j has the argument
--skip-duplicate-nodes[=true|false]
that can be used as an alternative to the following.)LC_ALL=C sort --buffer-size=32G --parallel=16 --temporary-directory=. -t$'\t' -k1,1 combined_BitcoinTxNode.tsv > sorted_BitcoinTxNode.tsv
LC_ALL=C sort --buffer-size=32G --parallel=16 --temporary-directory=. -t$'\t' -k1,1 combined_BitcoinScriptNode.tsv > sorted_BitcoinScriptNode.tsv
1.5. Run the application
EXP_ProcessSortedNodeFiles
-
Create a neo4j database, or if using an existing database, make sure it is empty.
-
Stop the database if it is running.
-
cd
to the database's import directory, for instance:C:\neo4j\relate-data\dbmss\dbms-5739a8c7-7235-4e8a-a2ad-7b708514efce
-
Use the following command to load data into neo4j empty data base using the admin tools.
$ENV:GDIR="" # set to the directory containing graph data without the trailing `\`
$ENV:HEAP_SIZE = "96G" # set the heap size for the neo4j-admin tool
Note that if you change the name of the database in the following, you will need to create that database in neo4j first, then run the import command.
neo4j admin.\bin\neo4j-admin.ps1 database import full --overwrite-destination neo4j `
--nodes="$ENV:GDIR\BitcoinCoinbase.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinGraph_header.tsv.gz,$ENV:GDIR\0_BitcoinGraph.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinScriptNode_header.tsv.gz,$ENV:GDIR\unique_BitcoinScriptNode.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinTxNode_header.tsv.gz,$ENV:GDIR\unique_BitcoinTxNode.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinS2S_header.tsv.gz,$ENV:GDIR\.*BitcoinS2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinT2T_header.tsv.gz,$ENV:GDIR\.*BitcoinT2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinC2T_header.tsv.gz,$ENV:GDIR\.*BitcoinC2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinC2S_header.tsv.gz,$ENV:GDIR\.*BitcoinC2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinB2T_header.tsv.gz,$ENV:GDIR\.*BitcoinB2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinB2S_header.tsv.gz,$ENV:GDIR\.*BitcoinB2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinS2B_header.tsv.gz,$ENV:GDIR\.*BitcoinS2B.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinT2B_header.tsv.gz,$ENV:GDIR\.*BitcoinT2B.tsv.gz" `
--delimiter "\t" --array-delimiter ";" --verboseneo4j admin (skip duplicates).\bin\neo4j-admin.ps1 database import full --overwrite-destination neo4j `
--nodes="$ENV:GDIR\BitcoinCoinbase.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinGraph_header.tsv.gz,$ENV:GDIR\0_BitcoinGraph.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinScriptNode_header.tsv.gz,$ENV:GDIR\unique_BitcoinScriptNode.tsv.gz" `
--nodes="$ENV:GDIR\BitcoinTxNode_header.tsv.gz,$ENV:GDIR\unique_BitcoinTxNode.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinS2S_header.tsv.gz,$ENV:GDIR\.*BitcoinS2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinT2T_header.tsv.gz,$ENV:GDIR\.*BitcoinT2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinC2T_header.tsv.gz,$ENV:GDIR\.*BitcoinC2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinC2S_header.tsv.gz,$ENV:GDIR\.*BitcoinC2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinB2T_header.tsv.gz,$ENV:GDIR\.*BitcoinB2T.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinB2S_header.tsv.gz,$ENV:GDIR\.*BitcoinB2S.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinS2B_header.tsv.gz,$ENV:GDIR\.*BitcoinS2B.tsv.gz" `
--relationships="$ENV:GDIR\BitcoinT2B_header.tsv.gz,$ENV:GDIR\.*BitcoinT2B.tsv.gz" `
--delimiter "\t" --array-delimiter ";" --verbose --skip-duplicate-nodes -
Enable APOC.
Incremental Update
-
Enable APOC.
-
Make sure to increase the max heap size for neo4j, otherwise you may get the following error message:
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
-
Run (Use
--help
for usage docs):eba bitcoin import
Load database dump
We also provide a dump of the neo4j database containing the entire bitcoin graph. In order to use this, you may take the following steps:
-
Import into neo4j. You may follow the steps outlined on this page for importing the downloaded database dump. Or, you may run the following.
.\bin\neo4j-admin.bat database load --overwrite-destination=true --verbose --from-path=M:\\ neo4j
This process may take a few hours and needs 2.722TiB storage. If it asks for password, you may enter
password
. -
Enable APOC.