Loading 22 Mil + records into neo4j database via databricks environment

Karu 1 Reputation point
2021-05-18T09:17:48.637+00:00

I am using neo4j spark connector and via databricks environment - I have loaded 22 Mil records into neo4j database and While establishing relationships for this 22mil records with the query

JournalLineItemdf.repartition(1).write.format("org.neo4j.spark.DataSource")\
.mode("Overwrite")\
.option("url", "bolt://xxx:7687")\
.option("authentication.basic.username", "xxx")\
.option("authentication.basic.password", "xxx")\
.option("database", "mydb")\
.option("query","""CALL apoc.periodic.iterate(
'MATCH(jh:JournalHeader) return jh', 'with jh MATCH (jli:JournalLineItem) WHERE jli.glheader_id=jh.uid MERGE(jh)-[:HAS_A]->(jli)',{batchSize:10000, parallel: true, iterateList:true}) yield batch return null;""")\
.save()

Though the required relationships got established in neo4j database. The spark job is hanging and no updates are being written.

Is there a way to make the query more efficient so once the relationships got established it will finish the job. Because in this case though the relationships got established still the query is searching for all the other records which is not necessary in this case.

Also If i gave the query inside the neo4j environment it is working but using datbricks neo4j spark connector and writing into neo4j is causing issues

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,218 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Karu 1 Reputation point
    2021-05-18T09:56:35.477+00:00

    DBR version 6.4 (includes Apache Spark 2.4.5, Scala 2.11) - We are using this version because we are loading the Comman Data Model - CDM data from Azure Data Lake Storage Gen 2 to Azure Databricks using Spark CDM connector spark-cdm-connector-assembly-0.19.1.jar . It is compatible with spark version 2.4+ . It is not compatible with spark 3 + . Hence We are using DBR version 6.4


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.