DBR version 6.4 (includes Apache Spark 2.4.5, Scala 2.11) - We are using this version because we are loading the Comman Data Model - CDM data from Azure Data Lake Storage Gen 2 to Azure Databricks using Spark CDM connector spark-cdm-connector-assembly-0.19.1.jar . It is compatible with spark version 2.4+ . It is not compatible with spark 3 + . Hence We are using DBR version 6.4
Loading 22 Mil + records into neo4j database via databricks environment
I am using neo4j spark connector and via databricks environment - I have loaded 22 Mil records into neo4j database and While establishing relationships for this 22mil records with the query
JournalLineItemdf.repartition(1).write.format("org.neo4j.spark.DataSource")\
.mode("Overwrite")\
.option("url", "bolt://xxx:7687")\
.option("authentication.basic.username", "xxx")\
.option("authentication.basic.password", "xxx")\
.option("database", "mydb")\
.option("query","""CALL apoc.periodic.iterate(
'MATCH(jh:JournalHeader) return jh', 'with jh MATCH (jli:JournalLineItem) WHERE jli.glheader_id=jh.uid MERGE(jh)-[:HAS_A]->(jli)',{batchSize:10000, parallel: true, iterateList:true}) yield batch return null;""")\
.save()
Though the required relationships got established in neo4j database. The spark job is hanging and no updates are being written.
Is there a way to make the query more efficient so once the relationships got established it will finish the job. Because in this case though the relationships got established still the query is searching for all the other records which is not necessary in this case.
Also If i gave the query inside the neo4j environment it is working but using datbricks neo4j spark connector and writing into neo4j is causing issues