If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
or
df2 = df.drop_duplicates()
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
When you load data into dataframes in databricks how can you make sure the rows in dataframes are not duplicated.
In SQL you can handle by using unique constraint on the tables. how this can be handle in dataframes to ensure rows are not duplicated.
If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
or
df2 = df.drop_duplicates()
Hi,
There is no locking mechanism of PK in Delta, but you can create surrogate keys in different ways
1°) You can watch this good article for example :
https://www.linkedin.com/pulse/creating-surrogate-keys-databricks-delta-using-spark-sql-patel/
2°) Once this key is created, you can use the merge function :
https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge
If your usecase is "Data deduplication when writing into Delta tables", so with merge, you can avoid inserting the duplicate records.
MERGE INTO logs
USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
THEN INSERT *
https://docs.delta.io/latest/delta-update.html#data-deduplication-when-writing-into-delta-tables