move data from one container to another in databricks

Shambhu Rai 1,406

HI Expert,

how we can move files from one container to another using timestamp last modified

i.e. mnt/test/tes1 to mnt/test/test2

phemanth 7,835 Reputation points Microsoft Vendor

2023-12-13T09:44:08.48+00:00
@Shambhu Rai

Thanks for reaching out to Microsoft Q&A.

Hello! You can move files from one container to another based on the last modified timestamp using a script. Here’s an example of how you might do it in a bash script:

#!/bin/bash src_dir="mnt/test/test1" dest_dir="mnt/test/test2" # Get current date and time current_date=$(date +%s) # Loop through each file in the source directory for file in "$src_dir"/* do # Get the last modified date of the file file_date=$(stat -c %Y "$file") # Calculate the difference in timestamps diff=$((current_date - file_date)) # If the file was modified in the last 24 hours (86400 seconds) if [ $diff -le 86400 ] then # Move the file to the destination directory mv "$file" "$dest_dir" fi done

This script will move all files from mnt/test/test1 to mnt/test/test2 that have been modified in the last 24 hours. You can adjust the time condition (86400 seconds = 24 hours) as per your needs.

Please note that this script should be run with sufficient permissions to read from the source directory and write to the destination directory. Also, stat -c %Y "$file" command is used to get the last modification time in Unix timestamp format, it might vary based on your operating system.

Remember to replace "mnt/test/test1" and "mnt/test/test2" with your actual source and destination paths.

I hope this helps! Let us know if you have any other questions.
phemanth 7,835 Reputation points Microsoft Vendor

2023-12-14T07:47:27.5866667+00:00

@Shambhu Rai

just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Vinodh247-1375 12,506 Reputation points

2023-12-12T06:43:44.88+00:00
Hi
Shambhu Rai:

Thanks for reaching out to Microsoft Q&A.

Yes, you can do this using ADF pipelines.

You can use the 'Filter by last modified' condition to pick particular files based on last modified Start and endtime UTC. I have blogged about it on how to use this filter by adding/subtracting to the current datetime to filter and copy files.

https://vinsdata.wordpress.com/2021/10/25/incremental-file-copy-in-azure-data-factory/

Note:

The use case i have blogged about is about incremental file copy but you can replicate the idea to suit your solution by doing only some minor changes.

In adf pipeline there is no move activity but you can achieve this using copy activity followed by a delete activity.

Let me know if you have any questions.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.
Please sign in to rate this answer.
Shambhu Rai 1,406 Reputation points

2023-12-12T09:33:09.3133333+00:00

Sir, i need in databricks

Shambhu Rai 1,406 Reputation points

2023-12-13T01:58:37.0333333+00:00

suggestion pls

Shambhu Rai 1,406 Reputation points

2023-12-13T03:08:33.2266667+00:00

i tried below code but not working fileinfo does not hav elast modified date

# Function to filter and load files final_files_list=[] files = dbutils.fs.ls("/mnt/CARDHIST") for file in files: if file.lastModified >= current_timestamp(): # Replace 'desired_timestamp' with your condition df = spark.read.csv(file.path) load_recent_files() df=load_recent_files() df1=df.show() df1(files) df1.createOrReplaceTempView("TestFiles") load_recent_files(df)

Shambhu Rai 1,406 Reputation points

2023-12-19T14:01:25.3166667+00:00

suggestion pls

phemanth 7,835 Reputation points Microsoft Vendor

2023-12-21T05:14:07.42+00:00

@Shambhu Rai please check my comment above

Shambhu Rai 1,406 Reputation points

2023-12-21T06:55:15.2866667+00:00

Not encourage this kind of answers

phemanth 7,835 Reputation points Microsoft Vendor

2023-12-21T10:54:43.07+00:00

@Shambhu Rai

In Databricks, you can use a combination of PySpark and DBFS commands to move files based on the last modified timestamp. Here’s an example:

from pyspark.sql import SparkSession from pyspark.sql.functions import input_file_name import time spark = SparkSession.builder.getOrCreate() # Define source and destination directories src_dir = "dbfs:/mnt/test/test1" dest_dir = "dbfs:/mnt/test/test2" # Get list of files in the source directory df = spark.read.text(src_dir).withColumn("filename", input_file_name()) # Filter files based on last modified timestamp # This requires accessing the Hadoop FileSystem API fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(src_dir)) for file_status in list_status: modification_time = file_status.getModificationTime() / 1000 # Convert to seconds # Check if the file was modified in the last 24 hours if time.time() - modification_time <= 86400: # Move the file to the destination directory dbutils.fs.mv(file_status.getPath().toString(), dest_dir)

This script will move all files from dbfs:/mnt/test/test1 to dbfs:/mnt/test/test2 that have been modified in the last 24 hours. You can adjust the time condition (86400 seconds = 24 hours) as per your needs.

Please note that this script should be run with sufficient permissions to read from the source directory and write to the destination directory. Also, remember to replace dbfs:/mnt/test/test1 and dbfs:/mnt/test/test2 with your actual source and destination paths.

Alternatively, you can use Databricks Autoloader for incremental ingestion of files. This feature can process new data files as they arrive in the cloud object stores. It maintains the state information at a checkpoint location in a key-value store called RocksDB. As the state is now maintained in the checkpoint, it can resume from where it was left off even in times of failure and can guarantee exactly-once semantics.

Another approach would be to maintain a control table to keep a track of the last load timestamp and keep comparing with the modified timestamps of your files to identify the new files and load them. This might need to be done in Python as no direct functions in Spark

Please go through the link https://community.databricks.com/t5/data-engineering/load-files-filtered-by-last-modified-in-pyspark/td-p/4159
Sign in to comment

Share via

move data from one container to another in databricks

1 answer