AzCopy re-trying MD5 failed files

Zeno GH 25 Reputation points
2024-06-17T10:09:30.33+00:00

Hi,

In case of an MD5 mismatch in azcopy, can we catch this error and re-try it immediately as the download is happening or we have to wait till its finished and then check the azcopy.log file and find what's failed?

Microsoft article - https://video2.skills-academy.com/en-us/azure/storage/common/storage-use-azcopy-configure?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.jsononly mentions one option which is to read the azcopy.log file and search for failed downloads.

  1. I want to know is there any more efficient way to check the MD5 mismatch at memory level as the download is happening and then immediately re-try(re-download) that file which has failed the MD5 check?
  2. Azcopy.log option - will MD5 mismatch be logged in DOWNLOADSFAILED section with the file path? So I can fetch the file location and re-try to download that file instead of downloading the whole folder again

Regards

Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,863 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Vinodh247-1375 12,506 Reputation points
    2024-06-17T10:29:02.2+00:00

    Hi Zeno GH,

    Thanks for reaching out to Microsoft Q&A.

    I want to know is there any more efficient way to check the MD5 mismatch at memory level as the download is happening and then immediately re-try(re-download) that file which has failed the MD5 check?

    To efficiently check for MD5 mismatches during a file download, you can use a streaming approach where the data is hashed as it’s being received. Libraries like hashlib in Python allow you to update the hash with chunks of data as they come in. If a mismatch is detected, you can use exception handling to trigger an immediate retry of the download. This process can be automated using a script or within a data pipeline in Azure Data Factory by incorporating custom activities or Azure Functions for the hashing and retry logic.

    import hashlib
    import requests
    def download_file(url, expected_md5):
        response = requests.get(url, stream=True)
        md5_hash = hashlib.md5()
        for chunk in response.iter_content(chunk_size=4096):
            md5_hash.update(chunk)
        if md5_hash.hexdigest() == expected_md5:
            print("MD5 matched.")
            # Save the file content if needed
        else:
            print("MD5 mismatch, retrying download...")
            download_file(url, expected_md5)
    # Example usage
    url = 'https://example.com/file'
    expected_md5 = 'expected_md5_hash_here'
    download_file(url, expected_md5)
    
    

    This function recursively retries the download until the MD5 matches the expected value. You can try integrating this logic into an azure Function and trigger it within adf for automated retries.

    Azcopy.log option - will MD5 mismatch be logged in DOWNLOADSFAILED section with the file path? So I can fetch the file location and re-try to download that file instead of downloading the whole folder again

    Yes, azcopy will log an MD5 mismatch under the DOWNLOADS FAILED section in the azcopy.log file, including the file path. This allows you to identify the specific files that failed the MD5 check so you can retry downloading just those files instead of the entire folder.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    1 person found this answer helpful.

  2. Zeno GH 25 Reputation points
    2024-06-17T12:03:41.3933333+00:00

    Hi,

    Thanks for the feedback :)

    I'm using c sharp web service with check/put md5 option so don't think I need any external lib.

    Will the azcopy error response object have the "md5 mismatch" string along with the file path? In that case, while the azcopy is running and response object revives error, I will read the message content and if the string matches "md5 mismatch", I will read the file path from the same object and then re-try the download attempt for that particular file. This function can be recursive so it tries will the error is gone.

    Will this approach work and better than reading azcopy.log and fetching the filename and then re-trying the failed file(instead of the whole folder)?

    Regards


  3. Sumarigo-MSFT 44,891 Reputation points Microsoft Employee
    2024-07-02T05:28:18.0166667+00:00

    @Zeno GH Apologies for the delay response!
    AzCopy by default does not retry a failed job. Once the job is complete you will get a summary on who many files failed and how many went through. If you just resume the job with the correct job-id it will retry the failed files. If there are failures others then md5 mismatch and you wish to retry only the md5 related failures then you need to go through the logs and figure out the file name and retry manually.

    Please let us know if you have any further queries. I’m happy to assist you further.     


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments