Azure databrciks pyspark code needs support based on the file extension received in datalake

Chitra Marimuthu 1 Reputation point
2021-06-10T08:53:48.223+00:00

In my azure data lake everyday I will receive any one of the below file (only the file extension will vary but the file name remains same).

File Name:
Receipts_currendate.xlsx(extension is in small letter) or Receipts_currentdate.XLSX(extension is in Caps letter).

In my databricks the current pyspark code reads a file from a data lake (through ADL mount)and is supported only for the file extension in Caps letter .but I need the code needs to support for the file extension with small letter as well .

Current code in databricks :

import os
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pandas as pd
from datetime import date,datetime,timedelta
from pyspark.sql.window import Window
from pyspark.sql import SQLContext

currentdate=(date.today().strftime('%m_%d_%Y')) #Filepath='/dbfs'+var_raw_path+'Receipts/receipts_02_03_2021.XLSX' Filepath='/dbfs'+var_raw_path+'WFCReceipts/receipts_'+currentdate+'.XLSX'
df = pd.read_excel(Filepath,index_col=None,usecols=[*range(0, 9)])

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,163 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 88,791 Reputation points Microsoft Employee
    2021-06-10T09:58:51.027+00:00

    Hello @Chitra Marimuthu ,

    Welcome to the Microsoft Q&A platform.

    As per my repro, I'm able to read the excel file with file extension in small/Caps letter.

    I had created two excel file with file extension in small & Caps letter in ADLS gen1 storage account.

    • Receipts_currendate.xlsx(extension is in small letter)
    • Receipts_currentdate.XLSX(extension is in Caps letter)

    Note: Installed openpyxl package to read the excel files using pandas.

    104194-image.png

    Reason for install openpyxl package.

    Pandas uses the xlrd as their default engine for reading excel files. However, xlrd has removed support for anything other than xls files in their latest release.

    To solve this, do the following:

    • Install openpyxl: This is another excel package that still supports the xlsx format
    • Set the engine to “openpyxl” instead of the default “xlrd”

    104241-image.png

    Able to read the excel files with file extension in small/Caps letter using Azure Databricks.

    104242-image.png

    Hope this helps. Do let us know if you any further queries.

    ---------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.