I am not getting all the files using listFiles(basep: String, globp: String) in Azure Databricks

Question

Hi,

I have a question regarding this article https://video2.skills-academy.com/en-us/azure/databricks/kb/data/list-delete-files-faster

I have a directory in my ADLS in which I have some folders and inside folders, I have some parquet files. I was using the same method which is described in this article for collecting the paths of all the files.
But after using this listFiles(basep: String, globp: String) in my ADB notebook I checked the output and I found some folder paths are completely missing from the output so the result is I am not getting a single file for those particular folders.
I try to debug it by putting the print statement in between then I saw that the ArrayBuffer is showing empty for those folders. but in ADLS I can see the data is available. Now I wonder why it's happening, On which basis it's skipping some folders.

how this InMemoryFileIndex.bulkListLeafFiles() function works?

What should be my approach to get all the paths, any suggestion/help please?

Answer

Hello @Manoj Kumar ,

Welcome to the Microsoft Q&A platform.

I'm able to successfully get all the list of files present in the mount point by using the same code mentioned in the document.

Could you please share the code which you are running along with the complete stack trace of the error message which you are experiencing?

How this InMemoryFileIndex.bulkListLeafFiles() function works?

The listFiles function takes a base path and a glob path as arguments, scans the files and matches with the glob pattern, and then returns all the leaf files that were matched as a sequence of strings.

The function also uses the utility function globPath from the SparkHadoopUtil package. This function lists all the paths in a directory with the specified prefix, and does not further list leaf children (files). The list of paths is passed into InMemoryFileIndex.bulkListLeafFiles method, which is a Spark internal API for distributed file listing.

Neither of these listing utility functions work well alone. By combining them you can get a list of top-level directories that you want to list using globPath function, which will run on the driver, and you can distribute the listing for all child leaves of the top-level directories into Spark workers using bulkListLeafFiles.

The speed-up can be around 20-50x faster according to Amdahl’s law. The reason is that, you can easily control the glob path according to the real file physical layout and control the parallelism through spark.sql.sources.parallelPartitionDiscovery.parallelism for InMemoryFileIndex.

Hope this helps. Do let us know if you any further queries.

------------

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

Answer

Sorry, @PRADEEPCHEEKATLA-MSFT I think you didn't get my main point.

I am not saying that I am getting errors or wrong output it's just about not getting the complete output. I know that this code will be working fine for you even it's working fine for me also for my other ADLS data. that's why I want to know how this code is working internally.

I break this code into small parts and try to understand but didn't get so much success.

SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp)))

I tried to read the definition of the above function involved in this code from this GitHub link SparkHadoopUtil.scala.

I didn't understand the complete code but I see this last line .getOrElse(Seq.empty[Path]) this could also be a reason for skipping the paths. because if this will return empty paths then how InMemoryFileIndex.bulkListLeafFiles() will tell whats inside.

If you could understand this function definition then what are the cases on which it will return Seq.empty[Path]?

Thanks.

Share via

I am not getting all the files using listFiles(basep: String, globp: String) in Azure Databricks

2 answers

Your answer