I have the following scala code in Azure Databricks that creates a dataframe with the list of files and file properties for an specific folder.
Can someone help me to transform that code to recursively navigate to each sub folder and add the list of files to the dataframe?
Here is my code:
`%scala
def GetFiles (path: String): DataFrame = {
try {
spark.createDataFrame(
dbutils.fs.ls(path).map { info =>
(info.path, info.name, info.size, info.modificationTime)
}
).toDF("path", "name", "size", "modificationTime").orderBy('name)
}
catch {
case e: Exception => spark.emptyDataFrame
}
}`
2
Answers
dbutils.fs.ls
will not give the files list recursively. It will only give the files in that particular folder.As a workaround, you can try the below approach to get your requirement done.
Using
Files.walkFileTree()
, first I got the list of all files recursively like below.Then I extracted the list of all last distinct subfolder paths from it.
Now, using
union
anddbutils.fs.ls(folder path)
in loop of subfolders list, I have generated the result dataframe with required columns.Complete code;
You can accomplish this with a simple recursive function as shown below. P.S. I have written this to work with Azure’s
mssparkutils.fs.ls
but the code is generic. Replace XXX with whatever works for you or whatever typedbutils.fs.ls
returns.