Azure Data Factory – How to Retrieve the Latest File in Each Folder

In this article, we are going to learn about retrieving the file based on the Last Modified Date / time in a given folder of Azure Data Lake Gen2

Advertisements

Introduction

Azure Data Factory is a powerful tool for managing and processing data in the cloud. One of the common tasks in data processing is retrieving the latest file from a set of folders. This can be a time-consuming and error-prone task, especially when dealing with large volumes of data.

Fortunately, Azure Data Factory provides a simple and effective way to automate this process using its built-in activities and connectors. In this article, we will explore how to retrieve the latest file in each folder using Azure Data Factory, and how to use this feature to streamline your data processing workflows.

Let’s get started.

Advertisements

Retrieve Latest Files

Create a Pipeline Variable JSONArray of type String as shown in the below screenshot which stores information about all the files available in the given folder.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Json array variable

Create a Dataset named ds_Folder which refers the Container in which you would like to get the latest file from. If you have folders inside the container then you can specify the directory path/name in the Directory field. However, please make sure that the Filename field is empty as shown in the below screenshot.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Dataset Creation

Create a Get Metadata Activity named GetAllFiles, select the dataset and then create a field list and choose ChildItems as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Get Metadata

In the above scenario, the Get Metada retrieves all the child items of the given folder in the dataset as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Iterate Each File

Please make a note that the about output has only the file name and the type as file.

Advertisements

In order to iterate through all the all the files, create a new ForEach activity named IterateEachFile with the settings shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Iterate Each File

If you would like to test till the current step, you can’t as the ForEach activity is empty. In order to test, let’s add Wait activity within the ForEach and test it to review the output.

Advertisements

The next important step is to create another Dataset named ds_File with a Dataset Parameter named FileName as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – File Dataset

Now, navigate to Connection tab and add refer the FileName parameter as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – File Dataset Filename

It’s time to retrieve the LastModified property of each file. So, Navigate to ForEach loop and perform the below steps which are shown in the next screenshot.

Advertisements
  • Navigate inside the ForEach activity
  • Create a new Get Metadata activity named Get Last Modified Date
  • Goto Settings tab
  • Select ds_File dataset
  • Assign the @item().name property to the FileName parameter
  • In the Field list property, click New
  • In the Argument dropdown, select Last Modified
Azure Data Factory How to Retrieve the Latest File in Each Folder – Get Modified Date

Now, when you run you should view the LastModified value for each file that exist in your Container.

Advertisements

At this stage, we have Filenames and LastModified values separately. We should develop the functionality of combining them in a single Array. Let’s now, create another Array variable named FileListWithModifiedDate as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – FileListWithModifiedDate

Navigate to the ForEach loop, create a new activity FrameJsonArray of type Append variable which

Azure Data Factory How to Retrieve the Latest File in Each Folder – FrameJson

The Dynamic contents of the value for the FileListWithModifiedDate array value is given below.

@concat('{"FileName":"',item().name,'"',',"LastModified":"',activity('Get Last Modified Date').output.lastModified,'"}')
Advertisements

In the above code a Key-Value pair for each File is create in the json format. You can review the same in the output when you run the pipeline.

Azure Data Factory How to Retrieve the Latest File in Each Folder – FrameJson Array

In the above screenshot, the output of one of the FrameJsonArray activity has been shown. The value is a Key-Value pair of the FileName and the LastModified of each file.

Advertisements

The next step is to combine the Key-Value pairs available in the FileListWithModifiedDate Array of all the files into a single Json String. We can achieve it using the join function.

After the ForEach activity, create a new SetVariable named Set JSONArray Content activity with the below settings as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Set Json Array Content

You can use the below expression that create the values in the Json specific array

@concat('[',join(variables('FileListWithModifiedDate'),','),']')

The output of the above expression would be something similar to the below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Set Json Array Content Output

In the above JSON we have both the FileName along with LastModifiedDate. It’s time to sort the data based on the LastModifiedDate. As of this writing, Azure Data Factory doesn’t support sorting functionality. So, we need to leverage some other service (like Azure Functions, Logic App etc) which could help in sorting the json data. In this example, we will take SQLDatabase Stored Procedure to perform the sorting of the JSON Content.

Advertisements

Create a new Stored Procedure which takes JSON string as input, sorts the json array and returns back the json items sorted based on the LastModified date in the descending order (recent one on the top)

CREATE   PROCEDURE  SortJsonArray     
@JsonArray NVARCHAR(MAX) 
AS 
BEGIN    
DECLARE @OutputJson NVARCHAR(MAX)          
SELECT @OutputJson = (         
SELECT * FROM OPENJSON(@JsonArray)  
WITH (
FileName NVARCHAR(MAX) '$.FileName',   
LastModified DATETIME '$.LastModified'         
)         
ORDER BY LastModified DESC  FOR JSON PATH     )       
SELECT @OutputJson AS OutputJson END  

Let’s now, add a Lookup Activity to execute the above Stored Procedure as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Set Json Array – Execute

Execute the pipeline now which at the end returns the files sorted based on the LastModified date as shown below.

Azure Data Factory How to Retrieve the Latest File in Each Folder – Set Json Array – Sorted Data

That’s how you can retrieve the latest file uploaded to the Folder of a given container.

Do you like this article? If you want to get more updates about these kind of articles, you can join my Learning Groups

WhatsApp

Telegram

Advertisements
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s