top of page

Case Study: Ensuring seamless data integration, accuracy, and quality through comprehensive ETL testing processes.

Preethi P Unnikrishnan

Objective

Replacing the existing .NET programs with Talend for the data extraction process. QA validation ensures that the new Orchestration process should be successful and the results produced should match with the results of the existing .NET code.


Problem Statement

  • The Legacy jobs are scheduled to start at fixed times with what is thought to be enough time for the previous process to finish before the next starts. However, as our data volumes have increased and some database performance issues, occasionally the jobs run long and end up creating a deadlock within the DB as they both try to read/update the Target database.

  • No automated monitoring or alerts in the event of a process failure of the legacy jobs so only way to identify an issue is by checking manually or when a client informs us there is an issue.

Impact

  • The Target data will not be ingested until the next day if  the legacy process fails.

  • With no ability to effectively monitor or react to an incident, it is often too late to fix anything that day so we end up waiting for the next scheduled run to hopefully fix things.

Solution

  • The process is orchestrated through Talend replacing the existing .NET programs.

  • Notifications and alerts are implement upon success or any failures in the new process.


Process Flow & Testing Approach

QA validation is performed in each ETL stages - file retrieval, data ingestion, and staging to ensure the data integrity, accuracy and the consistency across the system.


File Retrieval Process

Goal :-

Retrieve Files from SFTP and Place them in S3 Location


Test Data Preparation :-

  • We used Right Data Tool(RDt) - Query Studio to create Queries for extracting data from the files in the S3 Location.

  • Functional Data Reconciliation (FDR) Scenarios are created to validate the data & the record count between the files placed in SFTP & S3 Locations.

Data Verification:-

  1. Verification done to ensure that all the Files that are placed in the Test SFTP Location are dropped in the configured S3 Location appropriately through Talend Retrieval Job.

  2. The File name and the Record count are verified against the file in SFTP & S3 Locations.

  3. Run the FDR scripts to ensure that the data is not altered during the File Retrieval Process.

File Data Ingestion Job

Goal :-

AWS Glue Job ingest the file data into the corresponding Athena-Raw tables as and when the Files are placed in the S3 Location. Each file type has its own raw tables. Hence verification is done against the corresponding raw tables.


Test Data Preparation :-

Create Queries to extract data from the Athena - Corresponding Raw tables.

Create FDR scenarios to validate the data between the file in S3 & the Raw tables.

Data Verification: -

  1. Verification done to ensure that all the Files that are placed in the S3 Location are ingested into the corresponding Raw table after the successful Glue Job Run.

  2. The Record counts are verified between the files placed in S3 and the Raw tables are matching.

  3. Run the FDR scripts to ensure that the data is not altered during the Data Ingestion Process.

File Staging Process

Goal :-

Ingest Data from the Raw table into the #temp table and Run the SP’s to ingest data in Staging and the Target tables correspondingly.


Test Data Preparation :-

  1. Create Queries to extract data from the #temp tables for the Corresponding files.

  2. FDR Scenarios will be created to validate the data & the record count between the Raw table(Athena) and the (#temp). Note:- We copied the #temp table data into our QA workspace tables.

Data Verification: -

  1. Verification for the Record count between the Raw table(Athena) and the (#temp) performed for all possible test scenarios after Successful Talend Job Run.

  2. Check for all the SP’s (Staging & Target) that are called by the old .NET process is being called by the new Talend Process.

  3. Run the FDR scenarios to ensure that the data is not altered during the File Staging Process.

  4. Create and Run FDR scenarios to validate the data & the record count between the staging tables in QA & Prod for the specific date range.

  5. Create and Run FDR scenarios to validate the data & the record count between the Target tables in QA & Prod for the specific date range.

Generic Verification: -

Manual Verification of Slack Notification/alerts for all successes and failures in all the ETL phases.


Issues Identified

  1. Glue Job Fails if there are Multiple Files for the very first time insert as giving the below error message, Example of the error message: Table or view 'XXX' already exists in database 'test_xxx' (This applies to each category of  files). Solution: Introduced a log table in which all the file entries will be created as and when the File retrieval job picks the files from SFTP and place in S3. Glue schedular job will run for each category of file that are present in the log table parallel and ingest them into the Raw tables.

  2. Got an ICEBERG_COMMIT_ERROR: tWarn_3 - Message: Cannot connect to AWS Athena: java.lang.RuntimeException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. ICEBERG_COMMIT_ERROR: Failed to commit Iceberg update to table: Reference for solution :- https://iceberg.apache.org/docs/1.3.0/configuration/ 

Conclusion

By implementing thorough checks at each stage we guarantee the data integrity and the consistency across the system. The use of automated scripts for Functional Data Reconciliation (FDR), combined with manual verification of alerts and notifications, ensures that any discrepancies are promptly identified and addressed. This rigorous QA approach not only mitigates risks associated with data migration from the legacy .NET programs but also provides a robust framework for ongoing monitoring and issue resolution.


As a result, the new process is more reliable, scalable, and maintainable, ensuring timely and accurate data processing to meet business needs. The comprehensive validation ensures that the transition to Talend does not compromise data quality, thereby maintaining trust in the data-driven decisions.






bottom of page