ETL Data Validation Framework
Data Validation is one of the most critical features to be implemented in any ETL flow. There are several validations that are performed in the ETL flow. Some of the most common validations are
Source to Target Row count validation Data type checks Aggregates matching
Based on use case the more validations can be performed. Building a scalable validation framework is key to any ETL flow. Below describe a common framework for implementing ETL validation
e.g. Source to Target Row Count Validation – The validation framework is designed to be data driven and the ETL control scripts(bash/python) interacts with the RDBMS which has all the validation logic built in it. The componenets are:
ETL Validation table Action Table Bash/Python script to execute the validation ACTION Script to execute steps based on the success criteria
VALIDATION Table: The Validation table has the all the specifics of the validation to be performed. Sample table structure below. This serves as the primary table being accessed by the validation script. It also has the success failure criteria and the action to be taken as well. ACTION Table: The Action table has the subsequent actions to be taken incase of sucess of failure. it can multiple entries for the same action code incase multiple tasks need to be performed. VALIDATION RESULTS table: The table contains the validation results of the validations performed along with the timestamp and also indicates if it is pass/fail VALIDATION Execution Script: This is parameterized script which accepts 3 parameters the 1)validation code 2)LEFT/RIGHT/BOTH indicator 3)Y/N Perform action indicator as input parameters. It inserts the validation results into the Validation results table and executes the action script based on success failure criteria. ACTION Script: The script executes receives the ACTION code as the input parameter and looks up the action table and performs the action to performed. It could also point to another validation logic to be performed mentioned in the validation table. The framework could be made more generic by adding more parameters like source target database names, table names etc. But this could make it more complex and difficult to maintain.