Defining a Common ETL Framework
SQL Server Integration Services (SSIS) provides a mechanism to Extract data from some source, Transform it as necessary, and Load it into the tables in a target database (ETL).
Each SSIS package should implement a common framework for logging, executing, monitoring and error handling so that these details are abstracted from the SSIS developer and are “just part of the template” that they should use for SSIS package creation.
Package automation takes this one step further by automating the creation of packages based on metadata that is configurable at design time. I have selected Business Intelligence Markup Language (BIML) to perform this automation. I will demonstrate how to use BIML to configure and generate packages in future articles in this series.
The Framework needs to provide the services and behaviors that are common to all ETL packages. These include:
- Auditing/Logging. Execution Start, Stop and Error auditing for packages and tasks as they are executed. Auditing/logging verbosity should be configurable so that we can enhance the logging when we need to debug something.
- Error Handling. Appropriate error handling. Reroute bad data on read and on write. Break the process only when necessary. Log data, connectivity and process problems.
- Notifications and Routing. Intelligent routing so that the appropriate systems and system owners are notified of a failure without spamming the whole department.
- Multi-Threading. Master packages need to manage the multi-threading and dependencies of their child packages, watching for errors and handling them appropriately (i.e. Stop or Continue execution of other child packages upon a failure).
- Restartability. Packages should be able to pick up from the task prior to the failure point and run properly (where this makes sense). I.e. No need to stage data we’ve already staged.
- Re-run ability. Packages should be re-runnable if the data has changed and we need to reload it. For example, bad data was detected, logged and handled. Now the bad data has been fixed; we need to re-run the package for today’s data. This implies that the package would know that its last run had bad data, that the package actions need to be reversed and then the package can reload the data for the day.
- Load Strategies. The Framework should handle a variety of data Extract and Load strategies:
- Truncate and Replace loads. Where appropriate, table data can be truncated and reloaded. This technique is common for staging tables and for smaller datasets with no dependencies.
- Incremental loads. Where appropriate, data can be extracted incrementally and/or loaded incrementally, rather than a truncate and reload. This is often used to keep history. Incremental loads come in several flavors, some use a hash, others a date flag, others a set of change flags. Deletions require special handling in incremental loads.
- Cycle loads. Where appropriate, a dataset can be loaded with a Days Old flag set to -1; then a simple update can increment the Days Old flag by 1 to cycle the data by a day. A delete query can then remove rows that are too old to be kept.
- Data Retention. The Framework should also handle data retention during the ETL process. There are three strategies for removing data from the primary dataset: deletion, archiving, and flagging. A typical retention plan will employ all three to varying degrees.
- Stop/Cancel Job. The Framework should be able to interrupt a job, stopping it immediately, or canceling it at the next step. This also implies that the job is already Restartable or Re-runable.