The Common ETL Framework Part 2

Auditing/Logging

Each package is responsible for logging at two different levels:  Audit and Debug.

The Audit level of logging needs to record the following for each package execution:

  1. Start Day and Time
  2. Executing Machine name
  3. Executing as Username
  4. Records Affected (if appropriate) – Can include Records Inserted, Updated, Deleted, Archived, Errored, etc.
  5. Error data if an error occurs – Includes SQL Error, Call Stack, etc.
  6. Presence of Bad Data, if any, with specifics
  7. End Day and Time

The Debug level of logging records all the Audit level items, plus Control Flow Task level logging, plus reporting the Bad Data, if any, with specifics.

In SSIS, this level of logging can be handled by adding Package and Task level event handlers.  The BIML code to create them looks like this.  There are several placeholders in this code that will be expanded upon in a later post.

Package Level Event Handlers:

package-level-event-handlers

Main Control Flow Level Task Event Handlers:

task-level-event-handlers

Error Handling

There is a Package Level Error Handler in the example above.  Basically, we need to capture the errors at the package and task level and log them to an Audit.PackageExecutionError table.  Note: Bad Data that is being redirected to an Audit.BadDataLog table is not going to raise an error.  The Framework needs to be able to handle both sorts of errors and report on either of them in the notification portion of the Framework.  The Bad Data data flow is shown in the diagram below.

data-flow-error-handlers

Notification/Routing

Depending on the Success, Failure/Error, Unusual Behavior, or Bad Data condition of the Package or Task, there are different people that need to be alerted.  The first pass of implementing the Routing will look at SMTP Email as the notification routing mechanism.  A later pass might look at HTTP or Web Service notifications.

All notification contact info will be in the Audit schema and will be maintainable through simple table updates.  Email addresses and notification recipients must not be hard coded into the packages, neither by a developer nor by BIML.  Notification recipients must be determined dynamically at run-time.

In the Success condition, an optional statistical email to the specified notification recipients would suffice.  This email would show the Success of the Package and include Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, and the Affected Row Counts.  Notification recipients should be configurable to receive notifications Daily, Weekly, Bi-weekly, Monthly or Not at All.

In the Failure/Error condition, a detailed error message needs to be sent to the notification recipients that are indicated as SendErrorNotice in their contact profile.  Every package must name at least one SendErrorNotice contact.  These contacts will receive an email that shows that there was an error, the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, Error Message(s), Stack Trace (if available), Bad Data Condition (if there is one), and as much other information as we can harvest from SSIS.  The difference between a Failure and an Error is that a Failure is expected in certain circumstances, but an Error is not expected.

In the Unusual Behavior condition, a message needs to be sent to the SendErrorNotice recipients, as in the Failure condition above.  However, the Unusual Behavior notification will document the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, and Bad Data Condition (if there is one), as well as documenting the Unusual Behavior that triggered this condition.  For example, a package normally processes 100,000 rows, if it suddenly processes 500,000 rows, this would be Unusual Behavior.  Likewise, for a package that typically takes 3 minutes to run, to take 20 minutes to run would be Unusual Behavior.

In the Bad Data condition, the package has not Failed, Errored, or experienced Unusual Behavior.  It has experienced rows being routed to the Bad Data path(s) in the Data Flow.  Details of the Bad Data need to be sent to the notification recipients that are indicated as SendBadDataNotice in their contact profile.  These notifications should include the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, Bad Data Row Counts, and Bad Data Row Details.

Multi-threading

Multi-threading allows the package to run faster when there are tasks that can be run in parallel.  There are two multi-threading conditions that the Framework needs to handle:

  • Master Package Multi-threading of multiple child packages
  • Package Task execution within a package

Re-run ability

Packages should be re-runnable if the data has changed and we need to reload it.  For example, bad data was detected, logged and handled.  Now the bad data has been fixed, we need to re-run the package for today’s data.  This implies that the package would know that its last run had bad data, that the package actions need to be reversed and then the package can reload the data for the day.  This is most often implemented as a clean-up step at the beginning of the ETL package that makes sure that none of the data it’s about to put into the database is already there and removing it if it is.  For example, the package was loading data into the final tables when the database ran out of space and threw an error.  Once the space issue is fixed, the package would be re-run.  First, it would remove the data it loaded in the first run, then it would load the data normally as if it had not been run before.

Restartability

Packages should be able to pick up from the task prior to the failure point and run properly (where this makes sense).  There’s no need to stage data we’ve already staged.  Typically, when a package fails, it has already done some processing.  Packages should be able to roll back the failed task and restart from that point.  This is an advanced ETL framework activity.  It can be very tricky to deal with restarting a failed package, especially if it has any complexity in it.

The Common ETL Framework Part 1

Defining a Common ETL Framework

SQL Server Integration Services (SSIS) provides a mechanism to Extract data from some source, Transform it as necessary, and Load it into the tables in a target database (ETL).

Each SSIS package should implement a common framework for logging, executing, monitoring and error handling so that these details are abstracted from the SSIS developer and are “just part of the template” that they should use for SSIS package creation.

Package automation takes this one step further by automating the creation of packages based on metadata that is configurable at design time.  I have selected Business Intelligence Markup Language (BIML) to perform this automation.  I will demonstrate how to use BIML to configure and generate packages in future articles in this series.

What Does the Common ETL Framework Need to Do?

The Framework needs to provide the services and behaviors that are common to all ETL packages.  These include:

  1. Auditing/Logging.  Execution Start, Stop and Error auditing for packages and tasks as they are executed.  Auditing/logging verbosity should be configurable so that we can enhance the logging when we need to debug something.
  2. Error Handling.  Appropriate error handling. Reroute bad data on read and on write.  Break the process only when necessary.  Log data, connectivity and process problems.
  3. Notifications and Routing.  Intelligent routing so that the appropriate systems and system owners are notified of a failure without spamming the whole department.
  4. Multi-Threading.  Master packages need to manage the multi-threading and dependencies of their child packages, watching for errors and handling them appropriately (i.e. Stop or Continue execution of other child packages upon a failure).
  5. Restartability.  Packages should be able to pick up from the task prior to the failure point and run properly (where this makes sense).  I.e.  No need to stage data we’ve already staged.
  6. Re-run ability. Packages should be re-runnable if the data has changed and we need to reload it.  For example, bad data was detected, logged and handled.  Now the bad data has been fixed; we need to re-run the package for today’s data.  This implies that the package would know that its last run had bad data, that the package actions need to be reversed and then the package can reload the data for the day.
  7. Load Strategies.  The Framework should handle a variety of data Extract and Load strategies:
    1. Truncate and Replace loads. Where appropriate, table data can be truncated and reloaded.  This technique is common for staging tables and for smaller datasets with no dependencies.
    2. Incremental loads. Where appropriate, data can be extracted incrementally and/or loaded incrementally, rather than a truncate and reload.  This is often used to keep history.  Incremental loads come in several flavors, some use a hash, others a date flag, others a set of change flags.  Deletions require special handling in incremental loads.
    3. Cycle loads. Where appropriate, a dataset can be loaded with a Days Old flag set to -1; then a simple update can increment the Days Old flag by 1 to cycle the data by a day.  A delete query can then remove rows that are too old to be kept.
  8. Data Retention.  The Framework should also handle data retention during the ETL process.  There are three strategies for removing data from the primary dataset:  deletion, archiving, and flagging.  A typical retention plan will employ all three to varying degrees.
  9. Stop/Cancel Job.  The Framework should be able to interrupt a job, stopping it immediately, or canceling it at the next step.  This also implies that the job is already Restartable or Re-runable.
endmark endmark endmark