SQL Saturday Atlanta!

I am excited to announce that I will be giving my BIML seminar in Atlanta in July!!! Here’s the abstract for my session:

SQL Saturday 652

Using Biml to Automate the Generation of SSIS Packages

So, you need to move data from 75 tables on the mainframe to new tables in SQL Azure. Do you: a) hand code one package to load all 75 tables, b) hand code 75 packages that move a table each, or c) wish there was a better way?
There is! Business Intelligence Markup Language (Biml) can automate the creation of packages so that they all follow the same script. In this session, we will create some simple metadata to be able to generate multiple packages and their associated connection managers. You will see Biml in action. You will see the XML that Biml uses to define packages and connections. You will see the C# code that Biml uses to fetch metadata and dynamically generate packages in SSIS. And you will see packages and connection managers generated from Biml before your very eyes.

SQL Saturday!!!

Yay!!! I’ll be speaking about Biml at SQL Saturday in Chattanooga in June!
I’m very excited. In typical fashion, I will probably cram too much into the hour, but I’ll try not to. Here’s the abstract for my session.

SQL Saturday

Using Biml to Automate the Generation of SSIS Packages

So, you need to move data from 75 tables on the mainframe to new tables in SQL Azure. Do you: a) hand code one package to load all 75 tables, b) hand code 75 packages that move a table each, or c) wish there was a better way?
There is! Business Intelligence Markup Language (Biml) can automate the creation of packages so that they all follow the same script. In this session, we will create some simple metadata to be able to generate multiple packages and their associated connection managers. You will see Biml in action. You will see the XML that Biml uses to define packages and connections. You will see the C# code that Biml uses to fetch metadata and dynamically generate packages in SSIS. And you will see packages and connection managers generated from Biml before your very eyes.

Biml Tutorial Part 3 – Adding Connection Managers to our Package Generator

Now that we can generate simple empty packages, the next step is to add connection managers to the project and package.  After that, we will add a DataFlow.  This article will focus on the Connections though, and we will start with a pair of standard OleDb Connections.  Recall the metadata tables from the last article:

Setting up the Source and Target Databases

To insert the connection information, we’ll need to add a source, a target, the type (which is Ole Db) and then we’ll need to hook them up to the Package.
For the source, I’ll use the AdventureWorks2014 Database for SQL Server 2014.  Any DB will do though.
For the target, I’ve created an empty database called MySqlDatabase and added the Person table (from AdventureWorks2014) with the following SQL:

CREATE DATABASE MySqlDatabase;
GO
USE MySqlDatabase;
GO
CREATE TABLE dbo.[Person](
	[BusinessEntityID] [int] NOT NULL CONSTRAINT [PK_Person_BusinessEntityID] PRIMARY KEY CLUSTERED,
	[PersonType] [nchar](2) NOT NULL,
	[NameStyle] bit NOT NULL CONSTRAINT [DF_Person_NameStyle] DEFAULT 0,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NOT NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NOT NULL,
	[Suffix] [nvarchar](10) NULL,
	[EmailPromotion] [int] NOT NULL CONSTRAINT [DF_Person_EmailPromotion] DEFAULT 0,
	[AdditionalContactInfo] [xml] NULL,
	[Demographics] [xml] NULL,
	[rowguid] [uniqueidentifier] ROWGUIDCOL  NOT NULL CONSTRAINT [DF_Person_rowguid] DEFAULT (newid()),
	[ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_Person_ModifiedDate] DEFAULT (getdate()))
GO

We need to fashion an OleDb Connection String for each database, like so:

Source:  Data Source=.;Initial Catalog=AdventureWorks2014;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;
Target:  Data Source=.;Initial Catalog=MySqlDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;

Then we can insert the connection records, starting by adding a Connection Type of OleDb.

INSERT INTO Biml.ConnectionType ([ConnectionTypeName]) VALUES ('OleDb');
INSERT INTO Biml.Connection ([ConnectionName], [CreateInProject], [ConnectionTypeId],
    [OLEDB_ConnectionString],
    [OLEDB_DatabaseName], [ConnectionGuid])
VALUES ('My First Source', 1, (SELECT ConnectionTypeId FROM ConnectionType WHERE ConnectionTypeName = 'OleDb'),
    'Data Source=.;Initial Catalog=AdventureWorks2014;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;',
    'AdventureWorks2014', NEWID());
INSERT INTO Biml.Connection ([ConnectionName], [CreateInProject], [ConnectionTypeId],
    [OLEDB_ConnectionString],
    [OLEDB_DatabaseName], [ConnectionGuid])
VALUES ('My First Target', 1, (SELECT ConnectionTypeId FROM ConnectionType WHERE ConnectionTypeName = 'OleDb'),
    'Data Source=.;Initial Catalog=MySqlDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;',
    'MySqlDatabase', NEWID());

Now that we have the connection strings in the database, we need to hook them up to the package.  In the PackageConnection table, we also need to provide the TableSchema and TableName or SQL Statement.  In this case, we’ll pull from Person.Person.  Here are the SQL statements to insert the required rows into the PackageConnection table:

INSERT INTO Biml.PackageConnection (PackageId, ConnectionId,
    IsSource, IsTarget, TableSchema, TableName, DirectInputSql)
VALUES ((SELECT PackageId FROM Biml.Package WHERE PackageName = 'My First Biml Package'),
    (SELECT c.ConnectionId FROM Biml.Connection c WHERE c.ConnectionName = 'My First Source'),
    1,0,'Person','Person',NULL);
INSERT INTO Biml.PackageConnection (PackageId, ConnectionId,
    IsSource, IsTarget, TableSchema, TableName, DirectInputSql)
VALUES ((SELECT PackageId FROM Biml.Package WHERE PackageName = 'My First Biml Package'),
    (SELECT c.ConnectionId FROM Biml.Connection c WHERE c.ConnectionName = 'My First Target'),
    0,1,'dbo','Person',NULL);

Configuring the Biml file for Connections

Because we are using the OLEDB drivers, and have tables in the source and target that have the same structure, we do not need to map the fields.  Biml and SSIS will automatically map those.
So, we have populated the metadata.  Now, we need to modify the Biml so that it includes the Connection information.  In the structure of a Biml file, the connections are fully described in the root

<Connections></Connections>

section, and are then referenced inside the package

<Connections></Connections>

section.  Here’s what this would look like if we hard-coded the connections in the Biml file:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<!--Root level Connections-->
    <Connections>
        <Connection Name="My First Source"
    		ConnectionString="Provider=SQLNCLI11;Server=.;Database=AdventureWorks2014;Trusted_Connection=yes;"
			CreateInProject="true"
			CreatePackageConfiguration="false" RetainSameConnection="true" />
        <Connection Name="My First Target"
    		ConnectionString="Provider=SQLNCLI11;Server=.;Database=MySqlDatabase;Trusted_Connection=yes;"
			CreateInProject="true"
			CreatePackageConfiguration="false" RetainSameConnection="true" />
    </Connections>

    <Packages>
        <Package Name="My First Biml Package" ConstraintMode="Parallel"
			ProtectionLevel="EncryptSensitiveWithUserKey">
			
			<!--Package level Connection references-->
            <Connections>
                <Connection ConnectionName="My First Source" />
                <Connection ConnectionName="My First Target" />
             </Connections>
             <Tasks>
                 <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
                     <Variables></Variables>
                     <Tasks>
                         <Expression Name="Placeholder" Expression="1" />
                     </Tasks>
                 </Container>
             </Tasks>
        </Package>
    </Packages>
</Biml>

Now that we’ve looked at what the Biml code would look like if we hard coded everything, let’s get back to the version of the code we had at the end of the last article:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    <Packages>
        <# var packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p "
        + " WHERE GeneratePackage = 1 ORDER BY p.PackageName"); 
        foreach(DataRow package in packagesToCreate.Rows) { 
        #>
            <!-- Build Package Here -->
            <Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
                     ProtectionLevel="<#=package["ProtectionLevel"] #>">
                <Tasks>
    		    <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
                        <Variables></Variables>
            	        <Tasks>
            	            <Expression Name="Placeholder" Expression="1" />
                        </Tasks>
                    </Container> 
                </Tasks>
            </Package>
        <# } #>
    </Packages>
</Biml>

Adding the Dynamic Connections to the Package

Root Level Connections

With this as the starting point, we’ll need to add a SQL Statement and a foreach() loop to pull in the connection info at the root level, like so:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    <Connections>
	<# var connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT ConnectionId, ConnectionName, OLEDB_ConnectionString, CreateInProject FROM Biml.Connection Where ConnectionTypeId = 1 order by ConnectionName"); 
        foreach(DataRow connection in connectionsToCreate.Rows) { #>
           <Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
		   CreateInProject="<#=connection["CreateInProject"] #>" CreatePackageConfiguration="false" 
		   RetainSameConnection="true" />
        <# } #>
    </Connections> 

Package Level Connection References

We also a need the SQL Statement and foreach() loop to pull in the connection info at the package level. Here we’ll use the PackageId from the Package loop as a criterion in the WHERE clause:

<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
    ProtectionLevel="<#=package["ProtectionLevel"] #>">
    <Connections>
        <# connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection,
            "SELECT mbc.ConnectionName, CreateInProject, ConnectionGuid "
            + " FROM Biml.Connection mbc"
            + " INNER JOIN Biml.PackageConnection mbpc ON mbc.ConnectionId = mbpc.ConnectionId"
            + " INNER JOIN Biml.Package mbp ON mbpc.PackageId = mbp.PackageId"
            + " WHERE mbp.GeneratePackage = 1 and mbpc.PackageId = " + package["PackageId"].ToString()
            + " ORDER BY mbc.ConnectionTypeId;");
        foreach(DataRow connection in connectionsToCreate.Rows) { 
        #>
            <Connection ConnectionName="<#=connection["ConnectionName"] #>" />
        <# } #>
    </Connections>

Gotcha! #1

Now, here’s a nice little Gotcha! In Sql Server, in the Biml.Connection table, we store the CreateInProject flag as a bit. Biml rightly converts this to a bool, with values of True or False. However, the CreateInProject attribute requires values of true or false, with lowercase first letters. In order to correct this behavior, we need to add a function that can convert the bool values into a string that is either “true” or “false”. This is accomplished by adding a new Biml file, let’s call it -code.biml, and then creating a function in that file:

<#+ 
string FixBool(object inBool)
{
	
    if (inBool == "False" || inBool == "0")
    {
		return "false";
    }
    else
    {
        return "true";
    }
}
#>

Notice the symbols at the top and bottom of the function. These allow Biml to use this function as an include file. In the main biml file, we need to add a line:

<#@ include file="-code.biml" #>

And now we can wrap the CreateInProject flag inside the FixBool() function, like so:

   foreach(DataRow connection in connectionsToCreate.Rows) { #>
		<Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
		CreateInProject="<#=FixBool((bool)connection["CreateInProject"]) #>" 
		CreatePackageConfiguration="false" RetainSameConnection="true" />
	<# } #>
</Connections>

Testing it all: Generating the package.

Let’s test it all.

  1. Shift-click to select the MainBimlConnection.biml file and the My Second Package.biml file.
  2. Right-click and Check Biml for Errors. Fix any errors.
  3. Right-click and Generate SSIS Packages.

If everything went well, you should now have two new Package Level connection managers and a new SSIS Package.
So, it looks like everything worked.

Gotcha! #2

There is one last Gotcha! though. This one has to do with the Guids of the Connection Managers. If three different developers generate the packages you’ve defined here, they will get three different Guids for each connection manager.
To fix this, we need everyone to use connection managers with the same Guid.
Which Guid should it be? It should be the one in the Connection table.
How do I change it? That field is grayed out. Open the connection manager in Code view instead of Design view and you can paste in the Guid that was used in the Connection table.
Changing the Guid for a Connection Manager
Once you’ve changed the connection manager’s Guid to match the one in the Connection table, we need to tell Biml to use that Guid when it’s generating the package, like so:

        foreach(DataRow connection in connectionsToCreate.Rows) { 
        #>
            <Connection ConnectionName="<#=connection["ConnectionName"] #>" Id="<#=connection["ConnectionGuid"] #>" />
        <# } #>
    </Connections>

Then regenerate the packages. Don’t let it regenerate the connection managers though. You can prevent this by unchecking the connection managers in the confirmation box:
Biml Confirmation Dialog
As you increase the number of packages, you will get a lot more connection managers generated. The easiest way I have found to not let it regenerate them, is to check the ones you want to keep into Source Code Management (TFS, Git, etc.), then let Biml generate new connection managers and then Undo the Pending changes, so that you revert to the saved and checked-in version of the connection managers.

In Conclusion

Putting it all together, we get this Biml file:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>
<#@ include file="-code.biml" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    
	<Connections>
	    <# var connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT ConnectionId, ConnectionName, OLEDB_ConnectionString, CreateInProject FROM Biml.Connection Where ConnectionTypeId = 1 order by ConnectionName"); 
       foreach(DataRow connection in connectionsToCreate.Rows) { #>
            <Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
            CreateInProject="<#=FixBool((bool)connection["CreateInProject"]) #>" 
            CreatePackageConfiguration="false" RetainSameConnection="true" />
		<# } #>
    </Connections>    
    
    <Packages>
        
        <# var packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p "
        + " WHERE GeneratePackage = 1 ORDER BY p.PackageName"); 
       foreach(DataRow package in packagesToCreate.Rows) { 
            #>
            
        	<!-- Build Package Here -->
        	<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
        	        ProtectionLevel="<#=package["ProtectionLevel"] #>">
        	        
    	        <Connections>
            		<# connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection,
            		   "SELECT mbc.ConnectionName, CreateInProject, ConnectionGuid "
            			+ " FROM Biml.Connection mbc"
            			+ " INNER JOIN Biml.PackageConnection mbpc ON mbc.ConnectionId = mbpc.ConnectionId"
            			+ " INNER JOIN Biml.Package mbp ON mbpc.PackageId = mbp.PackageId"
            			+ " WHERE mbp.GeneratePackage = 1 and mbpc.PackageId = " + package["PackageId"].ToString()
            			+ " ORDER BY mbc.ConnectionTypeId;");
                   foreach(DataRow connection in connectionsToCreate.Rows) { 
            		   #>
                    <Connection ConnectionName="<#=connection["ConnectionName"] #>" Id="<#=connection["ConnectionGuid"] #>" />
            		<# } #>
                 </Connections>
        	        
                <Tasks>
            		<Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
            			<Variables></Variables>
            			<Tasks>
            				<Expression Name="Placeholder" Expression="1" />
            			</Tasks>
                    </Container> 
                </Tasks>
            </Package>
        <# } #>
    </Packages>
</Biml>

And it will generate this package (notice the Connection Managers added at the bottom):
Generated Package - After Adding Connections

endmark endmark endmark

Biml Tutorial 2 – Creating a Simple Package Generator

So I spent yesterday morning entering the metadata to describe some 70 or so packages that are all part of a big data load that moves data off the mainframe (DB2) into SQL Azure.  Occasionally, I would build the packages with BIML Express to ensure I hadn’t introduced bad metadata.  Today, I tweaked the BIML to allow for column level mappings between source and target tables.  Almost all of the tables are being staged into SQL Azure in the same format that they appear in DB2, so this was not necessary for those packages as BIML and SSIS will sort out the mappings automatically.

Before I get into the advanced material though, I need to finish the introduction that I started before the new job and move to Tennessee.

♦  ♦  ♦

 In the last post, I wrote about the metadata we need so that we can create ETL Packages.

We’re going to need at least the following tables:

  • Package
  • Connection
  • PackageConnection
  • ConnectionType

Additionally, Packages can have a Parent Package.

A Data Model Diagram might look like this:

With these tables, we can:

  • Define a Package to be Generated
  • Store the connection strings we need for our Development connections
  • Identify which source and target connections to use
  • Identify the Type of a Connection so we generate the right Source and Target components
  • Choose which packages to generate
  • Identify Parent-Child package relationships
  • Select whether a connection is at the Package or Project level

Defining a Package in MetaData

By populating the Package table, we can define a package to be generated.

Package
PackageId Int Identity, Primary Key, Not Null
PackageName Nvarchar(200) Not Null
ConstraintMode Nvarchar(10) Not Null, Default: Parallel
ProtectionLevel Nvarchar(50) Not Null, Default: EncryptSensitiveWithUserKey
ParentPackageId Int Nullable, Foreign Key to Package.PackageId
GeneratePackage Bit Not Null, Default: 1

Inserting a row into the table can be as simple as:

INSERT INTO Biml.Package (PackageName) VALUES ('My First Package');

With the table populated, we can now look at the Biml Code to Generate a Package from the MetaData defined so far.

To get started:

  1. Make sure you have Visual Studio 2015 and the latest SSDT, and that BIML Express is installed.
  2. Create a new Integration Services project named “My First Biml Project”
  3. Delete the Package1.dtsx that is created for you
  4. Right-click on the Project and select Add New Biml File
  5. Right-click again to add a second Biml file and rename the file to be: “My First Biml”

Now we have two blank Biml files, we need to setup a connection to the Biml MetaData tables.

In the Main Biml Connection file, enter this code:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <Connection Name="MainBimlConnection" ConnectionString="Data Source=MySqlServerInstance;Initial Catalog=BimlMetaDataDB;Provider=SQLNCLI11.1;Integrated Security=SSPI;" />
    </Connections>
</Biml>

That’s it. This file is done and won’t need to be changed unless you move your BimlMetaData database to a new server.

In the second file, we’ll be getting a bit more code in.

We’ll start by declaring our code language as C# and importing a couple of namespaces:

<#@ template language="C#" hostspecific="true" tier="0" #>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

Next, comes the Root Biml tag:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

Now, we need to use the connection we defined in the other file:

<# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>

Then we’ll start the Packages tag:

<Packages>

Now, we pull in the records from the Package table to define our package(s) and close the Packages tag:

<# var  packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString,    "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p WHERE GeneratePackage = 1 ORDER BY p.PackageName");     foreach(DataRow package in packagesToCreate.Rows) {         #>
        <!-- Build Package Here -->
    <# } #>
</Packages>

Where we put the comment, we need to put the code that builds the package itself, such as:

<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" ProtectionLevel="<#=package["ProtectionLevel"] #>">
    <Tasks>
        <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
            <Variables></Variables>
            <Tasks>
                <Expression Name="Placeholder" Expression="1" />
             </Tasks>
         </Container>
    </Tasks>
</Package>

The final Biml file should look like this:

First Package Code 1

Now, to generate the package.  First, select both Biml files (shift-click).  Then right-click and choose “Check Biml For Errors”:

Check for Errors

If there are any errors, read the error and check the Biml against the sample.

If there are no errors, then right-click again and choose “Generate SSIS Packages”:

Generate Package

In a few seconds, the Biml engine should generate the package we defined:

Generated Package

That’s it for now.  I’ll be back on as soon as I can to add connections to our package.

endmark endmark endmark

Biml Tutorial 1 – Introducing BIML

Intro to Automated Package and Template Generation with BIML

In order to simplify and automate the template creation and package creation process, we will create a Proof-of-concept (POC) that uses the Business Intelligence Markup Language (BIML, pronounced bim-ul).  BIML provides a language with which to generate SSIS packages using highly configurable metadata.

Business Intelligence Markup Language (BIML)

BIML has been a free component of BIDS Helper for years.  BIDS Helper is a well-established open-source product that has been part of the SSIS developer’s toolkit through many versions of SQL Server, up to and including SQL Server 2014.  BIML is an XML-based language, with code embedded in the XML document to perform specific functions.  It looks and feels a lot like working with standard ASP pages.

example-of-biml

BIDS Helper is available for SQL Server 2008R2 through SQL Server 2014 and Visual Studio 2010 through Visual Studio 2013.  For Visual Studio 2015, SQL Server 2014 and SQL Server 2016, a new tool, BIML Express is available.  Like BIDS Helper, it is a Visual Studio add-in.  However, BIML Express gives us the ability to edit BIML pages inside Visual Studio, with Intellisense and Syntax-Highlighting, something that was very awkward with BIDS Helper.

The high-end editor for BIML is Varigence’s Mist product (available for $249 per developer per month).  Varigence also hosts a free online editor called BIML Online.

If your development shop can afford it, Mist is the way to go.  If not, you will be using a combination of Visual Studio 2015 with BIML Express to edit/update BIML files, BIML Online to reverse engineer *.dtsx packages into BIML files, and Notepad++ (with the XML Plugin).

SSIS Template Generation

BIML works to automate template generation by using a combination of metadata (stored in SQL Server tables) and C# code, stored in the *.biml files themselves.

The basic structure of a package in BIML is as follows:

sample-package

You can see the tag that declares the Package Name.  Then it lsample-generated-packageists any package level .  (Project level connections are automatically available.)   Then it lists the that are declared in the Package.  Then we see the Control Flow , which we have divided into three s.  Within each container are the Control Flow for that container.  In the “Main Control Flow , you would put the logic for this particular package, both Control Flow Tasks, and Data Flow Tasks can be defined in this container.

If I run the BIML script above, it will output a standard *.dtsx SSIS package, shown at right.

Given the static XML above, and the standard package at right, what if I wanted to Generate several packages, each with a different name, but conforming to all our standards?  To accomplish this with BIML, we first introduce a table in the BIML schema that will hold package specifications, and then we modify the XML above to use parameters to generate the packages.

The table could look like this:

biml-package

The change to the BIML file is to add some C# code to loop through the packages in the metadata table defined above, and then to add some BIML specific tags that evaluate the expressions.  The tags are written inside the quotes for the attributes like Name and ProtectionLevel.  Like so:

sample-package-bimlized

Why Bother?

Why bother with the parameters?  Why not just run the script, rename the package, run the script again and rename the second package, run the script again and rename the third package?  (Sounds tedious, doesn’t it.)  The reason we bother is so that we can generate a lot of package templates, and then regenerate the package templates when the ETL Framework is updated.

For example, to accommodate a new method for delivering notifications, say we modified the ETL Framework a little.  Without the metadata for BIML to use, we could ignore this change and only apply it to template packages going forward, using the script and rename technique.  Or we could manually edit each package in our solution to retrofit it so that the framework change can be used in the existing packages as well as new ones.  Or we could regenerate all the packages with BIML, and then edit each to copy the Main Control Flow container from the existing package to the new one.  This is, debatably, a better way of handling changes to the Framework, but it still incurs the tedium of copying the Main Control Flow container from the old to the new.  Not ideal.

Perhaps we could generate all or most of the main package, and move the custom tasks into a separate, child package?

It turns out, we can.

SSIS Package / Template Generation

Most packages do much the same thing.  They fetch some data from some source and load it into a database into a staging table.  Then they perform some transformations and load the staged data into some final tables.  The transformations in the middle might be custom, but the rest is not.  The Extract from the source to the staging tables can be configured with metadata and coded for in BIML.  The Load from intermediary tables to the final tables can be configured too.  The only bit that is data specific is the Transformation happening in the middle of the process.  If we isolate that in its own package, we can fully automate the master package generation and automate the child template generation.

This is accomplished by capturing connection data for the different sources, mapping data from the source to staging table, and mapping data from the intermediate table to the final table.

A simple data model can accommodate the metadata we need.  The Connection table needs to contain all the parameters for the different kinds of connections we might encounter, and BIML will identify which columns are needed for each connection type.  Packages need to know which connections they use.  The Extract into staging tables needs to know the field mappings from source to target.  Likewise, the Load from the intermediary tables to the final tables needs to know its field mappings from source to target.

Continued here:  Introducing BIML – Continued

The Common ETL Framework Part 2

Auditing/Logging

Each package is responsible for logging at two different levels:  Audit and Debug.

The Audit level of logging needs to record the following for each package execution:

  1. Start Day and Time
  2. Executing Machine name
  3. Executing as Username
  4. Records Affected (if appropriate) – Can include Records Inserted, Updated, Deleted, Archived, Errored, etc.
  5. Error data if an error occurs – Includes SQL Error, Call Stack, etc.
  6. Presence of Bad Data, if any, with specifics
  7. End Day and Time

The Debug level of logging records all the Audit level items, plus Control Flow Task level logging, plus reporting the Bad Data, if any, with specifics.

In SSIS, this level of logging can be handled by adding Package and Task level event handlers.  The BIML code to create them looks like this.  There are several placeholders in this code that will be expanded upon in a later post.

Package Level Event Handlers:

package-level-event-handlers

Main Control Flow Level Task Event Handlers:

task-level-event-handlers

Error Handling

There is a Package Level Error Handler in the example above.  Basically, we need to capture the errors at the package and task level and log them to an Audit.PackageExecutionError table.  Note: Bad Data that is being redirected to an Audit.BadDataLog table is not going to raise an error.  The Framework needs to be able to handle both sorts of errors and report on either of them in the notification portion of the Framework.  The Bad Data data flow is shown in the diagram below.

data-flow-error-handlers

Notification/Routing

Depending on the Success, Failure/Error, Unusual Behavior, or Bad Data condition of the Package or Task, there are different people that need to be alerted.  The first pass of implementing the Routing will look at SMTP Email as the notification routing mechanism.  A later pass might look at HTTP or Web Service notifications.

All notification contact info will be in the Audit schema and will be maintainable through simple table updates.  Email addresses and notification recipients must not be hard coded into the packages, neither by a developer nor by BIML.  Notification recipients must be determined dynamically at run-time.

In the Success condition, an optional statistical email to the specified notification recipients would suffice.  This email would show the Success of the Package and include Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, and the Affected Row Counts.  Notification recipients should be configurable to receive notifications Daily, Weekly, Bi-weekly, Monthly or Not at All.

In the Failure/Error condition, a detailed error message needs to be sent to the notification recipients that are indicated as SendErrorNotice in their contact profile.  Every package must name at least one SendErrorNotice contact.  These contacts will receive an email that shows that there was an error, the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, Error Message(s), Stack Trace (if available), Bad Data Condition (if there is one), and as much other information as we can harvest from SSIS.  The difference between a Failure and an Error is that a Failure is expected in certain circumstances, but an Error is not expected.

In the Unusual Behavior condition, a message needs to be sent to the SendErrorNotice recipients, as in the Failure condition above.  However, the Unusual Behavior notification will document the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, and Bad Data Condition (if there is one), as well as documenting the Unusual Behavior that triggered this condition.  For example, a package normally processes 100,000 rows, if it suddenly processes 500,000 rows, this would be Unusual Behavior.  Likewise, for a package that typically takes 3 minutes to run, to take 20 minutes to run would be Unusual Behavior.

In the Bad Data condition, the package has not Failed, Errored, or experienced Unusual Behavior.  It has experienced rows being routed to the Bad Data path(s) in the Data Flow.  Details of the Bad Data need to be sent to the notification recipients that are indicated as SendBadDataNotice in their contact profile.  These notifications should include the Package Name, Start Time, End Time and Elapsed Time, Data Source, Target DB, Schema, and Table Name, Affected Row Counts, Bad Data Row Counts, and Bad Data Row Details.

Multi-threading

Multi-threading allows the package to run faster when there are tasks that can be run in parallel.  There are two multi-threading conditions that the Framework needs to handle:

  • Master Package Multi-threading of multiple child packages
  • Package Task execution within a package

Re-run ability

Packages should be re-runnable if the data has changed and we need to reload it.  For example, bad data was detected, logged and handled.  Now the bad data has been fixed, we need to re-run the package for today’s data.  This implies that the package would know that its last run had bad data, that the package actions need to be reversed and then the package can reload the data for the day.  This is most often implemented as a clean-up step at the beginning of the ETL package that makes sure that none of the data it’s about to put into the database is already there and removing it if it is.  For example, the package was loading data into the final tables when the database ran out of space and threw an error.  Once the space issue is fixed, the package would be re-run.  First, it would remove the data it loaded in the first run, then it would load the data normally as if it had not been run before.

Restartability

Packages should be able to pick up from the task prior to the failure point and run properly (where this makes sense).  There’s no need to stage data we’ve already staged.  Typically, when a package fails, it has already done some processing.  Packages should be able to roll back the failed task and restart from that point.  This is an advanced ETL framework activity.  It can be very tricky to deal with restarting a failed package, especially if it has any complexity in it.

The Common ETL Framework Part 1

Defining a Common ETL Framework

SQL Server Integration Services (SSIS) provides a mechanism to Extract data from some source, Transform it as necessary, and Load it into the tables in a target database (ETL).

Each SSIS package should implement a common framework for logging, executing, monitoring and error handling so that these details are abstracted from the SSIS developer and are “just part of the template” that they should use for SSIS package creation.

Package automation takes this one step further by automating the creation of packages based on metadata that is configurable at design time.  I have selected Business Intelligence Markup Language (BIML) to perform this automation.  I will demonstrate how to use BIML to configure and generate packages in future articles in this series.

What Does the Common ETL Framework Need to Do?

The Framework needs to provide the services and behaviors that are common to all ETL packages.  These include:

  1. Auditing/Logging.  Execution Start, Stop and Error auditing for packages and tasks as they are executed.  Auditing/logging verbosity should be configurable so that we can enhance the logging when we need to debug something.
  2. Error Handling.  Appropriate error handling. Reroute bad data on read and on write.  Break the process only when necessary.  Log data, connectivity and process problems.
  3. Notifications and Routing.  Intelligent routing so that the appropriate systems and system owners are notified of a failure without spamming the whole department.
  4. Multi-Threading.  Master packages need to manage the multi-threading and dependencies of their child packages, watching for errors and handling them appropriately (i.e. Stop or Continue execution of other child packages upon a failure).
  5. Restartability.  Packages should be able to pick up from the task prior to the failure point and run properly (where this makes sense).  I.e.  No need to stage data we’ve already staged.
  6. Re-run ability. Packages should be re-runnable if the data has changed and we need to reload it.  For example, bad data was detected, logged and handled.  Now the bad data has been fixed; we need to re-run the package for today’s data.  This implies that the package would know that its last run had bad data, that the package actions need to be reversed and then the package can reload the data for the day.
  7. Load Strategies.  The Framework should handle a variety of data Extract and Load strategies:
    1. Truncate and Replace loads. Where appropriate, table data can be truncated and reloaded.  This technique is common for staging tables and for smaller datasets with no dependencies.
    2. Incremental loads. Where appropriate, data can be extracted incrementally and/or loaded incrementally, rather than a truncate and reload.  This is often used to keep history.  Incremental loads come in several flavors, some use a hash, others a date flag, others a set of change flags.  Deletions require special handling in incremental loads.
    3. Cycle loads. Where appropriate, a dataset can be loaded with a Days Old flag set to -1; then a simple update can increment the Days Old flag by 1 to cycle the data by a day.  A delete query can then remove rows that are too old to be kept.
  8. Data Retention.  The Framework should also handle data retention during the ETL process.  There are three strategies for removing data from the primary dataset:  deletion, archiving, and flagging.  A typical retention plan will employ all three to varying degrees.
  9. Stop/Cancel Job.  The Framework should be able to interrupt a job, stopping it immediately, or canceling it at the next step.  This also implies that the job is already Restartable or Re-runable.
endmark endmark endmark