Quickly Processing One Million Transactions in Azure SQL Database

I’ve had to solve an interesting problem this week. I started the week with an SSIS package that ran in 2 minutes. It extracts a million transaction records from DB2 to a file, uploads the file to Azure Blob Storage and BULK IMPORT’s the file into an Azure SQL Database staging table. All in 2 minutes. This was acceptable performance and faster than any of the other options we were considering.

Each of the DB2 records represents an Insert, Update or Delete to the base table in DB2. I get the Transaction Type (M – Merge (Insert/Update) or D – Delete), and a time stamp in addition to the columns from the row.

So a typical set of rows might look like this:

Id (Identity)Field1Field2Field3Field4TrxTypeRowCreatedTimeStamp
11394000112478577941M11:37:31.650
21394000122478577920M11:37:32.070
31394000132478577926M11:37:32.185
41394000122478577921M11:37:32.205
5139400013247857794M11:37:32.265
61394000122478577929D11:37:32.391
71394000122478577918M11:37:33.392

In the example above, the rows are all for the same document (See Field1:  value 13940001).  Rows with Field2= 1, 2, 3 were added.  Then row 2 and 3 were changed (Id 4, 5).  Then row 2 was deleted (Id 6) and a new row 2 was inserted (Id 7).

Here is the definition of the source table in the Azure SQL Database:

CREATE TABLE ETLSource(
       Field1 [numeric](12, 0) NULL,
       Field2 [smallint] NULL,
       Field3 [int] NULL,
       Field4 [smallint] NULL,
       Field5 [nvarchar](1) NULL,
       [UpdateType] [nchar](1) NOT NULL,
       [RowCreated] [datetime2](7) NOT NULL,
       [Id] BIGINT IDENTITY NOT NULL
) WITH (DATA_COMPRESSION = PAGE)
GO

CREATE UNIQUE CLUSTERED INDEX ETLSourcePK ON ETLSource (Id) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE NONCLUSTERED  INDEX ETLSourceIdx1 ON ETLSource (UpdateType, RowCreated, Field1, Field2) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE NONCLUSTERED  INDEX ETLSourceIdx2 ON ETLSource (Field1, Field2, UpdateType, RowCreated) WITH (DATA_COMPRESSION = PAGE);
GO

And here is the definition of the target table in the Azure SQL Database:

CREATE TABLE ETLTarget(
       Field1 [numeric](12, 0) NULL,
       Field2 [smallint] NULL,
       Field3 [int] NULL,
       Field4 [smallint] NULL,
       Field5 [nvarchar](1) NULL,
       [BatchDate] [datetime2](7) NULL
) WITH (DATA_COMPRESSION = PAGE)
GO

CREATE CLUSTERED INDEX ETLTargetPK ON ETLTarget (Field1, Field2) WITH (DATA_COMPRESSION = PAGE);
GO

At first, I tried a cursor. I know how to write them and it was easy enough to create a cursor that looped through the rows and used either a DELETE statement or a MERGE statement to deal with each one. Here’s what that looked like:

DECLARE @BatchDate DATETIME2(7) = SYSUTCDATETIME();

DECLARE @Field1 NUMERIC(12, 0)
DECLARE @Field2 SMALLINT
DECLARE @Field3 INT
DECLARE @Field4 SMALLINT
DECLARE @Field5 NVARCHAR(1)
DECLARE @UpdateType	CHAR(1)
DECLARE @RowCreated	DATETIME2(7)

DECLARE cur CURSOR LOCAL FAST_FORWARD FOR SELECT
	   Field1 
	  ,Field2 
	  ,Field3 
	  ,Field4 
	  ,Field5 
	  ,UpdateType
	  ,RowCreated
    FROM ETLSource
    ORDER BY id

OPEN cur

FETCH NEXT FROM cur INTO 
	 @Field1 
    , @Field2 
    , @Field3 
    , @Field4 
    , @Field5 
    , @UpdateType
    , @RowCreated

WHILE @@fetch_status = 0
BEGIN

    IF @UpdateType = 'D'
    BEGIN
	   DELETE FROM dbo.ETLTarget
	   WHERE Field1 = @Field1
		  AND Field2 = @Field2;
    END
    IF @UpdateType = 'M'
    BEGIN
	   --Merge the changes that are left
	   MERGE ETLTarget AS target 
	   USING (
		  VALUES(
		    @Field1 
		  , @Field2 
		  , @Field3 
		  , @Field4 
		  , @Field5 
		  )
	   ) AS source (
		Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5 )
	   ON (target.Field1 = source.Field1
		  AND target.Field2 = source.Field2)
	   WHEN MATCHED
		  THEN UPDATE
			 SET target.Field3 = source.Field3
			    ,target.Field4 = source.Field4
			    ,target.Field5 = source.Field5
			    ,target.BatchDate = @BatchDate
	   WHEN NOT MATCHED BY target
		  THEN INSERT (
			   Field1 
			 , Field2 
			 , Field3 
			 , Field4 
			 , Field5
			 , BatchDate)
		  VALUES (@Field1 
			 , @Field2 
			 , @Field3 
			 , @Field4 
			 , @Field5 
			 , @BatchDate);
    END;

    FETCH NEXT FROM cur INTO 
		@Field1 
	   , @Field2 
	   , @Field3 
	   , @Field4 
	   , @Field5 
	   , @UpdateType
	   , @RowCreated
END

CLOSE cur
DEALLOCATE cur

Unfortunately, this solution was TERRIBLY slow. Cursors are notorious for being slow. This one worked fine for 1,000 transaction rows, but, after running for an hour and only processing a small portion of the million rows, I killed it and went looking for a set-based alternative.

Next, I tried a set-based MERGE statement. This was problematic because it kept complaining that multiple source records were trying to change the same target record. This complaint made sense when I realized that a row might be inserted and updated in the same day so it would have two source transactions. So I needed to get rid of the extras. It turns out that I really only care about the latest change. If it’s an insert or update, MERGE will insert or update the target row appropriately, if it’s a delete, MERGE can handle that too. But, how to select only the most recent row for each key? The standard de-duplication CTE query served as a model. Here is the final statement that worked:

WITH sourceRows AS (
    SELECT *, RN  = ROW_NUMBER() OVER (PARTITION BY
	   Field1, Field2
	   ORDER BY Field1, Field2, RowCreated DESC)
    FROM ETLSourceStagingTable)

INSERT INTO ETLSource (
      Field1 
    , Field2 
    , Field3 
    , Field4 
    , Field5
    , UpdateType
    , RowCreated)
SELECT       
      Field1 
    , Field2 
    , Field3 
    , Field4 
    , Field5
    , UpdateType
    , RowCreated 
FROM sourceRows
WHERE RN = 1
ORDER BY RowCreated;

Note the introduction of a Staging Table. The SSIS package now uses BULK INSERT to load the Staging Table from the file in Blob Storage. The query above is used to load only the relevant rows (the most recent) into the ETLSource table. The Staging Table has the same structure as the ETLSource table, without the Id column. And has an index on it like this:

CREATE INDEX ETLSourceStagingTableSort ON ETLSourceStagingTable
(Field1, Field2, RowCreated DESC) WITH (DATA_COMPRESSION = PAGE)

The use of the Staging Table and the CTE query above meant that of my original 7 rows in the example above, only three are relevant:

Id (Identity)Field1SequenceField3Field4TrxTypeRowCreatedTimeStampRelevant
11394000112478577941M11:37:31.650YES
21394000122478577920M11:37:32.070
31394000132478577926M11:37:32.185
41394000122478577921M11:37:32.205
5139400013247857794M11:37:32.265YES
61394000122478577929D11:37:32.391
71394000122478577918M11:37:33.392YES

Now, I just needed to craft the MERGE statement properly to work. When I did, this is what I had:

MERGE ETLTarget AS target USING (
    SELECT 
	     Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , UpdateType
    FROM ETLSource
    ) AS source (Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , UpdateType)
ON (target.Field1 = source.Field1
    AND target.Field2 = source.Field2)
WHEN MATCHED AND source.UpdateType = 'M'
    THEN UPDATE
	   SET target.Field3 = source.Field3
		  ,target.Field4 = source.Field4
		  ,target.Field5 = source.Field5
		  ,target.BatchDate = @BatchDate
WHEN MATCHED AND source.UpdateType = 'D'
    THEN DELETE  
WHEN NOT MATCHED BY TARGET AND source.UpdateType = 'M'
    THEN INSERT (Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , BatchDate)
    VALUES (Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , @BatchDate);

Which was fine for a small data set, but crawled on a big one, so I added batching so the merge only had to deal with a small set of rows at once. Since the clustered PK is an identity column, and since I truncate ETLSource before loading it, I am guaranteed that the Id column will be values 1…n where n is the total number of rows. So, I initialize an @rows variable right after inserting the rows into ETLSource:

SET @rows = @@rowcount;

Next, I create a while loop for each batch:

DECLARE @batchSize INT = 10000;
DECLARE @start INT = 1;
DECLARE @end INT = @batchSize;

WHILE (@start < @rows)
BEGIN

    MERGE ETLTarget AS target USING (
    ...;

    SET @start = @start + @batchSize;
    SET @end = @end + @batchSize;
END

Then I add the @start and @end to the MERGE statement source:

MERGE ETLTarget AS target USING (
    SELECT 
	     Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , UpdateType
    FROM ETLSource
    WHERE id BETWEEN @start AND @end
    ) AS source

And this worked!!! I was able to process a million rows in 1 minute. Yay!

Then I tried 10 million rows. Ugh. Now the MERGE was only processing 10,000 rows per minute. Crap. What changed? Same code. Same data, just more of it. A look at the query plan explained it all. Here it is with 1 million rows:

And here it is with 10 million rows:

The ETLTarget table has 80 million rows in it. When I had 10 million rows in the ETLSource table, the query optimizer decided that it would be easier to do an Index SCAN instead of a SEEK. In this case, however, the SEEK would have been A LOT faster.

So how do we force it to use a Seek? It turns out the optimizer has no idea how many rows were processing in a batch, so it bases its optimization on the entire table. We could use the Loop Join hint a the end of the merge statement:

	MERGE	...	 , @BatchDate)
	   OPTION (LOOP JOIN);

But most folks like to avoid these kinds of hints. So we needed another way. Someone suggested putting in a TOP clause in the Source SELECT statement. That worked. Here’s how the MERGE looks now:

MERGE ETLTarget AS target USING (
    SELECT TOP 10000
	     Field1 
	   , Field2 
	   , Field3 
	   , Field4 
	   , Field5
	   , UpdateType
    FROM ETLSource
    WHERE id BETWEEN @start AND @end
    ORDER BY id
    ) AS source

With this in place, I was able to process the 10 million rows in 10 minutes. Woohoo! Just to be sure, I re-ran the process with a thousand, a million and 10 million rows and it was consistent. I was able to process a million rows a minute.

When Loading Data, Should I Drop Indexes or Not?

I just ran a few simple tests in Azure SQL DB to see how each would perform.

I have a target table with an identity column that is the clustered primary key, and two other indexes that are the same except for the field order. (Whether having both is useful is a question for another day.) Here’s the DDL for the target table:

CREATE TABLE [ETL1](
	Field1 [numeric](12, 0) NULL,
	Field2 [smallint] NULL,
	Field3 [int] NULL,
	Field4 [smallint] NULL,
	Field5 [nvarchar](1) NULL,
	Field6 [nvarchar](2) NULL,
	Field7 [numeric](4, 0) NULL,
	Field8 [numeric](2, 0) NULL,
	Field9 [nvarchar](2) NULL,
	Field10 [nvarchar](8) NULL,
	Field11 [datetime2](7) NULL,
	Field12 [nvarchar](8) NULL,
	Field13 [datetime2](7) NULL,
	[UpdateType] [nchar](1) NOT NULL,
	[RowCreated] [datetime2](7) NOT NULL,
	Id BIGINT IDENTITY NOT NULL
) ON [PRIMARY]
GO

CREATE UNIQUE CLUSTERED INDEX iscdpk ON ETL1 (Id) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE NONCLUSTERED  INDEX iscd1 ON ETL1 (UpdateType, RowCreated, Field1, Field2) WITH (DATA_COMPRESSION = PAGE);
CREATE UNIQUE NONCLUSTERED  INDEX iscd2 ON ETL1 (Field1, Field2, UpdateType, RowCreated) WITH (DATA_COMPRESSION = PAGE);
GO

Test 1: Truncate the Target, Drop the Indexes, Insert the Data, Recreate the Indexes.

Test 2: Drop the Indexes, Truncate the Target, Insert the Data, Recreate the Indexes

Test 3: Just Truncate the Target and Insert the Data

Test 4: Truncate the Target, Drop the non-clustered Indexes (leaving the the clustered index on the identity column), Insert the Data, Recreate the non-clustered Indexes.

Here are the results. All timings are in milliseconds. These were run on a PRS1 instance of Azure SQL Database.

Test 1:
Trunc then
Drop Idxs
Test 2:
Drop before
Trunc
Test 3:
No Drop/
Create
Test 4:
Trunc but don’t
drop clustered
index
Truncate4 2 04
Drop PK8 4  n/a  n/a 
Drop Index 15 23,630 n/a 2
Drop Index 26 2  n/a 2
Insert 1.84 M rows83,033 82,315 161,706 83,205
Create PK20,454 21,205  n/a  n/a 
Create Index 112,149 12,264  n/a 12,265
Create Index 211,142 11,313  n/a 11,247
Total Time (ms)126,801 150,735 161,706 106,725
Total Time (mins)2.11 2.51 2.70 1.78
Delta (ms)0 23,934 34,905 (20,076)

Test 4 was the clear winner as it avoided the cost of recreating the clustered index. Which makes sense as the clustered index was being filled in order by the identity column as rows were added. Test 1 came in second, so if your clustered index is not on an identity column, or you have no clustered index, you are still better off dropping and recreating the indexes.

Conclusion: When inserting larger data sets into an empty table, drop the indexes before inserting the data, unless the index is clustered on an identity column.

SQL Saturday Charlotte 2017

I am excited to announce I am finally ready to present my advanced ETL Framework with Biml seminar at SQL Saturday Charlotte on Oct 14.  And I have a 10AM slot!!!   Wooo!

SQLSat683_Speaking_300x225

Implementing a SSIS Framework and enforcing SSIS Patterns (with Biml).

(Using Biml to Automate the Implementation of a SSIS Framework)

Let’s use Biml to automate the implementation of a standard SSIS Framework. I.e. Logging, error handling, etc. Business Intelligence Markup Language (Biml) is great at automating the creation of SSIS packages. With Biml we can generate a template package that implements a standard SSIS framework. In this fast-paced session, we will create the tables and load some metadata, write the Biml to implement logging and error handling, and generate some packages that implement our standard framework. In this session, you don’t need to know Biml, but some familiarity with XML, TSQL or C# will help. By the end of this session, you will know how to use Biml to automatically generate packages that implement a simple SSIS Framework with Audit Logging and Error Handling.

SQL Saturday Atlanta!

I am excited to announce that I will be giving my BIML seminar in Atlanta in July!!! Here’s the abstract for my session:

SQL Saturday 652

How to move a ton of data from the Mainframe to the Cloud (with Biml).

So, you need to move data from 75 tables on the mainframe to new tables in SQL Azure. Do you: a) hand code one package to load all 75 tables, b) hand code 75 packages that move a table each, or c) wish there was a better way?
There is! Business Intelligence Markup Language (Biml) can automate the creation of packages so that they all follow the same script. In this session, we will create some simple metadata to be able to generate multiple packages and their associated connection managers. You will see Biml in action. You will see the XML that Biml uses to define packages and connections. You will see the C# code that Biml uses to fetch metadata and dynamically generate packages in SSIS. And you will see packages and connection managers generated from Biml before your very eyes.

SQL Saturday!!!

Yay!!! I’ll be speaking about Biml at SQL Saturday in Chattanooga in June!
I’m very excited. In typical fashion, I will probably cram too much into the hour, but I’ll try not to. Here’s the abstract for my session.

SQL Saturday

Using Biml to Automate the Generation of SSIS Packages

So, you need to move data from 75 tables on the mainframe to new tables in SQL Azure. Do you: a) hand code one package to load all 75 tables, b) hand code 75 packages that move a table each, or c) wish there was a better way?
There is! Business Intelligence Markup Language (Biml) can automate the creation of packages so that they all follow the same script. In this session, we will create some simple metadata to be able to generate multiple packages and their associated connection managers. You will see Biml in action. You will see the XML that Biml uses to define packages and connections. You will see the C# code that Biml uses to fetch metadata and dynamically generate packages in SSIS. And you will see packages and connection managers generated from Biml before your very eyes.

Biml Tutorial Part 3 – Adding Connection Managers to our Package Generator

Now that we can generate simple empty packages, the next step is to add connection managers to the project and package.  After that, we will add a DataFlow.  This article will focus on the Connections though, and we will start with a pair of standard OleDb Connections.  Recall the metadata tables from the last article:

Setting up the Source and Target Databases

To insert the connection information, we’ll need to add a source, a target, the type (which is Ole Db) and then we’ll need to hook them up to the Package.
For the source, I’ll use the AdventureWorks2014 Database for SQL Server 2014.  Any DB will do though.
For the target, I’ve created an empty database called MySqlDatabase and added the Person table (from AdventureWorks2014) with the following SQL:

CREATE DATABASE MySqlDatabase;
GO
USE MySqlDatabase;
GO
CREATE TABLE dbo.[Person](
	[BusinessEntityID] [int] NOT NULL CONSTRAINT [PK_Person_BusinessEntityID] PRIMARY KEY CLUSTERED,
	[PersonType] [nchar](2) NOT NULL,
	[NameStyle] bit NOT NULL CONSTRAINT [DF_Person_NameStyle] DEFAULT 0,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NOT NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NOT NULL,
	[Suffix] [nvarchar](10) NULL,
	[EmailPromotion] [int] NOT NULL CONSTRAINT [DF_Person_EmailPromotion] DEFAULT 0,
	[AdditionalContactInfo] [xml] NULL,
	[Demographics] [xml] NULL,
	[rowguid] [uniqueidentifier] ROWGUIDCOL  NOT NULL CONSTRAINT [DF_Person_rowguid] DEFAULT (newid()),
	[ModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_Person_ModifiedDate] DEFAULT (getdate()))
GO

We need to fashion an OleDb Connection String for each database, like so:

Source:  Data Source=.;Initial Catalog=AdventureWorks2014;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;
Target:  Data Source=.;Initial Catalog=MySqlDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;

Then we can insert the connection records, starting by adding a Connection Type of OleDb.

INSERT INTO Biml.ConnectionType ([ConnectionTypeName]) VALUES ('OleDb');
INSERT INTO Biml.Connection ([ConnectionName], [CreateInProject], [ConnectionTypeId],
    [OLEDB_ConnectionString],
    [OLEDB_DatabaseName], [ConnectionGuid])
VALUES ('My First Source', 1, (SELECT ConnectionTypeId FROM ConnectionType WHERE ConnectionTypeName = 'OleDb'),
    'Data Source=.;Initial Catalog=AdventureWorks2014;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;',
    'AdventureWorks2014', NEWID());
INSERT INTO Biml.Connection ([ConnectionName], [CreateInProject], [ConnectionTypeId],
    [OLEDB_ConnectionString],
    [OLEDB_DatabaseName], [ConnectionGuid])
VALUES ('My First Target', 1, (SELECT ConnectionTypeId FROM ConnectionType WHERE ConnectionTypeName = 'OleDb'),
    'Data Source=.;Initial Catalog=MySqlDatabase;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;',
    'MySqlDatabase', NEWID());

Now that we have the connection strings in the database, we need to hook them up to the package.  In the PackageConnection table, we also need to provide the TableSchema and TableName or SQL Statement.  In this case, we’ll pull from Person.Person.  Here are the SQL statements to insert the required rows into the PackageConnection table:

INSERT INTO Biml.PackageConnection (PackageId, ConnectionId,
    IsSource, IsTarget, TableSchema, TableName, DirectInputSql)
VALUES ((SELECT PackageId FROM Biml.Package WHERE PackageName = 'My First Biml Package'),
    (SELECT c.ConnectionId FROM Biml.Connection c WHERE c.ConnectionName = 'My First Source'),
    1,0,'Person','Person',NULL);
INSERT INTO Biml.PackageConnection (PackageId, ConnectionId,
    IsSource, IsTarget, TableSchema, TableName, DirectInputSql)
VALUES ((SELECT PackageId FROM Biml.Package WHERE PackageName = 'My First Biml Package'),
    (SELECT c.ConnectionId FROM Biml.Connection c WHERE c.ConnectionName = 'My First Target'),
    0,1,'dbo','Person',NULL);

Configuring the Biml file for Connections

Because we are using the OLEDB drivers, and have tables in the source and target that have the same structure, we do not need to map the fields.  Biml and SSIS will automatically map those.
So, we have populated the metadata.  Now, we need to modify the Biml so that it includes the Connection information.  In the structure of a Biml file, the connections are fully described in the root

<Connections></Connections>

section, and are then referenced inside the package

<Connections></Connections>

section.  Here’s what this would look like if we hard-coded the connections in the Biml file:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<!--Root level Connections-->
    <Connections>
        <Connection Name="My First Source"
    		ConnectionString="Provider=SQLNCLI11;Server=.;Database=AdventureWorks2014;Trusted_Connection=yes;"
			CreateInProject="true"
			CreatePackageConfiguration="false" RetainSameConnection="true" />
        <Connection Name="My First Target"
    		ConnectionString="Provider=SQLNCLI11;Server=.;Database=MySqlDatabase;Trusted_Connection=yes;"
			CreateInProject="true"
			CreatePackageConfiguration="false" RetainSameConnection="true" />
    </Connections>

    <Packages>
        <Package Name="My First Biml Package" ConstraintMode="Parallel"
			ProtectionLevel="EncryptSensitiveWithUserKey">
			
			<!--Package level Connection references-->
            <Connections>
                <Connection ConnectionName="My First Source" />
                <Connection ConnectionName="My First Target" />
             </Connections>
             <Tasks>
                 <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
                     <Variables></Variables>
                     <Tasks>
                         <Expression Name="Placeholder" Expression="1" />
                     </Tasks>
                 </Container>
             </Tasks>
        </Package>
    </Packages>
</Biml>

Now that we’ve looked at what the Biml code would look like if we hard coded everything, let’s get back to the version of the code we had at the end of the last article:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    <Packages>
        <# var packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p "
        + " WHERE GeneratePackage = 1 ORDER BY p.PackageName"); 
        foreach(DataRow package in packagesToCreate.Rows) { 
        #>
            <!-- Build Package Here -->
            <Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
                     ProtectionLevel="<#=package["ProtectionLevel"] #>">
                <Tasks>
    		    <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
                        <Variables></Variables>
            	        <Tasks>
            	            <Expression Name="Placeholder" Expression="1" />
                        </Tasks>
                    </Container> 
                </Tasks>
            </Package>
        <# } #>
    </Packages>
</Biml>

Adding the Dynamic Connections to the Package

Root Level Connections

With this as the starting point, we’ll need to add a SQL Statement and a foreach() loop to pull in the connection info at the root level, like so:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    <Connections>
	<# var connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT ConnectionId, ConnectionName, OLEDB_ConnectionString, CreateInProject FROM Biml.Connection Where ConnectionTypeId = 1 order by ConnectionName"); 
        foreach(DataRow connection in connectionsToCreate.Rows) { #>
           <Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
		   CreateInProject="<#=connection["CreateInProject"] #>" CreatePackageConfiguration="false" 
		   RetainSameConnection="true" />
        <# } #>
    </Connections> 

Package Level Connection References

We also a need the SQL Statement and foreach() loop to pull in the connection info at the package level. Here we’ll use the PackageId from the Package loop as a criterion in the WHERE clause:

<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
    ProtectionLevel="<#=package["ProtectionLevel"] #>">
    <Connections>
        <# connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection,
            "SELECT mbc.ConnectionName, CreateInProject, ConnectionGuid "
            + " FROM Biml.Connection mbc"
            + " INNER JOIN Biml.PackageConnection mbpc ON mbc.ConnectionId = mbpc.ConnectionId"
            + " INNER JOIN Biml.Package mbp ON mbpc.PackageId = mbp.PackageId"
            + " WHERE mbp.GeneratePackage = 1 and mbpc.PackageId = " + package["PackageId"].ToString()
            + " ORDER BY mbc.ConnectionTypeId;");
        foreach(DataRow connection in connectionsToCreate.Rows) { 
        #>
            <Connection ConnectionName="<#=connection["ConnectionName"] #>" />
        <# } #>
    </Connections>

Gotcha! #1

Now, here’s a nice little Gotcha! In Sql Server, in the Biml.Connection table, we store the CreateInProject flag as a bit. Biml rightly converts this to a bool, with values of True or False. However, the CreateInProject attribute requires values of true or false, with lowercase first letters. In order to correct this behavior, we need to add a function that can convert the bool values into a string that is either “true” or “false”. This is accomplished by adding a new Biml file, let’s call it -code.biml, and then creating a function in that file:

<#+ 
string FixBool(object inBool)
{
	
    if (inBool == "False" || inBool == "0")
    {
		return "false";
    }
    else
    {
        return "true";
    }
}
#>

Notice the symbols at the top and bottom of the function. These allow Biml to use this function as an include file. In the main biml file, we need to add a line:

<#@ include file="-code.biml" #>

And now we can wrap the CreateInProject flag inside the FixBool() function, like so:

   foreach(DataRow connection in connectionsToCreate.Rows) { #>
		<Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
		CreateInProject="<#=FixBool((bool)connection["CreateInProject"]) #>" 
		CreatePackageConfiguration="false" RetainSameConnection="true" />
	<# } #>
</Connections>

Testing it all: Generating the package.

Let’s test it all.

  1. Shift-click to select the MainBimlConnection.biml file and the My Second Package.biml file.
  2. Right-click and Check Biml for Errors. Fix any errors.
  3. Right-click and Generate SSIS Packages.

If everything went well, you should now have two new Package Level connection managers and a new SSIS Package.
So, it looks like everything worked.

Gotcha! #2

There is one last Gotcha! though. This one has to do with the Guids of the Connection Managers. If three different developers generate the packages you’ve defined here, they will get three different Guids for each connection manager.
To fix this, we need everyone to use connection managers with the same Guid.
Which Guid should it be? It should be the one in the Connection table.
How do I change it? That field is grayed out. Open the connection manager in Code view instead of Design view and you can paste in the Guid that was used in the Connection table.
Changing the Guid for a Connection Manager
Once you’ve changed the connection manager’s Guid to match the one in the Connection table, we need to tell Biml to use that Guid when it’s generating the package, like so:

        foreach(DataRow connection in connectionsToCreate.Rows) { 
        #>
            <Connection ConnectionName="<#=connection["ConnectionName"] #>" Id="<#=connection["ConnectionGuid"] #>" />
        <# } #>
    </Connections>

Then regenerate the packages. Don’t let it regenerate the connection managers though. You can prevent this by unchecking the connection managers in the confirmation box:
Biml Confirmation Dialog
As you increase the number of packages, you will get a lot more connection managers generated. The easiest way I have found to not let it regenerate them, is to check the ones you want to keep into Source Code Management (TFS, Git, etc.), then let Biml generate new connection managers and then Undo the Pending changes, so that you revert to the saved and checked-in version of the connection managers.

In Conclusion

Putting it all together, we get this Biml file:

<#@ template language="C#" hostspecific="true" tier="0"#>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>
<#@ include file="-code.biml" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>
    
	<Connections>
	    <# var connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT ConnectionId, ConnectionName, OLEDB_ConnectionString, CreateInProject FROM Biml.Connection Where ConnectionTypeId = 1 order by ConnectionName"); 
       foreach(DataRow connection in connectionsToCreate.Rows) { #>
            <Connection Name="<#=connection["ConnectionName"] #>" ConnectionString="<#=connection["OLEDB_ConnectionString"] #>" 
            CreateInProject="<#=FixBool((bool)connection["CreateInProject"]) #>" 
            CreatePackageConfiguration="false" RetainSameConnection="true" />
		<# } #>
    </Connections>    
    
    <Packages>
        
        <# var packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString, 
        "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p "
        + " WHERE GeneratePackage = 1 ORDER BY p.PackageName"); 
       foreach(DataRow package in packagesToCreate.Rows) { 
            #>
            
        	<!-- Build Package Here -->
        	<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" 
        	        ProtectionLevel="<#=package["ProtectionLevel"] #>">
        	        
    	        <Connections>
            		<# connectionsToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection,
            		   "SELECT mbc.ConnectionName, CreateInProject, ConnectionGuid "
            			+ " FROM Biml.Connection mbc"
            			+ " INNER JOIN Biml.PackageConnection mbpc ON mbc.ConnectionId = mbpc.ConnectionId"
            			+ " INNER JOIN Biml.Package mbp ON mbpc.PackageId = mbp.PackageId"
            			+ " WHERE mbp.GeneratePackage = 1 and mbpc.PackageId = " + package["PackageId"].ToString()
            			+ " ORDER BY mbc.ConnectionTypeId;");
                   foreach(DataRow connection in connectionsToCreate.Rows) { 
            		   #>
                    <Connection ConnectionName="<#=connection["ConnectionName"] #>" Id="<#=connection["ConnectionGuid"] #>" />
            		<# } #>
                 </Connections>
        	        
                <Tasks>
            		<Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
            			<Variables></Variables>
            			<Tasks>
            				<Expression Name="Placeholder" Expression="1" />
            			</Tasks>
                    </Container> 
                </Tasks>
            </Package>
        <# } #>
    </Packages>
</Biml>

And it will generate this package (notice the Connection Managers added at the bottom):
Generated Package - After Adding Connections

endmark endmark endmark

Biml Tutorial 2 – Creating a Simple Package Generator

So I spent yesterday morning entering the metadata to describe some 70 or so packages that are all part of a big data load that moves data off the mainframe (DB2) into SQL Azure.  Occasionally, I would build the packages with BIML Express to ensure I hadn’t introduced bad metadata.  Today, I tweaked the BIML to allow for column level mappings between source and target tables.  Almost all of the tables are being staged into SQL Azure in the same format that they appear in DB2, so this was not necessary for those packages as BIML and SSIS will sort out the mappings automatically.

Before I get into the advanced material though, I need to finish the introduction that I started before the new job and move to Tennessee.

♦  ♦  ♦

 In the last post, I wrote about the metadata we need so that we can create ETL Packages.

We’re going to need at least the following tables:

  • Package
  • Connection
  • PackageConnection
  • ConnectionType

Additionally, Packages can have a Parent Package.

A Data Model Diagram might look like this:

With these tables, we can:

  • Define a Package to be Generated
  • Store the connection strings we need for our Development connections
  • Identify which source and target connections to use
  • Identify the Type of a Connection so we generate the right Source and Target components
  • Choose which packages to generate
  • Identify Parent-Child package relationships
  • Select whether a connection is at the Package or Project level

Defining a Package in MetaData

By populating the Package table, we can define a package to be generated.

Package
PackageId Int Identity, Primary Key, Not Null
PackageName Nvarchar(200) Not Null
ConstraintMode Nvarchar(10) Not Null, Default: Parallel
ProtectionLevel Nvarchar(50) Not Null, Default: EncryptSensitiveWithUserKey
ParentPackageId Int Nullable, Foreign Key to Package.PackageId
GeneratePackage Bit Not Null, Default: 1

Inserting a row into the table can be as simple as:

INSERT INTO Biml.Package (PackageName) VALUES ('My First Package');

With the table populated, we can now look at the Biml Code to Generate a Package from the MetaData defined so far.

To get started:

  1. Make sure you have Visual Studio 2015 and the latest SSDT, and that BIML Express is installed.
  2. Create a new Integration Services project named “My First Biml Project”
  3. Delete the Package1.dtsx that is created for you
  4. Right-click on the Project and select Add New Biml File
  5. Right-click again to add a second Biml file and rename the file to be: “My First Biml”

Now we have two blank Biml files, we need to setup a connection to the Biml MetaData tables.

In the Main Biml Connection file, enter this code:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <Connection Name="MainBimlConnection" ConnectionString="Data Source=MySqlServerInstance;Initial Catalog=BimlMetaDataDB;Provider=SQLNCLI11.1;Integrated Security=SSPI;" />
    </Connections>
</Biml>

That’s it. This file is done and won’t need to be changed unless you move your BimlMetaData database to a new server.

In the second file, we’ll be getting a bit more code in.

We’ll start by declaring our code language as C# and importing a couple of namespaces:

<#@ template language="C#" hostspecific="true" tier="0" #>
<#@ import namespace="System" #>
<#@ import namespace="System.Data" #>

Next, comes the Root Biml tag:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

Now, we need to use the connection we defined in the other file:

<# var bimlConfigConnection = (AstDbConnectionNode)RootNode.Connections["MainBimlConnection"]; #>

Then we’ll start the Packages tag:

<Packages>

Now, we pull in the records from the Package table to define our package(s) and close the Packages tag:

<# var  packagesToCreate = ExternalDataAccess.GetDataTable(bimlConfigConnection.ConnectionString,    "SELECT PackageId, PackageName, ConstraintMode, ProtectionLevel, GeneratePackage FROM Biml.Package p WHERE GeneratePackage = 1 ORDER BY p.PackageName");     foreach(DataRow package in packagesToCreate.Rows) {         #>
        <!-- Build Package Here -->
    <# } #>
</Packages>

Where we put the comment, we need to put the code that builds the package itself, such as:

<Package Name="<#=package["PackageName"] #>" ConstraintMode="<#=package["ConstraintMode"] #>" ProtectionLevel="<#=package["ProtectionLevel"] #>">
    <Tasks>
        <Container Name="SEQC - Main Control Flow" ConstraintMode="Linear">
            <Variables></Variables>
            <Tasks>
                <Expression Name="Placeholder" Expression="1" />
             </Tasks>
         </Container>
    </Tasks>
</Package>

The final Biml file should look like this:

First Package Code 1

Now, to generate the package.  First, select both Biml files (shift-click).  Then right-click and choose “Check Biml For Errors”:

Check for Errors

If there are any errors, read the error and check the Biml against the sample.

If there are no errors, then right-click again and choose “Generate SSIS Packages”:

Generate Package

In a few seconds, the Biml engine should generate the package we defined:

Generated Package

That’s it for now.  I’ll be back on as soon as I can to add connections to our package.

endmark endmark endmark