Showing posts with label Introduction to SSIS. Show all posts
Showing posts with label Introduction to SSIS. Show all posts

Sunday, 15 July 2012

SSIS : Event Handlers

SSIS is not just about control flows and dataflows, there are other options available in SSIS which extends functionality of SSIS package, and provides a great flexibility in desiging a successfull ETL solution. Lets take a deep insight of what are those options and how we can take benefit of them.

There are four options which extends functionality of SSIS packages.

1. Event Handlers
2. Variables
3. Expressions
4. SQL Queries

In this post i will be covering Event handlers, and in later series i will cover Variables, Expressions and SQL queries


Event Handlers:

Like any programming languages (C# or JAVA) SSIS also provides Event handlers, which provides functionality of performing some task on a specific event during runtime. At runtime of any SSIS task, series of event takes place, and SSIS Desginer has provided with one tab (Event Handler) to program any action that can be performed on occurance of that event.

Event handlers are created in the same manner as control flow but in different tab as shown.



How we create events?

Its as simple as creating control flow of package.

1. Select the executable container to which the handler will be assigned.
2. Select the event to which you wish the event handler to react to.
3. Drag control flow container and task and connect them together with precedence constraints.

How Events are actually gets handled in SSIS?

Events can be handled at task or container or package level, following is the order it will be handled.

1. Task
2. Container
3. Package

For example:  If an event is triggered  at the task level and no event handler is defined, the event is passed to container level, if no event handler is defined at container level then it will be passed to package level.

How many events are available?

There are 12 different events that can be handled in SSIS package. Below is the list..

OnError The OnError event is caused by the occurrence of an error. 

OnExecStatusChanged The OnExecStatusChanged event occurs when an executable changes its status. 

OnInformation When an executable reports information during validation or execution, the OnInformation event occurs. 

OnPostExecute OnPostExecute is an event that occurs immediately after an executable has completed running. 

OnPostValidate
The OnPostValidate event occurs immediately after an executable has finished its validation. 

OnPreExecute The OnPreExecute event occurs immediately before an executable runs. 

OnPreValidate The OnPreValidate event occurs immediately before an executable begins its validation. 

OnProgress 
OnProgress is an event that occurs when an executable makes measurable progress. 


OnQueryCancel The OnQueryCancel event occurs automatically at regular intervals to determine if the package should continue running. 

OnTaskFailed When a task has failed execution, the OnTaskFailed event occurs.

OnVariableValueChanged
The OnVariableValueChanged event occurs when a variable's value is changed. 


OnWarning The OnWarning event is raised when a warning occurs. 

Event handlers have a number of properties that allow you to 

• assign a name and description to the event handler
• enable or disable the event handler
• determine whether the package fails if the event handler fails
• determine the number of errors that can occur before the event handler fails
• override the execution result that would normally be returned at runtime
• determine the transaction isolation level for operations performed by the event handler, and
• determine the logging mode used by the event handler



SSIS : Connection Managers



Connection managers represent a layer of abstraction between the SSIS package runtime environment and the data sources. Connection managers contain the data source connection string and other related properties. At package execution time the connection managers manage the physical connectivity to data sources and destinations. Multiple tasks can share the same connection manager. In fact, multiple tasks can use the same connection to the database. On the other hand, you could also create a separate connection manager to force each task to use a dedicated connection.

You can define one or multiple data sources within your SSIS project. You could create connection managers without any dependency to the data source. However, if you define a project level data source, you can re-use the same source for multiple packages within your project ("project" and "solution" terms can be used interchangeably within BIDS). After you define a data source within your project, you can also create a data source view. A data source view is a layer of abstraction between the relational database and the row sets that you can make available for your transformations. For example, you can define a named query (similar to a relational database view) to join multiple tables, filter the output with the WHERE clause or define new columns based on the existing columns. Similarly, you could add named calculations to your tables, combining the first, middle and last names into a single column called "full_name". You are also allowed to add a friendly name to the tables, so that instead of calling a table "AdventureWorks.dbo.warehouse_sales" you can simply reference it as "sales". Why couldn't you define the similar views and aliases in the relational database? Well, sometimes you could have access to the relational database through an OLEDB driver, but you might NOT have permissions to create and alter database objects. This is when data source views are particularly helpful. 



List of Connection Managers

S No
Name
Short Description
1
ADOUsed to connect to ActiveX Data Objects (ADO) objects, like a recordset.
2
ADO.NETUsed to access data sources by using a .NET provider and Microsoft SQL Server and XML
3
CacheUsed to read data from the cached server or from a cache file(.caw) so that the data is stored in the memory.
4
ExcelUsed to connect to an existing Microsoft Excel workbook file for both Source and Destination processing of package
5
FileUsed to connect to an existing file or to create a new file and use as a source or a destination.
6
FlatFileUsed to connect to a Flat file which acts like a source or a destination for the package to access and process the data across the platform.
7
FTPUsed to connect to a File Transfer Protocol Server to fetch the data or to update the data to the server.
8
HTTPUsed to access the webserver using the Hyper text transfer protocol to send and receive files across the servers
9
MSMQUsed to connect to the Microsoft Message queuing server to access the messages as a source or to update the message as a destination
10
MultiFileUsed to reference to the existing file or folders or to create a new file and use it as a reference at runtime.
11
MultiFlatFileUsed to access the file using the flat file as multiple data source like using inside a loop container to loop through the file and access the data
12
OLEDBUsed to connect to the different data source using the OLEDB provider specifically used to connect to Microsoft SQL Server.
13
ODBCUsed to connect to different relational database system using the open connectivity provider
14
SMOServerUsed to connect to a SQL Management server objects to access for as a source or to update as a destination
15
SMTPUsed to connect to a Simple Mail transfer Protocol server to access and send mail or to receive mails
16
SQL CompactUsed to connect to SQL Server Compact database for light weight accessing of the server.
17
WMIUsed to connect to the Windows Management Instrumentation (WMI) in order to connect to the enterprise server for management.






SSIS - Data Flow Task


Data Flow Task that literally means a task where DATA flows. That’s exactly what ETL does, Extract Transform & Load of data. Most of the ETL is done in the Data Flow Task.
A Data Flow Task has to be a part of a ‘Control Flow’. One control flow can have a number of DFTs in it, as the Control Flow is the place where we can decide the ways the multiple DFTs will be placed in our solution.
Each DFT has to have a Source, may or may not have a Transform, may or may not have a Destination. However a DFT needs to have at least a transform or Destination followed by the Source. Source cannot just get data from somewhere and do nothing with it. If that’s the case, DFT will fail.
Ideally, DFT has a Source, Transform (optional) and a Destination. Source is used to extract the data, Transform is used to transform the data whileDestination is used to Load the data to a destination. That's how ETL is implemented in DFT.
The way the Control Flow contains tasks, DFT contains Source,Transforms and destinations and they are not called Tasks as many people call them. Let list down a few examples from each:

Source:
  • OLEDB Source
  • Flat File Source
  • XML Source
  • Excel Source
  • ADO NETSource
Transformations:
  • Aggregate
  • Conditional Split
  • Derived Column
  • Fuzzy Lookup
  • Merge
  • Merge Join
  • Look up
  • Row Count
  • Sort
  • Union All
  • Script Component
  • OLEDB Command
Destinations:
  • OLEDB Destination
  • Flat File Destination
  • Excel Destination
  • Recordset Destination
  • Raw File Destination
As you can see from the list above, there are many sources & transforms, which can help you to achieve your requirement. I have selected just a few elements of DFT. There are others as well. We would talk about DFT and parallel execution later in the advanced section where we talk about the SSIS Execution tree.
In the figure below, you can have a look at how a DFT looks once we have various Sources, Transforms and Destinations attached. You would notice the Red circle on each transform, it is because they are not yet configured and SSIS is showing that this will error out if executed now.


Just remember the way we connect the tasks in Control Flow with the use of precedence constraints. Similarly, we use arrows in DFT but they are not termed as precedence constraints. Here they are available in just two colours: Green for valid records and Red in case the data is incorrect as per our setup.
Unlike Control Flow in Data Flow, the various transforms cannot have any number of inputs or outputs from them. They are predefined barring the Multicast Transform which can have any n number of outputs configured. OLEDB Source takes no input and gives one success (green) and one error output (red), we have similar rule for each source, transform and destination.








SSIS - Control Flow


In Control Flow, we set up the flow of jobs in an ETL process, like fetching the files from ftp location to local and then make the file ready for ETL or at times, we clear the staging tables before the ETL process. All such tasks are done before ETL commences.
Once the ETL completes, we use Control Flow for tasks like sending mail updates, achieving files, etc.
Control Flow primarily contains various tasks and precedence constraints, which connects these tasks in a logical sequence. This creates the flow of control among the tasks.


Few of the various tasks available in Control Flow are:
  • For Loop For Each loop Script Task
  • Send Mail Task Execute SQL Task Data
  • Flow Task FTP Task ActiveX Script
  • Task Execute DTS 2000 Package Task
  • Execute Package Task Execute Process
  • Task XML Task
  • Transfer Database Task
........ etc.
In most of the cases, the names are self-explanatory. For others we will discuss in details.
A meaningful package would contain at least one task. In most of the cases, we have a number of the above tasks in a package with a particular task used more than once based on the requirement. Now we can have the tasks linked to each other or each task independent of the other or a group of tasks that need to follow a sequence and other group of tasks need to follow a sequence. However, these 2 groups are mutually exclusive and are not dependent on each other. In such cases, we need to join the 2 groups of tasks.
Single Flow in a package
Multiple Groups, which run in parallel

Precedence Constraint
Now how do we group create a flow between two tasks or link two Tasks?
When you drag a particular task from the Tasks pane onto the designer, you will see a GREEN arrow coming from the task. All you need to do is drag this arrow and connect to another task on the designer. Therefore, the Flow would be from the task, which had the arrow initially to the Task to which it is connected. This Green arrow is called the PRECEDENCE CONSTRAINT and this comes very handy while designing packages.
One Task could link with multiple tasks.
Similarly, multiple tasks could lead to one task.
Precedence constraint need not always be green. It can have colours as Green, Blue or Red. It need not always be a continuous line. It could also be a dotted line. We will talk about all this later in detail.





Getting Started with SSIS : BIDS Overview - SQL Server 2008 Installation

BIDS stands for Business Intelligence Development Studio.

"Business Intelligence Development Studio is Microsoft Visual Studio 2008 with additional project types that are specific to SQL Server business intelligence. Business Intelligence Development Studio is the primary environment that you will use to develop business solutions that include Analysis Services, Integration Services, and Reporting Services projects. Each project type supplies templates for creating the objects required for Business Intelligence Solutions, and provides a variety of designers, tools, and wizards to work with the objects."


Why do I need it?
BIDS facilitates the entire development of solutions mentioned above. All you need to know is the business requirement and the basics of BI suite. The development need is highly reduced as a developer’s job can be summarized by DRAG, DROP, LINK & CONFIGURE the various tasks or transforms, as may be the case (I will elaborate them later). Talking specifically about SSIS, BIDS is the only way where you develop and test the SSIS package. SSIS package is nothing but a complex XML file. To make this xml file (package) and to modify the complex XML, we need BIDS. BIDS has lots of other features as well which we will be covering in the next chapter.

Where can I find BIDS?
You can get BIDS while installing MS SQL Server on your machine. While SQL Server installation on your machine follow the below steps and you will have BIDS on your machine.



Put a valid product key for your SQL server.
Accept the license terms

Setup the role as per your need

While selecting the features select SQL Server Integration Services (You do not see here as it is already installed on my machine) and then from the shared services select the Business Intelligence Development Studio.


After this step, you could proceed with your SQL Server installation. Once installed, you could start BIDS from Start->MS SQL Server -> SQL Server Business Intelligence Studio
When you open BIDS, it appears just like Visual Studios.


The difference comes when you try creating a new project (File->New->Project).


Here you will find the Business Intelligence Projects on the left pane and the various templates for MS BI development available. As of now, we would be concentrating on Integration Services Projects.






Saturday, 14 July 2012

Getting started with SSIS - Introduction to SSIS


SSIS is a tool used for ETL. Before we talk on SSIS, let me walk you through what is ETL. ETL stands for Extract Transform and Load. These are simple day-to-day words we use in our daily lives. The figure below depicts ETL in real world scenario

E - Extract data from various homogeneous or non-homogeneous source systems. Data could be stored in any of the following forms though not limited to them: Flat file, Database, XML, Web queries etc. When we can have sources of such variety, the job of extraction is to fetch the data from these sources and make it available for the next step.

T - Transform Data: As already discussed that the data are coming from various sources and we cannot assume that the data is structured in the same way across all the sources. Therefore, we need to transform the data to a common format so that the other transformations can be done on them. Once we have the data we need to perform various activities like:
  • Data cleansing
  • Mandatory check
  • Data type check
  • Check for foreign key constraints
  • Check for business rules and apply business rules
  • Creation of surrogate keys
  • Sorting the data
  • Aggregating the data
  • Transposing the data
  • Trim the data to remove blanks.
The list can go on as the business requirements get complex day by day and hence the transformations get complex. While transformations are on, we need to log the anomalies in data for reporting and corrective action to be taken.

L Load Data: Once the transformations are done and the data takes the form as per the requirement, we have to load the data to the destination systems. The destinations can also be as varied as the sources. Once the data reaches the destination, it is consumed by other systems, which either stores it as historical data, generate reports out of it, build modes to take business decisions etc.

SSIS stands for SQL Server Integration Services. Microsoft introduced Business Intelligence Suite, which includes SSIS, SSAS (SQL Server Analysis Server) and SSRS (SQL Server Reporting Services).

Now what’s this Business Intelligence (BI)? Let me take some time to explain that. As the name suggests, it helps Business run across the globe. It provides the business with data and ways to look into the data and make business decisions to improve the business.

So, how do the 3 products work in the BI world or how are they organized? To start any business analysis we need data, and as I explained earlier, ETL would be used here to get the data from varied sources and put the data to tables or create Cubes for data warehouse. To do this, we make use of SSIS.

Once we have the data with us, SSAS comes into picture to organize the data and store them to cubes.

Next, we need to report the data so that it makes sense to the end user. This is where SSRS comes into picture for report generation.

The order of SSIS and SSAS could change, as both can come first. Having said this, SSIS makes the backbone of this entire domain as all the data is assembled using SSIS.
Now we dive in to what exactly SSIS is. Is it another coding language? The answer to the question is NO. SSIS is not another coding language. In fact very little coding is required in SSIS that too in very few cases. SSIS is a tool used for ETL. It is based on the paradigm shift where we need to focus more onto the business requirement and less on the actual coding to achieve the goals.

SSIS is a visual tool with drag and drop feature, which enables us to create SSIS, packages to perform ETL in a very short amount of time. The development time is widely reduced as compared to legacy systems where each aspect of ETL had to be coded and then of course, tested. Once the package is created, the visual look is good enough to give us an idea of what the ETL is doing.

SSIS provides us many tasks and transforms which help us build our ETL packages. A few examples would be Execute SQL Task which enables us to execute any T-SQL script from SSIS, Send Mail Task which enables us to send mail to any recipient. Likewise, there are lots of tasks and transforms available in SSIS to take care of most of the ETL scenarios that you can think of.

Note: I will explain what is Tasks and transforms in later chapters.

However, what if I have a scenario that cannot be accommodated in the existing tasks and transforms? SSIS has an answer to that too. In that case you can use the entire .NET library and code in either C# or VB to achieve your requirement. Though such scenarios would be very few and even if required will be of very basic nature where you need to know the basics of any programming language to get going. Let’s say that you are not satisfied with the logging provided by SSIS, you can always go ahead and write code to log the errors or activities the way you choose. Taking this a step further, if you want this new type of logging in many packages, you could convert your code to create a custom component and add it to all SSIS packages. It will now work as any other task or transform for you. Is that not great!!!

Note: C# is available only in SSIS 2008 and not 2005.
Let us peep into a simple SSIS package and get to know how SSIS looks:


This is how an SSIS package looks and feels. I will get into the details of what you see above in the coming chapters.