Tuesday, February 14, 2012

Data flow task multiple destinations

Hi,

The further i get with doing my current SSIS package the more i am starting to wonder about best practices and performance.

My current package loops through CSV files in a specified location and extracts events from these files. Each file contains multiple events which are a mixture of different types. Depending on the event there are a different number of comma seperated values. In the package i firstly set each event to one column seperated by a comma delimeter. I then create an array for the event which is split by the delimeter. In a script i weed out all elements of the array that are common to all events and set the remaining events to another array. After some processing i come to my conditional split transformation which splits the processing of each event based on the EventID. This is where i'm having doubts on whether i have approched the package correctly. There are approximately 60 different events so each one of these has a seperate pipeline to process the remaining parameters in the array and output them to the destination table. The destination table is differnet for each ID. Is it viable to have this amount conditions and paths when creating the pacakge and is this likely to have any detrimental effect on performance. Is there possibly another way that i could approach this problem?

Many thanks, i hope that made sense.

Grant

Grant Swan wrote:

Hi,

The further i get with doing my current SSIS package the more i am starting to wonder about best practices and performance.

My current package loops through CSV files in a specified location and extracts events from these files. Each file contains multiple events which are a mixture of different types. Depending on the event there are a different number of comma seperated values. In the package i firstly set each event to one column seperated by a comma delimeter. I then create an array for the event which is split by the delimeter. In a script i weed out all elements of the array that are common to all events and set the remaining events to another array. After some processing i come to my conditional split transformation which splits the processing of each event based on the EventID. This is where i'm having doubts on whether i have approched the package correctly. There are approximately 60 different events so each one of these has a seperate pipeline to process the remaining parameters in the array and output them to the destination table. The destination table is differnet for each ID. Is it viable to have this amount conditions and paths when creating the pacakge and is this likely to have any detrimental effect on performance. Is there possibly another way that i could approach this problem?

Many thanks, i hope that made sense.

Grant

Grant,

I have to be honest, it sounds like a very good approach. Importing as a single column and then splitting within the pipeline is definately a good idea.

Granted, 60 data-paths is alot but that does not mean the SSIS pipeline cannot handle them so yes, I do think it is viable. In terms of performance, well yes, of course as the number of paths increases then SSIS has more work to do and this will be detrimental to the time taken to execute - but that's not the same as being detrimental to perrformance. You're not going to get away from an increase in execution time whatever you solution may be, regardless of whether you use SSIS or not. Whether the performance is acceptable or not is entirely down to your discretion and the determining factors are:

Amount of data|||Hi Jamie,

Many thanks for this. It has put things in perspective. I still believe this solution at its current state to be quicker than the existing stored procedure which consists of a hell of a lot of nested if statements. The goal is to speed up this process as we are constantly getting these CSV files on a daily basis (approx 2000 a day) and the current methods are showing limitations.

I'll hopefully get some throughput and performance stats once i have the package completed and would be happy to let you cast your eye over them.

On a lightly differnet note. I am aware that when i process the remaining parameters there may be times where the event row will cause errors. Instead of using a row redirect and setting up the same destination multiple times would i be correct in saying that the redirects should point to a Union all transformation which then can input them all to one error destination table? Thinking about this would it be possible to have multiple inputs into a script which formalises all the events so that they can be directed to and error table?

Thanks,

Grant|||

Grant Swan wrote:

Hi Jamie,

Many thanks for this. It has put things in perspective. I still believe this solution at its current state to be quicker than the existing stored procedure which consists of a hell of a lot of nested if statements. The goal is to speed up this process as we are constantly getting these CSV files on a daily basis (approx 2000 a day) and the current methods are showing limitations.

I'll hopefully get some throughput and performance stats once i have the package completed and would be happy to let you cast your eye over them.

If resources start to get squeezed on performance then perhaps your conditional split could push EventIDs 1-30 into a raw file and EventIDs 31-60 into a second raw file. You can then process each "group" in isolation in seperate data-flows - thereby splitting out the processing. Variations on this these (i.e. more groups) are of course an option :)

As always the way to find the optimum solution is test and measure test and measure test and measure!!!

Grant Swan wrote:

On a lightly differnet note. I am aware that when i process the remaining parameters there may be times where the event row will cause errors. Instead of using a row redirect and setting up the same destination multiple times would i be correct in saying that the redirects should point to a Union all transformation which then can input them all to one error destination table? Thinking about this would it be possible to have multiple inputs into a script which formalises all the events so that they can be directed to and error table?

Yeah I think they should point to a UNION ALL. Maybe you could push that data into a raw file (which is stupendously quick compared to a flat file or DB table) and then push the contents of the raw file to wherever you want it in a seperate data-flow. Again, it would be an interesting test to find out. I dread to think what your data-flow will look like with 120 data paths in it though

-Jamie

|||Jamie,


Yeah I

think they should point to a UNION ALL. Maybe you could push that data

into a raw file (which is stupendously quick compared to a flat file or

DB table) and then push the contents of the raw file to wherever you

want it in a seperate data-flow. Again, it would be an interesting test

to find out. I dread to think what your data-flow will look like with

120 data paths in it though

It looks bad enough with nearly 60 :)
Need to look at the layout and set it out a bit better i think.

Thanks again,

Grant|||Jamie,

I've been thinking and believe i have come up with possibly a better approach to this problem. I've found that error re-direction from a script component isn't just a simple matter of dragging the red path to a seperate output although if you could help on how to do (as a seperate question) this i'd be interested in finding out.

Anyway, i believe i can get rid of all script tasks following the conditioanl split and just having one before the split. In this i would then split up my comma delimited list into generic parameter variables. Each event in the conditional split would then use only specific parameters from the array. This will work as long as i use a fixed length array with empty strings where no parameters exist. The problem i think i will encounter with this method is that i still need to convert the parameter data from strings into the destination table data type. I assume for this i should be able to use the data conversion task in place of the script? will the data conversion task allow me to do complex conversions that might involve changing a hex value in the source data to a integer value for the destination.

Any help or opinions on this would be appreciated.

Thanks,

Grant|||

Complicated conversions like that are more readily achieved in a Derived Column component or a script component.

-Jamie

|||Thanks Jamie,

Thats what i thought. At least with the Derived columns and the data conversion transformations i can use the error redirect as soon as a problem arises. As i mentioned the script task seems to be sadly missing this functionality for what ever reason.

Grant

No comments:

Post a Comment