in input we have a set of rows, each one with a column containing a string (eg: "AAOOOOAAAOOA").
We'd like to split this string (using a vb.net data script task) into tokens (eg: "AA", then "OOOO","AAA","OO","A"), and to output one line per token.
How can we achieve that with a vb.net data script task ? (Or anything else ?)
best regards
ThibautJust take your source and throw it into a derived column transformation where you'll perform your string split to create new fields for each "token." Then you'll go into an unpivot transformation where you'll take the new columns and turn them into rows.
Why use a script task when you can use the optimized data flow tools as-is?|||
You need to create an asynchrnous transform. So add the transform, select and also setup columns as required. Ensure the SynchronousInputID of the output is set to 0, making it async.
Then read rows and add them to the output as required. Sum dummy code -
Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)
' Read the rows in...
While Row.NextRow() ' This happens once for each input row
' Do something here, and as required, add rows to the output, use a loop or whatever
For i As Int = i < 10
With OutputBuffer
.AddRow() ' Adding a new output row and setting values. Can do this as many times as we like, 0 or more times in the context of this input row loop interation
.Asset = Row.InputColumn
.Product = i
Next
End While
If Row.EndOfRowset Then
OutputBuffer.SetEndOfRowset()
End If
End Sub
|||Thanks a lot Darren ! That's really perfect (and thanks for the detailed sample, I really appreciate).
regards,
Thibaut|||Just be careful because some have tested the script task and it performs slower than the other, native dataflow tasks.|||Phil, I agree in principal, the native stock components should be faster as a rule, but have you got any references?|||
DarrenSQLIS wrote:
Phil, I agree in principal, the native stock components should be faster as a rule, but have you got any references?
Yep, no problem... http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=857796&SiteID=1|||Hi Phil
thanks for your input as well. Like Darren, I agree that a good rule of thumb is that a pipeline component will work generally faster than a corresponding script component.
A script component is generally one-shot code, which generally doesn't get the same level of testing and optimization. In the article you point to, the script task benchmark shows an average CPU usage of 5% (which is often a sign of synchronization issues, contention etc). Here (like suggested by a commenter), a modification of the implementation (like handling sets of rows instead of one row at a time) would most likely boost the performance a lot.
So I wouldn't draw conclusions based on a single script example, given that each implementation is likely to vary a lot in terms of performance and memory consumption - just like any kind of code!
But anyway - it seems that I have two solutions for my problem now. I don't hesitate to use script task when they prove useful, but I first try to stick to the pipeline components.
I'll keep you posted!
thanks again for the input
Thibaut Barrère|||Hi
I've finished the job using a script task (I will report back the details later).
I'm curious about how it would have been possible to implement this using a derived column transformation (at first sight I couldn't find out).
Phil could you give a bit more details ? I could not find any real string splitting functions in SSIS - were you thinking of using a FIND function recursively (or any other feature I've missed ?)
cheers
Thibaut
No comments:
Post a Comment