Tuesday, February 14, 2012

Data flow task running very slow

Hello,

I developed an SSIS package doing a nightly load into a data warehouse. We have an 8 hour loading window - currently the package takes 16 hours to complete.

I isolated the problem to a Data Flow task where +-35% of the time is spent. This task is pretty straight forward:

- OLE DB source, reading +- 800,000 rows from a SQL server database

- 13 Lookups in sequence, to get surrogate keys from dimension tables. Lookups are all on GUIDS.

- An aggregation

- OLEDB target, fact table in a SQL server database.

It seems unreasonable for the this task to take over 5 hours. It spends the majority of time on the lookups - not so much at target, source and aggregation.

Any comments and advice will be greatly appreciated.

Thanks.

(PS some machine details:

OS Name Microsoft(R) Windows(R) Server 2003, Standard Edition
Version 5.2.3790 Service Pack 1 Build 3790
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ARK-SQL
System Manufacturer HP
System Model ProLiant DL380 G5
System Type X86-based PC
Processor x86 Family 6 Model 15 Stepping 6 GenuineIntel ~1866 Mhz
Processor x86 Family 6 Model 15 Stepping 6 GenuineIntel ~1866 Mhz
BIOS Version/Date HP P56, 9/18/2006
SMBIOS Version 2.3
Windows Directory C:\WINDOWS
System Directory C:\WINDOWS\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "5.2.3790.1830 (srv03_sp1_rtm.050324-1447)"
User Name Not Available
Time Zone South Africa Standard Time
Total Physical Memory 3,327.30 MB
Available Physical Memory 938.20 MB
Total Virtual Memory 1.10 GB
Available Virtual Memory 2.78 GB
Page File Space 2.00 GB
Page File C:\pagefile.sys)

How many rows are the lookups caching? Are you selecting the whole table (All columns) or are you using a SQL query to specify only the columns you need.

Are you sure it is not your dest that is slow receiving the rows?
If it is, SSIS will slow down the rows retrieved from the source making it appear as if it is something else slowing it down.

Check the lookups. Are as few columns as possible being selected.
Push your rows into nothing such as a Konesans Trash Destination. Does it appear faster?

Also, if your lookup is caching alot of rows and the key you are caching is a GUID, that's a chunk of work to do.|||

You need to find where the bottleneck is. You could start measuring how fast the dataflow 'reads from source'; then how fast it does the transformations (since you have a fair amount of transformations, I would measure a several points); and finally measure the how fast it writes into the destination.

few tips:

The lookups could slow down performance if you are using partial cache. So, limit the number of columns and rows (by providing a select statement with a where clause if possible) and use full cache mode. This approach could generate another problem if memory in server is limited.

See if you can move the aggregation upstream and or limit the number of columns to be used as 'group by'. In general aggregations will perform better if the number of columns with fewer columns in the 'group by'.

If you use an OLE DB destination, try to stick with 'fast load' and set the interval commit to an acceptable range

Use DB profiler tools to measure the performance of each SQL statement use in the data flow (OLE DB source, lookups, etc)

This white paper has some other information

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx|||

I would also highly recommend taking a read of this:

Donald Farmer's Technet webcast

(http://blogs.conchango.com/jamiethomson/archive/2006/06/14/SSIS_3A00_-Donald-Farmer_2700_s-Technet-webcast.aspx)

-Jamie

|||

Hi Crispin,

Thanks for your response.

I solved the problem last night. The key was in the caching.

I modify the SQL statement in most of my lookups to handle inferred fact entries. I.e. a sale against an unknown customer gets linked to the 'UNKNOWN CUTOMER' dimension table entry. It seems that SSIS sets the caching to partial when you modify the lookup SQL. I assume the lookup had to go back to disk quite often and therefore high cost for big dimension tables (...of which I have quite a few).

I am achieving a +-500% performance increase by keeping the Lookup transform as it is and by setting the cahcing to full. I treat the inferred fact entries at a later stage in stead of at the point of lookup.

|||Good to hear it has approved, however...

The lookup does not change to partial cache when you specify a statement. Not sure why you say it has done this. Was the checkbox checked?

On the partial cache thing, you are correct. The lookup will check it's cache, if the key is found, use it, else run the query against SQL. This can become dog slow. Only time it has been a good thing, for me at least, is when you know your facts are going to be a very small portion of a really large dimension. That way, the first few queries are slow but you don't have to cache 13 million rows only to use 1000.

No comments:

Post a Comment