Tuesday, March 20, 2012

Data Purging

Hi

We have a requirement for purging the historical data based on the specified criteria.We have a complex data model with Parent Child Hirearchy goes to multiple levels.

Does any body knows information on how data purging is done in production environments with characteristics such as

Data Volume is huge (500 GB)

Multiple level of realtionships (It can go up to 6 levels)

Physical relationship exists between tables

thanks in advance :)


Regards
Samsudeen B

My 20 cents worth:

One of the reasons data purging may be a pain is the dilemma of associated indexes. You need them to locate the records to delete and they can make the whole process slow because they may have to be rebuilt (not forgetting any locking issues).

The other issue would be whether you have cascading referential integrity set up.

I would not embark on an all-at-once deletion. I would do small/medium range deletions and I might also disable any indexes on the underlying tables. I you have cascading referential integrity, all you have to do is just delete the relevant parent records.

DELETE FROM parent_table

WHERE date_column BETWEEN a AND b

GO

DELETE FROM parent_table

WHERE date_column BETWEEN b AND c

GO

.......................

........................

DELETE FROM parent_table

WHERE date_column BETWEEN y AND z

|||

Hi Nathan,

Thanks for the response.

I don't have the cascading referential integrity setup in my model.As I told earlier my model is complex with child tables been refered with multiple foreign keys..

Do you have any idea on how these kind of situations handled in pruging data in production environment.

Thanks,

Samsudeen B

|||

Without cascading referential integrity, you will have to delete data in the correct order to avoid constraint violations. Another headache.

Anyway, in a production environment, you will have to be much more cautious as it is very easy to slow down the server to a crawl due to table locking and excessive transaction log growth.

Here is what I have done before:

1. Script out the DROP and CREATE index statements for the tables concerned.

2. Script out the DROP and CREATE foreign key constraints for all child tables.

3. Make a script containing piece-meal delete statements (deleting a few thousand rows per table per batch starting with the child tables first). You need to understand the dependencies between your tables

During my "off-peak" hours

4. Change the database recovery model to "simple"

5. Disable all Triggers on the tables.

6. Drop the indexes and foreign keys (all in a script)

7. Run the "delete" script. Delete as many rows as possible within my "free" time window.

8. Enable the triggers

9. Run the script to recreate the indexes for the tables

10. Shrink the log file (DBCC SHRINKFILE)

11. Revert to "Full" Recovery model

12. Run a script to update statistics on the tables.

You don't have to delete all the data all at once. You can do this as many times as necessary if you have a very busy server. And you can put all that work (4-12) into a script and make some coffee. My server is not used for at least 4 hours in the night.

If your server never "chills out" then a carefully scripted delete operation and a simple recovery model may be of help. You may also save yourself a few headaches by adding cascading referential integrity to the tables. That way you script only the delete statements for the parent tables.

No comments:

Post a Comment