Sunday, February 19, 2012

Data Integrity Across Tables - Please evaluate.

Hi,

I have the following situation...

Products with attributes. e.g. colour, size.

The attributes have options. e.g. green, red.

The options form combinations which can be priced seperately.

e.g. red/green = £3

I figure that a good design would put each entity in its own table.

A table for products, attributes, options and combinations.

The problem is that combinations directly derive/correspond from options. So if a stored procedure is reading options and then a concurrent process changes the options and thus the combinations, the stored procedure doing the reading will then go on to read combinations which are out of synch with the options it read.

So my idea is to put all the data into a single table and then, using a locking hint, place a shared lock on this table when the data is being read.

Each row in this table is capable of storing all of the different types of data - attributes, options and combinations. There is also a flag which marks what type of data is being stored in the row - much like variables in a scripting language.

What do you think of this design?

Also, if there are many processes reading from this table and locking it, will an update be able to get to the data?

Cheers,

I.Go for seperate tables. Better design and better for locking.|||Hi Perre.

Thanks for your reply.

When the data is in seperate tables, its the locking that's the problem.

" if a stored procedure is reading options and then a concurrent process changes the options and thus the combinations, the stored procedure doing the reading will then go on to read combinations which are out of synch with the options it read. "

WT.|||Hi WT,

Separating entities into different tables is invariably better database design : your database is much more scalable this way - for example it makes it much easier to add another colour to your list, and also it saves you from having to store the same data in multiple places, allows you to refer to rows of data by index etc etc which will speed up your database.

If you need to guarantee read consistency across your data then you can either gather all your data in a single select statement using READ UNCOMMITTED transaction isolation level in SQL server (this is the default - so you don't need any code beyond your select)

If you need more than one database select within your procedure you can start a transaction at the start of the procedure and then set the transaction isolation level to REPEATABLE READ. This will prevent any rows which you have read from any tables during your transaction being updated until after you have finished your transaction.
in SQL server you do this with the command
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

If a concurrent Insert into one of your tables whilst reading could cause problems you will need a trans isolation level of SERIALIZABLE.

The downside to these forms of locking is that they reduce concurrency - ie less people can access the same resources at the same time which may or may not be a problem in your particular application - you have to aim for a balance between concurrency and data consistency.

Maracatu|||Don't let some locking issues put you off from using better DB design. Correctness of your design is always your first priority. It's easy to get around performance / locking issues later if your design is correct and normalised. You can for example build indexes, cache datasets or build trigger-updated help tables. But if your design is bad your options are considerably reduced.|||Hey, thanks for the replies.

I agree that putting things in seperate tables is better design. However...

"it makes it much easier to add another colour to your list, and also it saves you from having to store the same data in multiple places, allows you to refer to rows of data by index etc etc which will speed up your database. "

Putting all the entities here in the same table as I described doesn't stop any of this happening although some rows do actually have lots of columns with null values when the row type doesn't require anything in those columns.

"gather all your data in a single select statement"

How do you read from 3 different tables in a single select? I suppose one could use a union but then the rows from some of the tables would have to be padded with nulls to give all of the results rows the same form. More importantly, couldn't a concurrent update occur within the middle of the union read? If so, then from a locking point of view, this is no different to 3 seperate selects. Also, if all the data is requested into the same results set, then it makes it harder to bind the data to different ASP.net web controls because they need distinct data sources. This last point is a problem with my design.

READ UNCOMMITTED is not the default locking. The default is READ COMMITED.

"If you need more than one database select within your procedure you can start a transaction at the start of the procedure and then set the transaction isolation level to REPEATABLE READ. This will prevent any rows which you have read from any tables during your transaction being updated until after you have finished your transaction."

Having things changed which have already been read isn't the problem. The problem is having things changed which have NOT yet been read and then having those things being out of synch with what has been read.

"If a concurrent Insert into one of your tables whilst reading could cause problems you will need a trans isolation level of SERIALIZABLE. "

I don't understand how this will help.

"Don't let some locking issues put you off from using better DB design."

OK, so assuming that I have data normalized in seperate tables, how does one read data in table A then table B and then table C without a concurrent update modfiying what's in table B or C before the read has completed?

I can think of a way but it involves using a transaction on a common read - surely this isn't good database design?

Have the read of a product contained by a transaction and make it repeatable read. The first thing that the read reads is the main product record in the products table - then it proceeds to all the combinations data etc. Then make sure that any updates always start by attempting to update the main product row. This way, the updates won't be able to proceed to any futher updates until the read has finished its transaction.

Finally, if you were a user, would you rather have integrated, consistent data on a web page that you requested or know that a web page with garbled and senseless data on it is stored in the 'correct' format. Perhaps the theory is imperfect.

WT.|||By the way, in order to get the different types of data from the single table in my unconventional/bad design into different result sets so that they can be bound to different web controls, I selected all the relevant data from the single table into a table variable and then did 3 seperate selects from this table variable - one for each type of data.

Is this ugly or what?

But at least the data will be consistent because the read form the single table uses a table lock hint.|||Please don't do this - you say 'at least the data will be consistent ...' This is stupid. You can get consistent reads from your normalised tables just as easily using the same locking hints if you wish. Also in your case there's no reason to lock the entire table. Just lock the rows you are reading, that's good enough in your case and will allow other process to read other data from the table at the same time.

About your other post. I sense that you have some misconceptions about database design. I'm not going to give you DB Design 101 here, but you say'How do you read from 3 different tables in a single select?' WHAT? Are you serious? Ever heard of 'join' or 'union'?

'More importantly, couldn't a concurrent update occur within the middle of the union read?' This has NOTHING to do with normalising your data-structure. This is a non-issue either way.

'The problem is having things changed which have NOT yet been read and then having those things being out of synch with what has been read.' So you want to bind several controls on your page with data from your tables. Right? And you want to make sure that your data on your page at the end is consistent. Right? So instead of messing up your DB design, simply write a stored proc to return the resultsets (all 3 or 4 or how many you need) inside a transaction with some row-level locking hints. I'll repeat: Do NOT screw up your database design!

I can think of a way but it involves using a transaction on a common read - surely this isn't good database design? And shoving several entities into one table is? I give up. Do whatever you feel like.

Perhaps the theory is imperfect. Perhaps you should study some of the theory before you make quick judgements on the life work of some of the most brilliant mathematicians of our time.|||Hi Pierre,

Thanks for your reply.

"I can think of a way but it involves using a transaction on a common read - surely this isn't good database design? And shoving several entities into one table is? I give up. Do whatever you feel like."

Using a transaction on a read seemed to be what Maracatu was suggesting I do - that's why I was questioning its traditional design merit. It also seems to be what you're suggesting...

"simply write a stored proc to return the resultsets (all 3 or 4 or how many you need) inside a transaction with some row-level locking hints"

My point is that if you're going to resist my badly designed database and, assuming that using transactons for common reads (e.g. a shopper browsing a catalog ) is bad design, why are you proposing it as an alternative to my design? Is it just unfortunate that a transaction is needed in this situation?

"You can get consistent reads from your normalised tables just as easily using the same locking hints if you wish."

Well this is exactly what I'm after!

Actually I think I get it now. Even if you only lock a fraction of the records which need to be consistent using repeatable read, an update can't make data further down in the read inconsistent even though those records are unlocked, because it can't get to the records which are locked and so finish its transaction. Aha!

The only way an update could leave the database in an inconsistent state would be if you let it.

OK, so just use a transaction during the reading of the attributes and their derivations! In fact, I'd have to include the reading of the main product row in the transaction because it contains a field saying where to get the price from - in the main row, in price bands or in the combinations.

"Just lock the rows you are reading, that's good enough in your case and will allow other process to read other data from the table at the same time."

Other processes can read from a locked table if its a shared lock can't they? I think locking hint TABLOCK puts a shared lock on the whole table so it would only be modifiations which were locked out.

"I give up."

Please dont do that! I've been banging on about this for ages and you're the first person who has given me a decent response.

WT|||Look, you're worrying too much about this issue. Remember that even if you lock the tables during the read you will still have this much more real problem: What if you do all the selects, populate the screen and display it to the user. Then the user changes some dropdown boxes or whatever and saves the changes. BUT while h'e looking at the screen some other user changes that record in the background, so that when you do the the update the data in the database does not match what was on the screen anymore. This is a much more real problem than updates during the few milliseconds when the actual read takes place. The net result, however, is the same. Notice that no amount of locking or whatever will prevent this new problem. Therefore I suggest you do not worry about those few milliseconds of the actual read. It pales comparison to this problem. The correct way to address both of these problems at the same time is to check your data before you do updates.

Let me give you an example: Let's say you have a screen that displays a single product and a list of it's attributes. User 1 requests this screen, and at the same time User 2 deletes an attribute.

Now let's say you spent 2 months perfecting the locking of those tables and thus User 1's select starts before 2's delete. 2 will wait for the select to complete, and then it will delete the attribute. Now user1 is looking at a screen that does not match what's in the database anymore - he's looking at an attribute that does not exists on that product. The other possibility is that the 2's delete started before the select of 1, and in that case the user will be looking at what's actually in the database.

Now let's say you dropped this sillyness and left the locking. Now the results are like this: Either the select will include the deleted row (if the delete happened too late) or it will exclude it (if the delete happened early enough).

So... What do you see? The results are EXACTLY THE SAME! Bottom line is that you're worrying about nonsense. Leave the locking and transactions using the reads alone. Check your data on updates.|||Hey Pierre,

Thanks again for your reply. I think I'm beginning to get some bearing.

"Notice that no amount of locking or whatever will prevent this new problem."

Hmmm, surely you could have Pessimistic Concurrency if you saved the open database connection in session state? Of course, I'm not going to do that.

But sure, I think I have dealt with the problem you are talking about. I added a time stamp field to the main product row. At the start of the update transaction, it reads the current time stamp value using a repeatable read and compares it with the one gathered with the rest of the data sent to the update page.

If they're different, it sends a message to the user suggesting that they refresh/rebind the page else it continues with the update. So all updates of a product - even if its just to change a single option name - need to update the main product record in order to register a change. I'm not sure if this is lazy or not.

I figured that this was a decent way of dealing with the problem because if another user has added or removed an attribute ( which is quite hard to detect anyway ) , all of the first user's combinations are now invalid - which would make completing the update more work that it seems worth doing.

There's another design issue that I wanted some feedback on...

That is, how to identify a combination given the primary keys of the options which comprise it.

At the moment, I've given each combination an id string. So if a combination is made up of option 0 (pk = 356) from attribute 0 and option 0 (pk = 234) of attribute 1, the id string would be - "356_234". The column also has an index on it.

So when the shopper adds the product to the cart, one can look up the combination's price and code using..


SELECT * FROM productCombos
WHERE idStr = "356" + "_" + "234";

(The primary key values of the options would come from the drop down lists in the 'add to cart' form.)

I figure that this is much better than doing an indefinite number of self joins ( one for each attribute ) on the table which joins the options table to the combinations table like so.. ( this would be valid if there were just 2 attributes )


SELECT * FROM productCombos pC
inner join productVarDetails_Combos pVDC ON pVDC.productComboID = pC.productComboID
inner join productVarDetails_Combos pVDC2 ON pVDC2.productComboID = pVDC.productComboID
WHERE pVDC2.varDetailID = 356 and pVDC.varDetailID = 234

In fact, this join table doesn't really seem worth having.

Any thoughts?

WT.

No comments:

Post a Comment