SQL Server: Is CHECKDB A Necessity?

From: SQL Server Customer
Sent: Friday, February 20, 2009 5:08 AM

 

I’m often asked by professionals whether CheckDB is still recommended on SQL2k8. SQL is calculating a checksum to avoid physical page corruption since 2k5. However in my experience (starting with SQL 7.0) I’ve never been able to clearly say that CheckDB is not necessary .. Does anyone have a good advice ?

_____________________________________________
From: Robert Dorr
Sent: Friday, February 20, 2009 9:04 AM

 

This question has been raised many times. You are correct that the addition of the DATABASE PAGE AUDIT CHECKSUM has reduced some of the need for constant CHECKDB execution. History has also shown that since SQL Server 7.0 the storage engine quality has significantly improved, reducing the need even further.

However, features like checksum only ensure what was written to disk is exactly what is read back. It does not validate many of the other integrity issues CHECKDB is designed for. As an example CHECKDB validates the system table meta data, page linkage, index ordering and many other facets.

The answer is that the need has been reduced since SQL Server 7.0 and it falls somewhat to the comfort of the customer. An air tight recovery plan will include a full restore and a checkdb to make sure all portions of the plan are working. For example, you could have a clean DB, back it up and restore and it be damaged because something happened to the backup media transfer. Backup with checksum in SQL Server 2008 helps for this scenario.

I feel other vendors can be a bit short sided in saying it is never needed. The database engine can't control everything. Examples:

  • I worked a case where the tape drive was misaligned. The backup would succeed but restore could not read the tape. Unless a restore was performed the engine did not realize the hardware had caused an issue. This customer lost the production server and had not been doing restores and spent $100,000's with a tape recovery company.

  • I have reviewed studies on memory and ECC failure rates. One of the number one areas of data corruption, according to the studies, is memory transfers with single bit changes. This is why SQL Server 2005 added constant page scanning to the buffer pool. The constant page scanning technique samples the checksum, while the page is in-memory, to make sure a read-only page was not changed by just such a failure. This technique helps stabilize the data but can't catch everything. So it is possible an ECC failure, while the page is in buffer pool, could change the data and get written to disk with a successfully calculated checksum.

    I worked a case with this very problem which was changing a bit in the middle of a data value. The column happened to be the primary key with foreign key references. So the value of the primary key was 9 and would get changed to 19. The table had a (silent) primary key violation because of the duplicate 19 values and foreign key issues because 9 no longer existed. The data was not changed as part of a transaction affecting the primary key value, it only took a change to any other value on the page. To muddy the waters more the database was setup for log shipping and the standby server would not show the corruption. There was no log record showing the corruption because a change was not made to the primary key by SQL Server directly. So even after a clean restore the CHECKDB would be clean. DBCC CHECKDB on the primary database is the only way to locate such an issue before it migrates to a full or differential backup and becomes weeks old before discovered.

  • Similar to the ECC SQL Server bugs but more often XPROC and COM object bugs lead to similar issues. If the code in an XPROC incorrectly updates memory the same pattern can occur.

  • Over the last 10 years our number one problem with storage has been stale reads (returning the previous version of the page from hardware cache instead of the newest version on disk). This often does not fail a checksum test because it is just the older version of the page. Here is a scenario. My wife deposits $500 in our checking account and a few minutes later I withdraw the $500. The hardware cache was loaded on the first account balance lookup, before my wife's deposit. The deposit is committed and written to disk but the hardware cache is not updated. When I withdraw the $500 the update does not see the deposit and updates my account using the original balance incorrectly. SQL Server may not be directly aware of this so it becomes a logical transaction loss.

    This is why SQL Server added stale read check sampling. Sampling helps but can't prevent all occurrences. In this case even DBCC can't catch this issue but a restore from the transaction log backups does identify that the page was updated twice, holding the same LSN and this is an invalid condition.

I would never say DBCC it is not needed and that you should run it now and then as well as restoring from backups. A lot of this comes down to the faith the customer has in the overall system and the importance of the data.

Bob Dorr
SQL Server Principal Escalation Engineer

Comments