A check-list for using Microsoft SQL Server as a reporting database. This is a ‘be aware’ list for myself in first place, since I forget some details as time passes.
Overall database settings
Use simple recovery mode
Set up simple recovery mode for database to minimize performance issues with transaction log records. Usually you can use daily backups with DWH, since data is loaded daily. In more complex situations you can use ETL-based ‘checkpoints’, but in general, daily backups are more than enough.
Separate log and data io channels
Use decent I\O systems and separate data and log I\O channels. This is obvious, just reiterating.
RAID 1+0 seems to be preferable choice of disk configuration for DWH loads. See
this entry,
this one or test it yourself )
2008 r2
Use SQL Server 2008r2 Enterprise if you have a choice. 2008 adds table\index compression (which is great) and get’s partitioning to work way better in queries. And I’ve 20% performance speed-up against 2005 on the same query set (with same execution plans and that surprised me).
Auto shrink Off
Table configuration
identity columns as primary keys
Use
identity columns as table primary keys. And using this column as clustered index will help I\O in most cases.
Set up foreign keys
Again, obvious, but I’ve seen this violated more than once. Use foreign keys for fact-dimension table reference. This will help optimizer a lot.
You can disable these keys to speed up inserts, but in 90% cases it’s not needed.
Table partitioning
Always use partitioned tables (even fake, with 1 partition) to:
Table compression in 2008
SQL Server 2008 introduces
table compression which helps to reduce I\O (but increases CPU load), which is very good for reporting tasks. Use it for fact tables.
ETL
partition switching as default load strategy
More benefits of partition switching for loading:
- while you’re loading data in ‘new_data’ partition, base table is accessible and consistent
- after you’ve prepared ‘new_data’ partition you can check it’s data for quality (planning to write about later)
You can implement switching by creating another temporary partition, which I like to call ‘last load’. So you switch:
- current -> last_load
- new_data -> current
in one transaction
After that you have ‘last_load’ to:
- switch back in emergency
- check data from current\previous load for some more quality checks
Filtered indexes for calculations
Another 2008-only feature is called
filtered indexes. These indexes can be essentially smaller than common ones if you know for sure what region of data you’ll be accessing. And this turns out to be very efficient for SQL code you’re using for different calculations. Since I mostly work with
Oracle Data Integrator as an ETL tool and it’s an SQL-generating machine: I get up to10 times calculation speed up by building small, calculation-specific filtered indexes.
Tablock hint
All inserts should be done with
TABLOCK hint, which speeds up insert by acquiring a full table lock during the whole transaction. Since in DWH main inserts are performed by ETL there should be no conflicts. TABLOCK also speeds up logging during bcp.
Query tuning
Indexes
Really a vast topic, but to sum it up ) :
Be very-very careful, ‘your mileage may vary’:
- creating a huge number of large indexes may actually slow system down due to I\O overload
- you should try to create indexes that can help a number of reports\query instead of tuning a single request
See Books session below, skip through at least one of them before starting it out wild.
Filtered indexes
As with ETL calculations — filtered indexes may be used for tuning specific queries.
Index compression
Indexes can be compressed in the same way as tables (which is obvious given that a table is a clustered index) and this speeds queries by reducing IO
Books to read
I’d recommend following
And if you have time and will