In relational databases, a clustered index is an index that determines the organization of records in a table.  This is as opposed to non-clustered indexes, which exist separate from the table data and contain a reference to the row of data corresponding to each node in the index.

Internally in SQL Server, data rows for tables without clustered indexes are stored as a heap.  A heap (not to be confused with the heap tree structure) is simply an unorganized list of rows.  If you insert five rows into a new table, they are added in order.  If you delete row three, its space is not reclaimed.  If you then add another row, it will try to jam it into row three’s former home if space is available and otherwise it will put it at the end.

Data for tables with clustered indexes is stored using a different structure – a B-tree.  A B-tree both stores data rows sequentially and provides a rapid search mechanism.  A clustered index seek is among the fastest actions in SQL Server.

Microsoft’s recommendation is that every SQL Server table should have a clustered index.  In fact, SQL Azure (Microsoft’s cloud-based database hosting service) requires all tables to have a clustered index.   Obviously, tables can have no more than one clustered index since you could never order the table in more than one way.

Clustered indexes provide are most efficient where some or all of these conditions are true:

  • The key is unique.  (If the key is not unique, Microsoft internally adds a row ID to make it unique.)  In fact, if you define a key as a primary key (a special case of a unique key), it will automatically become a clustered index if one does not already exist.
  • The key is constantly increasing, for example, an identity field or a transaction ID based in part on a timestamp.
  • The key is frequently used to return a range of values, such as viewing a list of log entries for a particular range.
  • The key is used to order or group results.  For example, suppose you want to display a list of transactions in date order.  If instead of ordering them by the date added, you order them by the ID field (and that ID field is used in a cluster index), the database engine does not need to do any work to order the list.

Clustered indexes are less efficient when:

  • The primary key is not sequential.  A non-sequential clustered index necessarily causes fragmentation, or, wasted space in the tree structure.
  • The key is wide (has more columns than necessary or has columns made up of varchars).
  • Some engineers swear by GUIDs (the uniqueidentifier data type) as primary keys.  There is almost never an advantage in doing so and there are plenty of reasons not to (it requires 16 bytes instead of a 4-byte integer to store; because of the random order, you can’t judge the age of a record just by seeing the unique ID nor is it obvious if there is a hole in the sequence; you can’t use >, <, or between).  If you do have your heart set on a GUID for your primary key, you can set the default to NEWSEQUENTIALID() instead of NEWID() and your GUID will be sequentially ordered (though not guaranteed to always be unique across all tables).

For help or more tips with your database design issues, contact our technology team.

image credit:: Margarit Ralev


Comments are closed.