SQL 2005+

SQL Server Grouping

by Richard Carr, published at http://www.blackwasp.co.uk/SQLGroupBy.aspx

The twenty-eighth part of the SQL Server Programming Fundamentals tutorial adds to the description of the use of aggregate functions within queries. In this article the GROUP BY and HAVING clauses are used to summarise groups of similar result rows.

Previous: SQL Server Basic Aggregation Functions

Download Script

Grouping Results

Is the previous article in this tutorial we examined some of the aggregate functions provided by SQL Server. These functions allow us to count rows, calculate totals and averages, and find the maximum and minimum values across a series of rows returned by a query. In each case, we performed an aggregation over a query's entire result set. This allowed us to answer questions such as "What is the entire sales income for the business?"

A common requirement is to perform similar functions over groups of related rows in a result set, calculating the aggregates for each group to provide a summary. This allows us to refine the questions asked into, for example, "What is the sales income by region for the business?" The results of this query would include one row per group with totals in each.

This type of query can be created using the GROUP BY clause within a SELECT statement. The GROUP BY clause defines one or more columns or expressions that should be used to determine which rows belong to each group. The groups can then be summarised using the appropriate aggregate functions.

JoBS Database

The examples in this article use the JoBS database. This is the database that has been created and populated with information throughout the course of the tutorial. If you do not have an up-to-date copy of the database you should create one with the script that may be downloaded using the link at the top of this page.

Using GROUP BY

The GROUP BY clause is added to the end of a query's SELECT statement. The clause is followed by a comma-separated list of the expressions for which you wish to create grouped sets. Often each item in the list will be a simple column name. However, you can create expressions that combine column names, literal values and functions to create groups.

GROUP BY expression1, expression2, ..., expressionN

As an example, we may wish to calculate the number of units of stock held by each engineer in the JoBS database. To calculate the total number of stocked parts for all of the engineers combined, we would use the following statement:

SELECT sum(UnitsHeld) FROM EngineerStock

To modify the statement to show a subtotal for each engineer individually we can add the engineer ID to the list of returned columns and to a GROUP BY clause. NB: The EngineerId is not required in the column list but without it you will not be able to see the link between each total and the engineer.

SELECT EngineerId, sum(UnitsHeld) FROM EngineerStock GROUP BY EngineerId

As with previous aggregation operations you will see that the calculated column in the results has no column name assigned. As before, we can add a column alias if desired:

SELECT EngineerId, sum(UnitsHeld) AS Units FROM EngineerStock GROUP BY EngineerId

When creating queries that use grouping, the list of columns to be returned in the results is more restricted than for a standard query. The column list may only include aggregated values and columns or expressions that appear in the GROUP BY clause's list. The following query attempts to select the PartNumber column, which meets neither criterion, so causes an error when executed.

SELECT
    EngineerId,
    PartNumber,
    sum(UnitsHeld)
FROM
    EngineerStock
GROUP BY
    EngineerId

Linking Tables

A GROUP BY clause operates on any data that is gathered by the preceding query. This makes it appropriate for queries that gather information from more than one linked table. However, there are some common mistakes that can occur when aggregating data from joined tables with grouping. When using joins for one-to-many or many-to-many relationships, queries often return some duplicated data from the joined tables. If the duplicated data includes values that are to be included within sum or avg functions they can return unexpected results due to double counting. Similarly, when using the count function, duplicates can be counted twice unless the DISTINCT clause is included.

When using small amounts of data for testing purposes it is possible that you will not notice such calculation errors. You should always pay extra attention when testing aggregates in grouped queries. It can be useful to execute queries without the grouping to help identify areas of potential duplication that are likely to cause miscalculations.

In the following query we join the Engineers and EngineerStock tables to improve the previous query. In this example we include the engineer's name in addition to their unique ID. As both the ID and name are displayed, they are both included in the GROUP BY clause. This does not generate extra groups because there is a one-to-one relationship between an engineer's name and ID.

SELECT
    E.EngineerName,
    S.EngineerId,
    sum(S.UnitsHeld)
FROM
    EngineerStock S
INNER JOIN
    Engineers E
ON
    S.EngineerId = E.EngineerId
GROUP BY
    S.EngineerId,
    E.EngineerName

Next: Page 2

28 May 2009