Within Group Aggregates

From Zanecorpwiki

Jump to: navigation, search

This uses the SQL Example Data Setup.

Contents

Basic Idea

A "within group aggregate" means using the results of an aggregate function as input to. Example:[1] you have a list of suppliers and you want the cheapest supplier for each item. The fill list might be:

supplier   | item           | price
------------------------------------
Acme       | widget         |  3.00
Acme       | dongle         |  8.00
Acme       | thingy         | 10.00
Corp       | widget         |  4.50
Corp       | dongle         |  6.00
Ma and Pop | thingy         |  9.75

The desired output would be:

supplier   | item           | price
------------------------------------
Acme       | widget         |  3.00
Corp       | dongle         |  6.00
Ma and Pop | thingy         |  9.75

Conceptually, we want to "for each widget, select the supplier whose price is lowest." If we think in terms of SQL, we might naturally think that the 'MIN()' function would be involved and specifically the term 'MIN(price)' will probably show up somewhere.

You might think you could do something like:

SELECT item, MIN(price) as minprice
  FROM itemsupplier
  GROUP BY item;

But this will fail because the 'GROUP BY' happens before the 'MIN(...)', rendering the aggregate function impotent.[notes 1] Instead, you need:

SELECT is1.supplier, is1.item, is1.price
  FROM itemsupplier is1
  JOIN (SELECT item, MIN(price) as minprice
          FROM itemsupplier
          GROUP BY item) ms
  ON is1.item=ms.item AND is1.price=ms.minprice;

You could also put the ON-clause conditions in the WHERE-clause, but I prefer this form and I don't think it should matter.

There is another, more esoteric approach involving left self exclusion joins:

SELECT is1.item, is1.supplier, is1.price
  FROM itemsupplier is1
  LEFT JOIN itemsupplier is2 
    ON is1.item=is2.item AND is1.price > is2.price
  WHERE is2.item IS NULL;

The first method is probably clearer in this case, but there may be times where the second is clearer and /or faster.

You could also do the query with a correlated sub-query, but don't. The following is very expensive and will quickly choke up even on moderate sized data sets:

-- AVOID !! DO NOT USE
SELECT is1.supplier, is1.item, is1.price
  FROM itemsupplier is1
  WHERE price=
    (SELECT MIN(is2.price)
       FROM itemsupplier is2
       WHERE is2.item=is1.item);
-- AVOID !! DO NOT USE

Discussion

Why is the correlated sub-query bad? This is called a correlated sub-query, and they're not always bad, but they usually scale poorly.

In this case you're asking the DB to kick off a new query for each item in the DB. 1,000 items means 1,001 queries.

TODO: To be honest, I'm not 100% sure why join-on-aggregate works. If the is2-select results in a bad result in the first case, it seems you just be joining on that bad result. Empirically this is not the same as running the contra-indicated correlated sub-query, and obviously the values from the join feed into the aggregation query. There's some indication that in some cases, the performance of a join-on-aggregate approach would equal that of a correlated sub-query, but I'm not certain.

In general, the left exclusion join approach is the fastest/bests way, though not always the most obvious. If a more obvious approach works for a particular problem, though, don't use be a show off.

Notes

  1. TODO: I think.

References

  1. Adapted from Common MySQL Queries
Personal tools