Available in versions: Dev (3.19) | Latest (3.18) | 3.17 | 3.16 | 3.15 | 3.14 | 3.13 | 3.12 | 3.11 | 3.10 | 3.9

SQL: SELECT DISTINCT

Applies to ✅ Open Source Edition   ✅ Express Edition   ✅ Professional Edition   ✅ Enterprise Edition

The SELECT DISTINCT syntax can occasionally be useful to remove duplicates in case your schema isn't normalised. For example, in a log table, you may want to query for distinct IP addresses:

SELECT DISTINCT ip FROM access_log
create.selectDistinct(ACCESS_LOG.IP).from(ACCESS_LOG).fetch();

This is especially useful for ad-hoc querying.

A lot of times, when removing such duplicates in application queries, you'll want to add aggregate functions (e.g. number of log entries per IP, number of distinct logins per IP, etc.), so you'll use GROUP BY, instead of the DISTINCT clause.

However, be wary of using DISTINCT for removing duplicates that you're not entirely sure why they happen, i.e. typically in very large and complex queries, but also a lot of times when there are (INNER) JOIN expressions, when in fact there should be SEMI JOIN expressions. For example, the following query looks for all authors of books available in book stores starting with the letter A:

SELECT DISTINCT
  author.first_name,
  author.last_name
FROM author
JOIN book ON author.id = book.author_id
JOIN book_to_book_store
  ON book.id = book_to_book_store.book_id
WHERE book_to_book_store.name LIKE 'A%'
create.selectDistinct(
          AUTHOR.FIRST_NAME,
          AUTHOR.LAST_NAME)
      .from(AUTHOR)
      .join(BOOK).on(AUTHOR.ID.eq(BOOK.AUTHOR_ID))
      .join(BOOK_TO_BOOK_STORE)
        .on(BOOK.ID.eq(BOOK_TO_BOOK_STORE.BOOK_ID))
      .where(BOOK_TO_BOOK_STORE.NAME.like("A%")).fetch();

Now, due to the nature of our schema:

  • Authors have written many books
  • Books are available from multiple book stores

Hence, it is very likely that the same author appears in the result multiple times, thus DISTINCT. From a database performance perspective, this may be very bad, because the authors are duplicated into some internal data structure before they're removed again from the result. Besides, depending on the optimiser, we might have to scan all the books and all the book stores even if we already found an author. Optimisers are very different, so this may or may not be a problem of equal severity in each RDBMS, but certainly something to keep an eye out for.

A much better solution here is to use a SEMI JOIN, e.g. using the IN predicate or EXISTS predicate (they're semantically equivalent):

SELECT
  author.first_name,
  author.last_name
FROM author
WHERE author.id IN (
  SELECT book.id
  FROM book
  JOIN book_to_book_store
    ON book.id = book_to_book_store.book_id
  WHERE book_to_book_store.name LIKE 'A%'
)
create.select(
          AUTHOR.FIRST_NAME,
          AUTHOR.LAST_NAME)
      .from(AUTHOR)
      .where(AUTHOR.ID).in(
          select(BOOK.ID)
          .from(BOOK)
          .join(BOOK_TO_BOOK_STORE)
            .on(BOOK.ID.eq(BOOK_TO_BOOK_STORE.BOOK_ID))
          .where(BOOK_TO_BOOK_STORE.NAME.like("A%")
      ).fetch();

Now, we'll never get any duplicate authors.

If in doubt, check your execution plan and execution metrics about what is the best query here, but in a lot of cases, DISTINCT is a hint that something unnecessary is being done, and then duplicates are removed again.

Feedback

Do you have any feedback about this page? We'd love to hear it!

The jOOQ Logo