A quick note: We’re running SQL Server 2012 in house, but the problem seems to also occur in 2008 and 2008 R2, and perhaps older versions as well.
I’ve been investigating a performance issue in some code of ours, and I’ve tracked down the problem to the following very simple query:
SELECT min(document_id) FROM document WHERE document_id IN (SELECT TOP 5000 document_id FROM document WHERE document_id > 442684)
I’ve noticed that this query takes an absurdly long time (between 18s and 70s depending on the resources of the machine running it) to return when that final value (after greater-than) is roughly 442000 or larger. Anything below that, the query returns nearly instantly.
I’ve since tweaked the query to look like this:
SELECT min(t.document_id) FROM (SELECT TOP 5000 document_id FROM document WHERE document_id > 442684) t
This returns immediately for all values of > that I’ve tested with.
I’ve solved the performance issue at hand, so I’m largely happy, but I’m still puzzling over why the original query performed so poorly for 442000 and why it runs quickly for virtually any value below that (400000, 350000, etc).
Can anyone explain this?
EDIT: Fixed the 2nd query to be min instead of max (that was a typo)
The secret to understanding performance in SQL Server (and other databases) is the execution plan. You would need to look at the execution plan for the queries to understand what is going on.
The first version of your query has a join operation.
IN with a subquery is another way to express a
JOIN. SQL Server has several ways to implement joins, such as hash-matches, merge-sort, nested-loop, and index-lookup operations. The optimizer chooses the one that it thinks is the best.
Without seeing the execution plans, my best guess is that the optimizer changes its mind as to the best algorithm to use for the
in. In my experience, this usually means that it switched to a nested-loop algorithm from a more reasonable one.
Answered By – Gordon Linoff
Answer Checked By – Katrina (BugsFixing Volunteer)