Theoretically EXISTS must be faster because it just checks for whether a row exists or not (True or False), on the other hand, IN has to collect the result set before testing the condition.
Is the above statement true?
Based on my labs, the answer would be “It depends”, basically the optimizer treats EXISTS and IN the same way when possible, but it depends of your table schema, data and the way you wrote the query.
How to use it?
IN: (subquery | expression [ ,...n ])
Expression : you can pass a list of literal values IN (1,2,3,N) or a single value IN (1)
Subquery: Is a subquery that has a result set of one column. IN (SELECT A FROM B)
EXISTS: (Subquery)
Subquery: Is a subquery to compare the existence of the row (TRUE or FALSE), the result set does not matter.
The Lab (IN vs EXISTS)
Let’s prepare the environment, to do so run the following script to create two tables:
We have two tables, the “MyMainTable” (300K rows) is our main table and “MySubTable” (25% random data from “MyMainTable”) this the table that we are going to use to filter the data.
Let’s see how EXISTS and IN compare.
Please noted the division by zero in the EXISTS statement, at this point you are probably expecting an error, but the cool thing is that the field list in an EXISTS statement is not going to be executed and query will be executed.
I have forced those queries to do not run in parallel, this is to get a very simple execution plan that we can compare easily.
Here are the execution plans, noted they are identical, two clustered index scan and a hash match.
The IOs are also exactly the same.
So the query optimizer treated both queries the same way. No performance differences here.
Let’s add some indexes and see how it goes.
The execution plans are identical, though now we have an index scan, a stream aggregate and a merge join due to the new indexes.
The IOs are also exactly the same.
So far the IN and EXISTS appear to perform identically, not matter if there is an index or not, even when there are nulls in either the subquery or in the outer table.
What about NOT IN vs NOT EXISTS?
Let’s see how NOT EXISTS and NOT IN compare. Please noted that I have created some indexes already.
One more time, IOs and execution plans are also exactly the same.
So far they appear to perform identically.
Hang on… The potential issue with NULL values.
If you have a NULL value in the result of the subquery or expression, IN and EXISTS will give the same result set, but NOT IN behaves differently.
Let’s alter the columns on our tables.
Let’s see how NOT EXISTS and NOT IN compare one more time.
Very different execution plans, you can see the NOT IN has bad performance.
The IOs shows exactly the same, very bad performance using the NOT IN, please noted that even with the bad performance we are still retrieving the same number of rows.
What if we insert a null value?
Let’s insert a NULL value.
The plan did not change, we still have a bad plan for the NOT IN.
The IOs are the same… Hang on, what happened here?
The NOT INT did not return rows!!! How is this possible? The answer is very simple, please see below.
IN Resolves to OR (=).
NOT IN Resolved to AND (<>).
On the other hand, NOT EXISTS work just fine since it just checks for whether a row exists or not (True or False).
Rule of thumb.
If you are going to use a subquery, then you must use EXISTS/NOT EXISTS.
If you are going to use a list of literal values, then you must use IN/NOT IN.
Enjoy!!