Issue
I have a query that I run hourly and I am processing a certain dataset from this query. While processing this dataset, I need to ignore some IDs, I am currently doing this with NOT IN
, but the number of IDs I need to ignore is around 50.
The question I am wondering is, I am creating a text file in a certain pattern with the data I am processing, should I use this ignore operation directly in the query or inside the foreach pattern for better performance?
Query returns around 5000-7000 data in a dataset consists of 10M record, and I need to ignore around 50 ID from resultset.
Lets say;
$blacklist_arr = array(1,10,20,30,40,50,60,70,80,90,100); //around 50 element in array~
What I use now;
...QUERY...
resultSet.ID NOT IN (\'' . implode( "', '" , $blacklist_arr ) . '\')
What I’m planning to use;
foreach ($final_dataset as $final_data) {
...
if (!in_array($final_data, $blacklist_arr )) {
//write to file
...
edit*
Query structure is below;
SELECT *
FROM
(
(
SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 )
WHERE
( a.someColumn > NOW( ) - INTERVAL 2 HOUR )
AND ( a.col3 < DATE_HERE )
) UNION
(
SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
WHERE
( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL )
AND ( a.col3 < DATE_HERE )
)
) AS resultSet
WHERE
resultSet.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 )
ORDER BY
resultSet.col3 ASC,
resultSet.col2 ASC,
resultSet.col4 ASC,
resultSet.col1 DESC
Solution
A variety of points:
-
I have a "Rule of Thumb": "If a possible optimization is estimated to improve things by less than 10%, move on. That is, don’t spend extra effort on it. Instead, look for something better to work on." According to your numbers, the optimization decreases the result set by only about 1%.
-
There is a standard programming rule: "KISS". Which is simpler to code — the
NOT IN
or the PHP filtering? A variant: "Which approach is fewer keystrokes?" That comes from "A Programmer’s time is much more valuable than computer time. -
Moving the
NOT IN
into each subquery may speed it up slightly. This is because it would decrease (slightly) the intermediate tables involved in the query. (However, this fails the 10% and KISS rules.) On the other hand, it could eliminate the outermost Select. Note: This works:(SELECT ...) UNION (SELECT ...) ORDER BY...
. -
Potential bug: The innermost Select may be picking a date & time from one of the excluded col1’s.
-
UNION
defaults toUNION DISTINCT
, which is slower thanUNION ALL
. Consider this as a bigger optimization. -
ON ( sub.X = a.col4)
probably needs to mentioncol2
. -
Is
DATE_HERE
somehow related toNOW()
? Perhaps you needTIMESTAMP
instead ofDATETIME
or vice versa? -
I suspect that the
DISTINCT
is not needed. Anyway, it is redundant with theUNION
. -
Consider whether the "blacklist" should be a table, not a config file. As a table,
NOT EXISTS(..)
orLEFT JOIN .. IS NOT NULL
would need to be added to the query. This would be slower than what you have now but might be "cleaner". -
WHERE 1=1
is an artifact of lazy programming; it is not an optimization; the Optimizer will simply toss it. -
Often, better indexes provide the most improvement. Maybe the following would help. Note: Separate, single-column indexes are not as good. Also, when adding
INDEX(a,b)
, dropINDEX(a)
.a (as b): INDEX(col2, col4) -- this order a: INDEX(col4, col3, someColumn) -- col4 first
Answered By – Rick James
Answer Checked By – Candace Johnson (BugsFixing Volunteer)