[SOLVED] MySQL NOT IN (array[]) vs PHP in_array(array[])?

Issue

I have a query that I run hourly and I am processing a certain dataset from this query. While processing this dataset, I need to ignore some IDs, I am currently doing this with NOT IN, but the number of IDs I need to ignore is around 50.

The question I am wondering is, I am creating a text file in a certain pattern with the data I am processing, should I use this ignore operation directly in the query or inside the foreach pattern for better performance?

Query returns around 5000-7000 data in a dataset consists of 10M record, and I need to ignore around 50 ID from resultset.

Lets say;

$blacklist_arr = array(1,10,20,30,40,50,60,70,80,90,100); //around 50 element in array~

What I use now;

...QUERY...
resultSet.ID NOT IN (\'' . implode( "', '" , $blacklist_arr ) . '\')

What I’m planning to use;

foreach ($final_dataset as $final_data) {
    ...
    if (!in_array($final_data, $blacklist_arr )) {
    //write to file
    ...

edit*
Query structure is below;

SELECT * 
FROM
    (
        (
        SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
        ) 
    ) AS resultSet 
WHERE
    resultSet.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

Solution

A variety of points:

  • I have a "Rule of Thumb": "If a possible optimization is estimated to improve things by less than 10%, move on. That is, don’t spend extra effort on it. Instead, look for something better to work on." According to your numbers, the optimization decreases the result set by only about 1%.

  • There is a standard programming rule: "KISS". Which is simpler to code — the NOT IN or the PHP filtering? A variant: "Which approach is fewer keystrokes?" That comes from "A Programmer’s time is much more valuable than computer time.

  • Moving the NOT IN into each subquery may speed it up slightly. This is because it would decrease (slightly) the intermediate tables involved in the query. (However, this fails the 10% and KISS rules.) On the other hand, it could eliminate the outermost Select. Note: This works: (SELECT ...) UNION (SELECT ...) ORDER BY....

  • Potential bug: The innermost Select may be picking a date & time from one of the excluded col1’s.

  • UNION defaults to UNION DISTINCT, which is slower than UNION ALL. Consider this as a bigger optimization.

  • ON ( sub.X = a.col4) probably needs to mention col2.

  • Is DATE_HERE somehow related to NOW()? Perhaps you need TIMESTAMP instead of DATETIME or vice versa?

  • I suspect that the DISTINCT is not needed. Anyway, it is redundant with the UNION.

  • Consider whether the "blacklist" should be a table, not a config file. As a table, NOT EXISTS(..) or LEFT JOIN .. IS NOT NULL would need to be added to the query. This would be slower than what you have now but might be "cleaner".

  • WHERE 1=1 is an artifact of lazy programming; it is not an optimization; the Optimizer will simply toss it.

  • Often, better indexes provide the most improvement. Maybe the following would help. Note: Separate, single-column indexes are not as good. Also, when adding INDEX(a,b), drop INDEX(a).

    a (as b):  INDEX(col2,  col4)  -- this order
    a:  INDEX(col4, col3, someColumn)  -- col4 first
    

Answered By – Rick James

Answer Checked By – Candace Johnson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *