[SOLVED] Faster way to delete matching rows?

Issue

I’m a relative novice when it comes to databases. We are using MySQL and I’m currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn’t find one.

The goal is to remove all the rows in table A that have a matching id in table B.

I’m currently doing the following:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);

There are approximately 100K rows in table a and about 22K rows in table b. The column ‘id’ is the PK for both tables.

This statement takes about 3 minutes to run on my test box – Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn’t, but I was hoping to speed things up. Is there a better/faster way to accomplish this?


EDIT:

Some additional information that might be helpful. Tables A and B have the same structure as I’ve done the following to create table B:

CREATE TABLE b LIKE a;

Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I’m a relative novice at DB work and still learning. I don’t know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.

Also, I’m using INNO DB.


Here is some additional info that might be helpful to you.

Table A has a structure similar to this (I’ve sanitized this a bit):

DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE  `frobozz`.`a` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `fk_g` varchar(30) NOT NULL,
  `h` int(10) unsigned default NULL,
  `i` longtext,
  `j` bigint(20) NOT NULL,
  `k` bigint(20) default NULL,
  `l` varchar(45) NOT NULL,
  `m` int(10) unsigned default NULL,
  `n` varchar(20) default NULL,
  `o` bigint(20) NOT NULL,
  `p` tinyint(1) NOT NULL,
  PRIMARY KEY  USING BTREE (`id`),
  KEY `idx_l` (`l`),
  KEY `idx_h` USING BTREE (`h`),
  KEY `idx_m` USING BTREE (`m`),
  KEY `idx_fk_g` USING BTREE (`fk_g`),
  KEY `fk_g_frobozz` (`id`,`fk_g`),
  CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

I suspect that part of the issue is there are a number of indexes for this table.
Table B looks similar to table B, though it only contains the columns id and h.

Also, the profiling results are as follows:

starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002

SOLVED

Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"

The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,

DELETE FROM c WHERE c.id = theId;

I looked into the EXPLAIN statement and rewrote this as,

EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;

So, I could see what this was doing and it gave me the following info:

id            1
select_type   SIMPLE
table         c
type          ALL
possible_keys NULL
key           NULL
key_len       NULL
ref           NULL
rows          2633
Extra         using where

This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:

id            1
select_type   SIMPLE
table         c
type          ref
possible_keys Index_1
key           Index_1
key_len       8
ref           const
rows          1
Extra         

Much better, in fact really great.

I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn’t leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That’s a rookie mistake, so I’m not going to beat myself up about that one. I’ll learn from that mistake. Second, the offending code was the result of a ‘get it done quick’ mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I’d have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I’ve got the opportunity to improve that.

Reference: EXPLAIN Statement

Solution

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem – most of them will be optimized to the same execution plan anyway.

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values – one for each snapshot. This is also why COUNTing on InnoDB takes some time – it depends on the snapshot state you see at that time.

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will “hear” your hard disk working while the statement runs. This is caused by accessing all the rows.
To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record.
With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

Answered By – Daniel Schneller

Answer Checked By – Terry (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *