Mastering Indexes in MySQL Through Simple Examples

Małgorzata Luchter

Introduction

In this article I’m going to explore possible mysql index misconceptions, basing on examples with simple sql queries.
I believe readers have a basic familiarity with the index topic and believe this article will be most useful for those who want to verify their knowledge. Nevertheless, in the first part I will present a brief summary of terminology and concepts that are going to be crucial for understanding the examples presented in consecutive sections.

Index, Explain and Explain Analyze

Indexes are structures used to accelerate finding rows with specific data. Most known types of index (Primary key, unique, index) are stored in B-trees[1]. A B-tree is a type of self-balancing tree structure designed to keep data sorted while supporting efficient searches, insertions and deletions— all in logarithmic time. Typically the primary key is used as a clustered index. The clustered index defines the physical order of data in a table. This means the table’s data is stored in the B-Tree along with the primary key. As for other indexes, secondary indexes, their node contains an index value and a pointer to original table. It is possible to create multi-column (level) indexes. In that case each B-tree node stores combined values, ordered lexicographically, based on the order of the columns defined in the index[2]. Indexes are utilized best for the columns with high uniqueness and for a very selective queries. Indexes create overhead for write, update or delete operations, as well as for maintenance, so indexing most of the columns is not possible. When running queries that retrieve the majority of rows, sequential reading directly from the table can be faster[3]. There are two helpful sql commands to retrieve information about the MySQL query execution path or rather the so-called execution plan that is chosen by a database optimizer: “EXPLAIN” and “EXPLAIN ANALYZE”. The main difference between them is that the latter actually runs the query, it also contains the actual execution time and the number of the searched rows. If you haven’t had the chance to use them, please see the enclosed links at the end of this article[4,5,6].

The EXPLAIN response contains among others:

type: describes access methods to collect data, some possible results:

  • const: when using a primary key or a unique index with the “=” operator
  • Ref: when retrieving data using the indexed column and the “=” operator
  • Range: retrieving data using the indexed column and the range operators like “>”,”<“, “IN”, “BETWEEN”, “LIKE”
  • Index: retrieving data using the indexed column, the process requires scanning the whole index b-tree structure.
  • All: retrieving data using the full table scan.

possible keys and key are numbers of the considered and actually used keys. rows predicted numbers of rows to be scanned.

EXPLAIN ANALYZE shows the stages of conducting the query. The highest level shows the total time of execution. Let’s look at the example output of this command:

-> Filter: (student.id < 4)  (cost=0.861 rows=3) (actual time=0.0165..0.0182 rows=3 loops=1)
    -> Covering index range scan on student using PRIMARY over (id < 4)  (cost=0.861 rows=3) (actual time=0.0153..0.0165 rows=3 loops=1)

(cost=0.861 rows=3)– predicted cost and rows to scan.
(actual time=0.0165..0.0182 rows=3 loops=1)– 0.0165 time when query started searching rows, 0.0182– total time, rows– actual number of rows scanned

Configuration

For the following examples I used Mysql version 9.2.0 with the InnoDb engine. I created a table called “students” and populated it with the 5000 records, crafted with Datafaker library for java[7].

 CREATE TABLE `student` (
   `id` bigint NOT NULL,
   `last_name` varchar(50) NOT NULL,
   `first_name` varchar(50) NOT NULL,
   `birthday` date DEFAULT NULL,
   `is_promoted` bit(1) DEFAULT NULL,
   `words` varchar(255) DEFAULT NULL,  
    PRIMARY KEY (`id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Single column index examples

So, let’s begin with the first, rather easy example.

Create an index for “first_name” column:

 ALTER TABLE student ADD INDEX (first_name);

Column “first_name” is of the “varchar” type. Suppose we aim to retrieve data about students whose names do not begin with the letter “S”. Will our index help us with the query that uses the wildcard “%” and the “NOT LIKE” constraint?

SELECT first_name, last_name FROM student WHERE first_name NOT LIKE "S%";

The answer is NO. See the EXPLAIN output below:

+--+-----------+-------+----------+----+-------------+----+-------+----+----+--------+-----------+
|id|select_type|table  |partitions|type|possible_keys|key |key_len|ref |rows|filtered|Extra      |
+--+-----------+-------+----------+----+-------------+----+-------+----+----+--------+-----------+
|1 |SIMPLE     |student|null      |ALL |null         |null|null   |null|4995|88.89   |Using where|
+--+-----------+-------+----------+----+-------------+----+-------+----+----+--------+-----------+

Type the ALL means approach with a full table scan. Does this imply that wildcards queries aren’t optimized by indexes, or is there another explanation??

Let’s consider the opposite query:

SELECT first_name, last_name FROM student WHERE first_name  LIKE "S%";

And output from EXPLAIN:

+--+-----------+-------+----------+-----+-------------+------------+-------+----+----+--------+---------------------+
|id|select_type|table  |partitions|type |possible_keys|key         |key_len|ref |rows|filtered|Extra                |
+--+-----------+-------+----------+-----+-------------+------------+-------+----+----+--------+---------------------+
|1 |SIMPLE     |student|null      |range| first_name  | first_name |202    |null|330 |100     |Using index condition|
+--+-----------+-------+----------+-----+-------------+------------+-------+----+----+--------+---------------------+

Now we can see that indexes can be used with wildcards. There’s just no point in using them for a previous query that will need to retrieve most of the rows from the original table anyway. Note that if the searched phrase starts with a wildcard e.g “%s”, the MySQL uses the full table scan as well. The Range type indicates that MySQL utilizes the “first_name” index to fetch rows falling within a defined range of values, limited by the “LIKE” operator in this example.

Staying with “first_name” filtering, which query will be faster?

1.

SELECT first_name, count(first_name) FROM student  WHERE first_name like "S%" GROUP BY first_name;

Or

SELECT first_name, count(first_name)  FROM student WHERE first_name LIKE "S%" AND is_promoted=1 GROUP BY first_name;

With additional filtering (which, in our example, eliminates around 50% more rows), the query has the potential to execute more quickly. However, since the “is_promoted” column is not part of the index, the execution plan changes significantly. The first query operates solely on the index structure, performing the so-called COVERING index scan. This type of scan eliminates the need to access the actual “Student” table. Conversely, retrieving data from the “is_promoted” column requires fetching it from the original table. In this example, the first query achieved the speed of 0.39 ms, while the second query was completed in 1.23 ms, so the difference is huge. Since the column used for aggregation is part of the index and index B-structure is already sorted, the grouping phase is significantly faster and avoids the need for a temporary table to gather results.

See the EXPLAIN ANALYZE results for both queries:

1.

-> Group aggregate: count(first_name)  (cost=150 rows=330) (actual time=0.0323..0.39 rows=192 loops=1)
    -> Filter: (student.`first_name` like 'S%')  (cost=74.2 rows=330) (actual time=0.0282..0.288 rows=330 loops=1)
        -> Covering index range scan on student using first_name over ('S' <= first_name <= 'S??????????????????????????????????????????????????????')  
        (cost=74.2 rows=330) (actual time=0.0261..0.204 rows=330 loops=1)

2.

-> Group aggregate: count(first_name)  (cost=225 rows=330) (actual time=0.162..1.23 rows=120 loops=1)
    -> Filter: (student.is_promoted=1)  (cost=149 rows=330) (actual time=0.154..1.15 rows=174 loops=1)
        -> Index range scan on student using first_name over ('S' <= first_name <= 'S??????????????????????????????????????????????????????'), 
        with index condition: (student.`name` like 'S%')  (cost=149 rows=330) (actual time=0.152..1.1 rows=330 loops=1)

Finishing with our simple index, how will it handle a query with a function in the “where” clause?

SELECT last_name  FROM student WHERE upper(first_name)="VINCENT";

Unfortunately MYSQL is helpless here and falls back to a full table scan. Using a function on the left side of the condition in the ‘WHERE’ clause prevents the index from being utilized. Maybe changing that part to “where first_name=’VINCENT’ or first_name=’Vincent'” could fulfill one’s needs and benefit from the index presence.

Multi column index examples

Let’s drop our index and create a composite (multi column) one.

DROP INDEX first_name ON student
ALTER TABLE student ADD INDEX multi (first_name, last_name);

In mysql documentation we can read “If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows”[8]. So the answer is NO. And because we need the “birthday” column as well, a full table scan is performed.

 -> Filter: (student.last_name like 'Collins')  (cost=506 rows=555) (actual time=0.179..3.41 rows=8 loops=1)
    -> Table scan on student  (cost=506 rows=4995) (actual time=0.14..2.71 rows=5000 loops=1)

Moving to a little bit more complex “WHERE” clause:

Does the order of conditions in the “WHERE” query part matter?

SELECT first_name, last_name FROM student  WHERE student.last_name = "Collins" AND student.first_name="Russ";

Luckily for us, no:

+--+-----------+-------+----------+----+-------------+-----+-------+-----------+----+--------+-----+
|id|select_type|table  |partitions|type|possible_keys|key  |key_len|ref        |rows|filtered|Extra|
+--+-----------+-------+----------+----+-------------+-----+-------+-----------+----+--------+-----+
|1 |SIMPLE     |student|null      |ref |multi        |multi|404    |const,const|1   |100     |null |
+--+-----------+-------+----------+----+-------------+-----+-------+-----------+----+--------+-----+

As you can see, the ref type means that our multi-column index was used.

What happens if we change “AND” to “OR”?

SELECT first_name, last_name FROM student  WHERE student.last_name = "Collins" OR student.first_name="Russ";

As you can assume, now mysql needs to go through the Covering index full scan.

+--+-----------+-------+----------+------+-------------+----+-------+----+----+--------+-----------+
|id|select_type|table  |partitions|type  |possible_keys|key |key_len|ref |rows|filtered|Extra      |
+--+-----------+-------+----------+------+-------------+----+-------+----+----+--------+-----------+
|1 |SIMPLE     |student|null      |index |multi        |null|null   |null|4995|10.03   |Using where|
+--+-----------+-------+----------+------+-------------+----+-------+----+----+--------+-----------+

Here, only a separate index on both tables could help.

And, finally, the last riddle.

Let’s drop the previous index and add a new one:

DROP INDEX multi ON student;

ALTER TABLE student ADD INDEX multi (last_name, is_promoted);

And consider the following query:

SELECT last_name, count(last_name) FROM student WHERE student.is_promoted=1 GROUP BY last_name ORDER BY last_name;

As we know, MySQL first processes the ‘WHERE’ clause and then aggregates the results using the information from the column specified in the ‘GROUP BY’ statement. Note that the ‘WHERE’ clause condition relies on the ‘is_promoted’ column, which corresponds to the second segment of the multi-index. Is it possible for this query to utilize the index, considering the left part rule?

This is the answer from EXPLAIN ANALYZE, when we use the previously created multi index:

-> Group aggregate: count(student.last_name)  (cost=1082 rows=473) (actual time=0.124..2.83 rows=468 loops=1)
    -> Filter: (student.is_promoted=1)  (cost=506 rows=2498) (actual time=0.115..2.24 rows=2523 loops=1)
        -> Covering index scan on student using multi  (cost=506 rows=4995) (actual time=0.112..1.68 rows=5000 loops=1)

And let’s compare it with the result, when we use the IGNORE INDEX hint[9].

This is a query that uses the hint above:

SELECT last_name, count( last_name) FROM student IGNORE INDEX (multi) WHERE student.is_promoted=1 GROUP BY last_name order by last_name

And the response from EXPLAIN ANALYZE:

-> Sort: student.last_name  (actual time=4.5..4.53 rows=468 loops=1)
    -> Table scan on <temporary>  (actual time=4.2..4.3 rows=468 loops=1)
        -> Aggregate using temporary table  (actual time=4.2..4.2 rows=468 loops=1)
            -> Filter: (student.is_promoted=1)  (cost=506 rows=2498) (actual time=0.139..2.51 rows=2523 loops=1)
                -> Table scan on student  (cost=506 rows=4995) (actual time=0.137..1.93 rows=5000 loops=1)

We can observe that without an index,the MySQL had to resort to a table scan, use a temporary table, and perform an additional sorting step, leading to an increase in processing time of nearly 2 milliseconds. Actually, there is a whole paragraph in mysql documentation on the optimization of GROUP BY queries[10].

Summary

As we could see, the MySQL query optimization process using indexes is quite well crafted and sophisticated. There are some rules of choosing the right index that must be take into account. And it is always safer to double check the expected result with the EXPLAIN and/or the EXPLAIN ANALYZE options before introducing the index into the table.

References

  1. https://dev.mysql.com/doc/refman/9.0/en/mysql-indexes.html
  2. https://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys
  3. https://funivan.com/post/index-scan-vs-sequential-scan/
  4. https://dev.mysql.com/doc/refman/9.0/en/execution-plan-information.html
  5. https://planetscale.com/learn/courses/mysql-for-developers/queries/explain-access-types
  6. https://planetscale.com/learn/courses/mysql-for-developers/queries/explain-analyze
  7. https://github.com/datafaker-net/datafaker/
  8. https://dev.mysql.com/doc/refman/9.0/en/multiple-column-indexes.html
  9. https://dev.mysql.com/doc/refman/9.0/en/index-hints.html
  10. https://dev.mysql.com/doc/refman/9.0/en/group-by-optimization.html

Meet the geek-tastic people, and allow us to amaze you with what it's like to work with j‑labs!

Contact us