I heard a talk concerning MySQL by Joshua Thijssen some weeks ago where he spoke on Selectivity and MySQL for some minutes. I found this especially interesting and researched some more. This small blogpost is the result.
The Selectivity of a column is defined as the total amount of records devided by the cardinality (amount of unique records in a column) of the column. In "MySQL-speak" this means:
( count(column) / count(distinct column) ) * 100
This will net you a percentage, it is not uncommon to not multiply by 100 and describe the cardinality as a number between 0 and 1. Lets assume a user table with 63290 records. The cardinality of the (autoincremented, unique) id column will be 63290. The selectivity is 63290 / 63290 or 1. If however in the same user table there is a user_status column with only 3 different values the fomula would be totally different: 3 / 63290 or 0.000047401.
So why should I care?
Columns with a low selectivity are often columns which are used in queries alot (select user_name from users where user_status = ....) and it therefore may sound logical to add a key to this column. The opposite is true. I couldn't find a definitive answer but the internet agrees that a key on a column with a selectivity lower then 25-30 % is useless, or even worse slowing the queries. For a column with a selectivity below those percentages a full table scan would be faster and more efficient than using keyed columns. This sounded so contra-intuitive I had to try.
For this experiment I used a table with users on one of our live implementations. It held 65400 at the moment of querying and contains a column with a type_id which had only 4 different possible options. The type_id has a key on it as well. A simple select count(type_id) from users took 0.06 seconds. When I performed the same query with ignore key(type_id) added at the end it was twice at fast, only 0.03 seconds this time. So when I explicitly instruct MySQL to ignore the key, the query is twice as fast!
The above seems to insinuate MySQL's query optimizer does not take selectivity into account when calculating the fastes query. The internet however seems to think otherwise. I don't know what is really happening, I just know I will think twice before adding a key to a column the next time the problem arises.