this is a write up of info I gathered from Jay Pipes, High Performance MySQL
, Colin Charles and others
(see References below) for a Sun Tech Days MySQL presentation.
MySQL powers many high-volume Web sites, including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia, YouTube, and Zappos.com.
A key difference between MySQL and other database platforms is the pluggable storage engine architecture of MySQL, which allows you to select a specialized storage engine for a particular application need such as data warehousing, transaction processing, high availability...
A storage engine is responsible for storing and retrieving all the data stored . The storage engines have different functionality, capabilities and performance characteristics, in many applications choosing the right storage engine can greatly improve performance. Here is a brief summary of some of the more common engines:
Whole Numbers:
Real Numbers:
Other:
VARCHAR(n) stores variable-length character strings. VARCHAR uses only as much space as it needs, which helps performance because it saves disk space. Use when the maximum column length is larger than the average length and when updates to the field are rare, so fragmentation is not a problem. VARCHAR gives more efficient disk usage, however MySQL internal memory buffers are fixed width, so internal memory usage isn't more efficient, therefore try to keep VARCHAR Max Length short.
CHAR(n) is fixed-length: MySQL allocates enough space for the specified number of characters. Useful to store very short strings, when all the values are nearly the same length, and for data that's changed frequently.
InnoDB's clustered indexes store the row data in the leaf nodes, it's called clustered because rows with close primary key values are stored close to each other. This can make retrieving indexed data fast, since the data is in the index. But this can be slower for updates , secondary indexes, and for full table scans.
Covering Indexes are indexes that contain all the data values needed for a query, these queries can improve performance because the row does not have to be read.
Know Every SQL Statement for Hibernate/JPA developers
If you are using Hibernate, JPA, or another ORM framework you want to look at the generated SQL-statements. If you are using Hibernate, enable the Hibernate SQL log. If you are using Glassfish with Toplink as the JPA provider, you can log the SQL that is being sent to the database by adding this property in the definition of the persistence unit in the
To use it just precede a
With the help of
When do you get a full table scan?
In the example above, a fast range "access strategy" is chosen by the optimizer, and the index scan on title is used to winnow the query results down.
Below is an example where an index can not be used :
A slow full table scan (the ALL"access strategy") is used because a function (LEFT) is operating on the title column. Operating on an indexed column with a function (in this case the LEFT() function) means the optimizer cannot use the index to satisfy the query. Typically, you can rewrite queries in order to not operate on an indexed column with a function.
Although we rewrote the WHERE expression to remove the function on the index, we still have a non-deterministic function
We replaced the function with a constant (probably using our application programming language). However, we are specifying SELECT * instead of the actual fields we need from the table. What if there is a TEXT field in Orders called order_memo that we don't need to see? Well, having it included in the result means a larger result set which may not fit into the query cache and may force a disk-based temporary table. – let's fix that:
SELECT order_id, customer_id, order_total, order_created
FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;
If you want to temporarily override the LAZY fetch type, you could use Fetch Join. For example this query would eagerly load the employee addresses:
MySQL Pluggable Storage Engine Architecture
MySQL Storage Engine Architecture, Part 2: An In-Depth Look
Optimizing Queries with EXPLAIN
Java Persistence with Hibernate book
Jay Pipes blog
Colin Charles blog
mysql performance blog
Ronald Bradford blog
Taking JPA for a Test Drive
Pro EJB 3: Java Persistence API
Pro MySQL, Chapter 6: Benchmarking and Profiling
MySQL powers many high-volume Web sites, including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia, YouTube, and Zappos.com.
MySQL for Developers
If you are a developer using MySQL, you should learn enough to take advantage of its strengths, because having an understanding of the database can help you develop better-performing applications. This can be especially important for Hibernate or JPA developers, since ORM frameworks tend to obscure the database schema and SQL for the developer, which can lead to poorly-performing index and schema strategies and inefficient SQL.MySQL Storage Engine Architecture
A key difference between MySQL and other database platforms is the pluggable storage engine architecture of MySQL, which allows you to select a specialized storage engine for a particular application need such as data warehousing, transaction processing, high availability...
A storage engine is responsible for storing and retrieving all the data stored . The storage engines have different functionality, capabilities and performance characteristics, in many applications choosing the right storage engine can greatly improve performance. Here is a brief summary of some of the more common engines:
- MyISAM
- offers high-speed query and
insert capability, is nontransactional,
provides table-level locking, full-text indexing, compression, GIS
functions, and supports indexes. MyISAM is a good choice for read-only
or read-mostly applications that don't require transactions and issue
primarily either SELECT or INSERT queries, such as Web Catalogs or
listing of jobs, auctions, real estate, etc., or data warehousing
applications. You can use multiple storage engines in a single
application, sometimes specific components of an application (such as
logging) fall into this category. MyISAM is not a good general purpose
storage engine for high concurrent UPDATEs or DELETEs since these
obtain exclusive write locks, however new rows can be inserted with
shared read locks.
- InnoDB
- supports ACID
transactions, multi-versioning, row-level locking, foreign key
constraints, crash recovery, and good query performance
depending on indexes. InnoDB uses row-level locking with multiversion
concurrency control (MVCC). MVCC can allow fewer row locks by keeping
data snapshots. Depending on the isolation level, InnoDB does not
require any locking for a
SELECT. This makes high concurrency possible, with some trade-offs: InnoDB requires more disk space compared to MyISAM, and for the best performance, lots of memory is required for the InnoDB buffer pool. InnoDB is a good choice for any order processing application, any application where transactions are required. - Memory
- stores all data in RAM for
extremely fast access. Useful when you need fast access to data that
doesn't change or doesn't need to persist after a restart. Good
for "lookup" or "mapping" tables, for caching the results of
periodically aggregated data, for intermediate results when analyzing
data.
MEMORY tables do not support variable-length columns, this means that VARCHAR columns get implicitly converted to CHAR columns, and it is impossible to create a MEMORY table with TEXT or BLOB columns.
- Merge - allows to logically group together a series of identical MyISAM tables and reference them as one object. Good for very large DBs like data warehousing.
- Archive - provides for storing and retrieving large amounts of seldom-referenced historical, archived, or security audit information.
- Federated - Allows a
remote server's tables to be used as if they
were local, by linking together separate MySQL servers as one logical
database. Can be good for distributed or data mart environments.
- Cluster/NDB -Highly-available clustered storage engine. Should be used where very high speed, availability and redundancy are absolute requirements. Other tables (even in the same database) should use one or more of the other storage engines.
- CSV-references comma-separated files as database tables.Useful for large bulk imports or exports
- Blackhole- the /dev/null storage engine for temporarily disabling application input to the database . Useful for benchmarking and some replication scenarios.
Schema, the basic foundation of performance
Database normalization minimizes duplication of information, this makes updates simpler and faster because the same information doesn't have to be updated in multiple tables. In a denormalized database, information is duplicated, or stored in multiple places. With a normalized database:
- updates are usually faster.
- there's less data to change.
- tables are usually smaller, use less memory, which can give better performance.
- better performance for distinct or group by queries
Data Types
In general, try to use the smallest data type that you can. Small and simple data types usually give better performance because it means fewer disk accesses (less I/O), more data in memory, and less CPU to process operations.Numeric Data Types
MySQL has 9 numeric data types.Whole Numbers:
TINYINT, SMALLINT, MEDIUMINT, INT,
BIGINT: require 8, 16, 24, 32, and 64 bits of storage
space, respectively. They can store values from –2(n-1)
to 2(n-1)–1, where N is the number of bits of
storage space they use.
Use UNSIGNED when you don't need negative numbers, this doubles the bits of storage space. BIGINT is not needed for AUTO_INCREMENT, INT UNSIGNED stores 4.3 billion values!
INT(1) does not mean 1 digit! The number in parentheses is the ZEROFILL argument, and specifies the number of characters some tools reserve for display purposes. For storage and computational purposes, INT(1) is identical to INT(20).
Integer data types work best for primary key data types.
Use UNSIGNED when you don't need negative numbers, this doubles the bits of storage space. BIGINT is not needed for AUTO_INCREMENT, INT UNSIGNED stores 4.3 billion values!
INT(1) does not mean 1 digit! The number in parentheses is the ZEROFILL argument, and specifies the number of characters some tools reserve for display purposes. For storage and computational purposes, INT(1) is identical to INT(20).
Integer data types work best for primary key data types.
Real Numbers:
FLOAT, DOUBLE: supports approximate
calculations with standard floating-point math.
DECIMAL: use DECIMAL when you need exact results, always use for monetary/currency fields.
DECIMAL: use DECIMAL when you need exact results, always use for monetary/currency fields.
Other:
BIT: to store 0,1 values. BIT(1)
stores 1 bit, BIT(2) stores 2... max is 64.
Character Data Types
The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters to store.VARCHAR(n) stores variable-length character strings. VARCHAR uses only as much space as it needs, which helps performance because it saves disk space. Use when the maximum column length is larger than the average length and when updates to the field are rare, so fragmentation is not a problem. VARCHAR gives more efficient disk usage, however MySQL internal memory buffers are fixed width, so internal memory usage isn't more efficient, therefore try to keep VARCHAR Max Length short.
CHAR(n) is fixed-length: MySQL allocates enough space for the specified number of characters. Useful to store very short strings, when all the values are nearly the same length, and for data that's changed frequently.
Use NOT NULL
Always define columns as NOT NULL unless there is a very good reason not to do so:- can save up to a byte per column per row of data
- nullable columns make indexes, index statistics, and value comparisons more complicated.
Indexes
Indexes are data structures that help retrieve row data with specific column values faster. Indexes can especially improve performance for larger data bases. Most MySQL storage engines support B-tree indexes. a B-tree is (remember data strucutres 101 ?) a tree data structure that sorts data values, tree nodes define the upper and lower bounds of the values in the child nodes. Leaf nodes have pointers to the data instead of pointers to child nodes. B-trees are kept balanced by requiring that all leaf nodes are at the same depth. MyISAM leaf nodes refer to the indexed row by the position of the row, InnoDB leaf nodes refers to the index by its primary key values.InnoDB's clustered indexes store the row data in the leaf nodes, it's called clustered because rows with close primary key values are stored close to each other. This can make retrieving indexed data fast, since the data is in the index. But this can be slower for updates , secondary indexes, and for full table scans.
Covering Indexes are indexes that contain all the data values needed for a query, these queries can improve performance because the row does not have to be read.
Know Every SQL Statement
You need to understand the SQL queries your application makes and evaluate their performance. For this you can harness the MySQL slow query log and use EXPLAIN.Consider adding
an index for queries
that are taking too long, or cause a lot of load. Know how your query
is executed by MySQL, consider disk I/O in
particular. Basically you want to make your queries access less
data, Indexes are a good way to reduce data access. Know Every SQL Statement for Hibernate/JPA developers
If you are using Hibernate, JPA, or another ORM framework you want to look at the generated SQL-statements. If you are using Hibernate, enable the Hibernate SQL log. If you are using Glassfish with Toplink as the JPA provider, you can log the SQL that is being sent to the database by adding this property in the definition of the persistence unit in the
persistence.xml
file, as follows: <property name="toplink.logging.level"
value="FINE">. Then run a use case of your application and
examine the SQL statements which are executed by JPA or your ORM
framework. You want to make sure you are only retrieving the data your
application needs, that is you want to optimize the number of SQL
statements executed (see lazy loading below), and you want to examine
the execution plan for queries.
Optimize data access
Basically you want to make your queries access less data:- is your application retrieving more data than it needs, are
queries accessing too many rows or columns?
- is MySQL analyzing more rows than it needs?
Understanding EXPLAIN
EXPLAIN provides the execution plan chosen by the MySQL optimiser for a specific SELECT statement. It is important to run EXPLAIN on all SELECT statements that your code is executing against the database. This step ensures that missing indexes are picked up early in the development process and gives developers insight into how the MySQL optimizer has chosen to execute the query.To use it just precede a
SELECT statement
with the keyword EXPLAIN and MySQL
will display information from the optimizer about the query execution
plan, which explains how it would process the SELECT,
including information about how tables are joined and in which order.
The EXPLAIN EXTENDED SELECT variant, followed by SHOW WARNINGS gives
some additional information, including the final rewritten query as
used by the optimizer.
With the help of
EXPLAIN, you can see
where you should add indexes to tables to get a faster SELECT that uses indexes to find rows. You can
also use EXPLAIN to check whether the
optimizer joins the tables in an optimal order.
EXPLAIN returns a row of
information for each "table"
used in the SELECT statement, which shows
each part and the order of the execution plan. The "table" can
mean a real schema table, a derived or temporary table, a subquery, a
union result. Here is an explanation of some of the output for each
row:
- table: the real table name or an alias.
- select_type: shows
whether the row is a simple or complex
SELECT.If complex it can be: PRIMARY (Outermost SELECT), UNION, SUBQUERY, DERIVED, or UNCACHEABLE. - type: shows the "access strategy" type to find rows.
The different access types ordered from the best to worst:
- system, or const: very fast because the table has at most one matching row (For example a primary key used in the WHERE)
- eq_ref: index lookup returning one value
- ref: index access
- fulltext: FULLTEXT index.
- ref_or_null: like ref, but with
an
extra search for rows that contain NULL values.
- range: index scan to select the rows in given
range
- index: index tree is scanned.
- ALL: A full table scan
- key: the index that MySQL
decided to use, from the possible_keys, to look up rows.
- rows: the number of rows MySQL estimates it must examine to execute the query.
- ref: columns or constants which are compared to key to select rows .
- Extra: additional
information about how MySQL resolves the query. Watch out for Extra
values of
Using filesortandUsing temporary.Using indexmeans information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index (Covering Index).
When do you get a full table scan?
- No WHERE condition
- No index on any field in WHERE condition
- When your range returns a large number of rows, i.e. too many records in WHERE condition
- When optimizer sees that the condition will return > ~20% of the records in a table, it will prefer a scan versus many seeks
- Pre-MySQL 5, using OR in a WHERE clause
- now fixed with an index merge, so the optimiser can use more than one index to satisfy a join condition
In the EXPLAIN output, the “type” for the table/set will be
“ALL” or “index”. “ALL” means a full table data record scan is
performed. “index” means a full index record scan. Avoid
this by ensuring indexes are on columns that are used in the WHERE, ON,
and GROUP BY clauses.
Scans and seeks
A seek jumps into a random place (on disk or in memory) to fetch data. A scan will jump to the start of the data, and sequentially read (from either disk or memory) until the end of the data. With large amounts of data sequentially scanning through contiguous data on disk or in memory is faster than performing many random seek operations. But scans can also be a sign of poor indexing, if an index contains many rows with the same value it is not very useful. MySQL keeps stats about the uniqueness of values in an index in order to the estimated rows returned (rows in the explain output). If the rows is greater than a certain % of rows in the table, then MySQL will do a scan.
Indexed columns and functions don't mix
Indexes can quickly find the rows that match a WHERE clause, however this works only if the index is NOT used in a function or expression in the WHERE clause. Below is an example where an index is used :mysql> EXPLAIN SELECT * FROM film WHERE title LIKE 'Tr%'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: film
type: range
possible_keys: idx_title
key: idx_title
key_len: 767
ref: NULL
rows: 15
Extra: Using whereIn the example above, a fast range "access strategy" is chosen by the optimizer, and the index scan on title is used to winnow the query results down.
Below is an example where an index can not be used :
mysql> EXPLAIN SELECT * FROM film WHERE LEFT(title,2)
= 'Tr' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: film
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 951
Extra: Using whereA slow full table scan (the ALL"access strategy") is used because a function (LEFT) is operating on the title column. Operating on an indexed column with a function (in this case the LEFT() function) means the optimizer cannot use the index to satisfy the query. Typically, you can rewrite queries in order to not operate on an indexed column with a function.
Solving multiple problems in one query
Here is an example of improving a query:
SELECT * FROM Orders WHERE TO_DAYS(CURRENT_DATE()) – TO_DAYS(order_created) <= 7;
First, we are operating on an indexed column (order_created)
with a function TO_DAYS – let's fix that:SELECT * FROM Orders WHERE order_created >= CURRENT_DATE() - INTERVAL 7 DAY;Although we rewrote the WHERE expression to remove the function on the index, we still have a non-deterministic function
CURRENT_DATE()
in the statement, which eliminates this query from being placed in the
query cache. Any time a
non-deterministic function is used in a SELECT statement, the query
cache ignores the query. In read-intensive applications, this can
be a significant performance problem. –
let's fix that:SELECT * FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;We replaced the function with a constant (probably using our application programming language). However, we are specifying SELECT * instead of the actual fields we need from the table. What if there is a TEXT field in Orders called order_memo that we don't need to see? Well, having it included in the result means a larger result set which may not fit into the query cache and may force a disk-based temporary table. – let's fix that:
SELECT order_id, customer_id, order_total, order_created
FROM Orders WHERE order_created >= '2008-01-11' - INTERVAL 7 DAY;
A Few Things to consider for JPA/Hibernate devlopers:
Lazy loading and JPA
With JPA many-to-one and many-to-many relationships lazy load by default , meaning they will be loaded when the entity in the relationship is accessed. Lazy loading is usually good, but if you need to access all of the "many" objects in a relationship, it will cause n+1 selects where n is the number of "many" objects. You can change the relationship to be loaded eagerly as follows :public class Employee{
@OneToMany(mappedBy = "employee", fetch = FetchType.EAGER)
private Collection<Address> addresses;
.....
}
However you should be careful with eager loading which could cause
SELECT statements that fetch too much data. It can cause a Cartesian
product if you eagerly load entities with several related
collections. If you want to temporarily override the LAZY fetch type, you could use Fetch Join. For example this query would eagerly load the employee addresses:
@NamedQueries({ @NamedQuery(name="getItEarly",
query="SELECT e FROM Employee e JOIN FETCH e.addresses")})
public class Employee{
.....
}
Optimistic locking and JPA
The Java Persistence API 1.0 persistence providers support optimistic locking, where your application will get an exception whenever a transaction tries to commit an object that was updated in the database since this transaction began. To enable this for a entity, you need to add a version attribute:@Entity
public class Item {
...
@Version
private int version;
...
}In Conclusion
- Understand the storage engines
- Keep data types small and compact
- Understand your query execution plans with the EXPLAIN output
- Understand the scan vs. seek choice the optimizer must make
- Don't mix functions and indexed columns
References
High Performance MySQL bookMySQL Pluggable Storage Engine Architecture
MySQL Storage Engine Architecture, Part 2: An In-Depth Look
Optimizing Queries with EXPLAIN
Java Persistence with Hibernate book
Jay Pipes blog
Colin Charles blog
mysql performance blog
Ronald Bradford blog
Taking JPA for a Test Drive
Pro EJB 3: Java Persistence API
Pro MySQL, Chapter 6: Benchmarking and Profiling