Wed 21 Sep 2005
Posted by Anton Titov under SQLNo Comments
Quite often people use queries like:
SELECT quote FROM quotes ORDER BY RAND() LIMIT 1
to get a random row (rows) from a table. That’s quite bad idea. For big tables, if your table have just
50-100 rows, use whatever you want.
What happens when you run such a query? Let’s say you run this query on a table with 10000 rows, than the SQL server generates 10000 random numbers, scans this numbers for the smallest one and gives you this row. Generating random numbers is relatively expensive operation, scaning them for the lowest one (if you have LIMIT 10, it will need to find 10 smallest numbers) is also not so fast (if quote is text it’s slower, if it’s something with fixed size it is faster, maybe because of need to create temporary table).
So. How can you select random row (rows) without this overhead? There is no easy drop in replacement. You can use something like:
SELECT COUNT(*) AS cnt FROM quotes
generate random number between 0 and cnt-1 in your programming language and run the query:
SELECT quote FROM quotes LIMIT $generated_number, 1
Yes. This are two queries, but they are MUCH faster than the first one. This option is good if you need just one random row. If you need more rows, you can still use this trick, just substract X (X is number of rows you need) from cnt when generating random number and modify query to:
SELECT quote FROM quotes LIMIT $generated_number, X
But this will give you X subsequent rows, starting from a random position. If that’s not an option, than you can use another approach: most of the time you have unique numeric field in tables, that start from 1 and continue to grow, so you can do something like:
SELECT MAX(id) AS maxid FROM quotes
generate X random numbers between 1 and X, join this numbers in string with ‘,’ for separator and run this query:
SELECT quote FROM quotes WHERE id IN ($idlist)
Yes. I know, you may have some deleted id’s and the query than may return less rows than you need. But you may generate 10 times more random ids and modify the query to look like:
SELECT quote FROM quotes WHERE id IN ($list_with_10_times_more_ids_than_x) LIMIT $x
now if you do not have too many rows deleted from the table the chances that you will not find let’s say 10 existing ids from list with 100 random numbers are near zero and you can include a code in your program, that will check if you have $x rows in the result and if not (let’s say once in 10000 times) it will run the equivalent ORDER BY RAND() code. You can also generate 100 times more ids than you need - the only overhead you’re introducing is generating 100 times more random numbers than $x (if you need 5 random rows from table with 10000 rows it’s better to generate 500 random numbers than 10000) and parsing of the query will be a bit slower, but the table scan will not be slower - the SQL server will stop searching for rows with corresponding ids from the list as soon
as it finds $x existing rows.
Thu 15 Sep 2005
If you have apache server on linux and want to be able to activate front page extensions for virtual hosts but do not want to install mod_frontpage, than you can download this file
extract it to /usr/local with
tar -xzf frontpage.tar.gz -C /usr/local
and than you can install front page extensions like (on a single line):
/usr/local/frontpage/install /www/titov.net www.titov.net firstname.lastname@example.org 2000 titovadmin titovpass
“/www/titov.net” is DocumentRoot of the virtual host
“www.titov.net” is ServerName of the virtual host
“email@example.com” is the ServerAdmin
“2000″ is the uid (user id) of the user in the filesystem (e.g. id of the ftp user of this host from /etc/passwd) all installed files will be chown-ed to 2000:2000, so you shoud have group with the same id/name, you should be able to easyly modify this script to take username as parameter
“titovadmin” and “titovpass” are username and password that you should use for publishing from frontpage
/usr/local/frontpage/uninstall /www/titov.net www.titov.net
will uninstall the frontpage extensions from this virtualhost.
You should have Options +ExecCGI in the virtualhost and most probably you will want to have suexec.
Thu 15 Sep 2005
Posted by Anton Titov under SQL Comments
Real life example:
SELECT username FROM users WHERE (ADDDATE( last_in_chat, INTERVAL 2 MINUTE) > NOW()) ORDER BY username LIMIT 30
NEVER do that. The user in question runs this query on table with 3000+ rows, which means, that SQL server (MySQL in
this case) is doing 3000+ date/time additions and will never use any index. If you write the query this way:
SELECT username FROM users WHERE last_in_chat> NOW()-interval 2 minute ORDER BY username LIMIT 30
the server will do only one date/time substraction and will probably use index if there is such on last_in_chat. The second query runs much faster, even without index. In this case not having index on last_in_chat maybe is a better option as this field probably updates on user actions in chat room, which may happen more ofthen than the displaing of this list.
Mind the Query cache.
MySQL have a nice feature - query cache, which stores query results and if you rerun your query, you will have your result served from the cache, which is quite faster. The cache expires at the first update/insert/delete query on the table (or any of the tables in case of join), so you will not get outdated results. The downside is that when you use functions like NOW(), CURRENT_DATE(), UNIXTIME() and so on, the server will not put your query in the cache, as next time the query will be executed this functions may return different value.
You can rewrite the query from the example above to use the query cache:
SELECT username FROM users WHERE last_in_chat>'2005-09-15 16:44:00' ORDER BY username LIMIT 30
where “2005-09-15 16:44:00″ is computed in the programming language, from which you run query like this (PHP):
$time=date("Y-m-d H:i:00", time()-120);
As you see, the seconds are hardcoded to be 00. This sacrifices correctness of the query and now you will see users which was last in chat 2 to 3 minutes. You shoud decide if you care about this. In this case this may be not a good option, but if you want users, that was active in last 15 minutes, you will most probably will not care about 1 minute.
so when you run this query, it will be stored in the cache and the cached result will be used during this minute as all the queries will be the same and will not have one of “bad” functions in it. On the next minute, query will be:
SELECT username FROM users WHERE last_in_chat>'2005-09-15 16:45:00' ORDER BY username LIMIT 30
and this will generate new results.
Of course you shoud use this trick only if you think that this query will be run more then once per minute, but if not you maybe don’t need any performance tips.
This trick is better illustrated when you substitute CURRENT_DATE() with current date, generated from your program as this changes quite less often than NOW() (once in 24 hours actually) and you can have a query served from the cache for a whole day.