The SPL Deserves Some Reiteration | Blue Parabola, LLC
some benchmarks of the php spl library structures compared to the equivalent pure php code using the php array. Interesting
some benchmarks of the php spl library structures compared to the equivalent pure php code using the php array. Interesting
We've decided to do full pagination at the bottom of Noopsi instead of just the next and previous buttons. I originally didn't implement full pagination because I didn't want to worry about scaling it but Dave really wants to be able to bounce to the middle of the result set and my response of "why don't you just search for it?" didn't present the solution he was hoping for.
Our first brainstorm involved keeping a count table, and anytime you made a change to your stuff, the counts would be updated. The problem was that with about 30 different ways to view your stuff (public, private, tagged, etc.) it just wasn't going to be possible to keep all the counts up to date reliably.
So I did what any self-respecting programmer does and turned to Google. I found some interesting info on the subject of getting row counts, including some optimizations that I felt might be a tad excessive for us at this juncture.
I felt really disatisfied with the benchmarks some of the posts had. First, they often didn't use InnoDB tables, which Noopsi does. Second, they usually only had about 100,000 rows in the table when they ran their tests. Third, they usually just had a table that consisted of the primary key, another random integer, and maybe a 32 character random string. The tests didn't seem to mimic a real world scenario at all. So I decided to run my own benchmarks.
I wanted the table to be as close to a real world table as I could, so my table was InnoDB and consisted of:
I ran the tests with 2 counting scenerios: either using a SELECT then a COUNT(*) or by using SQL_CALC_FOUND_ROWS followed by SELECT FOUND_ROWS().
A typical query was as follows:
SELECT * FROM table_name WHERE user id equals a random user from 1 to 20 ORDER BY timestamp DESC LIMIT 10
I ran all the tests on two different versions of the table, the first had an index on user id and one on the timestamp. The second had an index on the user id and the timestamp together.
I ran the tests on tables with 100,000, 500,000, and 1.5 million rows. Each query was ran 30 times and an average was taken. (I know, 30 times isn't that much, I originally had 1000 but I just ended up waiting longer and the average time wasn't all that different). I also added 20 rows (one for each user) between each query. I guess to mimick adding then reading, it seemed like a good idea at the time.
For the 100,000 and 500,000 row tables the differences were miniscule so I am just going to show results for the half a million row table.
500,000 rows, index on (user id,timestamp)...
SQL_CALC_FOUND_ROWS query - 2.56 seconds average SELECT then COUNT(*) query - 19.78 milliseconds average
500,000 rows, index on (user id) and one on (timestamp)...
SQL_CALC_FOUND_ROWS query - 2.68 seconds average
SELECT then COUNT(*) query - 24.33 milliseconds average
Once I got passed half a million rows, the results got more interesting.
1.5 million rows, index on (user id) and (timestamp)...
SQL_CALC_FOUND_ROWS query - 21.41 seconds average
SELECT then COUNT(*) query - 12.11 seconds average
1.5 million rows, index on (user id,timestamp)
SQL_CALC_FOUND_ROWS query - 53.91 seconds average
SELECT then COUNT(*) query - 0.36 seconds average
That's right, the index on (user id, timestamp) was much much (that's 2 muches!) faster than having separate indexes when the row count got higher. And, to be quite honest, three-tenths of a second to perform the two queries is probably something we can live with for now. And hopefully, by the time we get to million plus row databases we will have the resources to fix the bottlenecks and better optimze.
Honestly, they say InnoDb tables are really slow for counting queries, but for tables around the half million row mark, they still seemed lightning fast.
Most benchmark posts include the code used. The problem is because I was trying to make the tests as true-to-life as I could, I used a db abstraction library, and a profiling library, and some other files to perform the benchmarks. I don't feel comfortable releasing the full code right now, and just the main php file that calls everything would be of no help to anyone. Sorry.
While I definitely see that counting queries can become a problem as the rows increase, I think for most small apps and websites they will almost never have to worry about it. I was surprised by how poorly SQL_CALC_FOUND_ROWS performed though, it was slower in every one of the tests, though probably not really noticeable in a real world scenario until the row count got over half a million rows.
I wrangle code for Undrip and sling words for StartupGrind. Previously, I was Co-Founder and CTO of Plancast.
About me: About.me
My Plans: Plancast.com
My Notes: Noopsi.com
My Tweets: Twitter.com
My Code: Github.com
My Resume: LinkedIn.com
My Facebook: Facebook.com
My Google: Google.com