Optimizing a big 6.5 GB mysql database

MCHPhil · January 2015

You need ~4GB for the DB. So say 16GB? RAM is cheap, why not go 64GB?

ricardo · January 2015

When a DB is purely InnoDB I've seen recommendations to have the majority of RAM to its buffer pool, you basically want to save enough for the OS, webserver and whatever else you have running and allocate as much free memory as possible to the DB.

Again though, 'it depends', especially on whether the most intensively used queries/tables are innoDB or MyISAM. Caching indexes/data in memory certainly helps for performance.

SSDs tend to be talked about from the low access times, but they're also good for overcoming table fragmentation, particularly from non-ordered inserts.

jlay · January 2015

A few things to consider for MySQL performance:

Use a "tmpfs" mount for MySQL's tmpdir if you have the memory to spare. Ideally at least as big as your biggest table for fastest repairs/optimizations.
Use InnoDB instead of MyISAM. The locking differences alone are significant.
Make sure "innodb_file_per_table" is enabled in my.cnf. Otherwise you may be in for trouble in the future.

The rest is just making sure you're using as much memory as possible for MySQL (caching/buffering), and making sure your I/O is quick enough.

With InnoDB and especially bleeding edge optimizations (where you're leaning heavily on memory), having good backups is very important. Create and validate backups regularly.

Edit:
One more thing -- If "innodb_file_per_table" was previously disabled and it's been enabled, it doesn't work retroactively. Any databases with InnoDB tables will need dumped, dropped, and re-imported to have the table files created.

This is important for two reasons:

Prevent "ballooning" of InnoDB data files on the disk. The InnoDB storage engine only expands disk usage of the files. They don't readily shrink unless you jump through a lot of hoops.
Say something happens and you need to do InnoDB recovery. It's better if each table has its own file rather than having all of the InnoDB data in one contiguous file. This leads to a higher chance of success if you need to recover in the case of emergency.

Makkesk8 · January 2015

7 milion coupons... Oh?

bsdguy · January 2015

@mehargags said:
As a quick start, how Much RAM... would you simply advice for this Database.. The app front end claims to have 40-60,000 Visits with upto 100K pageviews a DAY

... a high-end VPS Ramnode or Vultr or should I go dedicated ?

Reading again through this thread I'm under the impression that you are somewhat overwhelmed, confused and lost in a maze of factors.
So, let's roll this thing up and look at it again, shall we

No, don't go dedicated. Some say that VPS is just a way for hosters to earn more on a dedi. And that's right, but (an important "but") you also get something valuable namely, ease of change and growth path. Changing, for instance, the processor power or the RAM is easy with a VPS. So, go VPS.
You mentioned Vultr. My advice: don't. Maybe they are a perfectly fine provider but if both, performance and reliability are important for you, you should chose a transparent provider, one that will (truthfully) tell you about e.g. his disks and RAID.
High-end? Nope. "good quality" yes but high-end, no. Your 4-5 GB DB is small cake.
My advice (and this might well trigger a religious war): Stay away from OpenVZ if you need realiability also in terms of node and VPS stability and reliability. Go KVM.
Maybe MySql is the right choice. But at least look at other choices! Risking yet another war (sorry, guys, but my priority here is to guide that guy a little so he finds a nice, safe place in the jungle) MySql, uhm, how to put that, uhm, is not the first DB system that comes to mind when I'm asked for reliability. I strongly advise that you at least take a good long look at PostgreSQL.
40k - 60k visits and 100k page view? Piece of cake. That's about 1 view/s and, depending on the kind of of site, between 20 and 200 views/s at high noon.

So, maybe the first thing to understand to get you somewhat better balanced in this whole thing is: Your site is not a toy site, no, but it's still a relatively small thing and by no means heavy weight. So, don't worry.

My take is: A VPS with two cores and 2GB RAM could easily run that thing. Just be sure that your provider has a decent disk setup and gives you realiable, reasonably fast storage. You are not yet very experienced and don't know how to finetune everything? No prob. Add RAM. Say, another 2 GB. You want more performance and a reserve? Fine, add RAM again.
Be sure to config your DB system smartly and well. Same for the server (which, I hope, you wisely chose rather than simply installing apache). Finally, be sure to chose a smart PHP setup and config (incl. probably some kind of accelerator, cache, whatever they call it).
And btw. keep in mind that funny disk caches (like SSD caches) can actually run against you in a DB centric scenario. So go either SSDs (software or hardware RAIDed, no big matter) or spindles with HW RAID 10.

That whole package should be easy to find under 50$/months. I guess, even 20, 25$ isn't unreasonable.

Btw., where is your client? west coast, east cost usa, western Europe, Singapur, or where? And where are his clients/visitors?

NickMNXio · January 2015

I think this comment is quite relevant:

@jcaleb said:
you may need to have specific problem first before trying to find a solution. find which part of the application is slow and which queries are slow. then tune from there. if there is no performance problem with the application, then why not just sit back and relax.

Since I had a chance to discuss this with you personally via chat, and review the existing server configs, sar data, etc.. I feel like you are spending a but too much energy trying to optimize something without much direction.

For reference, this app was running on a 4core-6gb machine, with peak CPU utilization of less than 50% with very little to no IO wait time (for a short analyzed period of 2 days).

I do think it is wise to ensure that you optimize your MySQL parameters.. I like to use the Percona configuration wizard for a base config and then tweak as required: https://tools.percona.com/wizard

Also, about the database size -- the size is not concerning at all (except at an app design level -- do they really need 7 million rows for coupons -- we routinely work with 5TB+ MySQL databases for some of our managed customers without issue.

jcaleb · January 2015

I believe it is a misconception that database performance is dba only problem, tuning database config and stuff.

Database performance relies on totality of a project. Good schema design. Good programming. And good dba tuning. All working together in team work.

aglodek · January 2015

@bsdguy, @tehmaggot, @Jar, @MCHPhil:

Not to hijack the thread or anything, but thought it better to pop my question here instead of opening a new thread...

Have a bunch of XLS files totalling 1 million+ records to convert to one contact DB backend to use with a custom CRM web application under Drupal8. They are company contact details plus industry/product classification, but with different fields in some of the XLS files (data comes from different partners). My questions to you are as follows:

first, what type of DB backend to chose here? Seems to me noSQL like MongoDB would be a better fit here than SQL, but not sure.
another option (which, frankly, I feel more comfortable with) is MariaDB/Galera cluster. I think a master-master replication backend would be a good fit here, but again, not sure. What do you guys think?
one other thing: after importing them into the new DB, the records will need to be parsed and cleaned up as some of them will be records of the same companies repeated in the different XLS files. By cleaning up, I mean merging data from all fields in the repeted records into one record. I plan to use a custom PHP script here, but again, any other suggestions are very welcome.

The apps mentioned above notwithstanding, I have no religion here, so anything goes as long as it makes sense. Thanks!

bsdguy · January 2015

First: 1 mio + records is a piece of cake DB-wise. Don't worry about DB. Just use whatever you like and know well.

Second: NO. First normalization/cleaning - then DB.

I personally would use an intermediate DB that is easily to script around, something like SQLite and Python. Why? Cause experience shows that usually there are things like 1 to many relationships which should be properly ID-indexed.
Example: 5.000 companies with 23.000 relevant products which again are (to be) ordered/indexed into product groups.
With SQLite you have a real SQL as tool and at the same time just a simple file which can be copied, etc.

And, no, I don't think that a NoSQL approach would be optimal here (there are btw. quite different NoSQLs and approaches). I might have mistaken you but to me sounds like a boring standard classical relational case.
So, just use whateverSQL you're fine with, preferably a smaller one fitting the job.

Maybe too much for the task but I'll tell it anyway as a hint. Have a look at MonetDB. Unlike the usual DBs it's a column oriented DB which comes very handy for jobs like that, particularly larger scale ones and is blazingly fast in its area. And yes, it does have a SQL interface

jar · January 2015

At the end of the day it's less about how you plan to store it and more about how you use it. A million records isn't bad until you throw them in a table and force every page load of a website to write new data to it and compare it against all existing records, and people do this...without any caching...it gets ugly fast. Right about the time a crawler says "lol what's robots.txt?"

If your use of it is optimized, your storage of it will be less important. Storage conversions are easier than logic changes.

Nearly every day I see someone run multiple tables with millions of rows with a web page that joins the data into temp tables on every single page load while adding new rows and comparing it's data to all existing ones. One hit by Google and Bing at the same time and it's OOM, I don't care how much memory you have. It's all about optimizing your access logic. A lot of people can't do that with a basic CMS and a little knowledge. Those people need to hire strong developers. As a sysadmin there's only so much optimization I can do.

aglodek · January 2015

@bsdguy said: Second: NO. First normalization/cleaning - then DB. I personally would use an intermediate DB that is easily to script around, something like SQLite and Python.

What an interesting idea, thanks In fact, I did think of an intermediary DB myself, but more like intermediary tables/fields in the target DB, didn't think to use a different DB as a tool. No experience with SQLite or Python, but thanks all the same for the pointer - time to look into this...

Why? Cause experience shows that usually there are things like 1 to many relationships which should be properly ID-indexed. Example: 5.000 companies with 23.000 relevant products which again are (to be) ordered/indexed into product groups.

They are buyers-visitors to different trade fairs in China. And it's a big mess index- and classification-wise. Each trade fair organizer has a different system (even between editions of the same show), hence different fields in the different XLS files. On top of that, some come with additional info included, like fields with text of product enquiries. Hence my first thought was: KISS and maybe go the NoSQL route...?

And, no, I don't think that a NoSQL approach would be optimal here (there are btw. quite different NoSQLs and approaches).

Could you elaborate on that?

Maybe too much for the task but I'll tell it anyway as a hint. Have a look at MonetDB. Unlike the usual DBs it's a column oriented DB which comes very handy for jobs like that, particularly larger scale ones and is blazingly fast in its area. And yes, it does have a SQL interface

Interesting, never heard of MonetDB or column oriented DB - will take a look at this, thanks

EDIT: at ~1 million records, MonetDB seems a bit of an overkill.

vimalware · January 2015

@bsdguy said:
And btw. keep in mind that funny disk caches (like SSD caches) can actually run against you in a DB centric scenario. So go either SSDs (software or hardware RAIDed, no big matter) or spindles with HW RAID 10.

1)
Could you elaborate on the impact of SSD-caches on db-only workloads?

Is your case against ssd-caches based on (much)greater read-latency (for a cache-miss) compared to a hw-raid10 spindle of hdds?

2)
I've been reading up on FB's flashcache module and the 3 modes it supports, writethrough(safest), writearound(archival/logging workloads?), and writeback(risky for last writes, but fastest write-commits)

Am I right in understanding that even HW raid controllers have supported these modes(at least 2 out of 3) using their own superfast DRAM/flash caches?

And Writeback is only recommended for Battery-backed hw-raid controllers?

So, in conclusion, the SSD-cached volumes should be fine for DB-workload if you have the whole machine to yourself.

(and your DB's 'working set' fits inside ssd cache maxsize)

So, this is only an option I would pursue for a client with freaking HUGE(>100gb forum) database with a small 'hot' dataset?

But in case of a ssd-cached vps , read-latency would be entirely dependent on the total average write volume of your co-tenants and the cache-eviction algorithm in use?

Just trying to get my mental model of the various combinations right.

Understanding all behaviour+failure modes is important for capacity planning.

mikho · January 2015

I'm just gonna leave this link here: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

In many cases you can, by using a correct index, speed up the database queries compared with no index or even bad index.

bsdguy · January 2015

@aglodek said:
"NoSQL ... elaborate"

What is NoSQL? Well, the best definition is probably the moniker. They are not SQL/classical relational. Beyond that there's little unity. Everything from graph-based approaches to hash tables with some gadgets added.

Putting it differently: What you describe as your problem strongly indicates to me that you do not want an unordered, cluttered agglomeration of data; otherwise you could save the effort and stay with XLS.

I may be wrong but I understand your desire as wanting something that is reasonable to work with, something around which forms and maybe at least ad hoc applications can be built. I mean, after all, you don't have those data for the fun of it and there is valuable information in them but at a count of 1+ mio working by hand isn't an option.

So, no matter which way you chose, there will have to happen some cleaning, sorting, ordering anyway. Now, having done that, having made that effort, it would seem logical and reasonable to a) keep the routines written for the preparation for future trade show data and b) to put those data into a proper DB (which btw. also opens the doors to a gazillion of tools to further use those data).

Finally, chances are that you would end up using the "has tables plus some gadgets around" kind of NoSQL. Oh well, that just happens to be a forte of scripting (strong hint: python).

An example. Many vendors use different names for basically the same product groups. Say, one calls vacuum cleaners just that, vacuum cleaners. Another one has them in "cleaning machines". Yet another one in "cleaners (vacuum)", "cleaners electrical", etc.

Now say, your designation would be "vacuum cleaners". Using scripting you may walk all rows and do things like

If item.group in ("vacuum cleaners", "cleaning machines","cleaners (vacuum)", "cleaners electrical"):
item.group = "vacuum cleaners"

Another typical problem is ordering. In different XLS areas (from different trade fairs) the columns are often ordered differently. Yet another easy normalization job for a decent scripting language.

If, on the other hand, you just throw everything into NoSQL you might as well just keep it in XLS.

aglodek · January 2015

@bsdguy said: If, on the other hand, you just throw everything into NoSQL you might as well just keep it in XLS.

Well, when you put it like that... Point taken, thanks.

Another typical problem is ordering. In different XLS areas (from different trade fairs) the columns are often ordered differently. Yet another easy normalization job for a decent scripting language.

And by "decent", you mean Python, I take it? Okay, I'm game, but what's Python got that PHP doesn't?

bsdguy · January 2015

@vimalware

OK.

SSD caches and DB workloads. The problem is that there are diverse players, the spindles, the SSD caches, the OS, and the DB. And each one has its own ways. With one important attribute: Unlike normal file handling, DBs "tick" quite differently and in other "portions".
In fact, any decent DB will make quite some efforts to adapt to the underlying OS and file system. Some even went so far as to avoid the OS completely because a DBs needs are different from a typical OS.

To add to the problem, the spindles and controllers have their own cache and optimal xfer chunk size, which usually will not match the SSDs. Moreover, while SSDs are fast beasts, the IO will take some time and, worse, interrupt and IO handling (and often do the OSs job just once more but on a slower(!) medium). Which for standard file systems might be fine and worth the while but not for DB chunks. Don't forget that the vast majority of file IO is largely sequential. A DB, however, particularly a large one, will happily jump all over the place with random access, trashing caches along the way. Which all together may actually lead to SSD caches working against you.

To make it short: One rule for DBs is to keep middlemen out of the way.

"modes"

Well, yes and no. For the simple reason that most cached disk controllers work along the attitude "what happens beyond data handover to me is none of your business". And that is also how the OS accepts it. It writes to the controller which merrily quickly says "Done" which the OS happily accepts. In reality and behind the scenes the controller may internally use whatever strategy it sees fit. And if they are battery backed up you need not even really care. From the users and OS's standpoint it bascially comes down to write through at the (low) cost of write back. In the end the vendor may tell you a lot and offer you funny switches to toggle but that's what it boils down to with maybe some variations such as asking for write through actually uses write through (but you won't find that in their benchmarks ...).

Short version: The write through vs write back question poses itself only on an OS level and if you do not have BBU.

And while the point of having a machine to oneself or not is indeed a relevant one, for the case of DBs the major point is to not have unnecessary layers such as third party caching. Either go spindles (preferably with cached and BBUd hardware controller) or go SSD but don't use SSD or other funny stuff in the middle. And yes, even read latency suffers from third party caching schemes at least with DB of considerable size.

And, being at that, @MikHo correctly hinted at that, the major factor with DBs is indices and damn enough RAM (and then some more). Read my lips: You almost never gain on record caching, at least not on a decent OS.
A reasonable approach is to (professionally and reasonably!) guesstimate the size of your indices and to include join tables (many forget that) and to then add the number of records in the DB times 1KB on top. Be generous but don't go maniac.

And no, being alone/dedi vs VPS doesn't change that other than with a VPS I would considerably increase RAM (so as to compensate). SSD caches for a DB server, virtual or dedi, no matter, will usually be just a waste of money (that would be better invested in RAM).

What I just wrote is basically always true but, that said, you can happily ignore it for all those tiny DBs that typically are behind web site. For mid to large size DBs, say anything beyond 10 or 20 GB and more than some (low) mio of records you might, however, want to think about what I said.

bsdguy · January 2015

@aglodek said:
And by "decent", you mean Python, I take it? Okay, I'm game, but what's Python got that PHP doesn't?

Pardon my strong inclination to not start a religious war. Suffice it to say that I'm using Python for scripting jobs and that I do not, no matter what what, touch PHP (Hint: I'm in security professionally and rather sensitive and picky when confronted by gaping security holes and repeated sheer idiocy).

mehargags · January 2015

@bsdguy -- pls accept my humble regards for being so descriptive and sharing detailed insight.

@bsdguy said:
You are somewhat overwhelmed, confused and lost in a maze of factors.

You are absolutely right! Can't agree more. Yes, I'm overwhelmed, as this is my first chance dealing with a 4GB "cake". However, I'm "not" confused, but rather reading and learning to be proactive! Yes there isn't a "problem" in specific right now, I may "sit back & relax" but then I don't want to live in suspense and don't like to work in crisis mode. Heck... I'm gathering the knowledge from all you seasoned veterans! so that I do not get lost in the maze... rather find my way through!

@NickMNXio and others...I'm extremely thankful for the help you gave, yes... discussing it all makes me feel stronger and mightier that this isn't a big enough site to be scared off.

Just that I'm excited and I want to put my 110% before I take up the responsibility.

@Jar, yes I agree with you too. the DB handling & app archit. is an imp factor.

and guyz... there are no coupons this DB/App sells

As an update, I spoke with the Developer, I know he is experienced and well versed. He said the app has not been maintained by him during last 4 years, and he sees some JOINS and other row handling mechanism may be out of order. He was certain that his designed DB should not be in GBs. Luckily he is back with this client for another sub-project attached to this domain and so he will see the inner things.

Thanks again guys... I'm an experienced Windows admin, but enjoying my linux learning curve patiently.

bsdguy · January 2015

@mehargags

Your are warmly welcome. I take that everyone here contributes as best he can to this community and this is my part

bsdguy · January 2015

@aglodek said:
... MonetDB ...

Apologies, forgot that one. So, a little follow up.

Relational DBs are row (~ record) based. Which is usually fine bceause the normal job is "Find me he records (rows) out of a gazillion records WHERE (condition)".

Sometimes, however, for example in financial, a column (~ field) based approach is better. And Monet DB is amazingly great at that (and still pretty good at normal DB jobs).

gbshouse · January 2015

From my experience:

change all tables to innodb, - move from MySQL to Percona, - optimize indexes, -optimize indexes again, - tune Percona for read (ram upgrade may be required), - use logs and extra functions to check the performance, -for large tables consider partitions (depends from data too)

mehargags · February 2015

Hello Everyone...
I have some updates, I've been migrating and sifting through the client's 6 websites and 4 Databases. I've observed the bigger Databases are being hastily populated not by the primary Web App, but rather a PERL application used to send out their massive newsletters (200K a day). The "lousy" sysadmin patched in their Newsletter app into the primary DB for ease of use and hence the DBs keep bloating. I know Mailer and link tracking apps can fill up DBs quite awfully.

Anyways, I'm working with the Developer and the Staff to Sieve through :usable: tables and chop out the other unwanted.

Secondly... I need your opinion on Replication scenario, or should I go on a Cluster.

I'd like to setup 2 VPSs now, in a mutual failover scenario. they host different websites and keep File's backup of each other. Thats something I can setup easily! Regarding Databases:
say I have 2 VPS ...
VPS-1 has a 2 GB Database and 3 Web sites
VPS-2 has a 4GB Databse and 2 web sites

What I want is each server to be a failover for each other. VPS-1's DB is replicated to VPS-2 and Vice versa
so I need master slave replication right ? but Can I set 2 servers as master and slave for each other ? or do I need to setup a cluster
Can you cluster Mysql DBs on 2 VPS's running other web services sideby ?

Kindly enlighten me !

Howdy, Stranger!

Categories

In this Discussion

Optimizing a big 6.5 GB mysql database

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Optimizing a big 6.5 GB mysql database

Comments