Best Databases for Complex Analytics

iwaswrongonce · June 2016

Wanted to get some opinions on this as the landscape is rapidly evolving and it's tough to make sure we are evaluating everything. Situation is bootstrapping a startup that is doing big-ish data analytics that has some decent geospatial needs as well. We are currently ingesting ~4mm GPS broadcasts per day (each is only ~100 bytes). Our current setup which worked well at my old job and works fairly well now is as follows:

MariaDB as main "single truth". All new data is first ingested there and then replicated to other sources. This simplifies things for a few reasons: ACID, hot backups with xtrabackup, and can handle complex queries better than other No/New-SQL solutions.

Memsql cluster as read cache and parallel query engine. Main benefit here is two fold: protocol compatible with Mysql/Mariadb and parallel execution of queries greatly speeds up some of our complex ad-hoc analytics (though we run into a number of queries that actually run faster in a single thread on Mariadb).

And of course a small-ish Hadoop cluster (mostly Impala, though Hawq looks interesting).

Use cases: this data is generated by roughly 20k actors, and each one is largely independent of the other. Essentially 20k timeseries. I feel as though this is not remotely close to a burdensome amount of data, but problems arise in the pattern of access that I don't think can be solved with simple solutions like more ram or better indices. We have 75% that is fairly predictable, but the other 25% is really unpredictable and tough to optimize with denormalization, etc.

Take a typical query: we self join the table on the next and previous events (we store the prev/next timestamps in each row for easy joins) to determine when an actor performed some behavior inside a geospatial polygon. Then we need to query the next time they performed some other action at some point in the past/present in another polygon. So we end up with hundreds of thousands of candidate rows with an equal number of subqueries that are scanning another hundred thousand or million rows.

We generate event data upon ingestion that can greatly speed up most use cases like the above. But occasionally we run across something where we need to hit the raw database and the query ends up taking much too long. One of the problems is that we are part trading firm so whereas other businesses might run a query that takes 5 minutes, that is an unacceptable amount of time when I have a broker on the phone trying to line up real volume.

I think the high grade commercial OLAP databases are out of our current budget, but if they are performant in a way that others are not, we will bite the bullet. We also need part OLTP because we are streaming real time data that needs to be queried in real time (i.e. if market changes, we can't wait for an overnight batch to run analytics on for fresh data). And we have admittedly not spent a lot of time playing with other alternatives like VoltDB or CockroachDB due to the added complexity.

So, who has the patience and energy to kindly provide their thoughts and experiences on the above? TIA

vfuse · June 2016

I used mongodb for something similar (a GPS positioning app for about 5000 ships), stored all the data in mongodb with a geospatial index. Queries are very fast and so is inserting data.

raindog308 · June 2016

vfuse said: I used mongodb for something similar (a GPS positioning app for about 5000 ships), stored all the data in mongodb with a geospatial index. Queries are very fast and so is inserting data.

Monogo's great. I mean, as long as you don't care about your data integrity. Don't make me page @joepie91 for the standard Mongo lecture...

But regardless, Mongo really isn't built for analytics, is it?

iwaswrongonce said: I think the high grade commercial OLAP databases are out of our current budget, but if they are performant in a way that others are not, we will bite the bullet.

Disclaimer: I manage teams of DBAs in the F500, so my opinions are jaded by what big enterprises do and they are overwhelming Oracle/MSSQL/DB2 people. Also, I'm just on the operational side, not the data modeling, code, etc.

When you talk commercial DBs (Oracle, MSSQL) you get features and capabilities that FOSS can't even dream of, but I am less certain on the performance part...until you get into that last 10%, which is only available with either extreme engineering (e.g., Facebook) or a commercial RDBMS (e.g., Oracle/MSSSQL on high-grade hardware, Exadata, Teradata, etc.) The big engineered systems (e.g., Exadata) push a lot of basic DB logic into firmware, so it's hard to get faster than that.

But even there, it's not apples-apples. e.g., Oracle in-memory is dual format, allowing both fast analytics and OLTP at full speed, while MySQL is single format regardless of disk or memory. But is it worth $48K/core list price for this Oracle EE? Depends what you're doing.

What the commercial RDBMSes give you are awesome features around recoverability, replication/mirroring, disaster recovery, auditing, active/active no-CAP-problem systems, and ecosystem (Java or .Net) integration that MySQL/Postgres/etc. cannot even dream of. We could run our business with MySQL - it lacks entire feature categories.

But if you're problem is straight performance...really, if you're running everything in memory already...well, you probably already have a lot of metrics. I'd be tempted to call vendors and say "here are our current stats. if you could double/triple performance, what would it cost?"

I will say that even in my commerical-as-default world, we recently invested in an elastic search farm and it's performed very well. But it's a coat check, not for analytics - sounds like you need somethign SQL or SQL-ish.

rincewind · June 2016

R-trees allow efficient range queries. SQLite, for instance, comes with an R* Tree module and is commonly used for geospatial and CAD systems. If your actors are really independent, store each of them in a separate SQLite file to get localized locking and lower fragmentation.

For a faster LMDB backend, use SQLightning or ActorDB.

If your queries are really complex, check out Tarantool. It allows entire Lua programs to be submitted to the database for processing. The embedded LuaJIT gets close to native C performance.

BTW, https://percona.com/software/mongo-database/percona-server-for-mongodb is a drop-in replacement for Mongo.

Other options here - both for DB and stream processing.

iwaswrongonce · June 2016

@vfuse said:
I used mongodb for something similar (a GPS positioning app for about 5000 ships), stored all the data in mongodb with a geospatial index. Queries are very fast and so is inserting data.

But the geospatial is just one component. We have a few other things (speed, heading, destination, eta, etc) that we are looking at. So we wind up with some complex joins across a number of tables often with correlated subqueries (which Memsql and others choke on).

iwaswrongonce · June 2016

@rincewind said:
R-trees allow efficient range queries. SQLite, for instance, comes with an R* Tree module and is commonly used for geospatial and CAD systems. If your actors are really independent, store each of them in a separate SQLite file to get localized locking and lower fragmentation.

For a faster LMDB backend, use SQLightning or ActorDB.

If your queries are really complex, check out Tarantool. It allows entire Lua programs to be submitted to the database for processing. The embedded LuaJIT gets close to native C performance.

BTW, https://percona.com/software/mongo-database/percona-server-for-mongodb is a drop-in replacement for Mongo.

Other options here - both for DB and stream processing.

I have actually looked at Tarantool before, and aside from knowing nothing about Lua, it seems cool but I rarely see it discussed anywhere else. As for your comment re: sharding SQLite, I actually did build something similar to this in my old job to shard per actor (basically a poor man's mapred). But it doesn't scale well and you end up building too many tools in-house (which we may end up having to do anyway). Memsql is nice for its ability to do this automatically.

iwaswrongonce · June 2016

@raindog308 said:

vfuse said: I used mongodb for something similar (a GPS positioning app for about 5000 ships), stored all the data in mongodb with a geospatial index. Queries are very fast and so is inserting data.

Monogo's great. I mean, as long as you don't care about your data integrity. Don't make me page @joepie91 for the standard Mongo lecture...

But regardless, Mongo really isn't built for analytics, is it?

iwaswrongonce said: I think the high grade commercial OLAP databases are out of our current budget, but if they are performant in a way that others are not, we will bite the bullet.

Disclaimer: I manage teams of DBAs in the F500, so my opinions are jaded by what big enterprises do and they are overwhelming Oracle/MSSQL/DB2 people. Also, I'm just on the operational side, not the data modeling, code, etc.

When you talk commercial DBs (Oracle, MSSQL) you get features and capabilities that FOSS can't even dream of, but I am less certain on the performance part...until you get into that last 10%, which is only available with either extreme engineering (e.g., Facebook) or a commercial RDBMS (e.g., Oracle/MSSSQL on high-grade hardware, Exadata, Teradata, etc.) The big engineered systems (e.g., Exadata) push a lot of basic DB logic into firmware, so it's hard to get faster than that.

But even there, it's not apples-apples. e.g., Oracle in-memory is dual format, allowing both fast analytics and OLTP at full speed, while MySQL is single format regardless of disk or memory. But is it worth $48K/core list price for this Oracle EE? Depends what you're doing.

What the commercial RDBMSes give you are awesome features around recoverability, replication/mirroring, disaster recovery, auditing, active/active no-CAP-problem systems, and ecosystem (Java or .Net) integration that MySQL/Postgres/etc. cannot even dream of. We could run our business with MySQL - it lacks entire feature categories.

But if you're problem is straight performance...really, if you're running everything in memory already...well, you probably already have a lot of metrics. I'd be tempted to call vendors and say "here are our current stats. if you could double/triple performance, what would it cost?"

I will say that even in my commerical-as-default world, we recently invested in an elastic search farm and it's performed very well. But it's a coat check, not for analytics - sounds like you need somethign SQL or SQL-ish.

Great post. Appreciate your time. You are spot on. Our volume of data is quite low (why I call it big-ish data) in that the entire dataset can easily fit in memory on a single 256gb box (we have a number of these at OVH). However the frequency of the data is rather high (many rows of small size) and thus still bump into performance issues. Perhaps we will have to call vendors but I am really a proponent of FOSS (or as close as one can get).

So does anyone have experience with the likes of Druid, Greenplum, Vertica, Volt or Hawq? These are the ones that have smaller user bases or don't have much info out there regarding our use cases.

zllovesuki · June 2016

Maybe you can try looking into RethinkDB? Join their Slack channel and ask questions, the devs are very active and helpful.

raindog308 · June 2016

iwaswrongonce said: Perhaps we will have to call vendors but I am really a proponent of FOSS (or as close as one can get).

iwaswrongonce said: single 256gb box (we have a number of these at OVH).

Unfortunately, the commercial RDBMS prices are stratospheric. Oracle is $48K/core list (though no one pays that - still, you could pay $35-40K/core if you're not a big customer) and MSSQL is about half that. They do count 2:1 for cores, but still, a 16-core box isn't that unusual these days and they'd be a quarter million each. Your large memory will push you into the enterprise editions alas.

Sorry, don't have anything else to recommend - will be watching this thread though :-)

ricardo · June 2016

Interesting thread, can't offer recommendations but I'd suggest spending a day seeing if rolling your own (if you program) is feasible.

Often with pre-made things their extensibility comes at the cost of performance. In many cases I've used Judy Arrays in C which offer great performance comparable to hash tables but with the benefit of ordering. Having a rough gist of your use case, it sounds like a GPU would offer the most kind of performance, loading straight off an SSD. Just throwing that out as an alternative view, most often folks stick with tried and tested though going by raindog's guidance of cost, it's certainly worth thinking of all the alternatives.

Do let us know what you end up going with.

raindog308 · June 2016

Druid's page looks cool though I'd note they wrote off RDBMS solutions with a single MySqL test.

ehab · June 2016

for time-series db someone suggested InfluxData, i didn't use it yet but planning.

rincewind · June 2016

iwaswrongonce said: I have actually looked at Tarantool before, and aside from knowing nothing about Lua, it seems cool but I rarely see it discussed anywhere else.

I've been using Lua for about 2 months, and its good so far. The two largest user bases for Lua are Torch (a scientific computing library) and Love2D (a game engine). It's a good option for an embedded scripting language, and LuaJIT performance is just amazing.

For storage engines, RocksDB does well for high write speeds and LMDB is optimized for reads. If you are rolling your own DB, you can decide on a query language and select the right storage. From what I hear, Facebook uses Mongo but with a RocksDB backend.

If you have everything in memory and still getting poor performance, take another look at your data structures. If geospatial indices don't work you could look at space partitioning or LSH to efficiently encode neighborhoods.

I've had bad experiences with Hadoop because of JVM's garbage collection (EDIT: To be fair, this was back in 2011 so things might have changed). For your amount of data you might do better with a custom query language and tricking out your DB.

Default MySQL/Maria has poor JOIN performance. Postgres would be a better option for "single truth".

jcaleb · June 2016

No technology will replace good design. You need to understand the execution plans of your queries and you may need to introduce new tables or maybe redesign your whole structure.

jcaleb · June 2016

iwaswrongonce said: Take a typical query: we self join the table on the next and previous events (we store the prev/next timestamps in each row for easy joins) to determine when an actor performed some behavior inside a geospatial polygon. Then we need to query the next time they performed some other action at some point in the past/present in another polygon. So we end up with hundreds of thousands of candidate rows with an equal number of subqueries that are scanning another hundred thousand or million rows.

Try to partition your relevant tables by actors. partition and local index may do wonders for you.

deadbeef · June 2016

@jcaleb said:
No technology will replace good design. You need to understand the execution plans of your queries and you may need to introduce new tables or maybe redesign your whole structure.

This. I too have the feeling that @iwaswrongonce is trying to throw more horse power in what sounds like a design issue.

iwaswrongonce · June 2016

Appreciate all the feedback. Re: design and access patterns, we understand this very well for ~75% of queries. But doing ad-hoc analysis the other 25% of the time we often break out of the typical pattern, even if it's just momentarily.

However, I am also weary to over-engineer. I can double my computing resources far cheaper than I can re-engineer some of the core design.

deadbeef · June 2016

@iwaswrongonce said:
Appreciate all the feedback. Re: design and access patterns, we understand this very well for ~75% of queries. But doing ad-hoc analysis the other 25% of the time we often break out of the typical pattern, even if it's just momentarily.

However, I am also weary to over-engineer. I can double my computing resources far cheaper than I can re-engineer some of the core design.

Exactly my thoughts. You have a mountain of technical dept and you're in the ugly position of having to do something about it. The technical dept was a result of bad desing decisions and now you're doubling down with more of it. Guess how's that going to end.

Also, if 25% of your queries are outside of the patterns you've designed for, you don't have a problem of over-engineering... you have a problem of "ain't nobody got time fo' dat" ... "engineering".

Hxxx · June 2016

A nice and "worth the time" thread. Very interesting suggestions. +1

graemep · June 2016

Have you looked at Postgres? It has PostGIS for spatial queries (with a pretty good feature set) and it has big data and analytics extensions.

I have not used it in either role, but it is a very nice RDMBS in lots of other ways and is very flexible, so if it works for you.

iwaswrongonce · June 2016

@deadbeef said:
Exactly my thoughts. You have a mountain of technical dept and you're in the ugly position of having to do something about it. The technical dept was a result of bad desing decisions and now you're doubling down with more of it. Guess how's that going to end.

Also, if 25% of your queries are outside of the patterns you've designed for, you don't have a problem of over-engineering... you have a problem of "ain't nobody got time fo' dat" ... "engineering".

Perhaps I overstated the situation. We are not anywhere near having material performance issues. But in trading, if you stop trying to get better, faster or more accurate...you've already lost. That's where we are today. I look at our performance and we are pleased with it, but just keep feeling we can and should be better. You are most certainly right that there could have been better design decisions. However, we have no customers so to speak. If we wanted to scrap it all and re-engineer we certainly can. My comment about hardware vs. people of course is simply a comment on premature optimization. Not that I am keen to go down a path that doesn't scale and yields compound issues later on. So all points well taken.

iwaswrongonce · June 2016

@graemep said:
Have you looked at Postgres? It has PostGIS for spatial queries (with a pretty good feature set) and it has big data and analytics extensions.

I have not used it in either role, but it is a very nice RDMBS in lots of other ways and is very flexible, so if it works for you.

Yes but the problem is that parallel execution is still very young and (as I understand it) mostly applicable to full table scans right now. So basically it overlaps with Impala.

I have started to play around with Greenplum (which is MPP based on PG). Will report my thoughts and findings on that later.

jcaleb · June 2016

As i mentioned above, try partition and local index. At least by actors. Depending on situation, it may give you a thousand fold performance increase

iwaswrongonce · June 2016

@jcaleb said:
As i mentioned above, try partition and local index. At least by actors. Depending on situation, it may give you a thousand fold performance increase

That is what we do currently in Memsql. Which platform do you suggest doing this with?

jcaleb · June 2016

@iwaswrongonce said:

@jcaleb said:
As i mentioned above, try partition and local index. At least by actors. Depending on situation, it may give you a thousand fold performance increase

That is what we do currently in Memsql. Which platform do you suggest doing this with?

Edited
~~Your mysql mariadb~~. You have partition for in memory db? Thats possible?

Also enable logging of query execution and spend time on slower queries. Study their execution plans to know how to optimize

jcaleb · June 2016

But of course most advice are blind because we cant really more info about your system

rincewind · June 2016

iwaswrongonce said: Take a typical query: we self join the table on the next and previous events (we store the prev/next timestamps in each row for easy joins) to determine when an actor performed some behavior inside a geospatial polygon. Then we need to query the next time they performed some other action at some point in the past/present in another polygon. So we end up with hundreds of thousands of candidate rows with an equal number of subqueries that are scanning another hundred thousand or million rows.

I was reading through your sample query again, and the join (t-1, t, t+ 1) looks really wasteful. You should be able to prune entries much before you compute the join, instead of first computing a join and then pruning out tuples. Maybe you would do better to consider your actor data just as the vector sequence it is.

You could check for multiple conditions on a single pass through the sequence. Even otherwise, multiple passes would still be in linear time. Skip the whole Memsql thing and go with a hashmap of C++ vectors

iwaswrongonce · June 2016

@rincewind said:
I was reading through your sample query again, and the join (t-1, t, t+ 1) looks really wasteful. You should be able to prune entries much before you compute the join, instead of first computing a join and then pruning out tuples. Maybe you would do better to consider your actor data just as the vector sequence it is.

You could check for multiple conditions on a single pass through the sequence. Even otherwise, multiple passes would still be in linear time. Skip the whole Memsql thing and go with a hashmap of C++ vectors

We of course do filter prior to the join, typically down to a date range of interest. But it's often not until after the join that we know which rows are of interest.

rincewind · June 2016

iwaswrongonce said: after the join

I suggested that you don't need to compute the join explicitly. Your time series is already in memory sorted by time. Just view the data as a vector - a C++ vector of records per actor might be a better data model than a MemSQL table.

iwaswrongonce · June 2016

@rincewind said:

iwaswrongonce said: after the join

I suggested that you don't need to compute the join explicitly. Your time series is already in memory sorted by time. Just view the data as a vector - a C++ vector of records per actor might be a better data model than a MemSQL table.

Fully agree. Fair point. One issue for us well is that keeping compatibility with existing tools is a plus when you bring in analysts that are familiar with a specific technology. Memsql is nice for its protocol compatibility with Mysql and obviously loads of people are comfortable using SQL to do analysis.

iwaswrongonce · June 2016

Front page of HN today: https://engineering.meteor.com/mongodb-queries-dont-always-return-all-matching-documents-654b6594a827#.5yvv12kuu

Howdy, Stranger!

Categories

In this Discussion

Best Databases for Complex Analytics

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Best Databases for Complex Analytics

Comments