New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Mongo for relatively big amounts of documents
I am willing to use MongoDB as a document storage for a project I'm working on. Data will rarely be read, but will be written in big amounts per day (>1 million documents per day).
Does anyone here have experience with such environments? What kind of setup/servers do you have for this?
Thanks in advance.
Thanked by 1rzlosty
Comments
Sharding is your friend.
Thanks for your reply. Yes, I was thinking of using sharding, but I was hoping to get ideas on specific configurations suitable for handling this amount of data from real life experience.
Of course, I will stress test the system before going live, but I want to have an idea of what would be a good starting point.
Rarely read -> disk based database. Cassandra is probably a good choice but it depends on the specifics.
i had sensors push 10's of values in intervals per minute 24 hours a day, using sharding, replication, and didn't have any problems. normal centos server with at least 4GB ram is good enough.
It's just information about the client performing the request. The only things I require are being able to "find" a record from a unique ID and being able to search even with only full field match - although those would be only made from time to time, manually, so no problem if they must be queued to the user.
The data is ephemeral (30 days), so 30-60million rows should be able to be handled without problems.
I think this data is a bit larger than just sensor data, however your reply still helps me make an idea of the kind of HW I'd need.
Ah! Well, Mongo will do then
I've scaled Mongo quite large but you need to think a lot about memory.
Another option is Elasticsearch
Just be careful with it, because it is not a dabatabe in the sense that it does NOT guarantee data integrity. In its docs its says it cannot be used as a "source of truth" and you need to have the data on a database for that.
@joepie91 will be along shortly to explain why using Mongo is always a mistake.
And what is your opinion? Is it a mistake? I think there's a use case for every storage system.
I personally have no use for a storage system that admits unreliability.
And what would be your suggestion?
I wouldn't use elasticsearch as a general purpose db. It's tailored to search.
But if you had to search, es is awesome.
Yes.
Mongo...
... so realistically, there's nothing it's good at, and a bunch of stuff it's outright bad at.
That's nonsense. There is absolutely nothing that prevents a piece of software from being objectively bad. And MongoDB is such an objectively bad piece of software - there are no usecases that aren't better solved by alternative options. It lives purely off hype.
PostgreSQL, most likely. That, or Cassandra. It depends on the kind of data.
i agree with ^ 100%
i think the hype comes from the fact that it was written partially in js
@joepie91 Thanks a lot for your insights regarding MongoDB and your advice on storage systems. It's been really useful.
I think I now have enough information to start making trials. More input is welcome of course, and thanks a lot to everyone who posted in this thread so far.
What are you actually writing? You said "documents." If it's actually documents then Cassandra is not good for you. If it's like a location to a document or some sort of URL that points to the document, then you're fine. Here are the data types supported by Cassandra http://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html . Make sure you understand the difference between NoSQL and other SQL like language in terms of what you're able to query.
Cassandra is pretty fast and scales linearly. For a distributed system, it's fairly easy to manage.
I need to save the data from each click (hit), and be able to recover it later with an unique ID. The click information is an array of data.
Cassandra should be fine then.
it is good for doing small experiments. not good for anything to be used in real life
Well, no, not even really that.
A small experiment is generally one of two things:
In both cases, you're better off using something that is either already production-ready, or likely will be production-ready in the near future. There's not really a point in experimenting with something that you can't use in production anyway.