Loading and reporting on 180 million records

dwnewyork5 · August 2016

I have a csv file with the following data:

GUID,low,2016-08-25 05:05:01
GUID,N/A,2016-08-25 05:05:02
GUID,Medium,2016-08-25 05:05:01
123String,High,2016-08-25 05:06:01
String,low,2016-08-25 05:05:01
String abc,N/A,2016-08-25 06:05:01

Around 180 million records and growing... I need to consume it and run some basic time based reports (last column) and impact (middle column).... Most of the entries in the first column are unique (around 5% repeated)

Everyday, I will be adding atleast 100,000 + records to this existing set.

I have always worked with RDBMS and LAMP stack... but, I want to get my hands dirty with big-data technologies (not sure if this qualifies for big-data )...

What will be the best way to load this data-set and run a few reports?

Any pointers are appreciated.

tommy · August 2016

What have you done? What the problem?

EdgeWeb · August 2016

MySQL should be able to handle this quite easily. You need either enough RAM to load your entire data + indexes into memory, or run your database from an SSD. Use InnoDB and make sure that columns that you are ordering by, or are a part of a WHERE clause are indexed (you don't want to constantly crawl though 200 million records).

ricardo · August 2016

I'd say it depends on how often you want to query the data. If it's just an occasional thing, I'd just leave the data on disk and even a simple grep & cut into a RAM disk may give you respectable performance.

Though if you wanted to get fancy, youd want to use 32-bit timestamps. I'd put them in Judy arrays but for something pre-rolled, I'd agree MySQL would be absolutely fine. Totally depends on what you need it for (e.g. high availability, lots of concurrent users)

dwnewyork5 · August 2016

@tommy said:
What have you done? What the problem?

I have worked with similar amount data with rdbms like sqlserver/ mysql. I am trying to seek some info on trying to get into no-sql big data tech system swhich can consume this data.

dwnewyork5 · August 2016

@EdgeWeb said:
MySQL should be able to handle this quite easily. You need either enough RAM to load your entire data + indexes into memory, or run your database from an SSD. Use InnoDB and make sure that columns that you are ordering by, or are a part of a WHERE clause are indexed (you don't want to constantly crawl though 200 million records).

Sure, I have worked with similar data with sqlserver. but, as I said, I want to see if I can use bigdata (nosql technologies to consume/ report off this..

dwnewyork5 · August 2016

@ricardo said:
I'd say it depends on how often you want to query the data. If it's just an occasional thing, I'd just leave the data on disk and even a simple grep & cut into a RAM disk may give you respectable performance.

Though if you wanted to get fancy, youd want to use 32-bit timestamps. I'd put them in Judy arrays but for something pre-rolled, I'd agree MySQL would be absolutely fine. Totally depends on what you need it for (e.g. high availability, lots of concurrent users)

Mostly, its for my internal use, so no need for any high availability or any thing. i have a i5 processor with 12gigs ram I believe and a general 7200k rpm hdd.

I want to consume and try/ experiment out.

The information on internet is slightly too over whelming for a person who has always worked with LAMP stack and wants to dive into new technologies. But, cannot as its too many things available to start/ try off with.

ricardo · August 2016

A relatively simple setup could be something along the lines of this, which is of course only one way of many ways you could do it.

I'd put your CSV into an INNODB table, using the timestamp you're going to query on as the primary key. Make it a BIGINT, 4 bytes for the data and add on an auto increment value to avoid collisions of the same timestamp.

You could partition it, perhaps considering date ranges you'd often query for. e.g. If you query over time spans of weeks or months, it makes sense to partition it monthly.

Since the PK is always in order, even with new insertions, you'll avoid fragmentation issues. You should stick with the default BTREE index type.

Create a separate table (or tables) mapping the 'impact' value to the PK of the other table. It looks like you only have a small range of 'impact' values so it maybe makes sense to create separate tables for each so you can forego storing the value at all in the table.

If you set innodb_buffer_pool to a decent size, your Primary key(s) will get cached into memory and queries should be returned at decent speed. It's perhaps not the 'best' way but seems like one of the easier routes, and should give good performance.

impossiblystupid · August 2016

When you're starting with CSV, you can find all sorts of adaptors (SQL and NoSQL) that allow you to easily use or import from the file. Like ricardo said, though, what you do really comes down to how you want to both access and update the data. If you're constantly reprocessing hundreds of millions of records, it'll be slow going compared to just doing updates based on the 100K new records. If it's all just doing some aggregate calculations without any real future need to process the data, you could get away with handling it MapReduce-style with a PHP (or whatever) script that just parses the CSV directly.

Dormeo · August 2016

Okay, so it seems nobody use bigdata technology yet? Because the guy just want to try it some bigdata technology for this big dataset as I understand...

rincewind · August 2016

If I assume 512 bytes per record, that comes out to 85GB, which though large, is still Ok for any Database. Even SQLite can handle that.

Helps if you can give more info to narrow down your options:

Which is more critical for performance: read or write/update speeds?
Will insertions (always) be in sorted order?
Is your key (GUID?) fixed length? How long?
What are the type of queries you expect? Do you need JOIN or any other complex operations? Or is it mostly search/lookup like in a hash-map? Do you need range-queries : all records between x and y? If your queries are simple enough, you can use a hash-map (or key-value store) and don't need SQL.

B-Tree type databases are good for reads and lookups, but write-amplification can degrade write performance. LSM-tree based DBs have better write-speeds. If you can explain the sort of trade-offs you are willing to make, then it becomes easier to evaluate your options.

rds100 · August 2016

Just keep it as csv (if you only add records at the end and don't modify the already existing records).

You can easily parse the whole csv with a simple perl script and process it entirely in memory. 180k records x about 30 bytes = 5 gigabytes - not impossible to get such a server.
Even if the records are larger than 30 bytes, it's not impossible to process this in memory. It's not that big.

William · August 2016

Sure, you could load that into DynamoDB/Cassandra (mhm...) or MongoDB (eh.....) (or pretty much anything else) and it will run faster (likely) than some fs/shell based regex/sort but at that size (much less than 100GB, at most?) none of this makes sense, MySQL (if you want to be fancy build like a MariaDB based cluster or use something like MemSQL) is probably easier to handle.

Your table design needs work for the import (you should clearly not keep it like that) but performance later should be pretty much "good", CloudSQL (Google), AWS RDS and Azure SQL work pretty well but are not cheap for start (price of dedicated instance eg.).

vfuse · August 2016

It would be easier to advice what kind of database would work best if you tell us more about the type of queries you will run.

ricardo · August 2016

Dormeo said: Okay, so it seems nobody use bigdata technology yet? Because the guy just want to try it some bigdata technology for this big dataset as I understand...

Big data is just a fuzzy term like 'the Cloud'. This dataset seems small enough that it'd fit on one machine comfortably and is for private use.

If it were something that was receiving thousands of queries a second, then perhaps a more scalable solution would warrant further thought, IMO.

dwnewyork5 · August 2016

@rds100 said:
Just keep it as csv (if you only add records at the end and don't modify the already existing records).

You can easily parse the whole csv with a simple perl script and process it entirely in memory. 180k records x about 30 bytes = 5 gigabytes - not impossible to get such a server.
Even if the records are larger than 30 bytes, it's not impossible to process this in memory. It's not that big.

my current dataset is 180 million and counting

dwnewyork5 · August 2016

@vfuse said:
It would be easier to advice what kind of database would work best if you tell us more about the type of queries you will run.

I am simply going to run some simple reporting queries.. like count events on date-time, group recurring GUIDs, trends in the high/ low/ medium impact (on certain times of the day) etc.

Nothing fancy. So, I felt this may give me some good segway to see how big-data technologies are used...

dwnewyork5 · August 2016

@rincewind said:
If I assume 512 bytes per record, that comes out to 85GB, which though large, is still Ok for any Database. Even SQLite can handle that.

Helps if you can give more info to narrow down your options:

Which is more critical for performance: read or write/update speeds?

Its purely personal learning, so speed is not a concern

Will insertions (always) be in sorted order?

They will be in sequential order of time

Is your key (GUID?) fixed length? How long?

32 bits/ chars. But, it may swivel around with some non-standard GUIDs

What are the type of queries you expect? Do you need JOIN or any other complex operations? Or is it mostly search/lookup like in a hash-map? Do you need range-queries : all records between x and y? If your queries are simple enough, you can use a hash-map (or key-value store) and don't need SQL.

range queries, time queries, trending...

again, all for the sole purpose of understanding how the big-data technologies could work and something different.

B-Tree type databases are good for reads and lookups, but write-amplification can degrade write performance. LSM-tree based DBs have better write-speeds. If you can explain the sort of trade-offs you are willing to make, then it becomes easier to evaluate your options.

rincewind · August 2016

I see a database as having

A storage engine
Meta functions (Index/views/triggers) and query language
Safety guarantees : ACID/CAP

NewSQL/NoSQL customizes them to handle special types of data. Eg- Document stores, Time-series databases, Graph databases etc. Some weaken ACID, but that's usually a bad idea.

For time-series data, InfluxDB is a good option. But if you really want to blow your mind, try Datomic. Kdb+ is a commercial DB with an unusual query language.

For graph and document stores, there's RethinkDB, ArangoDB, TinkerPop.

TiDB is a good starting point if you want to try out different storage engines while remaining MySQL compatible. Among storage engines: LMDB is the fastest B-Tree implementation I know, and there's RocksDB for LSM-trees.

If you want to dig deeper, search github for awesome lists. Eg- https://github.com/erictleung/awesome-nosql and https://github.com/numetriclabz/awesome-db

dwnewyork5 · August 2016

@rincewind said:
I see a database as having

A storage engine

Meta functions (Index/views/triggers) and query language

Safety guarantees : ACID/CAP

NewSQL/NoSQL customizes them to handle special types of data. Eg- Document stores, Time-series databases, Graph databases etc. Some weaken ACID, but that's usually a bad idea.

For time-series data, InfluxDB is a good option. But if you really want to blow your mind, try Datomic. Kdb+ is a commercial DB with an unusual query language.

For graph and document stores, there's RethinkDB, ArangoDB, TinkerPop.

TiDB is a good starting point if you want to try out different storage engines while remaining MySQL compatible. Among storage engines: LMDB is the fastest B-Tree implementation I know, and there's RocksDB for LSM-trees.

If you want to dig deeper, search github for awesome lists. Eg- https://github.com/erictleung/awesome-nosql and https://github.com/numetriclabz/awesome-db

Thanks. Will check out the options and the "awesome" lists.

lewekleonek · August 2016

@dwnewyork5 said:
I have a csv file with the following data:

GUID,low,2016-08-25 05:05:01
GUID,N/A,2016-08-25 05:05:02
GUID,Medium,2016-08-25 05:05:01
123String,High,2016-08-25 05:06:01
String,low,2016-08-25 05:05:01
String abc,N/A,2016-08-25 06:05:01

Around 180 million records and growing... I need to consume it and run some basic time based reports (last column) and impact (middle column).... Most of the entries in the first column are unique (around 5% repeated)

Everyday, I will be adding atleast 100,000 + records to this existing set.

I have always worked with RDBMS and LAMP stack... but, I want to get my hands dirty with big-data technologies (not sure if this qualifies for big-data )...

What will be the best way to load this data-set and run a few reports?

Any pointers are appreciated.

If you really would like to get familiar with big data, then I would start with hadoop mapreduce, this way you'll be able to handle processing of large data chunks in quite efficient way (in parallel - given that you can run your mapreduce jobs on multiple nodes).
Hortonworks has a pretty good big data sandbox and tutorials - http://hortonworks.com/tutorials/

No project is too small for big data if you want to learn but from a practical perspective an HDFS block size is usually set by default to 128MB (megabytes) or 64MB for vanilla Apache Hadoop.

BTW: I agree with @ricardo - the data set you're working with would most likely be a pet project if you're going go the big data way, but it's a good way to learn

dwnewyork5 · August 2016

@lewekleonek said:

@dwnewyork5 said:
I have a csv file with the following data:

GUID,low,2016-08-25 05:05:01
GUID,N/A,2016-08-25 05:05:02
GUID,Medium,2016-08-25 05:05:01
123String,High,2016-08-25 05:06:01
String,low,2016-08-25 05:05:01
String abc,N/A,2016-08-25 06:05:01

Around 180 million records and growing... I need to consume it and run some basic time based reports (last column) and impact (middle column).... Most of the entries in the first column are unique (around 5% repeated)

Everyday, I will be adding atleast 100,000 + records to this existing set.

I have always worked with RDBMS and LAMP stack... but, I want to get my hands dirty with big-data technologies (not sure if this qualifies for big-data )...

What will be the best way to load this data-set and run a few reports?

Any pointers are appreciated.

If you really would like to get familiar with big data, then I would start with hadoop mapreduce, this way you'll be able to handle processing of large data chunks in quite efficient way (in parallel - given that you can run your mapreduce jobs on multiple nodes).
Hortonworks has a pretty good big data sandbox and tutorials - http://hortonworks.com/tutorials/

No project is too small for big data if you want to learn but from a practical perspective an HDFS block size is usually set by default to 128MB (megabytes) or 64MB for vanilla Apache Hadoop.

BTW: I agree with @ricardo - the data set you're working with would most likely be a pet project if you're going go the big data way, but it's a good way to learn

Thanks, will check Hortonworks as a starting point.

Howdy, Stranger!

Categories

In this Discussion

Loading and reporting on 180 million records

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Loading and reporting on 180 million records

Comments