When the NoSQL movement began years ago, there was intense debate on which NoSQL databases were best for heavy-lifting applications with lots of data. Today, that discussion is settled, with those in the know acknowledging that Apache Cassandra™ is the top NoSQL engine for tackling large volumes of operational data.
Cassandra's masterless, divide-and-conquer architecture easily sails past all other databases that are either master-slave or multi-master in design, and ensures that your applications are future-proofed where scale is concerned.
That said, open source Cassandra has a recommended storage limitation of around one terabyte of data per machine (or virtual machine for those carving up larger hardware). This is due to the overhead involved in streaming operations when nodes are added, removed or replaced, or for standard operational activities such as compaction and repairs. For extraordinarily large databases, this restriction can lead to negative management and cost implications.
At DataStax, we're acutely aware of this limitation and have a number of projects in the works to address it. In fact, some of our recent internal testing has shown that, with DataStax Enterprise and the improvements delivered in our advanced performance suite, you can store 2-4X the amount of compressed data per node (physical or virtual) and therefore realize some nice productivity and cost savings when building out your on-premises and cloud applications, all the while maintaining optimized levels of performance.
Our technical teams have confirmed that due to our advanced performance enhancements and storage engine optimizations, out-of-the-box, DSE can store more than double the amount of compressed data per node for general applications as open source Cassandra. For time-series-styled applications (e.g., IoT), DSE can handle 4X the amount of compressed data over open source.
Note that this is compressed data and not raw data size. Compression in DSE can reduce your overall data footprint by 25-33%, plus net you some nice read/write performance benefits as well. Keep that in mind when doing the deployment math for your clusters (i.e., you can store more data/node that you think).
How does DSE pull this off? Long story short, it involves a number of the performance architecture enhancements and storage engine optimizations made in DSE 6, which you can read about here. We've got even more goodness coming on this front in soon-to-be-released versions of DSE, but that's all I'll say about that right now.
So what are some real-life examples of this in action?
One of our customers is a large media content delivery company. Like many, they started on open source Cassandra and cobbled together other supporting open source technologies while taking advantage of cloud hosting from one of the "big 3" cloud service providers.
As they grew, so did their environment along with a massive increase in the operational expenses to maintain an environment in excess of 300 open source Cassandra nodes. Worse, they expected their cloud costs to triple in the next three years.
They came to DataStax in hopes of reducing their forecasted expenses and were not disappointed. With DSE's advanced performance suite and ability to store more data than open source, they were able to save $3.2 million. OpsCenter and other management automation saved them another $900K, and an added bonus was that they were able to eliminate a MongoDB search cluster with DSE's integrated search that saved them another $2.7 million.
A nearly $7 million savings over three years: not bad!
Another customer we're currently doing something similar with is a major name in the oil and gas industry. As part of their focus on moving to standardized technologies, they have been comparing the true cost of open source Cassandra vs. a solution like DSE.
We were brought in to conduct a collaborative multi-year build vs. partner analysis that looked at multiple areas, with some eye-popping conclusions:
- They will be able to reduce cloud spend by 30-40% based on DSE advanced performance and improved storage, with the estimate being approximately $13M over five-plus years
- The reduction of development costs and gain of self-management tools to manage, monitor and provision Apache Cassandra yields a $3M savings over five years
- Formal support and services provide another couple of million in savings coupled with a six- to nine-month reduction in getting their applications to market.
Again, nothing to sneeze at.
Scale Out vs. Up Considerations
One caveat on this topic is worth mentioning. While it's tempting to put as much data as possible on every machine in a cluster, you need to ensure you don't jeopardize other aspects of your deployment such as uptime and overall capacity potential.
For example, maybe you can get away with only three or four nodes from a data volume standpoint, but you should keep in mind that if one of those nodes goes down, you immediately lose 25-33% of your capacity, and that could be a big deal breaker where your application is concerned.
Today, the smart database choice for managing large amounts of distributed data at scale remains Cassandra. With DataStax Enterprise, you can manage more data per node than open-source, saving yourself both time and money.