To get the most of your Big Data in the cloud and maximize your Return on Analytics (ROA), here are some simple ideas that will help you leverage the full potential of the cloud:
Enhance Your Data
Having good data is always better than having lots of data. Incorrect or inconsistent data might lead to skewed results. For example, when you have to analyze data from hundreds of different disparate sources, inconsistency in structure and format of datasets often lead to biased insight, especially when the data is not transposed or transformed into a common format. In order to get good accurate and consistent data, it’s important to enhance it. Enhancing data typically refers to cleansing, validating, normalizing, deducing and collating the data.
Performing “data hygiene” and cleaning up massive amounts of data is not a trivial task especially when you have disparate sources and large amounts of data collected at different time intervals. One can only achieve a certain percentage of “data enhancement” programmatically through scripts and programs (like running ETL jobs). However, for more enhanced accuracy, you will quickly realize that you need a human eye to validate, normalize or collate the data. Enhancements like identifying contextual similarities between columns (geo code vs. street address) or normalizing catalogs can be done by humans better than computers. You can get access to massive human workforce and split your big job into short tasks like validating the suggestion by computer program, normalizing datasets, mapping data elements, providing meta-data, cleaning up irrelevant data, transcribing audio/video files using the cloud.
Point Your Data Source to the Cloud
When your philosophy is to collect as much data as possible and measure everything, you will need massive storage capacity. When you have to collect and store massive amounts of data, the cloud makes it easy. Cloud storage is scalable, durable, reliable, highly available and most importantly, it’s inexpensive to store and can often be free to upload. Instead of moving data in periodic batches, you can point your source of data (your web clients, log generators, and so on) to the cloud, which brings the data closer to the compute resources for analysis.
Moreover, storing data in the cloud allows you to easily share and collaborate with both your partners and the consumers of your data, because they too can leverage compute and storage resources by the hour. They can analyze and extract relevant information from your data.
Analyze your Data in Parallel using the Elastic Supercomputer
The Open Source Hadoop and its ecosystem of tools bring massively parallel computing to mainstream developers. Hadoop enables the ad-hoc full-scan queries that we discussed earlier. It enables developers to break away from pre-optimized data warehouses and do exploratory analytics.
Hadoop in the cloud gives any developer or business the power to do analytics without the capital outlay. Today, you can spawn a Hadoop cluster in the cloud within minutes on the latest high performance computing hardware and network without making a capital investment to purchase the hardware. You have the ability to expand and shrink a running cluster and decide how soon they need their answers. With some cloud pricing models, you can bid for unused capacity at even lower prices and lower your costs even more.
Companies are realizing that analytics and processing massive datasets in parallel is sweet spot for the cloud and that the cloud doesn’t limit their choice of analytical functionality. You are not limited to using Hadoop; you can run Open MPI or any other commercial tools on your dataset and get better insight. You have the power to choose from a range of different analytics software – The Big Data Stacks – or even use a combination of open source and commercial tools on the same dataset to get the insight you desire and evolve analytics over time. Moreover, the cloud is elastic and cost-efficient, while at the same time offers a range of price and performance alternatives that can be tailored to how soon we need an answer. It’s a truly elastic supercomputer.
Access Aggregated Data in Real-Time with a 2-Tier Processing Model
Delivering business insight whenever you need it across massive amounts of data is not easy. To optimize results, many companies leverage a 2-tier processing model. First, they use a Batch Tier to analyze massive datasets in parallel, and store the aggregated data in separate data store (Query Tier). This pattern has two advantages. First, it leverages the power of the cloud by analyzing the batch in parallel, which greatly reduces processing time. Second, by storing pre-computed data in a NoSQL scalable data store (like Apache Cassandra, HBase, Amazon SimpleDB, Amazon DynamoDB, MongoDB, Riak etc.), continuous querying of the aggregated data is possible. Since the data is automatically indexed on input, it can be queried in near real time. This is especially useful when you want to visualize Big Data.
The cloud accelerates Big Data analytics. It gives enormous power to the Data Scientist – a new and emerging discipline – to work with Big Data, without limits. Since the cost of experimentation is low in the cloud, they can experiment often and respond to complex business questions quickly. The cloud makes it easy to absorb Big Data datasets and process them in parallel, and offers a variety of Big Data Stacks to choose from. Customers can choose to apply the technology most appropriate for their needs.
Most importantly, moving your Big Data to the cloud allows you to gain more insight from the data quickly and at a price point unmatched by older technologies. The cloud provides instant scalability and elasticity and lets you focus on analytics instead of infrastructure. It enhances your ability and capability to ask interesting questions about your data and get meaningful answers. This changes the game.