It detected the flu faster than the CDC and proved more accurate than Gallup polls in the 2012 election, no wonder Big Data is a big topic these days. If you aren't familiar with this emerging science and the technology behind it, then this primer should get you started.
On a high-level, the Big Data advantage is easy to grasp. The more data they collect about you, the more they know you, and the better they can sell to you. Of course it isn't just about selling; it’s about anything that can be monitored, codified and logged. This can range from the most popular topic on Twitter all the way down to the vibrations on a UPS truck. And it’s making some people big money.
So, what's the big deal?
The amount of data collected is inversely proportional to the cost of storage. Thus, storage being so cheap these days, we find ourselves with way more data than we can handle. The few that have figured out how to read it, have accessed a treasure trove of information and insights never seen before and placed themselves in the forefront of this science (and its opportunities). And those that haven't quite gotten there but realize its importance are building infrastructures to save every bit of data that crosses their path.
The bottom line is that there is a lot of data and it’s only going to get worse (or better depending on your job). The old tools, mostly decades of relational databases (RDBMS), aren’t up to the task and the commercial analytical workhorses like SAS and SAP are struggling to keep up. Even standard theories are being revisited, for example, who needs a random sample to measure a population, when Big Data is the entire population? Google, Facebook, and Yahoo, just to name the big ones, have been busy developing new approaches and tools to extract value out of this new matter.
"What's exciting about data isn't how much of it there is, but how quickly it is growing... if we use the word 'big' today, we'll quickly run out of adjectives: ‘big data’, ‘bigger data’, ‘even bigger data’, and ‘biggest data’” (Eric Siegel - Predictive Analytics)
In order to find the largest value in a data set with conventional data tools, we can either sort it in descending order and pick the first value, or call an aggregate function to return the maximum value. This is trivial to do with small amounts of data and is done all the time. The database simply loads all the data in memory, orders it, and returns the largest value. But with petabytes of data, this cannot be done with an 'order by' or a 'max' function anymore. For starters, big data doesn't live on one machine but on thousands, and bringing all that data onto a single machine's memory to sort is often not possible or would take inordinate amounts of time.
"Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." (Wikipedia)
The solutions lie with parallel and distributed processing. Parallel processing spreads the work equally amongst a computer's CPUs (or cores). This ensures that a machine is used to its fullest potential and that no single unit of work slows down the others (avoiding a bottleneck). Distributed processing is similar but across many computers and not only allows extensibility (i.e. the ability of growing your crunching power by throwing in more machines), but redundancy, allowing the work to continue even if a percentage of the machines stop responding. Redundancy is a big issue when you have 10,000 machines sharing a single job; you want to make sure that a failing machine doesn’t corrupt your data set. This usually requires even more machines to account for the duplication of work.
"Often, big data is messy, varies in quality, and is distributed among countless servers around the world" (Big Data - Viktor Mayer-Shoneberger & Kenneth Cukier)
Messy and poorly defined data is another challenge of Big Data. Due to its size, it tends to circumvent relational database schemas (i.e. it has none). So when reading it, let alone analyzing it in real time (think of the Firehose twitter feed that spits thousands of long, dense lines of concatenated key-pair values every millisecond) requires a lot of work. We need to find and only extract relevant data out of a huge ocean of comma delimited text. This may entail building a schema on the fly while extracting the data, meaning you check the data each time it is read that it fits a set of parameters and data types. Anything that doesn't fit is ignored, assumed different, or erroneous. Fault tolerance has to be built-in as the process of sifting through billions of rows cannot afford to stop and ask why one line of data looks different.
Big Data Protocols
Right behind "Big Data", the next buzz word in this arena is MapReduce. This has actually been around since mid-2000. It came out of Google and has since been used in one form or another in a variety of tools. A popular open-source version is Apache's Hadoop.
MapReduce offers an interesting solution in the analysis of large data sets. Looking for the largest value, for example the person with the most friends on Facebook, MapReduce will break up the data into packages, send those packages to however many machines are available on the cluster (all the computers available for a task), also known as the mapping phase. Each machine will then look for the largest value in its subset of data and return only the largest value, known as the reducing phase. A final processing pass will figure out the maximum value of the maximum values returned from the cluster, i.e. the Facebook user with the most friends.
So if a thousand machines each processed a million rows of data, we managed to boost the processing power and reduce the processing time by a factor of one thousand (guestimation alert). Now, imagine doing that on a trillion rows! You can map, reduce, then map again, then reduce again, as many times depending on the complexity of the job at hand and the quantity of machines available in the cluster.
For full implantation of MapReduce, check out two python code bases:
"Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin... Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets." (Wikipedia)
Pig is a programming language that can drastically simplify coding of complex manipulation tasks in MapReduce.
This methodology offers data storage and retrieval without being bound by data consistency like a traditional relational database. By freeing itself from transactional guarantees and absolute accuracy, it can offer an 'eventual' accuracy that allows it to scale to thousands of machines. This may not be appropriate for all data needs (i.e. banking), but great for data exploration, scientific experimentation, and just massive-scale environment management.
Taking Facebook as an example, with billions of users, and thousands of simultaneous updates, there is no way of locking anything down - updates have to be allowed at all times by all users. If you ‘like’ something on Facebook, folks on the other side of the world will have to wait for that ‘eventual’ synchronization to be dazzled by your latest update.
Big Data tend not to rely on a SQL querying language, instead really on JSON, which facilitates the reading and manipulating of data directly from strings, xml, text, and the Web.
Database Paradigm Shift
A strong selling point for relational databases (i.e. Microsoft MSSQL or Oracle Database) is that they can guarantee accuracy. If a user updates a field, the database can lock that field to make sure nobody else alters it until the update is finished. Only once confirmed, does it unlock the field so others can view or change it. This is called Transactional Processing and is great for banking - we like an accurate balance when we visit the ATM. Unfortunately, this feature doesn't translate so well to Big Data; imagine locking up Twitter each time somebody replies to a comment. So, transactional processing may not be big with Big Data yet, but it can do things that relational databases can only dream of:
Made by Facebook!
"Apache Cassandra is an open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook" (Wikipedia)
Made by LinkedIn!
“Voldemort is a distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage. It is named after the fictional Harry Potter villain Lord Voldemort." (Wikipedia)
Used by MTV Networks, Craigslist, and Foursquare!
"MongoDB (from "humongous") is an open source document-oriented database system developed and supported by 10gen. It is part of the NoSQL family of database systems. Instead of storing data in tables as is done in a "classical" relational database, MongoDB stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster." (Wikipedia)
Embraces the web!
"Apache CouchDB is a scalable, fault-tolerant, and schema-free document-oriented database written in Erlang. It's used in large and small organizations for a variety of applications where a traditional SQL database isn't the best solution for the problem at hand. " (http://wiki.apache.org/couchdb/)
Spanner is Google's globally distributed relational database management system (RDBMS). It is a scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions (i.e. like a conventional database but for Big Data). It also offers a SQL like querying language.
Popular Open Source Analytical Languages
Here are some of the popular open-source languages used in analytical research (and if you take a related MOOC, you will undoubtedly get acquainted with one of these). These are JSON compliant for easy data manipulation:
"R is a free software programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years. " (Wikipedia)
The Swiss-army knife of analytical programming!
"Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python's syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C, and the language provides constructs intended to enable clear programs on both a small and large scale." (Wikipedia)
"GNU Octave is a high-level programming language, primarily intended for numerical computations. It provides a command-line interface (CLI) for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with MATLAB. It may also be used as a batch-oriented language. As part of the GNU Project, it is free software under the terms of the GNU General Public License (GPL)." (Wikipedia)
I want to thank the sources mentioned in each quote along with Coursera.com classes for most of the material here.