BigData & Analytics

Where on Earth to begin describing such a topic as BigData? That's probably one of the biggest early challenges in starting to trawl this subject. As subjects, even Cloud, MMOG and Gamification pale into insignificance when addressing the breadth and depth of complexity involved in BigData. Indeed we touched upon an element of BigData in Gamification (see Gamification 2.0: A Concept for more details) alluding to the fact that until the advent (in many cases the depth of this has yet to be achieved) of BigData we did not have the ability; in anything more than a rudimentary sense; to understand, contextualise and predict the behaviours of customers at all. The evolution of computing which was covered very briefly under the Cloud heading of this website brought with it a new age of innovative capabilities and efficiencies with first governments, then companies and latterly individuals the world over able to harness computer processing to drive better services and be more organised than ever before. The arrival recently of SmartPhones has now taken the individual element of this evolution to new heights with the millennial generation expectant of immediacy of knowledge or facts wherever they are at whatever time. Even as recently as 2010, some commentators even mused that we were living through a stagnant boring age. Now though, any thoughts such as these are all about to irrevocably and stupendously change. We are truly witnessing "The Beginning of Infinity".

So what is Big Data?

At the highest level of aggregation BigData can be viewed as comprising of five specific functions each of which requires careful design across the domains of people, process and technology. In the simplest terms those functions can be viewed as being:

Lets dive into some more detail on each of the five functions.

The Five C's

Each of the five functions is described in more detail below but it is worth pointing out at this juncture that the size of the pie above is indicative of the size of the challenge posed by that function for the implementing organisation.

Capture – the switch from analogue to digital data which began during the 1990s provided the first portents for the today’s tsunami of data growth. From 1996 to 2007 digital storage in enterprises went from 0.3 Exabytes to 276 Exabytes. This foundation was then further fuelled by the growth in mobile devices, laptops, sensors & actuators, Machine-to-Machine (M2M), the availability of cheap storage disks, and latterly the promiscuity of data within social media resulting in an estimated total data storage need of approximately 35 Zettabytes by 2020. To further compound the difficulties of managing this level of data, the vast majority of this data (~85%) is unstructured data such as photographs, videos, tweets, posts, blogs, logs, email and webpages. Even conceptualising this growth is difficult, by way of tangible examples the number of live websites has risen from 1.2 Million to 367 Million between 1996 and 2011, and in the same period the number of mobile phones from 145 Million to ~6 Billion. The bottom line is that simply capturing a subset of this data is a monumental task for any company.

Collate – of course having the data is one thing, having it in a useful format is quite another. The second function of the lifecycle is the ordering of this data into some form of computing ingestible structure. This is the domain (in business intelligence vernacular) of Extract, Transform & Load (ETL). In the structured data world the problems of duplication, erroneous inputs, duplications, and absent cells are well known and to a greater or lesser extent the modern day ETL systems are capable of handling these issues. But as we have seen the true prize in BigData is handling the immensity of unstructured data which is now beginning to be addressed through textual analysis techniques, advanced data cleansing/pre-processing, contextual analysis techniques, and advances in natural language processing. To make this more real when we are scanning unstructured inputs a common issue is one of acronyms. To a human the “context” of the document, post, blog or article denotes the probable translation but this is very much more difficult for a computer, especially at the desired processing speed.

Collect – once we have developed and bounded the expansiveness of the capture, and then decided the specifics of collation, we then have to stage it in vast quantities in some form of computer & human usable containers. The actual choice of containment will greatly affect the efficacy of the ability to trawl the data. The physical choices come first to entertain answers to questions such as - How much should be cached? How much on what sort of disks? Considering an offline archive for later retrieval? Are we using de-duplication and compression? Beyond the physical layer choices are the decisions concerning which data store and which collated data to go into which store. The first data store is the continued leverage of the traditional relational database which remains a key asset in the BigData arsenal today. Developed during the 1970s based on the initial work of E F Codd, Donald Chamberlain and Raymond Boyce at IBM, the Relational Database Management System (RDBMS) is a powerful tool for structuring and querying data. Today’s databases are highly optimised systems that can search and find data using the Structured Query Language (SQL) within the clean carefully crafted, filled, indexed, non-duplicated table structure of an RDBMS. The second data store is the NoSQL database which emerged in 1998 but has really gained traction in the last five years. It departs from the relational approach, does not use tables, and does not use SQL to manipulate data. Instead the approach is to utilise “key, value” stores, document stores, and graphs. NoSQL is much faster and scales to enormous sizes for the types of work with unstructured data that today’s BigData demands.

Compute – if you were being pernickety or strict in terms of creating a linkage between the title ‘BigData’ and its function you could arguably stop now at the three functions described thus far. This is not the case though, so far we only have data, lots of data but data nonetheless, and data is not in-and-of-itself very useful to human beings. What we actually want at the end of this is information and better still ‘actionable’ information. This is where the compute function comes into play. Without getting into a debate on taxonomy from a business perspective compute is really the juncture of processes, programs, mathematics and visualisation techniques employed to deliver meaningful insights, models and recommendations. Together these techniques have come to be called “Business Analytics” but are very closely related and depending on your viewpoint either overlapping or a subset of “Business Intelligence”.

Change – at this stage the scene is set for the final function of BigData. We have captured, collated, collected and computed the data into information. Change is about using that information to take decisions that result in action. All of the preceding steps are for nought if the information you obtain cannot be translated into actual transformations within the business to deliver efficiencies, optimisations and ultimately profits. The goal of the change aspect, and the rational in the graphical view of the five functions as increasingly large segments of pie, is to deliver programme(s) for organisational change. This is by far the most difficult of the five functions of BigData to achieve. However, McKinsey estimated that impact of these changes as capable of delivering value of up to $300 Billion of value in US Healthcare, €250 Billion in European Public Sector and $600 Billion consumer surplus globally.

Further readings

BigData & Analytics Briefing

This short paper broadens the introduction above to outline in more detail the essential characteristics of BigData. In addition it provides some examples of the effect of BigData on businesses and their operational profits.

Read More

The Relevance of Videogames to BigData & Analytics

This longer paper introduces the concepts touched upon above in much greater detail, covering the topics of CPU and GPU architectures, the impact of 3D, pipelining, superscalar, multi-threading, SISD, SIMD, stream processing, OpenGL, GPGPUs and a convergence example based on BigData & MapReduce.

Read More