Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth.
According to IDC, it is imperative that organizations and IT leaders focus on the ever-increasing volume, variety and velocity of information that forms big data.1
- Volume. Many factors contribute to the increase in data volume – transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. In the past, excessive data volume created a storage issue. But with today’s decreasing storage costs, other issues emerge, including how to determine relevance amidst the large volumes of data and how to create value from data that is relevant.
- Variety. Data today comes in all types of formats – from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions. By some estimates, 80 percent of an organization’s data is not numeric! But it still must be included in analyses and decision making.
- Velocity. According to Gartner, velocity “means both how fast data is being produced and how fast the data must be processed to meet demand.” RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time. Reacting quickly enough to deal with velocity is a challenge to most organizations.
Small data is gone. Data is just going to get bigger and bigger and bigger, and people just have to think differently about how they manage it.
Big data according to SAS
At SAS, we consider two other dimensions when thinking about big data:
- Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something big trending in the social media? Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas is suddenly the must-do vacation activity. Daily, seasonal and event-triggered peak data loads can be challenging to manage – especially with social media involved.
- Complexity. When you deal with huge volumes of data, it comes from multiple sources. It is quite an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.
Ultimately, regardless of the factors involved, we believe that the term big data is relative; it applies (per Gartner’s assessment) whenever an organization’s ability to handle, store and analyze data exceeds its current capacity.
Examples of big data
- RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems. Tweet
- 10,000 payment card transactions are made every second around the world.2 Tweet
- Walmart handles more than 1 million customer transactions an hour.3 Tweet
- 340 million tweets are sent per day. That’s nearly 4,000 tweets per second.4 Tweet
- Facebook has more than 901 million active users generating social interaction data.5 Tweet
- More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones. Tweet
Uses for big data
So the real issue is not that you are acquiring large amounts of data (because we are clearly already in the era of big data). It’s what you do with your big data that matters. The hopeful vision for big data is that organizations will be able to harness relevant data and use it to make the best decisions.
Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:
- Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
- Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
- Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
- Quickly identify customers who matter the most.
- Generate retail coupons at the point of sale based on the customer’s current and past purchases, ensuring a higher redemption rate.
- Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
- Analyze data from social media to detect new market trends and changes in demand.
- Use clickstream analysis and data mining to detect fraudulent behavior.
- Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.
High-performance analytics, coupled with the ability to score every record and feed it into the system electronically, can identify fraud faster and more accurately.
Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.
- What if your data volume gets so large and varied you don’t know how to deal with it?
- Do you store all your data?
- Do you analyze it all?
- How can you find out which data points are really important?
- How can you use it to your best advantage?
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can’t analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.
You now have two choices:
- Incorporate massive data volumes in analysis. If the answers you are seeking will be better provided by analyzing all of your data, go for it. The game-changing technologies that extract true value from big data – all of it – are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics.
- Determine upfront which big data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine data relevance based on context. This analysis can be used to determine which data should be included in analytical processes and which can be placed in low-cost storage for later availability if needed.
Now you can run hundreds and thousands of models at the product level – at the SKU level – because you have the big data and analytics to support those models at that level.
A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:
- Cheap, abundant storage and server processing capacity.
- Faster processors.
- Affordable large-memory capabilities, such as Hadoop.
- New storage and processing technologies designed specifically for large data volumes, including unstructured data.
- Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
- Cloud computing and other flexible resource allocation arrangements.
Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
It is very important to understand that not all of your data will be relevant or useful. But how can you find the data points that matter most? It is a problem that is widely acknowledged. “Most businesses have made slow progress in extracting value from big data. And some companies attempt to use traditional data management practices on big data, only to learn that the old rules no longer apply,” says Dan Briody, in the 2011 Economist Intelligence Unit’s publication, “Big Data: Harnessing a Game-Changing Asset.”
Big data solutions from SAS
How can you make the most of all that data, now and in the future? It is a twofold proposition. You can only optimize your success if you weave analytics into your big data solution. But you also need analytics to help you manage the big data itself.
There are several key technologies that can help you get a handle on your big data, and more important, extract meaningful value from it.
- Information management for big data. Many vendors look at big data as a discussion related to technologies such as Hadoop, NoSQL, etc. SAS takes a more comprehensive data management/data governance approach by providing a strategy and solutions that allow big data to be managed and used more effectively.
- High-performance analytics. By taking advantage of the latest parallel processing power, high-performance analytics lets you do things you never thought possible because the data volumes were just too large.
- High-performance visual analytics. High-performance visual analytics lets you explore huge volumes of data in mere seconds so you can quickly identify opportunities for further analysis. Because it’s not just that you have big data, it’s the decisions you make with the data that will create organizational gains.
- Flexible deployment options for big data. Flexible deployment models bring choice. High-performance analytics from SAS can analyze billions of variables, and those solutions can be deployed in the cloud (with SAS or another provider), on a dedicated high-performance analytics appliance or within your existing IT infrastructure, whichever best suits your organization’s requirements.
1 Source: IDC. “Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO,” September 2011.
2 Source: American Bankers Association, March 2009
3 Source: http://www.economist.com
4 Source: http://blog.twitter.com
5 Source: http://newsroom.fb.com/
What is big data?
Big data is not a precise term; rather it’s a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used.
“My belief is that data is a terrible thing to waste. Information is valuable. In running our business, we want to make sure that we’re not leaving value on the table—value that can create better experiences for customers or better financial results for the company.”
—Johann Schleier-Smith, Tagged.com
“Data growth is a factor that
everybody is trying to deal with.
We’re seeing tremendous growth in
the size of data, year on year, as I
think everyone is. Finding effective
approaches for containing cost
so that it doesn’t run away with
your budget is an issue. Another
major challenge is dealing with
unstructured data. How do you
manage that data effectively? How do
you control its growth? And how do
you actually make that data part of
the information fabric that people can
draw upon to make decisions or look
—Rich Aducci, Boston Scientific
“Being able to look at big data can shorten your time to information, which has immediate value. For instance, if I want to know how a new product launch is going, I can analyze millions of social media conversations and know if we’re successful instantly, instead of waiting months for a customer satisfaction survey.”
—Guy Chiarello, JPMorgan Chase
“It’s more than a technology shift. There has to be a mindset shift about what you can do with data. For years, what CIOs had to deal with is managing information. It was all about managing information efficiently: how much can you compress it, de-dupe it, take snapshots and act upon it. Planning for big data extends that efficiency so you can do more with that data very quickly. The classic IT organizations are used to data warehousing, business intelligence where data is updated once, twice, maybe four times a month. Now we’re at the point where you’ve got access to everything all the time through real and uptime data.”
—Sanjay Mirchandani, EMC
Big data in the cloud
Cloud models tame big data while extracting
business value from it. This delivery model gives
organizations a flexible option for achieving
the efficiencies, scalability, data portability and
affordability they need for big data analytics.
Cloud models encourage access to data and
provide an elastic pool of resources to handle
massive scale, solving the problem of how to
store huge data volumes and how to amass the
computing resources required to manipulate it. In
the cloud, data is provisioned and spread across
multiple sites, allowing it to sit closer to the users
who require it, speeding response times and
boosting productivity. And, because cloud makes
IT resources more efficient and IT teams more
productive, enterprise resources are freed to be
Cloud services specifically designed for big
data analysis are starting to emerge, providing
platforms and tools designed to perform analytics
quickly and efficiently. Companies who recognize
the importance of big data but don’t have the
resources to build the required infrastructure
or acquire the necessary tools to exploit it will
benefit from considering these cloud services.
“Real-time data will continue to grow at a faster
pace than the capability to move it. Unless we
change the way we address the problem, we are
going to find ourselves constantly struggling to
squeeze information through very narrow tubes.
I believe that we’re going to be faced with a
situation where more and more we have to do
the analytics where the data resides. Instead of
moving the data for processing, we are going to
move analytics closer to the data.”
—Dimitris Mavroyiannis, Eurobank EFG Group
“The cloud will play an important role in big data.
I think it’s going to be increasingly rare that you’re
going to be able to run all this [infrastructure] at
home. And why would you, in some cases?”
—Deirdre Woods, The Wharton School of the
University of Pennsylvania
“Because it has become so cheap and easy to store data, a lot of companies have operated under this idea of, ‘let me just store it, I’ll deal with it when I figure out how to deal with it.’ But now the velocity of growth is increasing. The amount of storage we’re using is proliferating. All of that compels us to bring a business discipline to this ecosystem that helps us understand what needs to be retained and for how long. Big data raises the stakes on why content management needs to be promptly and successfully woven into the business operation. Then it’s as simple as basic records management blocking and tackling.”
—John Chickering, Fidelity Investments
“With compute power being what it is … you don’t need to build big tables and land them on disks and keep them on disks. You build them on the fly. That has reduced data storage needs dramatically. It’s a form of, I guess, intellectual compression, not algorithmic compression. It’s just smart data modeling and using the power of what you’ve got.”
—Ian Willson, The Boeing Company
“Big data calls for a lot more creativity in how you use data. You have to be way more creative about where you look for business value: if I combined this data with this data, what could it tell me?”
—Joe Solimando, Disney Consumer Products
“Instead of waiting for big data to stop operations, we should better organize or archive our data, manage it over its lifecycle and actually get rid of it. You can move out of mitigation mode by doing a better job of managing your information up front—in other words putting the data to more efficient use.”
—David Blue, The Boeing Company
“When people here have an idea, and they see they could do something differently if we make [processes] more real time in the future, and they [make changes to the service] and the numbers go up by 10%, people get really excited. So what I want to do is create that type of energy and enthusiasm. That’s what I want to be the dominant dynamic of the workplace. We really have that here. It’s pretty fun. People are getting results on a routine basis, and it’s because we’ve created frictionless access to data.”
—Johann Schleier-Smith, Tagged.com
“Our Wharton Research Data Services experience has shown us the value of organizing data so you can look at multiple data sources to analyze and draw conclusions. We’ve seen over years—and it’s a trend that’s certainly increasing—when people use the word “data” they want to see three or four and now five, six data sets joined.”
—Deirdre Woods, The Wharton School of the University of Pennsylvania
“Every vendor wants to bundle everything and provide a one-stop shop. This makes sense, and I don’t have a problem with that, as long the vendor doesn’t lock me in into a specific solution. And I think the only way that this can be avoided is if the vendor follows industry standards. We’ve moved into a world where standards prevail, especially in big data analytics, where data originate from multiple sources. Vendors that provide standard-based big data solutions are much more likely to be preferred.”
—Dimitris Mavroyiannis, Eurobank EFG Group
with business users will expand the capabilities of IT workers, bringing them closer to the strategic goal of aligning business and IT. Business workers will gain a better understanding of the capabilities and limitations of technology.
“At EMC we are developing roles for what we call the data scientist: people with a good amount of data competence who have skill sets partitioning information to make it easier to work with. The capabilities people in this role bring in the value chain of an organization are pretty tremendous. The central function is to add core business value and mask (for business users) the heavy lifting that happens behind the scenes in IT.”
—Sanjay Mirchandani, EMC
“The biggest challenge is the people. Big data requires non-traditional IT skill sets. We’re bringing in more Ph.D.s and people with expertise in outside areas to help our business users work with information.”
—Guy Chiarello, JPMorgan Chase
“A few years ago, I would’ve said the
value in BI overshadowed that of big
data, but now I’d say the relationship
is even, if not reversed. There’s more
outside information now that can
be digitized—about user behaviors
and external conditions—that can
be layered on top of structured data.
This type of analysis opens a window
into not just what happened and
why, but it also helps you see what’s
—Guy Chiarello, JPMorgan Chase
What is big data?
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Gain insight into IBM’s unique in-motion and at-rest big data analytics platform.Big data spans four dimensions: Volume, Velocity, Variety, and Veracity.
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.
- Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
- Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
- Scrutinize 5 million trade events created each day to identify potential fraud
- Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: Big data is any type of data – structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
- Monitor 100’s of live video feeds from surveillance cameras to target points of interest
- Exploit the 80% data growth in images, video and documents to improve customer satisfaction
Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Until now, there was no practical way to harvest this opportunity. Today, IBM’s platform for big data uses state of the art technologies including patented advanced analytics to open the door to a world of possibilities.
“At IBM, big data is about the ‘the art of the possible.’ . . . The company is certainly a leader in this space.”
“Four Vendor Views on Big Data and Big Data Analytics: IBM”
Hurwitz & Associates, Fern Halper, January 2012
From Wikipedia, the free encyclopedia
A visualization created by IBM of Wikipedia edits. At multiple terabytes
in size, the text and images of Wikipedia are a classic example of big data.
In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, includingmeteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created. The challenge for Large enterprises is who should own big data initiatives that straddle the entire organization. 
Big data is difficult to work with using relational databases and desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. What is considered “big data” varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”
Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of “big data” tools has arisen to handle sensemaking over large quantities of data, as in the Apache Hadoop Big Data Platform.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this “3Vs” model for describing big data. In 2012, Gartner updated its definition as follows: “Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.
The Large Hadron Collider (LHC) experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and not recording more than 99.999% of these streams, there are 100 collisions of interest per second.   
- As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
- If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times higher than all the other sources combined in the world.
Science and research
- When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.
- Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
- Computational social science — Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviour and real-world economic indicators.The authors of the study examined Google queries logs made by Internet users in 45 different countries in 2010 and calculated the ratio of the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘future orientation index’. They compared the future orientation index to the per capita GDP of each country and found a strong tendency for countries in which Google users enquire more about the future to exhibit a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.
- Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.
- Facebook handles 40 billion photos from its user base.
- FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
- The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates.
Following decades of work in the area of the effective usage of information and communication technologies for development (orICT4D), it has been suggested that Big Data can make important contributions to international development. On the one hand, the advent of Big Data delivers the cost-effective prospect to improve decision-making in critical development areas such as health care, employment, economic productivity, crime and security, and natural disaster and resource management. On the other hand, all the well-known concerns of the Big Data debate, such as privacy, interoperability challenges, and the almighty power of imperfect algorithms, are aggravated in developing countries by long-standing development challenges like lacking technological infrastructure and economic and human resource scarcity. “This has the potential to result in a new kind of digital divide: a divide in data-based intelligence to inform decision-making.”
“Big data” has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM,Microsoft, SAP, EMC, and HP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10 percent a year: about twice as fast as the software business as a whole.
Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world’s effective capacity to exchange information through telecommunication networks was 281petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.
DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets.
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include A/B testing, association rule learning, classification, cluster analysis,crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms,machine learning,