Sales & Marketing
Training & Development
PCs & Servers
Imaging & Printing
Big data buzzwords: A to Z
Words to know in alphabetical order.
Nov 29, 2012
Big data is one of the biggest trends in IT today, and it has spawned a whole new generation of technology to handle it. And with new technologies come new buzzwords: acronyms, technical terms, product names, etc. Even the phrase "big data" itself can be confusing. Many think of "lots of data" when they hear it, but big data is much more than just data volume.
An acronym for atomicity, consistency, isolation and durability, ACID is a set of requirements or properties that, when adhered to, ensure the data integrity of database transactions during processing. While ACID has been around for a while, the explosion in transaction data volumes has focused more attention on the need for meeting ACID provisions when working with big data.
IT systems today pump out data that's "big" on volume, velocity and variety. IDC estimates that the volume of world information will reach 2.7 zettabytes this year (that's 2.7 billion terabytes) and that's doubling every two years. It's not just the amount of data that's causing headaches for IT managers, but the increasingly rapid speed at which data is flowing from financial systems, retail systems, websites, sensors, RFID chips and social networks like Facebook, Twitter, etc. Going back five, maybe 10 years, IT mostly dealt with alphanumeric data that was easy to store in neat rows and columns in relational databases. No longer. Today, unstructured data, such as Tweets and Facebook posts, documents, web content and so on, is all part of the big data mix.
Columnar (or Column-Oriented) Database
Some new-generation databases (such as the open-source Cassandra and HP's Vertica) are designed to store data by column rather than by row as traditional SQL databases do. Their design provides faster disk access, improving their performance when handling big data. Columnar databases are especially popular for data-intensive business analytics applications.
The concept of data warehousing, copying data from multiple operational IT systems into a secondary, off-line database for business analytics applications, has been around for about 25 years. But as data volumes explode, data warehouse systems are rapidly changing. They need to store more data -- and more kinds of data -- making their management a challenge. And where 10 or 20 years ago data might have been copied into a data warehouse system on a weekly or monthly basis, data warehouses today are refreshed far more frequently with some even updated in real time.
Extract, transform and load (ETL) software is used when moving data from one database, such as one supporting a banking application transaction processing system, to another, such as a data warehouse system used for business analytics. Data often needs to be reformatted and cleaned up when being transferred from one database to another. The performance demands on ETL tools have increased as data volumes have grown exponentially and data processing speeds have accelerated.
Flume, a technology in the Apache Hadoop family (others include HBase, Hive, Oozie, Pig and Whirr), is a framework for populating Hadoop with data. The technology uses agents scattered across application servers, web servers, mobile devices and other systems to collect data and transfer it to a Hadoop system. A business, for example, could use Apache Flume running on a web server to collect data from Twitter posts for analysis.
One trend fueling big data is the increasing volume of geospatial data being generated and collected by IT systems today. A picture may be worth 1000 words, so it's no surprise the growing number of maps, charts, photographs and other geographic-based content is a major driver of today's big data explosion. Geospatial analysis is a specific form of data visualisation that overlays data on geographical maps to help users better understand the results of big data analysis.
Hadoop is an open-source platform for developing distributed, data-intensive applications. It's controlled by the Apache Software Foundation. Hadoop was created by Yahoo developer Doug Cutting, who based it on Google Labs' MapReduce concept and named it after his infant son's toy elephant. Bonus "H" entries, or HBase, is a non-relational database developed as part of the Hadoop project. The Hadoop Distributed Filesystem (HDFS) is a key component of Hadoop. And, Hive is a data warehouse system built on Hadoop.
Computers generally retrieve data from disk drives as they process transactions or perform queries. But, that can be too slow when IT systems are working with big data. In-memory database systems utilise a computer's main memory to store frequently used data, greatly reducing processing times. In-memory database products include SAP HANA and the Oracle Times Ten In-Memory Database.
Java is a programming language developed at Sun Microsystems and released in 1995. Hadoop and a number of other big data technologies were built using Java, and it remains a dominant development technology in the big data world.
Kafka is a high-throughput, distributed messaging system originally developed at LinkedIn to manage the service's activity stream (data about a website's usage) and operational data processing pipeline (about the performance of server components). Kafka is effective for processing large volumes of streaming data -- a key issue in many big data computing environments. Storm, developed by Twitter, is another stream-processing technology that's catching on. The Apache Software Foundation has taken Kafka on as an open-source project.
Latency is the delay when data is being delivered from one point to another or the amount of delay for a system, such as an application, to respond to another. While the term isn't new, you're hearing it more often today as data volumes grow and IT systems struggle to keep up. "Low latency" is good; "high latency" is bad.
Map/reduce is a way of breaking up a complex problem into smaller chunks, distributing them across many computers and then reassembling them into a single answer. Google's search system utilises map/reduce concepts and the company has a framework with the brand name MapReduce. In 2004, Google released a white paper describing its use of map/reduce. Doug Cutting recognised its potential and developed the first release of Hadoop that also incorporates map/reduce concepts.
Most mainstream databases (such as the Oracle Database and Microsoft SQL Server) are based on a relational architecture and use structured query language (SQL) for development and data management. But a new generation of database systems dubbed "NoSQL" (which some now say stands for "Not only SQL") is based on architectures that proponents argue are better for handling big data. Some NoSQL databases are designed for scalability and flexibility whereas others are more efficient at handling documents and other unstructured data. Examples include Hadoop/HBase, Cassandra, MongoDB and CouchDB, while some big vendors like Oracle have launched their own NoSQL products.
Apache Oozie is an open-source workflow engine that's used to help manage processing jobs for Hadoop. Using Oozie, a series of jobs can be defined in multiple languages, such as Pig and MapReduce, and then linked to each other. That allows a programmer to launch a data analysis query once a job to collect data from an operational application has finished, for example.
Pig, another Apache Software Foundation project, is a platform for analysing huge data sets. At its core, it's a programming language for developing parallel computation queries that run on Hadoop.
Quantitative data analysis
Quantitative data analysis is the use of complex mathematical or statistical modeling to explain financial and business behavior or even predict future behavior. With the exploding volumes of data being collected today, quantitative data analysis has become more complex. But more data also holds the promise of more data analysis opportunities for companies that know how to use it to gain better visibility and insights into their businesses and spot market trends.
Relational database management systems, including IBM's DB2, Microsoft's SQL Server and the Oracle Database, are the most widely used type of database today. Most corporate transaction processing systems run on RDBMs, from banking applications to retail point-of-sale systems to inventory management applications. But some argue that relational databases may be unable to keep up with today's exploding volume and variety of data. RDBMs, for example, were designed with alphanumeric data in mind and aren't as effective when working with unstructured data.
As databases become ever larger, they become more difficult to work with. Sharding is a form of database partitioning that breaks a database up into smaller, more easily managed parts. Specifically, a database is partitioned horizontally to separately manage rows in a database table. Sharding allows segments of a huge database to be distributed across multiple servers, improving the overall speed and performance of the database. Bonus "S" entry: Sqoop is an open-source tool for moving data from non-Hadoop sources, such as relational databases, into Hadoop.
One of the contributors to the big data problem is the increasing amount of text being collected from social media sites like Twitter and Facebook, external news feeds and even within a company for analysis. Because text is unstructured (unlike structured data typically stored in relational databases), mainstream business analytics tools often falter when faced with text. Text analytics uses a range of techniques -- from key word search to statistical analysis to linguistic approaches -- to derive insight from text-based data.
Until recent years, most data was structured, the kind of alphanumeric information (such as financial data from sales transactions) that could be easily stored in a relational database and analyzed by business intelligence tools. But a big chunk of the 2.7 zettabytes of stored data today is unstructured, such as text-based documents, tweets, photos posted on Flickr, videos posted on YouTube and so on. (Fun fact: Thirty-five hours of content are uploaded to YouTube every minute.) Processing, storing and analysing all that messy unstructured stuff are often challenges for today's IT systems.
As the volume of data grows, it becomes increasingly difficult for people to understand it using static charts and graphs. That's led to the development of a new generation of data visualisation and analysis tools that present data in new ways to help people make sense of huge amounts of information. These tools include color-coded heat maps, three-dimensional graphs, animated visualisations that show changes over time and geospatial representations that overlay data on geographical maps. Today's advanced data visualization tools are also more interactive, such as allowing a user to zoom in on a data subset for closer inspection.
Apache Whirr is a set of libraries for running big data cloud services. More specifically, it speeds up the development of Hadoop clusters on virtual infrastructure such as Amazon EC2 and Rackspace.
A yottabyte is a data storage benchmark that's equal to 1000 zettabytes. The total amount of data stored worldwide is expected to reach 2.7 zettabytes this year, up 48 percent from 2011, according to an IDC calculation. So we're a long way from reaching the yottabyte threshold -- although with the rate of big data growth, it might come sooner than we think. Just to review, a zettabyte is one sextillion bytes of data. It's equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes.
ZooKeeper was created by the Apache Software Foundation to help Hadoop users manage and coordinate Hadoop nodes across a distributed network. Closely integrated with HBase, the database associated with Hadoop, ZooKeeper is a centralised service for maintaining configuration information, naming services, distributed synchronisation and other group services. IT managers use it to implement reliable messaging, synchronise process execution and implement redundant services.
Copyright © CRN Australia
. All rights reserved.
You must be a registered member of CRN to post a comment.
Click here to login
Click here to register
Sign up to receive CRN email bulletins
Telstra aggregates Netflix, Stan and Presto with on-demand TV launch
If you only read one Windows 10 review, this should be it
Can Windows 10 win over the enterprise?
Channel conflict is here to stay - get used to it
Neil Perry's restaurants migrate to Microsoft cloud
Powered by Disqus
Which type of channel conflict annoys you the most?
Vendors going direct to customers
Distributors moving into IT services
Vendors not sticking to program rules
Vendors adding too many partners
Partners avoiding program rules
view previous polls »
Powered by Disqus
CRN Magazine looks in-depth at the emerging issues and developments for the channel, and provides insight, analysis and strategic information to help resellers better run their businesses.
What's in this issue?
Most popular tech stories
7 accounting packages for Australian small businesses compared: including MYOB, QuickBooks Online, Reckon, Xero
Do you use Dropbox? Here are some clever tricks
NBN gets ACCC tick for Optus and Telstra HFC takeover
How much does it cost to use the NBN? 14 providers compared including iiNet, Telstra, Internode
Windows 10 is here! (For some)
Photos: A tour through the history of Microsoft's Windows
Google admits defeat with Google+
Android bug leaves a billion phones open to attack
Windows 10 lands in Australia
NSW to build its own myGov
How to: How much RAM do you really need?
Top 25 fantasy games of all time
Review: Microsoft Windows 10
How to: Install Windows 10 now
The 5 Windows 10 privacy issues you should be aware of
Review: Intel i7-4970K
Runescape and coming home
Review: Origin Genesis X99
Tech of the Year 2014
Preview: Space Rogue
PC & Tech Authority
nextmedia Pty Ltd
. All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of nextmedia's
Terms & Conditions
Login to CRN
Email or Username:
* Email or Username required
* Password required
Forgot your password?
Don't have an account? Register now!
To request a
, enter the email address linked to your CRN account and we'll send one to you.
* Email required
* Invalid Email address
* Invalid Email address
Click here to return to Login Form
comments powered by Disqus.