The information technology industry has never seen the likes of the data tsunami, or more appropriately, the perpetual data hurricane that is raining down on us. Many of the cloud pundits talk about the Infrastructure as a Service, Platform as a Service and Software as a Service. But very few discuss the critical aspect of cloud : big data. Bill Oreilly calls it the network effect in data, and Amazon recently gave their nod to big data by putting public sets of census, genome, economics and
3-d chemical data online. Google has been indexing public books ( Google scholar ) for quite a while now.
So how big is this data? and how fast is it growing? I did some research - the results are astounding!
( For reference : Petabyte(PB) is 1000 terrabytes(TB), where 1 TB = 1000 GB )
| DATA |
SIZE ( Plus Compound Annual Growth Rate ) |
| Wikipedia | 10GB ( 100% CAGR ) |
| Merck Bio Research DB | 1.5TB/quarter |
| Wal-Mark Transaction DB | 600TB |
| UPMC Hospitals Imaging Data | 500TB/year |
| Typical Oil Company data per oil field |
350 TB |
| One day of Instant Messaging in 2002 | 750GB |
| World Wide Web | 1 PB |
| Internet Archive | 1 PB + |
| Terashake earthquake model of LA Basin | 1PB |
| MIT BabyTalk Speech Experiment | 1.5PB |
| Estimated Online Ram in Google | 8PB+ |
| Large hadron Collidor | 15 PB per run ( 300Exabytes per year! ) |
| Annual Email traffic ( no spam ) | 300 PB |
| Personal digital photos | 1000 PB+ ( 100% CAGR ) |
| Human Genomics | 7000 PB ( 1GB per person / 200 PB + captured ) ( 200% CAGR ) |
| Total Digital Data created in 2007 ( IDC ) | 281,000 PB ( 281 Exabytes ) with 10% CAGR |
There are some interesting data points to dwell upon here:
a. The TOTAL size of the documents on world wide web is only around 1PB - compared to the digital photography which is 1000 times the size of WWW or the current size of human genome which is 700 times the size of the WWW.
b. Many of the large data sets are not created by the social web but by large institutions. High Performance Computing involving audio/video analysis and simulations produces data sets that dwarf the size of others.
c. The IDC
report quoted above estimates that by the year 2011 there will be 1,773
exabytes of digital data in the world!. The report contains many jewels
of information, one being that only 5% of this data is generated from
the enterprise and only 35% emanates from workers overall ( from their
workstations ). Rest of it is created by consumers themselves or workers in enterprises capturing personal information for their customers. In fact if you evenly divide the data by the
world population, each person is assumed to have about 45GB of data. I
know I probably have created much more than that over last year.
This data points to some interesting trends -
1. The storage market will continue to grow by double digits over the next 5 years. System that are bigger, better, faster, more cost effective and easier to manage and operate will do better. ( Note to store 280exabytes of data, it will take 560,000 Sun Storage 7410 servers.)
2. The applications that can mine this data and expose meaningful information will become widely popular. Best data mining providers that want to offer services above these data sets. Data Warehousing technologies and new distributed analytics models ( like hadoop ) will thrive.
3. Security of this data will remain relevant. With
tons of PII information, privacy and security regulations ( think
sarbox, HIPPAA, GLB etc ) will continue to force enterprises and
guardians of this data to address security at all levels.
I do see a bright future for companies that are in the data management , retention and safeguarding business. As well as those companies that can corral this data hurricane and offer meaningful analysis and services above these.
Thoughts/ comments - please fire away!
