Cloud Analytics should leverage Sun's compelling storage architecture.
Hadoop Distributed File System (HDFS) is scalable with high availability and high performance. HDFS on servers with 3 cluster nodes minimum (1 Master Node and 2 Slaves Nodes). The blocks data are 64 MB (default) / 128 MB, every block is replicated 3 times (default). NameNode is the metadata of the file system. The files are divided and distributed on DataNodes.
MapReduce is a data processing software and is designed to store and stream extremely large datasets in batch, not intended for realtime querying and does not support ramdom access. JobTracker schedules and manages jobs, TaskTracker executes individual map() and reduce() tasks on each cluster node.
HBase is distributed storage system, column-oriented and multi-dimensional, This software is very interesting to manage very large structured data for the web semantic. HBase can manage billions of rows, millions of columns, thousands of versions and petabytes across thousands of servers. Realtime querying.
Hive is a system for managing and querying structured data built on top of Hadoop with SQL as data warehousing tool. No realtime querying
High Availability
- The NameNode is a single point of failure (SPOF), the transaction Log is stored in multiple directories and a directory is on the local file system or on a remote file system (NFS/CIFS).
- The secondary NameNode is the copies of FsImage and Transaction Log from NameNode to a temporary directory.
- For increasing the high availability of the Hadoop cluster it is possible to interconnect 2 master nodes (active/passive) servers with Solaris Cluster
Security
- For the security of the Hadoop cluster you should encrypted the data for safeguarding all transactions on the web.
Proof Of Concept
- Create an architecture with minimum three nodes and test the performance and the feasibility of Hadoop.
- For rapidly testing Hadoop you can use the OpenSolaris Hadoop Live CD
- The OpenSolaris LiveHadoop setup install three virtual nodes Hadoop Cluster
- Once OpenSolaris boots, two virtual servers are created using Zones
- Zones are very lightweight, minimizing virtualization overheads and leaving more memory for your application
- The "Global" zone hosts the NameNode and JobTracker, and two "Local" zones each host a DataNode and TaskTracker
- Interface your application with HDFS and implement the "Save as Cloud..." and "Open from Cloud...". functionalities. Use the Hadoop Java API for your development.
Service and Support
- HDFS, MapReduce, HBase and Hive are Open Source software and supported on OpenSolaris.
- For the US countries it is possible to contact Cloudera for bringing big data to the enterprise with Hadoop.
- Who support Hadoop across the globe ? http://wiki.apache.org/hadoop/Support
Architecture Overview
Configuration
- Application Server (2 servers minimum for load balancing) with 2 CPU Quad-Core / 32 GB RAM/ 2 HD 146 GB SAS/ 4 GbE Ports
- Data Server (3 servers minimum for data replication) with 2 CPU Quad-Core / 32 GB RAM/ 48 HD 1 TB SATAII / 4 GbE Ports
- No Fiber Channel for storage connection is needed because the network protocol between the Hadoop nodes is Ethernet.
Hadoop Presentation
Download
http://developer.yahoo.com/hadoop
http://wiki.apache.org/hadoop/Support
http://wiki.apache.org/hadoop/FrontPage
http://hadoop.apache.org/core/docs/current/hdfs_design.html
http://opensolaris.org/os/project/livehadoop
http://hadoop.apache.org/
MySQL, Yes for Business Intelligence
Business
Intelligence drives Business and IT Performance

