清's profile中国数据仓库之路BlogListsGuestbookMore Tools Help

中国数据仓库之路

清 张

www.chinadwonline.com
中国数据仓库在线
There are no categories in use.
感谢访问!
Please wait...
Sorry, the comment you entered is too long. Please shorten it.
You didn't enter anything. Please try again.
Sorry, we can't add your comment right now. Please try again later.
To add a comment, you need permission from your parent. Ask for permission
Your parent has turned off comments.
Sorry, we can't delete your comment right now. Please try again later.
You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
Complete the security check below to finish leaving your comment.
The characters you type in the security check must match the characters in the picture or audio.
广奎wrote:
嘿嘿,来看一眼
Mar. 10
June 25

开源云计算技术系列(五)(崛起的黑马Sector/Sphere 实战篇)

在基于java的hadoop如日中天的时代,开源云计算界有一匹基于C++的黑马,Sector/Sphere在性能方面对hadoop提出了挑战,在Open Cloud Consortium(OCC)开放云计算协会建立的Open Cloud Testbed开放云实验床的软件测试中, Sector is about twice as fast as Hadoop.

本篇先对这匹黑马做一次实战演习,先感受一下,下一篇深入其设计原理,探讨云计算的本质。

OCT是一套跨核心10G带宽教育网的多个数据中心的计算集群。

分2个阶段实现:

 

opencloud-09-v6

Phase 1. Phase 1 was operational in June 2008 and consists of 240 cores distributed across four cities in the U.S. This was upgraded in September, 2008 to 480 cores.

Here is a diagram of the testbed. The Phase 1 equipment consists of four racks. Each rack contains 30 nodes. Each node has 4 cores. The racks are located in:

  • University of Illinois at Chicago (Chicago)
  • StarLight (Chicago)
  • Calit2 (La Jolla)
  • Johns Hopkins University (Baltimore)

All the racks are connected by a wide area 10 Gb/s network.

Phase 2. Phase 2 of the Open Cloud Testbed is planned to be operational by June, 2009. The testbed will add 4 racks of equipment for a total of 8 racks containing over 1000 cores. In addition, two more sites will be connected by 10 Gb/s networks. Phase 2 racks will be located at:

  • Johns Hopkins University (Baltimore)
  • Calit2 (La Jolla)
  • MIT Lincoln Lab (Cambridge)
  • Pittsburgh Supercompter Center/Carnegie Mellon University (Pittsburgh)
  • StarLight (Chicago)
  • University of Illinois at Chicago (Chicago)

In addition, in Phase 2, the Open Cloud Testbed will add shared, non-dedicated resources.

 

企业和大学联合,开源云计算领域规模日益扩大。前面提到的hadoop也是OCT使用的软件之一,我们这里重点来看另外一匹黑马Sector/Sphere,也是OCT采用的核心软件之一,可见其分量,Sector/Sphere的重点在与可以跨公网运行,强调了核心数据安全性,另外给熟练C++的开发人员提供了开源云计算技术框架。我们来看一下性能测试中超过hadoop 2倍的这匹黑马。

Sector/Sphere设计思路清晰,不过资料和文档目前比较少,这也给大规模的推广带来不方便

在体验这匹黑马前,我们来看一下sector/sphere的设计结构图。

image

了解到sector的一个比较突出的地方时有security server的设置,这在广域网上进行云计算提供了一定的安全性保障。

软件非常小巧,下载最新版本codeblue.1.23c.tar.gz,一些问题可以在论坛进行讨论。

http://sourceforge.net/forum/?group_id=172838

在安装make之前,检查debian os里面的几个基本的包是否安装。

libssl-dev,gcc,g++,libfuse-dev 如果准备体验FUSE的功能。

debian:~# tar xvzf codeblue.1.23c.tar.gz

debian:~/codeblue2/conf# ls
client.conf  master_node.cert  masters.list        security_node.key  slave.conf   topology.conf
master.conf  master_node.key   security_node.cert  slave_acl.conf     slaves.list  users
debian:~/codeblue2/conf# pwd
/root/codeblue2/conf

根据你部署的环境更改security,master,slave,client的配置文件。

配置文件非常清晰,基本上改一下对应的主机,和data目录就可以了。

debian:~/codeblue2/conf# more master.conf
#SECTOR server port number
SECTOR_PORT
        6000

#security server address
SECURITY_SERVER
        localhost:5000

debian:~/codeblue2/conf# more slave.conf
#Master address
MASTER_ADDRESS
        localhost:6000

#Data directory
DATA_DIRECTORY
        /root/data/

debian:~/codeblue2/conf# more client.conf
#Master address
MASTER_ADDRESS
        localhost:6000

编译,make成功完成后,就可以启动服务了。

启动服务:

debian:~/codeblue2/security# ./sserver &
[1] 8637
debian:~/codeblue2/security# Sector Security server running at port 5000

The server is started successfully; there is no further output from this program. Please do not shutdown the security server; otherwise no client may be able to login. If the server is down for any reason, you can restart it without restarting the masters and the slaves

 

debian:~/codeblue2/security# cd ../master/
debian:~/codeblue2/master# ./start_master &
[2] 8638
debian:~/codeblue2/master# Sector master is successfully running now. check sector.log for more details.
There is no further screen output from this program.

 

debian:~/codeblue2/master# cd ../slave/
debian:~/codeblue2/slave# ls
COPYING   serv_file.cpp  serv_spe.cpp  slave.cpp  slave.o      start_slave.cpp
Makefile  serv_file.o    serv_spe.o    slave.h    start_slave
debian:~/codeblue2/slave# ./start_slave &
[3] 8652
debian:~/codeblue2/slave# scaning /root/data/
This Sector slave is successfully initialized and running now.
slave process: GMP 47087 DATA 42064

debian:~/codeblue2/slave#

默认sector会保留10GB的空间,产生的测试数据也是10GB,如果大家想用小一点的数据量来验证一下,可以通过更改源代码来实现。

比如,如果需要产生100M的测试数据进行排序。

那么

vi randwriter.cpp

修改,去掉最后的00,这样从10GB减少到100M的测试数据量。

//10GB = 100 * 1000000
     57    for (long long int i = 0; i < 1000000; ++ i)
     58    {
     59       keygen(record);
     60       ofs.write(record, 100);
     61    }

 

 

67    for (long long int i = 0; i < 1000001; ++ i)
68    {
69       long long int d = i * 100;
70       idx.write((char*)&d, 8);
71    }

而mrsort.cpp里面需要注释掉一段,否则运行不过去。

debian:~/codeblue2/client/examples# vi mrsort.cpp

/*   if (3 != argc)
   {
      cout << "usage: mrsort" << endl;
      return 0;
   }
*/

然后make或者到codeblue2目录下make clean,make。

这样下面的测试就可以开始了,也不会撑爆你的硬盘,不过玩云计算,建议大家还是多预留一些硬盘,很多benchmark的程序都要默认数据量达到一定级别才能有代表性,也能体现出云的庞大,呵呵。

生成测试数据。

debian:~/codeblue2/client/examples# ./testfs
recv cmd 127.0.0.1 6000 type 105
recv cmd 127.0.0.1 6000 type 103
recv cmd 127.0.0.1 6000 type 110
===> start file server 127.0.0.1 6000
open file tmp/guide.dat 127.0.0.1 60833
rendezvous connect source 127.0.0.1 45180 /root/data//tmp/guide.dat
connected
file server closed 127.0.0.1 45180 0
report 127.0.0.1 6000 14,/tmp/guide.dat,0,1245914942,4
recv cmd 127.0.0.1 6000 type 110
===> start file server 127.0.0.1 6000
rendezvous connect source 127.0.0.1 45180 /root/data//tmp/guide.dat.idx
connected
open file tmp/guide.dat.idx 127.0.0.1 60833
file server closed 127.0.0.1 45180 0
report 127.0.0.1 6000 18,/tmp/guide.dat.idx,0,1245914943,16
start time 1245914943
JOB 4 1
1 spes found! 1 data seg total.
recv cmd 127.0.0.1 6000 type 203
starting SPE ... 0 45180 randwriter 3
rendezvous connect 127.0.0.1 45180
connected
connect SPE 127.0.0.1 3
new job /tmp/guide.dat 0 1
completed 100 127.0.0.1 46922
sending data back... 0
report 127.0.0.1 6000 21,test/sort_input.0.dat,0,1245914946,100000000
report 127.0.0.1 6000 25,test/sort_input.0.dat.idx,0,1245914946,8000008
recv cmd 127.0.0.1 6000 type 105
comp server closed 127.0.0.1 46922 2
reportSphere 127.0.0.1 6000 3

通过./sysinfo 查看sector系统信息。

debian:~/codeblue2/client/tools# ./sysinfo
Sector System Information:
Running since Thu Jun 25 03:28:39 2009
Available Disk Size 27413 MB
Total File Size 102 MB
Total Number of Files 2
Total Number of Slave Nodes 1
------------------------------------------------------------
Total number of clusters 4
Cluster_ID  Total_Nodes  AvailDisk(MB)  FileSize(MB)  NetIn(MB)  NetOut(MB)
0:  1  27413  102  0  0
1:  0  0  0  0  0
2:  0  0  0  0  0
3:  0  0  0  0  0
------------------------------------------------------------
SLAVE_ID  IP  TS(us)  AvailDisk(MB)  TotalFile(MB)  Mem(MB)  CPU(us)  NetIn(MB)  NetOut(MB)
1:  127.0.0.1  1245915399257411  27413  102  0  3440000  0  0

debian:~/codeblue2/client/tools# ./ls /
test                                            <dir>
debian:~/codeblue2/client/tools# ./ls /test
sort_input.0.dat                                100000000 bytes         Thu Jun 25 03:29:06 2009
sort_input.0.dat.idx                            8000008 bytes   Thu Jun 25 03:29:06 2009

可以看到测试数据已经生成。

用testdc做排序实验。

debian:~/codeblue2/client/examples# ./testdc
start time 1245915520
JOB 100000000 1000000
request shuffler 127.0.0.1 41406
1 spes found! 1 data seg total.
connect SPE 127.0.0.1 5
stage 1 accomplished 1245915552
JOB 100000000 1000000
2 spes found! 16 data seg total.
connect SPE 127.0.0.1 6
connect SPE 127.0.0.1 7
stage 2 accomplished 1245915557
SPE COMPLETED
debian:~/codeblue2/client/examples#

在运行一个wordcount例子,这个在hadoop里面也有对应的example例子。

debian:~/codeblue2/client/tools# ./mkdir html
debian:~/codeblue2/client/tools# ./upload mv.cpp
usage: upload <src file/dir> <dst dir>
debian:~/codeblue2/client/tools# ./upload mv.cpp /html
uploading mv.cpp of 1821 bytes
open file /html/mv.cpp 127.0.0.1 60833
Uploading accomplished! AVG speed 0.0121632 Mb/s.

debian:~/codeblue2/client/tools# cd ../examples/
debian:~/codeblue2/client/examples# ./wordcount
start time 1245915644
JOB 1821 -1
request shuffler 127.0.0.1 41406
1 spes found! 1 data seg total.
connect SPE 127.0.0.1 10
stage 1 accomplished 1245915645
SPE COMPLETED
debian:~/codeblue2/client/examples#

有兴趣的同学可以访问http://sector.sourceforge.net/来了解更多的信息。

June 24

开源云计算技术系列(四)(Cloudera安装配置hadoop 0.20最新版配置)

接上文,我们继续体验Cloudera 0.20最新版。

wget hadoop-0.20-conf-pseudo_0.20.0-1cloudera0.5.0~lenny_all.deb

wget hadoop-0.20_0.20.0-1cloudera0.5.0~lenny_all.deb

debian:~# dpkg –i hadoop-0.20-conf-pseudo_0.20.0-1cloudera0.5.0~lenny_all.deb

dpkg –i hadoop-0.20_0.20.0-1cloudera0.5.0~lenny_all.deb

就这么简单。ok

如果不清楚安装到哪里了,可以用

debian:~# dpkg -L hadoop-0.20

可以看到清晰的安装目录结构。

启动:

debian:~# cd /etc/init.d/hadoop-0.20-
hadoop-0.20-datanode           hadoop-0.20-namenode           hadoop-0.20-tasktracker
hadoop-0.20-jobtracker         hadoop-0.20-secondarynamenode 
debian:~# cd /etc/init.d/hadoop-0.20-

debian:~# /etc/init.d/hadoop-0.20-namenode start

debian:~# /etc/init.d/hadoop-0.20-namenode status
hadoop-0.20-namenode is running

debian:~# /etc/init.d/hadoop-0.20-datanode start

debian:~# /etc/init.d/hadoop-0.20-datanode status
hadoop-0.20-datanode is running

 

debian:~# /etc/init.d/hadoop-0.20-jobtracker start

debian:~# /etc/init.d/hadoop-0.20-jobtracker status
hadoop-0.20-jobtracker is running

 

debian:~# /etc/init.d/hadoop-0.20-tasktracker start

debian:~# /etc/init.d/hadoop-0.20-tasktracker status
hadoop-0.20-tasktracker is running

启动完成。

接着可以进行常规的example的测试了。

值得测试的是

debian:~# sqoop --help
Usage: hadoop sqoop.jar org.apache.hadoop.sqoop.Sqoop (options)

Database connection options:
--connect (jdbc-uri)         Specify JDBC connect string
--driver (class-name)        Manually specify JDBC driver class to use
--username (username)        Set authentication username
--password (password)        Set authentication password
--local                      Use local import fast path (mysql only)

Import control options:
--table (tablename)          Table to read
--columns (col,col,col...)   Columns to export from table
--order-by (column-name)     Column of the table used to order results
--hadoop-home (dir)          Override $HADOOP_HOME
--warehouse-dir (dir)        HDFS path for table destination
--as-sequencefile            Imports data to SequenceFiles
--as-textfile                Imports data as plain text (default)
--all-tables                 Import all tables in database
                             (Ignores --table, --columns and --order-by)

Code generation options:
--outdir (dir)               Output directory for generated code
--bindir (dir)               Output directory for compiled objects
--generate-only              Stop after code generation; do not import

Additional commands:
--list-tables                List tables in database and exit
--list-databases             List all databases available and exit
--debug-sql (statement)      Execute 'statement' in SQL and exit

Generic Hadoop command-line options:
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

At minimum, you must specify --connect and either --table or --all-tables.
Alternatively, you can specify --generate-only or one of the additional
commands.

可以通过apt-get install mysql-server 安装debian下的mysql进行联合测试。

测试体验见前面的文章,到这里大家可以进行一个完整的体验了。Cloudera的出现的确对hadoop的配置进行了极大的简化,推动了开源云计算的发展。

开源云计算技术系列(四)(Cloudera安装配置 0.183稳定版)

节省篇幅,直入正题。

首先用虚拟机virtualbox 配置一台debian 5.0.

debian在开源linux里面始终是最为纯正的linux血统,使用起来方便,运行起来高效,重新审视一下最新的5.0,别有一番似是故人来的感觉。

只需要下载debian-501-i386-CD-1.iso进行安装,剩下的基于debian强大的网络功能,可以很方便的进行软件包的配置。具体过程这里略去,可以在www.debian.org里面找到所有你需要的信息。

下面我们来体验一下稳定版0.183的方便和简洁。

step1.配置 Cloudera Repository

创建一个新的配置文件 vi /etc/apt/sources.list.d/cloudera.list

more /etc/apt/sources.list.d/cloudera.list
deb http://archive.cloudera.com/debian lenny contrib
deb-src http://archive.cloudera.com/debian lenny contrib

增加 Adding the Cloudera Key

debian:~# curl -s http://archive.cloudera.com/debian/archive.key | apt-key add -
OK

更新 APT Index

debian:~# apt-get update
Ign cdrom://[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10] lenny Release.gpg
Ign cdrom://[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10] lenny/main Translation-en_US
Ign cdrom://[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10] lenny Release 
Ign cdrom://[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10] lenny/main Packages/DiffIndex
Get:1 http://archive.cloudera.com lenny Release.gpg [197B]                                            
Get:2 http://volatile.debian.org lenny/volatile Release.gpg [189B]                                    
Ign http://volatile.debian.org lenny/volatile/main Translation-en_US                                  
Hit http://ftp.us.debian.org lenny Release.gpg                                                        
Ign http://archive.cloudera.com lenny/contrib Translation-en_US                           
Hit http://security.debian.org lenny/updates Release.gpg                                  
Ign http://security.debian.org lenny/updates/main Translation-en_US 
Get:3 http://volatile.debian.org lenny/volatile Release [40.7kB]    
Ign http://ftp.us.debian.org lenny/main Translation-en_US                                       
Hit http://security.debian.org lenny/updates Release                                            
Get:4 http://archive.cloudera.com lenny Release [2391B]                                        
Hit http://ftp.us.debian.org lenny Release                                                      
Ign http://security.debian.org lenny/updates/main Packages/DiffIndex                           
Ign http://archive.cloudera.com lenny/contrib Packages                     
Ign http://security.debian.org lenny/updates/main Sources/DiffIndex        
Ign http://ftp.us.debian.org lenny/main Packages/DiffIndex                 
Ign http://ftp.us.debian.org lenny/main Sources/DiffIndex                                  
Hit http://security.debian.org lenny/updates/main Packages          
Hit http://ftp.us.debian.org lenny/main Packages                    
Ign http://archive.cloudera.com lenny/contrib Sources               
Ign http://volatile.debian.org lenny/volatile/main Packages/DiffIndex
Hit http://security.debian.org lenny/updates/main Sources           
Ign http://volatile.debian.org lenny/volatile/main Sources/DiffIndex
Hit http://ftp.us.debian.org lenny/main Sources                     
Get:5 http://archive.cloudera.com lenny/contrib Packages [4480B]
Get:6 http://volatile.debian.org lenny/volatile/main Packages [7471B]
Get:7 http://volatile.debian.org lenny/volatile/main Sources [2350B]     
Get:8 http://archive.cloudera.com lenny/contrib Sources [1431B]
Fetched 59.2kB in 4s (12.5kB/s)
Reading package lists... Done
debian:~#

查看 Cloudera packages

debian:~# apt-cache search hadoop
hadoop - A software platform for processing vast amounts of data
hadoop-conf-pseudo - Pseudo-distributed Hadoop configuration
hadoop-datanode - Data Node for Hadoop
hadoop-doc - Documentation for Hadoop
hadoop-jobtracker - Job Tracker for Hadoop
hadoop-namenode - Name Node for Hadoop
hadoop-native - Native libraries for Hadoop (e.g., compression)
hadoop-pipes - Interface to author Hadoop MapReduce jobs in C++
hadoop-secondarynamenode - Secondary Name Node for Hadoop
hadoop-tasktracker - Task Tracker for Hadoop
hive - A data warehouse infrastructure built on top of Hadoop
libhdfs0 - JNI Bindings to access Hadoop HDFS from C
pig - A platform for analyzing large data sets using Hadoop
debian:~#

 

ok,准备工作到此,下面开始正式安装,还是非常方便的。

我们选择安装Hadoop (Pseudo-Distributed Mode)的模式。能完整体验hadoop的功能。

昨天我们体验了hadoop-conf-pseudo 0.18.3-0cloudera0.3.0~intrepid,今天放出了基于最新版hadoop 0.20的cloudera软件试用包,既然如此,那就趁机尝一把鲜吧,这就是开源软件的速度,每天都有新感觉。

需要java6。

配置

debian:~/codeblue2/client/examples# more /etc/apt/sources.list
#
# deb cdrom:[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10]/ lenny main

deb cdrom:[Debian GNU/Linux 5.0.1 _Lenny_ - Official i386 CD Binary-1 20090413-00:10]/ lenny main

deb http://ftp.us.debian.org/debian/ lenny main contrib non-free
deb-src http://ftp.us.debian.org/debian/ lenny main contrib non-free

deb http://security.debian.org/ lenny/updates main contrib non-free
deb-src http://security.debian.org/ lenny/updates main contrib non-free

deb http://volatile.debian.org/debian-volatile lenny/volatile main contrib non-free
deb-src http://volatile.debian.org/debian-volatile lenny/volatile main contrib non-free

 

然后apt-get update一把。

debian:~# apt-get install sun-java6-jre

很傻瓜化的就安装好了,这里就略去输出了。

在体验0.20之前,在把0.18.3 的安装说一下,毕竟是稳定版本。

apt-get -y install hadoop-conf-pseudo
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following extra packages will be installed:
  hadoop hadoop-native liblzo2-2
The following NEW packages will be installed:
  hadoop hadoop-conf-pseudo hadoop-native liblzo2-2
0 upgraded, 4 newly installed, 0 to remove and 0 not upgraded.
Need to get 12.0MB/12.1MB of archives.
After this operation, 21.5MB of additional disk space will be used.
Get:1 http://archive.cloudera.com lenny/contrib hadoop 0.18.3-4cloudera0.3.0~lenny [11.9MB]
Get:2 http://archive.cloudera.com lenny/contrib hadoop-conf-pseudo 0.18.3-4cloudera0.3.0~lenny [93.1kB]
Get:3 http://archive.cloudera.com lenny/contrib hadoop-native 0.18.3-4cloudera0.3.0~lenny [92.7kB]    
Fetched 4336kB in 23s (184kB/s)                                                                       
Selecting previously deselected package liblzo2-2.
(Reading database ... 103556 files and directories currently installed.)
Unpacking liblzo2-2 (from .../lzo2/liblzo2-2_2.03-1_i386.deb) ...
Selecting previously deselected package hadoop.
Unpacking hadoop (from .../hadoop_0.18.3-4cloudera0.3.0~lenny_all.deb) ...
Selecting previously deselected package hadoop-conf-pseudo.
Unpacking hadoop-conf-pseudo (from .../hadoop-conf-pseudo_0.18.3-4cloudera0.3.0~lenny_all.deb) ...
Selecting previously deselected package hadoop-native.
Unpacking hadoop-native (from .../hadoop-native_0.18.3-4cloudera0.3.0~lenny_i386.deb) ...
Processing triggers for man-db ...
Setting up liblzo2-2 (2.03-1) ...
Setting up hadoop (0.18.3-4cloudera0.3.0~lenny) ...
Setting up hadoop-conf-pseudo (0.18.3-4cloudera0.3.0~lenny) ...
Setting up hadoop-native (0.18.3-4cloudera0.3.0~lenny) ...

 

查看一下安装到哪里了。

debian:~# dpkg -L hadoop-conf-pseudo
/.
/etc
/etc/hadoop
/etc/hadoop/conf.pseudo
/etc/hadoop/conf.pseudo/hadoop-default.xml
/etc/hadoop/conf.pseudo/configuration.xsl
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/slaves
/etc/hadoop/conf.pseudo/sslinfo.xml.example
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/masters
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/commons-logging.properties
/etc/hadoop/conf.pseudo/hadoop-site.xml
/usr
/usr/share
/usr/share/doc
/usr/share/doc/hadoop-conf-pseudo
/usr/share/doc/hadoop-conf-pseudo/copyright
/usr/share/doc/hadoop-conf-pseudo/changelog.Debian.gz
/usr/share/doc/hadoop-conf-pseudo/changelog.gz
/usr/share/lintian
/usr/share/lintian/overrides
/usr/share/lintian/overrides/hadoop-conf-pseudo

 

debian:~# ls -l /var/lib/hadoop/cache/hadoop/dfs/name
total 8
drwxr-xr-x 2 hadoop hadoop 4096 2009-06-24 02:58 current
drwxr-xr-x 2 hadoop hadoop 4096 2009-06-24 02:58 image

 

启动hadoop的服务:

debian:~# /etc/init.d/hadoop-namenode start
Starting Hadoop namenode daemon: starting namenode, logging to /var/log/hadoop/hadoop-hadoop-namenode-debian.out
hadoop-namenode.

 

/etc/init.d/hadoop-datanode start
Starting Hadoop datanode daemon: starting datanode, logging to /var/log/hadoop/hadoop-hadoop-datanode-debian.out
hadoop-datanode.
debian:~# /etc/init.d/hadoop-jobtracker start
Starting Hadoop jobtracker daemon: starting jobtracker, logging to /var/log/hadoop/hadoop-hadoop-jobtracker-debian.out

hadoop-jobtracker.

 

查看一下进程是否正常

hadoop    7926     1  0 03:01 ?        00:00:12 /usr/lib/jvm/java-6-sun//bin/java -Xmx100m -Dcom.sun.man
hadoop    8007     1  1 03:02 ?        00:00:14 /usr/lib/jvm/java-6-sun//bin/java -Xmx100m -Dcom.sun.man
hadoop    8053     1  0 03:02 ?        00:00:13 /usr/lib/jvm/java-6-sun//bin/java -Xmx100m -Dcom.sun.man
hadoop    8108     1  0 03:02 ?        00:00:11 /usr/lib/jvm/java-6-sun//bin/java -Xmx100m -Dhadoop.log

 

hive和pig的安装也就一条命令搞定,方便实惠。

apt-get install hive

apt-get insall pig

ok,我们autoremove掉0.183,体验最新的0.20

debian:~# apt-get autoremove hadoop-conf-pseudo

 

debian:~# wget http://archive.cloudera.com/hadoop-summit-09/hadoop-20-debs/deb_lenny_i386/hadoop-0.20_0.20.0-1cloudera0.5.0~lenny_all.deb

debian:~# dpkg -i hadoop-0.20_0.20.0-1cloudera0.5.0~lenny_all.deb
Selecting previously deselected package hadoop-0.20.
(Reading database ... 103589 files and directories currently installed.)
Unpacking hadoop-0.20 (from hadoop-0.20_0.20.0-1cloudera0.5.0~lenny_all.deb) ...
Setting up hadoop-0.20 (0.20.0-1cloudera0.5.0~lenny) ...
Processing triggers for man-db ...

关于0.20的新进展,关注中。

June 23

开源云计算技术系列(四)(Cloudera体验篇)

Cloudera  的定位在于

Bringing Big Data to the Enterprise with Hadoop

Cloudera为了让Hadoop的配置标准化,可以帮助企业安装,配置,运行hadoop以达到大规模企业数据的处理和分析。

既然是给企业使用,Cloudera的软件配置不是采用最新的hadoop 0.20,而是采用了Hadoop 0.18.3-12.cloudera.CH0_3的版本进行封装,并且集成了facebook提供的hive,yahoo提供的pig等基于hadoop的sql实现接口,使得这些软件的安装,配置和使用的成本降低并且进行了标准化。当然除了集成和封装这些成熟的工具外,Cloudera一个比较有意思的工具是sqoop,目前这个工具没有独立提供,因此这也是这次我们全面体验Cloudera的一个出发点,就是体验一下sqoop的工具的便捷性。

Sqoop (”SQL-to-Hadoop”),a tool designed to easily import information from SQL databases into your Hadoop cluster.通过sqoop,可以很方便的从传统的RDBMS里面导入数据到hadoop的集群,比如从mysql和oracle里面导入数据,非常方便,从导出到导入一条命令搞定,而且可以进行表的筛选,比起目前比较成熟的通过文本文件或者管道中转来说,开发的效率提升和配置的简洁是这个工具的特色所在。

Sqoop可以做到

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse

After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.

这里我们先通过一个例子来立即体验一下sqoop,然后在给大家介绍完整的这套云计算环境的配置。

这个例子演示的是如果把客户表的数据拿到hadoop集群上进行分析,如何导出users表的数据并自动导入到hive,在通过hive进行ad-hoc的sql查询分析。这样可以体现出hadoop的强大数据处理能力,并且不影响生产库。

先建立测试USERS表:

mysql> CREATE TABLE USERS (
    ->   user_id INTEGER NOT NULL PRIMARY KEY,
    ->   first_name VARCHAR(32) NOT NULL,
    ->   last_name VARCHAR(32) NOT NULL,
    ->   join_date DATE NOT NULL,
    ->   zip INTEGER,
    ->   state CHAR(2),
    ->   email VARCHAR(128),
    ->   password_hash CHAR(64));
Query OK, 0 rows affected (0.00 sec)

 

插入一条测试数据

insert into USERS (user_id,first_name,last_name,join_date,zip,state,email,password_hash) values (1,'a','b','20080808',330440,'ha','test@test.com','xxxx');       
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from USERS;
+---------+------------+-----------+------------+--------+-------+---------------+---------------+
| user_id | first_name | last_name | join_date  | zip    | state | email         | password_hash |
+---------+------------+-----------+------------+--------+-------+---------------+---------------+
|       1 | a          | b         | 2008-08-08 | 330440 | ha    | test@test.com | xxxx          |
+---------+------------+-----------+------------+--------+-------+---------------+---------------+
1 row in set (0.00 sec)

然后我们使用sqoop导入mysq库的USERS表到hive。

sqoop --connect jdbc:mysql://localhost/test --username root --password xxx --local --table USERS --hive-import
09/06/20 18:43:50 INFO sqoop.Sqoop: Beginning code generation
09/06/20 18:43:50 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USERS AS t WHERE 1 = 1
09/06/20 18:43:50 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USERS AS t WHERE 1 = 1
09/06/20 18:43:50 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
09/06/20 18:43:50 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop/hadoop-0.18.3-12.cloudera.CH0_3-core.jar
09/06/20 18:43:50 INFO orm.CompilationManager: Invoking javac with args: -sourcepath ./ -d /tmp/sqoop/compile/ -classpath /etc/hadoop/conf:/home/hadoop/jdk1.6/lib/tools.jar:/usr/lib/hadoop:/usr/lib/hadoop/hadoop-0.18.3-12.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons-net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-12.cloudera.CH0_3-fairscheduler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-12.cloudera.CH0_3-scribe-log4j.jar:/usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/libfb303.jar:/usr/lib/hadoop/lib/libthrift.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/hadoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-compiler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/lib/jetty-ext/jsp-api.jar:/usr/lib/hadoop/hadoop-0.18.3-12.cloudera.CH0_3-core.jar:/usr/lib/hadoop/contrib/sqoop/hadoop-0.18.3-12.cloudera.CH0_3-sqoop.jar ./USERS.java
09/06/20 18:43:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop/compile/USERS.jar
09/06/20 18:43:51 INFO manager.LocalMySQLManager: Beginning mysqldump fast path import
09/06/20 18:43:51 INFO manager.LocalMySQLManager: Performing import of table USERS from database test
09/06/20 18:43:52 INFO manager.LocalMySQLManager: Transfer loop complete.
09/06/20 18:43:52 INFO hive.HiveImport: Loading uploaded data into Hive
09/06/20 18:43:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USERS AS t WHERE 1 = 1
09/06/20 18:43:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM USERS AS t WHERE 1 = 1
09/06/20 18:43:52 WARN hive.TableDefWriter: Column join_date had to be cast to a less precise type in Hive
09/06/20 18:43:53 INFO hive.HiveImport: Hive history file=/tmp/root/hive_job_log_root_200906201843_1606494848.txt
09/06/20 18:44:00 INFO hive.HiveImport: OK
09/06/20 18:44:00 INFO hive.HiveImport: Time taken: 5.916 seconds
09/06/20 18:44:00 INFO hive.HiveImport: Loading data to table users
09/06/20 18:44:00 INFO hive.HiveImport: OK
09/06/20 18:44:00 INFO hive.HiveImport: Time taken: 0.344 seconds
09/06/20 18:44:01 INFO hive.HiveImport: Hive import complete.

导入成功,我们在hive里面验证一下导入的正确性。

hive
Hive history file=/tmp/root/hive_job_log_root_200906201844_376630602.txt
hive> select * from USERS;
OK
1       'a'     'b'     '2008-08-08'    330440  'ha'    'test@test.com' 'xxxx'
Time taken: 5.019 seconds
hive>

可以看到和mysql库的数据完全一致。

这样我们就完成了从mysql库到HDFS的导入。

并且提供了一个自动生成的USERS.java程序供MapReduce 的分析使用。

more USERS.java
// ORM class for USERS
// WARNING: This class is AUTO-GENERATED. Modify at your own risk.
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.lib.db.DBWritable;
import org.apache.hadoop.sqoop.lib.JdbcWritableBridge;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.Date;
import java.sql.Time;
import java.sql.Timestamp;
public class USERS implements DBWritable, Writable {
  public static final int PROTOCOL_VERSION = 1;
  private Integer user_id;
  public Integer get_user_id() {
    return user_id;
  }
  private String first_name;
  public String get_first_name() {
    return first_name;
  }
  private String last_name;
  public String get_last_name() {
    return last_name;
  }
  private java.sql.Date join_date;
  public java.sql.Date get_join_date() {
    return join_date;
  }
  private Integer zip;
  public Integer get_zip() {
    return zip;
  }
  private String state;
  public String get_state() {
    return state;
  }
  private String email;
  public String get_email() {
    return email;
  }
  private String password_hash;
  public String get_password_hash() {
    return password_hash;
  }
  public void readFields(ResultSet __dbResults) throws SQLException {
    this.user_id = JdbcWritableBridge.readInteger(1, __dbResults);
    this.first_name = JdbcWritableBridge.readString(2, __dbResults);
    this.last_name = JdbcWritableBridge.readString(3, __dbResults);
    this.join_date = JdbcWritableBridge.readDate(4, __dbResults);
    this.zip = JdbcWritableBridge.readInteger(5, __dbResults);
    this.state = JdbcWritableBridge.readString(6, __dbResults);
    this.email = JdbcWritableBridge.readString(7, __dbResults);
    this.password_hash = JdbcWritableBridge.readString(8, __dbResults);
  }
  public void write(PreparedStatement __dbStmt) throws SQLException {
    JdbcWritableBridge.writeInteger(user_id, 1, 4, __dbStmt);
    JdbcWritableBridge.writeString(first_name, 2, 12, __dbStmt);
    JdbcWritableBridge.writeString(last_name, 3, 12, __dbStmt);
    JdbcWritableBridge.writeDate(join_date, 4, 91, __dbStmt);
    JdbcWritableBridge.writeInteger(zip, 5, 4, __dbStmt);
    JdbcWritableBridge.writeString(state, 6, 1, __dbStmt);
    JdbcWritableBridge.writeString(email, 7, 12, __dbStmt);
    JdbcWritableBridge.writeString(password_hash, 8, 1, __dbStmt);
  }
  public void readFields(DataInput __dataIn) throws IOException {
    if (__dataIn.readBoolean()) {
        this.user_id = null;
    } else {
    this.user_id = Integer.valueOf(__dataIn.readInt());
    }
    if (__dataIn.readBoolean()) {
        this.first_name = null;
    } else {
    this.first_name = Text.readString(__dataIn);
    }
    if (__dataIn.readBoolean()) {
        this.last_name = null;
    } else {
    this.last_name = Text.readString(__dataIn);
    }
    if (__dataIn.readBoolean()) {
        this.join_date = null;
    } else {
    this.join_date = new Date(__dataIn.readLong());
    }
    if (__dataIn.readBoolean()) {
        this.zip = null;
    } else {
    this.zip = Integer.valueOf(__dataIn.readInt());
    }
    if (__dataIn.readBoolean()) {
        this.state = null;
    } else {
    this.state = Text.readString(__dataIn);
    }
    if (__dataIn.readBoolean()) {
        this.email = null;
    } else {
    this.email = Text.readString(__dataIn);
    }
    if (__dataIn.readBoolean()) {
        this.password_hash = null;
    } else {
    this.password_hash = Text.readString(__dataIn);
    }
  }
  public void write(DataOutput __dataOut) throws IOException {
    if (null == this.user_id) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    __dataOut.writeInt(this.user_id);
    }
    if (null == this.first_name) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    Text.writeString(__dataOut, first_name);
    }
    if (null == this.last_name) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    Text.writeString(__dataOut, last_name);
    }
    if (null == this.join_date) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    __dataOut.writeLong(this.join_date.getTime());
    }
    if (null == this.zip) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    __dataOut.writeInt(this.zip);
    }
    if (null == this.state) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    Text.writeString(__dataOut, state);
    }
    if (null == this.email) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    Text.writeString(__dataOut, email);
    }
    if (null == this.password_hash) {
        __dataOut.writeBoolean(true);
    } else {
        __dataOut.writeBoolean(false);
    Text.writeString(__dataOut, password_hash);
    }
  }
  public String toString() {
    StringBuilder sb = new StringBuilder();
    sb.append("" + user_id);
    sb.append(",");
    sb.append(first_name);
    sb.append(",");
    sb.append(last_name);
    sb.append(",");
    sb.append("" + join_date);
    sb.append(",");
    sb.append("" + zip);
    sb.append(",");
    sb.append(state);
    sb.append(",");
    sb.append(email);
    sb.append(",");
    sb.append(password_hash);
    return sb.toString();
  }
}

可以看到,自动生成的程序可读性非常好,可以进行自定义的二次开发使用。

June 16

开源云计算技术系列三(10gen)安装配置

10gen 是一套云计算平台,可以为web应用提供可以扩展的高性能的数据存储解决方案。10gen的开源项目是mongoDB,主要功能是解决website的操作性数据存储,session对象的存储,数据缓存,高效率的实时计数(比如统计pv,uv),并支持ruby,python,java,c++,php等众多的页面语言。

MongoDB主要特征是存储数据非常方便,不在是传统的object-relational mapping的模式,高性能,可以存储大对象数据,比如视频等,可以自动复制和failover。

技术需要实践,让我们一起来从实践中体会mongoDB的众多优越的特性。

首先建立一台虚拟机rhel 5.2.

下载符合版本的软件,

http://www.mongodb.org/display/DOCS/Downloads

curl -O http://downloads.mongodb.org/linux/mongodb-linux-i686-latest.tgz

安装非常简单,解压好就可以使用,非常方便。

tar xvzf mongodb-linux-i686-latest.tgz

解压后的目录结构如下:

|-- bin
|   |-- mongo                              (the database shell)
|   |-- mongod                             (the database)
|   |-- mongodump                          (dump/export utility)
|   `-- mongorestore                       (restore/import utility)
|-- include                                (c++ driver include files)
|   `-- mongo
|       |-- client
|       |-- db
|       |-- grid
|       `-- util
|-- lib
|-- lib64

在启动之前,先建立数据库存放的目录 。

mkdir –p /data/db

接着后台启动mongoDB

bin/mongod run &
[1] 5673
[root@rac01 mongodb-linux-i686-2009-06-14]# Mon Jun 15 20:27:32 Mongo DB : starting : pid = 5673 port = 27017 dbpath = /data/db/ master = 0 slave = 0
Mon Jun 15 20:27:32 db version v0.9.4+, pdfile version 4.4
Mon Jun 15 20:27:32 git version: 004cd26deee50b7fdf060c06605bbce37bc09794
Mon Jun 15 20:27:32 sys info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
Mon Jun 15 20:27:32 waiting for connections on port 27017
Mon Jun 15 20:27:32 web admin interface listening on port 28017

 

ok,启动完毕,我们用mongoDB自带的客户端连接上。

 

bin/mongo      
url: test
connecting to: test
type "help" for help
Mon Jun 15 20:28:09 connection accepted from 127.0.0.1:19943
> help
HELP
        show dbs                     show database names
        show collections             show collections in current database
        show users                   show users in current database
        show profile                 show most recent system.profile entries with time >= 1ms
        use <db name>                set curent database to <db name>
        db.help()                    help on DB methods
        db.foo.help()                help on collection methods
        db.foo.find()                list objects in collection foo
        db.foo.find( { a : 1 } )     list objects in foo where a == 1
        it                           result of the last line evaluated; use to further iterate
> show dbs
admin
local
test

 

帮助的层次非常清楚。

我们来检查一下mongoDB是否运行正常。

bin/mongo
url: test
connecting to: test
type "help" for help
Mon Jun 15 20:28:56 connection accepted from 127.0.0.1:31975
> db.foo.save( { a : 1 } )
> db.foo.findOne()
{"_id" : "4a3631b14ae1a7d3e24cab82" , "a" : 1}

 

到这一步,这个强大的mongoDB就安装配置好了,下一篇我们来进一步体验一下这个开源mongoDB的强大功能要点。

June 15

开源云计算技术系列二(使用篇)(Enomaly)

接上文,我们开始来使用ECP。

登陆后,我们可以看到整体界面

image

dashboard里面可以很清楚的看到整个平台的操作过程和结果,非常清晰。

step8

Virtual Infrastructure里面可以看到虚拟的os镜像,这里是Ubuntu 8.04 server 和netbsd 4.0,界面十分友好,虚拟的配置为256M内存,1颗CPU和1G硬盘。启动很快,可以通过vnc连接上虚拟os镜像或者用vnc客户端连接,和真实机器使用起来基本没有区别。还可以管理网络设备。

step18_netbsd_vm

step17_unlock_vm

step19_netbsd_vm

 

image

大家会问,这些虚拟os系统如何安装的,别着急,下一个菜单就有。虚拟os的管理都在Repository里面。可以管理本地应用,远程应用,自己通过iso,cdrom等创建vm镜像,并且对于iso大于1G的还提供直接通过ftp方式传输到/opt/enomalism2/iso/,方便安装。

 

image

 

image

image

image

 

在admin菜单下有很多api的介绍。

image

还可以做个性化配置

image

在user菜单下可以进行用户和组的管理。

image

还可以管理每个用户的具体访问资源的权限。

image

管理起来非常方便。

还有很多功能,大家可以深入挖掘一下。

如果有兴趣的同学可以参考官方文档,共同交流提升。

http://src.enomaly.com/wiki

开源云计算技术系列二(安装配置篇)(Enomaly)

Enomaly's Elastic Computing Platform (ECP),是目前值得关注的十家云计算公司之一,ECP把企业数据中心与商用云计算服务集成起来,让IT专业人员可以通过单一控制台全面管理内外资源,同时便于虚拟机从一个数据中心转移至另一个数据中心。从中可以体会到云计算的实质是把企业的富足的计算资源虚拟化,按照用户的要求提供服务,用户看到的一台虚拟服务器可以用来满足用户的弹性计算要求,用户无需去买一堆的硬件设备,并进行系统,应用软件的安装,对用户来说业务发展了只需要提出增加云计算服务能力的要求即可,极大的减少了用户的IT投入和维护优化,因此云计算服务不仅仅是提供了虚拟的主机,应用服务,而且通过云计算平台也在提供企业的IT服务能力。

我们通过完整的体验一把Enomaly的ECP来理解云计算的本质。

玩云计算,需要比较强悍的硬件配置,如果没有一个几十台硬件组成的实验室环境,是玩不起来的,不过如果只是想体验一下云计算的技术,一台配置强悍的单机也能满足要求,下面我们就以一台单机来部署一套完整的云计算环境,让大家也来过一把云计算的瘾。

云计算技术需要很多基础的单项技术基础,如果你曾经折腾过很多oracle的rac,折腾过很多服务器os(rhel,suse,ubuntu,freebsd),折腾过很多web服务(apache,tomcat,php,jboss)等等,而且对虚拟化技术(xen,kvm,qemu)有所了解,那么请继续阅读这篇文章,如果没有,建议先google这些关键词,这样有助于自己独立部署一套完整的云计算环境。

首先一套完整的云计算环境有多种配置,这里只介绍其中一种,很多零配件都可以换用,在换用的过程中也能锻炼动手能力,为了尽快让大家先走通一条,我们先来准备基础环境,利用sun的virtualbox建立一台虚拟机,os采用rhel 5.2,虚拟机内容设置为3.6G,硬盘设置为32G,有条件的话当然越大越好,要知道体验一套云计算环境,后面会看到动辄都是很大的iso文件和虚拟os image,都是耗费内存和硬盘的大户。

接着我们来安装配置ECP。

预安装需要做一些准备。

step1 下载 enomalism.public 并通过rpm安装 : rpm --import enomalism.public

step2 下载 enomalism001.pubkey.asc 并通过rpm安装: rpm --import enomalism001.pubkey.asc

step3 安装libvirt 0.4.1 。

手动安装保障成功

yum install libxml2-devel openssl-devel cyrus-sasl-devel xen-devel gnutls-devel gcc

wget http://libvirt.org/sources/libvirt-0.4.1.tar.gz

tar -xvzf libvirt-0.4.1.tar.gz

cd libvirt-0.4.1

./configure --prefix=/usr && make && make install

step4  下载最新的enomalism2-2.2.3.noarch.PY2.4.rpm

SourceForge, 安装:

yum install Enomalism2-2.2.2-r4157.noarch.PY2.4.rpm

 

step 5 安装管理程序 Hypervisor

Xen
yum install kernel-xen xen
或者KVM/QEMU 
选择了rhel 5.2,需要手工下载进行安装。
  • Install CentOS 5.x Public Key: rpm --import http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5

     

  • 32 Bit (x86)
    wget http://mirrors.kernel.org/centos/5.2/extras/i386/RPMS/qemu-0.9.0-4.i386.rpm
    wget http://mirrors.kernel.org/centos/5.2/extras/i386/RPMS/kvm-36-1.i386.rpm
    wget http://mirrors.kernel.org/centos/5.2/extras/i386/RPMS/kmod-kvm-36-2.2.6.18_92.1.10.el5.i686.rpm
    wget http://mirrors.kernel.org/centos/5.2/updates/i386/RPMS/kernel-2.6.18-92.1.10.el5.i686.rpm
    yum install qemu-0.9.0-4.i386.rpm kvm-36-1.i386.rpm kmod-kvm-36-2.2.6.18_92.1.10.el5.i686.rpm kernel-2.6.18-92.1.10.el5.i686.rpm
    ln -s /usr/bin/qemu-kvm /usr/bin/kvm

    到这一步,如果都没有出现error,那么重新启动虚拟机。

     

    step 6. 启动mysql服务

    /etc/init.d/mysqld start
    设置mysql密码
    mysqladmin password <password>

     

    设置mysql服务在主机启动的时候自动启动

    chkconfig mysqld on

    以上6步,我们接着开始配置ECP.

    step 7:

    cd /opt/enomalism2 scripts/init-db.sh <mysql root password> <new ecp user> <new ecp password>

  • cp default.cfg config/$HOSTNAME.cfg
  • Edit config/$HOSTNAME.cfg
    • Change sqlobject.dburi="mysql://enomalism2:zx45qw12@localhost/enomalism2" to reflect your proper MySQL username and password.
    • Change enomalism2.self="5fe6f05e-7ee0-11dc-ba7c-0011d88b8e81" to reflect a unique identifier for your cluster environment (each instance needs to be unique)
      • Most distros have a utility such as uuid or uuidgen that can generate a number for you. The only valid value in this field is a uuid.
    • Change enomalism2.baseurl="http://127.0.0.1:8080/rest/" to the IP/hostname you use to access the ECP web interface.
    • Change enomalism2.ip_addr="1.2.3.4" to the IP/hostname you use to access the ECP web interface, this will be used later for clustering

    验证并查看一下mysql里面存的元数据配置。

    mysql> use enomalism2;
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

    Database changed
    mysql> show tables;
    +----------------------+
    | Tables_in_enomalism2 |
    +----------------------+
    | clusters             |
    | clusters_machine     |
    | e2_perm              |
    | enomalism_group      |
    | enomalism_user       |
    | exception            |
    | hypervisor           |
    | locker               |
    | machine              |
    | machine_definition   |
    | networks             |
    | packages             |
    | qmessage             |
    | queue                |
    | repo_entry           |
    | repo_feed            |
    | static_ip_range      |
    | static_ip_used       |
    | static_network       |
    | tg_group             |
    | tg_group_permission  |
    | tg_permission        |
    | tg_user              |
    | tg_user_group        |
    | transactions         |
    | variables            |
    | visit                |
    | visit_identity       |
    +----------------------+
    28 rows in set (0.00 sec)

     

    接下来我们配置一下VNC,这一步如果不配置,后面在节目里面启动虚拟镜像的时候会提示连接不上虚拟镜像,也可以在遇到问题后在配置。这里我们先配置好。

     

    • /etc/libvirt/qemu.conf (NOTE: If this file is missing or is a directory, you probably did not install the 0.4.1 version of libvirt!)
      # VNC is configured to listen on 127.0.0.1 by default.
      # To make it listen on all public interfaces, uncomment
      # this next option.
      #
      # NB, strong recommendation to enable TLS + x509 certificate
      # verification when allowing public access
      #
      vnc_listen = "0.0.0.0"
    • /etc/xen/xend-config.sxp
      # The interface for VNC servers to listen on. Defaults
      # to 127.0.0.1  To restore old 'listen everywhere' behaviour
      # set this to 0.0.0.0
      (vnc-listen '0.0.0.0')

    到这里面,我们开始启动ECP,关键的一步。

    /etc/init.d/enomalism2.sh start

    Starting enomalism2
    Setting up KVM/Qemu Networking
    Configuring Virtual Bridge on eth0 IP x.x.x.x

    ok,启动成功!

    先看一下web页面,下一篇详细介绍功能使用。

  • http://x.x.x.x:8080

  • 默认的用户名和密码 admin password

    step1

  • image

  •  

     

     

  • June 14

    开源云计算技术系列之一(abiquo)

    开源云计算abiCloud在6.11号发布了0.7.0版本,在云计算风起云涌的时代,我们一起来近距离体验一把最新的开源云计算技术。

    本篇用windows版本做演示,从一个完整的开源云计算软件里面分析云计算究竟要解决什么问题。

    先下载abiCloud-0.7.0-windows-installer.exe,全自动安装,不过根据网速的快慢,安装过程会自动下载mysql-noinstall-5.1.31-win32.zip和apache-tomcat-6.0.18.zip的配套版本。如果我们需要安装最新的配套版本,可以手工建立C:\external目录,然后下载

    http://mysql.west.mirrors.airband.net/Downloads/MySQL-5.1/mysql-noinstall-5.1.35-win32.zip

    http://labs.xiaonei.com/apache-mirror/tomcat/tomcat-6/v6.0.20/bin/apache-tomcat-6.0.20.zip

    并重新命名为mysql.zip 和tomcat.zip。做好这样的准备工作后,将会加速后续的安装和配置过程。

    abicloud需要jre 1.6的环境,并支持

    Virtualization technologies (Supported technologies)

    • Virtualbox (2.2.x versions) installed on each cloud node.
    • KVM (With libvirt)
    • XEN (With libvirt) (Not tested YET)

    在环境变量里面加入JAVA_HOME=D:\jdk1.6.0_14(替换为你的 jdk1.6的安装目录)

    接下来点击abiCloud-0.7.0-windows-installer.exe,进行安装。

    init2

    license

    installDirectory

    hyperType

    database1

    database2

    Database configuration: You must create a database named kinton. One database user has to be able to write to this kinton database.

    tomcat

    tomcat2

    domain

    internalAdress

    readyInstall

    安装完毕。

    启动C:\Program Files\abiCloud-0.7.0\run.bat

    image

    如果启动正常,可以进行web管理页面的访问。

    http://localhost:5050/abicloud/AbiCloud.html

    默认用户名,admin,user,密码为 xabiquo

    start

    infras

    infras1

    vd1

    image

    vd2

    user1

    image

    通过使用,我们可以看到abicloud可以管理企业的全球data center,数据库服务器,应用服务器,webserver等虚拟镜像,还可以管理虚拟应用,Virtual application: a simple or complex system developed inside the virtual [DataCenter].还可以自定义增加其他需要管理的类别。

    虚拟数据中心Virtual Data Center: an isolated cloud infrastructure in a physical [DataCenter] where a company deploys its cloud applications.

    我们可以看一下0.7.0的最新特征图:

    features

    April 27

    浅谈企业数据仓库架构的稳中有变

    EDW的概念进入中国后,很多企业建了了企业数据仓库,银行,证券,电信,移动,互联网纷纷开展EDW的建设,EDW的建设基本上是分期进行,不过在EDW上线后是一个持续支撑业务发展的平台,随着时间的推移,业务的迅速发展,EDW的后期维护,优化和变化是一个持续的过程,业务变化越快的企业面临的EDW的架构压力越大,很多匆忙上马的edw项目生命周期很短暂,能在业务架构若干次调整后生存下来的edw项目少之又少,究其原因,大部分是上线初期架构设计不合理造成,那么一套成熟的EDW具备什么特征呢?

    1.层次清晰,edw的各层之间紧密联系,但跨层的干扰要尽量小,这样有助于在业务架构变化的时候把变化控制在合适的层次上,而不是牵一发而动全身。

    2.模块化。模块化是一个老生常谈的话题,模块化的精神实质是对业务的深度理解和业务底层逻辑架构的深度理解,合理的模块划分,控制模块的复杂度,模块内部原子级别的模块段形式上的统一,这些看似简单的原理,如果用的好,在业务变化迅速的情况下抗压和抗变化能力就能充分体现出来。

    3.基于元数据驱动。技术元数据和业务元数据纳入一套edw metadata系统中,元数据的累积需要很好的规范和技术平台化,统一元数据的好处是什么呢?大家可以想一下如下场景,一个业务发生了变化,究竟edw里面有多少地方需要做修改?这个问题是建设edw过程中经常会遇到的一个问题,在一个实施多年的edw系统里面,如元数据不过硬,不完整,这个问题将会是一个灾难性的问题,因为无法全面准确的判断受影响面基本上就宣告了改动的无目的性和针对性。元数据的基本作用就是能在这样的情况下准确的判断出edw的受影响面,当然其他的深度作用很多,这里不做详细阐述。

    4.灵活的调动系统。调度系统是一套edw系统的骨架和筋脉。调度系统把edw的各个模块根据元数据的相关性灵活的组织起来,是一个纯动态的系统。调度系统要能做到负载均衡,并行调动,最基本的一条是调度的准确性是数据准确性的一个非常重要的基础要求,调度的准确性和模块业务逻辑的准确性是产生完整准确的分析数据的两大基础性要素。合理灵活的调动系统能充分的利用机器的资源,最大化的减少etl时间窗口,一套好的调度系统对建设edw的ROI指标非常重要。

    5.自动监控系统。既然edw是企业数据仓库,其中指标,kpi会非常多,这些指标往往是业务决策的重要来源,如何在最后一层上保障数据的准确性。一套数据预警系统是产生高质量数据结果和最早发现业务问题的一个重要基础。

    6.自动处理系统。edw建设的复杂性之一是企业it环境的复杂,出现问题的点会非常多,人工处理的经验一定要合理的抽象出规则,耐入到edw的自动处理系统中,使得edw系统具备尽可能多的自我处理能力。

    April 09

    流数据管理关键技术研究和原型体验

    背景:流数据管理是dw里面一个比较新的技术趋势,在很多实时性和新鲜性要求很高的场合发挥重要作用。比如去年比较热门的股票证

    券行业,当前流数据和历史数据的作用结合才能发挥最大的参考作用。一般情况下,dw面临的是历史数据,在实时性高的情况下,会要

    求看到最新的数据,在互联网行业,人们都对朋友,商业伙伴的信息保持高度的关注,引发出来的进一步的商业潜在需求会是什么?是对

    当前最新数据实时分析和汇总,从技术层面来看就是对最新的数据还能做join,sum,group等分析操作。

    不得不承认,在dw技术方面,国外起步比国内要早很多年,对流数据的基础技术的研究,斯坦福大学对data warehouse的技术研究有

    一个专门的小组,data warehouse at  standford,起步非常早,包含抽取复杂异构数据源,数据仓库的高效优化,物理和逻辑设计,

    查询过程原理,海量数据恢复,数据挖掘等等多个方面。其中有一个project就是今天我们要研究的主题,流数据管理,stanford stream data manager。

    研究这项技术的驱动力来源,网络监控,电磁通信数据管理,点击流监控,传感器数据管理,这方面的数据如果用户存在需要长时间持续

    查询而非传统的一次性查询的需求,则需要类似的流数据查询技术。也就是会持续关注multiple, continuous, rapid, time-

    varying data streams的时候会使用到类似的技术。

    stanford早在十年前对这项技术进行了研究,并开发出原型系统,其中涉及到并行数据流,内存有限的范围内对无限流数据做join等难点

    的理论研究对当前的技术仍然有非常重要的指导意义。本文不在理论方面做过多探讨,这篇文章的重点是从整体上看一下流数据管理的一

    个概貌,给大家一个感性认识。

    下面和大家一步步演示如何搭建一个流数据原型环境。下面的步骤假设大家对linux,虚拟机,java,ant等技术有一定的基础。

    步骤1:准备一台rhel 5.2,5.3 ,4.5或者4.7的虚拟机,选择安装开发需要的大部分工具比如gcc等,如果不清楚,可以选择全部安装。

    步骤2:准备好流数据的关键软件server端和client端软件,以及java,ant 等软件。

    列表如下:

    server端:http://infolab.stanford.edu/stream/code/stream-0.6.0.tar.gz

    client端:http://infolab.stanford.edu/stream/code/stream-vis-0.3.0.tar.gz

    ant 1.7   http://labs.xiaonei.com/apache-mirror/ant/binaries/apache-ant-1.7.1-bin.tar.gz

    步骤3:开始配置。

     

    安装好java和ant,然后配置如下:

    Building the Server

    [root@test stream-0.6.0]# pwd
    /root/stream/stream-0.6.0

  • ./configure --prefix=/root/stream
  • make
  • make install

    test the server:

    [root@test test]# pwd
    /root/stream/stream-0.6.0/test

    root@test test]# ./test.sh
    Test 1 ok
    Test 2 ok
    Test 3 ok
    Test 4 ok
    Test 5 ok
    Test 6 ok
    Test 7 ok
    Test 8 ok
    Test 9 ok
    Test 10 ok
    [root@test test]# ./cleanup.sh
    [root@test test]#

    Building the Client

    [root@test stream-vis-0.3.0]# pwd
    /root/stream/stream-vis-0.3.0

  • chmod +x geninit.sh
  • ./geninit.sh
  • ant

     

    [root@test stream-vis-0.3.0]# cd lib
    [root@test lib]# ls
    STREAMvis.jar
    [root@test lib]# ls -ltr
    total 196
    -rw-r--r--  1 root root 190680 Mar 29 09:49 STREAMvis.jar

     

    步骤4:如何使用流数据系统。

    vi ~/.bash_profile

    export ANT_HOME=/root/stream/apache-ant-1.7.1

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/stream/lib/

    PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$ANT_HOME/bin:/root/stream/bin/

    export PATH
    unset USERNAME

    [root@test lib]# source ~/.bash_profile

    下载 http://infolab.stanford.edu/stream/code/config ,放到

    [root@test stream]# pwd
    /root/stream
    [root@test stream]# more config
    #
    # Config file for the STREAM server
    #
    # None of the config parameters are strictly necessary.  If the value for some
    # parameter is not specified, the system assumes reasonable default values
    #

    #
    # The size of the memory in bytes that is used during the execution of the system.
    #
    # 32 MB
    #

    MEMORY_SIZE = 33554432

    #
    # Queues have fixed sizes.  A smaller value of QUEUE_SIZE means that the operators
    # execute is a more tightly coupled manner.  This should be an integer value > 1.
    # The queue size is specified in number of pages.  A page is the atomic unit of memory
    # It is set to 4096 bytes.
    #
    #

    QUEUE_SIZE = 1

    #
    # Shared queue size in pages.  A shared queue is a queue which has one writer operator
    # and many read operators.  It is useful to set this value higher than QUEUE_SIZE.
    #

    SHARED_QUEUE_SIZE = 30

    #
    # This should be a fraction (between 0 & 1).  It is similar to the threshold value
    # used in a disk-based linear hash table.  A smaller value leads to cheaper index updates
    # but lookups could be costlier and vice-versa.

    INDEX_THRESHOLD = 0.85

    #
    # Long long int value that roughly translates to the duration for which the system is run.
    # The special values '0' indicates that the server should run forever.  For net_server
    # program, the value of this parameter should always be 0
    #

    RUN_TIME = 0

    #
    # The CPU clock speed (MHz) of the machine on which the server is run.  It is very important
    # to set this number to the correct value, since it is used to calibrate various internal
    # timers used in the system.
    #

    CPU_SPEED = 2000

    [root@test stream]#

  • mkdir logs
  • net_server -c config -l logs/log -p 9000

    [root@test bin]# ./net_server
    Usage: ./net_server -l [logFilePref] -c [configFile] -p [portNo]
    [root@test bin]# pwd
    /root/stream/bin

    [root@test stream-vis-0.3.0]# pwd
    /root/stream/stream-vis-0.3.0

  • chmod +x vis.sh
  • ./vis.sh

     

    image

     

    image

     

    image

     

    简单sum

    image

    image

    image

    简单join:

    image

    image

    image

    可以在运行过程中监控cpu的运行情况,并且可以随时暂停其中任意一个数据流。

    image

    点击执行计划的图中的任意节点,可以看到具体的信息,非常直观。

    image

    image

     

    这个查询sql的含义是从2个流数据R,S中找出名字相同,但是R的value比S的value大的记录。

    大家可以想象一下这个技术在实时监控中的强大作用。

    感兴趣的同学可以参考

    http://infolab.stanford.edu/stream/

    深入学习研究一下。

  •  
    by