Nutch相关框架视频教程杨尚川281032878@qq.com
第十八讲
1、准备压缩数据
从dmoz下载url库
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz
准备nutch1.6
svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
cp release-1.6/conf/nutch-site.xml.template release-1.6/conf/nutch-site.xml vi release-1.6/conf/nutch-site.xml 增加:
使用DmozParser把dmoz的URL库解析为文本
release-1.6/runtime/local/bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8>urls & 将url文本内容放到HDFS上面 hadoop fs -put urls urls
2、以不同压缩方法注入URL
进入nutch主目录 cd release-1.6
以未压缩的方式注入URL
runtime/deploy/bin/nutchinject data_no_compress/crawldb urls
以默认压缩的方式注入URL vi conf/nutch-site.xml
33 / 44
Nutch相关框架视频教程杨尚川281032878@qq.com
runtime/deploy/bin/nutchinject data_default_compress/crawldb urls
以Gzip压缩的方式注入URL vi conf/nutch-site.xml
runtime/deploy/bin/nutchinject data_gzip_compress/crawldb urls
以BZip2的压缩方式注入URL vi conf/nutch-site.xml
34 / 44
Nutch相关框架视频教程杨尚川281032878@qq.com
runtime/deploy/bin/nutchinject data_bzip2_compress/crawldb urls
以Snappy的方式注入URL vi conf/nutch-site.xml
runtime/deploy/bin/nutchinject data_snappy_compress/crawldb urls
35 / 44
Nutch相关框架视频教程杨尚川281032878@qq.com
压缩类型的影响 块大小的影响
3、Hadoop配置Snappy压缩
下载解压:
wget https://snappy.googlecode.com/files/snappy-1.1.0.tar.gz tar -xzvf snappy-1.1.0.tar.gz cd snappy-1.0.5
编译:
./configure make
make install 复制库文件:
scp /usr/local/lib/libsnappy* host2:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/ scp /usr/local/lib/libsnappy* host6:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/ scp /usr/local/lib/libsnappy* host8:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/ 在每一台集群机器上面修改环境变量: vi /home/hadoop/.bashrc 追加:
export LD_LIBRARY_PATH=/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64
36 / 44

