HADOOP安装LZO压缩

编译安装lzo与lzop

在集群的每一台主机上都需要编译安装!!!

  1. 下载编译安装lzo文件,版本可以下载最新的

    wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
  2. 编译安装(保证主机上有gcc与g++)

    tar -xvzf   lzo-2.10.tar.gz
    cd lzo-2.10
    ./configure --enable-shared
    make -j 10
    make install
    cp /usr/local/lib/*lzo* /usr/lib

    安装完成后需要将 cp部分文件到/usr/lib中,这个步骤不做会抛 lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory

  3. 下载编译lzop,最新版选择

    wget http://www.lzop.org/download/lzop-1.04.tar.gz
    tar -xvzf lzop-1.04.tar.gz 
    cd lzop-1.04
    ./configure
    make -j 10
    make install

安装、编译hadoop-lzo-master

需在linux环境中安装,在windows上编译不过

wget https://github.com/twitter/hadoop-lzo/archive/master.zip
  1. 解压

    unzip master.zip 
    cd hadoop-lzo-master/
  2. 编辑pom.xml修改hadoop的版本号与你集群中hadoop版本一致

    <hadoop.current.version>2.6.0-cdh5.16.2</hadoop.current.version>

    pom文件增加cloudera的仓库地址

    <repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    <!--允许发布版本,禁止快照版-->
    <releases>
    <enabled>true</enabled>
    </releases>
    <snapshots>
    <enabled>false</enabled>
    </snapshots>
    </repository>
  3. 检查所在主机是否有maven,如果没有需要安装,如下(安装了maven即可跳过):

    wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz

    tar -zxvf apache-maven-3.5.4-bin.tar.gz

    vim /etc/profile

    添加环境变量:
    MAVEN_HOME=/usr/local/apache-maven-3.5.4
    export MAVEN_HOME
    export PATH=${PATH}:${MAVEN_HOME}/bin

    保存退出profile

    source /etc/profile
  4. 在maven中配置阿里云和cloudera的仓库

    <mirror>
    <id>nexus-aliyun</id>
    <mirrorOf>*,!cloudera</mirrorOf>
    <name>Nexus aliyun</name>
    <url>
    http://maven.aliyun.com/nexus/content/groups/public
    </url>
    </mirror>
  5. 导入hadoop-lzo编译时需要路径信息

    export CFLAGS=-m64
    export CXXFLAGS=-m64
    export C_INCLUDE_PATH=/home/hadoop/app/hadoop/lzo/include
    export LIBRARY_PATH=/home/hadoop/app/hadoop/lzo/lib
  6. maven编译安装

    mvn clean package -Dmaven.test.skip=true

    编译安装没有异常结束后往下继续 PS:如果在mvn这里出现异常,请解决后再继续,注意权限问题

  7. 编译成功后会有target文件

    cd target/native/Linux-amd64-64/
    mkdir ~/hadoop-lzo-files
    tar -cBf - -C lib . | tar -xBvf - -C ~/hadoop-lzo-files
  8. 在 ~/hadoop-lzo-files 目录下产生几个文件,执行cp

    cp ~/hadoop-lzo-files/libgplcompression*  $HADOOP_HOME/lib/native/

    注意!!!上面这一步的cp文件也要同步到集群其他主机的hadoop的对应目录下

  9. cp hadoop-lzo的jar包到hadoop目录

    cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/

    注意!!!上面这一步的cp文件也要同步到集群其他主机的hadoop的对应目录下

配置hadoop配置文件

  1. 修改core-site.xml(如果配置过了不需要配置)

    <property>
    <name>io.compression.codecs</name>
    <value>
    org.apache.hadoop.io.compress.GzipCodec,
    org.apache.hadoop.io.compress.DefaultCodec,
    org.apache.hadoop.io.compress.BZip2Codec,
    com.hadoop.compression.lzo.LzoCodec,
    com.hadoop.compression.lzo.LzopCodec,
    org.apache.hadoop.io.compress.Lz4Codec,
    org.apache.hadoop.io.compress.SnappyCodec,
    </value>
    </property>
  2. 修改mapred-site.xml中的压缩方式

    #开启压缩
    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>
    #配置压缩方式
    <property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
  3. 跑个wc验证输出文件是否压缩

  4. 创建索引

    hadoop jar $HADOOP_HOME/share/hadoop/common/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /data/lzo-index/large_wc.txt.lzo
Author: Tunan
Link: http://yerias.github.io/2018/10/15/hadoop/13/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.