Order By&Sort By&Distribute By&Cluster By

目标

  1. 全局排序(Order By)
  2. Reduce内部排序(Sort By)
  3. 分区排序(Distribute By)
  4. Cluster By

准备工作

  1. 准备测试数据

    7369	SMITH	CLERK	7902	1980-12-17	800.00		20
    7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
    7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
    7566 JONES MANAGER 7839 1981-4-2 2975.00 20
    7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
    7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
    7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
    7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
    7839 KING PRESIDENT 1981-11-17 5000.00 10
    7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
    7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
    7900 JAMES CLERK 7698 1981-12-3 950.00 30
    7902 FORD ANALYST 7566 1981-12-3 3000.00 20
    7934 MILLER CLERK 7782 1982-1-23 1300.00 10
  2. 创建emp

    create external table emp(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
    )
    row format delimited
    fields terminated by "\t";
  3. 加载数据

    load data local inpath '/home/hadoop/data/emp.txt' into table emp;
  4. 验证数据

    select * from emp;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
    -----|------|---------|----|----------|----|----|------|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|

全局排序(Order By)[慎用]

语句格式: order by col1,col2 asc/desc

使用order by语句排序的是全局排序,只能有一个reduce作用来完成,与此对应的是sort by由多个reduce来完成

案例实操

  1. 查询员工信息按工资升序排列

    select * from emp order by sal;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
    -----|------|---------|----|----------|----|----|------|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
  2. 查询员工信息按工资降序排列(增加一个替换null值的功能)

    select *,nvl(comm,'-1') from emp order by sal desc;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|_c1   |
    -----|------|---------|----|----------|----|----|------|------|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|-1 |
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|-1 |
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|-1 |
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|-1 |
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|-1 |
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|-1 |
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|300.0 |
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|0.0 |
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|-1 |
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|1400.0|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|500.0 |
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|-1 |
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|-1 |
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|-1 |
  3. 按照员工薪水的2倍排序

    select *,sal+nvl(comm,0) as salandcomm from emp order by salandcomm;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|salandcomm|
    -----|------|---------|----|----------|----|----|------|----------|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20| 800|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30| 950|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20| 1100|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10| 1300|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30| 1500|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30| 1750|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30| 1900|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10| 2450|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30| 2650|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30| 2850|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20| 2975|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20| 3000|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20| 3000|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10| 5000|
  4. 按照部门和工资升序排序

    select * from emp order by deptno,sal+nvl(comm,0);

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
    -----|------|---------|----|----------|----|----|------|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|

Reduce内部排序(Sort By)

对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by

Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。

  1. 设置reduce个数,不然一个一个reduce没有效果

    set mapreduce.job.reduces=3;
  2. 根据部门编号降序查看员工信息

    select * from emp order by deptno desc;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
    -----|------|---------|----|----------|----|----|------|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
  3. 将查询结果导入到文件中(按照部门编号降序排序)[指定了格式]

    insert overwrite local directory '/home/hadoop/emp' row format delimited fields terminated by "\t" select * from emp order by deptno desc;

    打开/home/hadoop/emp下的文件

    [hadoop@aliyun emp.txt]$ cat 000000_0 
    7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
    7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
    7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
    7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
    7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
    7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
    7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
    7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
    7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
    7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
    7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
    7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
    7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
    7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10

发送分区排序(Distribute By)

在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。

对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。

  1. 先按照部门编号分区,再按照员工编号降序排序。

    select * from emp distribute by deptno sort by empno desc;

    empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
    -----|------|---------|----|----------|----|----|------|
    7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
    7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
    7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
    7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
    7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
    7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
    7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
    7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
    7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
    7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
    7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
    7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
    7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
    7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|

    注意:

    1. distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
    2. Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

Cluster By

distribute bysorts by字段相同时,可以使用cluster by方式。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。

当分区字段和排序字段都是部门编号的时候我们可以这么做

select * from emp cluster by deptno;

empno|ename |job      |mgr |hiredate  |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|

注意: 注意:按照部门编号分区,不一定就是固定死的数值,可以是20号和30号部门分到一个分区里面去。

Author: Tunan
Link: http://yerias.github.io/2018/11/06/hive/6/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.