目标
- 全局排序(Order By)
- Reduce内部排序(Sort By)
- 分区排序(Distribute By)
- Cluster By
准备工作
准备测试数据
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10创建emp
create external table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited
fields terminated by "\t";加载数据
load data local inpath '/home/hadoop/data/emp.txt' into table emp;
验证数据
select * from emp;
empno|ename |job |mgr |hiredate |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
全局排序(Order By)[慎用]
语句格式: order by col1,col2 asc/desc
使用order by语句排序的是全局排序,只能有一个reduce作用来完成,与此对应的是sort by由多个reduce来完成
案例实操
查询员工信息按工资升序排列
select * from emp order by sal;
empno|ename |job |mgr |hiredate |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|查询员工信息按工资降序排列(增加一个替换null值的功能)
select *,nvl(comm,'-1') from emp order by sal desc;
empno|ename |job |mgr |hiredate |sal |comm|deptno|_c1 |
-----|------|---------|----|----------|----|----|------|------|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|-1 |
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|-1 |
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|-1 |
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|-1 |
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|-1 |
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|-1 |
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|300.0 |
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|0.0 |
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|-1 |
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|1400.0|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|500.0 |
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|-1 |
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|-1 |
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|-1 |按照员工薪水的2倍排序
select *,sal+nvl(comm,0) as salandcomm from emp order by salandcomm;
empno|ename |job |mgr |hiredate |sal |comm|deptno|salandcomm|
-----|------|---------|----|----------|----|----|------|----------|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20| 800|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30| 950|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20| 1100|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10| 1300|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30| 1500|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30| 1750|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30| 1900|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10| 2450|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30| 2650|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30| 2850|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20| 2975|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20| 3000|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20| 3000|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10| 5000|按照部门和工资升序排序
select * from emp order by deptno,sal+nvl(comm,0);
empno|ename |job |mgr |hiredate |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
Reduce内部排序(Sort By)
对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。
Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。
设置reduce个数,不然一个一个reduce没有效果
set mapreduce.job.reduces=3;
根据部门编号降序查看员工信息
select * from emp order by deptno desc;
empno|ename |job |mgr |hiredate |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|将查询结果导入到文件中(按照部门编号降序排序)[指定了格式]
insert overwrite local directory '/home/hadoop/emp' row format delimited fields terminated by "\t" select * from emp order by deptno desc;
打开/home/hadoop/emp下的文件
[hadoop@aliyun emp.txt]$ cat 000000_0
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 \N 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 \N 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 \N 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 \N 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 \N 20
7369 SMITH CLERK 7902 1980-12-17 800.0 \N 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 \N 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 \N 10
7839 KING PRESIDENT \N 1981-11-17 5000.0 \N 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 \N 10
发送分区排序(Distribute By)
在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。
先按照部门编号分区,再按照员工编号降序排序。
select * from emp distribute by deptno sort by empno desc;
empno|ename |job |mgr |hiredate |sal |comm|deptno|
-----|------|---------|----|----------|----|----|------|
7900|JAMES |CLERK |7698|1981-12-3 | 950| | 30|
7844|TURNER|SALESMAN |7698|1981-9-8 |1500| 0| 30|
7698|BLAKE |MANAGER |7839|1981-5-1 |2850| | 30|
7654|MARTIN|SALESMAN |7698|1981-9-28 |1250|1400| 30|
7521|WARD |SALESMAN |7698|1981-2-22 |1250| 500| 30|
7499|ALLEN |SALESMAN |7698|1981-2-20 |1600| 300| 30|
7934|MILLER|CLERK |7782|1982-1-23 |1300| | 10|
7839|KING |PRESIDENT| |1981-11-17|5000| | 10|
7782|CLARK |MANAGER |7839|1981-6-9 |2450| | 10|
7902|FORD |ANALYST |7566|1981-12-3 |3000| | 20|
7876|ADAMS |CLERK |7788|1987-5-23 |1100| | 20|
7788|SCOTT |ANALYST |7566|1987-4-19 |3000| | 20|
7566|JONES |MANAGER |7839|1981-4-2 |2975| | 20|
7369|SMITH |CLERK |7902|1980-12-17| 800| | 20|注意:
- distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
- Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
Cluster By
当distribute by
和sorts by
字段相同时,可以使用cluster by
方式。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。
当分区字段和排序字段都是部门编号的时候我们可以这么做
select * from emp cluster by deptno;
empno|ename |job |mgr |hiredate |sal |comm|deptno| |
注意:
注意:按照部门编号分区,不一定就是固定死的数值,可以是20号和30号部门分到一个分区里面去。