目标
- HS2
- 复杂数据结构
- 行列互转
- 常用函数
- 静动态分区表
- 桶表
SH2
HS2是HiveServer2的简称
HS2: Server端,默认端口10000
修改端口的方式是通过设置hive.server2.thrift.port的值
beeline: Client端
连接方式:
./beeline -u jdbc:hive2://hadoop001:10000/matestore -n hadoop
复杂数据结构
复杂数据类型有三种,分别是: array、map、structs
array
array
的建表格式:
create table hive_array(name string, work_locations array<string>) |
注意: COLLECTION ITEMS TERMINATED BY ','
是指定array数组中的元素分隔符
准备数据
第一步准备数据
pk beijing,shanghai,tianjin,hangzhou
jepson changchu,chengdu,wuhan,beijing第二步建表
create table array_(
name string,
address array<string>
)
row format delimited
fields terminated by "\t"
collection items terminated by ",";第三步装载数据
load data local inpath "/home/hadoop/data/hive_array.txt" into table array_;
第四步查询数据是否加载成功
select * from array_;
name |address |
------|-------------------------------------------|
pk |["beijing","shanghai","tianjin","hangzhou"]|
jepson|["changchu","chengdu","wuhan","beijing"] |
对array数组的”取”
根据数组的下标取出指定元素,下标从0开始
select name,address[1] as address from array_;
name |address |
------|--------|
pk |shanghai|
jepson|chengdu |获取数组中元素的个数
select name,size(address) as size from array_;
name |size|
------|----|
pk | 4|
jepson| 4|获取数组中包含某元素的记录
select name,address from array_ where array_contains(address,"shanghai");
name|address |
----|-------------------------------------------|
pk |["beijing","shanghai","tianjin","hangzhou"]|
map
map
的建表格式:
create table hive_map( |
注意: COLLECTION ITEMS TERMINATED BY '#'
是指定map键值对中的元素分隔符,MAP KEYS TERMINATED BY ':'
是指定key和value的分隔符
准备数据
第一步准备数据
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26第二步建表
create table map_(
id int,
name string,
family map<String,String>,
age int
)
row format delimited
fields terminated by ","
collection items terminated by "#"
map keys terminated by ":";第三步加载数据
load data local inpath '/home/hadoop/data/hive_map.txt' into table map_;
第四步查询数据是否加载成功
select * from map_;
id|name |family |age|
--|--------|-------------------------------------------------------------|---|
1|zhangsan|{"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}| 28|
2|lisi |{"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22|
3|wangwu |{"father":"wangjianlin","mother":"ruhua","sister":"jingtian"}| 29|
4|mayun |{"father":"mayongzhen","mother":"angelababy"} | 26|
对map键值对的”取”
根据key取value
select name,family['father'] as father,age from map_;
name |father |age|
--------|-----------|---|
zhangsan|xiaoming | 28|
lisi |mayun | 22|
wangwu |wangjianlin| 29|
mayun |mayongzhen | 26|取出所有key集合
select name,map_keys(family) as map_keys,age from map_;
name |map_keys |age|
--------|-----------------------------|---|
zhangsan|["father","mother","brother"]| 28|
lisi |["father","mother","brother"]| 22|
wangwu |["father","mother","sister"] | 29|
mayun |["father","mother"] | 26|取出所有的value集合
select name,map_values(family) as map_values,age from map_;
name |map_values |age|
--------|----------------------------------|---|
zhangsan|["xiaoming","xiaohuang","xiaoxu"] | 28|
lisi |["mayun","huangyi","guanyu"] | 22|
wangwu |["wangjianlin","ruhua","jingtian"]| 29|
mayun |["mayongzhen","angelababy"] | 26|求key的数量(对key集合求size)
select name,size(map_keys(family)) as key_size,age from map_;
name |key_size|age|
--------|--------|---|
zhangsan| 3| 28|
lisi | 3| 22|
wangwu | 3| 29|
mayun | 2| 26|求key是否包含某个元素(对key的集合求contains)
select name,family,age from map_ where array_contains((map_keys(family)),"brother");
name |family |age|
--------|-------------------------------------------------------------|---|
zhangsan|{"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}| 28|
lisi |{"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22|
structs
structs
的建表格式:
create table hive_struct( |
注意: COLLECTION ITEMS TERMINATED BY ':'
是指定每个元素之间的分隔符
准备数据
第一步准备数据
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70第二步建表
create table structs_(
address string,
`user` struct<name:string,age:int>
)
row format delimited
fields terminated by "#"
collection items terminated by ":";第三步加载数据
load data local inpath "/home/hadoop/data/hive_struct.txt" into table structs_;
第四步查询数据是否装载成功
select * from structs_;
address |user |
-----------|------------------------------|
192.168.1.1|[{"name":"zhangsan","age":40}]|
192.168.1.2|[{"name":"lisi","age":50}] |
192.168.1.3|[{"name":"wangwu","age":60}] |
192.168.1.4|[{"name":"zhaoliu","age":70}] |
对structs结构体的”取”
取出结构体中的元素
select address,user.name as name from structs_;
address |name |
-----------|--------|
192.168.1.1|zhangsan|
192.168.1.2|lisi |
192.168.1.3|wangwu |
192.168.1.4|zhaoliu |
行列互转
这是一个复杂数据类型的综合练习
准备数据
第一步准备数据
数据1: session点击广告记录
11 ad_101 2014-05-01 06:01:12.334+01
22 ad_102 2014-05-01 07:28:12.342+01
33 ad_103 2014-05-01 07:50:12.33+01
11 ad_104 2014-05-01 09:27:12.33+01
22 ad_103 2014-05-01 09:03:12.324+01
33 ad_102 2014-05-02 19:10:12.343+01
11 ad_101 2014-05-02 09:07:12.344+01
35 ad_105 2014-05-03 11:07:12.339+01
22 ad_104 2014-05-03 12:59:12.743+01
77 ad_103 2014-05-03 18:04:12.355+01
99 ad_102 2014-05-04 00:36:39.713+01
33 ad_101 2014-05-04 19:10:12.343+01
11 ad_101 2014-05-05 09:07:12.344+01
35 ad_102 2014-05-05 11:07:12.339+01
22 ad_103 2014-05-05 12:59:12.743+01
77 ad_104 2014-05-05 18:04:12.355+01
99 ad_105 2014-05-05 20:36:39.713+01数据2: 广告的详情
ad_101 http://www.google.com catalog8|catalog1
ad_102 http://www.sohu.com catalog6|catalog3
ad_103 http://www.baidu.com catalog7
ad_104 http://www.qq.com catalog5|catalog1|catalog4|catalog9
ad_105 http://sina.com第二步创建表
表1: 动作表
create table click_tbl(
cookie_id string,
ad_id int,
time string
)
row format delimited
fields terminated by "\t";表2: 广告表
create table ad_tbl(
ad_id string,
url string,
catalogs array<String>
)
row format delimited
fields terminated by "\t"
collection items terminated by "|"第三步加载数据
给动作表加载数据
load data local inpath "/home/hadoop/data/click_log.txt" into table click_tbl;
给广告表加载数据
load data local inpath "/home/hadoop/data/ad_list.txt" into table ad_tbl;
第四步查询数据是否装载成功
动作表查询
select * from click_tbl;
cookie_id|ad_id |time |
---------|------|--------------------------|
11|ad_101|2014-05-01 06:01:12.334+01|
22|ad_102|2014-05-01 07:28:12.342+01|
33|ad_103|2014-05-01 07:50:12.33+01 |
11|ad_104|2014-05-01 09:27:12.33+01 |
22|ad_103|2014-05-01 09:03:12.324+01|
33|ad_102|2014-05-02 19:10:12.343+01|
11|ad_101|2014-05-02 09:07:12.344+01|
35|ad_105|2014-05-03 11:07:12.339+01|
22|ad_104|2014-05-03 12:59:12.743+01|
77|ad_103|2014-05-03 18:04:12.355+01|
99|ad_102|2014-05-04 00:36:39.713+01|
33|ad_101|2014-05-04 19:10:12.343+01|
11|ad_101|2014-05-05 09:07:12.344+01|
35|ad_102|2014-05-05 11:07:12.339+01|
22|ad_103|2014-05-05 12:59:12.743+01|
77|ad_104|2014-05-05 18:04:12.355+01|
99|ad_105|2014-05-05 20:36:39.713+01|广告把查询
select * from ad_tbl;
ad_id |url |catalogs |
------|---------------------|---------------------------------------|
ad_101|http://www.google.com|["catalog8|catalog1"] |
ad_102|http://www.sohu.com |["catalog6|catalog3"] |
ad_103|http://www.baidu.com |["catalog7"] |
ad_104|http://www.qq.com |["catalog5|catalog1|catalog4|catalog9"]|
ad_105|http://sina.com |NULL |
需求
行转列
查询每个人访问的广告
去重(collect_set):
select cookie_id,collect_set(ad_id) ad_id from click_tbl group by cookie_id;
结果:
cookie_id|ad_id |
---------|----------------------------|
11|["ad_101","ad_104"] |
22|["ad_102","ad_103","ad_104"]|
33|["ad_103","ad_102","ad_101"]|
35|["ad_105","ad_102"] |
77|["ad_103","ad_104"] |
99|["ad_102","ad_105"] |
不去重(collect_list):
select cookie_id,collect_list(ad_id) ad_id from click_tbl group by cookie_id;
结果:
cookie_id|ad_id |
---------|-------------------------------------|
11|["ad_101","ad_104","ad_101","ad_101"]|
22|["ad_102","ad_103","ad_104","ad_103"]|
33|["ad_103","ad_102","ad_101"] |
35|["ad_105","ad_102"] |
77|["ad_103","ad_104"] |
99|["ad_102","ad_105"] |
聚合查询
查询每个人访问相同广告的次数
select cookie_id,ad_id,count(1) amount from click_tbl group by cookie_id,ad_id;
结果:
cookie_id|ad_id |amount|
---------|------|------|
11|ad_101| 3|
11|ad_104| 1|
22|ad_102| 1|
22|ad_103| 2|
22|ad_104| 1|
33|ad_101| 1|
33|ad_102| 1|
33|ad_103| 1|
35|ad_102| 1|
35|ad_105| 1|
77|ad_103| 1|
77|ad_104| 1|
99|ad_102| 1|
99|ad_105| 1|查询每个人访问的广告详情
select
click_tmp.cookie_id,click_tmp.ad_id,ad_tbl.url,ad_tbl.catalogs
from
ad_tbl
JOIN
(select cookie_id,ad_id,count(1) amount from click_tbl group by cookie_id,ad_id) click_tmp
on click_tmp.ad_id=ad_tbl.ad_id;结果:
cookie_id|ad_id |url |catalogs |
---------|------|---------------------|---------------------------------------------|
11|ad_101|http://www.google.com|["catalog8","catalog1"] |
11|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
22|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
22|ad_103|http://www.baidu.com |["catalog7"] |
22|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
33|ad_101|http://www.google.com|["catalog8","catalog1"] |
33|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
33|ad_103|http://www.baidu.com |["catalog7"] |
35|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
35|ad_105|http://sina.com |NULL |
77|ad_103|http://www.baidu.com |["catalog7"] |
77|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
99|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
99|ad_105|http://sina.com |NULL |把catalogs中的数据元素排序
select ad_id,url,sort_array(catalogs) as catalogs from ad_tbl;
结果:
ad_id |url |catalogs |
------|---------------------|---------------------------------------------|
ad_101|http://www.google.com|["catalog1","catalog8"] |
ad_102|http://www.sohu.com |["catalog3","catalog6"] |
ad_103|http://www.baidu.com |["catalog7"] |
ad_104|http://www.qq.com |["catalog1","catalog4","catalog5","catalog9"]|
ad_105|http://sina.com |NULL |
列转行
查询每个广告详情
select |
结果:
ad_id |catalog | |
注意:
如果ad_tbl的catalogs字段是String类型的,那么在explode炸开的时候要转换成数组,也就是用split把字段元素按’|’切分开返回一个数组
lateral view outer explode(split(catalogs,'\\|')) t as catalog |
常用函数
查用函数的用法查询: desc function extended 函数名;
时间函数
current_date:
返回当前日期
select current_date --2019-12-21 |
current_timestamp:
返回当前时间
select current_timestamp --2019-12-21 02:23:44 |
unix_timestamp:
时间转秒
select unix_timestamp('2019-08-15 16:40:00','yyyy-MM-dd HH:mm:ss') --1565858400 |
from_unixtime:
秒转时间
select from_unixtime(1565858389,'yyyy-MM-dd HH:mm:ss') --2019-08-15 16:39:49 |
to_date:
返回日期
SELECT to_date('2009-07-30 04:17:52') FROM src LIMIT 1; --2009-07-30 |
year:
返回年份
SELECT year('2009-07-30') FROM src LIMIT 1; --2009 |
month:
返回月份
day:
返回日期
hour:
返回小时
minute:
返回分钟
second:
返回毫秒
SELECT second('2009-07-30 12:58:59') FROM src LIMIT 1; --59 |
date_add:
增加指定时间
SELECT date_add('2009-07-30', 1) FROM src LIMIT 1; --2009-07-31 |
date_sub:
减少指定时间
SELECT date_sub('2009-07-30', 1) FROM src LIMIT 1; --2009-07-29 |
数值函数
round:
返回指定范围的数值
ceil:
返回天花板,取最大的整数值
floor:
返回地板,去最小的整数值
abs:
返回绝对值
least:
数列中最小值 ==> min
SELECT least(2, 3, 1) FROM src LIMIT 1; --1 |
字符串函数
substr:
返回截取字符串
concat:
返回连接字符串
concat_ws:
根据特定格式组合字符串
SELECT concat_ws('.', 'www', array('facebook', 'com')) FROM src LIMIT 1; --www.facebook.com |
length:
返回字符串的长度,整数值
split:
返回切分后的字符串数组
upper:
字符转大写
lower:
字符转小写
Json处理函数
第一步准备数据
{"movie":"1","rate":"5","time":"978300760","userid":"1"}
{"movie":"2","rate":"4","time":"978702109","userid":"1"}
{"movie":"3","rate":"3","time":"978401968","userid":"1"}
{"movie":"4","rate":"4","time":"978300275","userid":"1"}
{"movie":"5","rate":"3","time":"978801091","userid":"1"}第二步建表
create table rating_json(
json string
)第三步加载数据
load data local inpath '/home/hadoop/data/rating.txt' into table rating_json;
第四步查看数据
json |
--------------------------------------------------------|
{"movie":"1","rate":"5","time":"978300760","userid":"1"}|
{"movie":"2","rate":"4","time":"978702109","userid":"1"}|
{"movie":"3","rate":"3","time":"978401968","userid":"1"}|
{"movie":"4","rate":"4","time":"978300275","userid":"1"}|
{"movie":"5","rate":"3","time":"978801091","userid":"1"}|对数据进行处理
json_tuple:
返回指定json文件的指定字段,返回的是字符串select json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid) from rating_json;
movie|rate|time |userid|
-----|----|---------|------|
1 |5 |978300760|1 |
2 |4 |978702109|1 |
3 |3 |978401968|1 |
4 |4 |978300275|1 |
5 |3 |978801091|1 |parse_url_tuple:
返回指定url的指定字段,返回的是字符串select parse_url_tuple("https://www.baidu.com/bigdate/spark?cookie_id=10",'HOST','PATH','QUERY','QUERY:cookie_id');
c0 |c1 |c2 |c3|
-------------|--------------|------------|--|
www.baidu.com|/bigdate/spark|cookie_id=10|10|
Null值处理函数
isnull:
指定字段的元素如果为null则返回true
isnotnull:
指定字段的元素如果不为null则返回true
elt:
指定返回的元素
SELECT elt(1, 'face', 'book') FROM src LIMIT 1; --face |
nvl:
替换null值
SELECT nvl(null,'bla') FROM src LIMIT 1; --bla |
其他函数
cast:
转换数据类型
cast(time as bigint) --时间转数值 |
注意:
binary只能转string
静动态分区表
准备工作
数据源
1,jack,shanghai,20190129
2,kevin,beijing,20190130
3,lucas,hangzhou,20190129
4,lily,hangzhou,20190130创建源数据表(外表)
create external table prit_tbl(
id int,
name string,
address string,
day string
)
row format delimited
fields terminated by ",";加载数据:
load data local inpath '/home/hadoop/data/partition.txt' into table prit_tbl;
创建分区表(外表)
create external table prit_tbl(
id int,
name string,
address string
)
partitioned by (day string)
row format delimited
fields terminated by ",";
静态分区加载数据
静态分区缺点:每次写入都要明确指定分区日期。
insert overwrite table prit_tbl partition(day="20190129") select id,name,address from test_tbl where day = "20190129" ;
注意:
并且在查询处不能包含分区字段day,否则会报错
动态分区加载数据
查看表分区
show partitions prit_tbl;
partition |
------------|
day=20190129|
day=20190130|自动识别分区,不需要明确指定
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table prit_tbl partition(day) select * from test_tbl;查询验证:
select * from prit_tbl;
select * from prit_tbl where day = 20190129;
select * from prit_tbl where day = 20190130;HDFS web界面验证
分区注意
尽量不要是用动态分区,因为动态分区的时候,将会为每一个分区分配reducer数量,当分区数量多的时候,reducer数量将会增加,对服务器是一种灾难。
动态分区和静态分区的区别: 静态分区不管有没有数据都将会创建该分区,动态分区是有结果集将创建,否则不创建。
hive动态分区的严格模式和hive提供的
hive.mapred.mode的
严格模式,为了阻止用户不小心提交恶意hql
。hive.mapred.mode=nostrict : strict
如果该模式值为strict,将会阻止以下三种查询:
- 对分区表查询,where中过滤字段不是分区字段。
- 笛卡尔积join查询,join查询语句,不带on条件 或者 where条件。
- 对order by查询,有order by的查询不带limit语句。
桶表
分桶的概念
分区提供一个隔离数据和优化查询的便利方式。不过,并非所有的数据集都可形成合理的分区。对于一张表或者分区,Hive 可以进一步组织成桶,也就是更为细粒度的数据范围划分。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。
分区针对的是数据的存储路径;分桶针对的是数据文件。
分桶的好处
- 分桶规则:对分桶字段值进行哈希,哈希值除以桶的个数求余,余数决定了该条记录在哪个桶中,也就是余数相同的在一个桶中。
- 优点:
- 提高join查询效率
- 提高抽样效率
分桶实践
准备数据
1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
7,name7
8,name8
9,name9创建桶表
create external table bucket_tbl(
id int,
name string
)
clustered by(id)
into 4 buckets
row format delimited
fields terminated by ",";创建中间表
create external table mid_tbl(
id int,
name string
)
row format delimited
fields terminated by ",";加载数据到中间表
load data local inpath '/home/hadoop/data/bucket.txt' into table mid_tbl;
设置强制分桶
set hive.enforce.bucketing = true; --Hive 2.x 不需要这一步
set mapreduce.job.reduces=-1;在 Hive 0.x and 1.x 版本,必须使用设置
hive.enforce.bucketing = true
,表示强制分桶,允许程序根据表结构自动选择正确数量的 Reducer 和 cluster by column 来进行分桶。插入数据到分桶表
insert into table bucket_tbl select * from mid_tbl;
查看结果
HDFS:
桶是以文件的形式存在的,而不是像分区那样以文件夹的形式存在。
有序的分桶表
如果要按id升序排序可以这样建表
create table test_bucket_sorted (
id int comment 'ID',
name string comment '名字'
)
comment '测试分桶'
clustered by(id)
sorted by (id)
into 4 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
create external table bucket_tbl(
id int,
name string
)
clustered by(id) sorted by (id) into 4 buckets
row format delimited
fields terminated by ",";注意:
同样需要中间表insert插入数据好处:
因为每个桶内的数据是排序的,这样每个桶进行连接时就变成了高效的归并排序
分桶抽样查询
对于非常大的数据集,有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。
SQL示例
hive> select * from test_bucket tablesample (bucket 1 out of 2); |
hive> select * from test tablesample (bucket 1 out of 2 on id); |
区别:
- 分桶表后面可以不带on 字段名,不带时默认的是按分桶字段,也可以带,而没有分桶的表则必须带
- 按分桶字段取样时,因为分桶表是直接去对应的桶中拿数据,在表比较大时会提高取样效率
语法:
tablesample (bucket x out of y on id); |
x表示从哪个桶开始,y代表分几个桶,也可以理解分x为分子,y为分母,及将表分为y份(桶),取第x份(桶)
所以这时对于分桶表是有要求的,y为桶数的倍数或因子,
x=1,y=2,取2(4/y)个bucket的数据,分别桶1和桶3(1+y)
x=1,y=4, 取1(4/y)个bucket的数据,即桶1
x=2,y=8, 取1/2(4/y)个bucket的数据,即桶1的一半
x的值必须小于等于y的值