HS2&Hive的复杂数据结构&行列互转&常用函数&静动态分区表&桶表

目标

  1. HS2
  2. 复杂数据结构
  3. 行列互转
  4. 常用函数
  5. 静动态分区表
  6. 桶表

SH2

HS2是HiveServer2的简称

  • HS2: Server端,默认端口10000

    修改端口的方式是通过设置hive.server2.thrift.port的值

  • beeline: Client端

    连接方式: ./beeline -u jdbc:hive2://hadoop001:10000/matestore -n hadoop

复杂数据结构

复杂数据类型有三种,分别是: array、map、structs

array

array的建表格式:

create table hive_array(name string, work_locations array<string>)
row format delimited
fields terminated by '\t'
COLLECTION ITEMS TERMINATED BY ',';

注意: COLLECTION ITEMS TERMINATED BY ','是指定array数组中的元素分隔符

准备数据

  1. 第一步准备数据

    pk      beijing,shanghai,tianjin,hangzhou
    jepson changchu,chengdu,wuhan,beijing
  2. 第二步建表

    create table array_(
    name string,
    address array<string>
    )
    row format delimited
    fields terminated by "\t"
    collection items terminated by ",";
  3. 第三步装载数据

    load data local inpath "/home/hadoop/data/hive_array.txt" into table array_;
  4. 第四步查询数据是否加载成功

    select * from array_;

    name  |address                                    |
    ------|-------------------------------------------|
    pk |["beijing","shanghai","tianjin","hangzhou"]|
    jepson|["changchu","chengdu","wuhan","beijing"] |

对array数组的”取”

  1. 根据数组的下标取出指定元素,下标从0开始

    select name,address[1] as address from array_;

    name  |address |
    ------|--------|
    pk |shanghai|
    jepson|chengdu |
  2. 获取数组中元素的个数

    select name,size(address) as size from array_;

    name  |size|
    ------|----|
    pk | 4|
    jepson| 4|
  3. 获取数组中包含某元素的记录

    select name,address from array_ where array_contains(address,"shanghai");

    name|address                                    |
    ----|-------------------------------------------|
    pk |["beijing","shanghai","tianjin","hangzhou"]|

map

map的建表格式:

create table hive_map(
id int, name string, members map<string,string>, age int
)
row format delimited
fields terminated by ','
COLLECTION ITEMS TERMINATED BY '#'
MAP KEYS TERMINATED BY ':';

注意: COLLECTION ITEMS TERMINATED BY '#'是指定map键值对中的元素分隔符,MAP KEYS TERMINATED BY ':'是指定key和value的分隔符

准备数据

  1. 第一步准备数据

    1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
    2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
    3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
    4,mayun,father:mayongzhen#mother:angelababy,26
  2. 第二步建表

    create table map_(
    id int,
    name string,
    family map<String,String>,
    age int
    )
    row format delimited
    fields terminated by ","
    collection items terminated by "#"
    map keys terminated by ":";
  3. 第三步加载数据

    load data local inpath '/home/hadoop/data/hive_map.txt' into table map_;
  4. 第四步查询数据是否加载成功

    select * from map_;

    id|name    |family                                                       |age|
    --|--------|-------------------------------------------------------------|---|
    1|zhangsan|{"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}| 28|
    2|lisi |{"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22|
    3|wangwu |{"father":"wangjianlin","mother":"ruhua","sister":"jingtian"}| 29|
    4|mayun |{"father":"mayongzhen","mother":"angelababy"} | 26|

对map键值对的”取”

  1. 根据key取value

    select name,family['father'] as father,age from map_;

    name    |father     |age|
    --------|-----------|---|
    zhangsan|xiaoming | 28|
    lisi |mayun | 22|
    wangwu |wangjianlin| 29|
    mayun |mayongzhen | 26|
  2. 取出所有key集合

    select name,map_keys(family) as map_keys,age from map_;

    name    |map_keys                     |age|
    --------|-----------------------------|---|
    zhangsan|["father","mother","brother"]| 28|
    lisi |["father","mother","brother"]| 22|
    wangwu |["father","mother","sister"] | 29|
    mayun |["father","mother"] | 26|
  3. 取出所有的value集合

    select name,map_values(family) as map_values,age from map_;

    name    |map_values                        |age|
    --------|----------------------------------|---|
    zhangsan|["xiaoming","xiaohuang","xiaoxu"] | 28|
    lisi |["mayun","huangyi","guanyu"] | 22|
    wangwu |["wangjianlin","ruhua","jingtian"]| 29|
    mayun |["mayongzhen","angelababy"] | 26|
  4. 求key的数量(对key集合求size)

    select name,size(map_keys(family)) as key_size,age from map_;

    name    |key_size|age|
    --------|--------|---|
    zhangsan| 3| 28|
    lisi | 3| 22|
    wangwu | 3| 29|
    mayun | 2| 26|
  5. 求key是否包含某个元素(对key的集合求contains)

    select name,family,age from map_ where array_contains((map_keys(family)),"brother");

    name    |family                                                       |age|
    --------|-------------------------------------------------------------|---|
    zhangsan|{"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}| 28|
    lisi |{"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22|

structs

structs的建表格式:

create table hive_struct(
ip string, info struct<name:string,age:int>
)
row format delimited
fields terminated by '#'
COLLECTION ITEMS TERMINATED BY ':';

注意: COLLECTION ITEMS TERMINATED BY ':'是指定每个元素之间的分隔符

准备数据

  1. 第一步准备数据

    192.168.1.1#zhangsan:40
    192.168.1.2#lisi:50
    192.168.1.3#wangwu:60
    192.168.1.4#zhaoliu:70
  2. 第二步建表

    create table structs_(
    address string,
    `user` struct<name:string,age:int>
    )
    row format delimited
    fields terminated by "#"
    collection items terminated by ":";
  3. 第三步加载数据

    load data local inpath "/home/hadoop/data/hive_struct.txt" into table structs_;
  4. 第四步查询数据是否装载成功

    select * from structs_;

    address    |user                          |
    -----------|------------------------------|
    192.168.1.1|[{"name":"zhangsan","age":40}]|
    192.168.1.2|[{"name":"lisi","age":50}] |
    192.168.1.3|[{"name":"wangwu","age":60}] |
    192.168.1.4|[{"name":"zhaoliu","age":70}] |

对structs结构体的”取”

  1. 取出结构体中的元素

    select address,user.name as name from structs_;

    address    |name    |
    -----------|--------|
    192.168.1.1|zhangsan|
    192.168.1.2|lisi |
    192.168.1.3|wangwu |
    192.168.1.4|zhaoliu |

行列互转

这是一个复杂数据类型的综合练习

准备数据

  1. 第一步准备数据

    数据1: session点击广告记录

    11      ad_101  2014-05-01 06:01:12.334+01
    22 ad_102 2014-05-01 07:28:12.342+01
    33 ad_103 2014-05-01 07:50:12.33+01
    11 ad_104 2014-05-01 09:27:12.33+01
    22 ad_103 2014-05-01 09:03:12.324+01
    33 ad_102 2014-05-02 19:10:12.343+01
    11 ad_101 2014-05-02 09:07:12.344+01
    35 ad_105 2014-05-03 11:07:12.339+01
    22 ad_104 2014-05-03 12:59:12.743+01
    77 ad_103 2014-05-03 18:04:12.355+01
    99 ad_102 2014-05-04 00:36:39.713+01
    33 ad_101 2014-05-04 19:10:12.343+01
    11 ad_101 2014-05-05 09:07:12.344+01
    35 ad_102 2014-05-05 11:07:12.339+01
    22 ad_103 2014-05-05 12:59:12.743+01
    77 ad_104 2014-05-05 18:04:12.355+01
    99 ad_105 2014-05-05 20:36:39.713+01

    数据2: 广告的详情

    ad_101  http://www.google.com   catalog8|catalog1
    ad_102 http://www.sohu.com catalog6|catalog3
    ad_103 http://www.baidu.com catalog7
    ad_104 http://www.qq.com catalog5|catalog1|catalog4|catalog9
    ad_105 http://sina.com
  2. 第二步创建表

    表1: 动作表

    create table click_tbl(
    cookie_id string,
    ad_id int,
    time string
    )
    row format delimited
    fields terminated by "\t";

    表2: 广告表

    create table ad_tbl(
    ad_id string,
    url string,
    catalogs array<String>
    )
    row format delimited
    fields terminated by "\t"
    collection items terminated by "|"
  3. 第三步加载数据

    给动作表加载数据

    load data local inpath "/home/hadoop/data/click_log.txt" into table click_tbl;

    给广告表加载数据

    load data local inpath "/home/hadoop/data/ad_list.txt" into table ad_tbl;
  4. 第四步查询数据是否装载成功

    动作表查询

    select * from click_tbl;

    cookie_id|ad_id |time                      |
    ---------|------|--------------------------|
    11|ad_101|2014-05-01 06:01:12.334+01|
    22|ad_102|2014-05-01 07:28:12.342+01|
    33|ad_103|2014-05-01 07:50:12.33+01 |
    11|ad_104|2014-05-01 09:27:12.33+01 |
    22|ad_103|2014-05-01 09:03:12.324+01|
    33|ad_102|2014-05-02 19:10:12.343+01|
    11|ad_101|2014-05-02 09:07:12.344+01|
    35|ad_105|2014-05-03 11:07:12.339+01|
    22|ad_104|2014-05-03 12:59:12.743+01|
    77|ad_103|2014-05-03 18:04:12.355+01|
    99|ad_102|2014-05-04 00:36:39.713+01|
    33|ad_101|2014-05-04 19:10:12.343+01|
    11|ad_101|2014-05-05 09:07:12.344+01|
    35|ad_102|2014-05-05 11:07:12.339+01|
    22|ad_103|2014-05-05 12:59:12.743+01|
    77|ad_104|2014-05-05 18:04:12.355+01|
    99|ad_105|2014-05-05 20:36:39.713+01|

    广告把查询

    select * from ad_tbl;

    ad_id |url                  |catalogs                               |
    ------|---------------------|---------------------------------------|
    ad_101|http://www.google.com|["catalog8|catalog1"] |
    ad_102|http://www.sohu.com |["catalog6|catalog3"] |
    ad_103|http://www.baidu.com |["catalog7"] |
    ad_104|http://www.qq.com |["catalog5|catalog1|catalog4|catalog9"]|
    ad_105|http://sina.com |NULL |

需求

行转列

查询每个人访问的广告

  1. 去重(collect_set):

    select cookie_id,collect_set(ad_id) ad_id from click_tbl group by cookie_id;

  • 结果:

    cookie_id|ad_id                       |
    ---------|----------------------------|
    11|["ad_101","ad_104"] |
    22|["ad_102","ad_103","ad_104"]|
    33|["ad_103","ad_102","ad_101"]|
    35|["ad_105","ad_102"] |
    77|["ad_103","ad_104"] |
    99|["ad_102","ad_105"] |
  1. 不去重(collect_list):

    select cookie_id,collect_list(ad_id) ad_id from click_tbl group by cookie_id;

  • 结果:

    cookie_id|ad_id                                |
    ---------|-------------------------------------|
    11|["ad_101","ad_104","ad_101","ad_101"]|
    22|["ad_102","ad_103","ad_104","ad_103"]|
    33|["ad_103","ad_102","ad_101"] |
    35|["ad_105","ad_102"] |
    77|["ad_103","ad_104"] |
    99|["ad_102","ad_105"] |

聚合查询

  1. 查询每个人访问相同广告的次数

    select cookie_id,ad_id,count(1) amount from click_tbl group by cookie_id,ad_id;

    结果:

    cookie_id|ad_id |amount|
    ---------|------|------|
    11|ad_101| 3|
    11|ad_104| 1|
    22|ad_102| 1|
    22|ad_103| 2|
    22|ad_104| 1|
    33|ad_101| 1|
    33|ad_102| 1|
    33|ad_103| 1|
    35|ad_102| 1|
    35|ad_105| 1|
    77|ad_103| 1|
    77|ad_104| 1|
    99|ad_102| 1|
    99|ad_105| 1|
  2. 查询每个人访问的广告详情

    select 
    click_tmp.cookie_id,click_tmp.ad_id,ad_tbl.url,ad_tbl.catalogs
    from
    ad_tbl
    JOIN
    (select cookie_id,ad_id,count(1) amount from click_tbl group by cookie_id,ad_id) click_tmp
    on click_tmp.ad_id=ad_tbl.ad_id;

    结果:

    cookie_id|ad_id |url                  |catalogs                                     |
    ---------|------|---------------------|---------------------------------------------|
    11|ad_101|http://www.google.com|["catalog8","catalog1"] |
    11|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
    22|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
    22|ad_103|http://www.baidu.com |["catalog7"] |
    22|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
    33|ad_101|http://www.google.com|["catalog8","catalog1"] |
    33|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
    33|ad_103|http://www.baidu.com |["catalog7"] |
    35|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
    35|ad_105|http://sina.com |NULL |
    77|ad_103|http://www.baidu.com |["catalog7"] |
    77|ad_104|http://www.qq.com |["catalog5","catalog1","catalog4","catalog9"]|
    99|ad_102|http://www.sohu.com |["catalog6","catalog3"] |
    99|ad_105|http://sina.com |NULL |
  3. 把catalogs中的数据元素排序

    select ad_id,url,sort_array(catalogs) as catalogs from ad_tbl;

    结果:

    ad_id |url                  |catalogs                                     |
    ------|---------------------|---------------------------------------------|
    ad_101|http://www.google.com|["catalog1","catalog8"] |
    ad_102|http://www.sohu.com |["catalog3","catalog6"] |
    ad_103|http://www.baidu.com |["catalog7"] |
    ad_104|http://www.qq.com |["catalog1","catalog4","catalog5","catalog9"]|
    ad_105|http://sina.com |NULL |

列转行

查询每个广告详情

select
ad_id,catalog
from
ad_tbl
lateral view explode(catalogs) t as catalog;

结果:

ad_id |catalog |
------|--------|
ad_101|catalog8|
ad_101|catalog1|
ad_102|catalog6|
ad_102|catalog3|
ad_103|catalog7|
ad_104|catalog5|
ad_104|catalog1|
ad_104|catalog4|
ad_104|catalog9|

注意: 如果ad_tbl的catalogs字段是String类型的,那么在explode炸开的时候要转换成数组,也就是用split把字段元素按’|’切分开返回一个数组

lateral view outer explode(split(catalogs,'\\|')) t as catalog

常用函数

查用函数的用法查询: desc function extended 函数名;

时间函数

current_date: 返回当前日期

select current_date		--2019-12-21

current_timestamp: 返回当前时间

select current_timestamp	--2019-12-21 02:23:44

unix_timestamp: 时间转秒

select unix_timestamp('2019-08-15 16:40:00','yyyy-MM-dd HH:mm:ss')   --1565858400

from_unixtime: 秒转时间

select from_unixtime(1565858389,'yyyy-MM-dd HH:mm:ss')  --2019-08-15 16:39:49
select from_unixtime(cast(substr(1553184000488,1,10) as int),'yyyy-MM-dd HH:mm:ss') --2019-03-22 00:00:00
select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') -- 2019-08-15 17:18:55

to_date: 返回日期

SELECT to_date('2009-07-30 04:17:52') FROM src LIMIT 1;		--2009-07-30

year: 返回年份

SELECT year('2009-07-30') FROM src LIMIT 1;		--2009

month: 返回月份

day: 返回日期

hour: 返回小时

minute: 返回分钟

second: 返回毫秒

SELECT second('2009-07-30 12:58:59') FROM src LIMIT 1; 		--59

date_add: 增加指定时间

SELECT date_add('2009-07-30', 1) FROM src LIMIT 1;		--2009-07-31

date_sub: 减少指定时间

SELECT date_sub('2009-07-30', 1) FROM src LIMIT 1;		--2009-07-29

数值函数

round: 返回指定范围的数值

ceil: 返回天花板,取最大的整数值

floor: 返回地板,去最小的整数值

abs: 返回绝对值

least: 数列中最小值 ==> min

SELECT least(2, 3, 1) FROM src LIMIT 1;		--1

字符串函数

substr: 返回截取字符串

concat: 返回连接字符串

concat_ws: 根据特定格式组合字符串

SELECT concat_ws('.', 'www', array('facebook', 'com')) FROM src LIMIT 1;	--www.facebook.com

length: 返回字符串的长度,整数值

split: 返回切分后的字符串数组

upper: 字符转大写

lower: 字符转小写

Json处理函数

  1. 第一步准备数据

    {"movie":"1","rate":"5","time":"978300760","userid":"1"}
    {"movie":"2","rate":"4","time":"978702109","userid":"1"}
    {"movie":"3","rate":"3","time":"978401968","userid":"1"}
    {"movie":"4","rate":"4","time":"978300275","userid":"1"}
    {"movie":"5","rate":"3","time":"978801091","userid":"1"}
  2. 第二步建表

    create table rating_json(
    json string
    )
  3. 第三步加载数据

    load data local inpath '/home/hadoop/data/rating.txt' into table rating_json;
  4. 第四步查看数据

    json                                                    |
    --------------------------------------------------------|
    {"movie":"1","rate":"5","time":"978300760","userid":"1"}|
    {"movie":"2","rate":"4","time":"978702109","userid":"1"}|
    {"movie":"3","rate":"3","time":"978401968","userid":"1"}|
    {"movie":"4","rate":"4","time":"978300275","userid":"1"}|
    {"movie":"5","rate":"3","time":"978801091","userid":"1"}|
  5. 对数据进行处理

    json_tuple: 返回指定json文件的指定字段,返回的是字符串

    select json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid) from rating_json;

    movie|rate|time     |userid|
    -----|----|---------|------|
    1 |5 |978300760|1 |
    2 |4 |978702109|1 |
    3 |3 |978401968|1 |
    4 |4 |978300275|1 |
    5 |3 |978801091|1 |

    parse_url_tuple: 返回指定url的指定字段,返回的是字符串

    select parse_url_tuple("https://www.baidu.com/bigdate/spark?cookie_id=10",'HOST','PATH','QUERY','QUERY:cookie_id');

    c0           |c1            |c2          |c3|
    -------------|--------------|------------|--|
    www.baidu.com|/bigdate/spark|cookie_id=10|10|

Null值处理函数

isnull: 指定字段的元素如果为null则返回true

isnotnull: 指定字段的元素如果不为null则返回true

elt: 指定返回的元素

SELECT elt(1, 'face', 'book') FROM src LIMIT 1;		--face

nvl: 替换null值

SELECT nvl(null,'bla') FROM src LIMIT 1;	--bla

其他函数

cast: 转换数据类型

cast(time as bigint)	--时间转数值
cast(string as date) --字符串转日期

注意: binary只能转string

静动态分区表

准备工作

  1. 数据源

    1,jack,shanghai,20190129
    2,kevin,beijing,20190130
    3,lucas,hangzhou,20190129
    4,lily,hangzhou,20190130
  2. 创建源数据表(外表)

    create external table prit_tbl(
    id int,
    name string,
    address string,
    day string
    )
    row format delimited
    fields terminated by ",";
  3. 加载数据:

    load data local inpath '/home/hadoop/data/partition.txt' into table prit_tbl;
  4. 创建分区表(外表)

    create external table prit_tbl(
    id int,
    name string,
    address string
    )
    partitioned by (day string)
    row format delimited
    fields terminated by ",";

静态分区加载数据

  • 静态分区缺点:每次写入都要明确指定分区日期。

    insert overwrite table prit_tbl partition(day="20190129") select id,name,address from test_tbl where day = "20190129" ;

    注意: 并且在查询处不能包含分区字段day,否则会报错

动态分区加载数据

  • 查看表分区

    show partitions prit_tbl;

    partition   |
    ------------|
    day=20190129|
    day=20190130|
  • 自动识别分区,不需要明确指定

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    insert into table prit_tbl partition(day) select * from test_tbl;
  • 查询验证:

    select * from prit_tbl;
    select * from prit_tbl where day = 20190129;
    select * from prit_tbl where day = 20190130;

    HDFS web界面验证

分区注意

  1. 尽量不要是用动态分区,因为动态分区的时候,将会为每一个分区分配reducer数量,当分区数量多的时候,reducer数量将会增加,对服务器是一种灾难。

  2. 动态分区和静态分区的区别: 静态分区不管有没有数据都将会创建该分区,动态分区是有结果集将创建,否则不创建。

  3. hive动态分区的严格模式和hive提供的hive.mapred.mode的严格模式,为了阻止用户不小心提交恶意hql

    hive.mapred.mode=nostrict : strict

    如果该模式值为strict,将会阻止以下三种查询:

    1. 对分区表查询,where中过滤字段不是分区字段。
    2. 笛卡尔积join查询,join查询语句,不带on条件 或者 where条件。
    3. 对order by查询,有order by的查询不带limit语句。

桶表

分桶的概念

分区提供一个隔离数据和优化查询的便利方式。不过,并非所有的数据集都可形成合理的分区。对于一张表或者分区,Hive 可以进一步组织成桶,也就是更为细粒度的数据范围划分。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

分区针对的是数据的存储路径;分桶针对的是数据文件。

分桶的好处

  • 分桶规则:对分桶字段值进行哈希,哈希值除以桶的个数求余,余数决定了该条记录在哪个桶中,也就是余数相同的在一个桶中。
  • 优点:
    1. 提高join查询效率
    2. 提高抽样效率

分桶实践

  1. 准备数据

    1,name1
    2,name2
    3,name3
    4,name4
    5,name5
    6,name6
    7,name7
    8,name8
    9,name9
  2. 创建桶表

    create external table bucket_tbl(
    id int,
    name string
    )
    clustered by(id)
    into 4 buckets
    row format delimited
    fields terminated by ",";
  3. 创建中间表

    create external table mid_tbl(
    id int,
    name string
    )
    row format delimited
    fields terminated by ",";
  4. 加载数据到中间表

    load data local inpath '/home/hadoop/data/bucket.txt' into table mid_tbl;
  5. 设置强制分桶

    set hive.enforce.bucketing = true; --Hive 2.x 不需要这一步
    set mapreduce.job.reduces=-1;

    在 Hive 0.x and 1.x 版本,必须使用设置 hive.enforce.bucketing = true,表示强制分桶,允许程序根据表结构自动选择正确数量的 Reducer 和 cluster by column 来进行分桶。

  6. 插入数据到分桶表

    insert into table bucket_tbl select * from mid_tbl;
  7. 查看结果

    HDFS:桶是以文件的形式存在的,而不是像分区那样以文件夹的形式存在。

    分桶文件

有序的分桶表

  • 如果要按id升序排序可以这样建表

    create table test_bucket_sorted (
    id int comment 'ID',
    name string comment '名字'
    )
    comment '测试分桶'
    clustered by(id)
    sorted by (id)
    into 4 buckets
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

    create external table bucket_tbl(
    id int,
    name string
    )
    clustered by(id) sorted by (id) into 4 buckets
    row format delimited
    fields terminated by ",";

    注意: 同样需要中间表insert插入数据

  • 好处:

    因为每个桶内的数据是排序的,这样每个桶进行连接时就变成了高效的归并排序

分桶抽样查询

对于非常大的数据集,有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

SQL示例

hive> select * from test_bucket tablesample (bucket 1 out of 2);
OK
8 name8
4 name4
2 name2
6 name6
hive> select * from test tablesample (bucket 1 out of 2 on id);
OK
2 name2
8 name8
4 name4
6 name6

区别:

  • 分桶表后面可以不带on 字段名,不带时默认的是按分桶字段,也可以带,而没有分桶的表则必须带
  • 按分桶字段取样时,因为分桶表是直接去对应的桶中拿数据,在表比较大时会提高取样效率

语法:

tablesample (bucket x out of y on id);

x表示从哪个桶开始,y代表分几个桶,也可以理解分x为分子,y为分母,及将表分为y份(桶),取第x份(桶)

所以这时对于分桶表是有要求的,y为桶数的倍数或因子,

  • x=1,y=2,取2(4/y)个bucket的数据,分别桶1和桶3(1+y)

  • x=1,y=4, 取1(4/y)个bucket的数据,即桶1

  • x=2,y=8, 取1/2(4/y)个bucket的数据,即桶1的一半

    x的值必须小于等于y的值

Author: Tunan
Link: http://yerias.github.io/2018/11/05/hive/5/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.