Hive函数——窗口函数
准备数据
[caixiaoyu@gw3 ~]$ cat tmp_order.txt
1001,20170101,20170101,创建,150
1001,20170101,20170104,支付,150
1001,20170101,20170106,支付,150
1001,20170101,20170107,完成,150
1001,20170101,20170108,完成,150
1005,20170101,20170101,创建,270
1005,20170101,20170102,支付,270
1005,20170101,20170103,支付,270
1005,20170101,20170104,支付,270
1005,20170101,20170105,取消,270
1005,20170101,20170109,取消,270
1008,20170102,20170102,创建,300
1008,20170102,20170105,取消,300
建表
CREATE EXTERNAL TABLE `tmp_order`( `id` int, `order_date` string, `update_date` string, `state` string, `fee` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim'=',', 'line.delim'='\n') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT ' org.apache.hadoop.hive.ql.io .HiveIgnoreKeyTextOutputFormat' LOCATION ' hdfs://ns1/tmp/caixiaoyu/tmp_order '
| tmp_order.order_date | tmp_order.update_date | tmp_order.state | tmp_order.fee | |
---|---|---|---|---|---|
1 | 1001 | 20170101 | 20170101 | 创建 | 150 |
2 | 1001 | 20170101 | 20170104 | 支付 | 150 |
3 | 1001 | 20170101 | 20170106 | 支付 | 150 |
4 | 1001 | 20170101 | 20170107 | 完成 | 150 |
5 | 1001 | 20170101 | 20170108 | 完成 | 150 |
6 | 1005 | 20170101 | 20170101 | 创建 | 270 |
7 | 1005 | 20170101 | 20170102 | 支付 | 270 |
8 | 1005 | 20170101 | 20170103 | 支付 | 270 |
9 | 1005 | 20170101 | 20170104 | 支付 | 270 |
10 | 1005 | 20170101 | 20170105 | 取消 | 270 |
11 | 1005 | 20170101 | 20170109 | 取消 | 270 |
12 | 1008 | 20170102 | 20170102 | 创建 | 300 |
13 | 1008 | 20170102 | 20170105 | 取消 | 300 |
窗口函数
LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT id, order_date,update_date, state, fee, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_date) AS num, LAG(update_date, 1) OVER(PARTITION BY id ORDER BY update_date) AS end_date FROM tmp_order
| id | order_date | update_date | state | fee | num | end_date |
---|---|---|---|---|---|---|---|
1 | 1001 | 20170101 | 20170101 | 创建 | 150 | 1 | NULL |
2 | 1001 | 20170101 | 20170104 | 支付 | 150 | 2 | 20170101 |
3 | 1001 | 20170101 | 20170106 | 支付 | 150 | 3 | 20170104 |
4 | 1001 | 20170101 | 20170107 | 完成 | 150 | 4 | 20170106 |
5 | 1001 | 20170101 | 20170108 | 完成 | 150 | 5 | 20170107 |
6 | 1005 | 20170101 | 20170101 | 创建 | 270 | 1 | NULL |
7 | 1005 | 20170101 | 20170102 | 支付 | 270 | 2 | 20170101 |
8 | 1005 | 20170101 | 20170103 | 支付 | 270 | 3 | 20170102 |
9 | 1005 | 20170101 | 20170104 | 支付 | 270 | 4 | 20170103 |
10 | 1005 | 20170101 | 20170105 | 取消 | 270 | 5 | 20170104 |
11 | 1005 | 20170101 | 20170109 | 取消 | 270 | 6 | 20170105 |
12 | 1008 | 20170102 | 20170102 | 创建 | 300 | 1 | NULL |
13 | 1008 | 20170102 | 20170105 | 取消 | 300 | 2 | 20170102 |
LEAD
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT id, order_date,update_date, state, fee, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_date) AS num, LEAD(update_date, 1) OVER(PARTITION BY id ORDER BY update_date) AS lead_date FROM tmp_order
| id | order_date | update_date | state | fee | num | lead_date |
---|---|---|---|---|---|---|---|
1 | 1001 | 20170101 | 20170101 | 创建 | 150 | 1 | 20170104 |
2 | 1001 | 20170101 | 20170104 | 支付 | 150 | 2 | 20170106 |
3 | 1001 | 20170101 | 20170106 | 支付 | 150 | 3 | 20170107 |
4 | 1001 | 20170101 | 20170107 | 完成 | 150 | 4 | 20170108 |
5 | 1001 | 20170101 | 20170108 | 完成 | 150 | 5 | NULL |
6 | 1005 | 20170101 | 20170101 | 创建 | 270 | 1 | 20170102 |
7 | 1005 | 20170101 | 20170102 | 支付 | 270 | 2 | 20170103 |
8 | 1005 | 20170101 | 20170103 | 支付 | 270 | 3 | 20170104 |
9 | 1005 | 20170101 | 20170104 | 支付 | 270 | 4 | 20170105 |
10 | 1005 | 20170101 | 20170105 | 取消 | 270 | 5 | 20170109 |
11 | 1005 | 20170101 | 20170109 | 取消 | 270 | 6 | NULL |
12 | 1008 | 20170102 | 20170102 | 创建 | 300 | 1 | 20170105 |
13 | 1008 | 20170102 | 20170105 | 取消 | 300 | 2 | NULL |
FIRST_VALUE
取分组内排序后,截止到当前行,第一个值
SELECT id, order_date, update_date, state, fee, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_date) AS num, FIRST_VALUE(update_date) OVER(PARTITION BY id ORDER BY update_date) AS first_date FROM tmp_order
| id | order_date | update_date | state | fee | num | first_date |
---|---|---|---|---|---|---|---|
1 | 1001 | 20170101 | 20170101 | 创建 | 150 | 1 | 20170101 |
2 | 1001 | 20170101 | 20170104 | 支付 | 150 | 2 | 20170101 |
3 | 1001 | 20170101 | 20170106 | 支付 | 150 | 3 | 20170101 |
4 | 1001 | 20170101 | 20170107 | 完成 | 150 | 4 | 20170101 |
5 | 1001 | 20170101 | 20170108 | 完成 | 150 | 5 | 20170101 |
6 | 1005 | 20170101 | 20170101 | 创建 | 270 | 1 | 20170101 |
7 | 1005 | 20170101 | 20170102 | 支付 | 270 | 2 | 20170101 |
8 | 1005 | 20170101 | 20170103 | 支付 | 270 | 3 | 20170101 |
9 | 1005 | 20170101 | 20170104 | 支付 | 270 | 4 | 20170101 |
10 | 1005 | 20170101 | 20170105 | 取消 | 270 | 5 | 20170101 |
11 | 1005 | 20170101 | 20170109 | 取消 | 270 | 6 | 20170101 |
12 | 1008 | 20170102 | 20170102 | 创建 | 300 | 1 | 20170102 |
13 | 1008 | 20170102 | 20170105 | 取消 | 300 | 2 | 20170102 |
LAST_VALUE
取分组内排序后,截止到当前行,最后一个值
SELECT id, order_date, update_date, state, fee, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_date) AS num, LAST_VALUE(update_date) OVER(PARTITION BY id ORDER BY update_date) AS last_date FROM tmp_order
OVER从句
用法
使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG
使用PARTITION BY语句,使用一个或者多个原始数据类型的列
使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者排序列
使用窗口规范,语法如下
窗口规范
基本语法
ROWS betweenCURRENT ROW | UNBOUNDED PRECEDING | [num] PRECEDING AND UNBOUNDED FOLLOWING | [num] FOLLOWING| CURRENT ROW
或
RANGE between [num] PRECEDING AND [num]FOLLOWING
ROWS是物理窗口,从行数上控制窗口的尺寸的;
RANGE是逻辑窗口,从列值上控制窗口的尺寸。
窗口从句条件
当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
OVER从句支持以下函数, 但是并不支持和窗口一起使用它们。
Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead 和 Lag 函数.
实例
SELECT id, order_date, update_date, state, fee, MAX(update_date) OVER(PARTITION BY id ORDER BY update_date) AS date1, -- 默认为从起点到当前行 MAX(update_date) OVER(PARTITION BY id ORDER BY update_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS date2, --从起点到当前行,结果同date1 MAX(update_date) OVER(PARTITION BY id) AS date3, --分组内所有行 MAX(update_date) OVER(PARTITION BY id ORDER BY update_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS date4, --当前行+往前2行 MAX(update_date) OVER(PARTITION BY id ORDER BY update_date ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS date5, --当前行+往前2行+往后1行 MAX(update_date) OVER(PARTITION BY id ORDER BY update_date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS date6 ---当前行+往后所有行 FROM tmp_order ORDER BY id, update_date
