问题
nnnn下面是某电商网站的订单数据,包括order_id,user_id,order_status和operate_time四个字段,我们需要找出所有恶意购买的用户。恶意购买的用户定义是:同一个用户,在任意半小时内(含),取消订单次数>=2次的就被视为恶意买家。比如该样例数据中c用户就是恶意买家。
nnnnorder_id user_id order_status operate_timen1101 a 已支付 2023-01-01 10:00:00n1102 a 已取消 2023-01-01 10:10:00n1103 a 待支付 2023-01-01 10:20:00n1104 b 已取消 2023-01-01 10:30:00n1105 a 待确认 2023-01-01 10:50:00n1106 a 已取消 2023-01-01 11:00:00n1107 b 已取消 2023-01-01 11:40:00n1108 b 已取消 2023-01-01 11:50:00n1109 b 已支付 2023-01-01 12:00:00n1110 b 已取消 2023-01-01 12:11:00n1111 c 已取消 2023-01-01 12:20:00n1112 c 已取消 2023-01-01 12:30:00n1113 c 已取消 2023-01-01 12:55:00n1114 c 已取消 2023-01-01 13:00:00
nnnn解答
nnnn典型的滑动窗口的场景。Hive中也是有滑动窗口的功能的(按数据范围开窗,range between and )。针对这个例子,窗口大小就是半小时,然后按每条数据进行滑动,在窗口内判断该条数据对应的用户是否是恶意用户。
nnnnwith temp as (nselect 1101 as order_id ,'a' as user_id ,"已支付" as order_status, "2023-01-01 10:00:00" as operate_timenunion all nselect 1102 as order_id ,'a' as user_id ,"已取消" as order_status, "2023-01-01 10:10:00" as operate_timenunion all nselect 1103 as order_id ,'a' as user_id ,"待支付" as order_status, "2023-01-01 10:20:00" as operate_timenunion all nselect 1104 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 10:30:00" as operate_timenunion all nselect 1105 as order_id ,'a' as user_id ,"待确认" as order_status, "2023-01-01 10:50:00" as operate_timenunion all nselect 1106 as order_id ,'a' as user_id ,"已取消" as order_status, "2023-01-01 11:00:00" as operate_timenunion all nselect 1107 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 11:40:00" as operate_timenunion all nselect 1108 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 11:50:00" as operate_timenunion all nselect 1109 as order_id ,'b' as user_id ,"已支付" as order_status, "2023-01-01 12:00:00" as operate_timenunion all nselect 1110 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 12:11:00" as operate_timenunion all nselect 1111 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:20:00" as operate_timenunion all nselect 1112 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:30:00" as operate_timenunion all nselect 1113 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:55:00" as operate_timenunion all nselect 1114 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 13:00:00" as operate_timen)n---这里设定tmp表中放的就是上面提供的样例数据nselectndistinct user_idnfromn(n selectn order_idn ,user_idn ,order_statusn --通过range between以当前行为锚点,圈定数据范围为operate_time为近30分钟内,然后算该范围内的取消订单数n ,count(case when order_status='已取消' then order_id end) over(partition by user_id order by operate_time range between 1800 preceding and current row) as cancel_order_cntn fromn (n selectn order_idn ,user_idn ,order_statusn ,unix_timestamp(operate_time) as operate_time ---由于range...between只能整数比较,这里先转换为秒n from tempn )t1n)t1nwhere cancel_order_cnt>=2n;n----结果nuser_idnbnc
n
Comments NOTHING