一、问题现象

  1. hive 执行 insert overwrite table dwd_prod_info_frequent_di partition(dt='20230225',rgn='cn') 双分区的形式。
INFO  : Moving data to directory hdfs://nameservice1/data/hive/warehouse/2c/dwd/dwd_prod_info_frequent_di/dt=20230225/rgn=cn/.hive-staging_hive_2023-02-26_22-18-23_057_6764590617411203663-4519/-ext-10000 from hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn/.hive-staging_hive_2023-02-26_22-18-23_057_6764590617411203663-4519/-ext-10002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table ninebot_dw_cn_2c.dwd_prod_t_bms_info_frequent_di partition (dt=20230225, rgn=cn) from hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn/.hive-staging_hive_2023-02-26_22-18-23_057_6764590617411203663-4519/-ext-10000
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn/.hive-staging_hive_2023-02-26_22-18-23_057_6764590617411203663-4519/-ext-10000 to destination hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 17   Cumulative CPU: 94943.39 sec   HDFS Read: 4486454836 HDFS Write: 15886513875 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 days 2 hours 22 minutes 23 seconds 390 msec
INFO  : Completed executing command(queryId=hive_20230226221823_ff9fc631-15fc-4ebf-871f-36265e72b8fd); Time taken: 6032.278 seconds
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn/.hive-staging_hive_2023-02-26_22-18-23_057_6764590617411203663-4519/-ext-10000 to destination hdfs://nameservice1/data/hive/warehouse/ninebot_dw_cn_2c/dwd/dwd_prod_t_bms_info_frequent_di/dt=20230225/rgn=cn
taskId-3318086,error

二、持续时间

连续好几天的任务都报了相同的错误

三、问题原因

使用的CDH hive 2.1.1-cdh6.1.1 版本,上网查资源说是bug,并没有修复

四、解决问题

-- 这个参数默认了cdh的版本并没有开启,所以手动开启一下
set hive.warehouse.subdir.inherit.perms = false;
-- 
set hive.insert.into.multilevel.dirs= true;

最终每天的周期调度还是失败,但是手动删除文件后就在重跑就可以了,问题比较奇怪,所以我从hive改成 spark ,改成spark 之后任务不失败了,并且发现了问题的端倪。

突然间发现不仅有spark 写的文件 ,还有hive生成的文件,所以猜测到有可能是谁的表location写错了。导致写文件冲突了,由于hive每次生成的文件名称都是固定的按照编号,那么如果两个hive同时写一个文件的话,就会出现move过去相同的文件,也就导致move报错。
既然猜测是 location 写错了,那么就要定位到底是那个表的location写错了。于是登录hive得元数据库。


MySQL [metastore]> select
    -> t1.TBL_NAME,
    -> t2.LOCATION
    -> from (
    -> select * from TBLS
    -> ) t1 join (
    -> select * from SDS
    -> where LOCATION like '%data/hive/warehouse/2c/dwd/dwd_prod_info_frequent_di%'
    -> ) t2 on t1.SD_ID=t2.SD_ID;
+------------------------------------+----------------------------------------------------------------------------------------------+
| TBL_NAME                           | LOCATION                                                                                     |
+------------------------------------+----------------------------------------------------------------------------------------------+
| dwd_prod_bms_info_frequent_v_di | hdfs://nameservice1/data/hive/warehouse/2c/dwd/dwd_prod_info_frequent_di |
| dwd_prod_info_frequent_di    | hdfs://nameservice1/data/hive/warehouse/2c/dwd/dwd_prod_info_frequent_di |
+------------------------------------+----------------------------------------------------------------------------------------------+

查询得到 有两张表的location相同了;
查询一下 创建时间;

MySQL [metastore]> select * from TBLS where TBL_NAME = 'dwd_prod_bms_info_frequent_v_di';
+--------+-------------+--------+------------------+-------+------------+-----------+---------+------------------------------------+----------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID  | LAST_ACCESS_TIME | OWNER | OWNER_TYPE | RETENTION | SD_ID   | TBL_NAME                           | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+--------+------------------+-------+------------+-----------+---------+------------------------------------+----------------+--------------------+--------------------+
| 737871 |  1677150410 | 386004 |                0 | simba | USER       |         0 | 4460486 | dwd_prod_bms_info_frequent_v_di | EXTERNAL_TABLE | NULL               | NULL               |
+--------+-------------+--------+------------------+-------+------------+-----------+---------+------------------------------------+----------------+--------------------+--------------------+
1 row in set (0.00 sec)

正好和出现问题的时间差不多。所以根本问题是move文件的时候文件名称相同冲突导致的。
于是修改表的location 问题解决。

五、参考资料

https://www.notion.so/20230227-hive-move-8b1ecfbe901547faa7011d27ddc94e74?pvs=4#3be0a81c98924d47a4b2507b1b523402
https://www.notion.so/20230227-hive-move-8b1ecfbe901547faa7011d27ddc94e74?pvs=4#5eebdf49984d499588226ff2c087fb7c