案例:Spark/Hive中‘String=数值类型’丢失精度问题
问题描述 Spark/Hive执行sql,发现key不相等居然也关联上了:
select
a left join b
on a.id = e.clue_id --a.id类型:bigint,e.clue_id类型string
where a.id=1734933297158217731
a.id e.clue_id dt timestamp
1734933297158217731 1734933297158217731 2023-12-14 2023-12-14 16:51:32
1734933297158217731 1734933297158217729 2023-12-14 2023-12-14 16:53:06
1734933297158217731 1734933297158217730 2023-12-14 2023-12-14 16:51:32
问题原因
使用HIve/Spark引擎执行sql时,当出现关联条件:String类型=数值类型(可能是double、int、bigint等),引擎会默认将两边的值cast成doble类型丢失精度,执行计划如下:
Filter (id#27L = 1734933297158217731)
+- Join LeftOuter, (if ((isnull(followup_kfid#12) || (followup_kfid#12 = ))) kf_id#33 else followup_kfid#12 = kf_id#136)
:- Join LeftOuter, (cast(id#27L as double) = cast(clue_id#63 as double))
: :- SubqueryAlias `a`
验证一下:
解决办法 调整sql:
修改前: a left join b on a.id = a.id = e.clue_id where a.id=1734933297158217731
修改后: a left join b on a.id = a.id = cast(e.clue_id as bigint) where a.id=1734933297158217731