【踩坑】SparkSQL union/unionAll 函数的去重问题
case class Employee(first_name:String)
val employeeDF1 = spark.createDataset(Seq(
Employee("Mary"),
Employee("Mandy"),
Employee("Kurt")
))
val employeeDF2 = spark.createDataset(Seq(
Employee("Mary"),
Employee("Julie"),
Employee("Mandy"),
Employee("Julie"),
Employee("Kurt")
))
employeeDF1.union(employeeDF2).show
employeeDF1.unionAll(employeeDF2).show
- 当通过
spark.sql
执行方式时,union
可以去重
employeeDF1.createOrReplaceTempView("ds1")
employeeDF2.createOrReplaceTempView("ds2")
spark.sql("select * from ds1 union select * from ds2").show
spark.sql("select * from ds1 union all select * from ds2").show
- 误区
- SQL标准查询语言 层面(如hive环境):union去重,unionAll简单合并性能较好
- Spark union 默认按列的位置直接合并,很可能字段错误合并。可使用unionByName作为替代
- 最新官方集合操作文档:https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html#set-operators