SparkSQLdropDuplicates
spark sql 数据去重
在对spark sql 中的dataframe数据表去除重复数据的时候可以使⽤dropDuplicates()⽅法
dropDuplicates()有4个重载⽅法
第⼀个def dropDuplicates(): Dataset[T] = dropDuplicates(this.columns)
这个⽅法,不需要传⼊任何的参数,默认根据所有列进⾏去重,然后按数据⾏的顺序保留每⾏数据出现的第⼀条。/**
* Returns a new Dataset that contains only the unique rows from this Dataset. * This is an alias for `distinct`. *
* For a static batch [[Dataset]], it just drops duplicate rows. For a streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop duplicates rows. You can use * [[withWatermark]] to limit how late the duplicate data can be and system will accordingly limit * the state. In addition, too late data older than watermark will be dropped to avoid any * possibility of duplicates. *
* @group typedrel * @since 2.0.0 */
def dropDuplicates(): Dataset[T] = dropDuplicates(this.columns)
第⼆个def dropDuplicates(colNames: Seq[String])
传⼊的参数是⼀个序列。你可以在序列中指定你要根据哪些列的重复元素对数据表进⾏去重,然后也是返回每⼀⾏数据出现的第⼀条/**
* (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only * the subset of columns. *
* For a static batch [[Dataset]], it just drops duplicate rows. For a streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop duplicates rows. You can use * [[withWatermark]] to limit how late the duplicate data can be and system will accordingly limit * the state. In addition, too late data older than watermark will be dropped to avoid any * possibility of duplicates. *
* @group typedrel * @since 2.0.0 */
def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output
val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => // It is possibly there are more than one columns with the same name, // so we call filter instead of find.
val cols = allColumns.filter(col => resolver(col.name, colName)) if (cols.isEmpty) {
throw new AnalysisException(
s\"\"\"Cannot resolve column name \"$colName\" among (${schema.fieldNames.mkString(\ } cols }
Deduplicate(groupCols, planWithBarrier) }
第三个def dropDuplicates(colNames: Array[String])
传⼊的参数是⼀个数组,然后⽅法会把数组转换为序列然后再调⽤第⼆个⽅法。
/**
* Returns a new Dataset with duplicate rows removed, considering only * the subset of columns. *
* For a static batch [[Dataset]], it just drops duplicate rows. For a streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop duplicates rows. You can use * [[withWatermark]] to limit how late the duplicate data can be and system will accordingly limit * the state. In addition, too late data older than watermark will be dropped to avoid any * possibility of duplicates. *
* @group typedrel * @since 2.0.0 */
def dropDuplicates(colNames: Array[String]): Dataset[T] = dropDuplicates(colNames.toSeq)
第四个def dropDuplicates(col1: String, cols: String*)
传⼊的参数为字符串,在⽅法体内会把你传⼊的字符串组合成⼀个序列再调⽤第⼆个⽅法。/**
* Returns a new [[Dataset]] with duplicate rows removed, considering only * the subset of columns. *
* For a static batch [[Dataset]], it just drops duplicate rows. For a streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop duplicates rows. You can use * [[withWatermark]] to limit how late the duplicate data can be and system will accordingly limit * the state. In addition, too late data older than watermark will be dropped to avoid any
* possibility of duplicates. *
* @group typedrel * @since 2.0.0 */
@scala.annotation.varargs
def dropDuplicates(col1: String, cols: String*): Dataset[T] = { val colNames: Seq[String] = col1 +: cols dropDuplicates(colNames) }
第三和第四个本质上还是调⽤了第⼆个⽅法,所以我们在使⽤的时候如果需要根据指定的列进⾏数据去重,可以直接传⼊⼀个Seq。第⼀个⽅法默认根据所有列去重,实际上也是调⽤了第⼆个⽅法,然后传⼊参数this.columns,即所有的列组成的Seq。
所以各位想深究dropDuplicate()去重的核⼼代码,只需要研究第⼆个去重⽅法即可。等我有时间我也会把去重的核⼼源码讲解继续补充。
dropDuplicates()的坑!
在使⽤dropDuplicates() 在去重的时候,我发现有时候还是会出现重复数据的情况。我分析了⼀下还出现重复数据的原因:数据存在多个excuter中
因为spark是分布式计算的,数据在计算的时候会分布在不同的excutor上,使⽤dropDuplicate去重的时候,可能只是⼀个excutor内的数据进⾏了去重,别的excutor上可能还会有重复的数据。
数据是存放在不同分区的,
因为spark是分布式计算的,数据在计算的时候会分散在不同的分区中,使⽤dropDuplicate去重的时候,不同的区分可能还会存在相同的数据。
我试了只启动⼀个excutor多分区的情况下进⾏计算,没有出现重复的数据,然后多个excutor将数据先合并到⼀个分区在去重还是有重复的数据。所以觉得可能是第⼀种猜测的情况⽐较⼤,但是如果只使⽤⼀个excutor就失去了分布式计算的意义和优势,所以还是得想想其它办法。各位有什么好的解决办法也可以在评论区交流!