pyspark.sql.functions.tuple_union_double#

pyspark.sql.functions.tuple_union_double(col1, col2, lgNomEntries=None, mode=None)[source]#

Returns the union of two Datasketches TupleSketch objects with double summaries.

New in version 4.2.0.

Parameters

col1Column or column name: The first TupleSketch column
col2Column or column name: The second TupleSketch column
lgNomEntriesColumn or int, optional: The log-base-2 of nominal entries (must be between 4 and 26, defaults to 12)
modeColumn or str, optional: The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns

Column: The binary representation of the merged TupleSketch.

See also

pyspark.sql.functions.tuple_sketch_agg_double()
pyspark.sql.functions.tuple_union_agg_double()
pyspark.sql.functions.tuple_intersection_double()

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 10.0, 3, 30.0), (2, 20.0, 4, 40.0)], ["key1", "v1", "key2", "v2"])  # noqa
>>> df = df.agg(
...     sf.tuple_sketch_agg_double("key1", "v1").alias("sketch1"),
...     sf.tuple_sketch_agg_double("key2", "v2").alias("sketch2")
... )
>>> df.select(sf.tuple_sketch_estimate_double(sf.tuple_union_double(df.sketch1, "sketch2"))).show()  # noqa
+---------------------------------------------------------------------------+
|tuple_sketch_estimate_double(tuple_union_double(sketch1, sketch2, 12, sum))|
+---------------------------------------------------------------------------+
|                                                                        4.0|
+---------------------------------------------------------------------------+