pyspark.pandas.DataFrame.nunique#

DataFrame.nunique(axis=0, dropna=True, approx=False, rsd=0.05)[source]#

Return number of unique elements in the object.

Excludes NA values by default.

Parameters

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise (count unique values per column), 1 or ‘columns’ for column-wise (count unique values per row).
dropnabool, default True: Don’t include NaN in the count.
approx: bool, default False: If False, will use the exact algorithm and return the exact number of unique. If True, it uses the HyperLogLog approximate algorithm, which is significantly faster for large amounts of data. Note: This parameter is specific to pandas-on-Spark and is not found in pandas. For axis=1, this parameter is ignored and exact counting is always used.
rsd: float, default 0.05: Maximum estimation error allowed in the HyperLogLog algorithm. Note: Just like approx this parameter is specific to pandas-on-Spark. For axis=1, this parameter is ignored.

Returns

Series: The number of unique values per column (axis=0) or per row (axis=1).

Examples

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [np.nan, 3, np.nan]})
>>> df.nunique()
A    3
B    1
dtype: int64

>>> df.nunique(dropna=False)
A    3
B    2
dtype: int64

>>> df.nunique(axis=1)
0    1
1    2
2    1
dtype: int32

>>> df.nunique(axis=1, dropna=False)
0    2
1    2
2    2
dtype: int32

On big data, we recommend using the approximate algorithm to speed up this function. The result will be very close to the exact unique count.

>>> df.nunique(approx=True)
A    3
B    1
dtype: int64