pyspark.pandas.DataFrame.nunique#
- DataFrame.nunique(axis=0, dropna=True, approx=False, rsd=0.05)[source]#
Return number of unique elements in the object.
Excludes NA values by default.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to use. 0 or ‘index’ for row-wise (count unique values per column), 1 or ‘columns’ for column-wise (count unique values per row).
- dropnabool, default True
Don’t include NaN in the count.
- approx: bool, default False
If False, will use the exact algorithm and return the exact number of unique. If True, it uses the HyperLogLog approximate algorithm, which is significantly faster for large amounts of data. Note: This parameter is specific to pandas-on-Spark and is not found in pandas. For axis=1, this parameter is ignored and exact counting is always used.
- rsd: float, default 0.05
Maximum estimation error allowed in the HyperLogLog algorithm. Note: Just like
approxthis parameter is specific to pandas-on-Spark. For axis=1, this parameter is ignored.
- Returns
- Series
The number of unique values per column (axis=0) or per row (axis=1).
Examples
>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [np.nan, 3, np.nan]}) >>> df.nunique() A 3 B 1 dtype: int64
>>> df.nunique(dropna=False) A 3 B 2 dtype: int64
>>> df.nunique(axis=1) 0 1 1 2 2 1 dtype: int32
>>> df.nunique(axis=1, dropna=False) 0 2 1 2 2 2 dtype: int32
On big data, we recommend using the approximate algorithm to speed up this function. The result will be very close to the exact unique count.
>>> df.nunique(approx=True) A 3 B 1 dtype: int64