pyspark.pandas.groupby.DataFrameGroupBy.aggregate#

DataFrameGroupBy.aggregate(func_or_funcs=None, *args, **kwargs)#

Aggregate using one or more operations over the specified axis.

Parameters
func_or_funcsdict, str or list

a dict mapping from column name (string) to aggregate functions (string or list of strings).

Returns
Series or DataFrame

The return can be:

  • Series : when DataFrame.agg is called with a single function

  • DataFrame : when DataFrame.agg is called with several functions

Return Series or DataFrame.

Notes

agg is an alias for aggregate. Use the alias.

Examples

>>> df = ps.DataFrame({'A': [1, 1, 2, 2],
...                    'B': [1, 2, 3, 4],
...                    'C': [0.362, 0.227, 1.267, -0.562]},
...                   columns=['A', 'B', 'C'])
>>> df
   A  B      C
0  1  1  0.362
1  1  2  0.227
2  2  3  1.267
3  2  4 -0.562

Different aggregations per column

>>> aggregated = df.groupby('A').agg({'B': 'min', 'C': 'sum'})
>>> aggregated[['B', 'C']].sort_index()  
   B      C
A
1  1  0.589
2  3  0.705
>>> aggregated = df.groupby('A').agg({'B': ['min', 'max']})
>>> aggregated.sort_index()  
     B
   min  max
A
1    1    2
2    3    4
>>> aggregated = df.groupby('A').agg('min')
>>> aggregated.sort_index()  
     B      C
A
1    1  0.227
2    3 -0.562
>>> aggregated = df.groupby('A').agg(['min', 'max'])
>>> aggregated.sort_index()  
     B           C
   min  max    min    max
A
1    1    2  0.227  0.362
2    3    4 -0.562  1.267

To control the output names with different aggregations per column, pandas-on-Spark also supports ‘named aggregation’ or nested renaming in .agg. It can also be used when applying multiple aggregation functions to specific columns.

>>> aggregated = df.groupby('A').agg(b_max=ps.NamedAgg(column='B', aggfunc='max'))
>>> aggregated.sort_index()  
     b_max
A
1        2
2        4
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), b_min=('B', 'min'))
>>> aggregated.sort_index()  
     b_max   b_min
A
1        2       1
2        4       3
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), c_min=('C', 'min'))
>>> aggregated.sort_index()  
     b_max   c_min
A
1        2   0.227
2        4  -0.562