Sunday, September 10, 2017

Julia - Language - The Split Apply Combine Strategy - by, aggregate, groupby and view functions

A common data analysis task is to split dataset into groups, apply some functions to each group and then combining the results. Hadley Wickham has described this in his paper (http://www.jstatsoft.org/v40/i01) a standardised framework for handling this sort of computation. 

Refer: http://juliadata.github.io/DataFrames.jl/stable/man/split_apply_combine/#The-Split-Apply-Combine-Strategy-1

Julia Language provides the following functions in DataFrames package to handle splicing and grouping of data:

  1. by
  2. aggregate
  3. groupby
  4. view 


by function:


$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames;

julia> city_df = DataFrame(Group=rand(["Agra","Bengaluru","Chandigarh"],15), Variable1 = randn(15), Variable2=rand(15));

julia> city_df
15×3 DataFrames.DataFrame
│ Row │ Group        │ Variable1   │ Variable2  │
├─────┼──────────────┼─────────────┼────────────┤
│ 1   │ "Bengaluru"  │ -0.586793   │ 0.463548   │
│ 2   │ "Chandigarh" │ 1.13232     │ 0.630437   │
│ 3   │ "Bengaluru"  │ 0.309367    │ 0.233593   │
│ 4   │ "Chandigarh" │ -1.17079    │ 0.562659   │
│ 5   │ "Agra"       │ -1.66022    │ 0.923693   │
│ 6   │ "Bengaluru"  │ 0.40091     │ 4.30822e-5 │
│ 7   │ "Bengaluru"  │ -2.5064     │ 0.329671   │
│ 8   │ "Agra"       │ -1.30047    │ 0.78777    │
│ 9   │ "Agra"       │ 0.469538    │ 0.171885   │
│ 10  │ "Chandigarh" │ -0.00710071 │ 0.567401   │
│ 11  │ "Chandigarh" │ 1.1994      │ 0.606546   │
│ 12  │ "Agra"       │ -0.420354   │ 0.658603   │
│ 13  │ "Bengaluru"  │ 0.980519    │ 0.956216   │
│ 14  │ "Chandigarh" │ -1.02094    │ 0.829364   │
│ 15  │ "Agra"       │ -0.799467   │ 0.886456   │

julia> 

Now, let us apply the "by" function. The "by" function takes 3 arguments:
1) DataFrame
2) One or more column names to split the DataFrame on.
3) a function or expression to apply to each subset of the DataFrame

julia> by(city_df,:Group,size)
3×2 DataFrames.DataFrame
│ Row │ Group        │ x1     │
├─────┼──────────────┼────────┤
│ 1   │ "Agra"       │ (5, 3) │
│ 2   │ "Bengaluru"  │ (5, 3) │
│ 3   │ "Chandigarh" │ (5, 3) │

julia> 


We can also use stabby functions as the third parameter:

julia> by(city_df,:Group,stabbyDF1 -> mean(stabbyDF1[:Variable1]))
3×2 DataFrames.DataFrame
│ Row │ Group        │ x1        │
├─────┼──────────────┼───────────┤
│ 1   │ "Agra"       │ -0.742193 │
│ 2   │ "Bengaluru"  │ -0.280479 │
│ 3   │ "Chandigarh" │ 0.0265779 │

julia> 

To get a custom column name for the 3rd column:
julia> by(city_df,:Group,stabbyDF1 -> DataFrame(Mean = mean(stabbyDF1[:Variable1])))
3×2 DataFrames.DataFrame
│ Row │ Group        │ Mean      │
├─────┼──────────────┼───────────┤
│ 1   │ "Agra"       │ -0.742193 │
│ 2   │ "Bengaluru"  │ -0.280479 │
│ 3   │ "Chandigarh" │ 0.0265779 │

julia> 



julia> by(city_df,:Group,stabbyDF2 -> DataFrame(N= size(stabbyDF2,1)))
3×2 DataFrames.DataFrame
│ Row │ Group        │ N │
├─────┼──────────────┼───┤
│ 1   │ "Agra"       │ 5 │
│ 2   │ "Bengaluru"  │ 5 │
│ 3   │ "Chandigarh" │ 5 │

julia> 


We can also use the by function in the following way:

julia> by(city_df, :Group) do stabbyDF3
       DataFrame(Mean = mean(stabbyDF3[:Variable1]), Variance_σ² = var(stabbyDF3[:Variable1]))
       end
3×3 DataFrames.DataFrame
│ Row │ Group        │ Mean      │ Variance_σ² │
├─────┼──────────────┼───────────┼─────────────┤
│ 1   │ "Agra"       │ -0.742193 │ 0.682398    │
│ 2   │ "Bengaluru"  │ -0.280479 │ 1.86272     │
│ 3   │ "Chandigarh" │ 0.0265779 │ 1.28258     │

julia> 

Aggregate function:


Aggregate is the second approach to the Split-Apply-Combine strategy. The aggregate function also groups (spits) the dataframe by a column and then calculates a function or function on the rest of the columns. The new columns are named from the old with an underscore suffix followed by the applied function.

This function takes three arguments:

  1. DataFrame
  2. one or more columns to split the DataFrame on
  3. one or more functions that will be used to compute the summary of each subset of the DataFrame. Each function is applied to each column, that was not used to split the DataFrame, creating new columns of the form $name_$function 


# Calculating the mean and standard deviation of variables 1 and 2 grouped by 
# data point, values in the Group column

julia> aggregate(city_df, :Group, [mean,std,var])
3×7 DataFrames.DataFrame
│ Row │ Group        │ Variable1_mean │ Variable1_std │ Variable1_var │ Variable2_mean │ Variable2_std │ Variable2_var │
├─────┼──────────────┼────────────────┼───────────────┼───────────────┼────────────────┼───────────────┼───────────────┤
│ 1   │ "Agra"       │ -0.742193      │ 0.826074      │ 0.682398      │ 0.685681       │ 0.305001      │ 0.0930255     │
│ 2   │ "Bengaluru"  │ -0.280479      │ 1.36481       │ 1.86272       │ 0.396614       │ 0.355654      │ 0.12649       │
│ 3   │ "Chandigarh" │ 0.0265779      │ 1.13251       │ 1.28258       │ 0.639281       │ 0.109908      │ 0.0120798     │

julia> 


groupby function


The groupby() function simply splits the dataframe as specified:

julia> groupby(city_df,:Group)
DataFrames.GroupedDataFrame  3 groups with keys: Symbol[:Group]
First Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group  │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1   │ "Agra" │ -1.66022  │ 0.923693  │
│ 2   │ "Agra" │ -1.30047  │ 0.78777   │
│ 3   │ "Agra" │ 0.469538  │ 0.171885  │
│ 4   │ "Agra" │ -0.420354 │ 0.658603  │
│ 5   │ "Agra" │ -0.799467 │ 0.886456  │

Last Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group        │ Variable1   │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1   │ "Chandigarh" │ 1.13232     │ 0.630437  │
│ 2   │ "Chandigarh" │ -1.17079    │ 0.562659  │
│ 3   │ "Chandigarh" │ -0.00710071 │ 0.567401  │
│ 4   │ "Chandigarh" │ 1.1994      │ 0.606546  │
│ 5   │ "Chandigarh" │ -1.02094    │ 0.829364  │

julia>

To check selected groups:

julia> groupby(city_df,:Group)[1]
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group  │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1   │ "Agra" │ -1.66022  │ 0.923693  │
│ 2   │ "Agra" │ -1.30047  │ 0.78777   │
│ 3   │ "Agra" │ 0.469538  │ 0.171885  │
│ 4   │ "Agra" │ -0.420354 │ 0.658603  │
│ 5   │ "Agra" │ -0.799467 │ 0.886456  │

julia>

Using for loop with groupby() function:

julia> for i in groupby(city_df,:Group)
       println(i)
       end
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group  │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1   │ "Agra" │ -1.66022  │ 0.923693  │
│ 2   │ "Agra" │ -1.30047  │ 0.78777   │
│ 3   │ "Agra" │ 0.469538  │ 0.171885  │
│ 4   │ "Agra" │ -0.420354 │ 0.658603  │
│ 5   │ "Agra" │ -0.799467 │ 0.886456  │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group       │ Variable1 │ Variable2  │
├─────┼─────────────┼───────────┼────────────┤
│ 1   │ "Bengaluru" │ -0.586793 │ 0.463548   │
│ 2   │ "Bengaluru" │ 0.309367  │ 0.233593   │
│ 3   │ "Bengaluru" │ 0.40091   │ 4.30822e-5 │
│ 4   │ "Bengaluru" │ -2.5064   │ 0.329671   │
│ 5   │ "Bengaluru" │ 0.980519  │ 0.956216   │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group        │ Variable1   │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1   │ "Chandigarh" │ 1.13232     │ 0.630437  │
│ 2   │ "Chandigarh" │ -1.17079    │ 0.562659  │
│ 3   │ "Chandigarh" │ -0.00710071 │ 0.567401  │
│ 4   │ "Chandigarh" │ 1.1994      │ 0.606546  │
│ 5   │ "Chandigarh" │ -1.02094    │ 0.829364  │

julia> 


View Function:

Using View function we can select certain rows only and therefore splice our DataFrame. Let us now add a third column to our dataframe:

julia> city_df[:Variable3] = randn(15)
15-element Array{Float64,1}:
  1.22496 
 -2.40136 
 -0.270867
  1.43093 
  1.87503 
 -2.10986 
  0.871126
  1.43062 
  1.15698 
 -1.55353 
  0.912977
 -0.222599
  0.281066
 -0.670763
 -0.324178

julia> city_df
15×4 DataFrames.DataFrame
│ Row │ Group        │ Variable1   │ Variable2  │ Variable3 │
├─────┼──────────────┼─────────────┼────────────┼───────────┤
│ 1   │ "Bengaluru"  │ -0.586793   │ 0.463548   │ 1.22496   │
│ 2   │ "Chandigarh" │ 1.13232     │ 0.630437   │ -2.40136  │
│ 3   │ "Bengaluru"  │ 0.309367    │ 0.233593   │ -0.270867 │
│ 4   │ "Chandigarh" │ -1.17079    │ 0.562659   │ 1.43093   │
│ 5   │ "Agra"       │ -1.66022    │ 0.923693   │ 1.87503   │
│ 6   │ "Bengaluru"  │ 0.40091     │ 4.30822e-5 │ -2.10986  │
│ 7   │ "Bengaluru"  │ -2.5064     │ 0.329671   │ 0.871126  │
│ 8   │ "Agra"       │ -1.30047    │ 0.78777    │ 1.43062   │
│ 9   │ "Agra"       │ 0.469538    │ 0.171885   │ 1.15698   │
│ 10  │ "Chandigarh" │ -0.00710071 │ 0.567401   │ -1.55353  │
│ 11  │ "Chandigarh" │ 1.1994      │ 0.606546   │ 0.912977  │
│ 12  │ "Agra"       │ -0.420354   │ 0.658603   │ -0.222599 │
│ 13  │ "Bengaluru"  │ 0.980519    │ 0.956216   │ 0.281066  │
│ 14  │ "Chandigarh" │ -1.02094    │ 0.829364   │ -0.670763 │
│ 15  │ "Agra"       │ -0.799467   │ 0.886456   │ -0.324178 │

julia> 


Using view function, Retrieving rows 5 to 9 in the dataframe:
julia> view(city_df,5:9)
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group       │ Variable1 │ Variable2  │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1   │ "Agra"      │ -1.66022  │ 0.923693   │ 1.87503   │
│ 2   │ "Bengaluru" │ 0.40091   │ 4.30822e-5 │ -2.10986  │
│ 3   │ "Bengaluru" │ -2.5064   │ 0.329671   │ 0.871126  │
│ 4   │ "Agra"      │ -1.30047  │ 0.78777    │ 1.43062   │
│ 5   │ "Agra"      │ 0.469538  │ 0.171885   │ 1.15698   │

julia> 

To retrive specific row numbers:

julia> view(city_df,[5,9])
2×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group  │ Variable1 │ Variable2 │ Variable3 │
├─────┼────────┼───────────┼───────────┼───────────┤
│ 1   │ "Agra" │ -1.66022  │ 0.923693  │ 1.87503   │
│ 2   │ "Agra" │ 0.469538  │ 0.171885  │ 1.15698   │

julia> 

To evaluate each row:

## the dot in .> indicates that we want to evaluate element by element of each row. This is a nice shortcut in place of "for loop":

julia> view(city_df, city_df[:Variable1] .> 0.5)
3×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group        │ Variable1 │ Variable2 │ Variable3 │
├─────┼──────────────┼───────────┼───────────┼───────────┤
│ 1   │ "Chandigarh" │ 1.13232   │ 0.630437  │ -2.40136  │
│ 2   │ "Chandigarh" │ 1.1994    │ 0.606546  │ 0.912977  │
│ 3   │ "Bengaluru"  │ 0.980519  │ 0.956216  │ 0.281066  │

julia> 

If you notice, it went down :Variable1 and returned only those rows which are larger than 0.5.

Lastly, let call the view function to only return rows for "Bengaluru":

julia> view(city_df, city_df[:Group] .== "Bengaluru")
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group       │ Variable1 │ Variable2  │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1   │ "Bengaluru" │ -0.586793 │ 0.463548   │ 1.22496   │
│ 2   │ "Bengaluru" │ 0.309367  │ 0.233593   │ -0.270867 │
│ 3   │ "Bengaluru" │ 0.40091   │ 4.30822e-5 │ -2.10986  │
│ 4   │ "Bengaluru" │ -2.5064   │ 0.329671   │ 0.871126  │
│ 5   │ "Bengaluru" │ 0.980519  │ 0.956216   │ 0.281066  │

julia> 




No comments:

Post a Comment