A common data analysis task is to split dataset into groups, apply some functions to each group and then combining the results. Hadley Wickham has described this in his paper (http://www.jstatsoft.org/v40/i01) a standardised framework for handling this sort of computation.
Refer: http://juliadata.github.io/DataFrames.jl/stable/man/split_apply_combine/#The-Split-Apply-Combine-Strategy-1
Julia Language provides the following functions in DataFrames package to handle splicing and grouping of data:
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using DataFrames;
julia> city_df = DataFrame(Group=rand(["Agra","Bengaluru","Chandigarh"],15), Variable1 = randn(15), Variable2=rand(15));
julia> city_df
15×3 DataFrames.DataFrame
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼────────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │
│ 2 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 3 │ "Bengaluru" │ 0.309367 │ 0.233593 │
│ 4 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 5 │ "Agra" │ -1.66022 │ 0.923693 │
│ 6 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │
│ 7 │ "Bengaluru" │ -2.5064 │ 0.329671 │
│ 8 │ "Agra" │ -1.30047 │ 0.78777 │
│ 9 │ "Agra" │ 0.469538 │ 0.171885 │
│ 10 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 11 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 12 │ "Agra" │ -0.420354 │ 0.658603 │
│ 13 │ "Bengaluru" │ 0.980519 │ 0.956216 │
│ 14 │ "Chandigarh" │ -1.02094 │ 0.829364 │
│ 15 │ "Agra" │ -0.799467 │ 0.886456 │
julia>
Now, let us apply the "by" function. The "by" function takes 3 arguments:
1) DataFrame
2) One or more column names to split the DataFrame on.
3) a function or expression to apply to each subset of the DataFrame
julia> by(city_df,:Group,size)
3×2 DataFrames.DataFrame
│ Row │ Group │ x1 │
├─────┼──────────────┼────────┤
│ 1 │ "Agra" │ (5, 3) │
│ 2 │ "Bengaluru" │ (5, 3) │
│ 3 │ "Chandigarh" │ (5, 3) │
julia>
We can also use stabby functions as the third parameter:
julia> by(city_df,:Group,stabbyDF1 -> mean(stabbyDF1[:Variable1]))
3×2 DataFrames.DataFrame
│ Row │ Group │ x1 │
├─────┼──────────────┼───────────┤
│ 1 │ "Agra" │ -0.742193 │
│ 2 │ "Bengaluru" │ -0.280479 │
│ 3 │ "Chandigarh" │ 0.0265779 │
julia>
To get a custom column name for the 3rd column:
julia> by(city_df,:Group,stabbyDF1 -> DataFrame(Mean = mean(stabbyDF1[:Variable1])))
3×2 DataFrames.DataFrame
│ Row │ Group │ Mean │
├─────┼──────────────┼───────────┤
│ 1 │ "Agra" │ -0.742193 │
│ 2 │ "Bengaluru" │ -0.280479 │
│ 3 │ "Chandigarh" │ 0.0265779 │
julia>
julia> by(city_df,:Group,stabbyDF2 -> DataFrame(N= size(stabbyDF2,1)))
3×2 DataFrames.DataFrame
│ Row │ Group │ N │
├─────┼──────────────┼───┤
│ 1 │ "Agra" │ 5 │
│ 2 │ "Bengaluru" │ 5 │
│ 3 │ "Chandigarh" │ 5 │
julia>
We can also use the by function in the following way:
julia> by(city_df, :Group) do stabbyDF3
DataFrame(Mean = mean(stabbyDF3[:Variable1]), Variance_σ² = var(stabbyDF3[:Variable1]))
end
3×3 DataFrames.DataFrame
│ Row │ Group │ Mean │ Variance_σ² │
├─────┼──────────────┼───────────┼─────────────┤
│ 1 │ "Agra" │ -0.742193 │ 0.682398 │
│ 2 │ "Bengaluru" │ -0.280479 │ 1.86272 │
│ 3 │ "Chandigarh" │ 0.0265779 │ 1.28258 │
julia>
Aggregate is the second approach to the Split-Apply-Combine strategy. The aggregate function also groups (spits) the dataframe by a column and then calculates a function or function on the rest of the columns. The new columns are named from the old with an underscore suffix followed by the applied function.
This function takes three arguments:
# Calculating the mean and standard deviation of variables 1 and 2 grouped by
# data point, values in the Group column
julia> aggregate(city_df, :Group, [mean,std,var])
3×7 DataFrames.DataFrame
│ Row │ Group │ Variable1_mean │ Variable1_std │ Variable1_var │ Variable2_mean │ Variable2_std │ Variable2_var │
├─────┼──────────────┼────────────────┼───────────────┼───────────────┼────────────────┼───────────────┼───────────────┤
│ 1 │ "Agra" │ -0.742193 │ 0.826074 │ 0.682398 │ 0.685681 │ 0.305001 │ 0.0930255 │
│ 2 │ "Bengaluru" │ -0.280479 │ 1.36481 │ 1.86272 │ 0.396614 │ 0.355654 │ 0.12649 │
│ 3 │ "Chandigarh" │ 0.0265779 │ 1.13251 │ 1.28258 │ 0.639281 │ 0.109908 │ 0.0120798 │
julia>
The groupby() function simply splits the dataframe as specified:
julia> groupby(city_df,:Group)
DataFrames.GroupedDataFrame 3 groups with keys: Symbol[:Group]
First Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
⋮
Last Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 2 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 3 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 4 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 5 │ "Chandigarh" │ -1.02094 │ 0.829364 │
julia>
To check selected groups:
julia> groupby(city_df,:Group)[1]
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
julia>
Using for loop with groupby() function:
julia> for i in groupby(city_df,:Group)
println(i)
end
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼─────────────┼───────────┼────────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │
│ 2 │ "Bengaluru" │ 0.309367 │ 0.233593 │
│ 3 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │
│ 4 │ "Bengaluru" │ -2.5064 │ 0.329671 │
│ 5 │ "Bengaluru" │ 0.980519 │ 0.956216 │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 2 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 3 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 4 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 5 │ "Chandigarh" │ -1.02094 │ 0.829364 │
julia>
View Function:
Using View function we can select certain rows only and therefore splice our DataFrame. Let us now add a third column to our dataframe:
julia> city_df[:Variable3] = randn(15)
15-element Array{Float64,1}:
1.22496
-2.40136
-0.270867
1.43093
1.87503
-2.10986
0.871126
1.43062
1.15698
-1.55353
0.912977
-0.222599
0.281066
-0.670763
-0.324178
julia> city_df
15×4 DataFrames.DataFrame
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼──────────────┼─────────────┼────────────┼───────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │ 1.22496 │
│ 2 │ "Chandigarh" │ 1.13232 │ 0.630437 │ -2.40136 │
│ 3 │ "Bengaluru" │ 0.309367 │ 0.233593 │ -0.270867 │
│ 4 │ "Chandigarh" │ -1.17079 │ 0.562659 │ 1.43093 │
│ 5 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 6 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 7 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 8 │ "Agra" │ -1.30047 │ 0.78777 │ 1.43062 │
│ 9 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
│ 10 │ "Chandigarh" │ -0.00710071 │ 0.567401 │ -1.55353 │
│ 11 │ "Chandigarh" │ 1.1994 │ 0.606546 │ 0.912977 │
│ 12 │ "Agra" │ -0.420354 │ 0.658603 │ -0.222599 │
│ 13 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
│ 14 │ "Chandigarh" │ -1.02094 │ 0.829364 │ -0.670763 │
│ 15 │ "Agra" │ -0.799467 │ 0.886456 │ -0.324178 │
julia>
Using view function, Retrieving rows 5 to 9 in the dataframe:
julia> view(city_df,5:9)
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 2 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 3 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 4 │ "Agra" │ -1.30047 │ 0.78777 │ 1.43062 │
│ 5 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
julia>
To retrive specific row numbers:
julia> view(city_df,[5,9])
2×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼────────┼───────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 2 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
julia>
To evaluate each row:
## the dot in .> indicates that we want to evaluate element by element of each row. This is a nice shortcut in place of "for loop":
julia> view(city_df, city_df[:Variable1] .> 0.5)
3×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼──────────────┼───────────┼───────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │ -2.40136 │
│ 2 │ "Chandigarh" │ 1.1994 │ 0.606546 │ 0.912977 │
│ 3 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
julia>
If you notice, it went down :Variable1 and returned only those rows which are larger than 0.5.
Lastly, let call the view function to only return rows for "Bengaluru":
julia> view(city_df, city_df[:Group] .== "Bengaluru")
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │ 1.22496 │
│ 2 │ "Bengaluru" │ 0.309367 │ 0.233593 │ -0.270867 │
│ 3 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 4 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 5 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
julia>
Refer: http://juliadata.github.io/DataFrames.jl/stable/man/split_apply_combine/#The-Split-Apply-Combine-Strategy-1
Julia Language provides the following functions in DataFrames package to handle splicing and grouping of data:
- by
- aggregate
- groupby
- view
by function:
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using DataFrames;
julia> city_df = DataFrame(Group=rand(["Agra","Bengaluru","Chandigarh"],15), Variable1 = randn(15), Variable2=rand(15));
julia> city_df
15×3 DataFrames.DataFrame
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼────────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │
│ 2 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 3 │ "Bengaluru" │ 0.309367 │ 0.233593 │
│ 4 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 5 │ "Agra" │ -1.66022 │ 0.923693 │
│ 6 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │
│ 7 │ "Bengaluru" │ -2.5064 │ 0.329671 │
│ 8 │ "Agra" │ -1.30047 │ 0.78777 │
│ 9 │ "Agra" │ 0.469538 │ 0.171885 │
│ 10 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 11 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 12 │ "Agra" │ -0.420354 │ 0.658603 │
│ 13 │ "Bengaluru" │ 0.980519 │ 0.956216 │
│ 14 │ "Chandigarh" │ -1.02094 │ 0.829364 │
│ 15 │ "Agra" │ -0.799467 │ 0.886456 │
julia>
Now, let us apply the "by" function. The "by" function takes 3 arguments:
1) DataFrame
2) One or more column names to split the DataFrame on.
3) a function or expression to apply to each subset of the DataFrame
julia> by(city_df,:Group,size)
3×2 DataFrames.DataFrame
│ Row │ Group │ x1 │
├─────┼──────────────┼────────┤
│ 1 │ "Agra" │ (5, 3) │
│ 2 │ "Bengaluru" │ (5, 3) │
│ 3 │ "Chandigarh" │ (5, 3) │
julia>
We can also use stabby functions as the third parameter:
julia> by(city_df,:Group,stabbyDF1 -> mean(stabbyDF1[:Variable1]))
3×2 DataFrames.DataFrame
│ Row │ Group │ x1 │
├─────┼──────────────┼───────────┤
│ 1 │ "Agra" │ -0.742193 │
│ 2 │ "Bengaluru" │ -0.280479 │
│ 3 │ "Chandigarh" │ 0.0265779 │
julia>
To get a custom column name for the 3rd column:
julia> by(city_df,:Group,stabbyDF1 -> DataFrame(Mean = mean(stabbyDF1[:Variable1])))
3×2 DataFrames.DataFrame
│ Row │ Group │ Mean │
├─────┼──────────────┼───────────┤
│ 1 │ "Agra" │ -0.742193 │
│ 2 │ "Bengaluru" │ -0.280479 │
│ 3 │ "Chandigarh" │ 0.0265779 │
julia>
julia> by(city_df,:Group,stabbyDF2 -> DataFrame(N= size(stabbyDF2,1)))
3×2 DataFrames.DataFrame
│ Row │ Group │ N │
├─────┼──────────────┼───┤
│ 1 │ "Agra" │ 5 │
│ 2 │ "Bengaluru" │ 5 │
│ 3 │ "Chandigarh" │ 5 │
julia>
We can also use the by function in the following way:
julia> by(city_df, :Group) do stabbyDF3
DataFrame(Mean = mean(stabbyDF3[:Variable1]), Variance_σ² = var(stabbyDF3[:Variable1]))
end
3×3 DataFrames.DataFrame
│ Row │ Group │ Mean │ Variance_σ² │
├─────┼──────────────┼───────────┼─────────────┤
│ 1 │ "Agra" │ -0.742193 │ 0.682398 │
│ 2 │ "Bengaluru" │ -0.280479 │ 1.86272 │
│ 3 │ "Chandigarh" │ 0.0265779 │ 1.28258 │
julia>
Aggregate function:
Aggregate is the second approach to the Split-Apply-Combine strategy. The aggregate function also groups (spits) the dataframe by a column and then calculates a function or function on the rest of the columns. The new columns are named from the old with an underscore suffix followed by the applied function.
This function takes three arguments:
- DataFrame
- one or more columns to split the DataFrame on
- one or more functions that will be used to compute the summary of each subset of the DataFrame. Each function is applied to each column, that was not used to split the DataFrame, creating new columns of the form $name_$function
# Calculating the mean and standard deviation of variables 1 and 2 grouped by
# data point, values in the Group column
julia> aggregate(city_df, :Group, [mean,std,var])
3×7 DataFrames.DataFrame
│ Row │ Group │ Variable1_mean │ Variable1_std │ Variable1_var │ Variable2_mean │ Variable2_std │ Variable2_var │
├─────┼──────────────┼────────────────┼───────────────┼───────────────┼────────────────┼───────────────┼───────────────┤
│ 1 │ "Agra" │ -0.742193 │ 0.826074 │ 0.682398 │ 0.685681 │ 0.305001 │ 0.0930255 │
│ 2 │ "Bengaluru" │ -0.280479 │ 1.36481 │ 1.86272 │ 0.396614 │ 0.355654 │ 0.12649 │
│ 3 │ "Chandigarh" │ 0.0265779 │ 1.13251 │ 1.28258 │ 0.639281 │ 0.109908 │ 0.0120798 │
julia>
groupby function
The groupby() function simply splits the dataframe as specified:
julia> groupby(city_df,:Group)
DataFrames.GroupedDataFrame 3 groups with keys: Symbol[:Group]
First Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
⋮
Last Group:
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 2 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 3 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 4 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 5 │ "Chandigarh" │ -1.02094 │ 0.829364 │
julia>
To check selected groups:
julia> groupby(city_df,:Group)[1]
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
julia>
Using for loop with groupby() function:
julia> for i in groupby(city_df,:Group)
println(i)
end
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │
│ 2 │ "Agra" │ -1.30047 │ 0.78777 │
│ 3 │ "Agra" │ 0.469538 │ 0.171885 │
│ 4 │ "Agra" │ -0.420354 │ 0.658603 │
│ 5 │ "Agra" │ -0.799467 │ 0.886456 │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼─────────────┼───────────┼────────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │
│ 2 │ "Bengaluru" │ 0.309367 │ 0.233593 │
│ 3 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │
│ 4 │ "Bengaluru" │ -2.5064 │ 0.329671 │
│ 5 │ "Bengaluru" │ 0.980519 │ 0.956216 │
5×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │
├─────┼──────────────┼─────────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │
│ 2 │ "Chandigarh" │ -1.17079 │ 0.562659 │
│ 3 │ "Chandigarh" │ -0.00710071 │ 0.567401 │
│ 4 │ "Chandigarh" │ 1.1994 │ 0.606546 │
│ 5 │ "Chandigarh" │ -1.02094 │ 0.829364 │
julia>
View Function:
Using View function we can select certain rows only and therefore splice our DataFrame. Let us now add a third column to our dataframe:
julia> city_df[:Variable3] = randn(15)
15-element Array{Float64,1}:
1.22496
-2.40136
-0.270867
1.43093
1.87503
-2.10986
0.871126
1.43062
1.15698
-1.55353
0.912977
-0.222599
0.281066
-0.670763
-0.324178
julia> city_df
15×4 DataFrames.DataFrame
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼──────────────┼─────────────┼────────────┼───────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │ 1.22496 │
│ 2 │ "Chandigarh" │ 1.13232 │ 0.630437 │ -2.40136 │
│ 3 │ "Bengaluru" │ 0.309367 │ 0.233593 │ -0.270867 │
│ 4 │ "Chandigarh" │ -1.17079 │ 0.562659 │ 1.43093 │
│ 5 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 6 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 7 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 8 │ "Agra" │ -1.30047 │ 0.78777 │ 1.43062 │
│ 9 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
│ 10 │ "Chandigarh" │ -0.00710071 │ 0.567401 │ -1.55353 │
│ 11 │ "Chandigarh" │ 1.1994 │ 0.606546 │ 0.912977 │
│ 12 │ "Agra" │ -0.420354 │ 0.658603 │ -0.222599 │
│ 13 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
│ 14 │ "Chandigarh" │ -1.02094 │ 0.829364 │ -0.670763 │
│ 15 │ "Agra" │ -0.799467 │ 0.886456 │ -0.324178 │
julia>
Using view function, Retrieving rows 5 to 9 in the dataframe:
julia> view(city_df,5:9)
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 2 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 3 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 4 │ "Agra" │ -1.30047 │ 0.78777 │ 1.43062 │
│ 5 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
julia>
To retrive specific row numbers:
julia> view(city_df,[5,9])
2×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼────────┼───────────┼───────────┼───────────┤
│ 1 │ "Agra" │ -1.66022 │ 0.923693 │ 1.87503 │
│ 2 │ "Agra" │ 0.469538 │ 0.171885 │ 1.15698 │
julia>
To evaluate each row:
## the dot in .> indicates that we want to evaluate element by element of each row. This is a nice shortcut in place of "for loop":
julia> view(city_df, city_df[:Variable1] .> 0.5)
3×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼──────────────┼───────────┼───────────┼───────────┤
│ 1 │ "Chandigarh" │ 1.13232 │ 0.630437 │ -2.40136 │
│ 2 │ "Chandigarh" │ 1.1994 │ 0.606546 │ 0.912977 │
│ 3 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
julia>
If you notice, it went down :Variable1 and returned only those rows which are larger than 0.5.
Lastly, let call the view function to only return rows for "Bengaluru":
julia> view(city_df, city_df[:Group] .== "Bengaluru")
5×4 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ Group │ Variable1 │ Variable2 │ Variable3 │
├─────┼─────────────┼───────────┼────────────┼───────────┤
│ 1 │ "Bengaluru" │ -0.586793 │ 0.463548 │ 1.22496 │
│ 2 │ "Bengaluru" │ 0.309367 │ 0.233593 │ -0.270867 │
│ 3 │ "Bengaluru" │ 0.40091 │ 4.30822e-5 │ -2.10986 │
│ 4 │ "Bengaluru" │ -2.5064 │ 0.329671 │ 0.871126 │
│ 5 │ "Bengaluru" │ 0.980519 │ 0.956216 │ 0.281066 │
julia>
No comments:
Post a Comment