In DataFrames Library, there is a function similar to the stack() function, it is the stackdf() function. The output result is same as that of stack() function.
Now, the question arises, why do we have stackdf() function afterall? The important difference is that stackdf() functioin returns a view into the original dataframe whereas stack() function returns actual data copies.
Refer: http://juliastats.github.io/DataFrames.jl/latest/lib/manipulation/#DataFrames.stackdf
Now, let us see how we can put stackdf() function to use:
Let us import the necessary packages:
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using RDatasets,DataFrames
julia>
Let us have a look at RDatasets Library:
julia> RDatasets.packages()
33×2 DataFrames.DataFrame
│ Row │ Package │ Title │
├─────┼────────────────┼───────────────────────────────────────────────────────────────────────────┤
│ 1 │ "COUNT" │ "Functions, data and code for count data." │
│ 2 │ "Ecdat" │ "Data sets for econometrics" │
│ 3 │ "HSAUR" │ "A Handbook of Statistical Analyses Using R (1st Edition)" │
│ 4 │ "HistData" │ "Data sets from the history of statistics and data visualization" │
│ 5 │ "ISLR" │ "Data for An Introduction to Statistical Learning with Applications in R" │
│ 6 │ "KMsurv" │ "Data sets from Klein and Moeschberger (1997), Survival Analysis" │
│ 7 │ "MASS" │ "Support Functions and Datasets for Venables and Ripley's MASS" │
│ 8 │ "SASmixed" │ "Data sets from \"SAS System for Mixed Models\"" │
│ 9 │ "Zelig" │ "Everyone's Statistical Software" │
│ 10 │ "adehabitatLT" │ "Analysis of Animal Movements" │
│ 11 │ "boot" │ "Bootstrap Functions (Originally by Angelo Canty for S)" │
│ 12 │ "car" │ "Companion to Applied Regression" │
│ 13 │ "cluster" │ "Cluster Analysis Extended Rousseeuw et al." │
│ 14 │ "datasets" │ "The R Datasets Package" │
│ 15 │ "gap" │ "Genetic analysis package" │
│ 16 │ "ggplot2" │ "An Implementation of the Grammar of Graphics" │
│ 17 │ "lattice" │ "Lattice Graphics" │
│ 18 │ "lme4" │ "Linear mixed-effects models using Eigen and S4" │
│ 19 │ "mgcv" │ "Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation" │
│ 20 │ "mlmRev" │ "Examples from Multilevel Modelling Software Review" │
│ 21 │ "nlreg" │ "Higher Order Inference for Nonlinear Heteroscedastic Models" │
│ 22 │ "plm" │ "Linear Models for Panel Data" │
│ 23 │ "plyr" │ "Tools for splitting, applying and combining data" │
│ 24 │ "pscl" │ "Political Science Computational Laboratory, Stanford University" │
│ 25 │ "psych" │ "Procedures for Psychological, Psychometric, and Personality Research" │
│ 26 │ "quantreg" │ "Quantile Regression" │
│ 27 │ "reshape2" │ "Flexibly Reshape Data: A Reboot of the Reshape Package." │
│ 28 │ "robustbase" │ "Basic Robust Statistics" │
│ 29 │ "rpart" │ "Recursive Partitioning and Regression Trees" │
│ 30 │ "sandwich" │ "Robust Covariance Matrix Estimators" │
│ 31 │ "sem" │ "Structural Equation Models" │
│ 32 │ "survival" │ "Survival Analysis" │
│ 33 │ "vcd" │ "Visualizing Categorical Data" │
julia>
The package of our interest is the datasets package. Let us fetch the famous iris dataset from this package into a dataframe:
julia> iris_df = dataset("datasets","iris");
julia> head(iris_df)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ "setosa" │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ "setosa" │
julia>
Using stackdf() function:
julia> iris_df_stackdf = stackdf(iris_df);
julia> head(iris_df_stackdf)
6×3 DataFrames.DataFrame
│ Row │ variable │ value │ Species │
├─────┼─────────────┼───────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ "setosa" │
│ 2 │ SepalLength │ 4.9 │ "setosa" │
│ 3 │ SepalLength │ 4.7 │ "setosa" │
│ 4 │ SepalLength │ 4.6 │ "setosa" │
│ 5 │ SepalLength │ 5.0 │ "setosa" │
│ 6 │ SepalLength │ 5.4 │ "setosa" │
Stack 2 columns only:
julia> iris_df_stackdf2 = stackdf(iris_df,1:2);
julia> unique(iris_df_stackdf2[1])
2-element Array{Symbol,1}:
:SepalLength
:SepalWidth
julia>
Stack 3 columns only:
julia> iris_df_stackdf3 = stack(iris_df,1:3);
julia> unique(iris_df_stackdf3[1])
3-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength
Stack 4 columns only:
julia> iris_df_stackdf4 = stack(iris_df,1:4);
julia> unique(iris_df_stackdf4[1])
4-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength
:PetalWidth
julia>
Now, the question arises, why do we have stackdf() function afterall? The important difference is that stackdf() functioin returns a view into the original dataframe whereas stack() function returns actual data copies.
Refer: http://juliastats.github.io/DataFrames.jl/latest/lib/manipulation/#DataFrames.stackdf
Now, let us see how we can put stackdf() function to use:
Let us import the necessary packages:
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using RDatasets,DataFrames
julia>
Let us have a look at RDatasets Library:
julia> RDatasets.packages()
33×2 DataFrames.DataFrame
│ Row │ Package │ Title │
├─────┼────────────────┼───────────────────────────────────────────────────────────────────────────┤
│ 1 │ "COUNT" │ "Functions, data and code for count data." │
│ 2 │ "Ecdat" │ "Data sets for econometrics" │
│ 3 │ "HSAUR" │ "A Handbook of Statistical Analyses Using R (1st Edition)" │
│ 4 │ "HistData" │ "Data sets from the history of statistics and data visualization" │
│ 5 │ "ISLR" │ "Data for An Introduction to Statistical Learning with Applications in R" │
│ 6 │ "KMsurv" │ "Data sets from Klein and Moeschberger (1997), Survival Analysis" │
│ 7 │ "MASS" │ "Support Functions and Datasets for Venables and Ripley's MASS" │
│ 8 │ "SASmixed" │ "Data sets from \"SAS System for Mixed Models\"" │
│ 9 │ "Zelig" │ "Everyone's Statistical Software" │
│ 10 │ "adehabitatLT" │ "Analysis of Animal Movements" │
│ 11 │ "boot" │ "Bootstrap Functions (Originally by Angelo Canty for S)" │
│ 12 │ "car" │ "Companion to Applied Regression" │
│ 13 │ "cluster" │ "Cluster Analysis Extended Rousseeuw et al." │
│ 14 │ "datasets" │ "The R Datasets Package" │
│ 15 │ "gap" │ "Genetic analysis package" │
│ 16 │ "ggplot2" │ "An Implementation of the Grammar of Graphics" │
│ 17 │ "lattice" │ "Lattice Graphics" │
│ 18 │ "lme4" │ "Linear mixed-effects models using Eigen and S4" │
│ 19 │ "mgcv" │ "Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation" │
│ 20 │ "mlmRev" │ "Examples from Multilevel Modelling Software Review" │
│ 21 │ "nlreg" │ "Higher Order Inference for Nonlinear Heteroscedastic Models" │
│ 22 │ "plm" │ "Linear Models for Panel Data" │
│ 23 │ "plyr" │ "Tools for splitting, applying and combining data" │
│ 24 │ "pscl" │ "Political Science Computational Laboratory, Stanford University" │
│ 25 │ "psych" │ "Procedures for Psychological, Psychometric, and Personality Research" │
│ 26 │ "quantreg" │ "Quantile Regression" │
│ 27 │ "reshape2" │ "Flexibly Reshape Data: A Reboot of the Reshape Package." │
│ 28 │ "robustbase" │ "Basic Robust Statistics" │
│ 29 │ "rpart" │ "Recursive Partitioning and Regression Trees" │
│ 30 │ "sandwich" │ "Robust Covariance Matrix Estimators" │
│ 31 │ "sem" │ "Structural Equation Models" │
│ 32 │ "survival" │ "Survival Analysis" │
│ 33 │ "vcd" │ "Visualizing Categorical Data" │
julia>
The package of our interest is the datasets package. Let us fetch the famous iris dataset from this package into a dataframe:
julia> iris_df = dataset("datasets","iris");
julia> head(iris_df)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ "setosa" │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ "setosa" │
julia>
Using stackdf() function:
julia> iris_df_stackdf = stackdf(iris_df);
julia> head(iris_df_stackdf)
6×3 DataFrames.DataFrame
│ Row │ variable │ value │ Species │
├─────┼─────────────┼───────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ "setosa" │
│ 2 │ SepalLength │ 4.9 │ "setosa" │
│ 3 │ SepalLength │ 4.7 │ "setosa" │
│ 4 │ SepalLength │ 4.6 │ "setosa" │
│ 5 │ SepalLength │ 5.0 │ "setosa" │
│ 6 │ SepalLength │ 5.4 │ "setosa" │
Stack 2 columns only:
julia> iris_df_stackdf2 = stackdf(iris_df,1:2);
julia> unique(iris_df_stackdf2[1])
2-element Array{Symbol,1}:
:SepalLength
:SepalWidth
julia>
Stack 3 columns only:
julia> iris_df_stackdf3 = stack(iris_df,1:3);
julia> unique(iris_df_stackdf3[1])
3-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength
Stack 4 columns only:
julia> iris_df_stackdf4 = stack(iris_df,1:4);
julia> unique(iris_df_stackdf4[1])
4-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength
:PetalWidth
julia>
Checking the size of Original DataFrame:
julia> size(iris_df)
(150, 5)
Checking the size after stacking the original DataFrame:
julia> size(iris_df_stackdf)
(600, 3)
julia> size(iris_df_stackdf2)
(300, 5)
julia> size(iris_df_stackdf3)
(450, 4)
julia> size(iris_df_stackdf4)
(600, 3)
julia>
We can see stackdf() provides the same output as the stack() function. The difference is only in the nature of the datasets.
To see the nature of the underlying datasets - we need to use the dump() function
Dump of DataFrame resulted from output of stackdf() function:
julia> dump(iris_df_stackdf)
DataFrames.DataFrame 600 observations of 3 variables
variable: DataFrames.RepeatedVector{Symbol}
parent: Array{Symbol}((4,))
1: Symbol SepalLength
2: Symbol SepalWidth
3: Symbol PetalLength
4: Symbol PetalWidth
inner: Int64 150
outer: Int64 1 value: DataFrames.StackedVector
components: Array{Any}((4,))
1: DataArrays.DataArray{Float64,1}(150) [5.1, 4.9, 4.7, 4.6]
2: DataArrays.DataArray{Float64,1}(150) [3.5, 3.0, 3.2, 3.1]
3: DataArrays.DataArray{Float64,1}(150) [1.4, 1.4, 1.3, 1.5]
4: DataArrays.DataArray{Float64,1}(150) [0.2, 0.2, 0.2, 0.2]
Species: DataFrames.RepeatedVector{String}
parent: DataArrays.PooledDataArray{String,UInt8,1}(150) String["setosa", "setosa", "setosa", "setosa"]
inner: Int64 1
outer: Int64 4
julia>
Dump of DataFrame resulted from output of stack() function:
julia> dump(iris_df_stack)
DataFrames.DataFrame 600 observations of 3 variables
variable: Array{Symbol}((600,))
1: Symbol SepalLength
2: Symbol SepalLength
3: Symbol SepalLength
4: Symbol SepalLength
5: Symbol SepalLength
...
596: Symbol PetalWidth
597: Symbol PetalWidth
598: Symbol PetalWidth
599: Symbol PetalWidth
600: Symbol PetalWidth value: DataArrays.DataArray{Float64,1}(600) [5.1, 4.9, 4.7, 4.6]
Species: DataArrays.PooledDataArray{String,UInt8,1}(600) String["setosa", "setosa", "setosa", "setosa"]
julia>
To understand, observe the difference in datatypes of the columns in the dataframes created by the stackdf() and stack() functions respectively.
No comments:
Post a Comment