Technology Works: Julia - Language - Reshaping of Datasets - stackdf() Function

In DataFrames Library, there is a function similar to the stack() function, it is the stackdf() function. The output result is same as that of stack() function.

Now, the question arises, why do we have stackdf() function afterall? The important difference is that stackdf() functioin returns a view into the original dataframe whereas stack() function returns actual data copies.

Refer: http://juliastats.github.io/DataFrames.jl/latest/lib/manipulation/#DataFrames.stackdf

Now, let us see how we can put stackdf() function to use:

Let us import the necessary packages:

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using RDatasets,DataFrames

julia>

Let us have a look at RDatasets Library:

julia> RDatasets.packages()
33×2 DataFrames.DataFrame
│ Row │ Package │ Title │
├─────┼────────────────┼───────────────────────────────────────────────────────────────────────────┤
│ 1 │ "COUNT" │ "Functions, data and code for count data." │
│ 2 │ "Ecdat" │ "Data sets for econometrics" │
│ 3 │ "HSAUR" │ "A Handbook of Statistical Analyses Using R (1st Edition)" │
│ 4 │ "HistData" │ "Data sets from the history of statistics and data visualization" │
│ 5 │ "ISLR" │ "Data for An Introduction to Statistical Learning with Applications in R" │
│ 6 │ "KMsurv" │ "Data sets from Klein and Moeschberger (1997), Survival Analysis" │
│ 7 │ "MASS" │ "Support Functions and Datasets for Venables and Ripley's MASS" │
│ 8 │ "SASmixed" │ "Data sets from \"SAS System for Mixed Models\"" │
│ 9 │ "Zelig" │ "Everyone's Statistical Software" │
│ 10 │ "adehabitatLT" │ "Analysis of Animal Movements" │
│ 11 │ "boot" │ "Bootstrap Functions (Originally by Angelo Canty for S)" │
│ 12 │ "car" │ "Companion to Applied Regression" │
│ 13 │ "cluster" │ "Cluster Analysis Extended Rousseeuw et al." │
│ 14 │ "datasets" │ "The R Datasets Package" │
│ 15 │ "gap" │ "Genetic analysis package" │
│ 16 │ "ggplot2" │ "An Implementation of the Grammar of Graphics" │
│ 17 │ "lattice" │ "Lattice Graphics" │
│ 18 │ "lme4" │ "Linear mixed-effects models using Eigen and S4" │
│ 19 │ "mgcv" │ "Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation" │
│ 20 │ "mlmRev" │ "Examples from Multilevel Modelling Software Review" │
│ 21 │ "nlreg" │ "Higher Order Inference for Nonlinear Heteroscedastic Models" │
│ 22 │ "plm" │ "Linear Models for Panel Data" │
│ 23 │ "plyr" │ "Tools for splitting, applying and combining data" │
│ 24 │ "pscl" │ "Political Science Computational Laboratory, Stanford University" │
│ 25 │ "psych" │ "Procedures for Psychological, Psychometric, and Personality Research" │
│ 26 │ "quantreg" │ "Quantile Regression" │
│ 27 │ "reshape2" │ "Flexibly Reshape Data: A Reboot of the Reshape Package." │
│ 28 │ "robustbase" │ "Basic Robust Statistics" │
│ 29 │ "rpart" │ "Recursive Partitioning and Regression Trees" │
│ 30 │ "sandwich" │ "Robust Covariance Matrix Estimators" │
│ 31 │ "sem" │ "Structural Equation Models" │
│ 32 │ "survival" │ "Survival Analysis" │
│ 33 │ "vcd" │ "Visualizing Categorical Data" │

julia>

The package of our interest is the datasets package. Let us fetch the famous iris dataset from this package into a dataframe:

julia> iris_df = dataset("datasets","iris");

julia> head(iris_df)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ "setosa" │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ "setosa" │

julia>

Using stackdf() function:

julia> iris_df_stackdf = stackdf(iris_df);

julia> head(iris_df_stackdf)
6×3 DataFrames.DataFrame
│ Row │ variable │ value │ Species │
├─────┼─────────────┼───────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ "setosa" │
│ 2 │ SepalLength │ 4.9 │ "setosa" │
│ 3 │ SepalLength │ 4.7 │ "setosa" │
│ 4 │ SepalLength │ 4.6 │ "setosa" │
│ 5 │ SepalLength │ 5.0 │ "setosa" │
│ 6 │ SepalLength │ 5.4 │ "setosa" │

Stack 2 columns only:

julia> iris_df_stackdf2 = stackdf(iris_df,1:2);

julia> unique(iris_df_stackdf2[1])
2-element Array{Symbol,1}:
:SepalLength
:SepalWidth

julia>

Stack 3 columns only:

julia> iris_df_stackdf3 = stack(iris_df,1:3);

julia> unique(iris_df_stackdf3[1])
3-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength

Stack 4 columns only:

julia> iris_df_stackdf4 = stack(iris_df,1:4);

julia> unique(iris_df_stackdf4[1])
4-element Array{Symbol,1}:
:SepalLength
:SepalWidth
:PetalLength
:PetalWidth

julia>

Checking the size of Original DataFrame:

julia> size(iris_df)

(150, 5)

Checking the size after stacking the original DataFrame:

julia> size(iris_df_stackdf)

(600, 3)

julia> size(iris_df_stackdf2)

(300, 5)

julia> size(iris_df_stackdf3)

(450, 4)

julia> size(iris_df_stackdf4)

(600, 3)

julia>

We can see stackdf() provides the same output as the stack() function. The difference is only in the nature of the datasets.

To see the nature of the underlying datasets - we need to use the dump() function

Dump of DataFrame resulted from output of stackdf() function:

julia> dump(iris_df_stackdf)

DataFrames.DataFrame 600 observations of 3 variables

variable: DataFrames.RepeatedVector{Symbol}

parent: Array{Symbol}((4,))

1: Symbol SepalLength

2: Symbol SepalWidth

3: Symbol PetalLength

4: Symbol PetalWidth

inner: Int64 150

outer: Int64 1 value: DataFrames.StackedVector

components: Array{Any}((4,))

1: DataArrays.DataArray{Float64,1}(150) [5.1, 4.9, 4.7, 4.6]

2: DataArrays.DataArray{Float64,1}(150) [3.5, 3.0, 3.2, 3.1]

3: DataArrays.DataArray{Float64,1}(150) [1.4, 1.4, 1.3, 1.5]

4: DataArrays.DataArray{Float64,1}(150) [0.2, 0.2, 0.2, 0.2]

Species: DataFrames.RepeatedVector{String}

parent: DataArrays.PooledDataArray{String,UInt8,1}(150) String["setosa", "setosa", "setosa", "setosa"]

inner: Int64 1

outer: Int64 4

julia>

Dump of DataFrame resulted from output of stack() function:

julia> dump(iris_df_stack)

DataFrames.DataFrame 600 observations of 3 variables

variable: Array{Symbol}((600,))

1: Symbol SepalLength

2: Symbol SepalLength

3: Symbol SepalLength

4: Symbol SepalLength

5: Symbol SepalLength

...

596: Symbol PetalWidth

597: Symbol PetalWidth

598: Symbol PetalWidth

599: Symbol PetalWidth

600: Symbol PetalWidth value: DataArrays.DataArray{Float64,1}(600) [5.1, 4.9, 4.7, 4.6]

Species: DataArrays.PooledDataArray{String,UInt8,1}(600) String["setosa", "setosa", "setosa", "setosa"]

julia>

To understand, observe the difference in datatypes of the columns in the dataframes created by the stackdf() and stack() functions respectively.

Technology Works

Monday, August 21, 2017

Julia - Language - Reshaping of Datasets - stackdf() Function - dump() function

No comments:

Post a Comment