Somtimes, we may need to work on the dataset in a different form. For this purpose, we want to reshape the data. Let us illustrate this by using the iris datasets:
https://github.com/johnmyleswhite/RDatasets.jl#rdatasetsjl
julia> using DataFrames
julia> using RDatasets
julia> iris_df = dataset("datasets","iris");
julia> head(iris_df)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ "setosa" │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ "setosa" │
julia>
Using stack() function to reshape the dataset:
julia> iris_df_stack = stack(iris_df);
julia> head(iris_df_stack)
6×3 DataFrames.DataFrame
│ Row │ variable │ value │ Species │
├─────┼─────────────┼───────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ "setosa" │
│ 2 │ SepalLength │ 4.9 │ "setosa" │
│ 3 │ SepalLength │ 4.7 │ "setosa" │
│ 4 │ SepalLength │ 4.6 │ "setosa" │
│ 5 │ SepalLength │ 5.0 │ "setosa" │
│ 6 │ SepalLength │ 5.4 │ "setosa" │
julia>
Just observe how the DataFrame (iris_df) has got transformed to iris_df_stack. The following columns in the DataFrame, iris_df:
- SepalLength
- SepalWidth
- PetalLength
- PetalWidth
are now available as rows in the DataFrame, iris_df under the column: variable. And the values under these columns have been transposed and available under the column: value in the iris_df_stack DataFrame
We can now say that: our dataset has been stacked - as we have stacked all our columns. To get more hold over the stack() function, let us try to stack only specific columns and observe the output:
Stacking 2 columns:
julia> iris_df_stack2 = stack(iris_df,1:2);
Get the Unique values in the first column (variable) to verify:
julia> unique(iris_df_stack2,1)
2×5 DataFrames.DataFrame
│ Row │ variable │ value │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼───────┼─────────────┼────────────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ SepalWidth │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
Stacking 3 columns:
julia> iris_df_stack3 = stack(iris_df,1:3);
Get the Unique values in the first column (variable) to verify:
julia> unique(iris_df_stack3,1)
3×4 DataFrames.DataFrame
│ Row │ variable │ value │ PetalWidth │ Species │
├─────┼─────────────┼───────┼────────────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ 0.2 │ "setosa" │
│ 2 │ SepalWidth │ 3.5 │ 0.2 │ "setosa" │
│ 3 │ PetalLength │ 1.4 │ 0.2 │ "setosa" │
Stacking 4 columns:
julia> iris_df_stack4 = stack(iris_df,1:4);
Get the Unique values in the first column (variable) to verify:
julia> unique(iris_df_stack4,1)
4×3 DataFrames.DataFrame
│ Row │ variable │ value │ Species │
├─────┼─────────────┼───────┼──────────┤
│ 1 │ SepalLength │ 5.1 │ "setosa" │
│ 2 │ SepalWidth │ 3.5 │ "setosa" │
│ 3 │ PetalLength │ 1.4 │ "setosa" │
│ 4 │ PetalWidth │ 0.2 │ "setosa" │
julia>
We can see that the columns we have chosen in the stack function are the ones which are available in the output DataFrame.
Let us observe the size of these DataFrames.
julia> size(iris_df)
(150, 5)
julia> size(iris_df_stack)
(600, 3)
julia> size(iris_df_stack2)
(300, 5)
julia> size(iris_df_stack3)
(450, 4)
julia> size(iris_df_stack4)
(600, 3)
julia>
Since, the original DataFrame had 150 rows. Hence, each column added to the Stack increases the size of the stack by 150 rows.