Monday, August 21, 2017

Julia - Language - Reshaping of Datasets - Stack Function

Somtimes, we may need to work on the dataset in a different form. For this purpose, we want to reshape the data. Let us illustrate this by using the iris datasets:

https://github.com/johnmyleswhite/RDatasets.jl#rdatasetsjl

julia> using DataFrames

julia> using RDatasets

julia> iris_df = dataset("datasets","iris");

julia> head(iris_df)
6×5 DataFrames.DataFrame
RowSepalLengthSepalWidthPetalLengthPetalWidthSpecies  │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa" │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa" │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa" │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ "setosa" │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ "setosa" │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ "setosa" │

julia> 

Using stack() function to reshape the dataset:

julia> iris_df_stack = stack(iris_df);

julia> head(iris_df_stack)
6×3 DataFrames.DataFrame
Rowvariable    │ valueSpecies  │
├─────┼─────────────┼───────┼──────────┤
│ 1   │ SepalLength │ 5.1   │ "setosa" │
│ 2   │ SepalLength │ 4.9   │ "setosa" │
│ 3   │ SepalLength │ 4.7   │ "setosa" │
│ 4   │ SepalLength │ 4.6   │ "setosa" │
│ 5   │ SepalLength │ 5.0   │ "setosa" │
│ 6   │ SepalLength │ 5.4   │ "setosa" │

julia> 

Just observe how the DataFrame (iris_df) has got transformed to iris_df_stack. The following columns in the DataFrame, iris_df:
  1. SepalLength
  2. SepalWidth
  3. PetalLength
  4. PetalWidth

are now available as rows in the DataFrame, iris_df under the column: variable. And the values under these columns have been transposed and available under the column: value in the iris_df_stack DataFrame

We can now say that: our dataset has been stacked - as we have stacked all our columns. To get more hold over the stack() function, let us try to stack only specific columns and observe the output:

Stacking 2 columns:

julia> iris_df_stack2 = stack(iris_df,1:2);

Get the Unique values in the first column (variable) to verify:

julia> unique(iris_df_stack2,1)
2×5 DataFrames.DataFrame
Rowvariable    │ valuePetalLengthPetalWidthSpecies  │
├─────┼─────────────┼───────┼─────────────┼────────────┼──────────┤
│ 1   │ SepalLength │ 5.1   │ 1.4         │ 0.2        │ "setosa" │
│ 2   │ SepalWidth  │ 3.5   │ 1.4         │ 0.2        │ "setosa" │


Stacking 3 columns:

julia> iris_df_stack3 = stack(iris_df,1:3);

Get the Unique values in the first column (variable) to verify:

julia> unique(iris_df_stack3,1)
3×4 DataFrames.DataFrame
│ Row │ variable    │ valuePetalWidthSpecies  │
├─────┼─────────────┼───────┼────────────┼──────────┤
│ 1   │ SepalLength │ 5.1   │ 0.2        │ "setosa" │
│ 2   │ SepalWidth  │ 3.5   │ 0.2        │ "setosa" │
│ 3   │ PetalLength │ 1.4   │ 0.2        │ "setosa" │

Stacking 4 columns:

julia> iris_df_stack4 = stack(iris_df,1:4);

Get the Unique values in the first column (variable) to verify:

julia> unique(iris_df_stack4,1)
4×3 DataFrames.DataFrame
│ Row │ variable    │ value │ Species  │
├─────┼─────────────┼───────┼──────────┤
│ 1   │ SepalLength │ 5.1   │ "setosa" │
│ 2   │ SepalWidth  │ 3.5   │ "setosa" │
│ 3   │ PetalLength │ 1.4   │ "setosa" │
│ 4   │ PetalWidth  │ 0.2   │ "setosa" │

julia> 

We can see that the columns we have chosen in the stack function are the ones which are available in the output DataFrame.

Let us observe the size of these DataFrames. 

julia> size(iris_df)
(150, 5)

julia> size(iris_df_stack)
(600, 3)

julia> size(iris_df_stack2)
(300, 5)

julia> size(iris_df_stack3)
(450, 4)

julia> size(iris_df_stack4)
(600, 3)

julia> 

Since, the original DataFrame had 150 rows. Hence, each column added to the Stack increases the size of the stack by 150 rows.


No comments:

Post a Comment