Technology Works: Julia - Language - DataFrames - Spreadsheet like data structures

In real world, multiple datasets have relations between them. Hence, information needs to be represented in formats like : Spreadsheets or Tables. We cannot hope to address this issue using DataArrays. To solve this problem of storing relations between data in a single data structure we can use DataFrames in Julia.

Corresponding package to DataFrames in R and Python:

Julia	Python	R
DataFrames	Pandas	data.frame

Consider the below example:

$ julia

_ _ _(_)_ | A fresh approach to technical computing

(_) | (_) (_) | Documentation: https://docs.julialang.org

_ _ _| |_ __ _ | Type "?help" for help.

| | | | | | |/ _` | |

| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)

_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release

|__/ | x86_64-pc-linux-gnu

julia> using DataFrames

INFO: Precompiling module DataFrames.

julia> waugh_df = DataFrame(Name=["Steve Waugh","Mark Waugh"], Age = [52,52], Country = ["Australia","Australia"])

2×3 DataFrames.DataFrame

│ Row │ Name │ Age │ Country │

├─────┼───────────────┼─────┼─────────────┤

│ 1 │ "Steve Waugh" │ 52 │ "Australia" │

│ 2 │ "Mark Waugh" │ 52 │ "Australia" │

julia>

Now, we cannot create the above dataset using DataArrays. This dataset has the following features:

Labelling of Columns
Different Datatypes in different columns. Remember that matrix can only contain value of one type.
Data Elements have relation with other data elements in the same row of different columns. See, we may argue, I can use multiple vectors and maintain relations between them. But, Vectors cannot be enforced to have the same length.

Note: A column in DataFrame is represented by a DataArray.

Technique 2:

Create the DataFrame by calling the DataFrame constructor:

julia> cricket_score_df = DataFrame()

0×0 DataFrames.DataFrame

Add one column, say Players:

julia> cricket_score_df[:Player] = ["Adam Gilchrist", "Mahendra Singh Dhoni","Jonty Rhodes","Sanath Jayasurya","Arjuna Ranatunga","Wasim Akram","Bret Lee","Brian Lara","Lance Klusner","Yuvaraj Singh"]

10-element Array{String,1}:

"Adam Gilchrist"

"Mahendra Singh Dhoni"

"Jonty Rhodes"

"Sanath Jayasurya"

"Arjuna Ranatunga"

"Wasim Akram"

"Bret Lee"

"Brian Lara"

"Lance Klusner"

"Yuvaraj Singh"

Add second column, say, Score:

julia> cricket_score_df[:Score] = [100,101,102,103,104,105,106,107,108,109]

10-element Array{Int64,1}:

100

101

102

103

104

105

106

107

108

109

Now, let us check the contents of the DataFrame:

julia> cricket_score_df

10×2 DataFrames.DataFrame

│ Row │ Player │ Score │

├─────┼────────────────────────┼───────┤

│ 1 │ "Adam Gilchrist" │ 100 │

│ 2 │ "Mahendra Singh Dhoni" │ 101 │

│ 3 │ "Jonty Rhodes" │ 102 │

│ 4 │ "Sanath Jayasurya" │ 103 │

│ 5 │ "Arjuna Ranatunga" │ 104 │

│ 6 │ "Wasim Akram" │ 105 │

│ 7 │ "Bret Lee" │ 106 │

│ 8 │ "Brian Lara" │ 107 │

│ 9 │ "Lance Klusner" │ 108 │

│ 10 │ "Yuvaraj Singh" │ 109 │

julia>

We have created a DataFrame.

Finding Size of a DataFrame:

julia> size(cricket_score_df)

(10, 2)

Use head and tail functions to observe the first few lines of a DataFrame and last few lines respectively:

julia> head(cricket_score_df)

6×2 DataFrames.DataFrame

│ Row │ Player │ Score │

├─────┼────────────────────────┼───────┤

│ 1 │ "Adam Gilchrist" │ 100 │

│ 2 │ "Mahendra Singh Dhoni" │ 101 │

│ 3 │ "Jonty Rhodes" │ 102 │

│ 4 │ "Sanath Jayasurya" │ 103 │

│ 5 │ "Arjuna Ranatunga" │ 104 │

│ 6 │ "Wasim Akram" │ 105 │

julia> tail(cricket_score_df)

6×2 DataFrames.DataFrame

│ Row │ Player │ Score │

├─────┼────────────────────┼───────┤

│ 1 │ "Arjuna Ranatunga" │ 104 │

│ 2 │ "Wasim Akram" │ 105 │

│ 3 │ "Bret Lee" │ 106 │

│ 4 │ "Brian Lara" │ 107 │

│ 5 │ "Lance Klusner" │ 108 │

│ 6 │ "Yuvaraj Singh" │ 109 │

To access the data in a particular column:

julia> cricket_score_df[:Player]

10-element DataArrays.DataArray{String,1}:

"Adam Gilchrist"

"Mahendra Singh Dhoni"

"Jonty Rhodes"

"Sanath Jayasurya"

"Arjuna Ranatunga"

"Wasim Akram"

"Bret Lee"

"Brian Lara"

"Lance Klusner"

"Yuvaraj Singh"

julia>

By providing real world names to the columns, we do not have the need anymore to remember the numeric indices of these columns as they can be accessed by their real world names.

We can rename the columns by using the rename!() function:

julia> rename!(cricket_score_df,:Player,:TeamPlayer)

10×2 DataFrames.DataFrame

│ Row │ TeamPlayer │ Score │

├─────┼────────────────────────┼───────┤

│ 1 │ "Adam Gilchrist" │ 100 │

│ 2 │ "Mahendra Singh Dhoni" │ 101 │

│ 3 │ "Jonty Rhodes" │ 102 │

│ 4 │ "Sanath Jayasurya" │ 103 │

│ 5 │ "Arjuna Ranatunga" │ 104 │

│ 6 │ "Wasim Akram" │ 105 │

│ 7 │ "Bret Lee" │ 106 │

│ 8 │ "Brian Lara" │ 107 │

│ 9 │ "Lance Klusner" │ 108 │

│ 10 │ "Yuvaraj Singh" │ 109 │

julia>

Using the describe function, Julia summarizes the dataset for us:

julia> describe(cricket_score_df)

TeamPlayer

Summary Stats:

Length: 10

Type: String

Number Unique: 10

Number Missing: 0

% Missing: 0.000000

Score

Summary Stats:

Mean: 104.500000

Minimum: 100.000000

1st Quartile: 102.250000

Median: 104.500000

3rd Quartile: 106.750000

Maximum: 109.000000

Length: 10

Type: Int64

Number Missing: 0

% Missing: 0.000000

To List the column names, use the names() function:

julia> names(cricket_score_df)

2-element Array{Symbol,1}:

:TeamPlayer

:Score

Technology Works

Monday, August 14, 2017

Julia - Language - DataFrames - Spreadsheet like data structures

No comments:

Post a Comment