In real world, multiple datasets have relations between them. Hence, information needs to be represented in formats like : Spreadsheets or Tables. We cannot hope to address this issue using DataArrays. To solve this problem of storing relations between data in a single data structure we can use DataFrames in Julia.
Corresponding package to DataFrames in R and Python:
Corresponding package to DataFrames in R and Python:
Julia |
Python |
R |
DataFrames |
Pandas |
data.frame |
Consider the below example:
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using DataFrames
INFO: Precompiling module DataFrames.
julia> waugh_df = DataFrame(Name=["Steve Waugh","Mark Waugh"], Age = [52,52], Country = ["Australia","Australia"])
2×3 DataFrames.DataFrame
│ Row │ Name │ Age │ Country │
├─────┼───────────────┼─────┼─────────────┤
│ 1 │ "Steve Waugh" │ 52 │ "Australia" │
│ 2 │ "Mark Waugh" │ 52 │ "Australia" │
julia>
Now, we cannot create the above dataset using DataArrays. This dataset has the following features:
- Labelling of Columns
- Different Datatypes in different columns. Remember that matrix can only contain value of one type.
- Data Elements have relation with other data elements in the same row of different columns. See, we may argue, I can use multiple vectors and maintain relations between them. But, Vectors cannot be enforced to have the same length.
Note: A column in DataFrame is represented by a DataArray.
Technique 2:
Create the DataFrame by calling the DataFrame constructor:
julia> cricket_score_df = DataFrame()
0×0 DataFrames.DataFrame
Add one column, say Players:
julia> cricket_score_df[:Player] = ["Adam Gilchrist", "Mahendra Singh Dhoni","Jonty Rhodes","Sanath Jayasurya","Arjuna Ranatunga","Wasim Akram","Bret Lee","Brian Lara","Lance Klusner","Yuvaraj Singh"]
10-element Array{String,1}:
"Adam Gilchrist"
"Mahendra Singh Dhoni"
"Jonty Rhodes"
"Sanath Jayasurya"
"Arjuna Ranatunga"
"Wasim Akram"
"Bret Lee"
"Brian Lara"
"Lance Klusner"
"Yuvaraj Singh"
Add second column, say, Score:
julia> cricket_score_df[:Score] = [100,101,102,103,104,105,106,107,108,109]
10-element Array{Int64,1}:
100
101
102
103
104
105
106
107
108
109
Now, let us check the contents of the DataFrame:
julia> cricket_score_df
10×2 DataFrames.DataFrame
│ Row │ Player │ Score │
├─────┼────────────────────────┼───────┤
│ 1 │ "Adam Gilchrist" │ 100 │
│ 2 │ "Mahendra Singh Dhoni" │ 101 │
│ 3 │ "Jonty Rhodes" │ 102 │
│ 4 │ "Sanath Jayasurya" │ 103 │
│ 5 │ "Arjuna Ranatunga" │ 104 │
│ 6 │ "Wasim Akram" │ 105 │
│ 7 │ "Bret Lee" │ 106 │
│ 8 │ "Brian Lara" │ 107 │
│ 9 │ "Lance Klusner" │ 108 │
│ 10 │ "Yuvaraj Singh" │ 109 │
julia>
We have created a DataFrame.
Finding Size of a DataFrame:
julia> size(cricket_score_df)
(10, 2)
Use head and tail functions to observe the first few lines of a DataFrame and last few lines respectively:
julia> head(cricket_score_df)
6×2 DataFrames.DataFrame
│ Row │ Player │ Score │
├─────┼────────────────────────┼───────┤
│ 1 │ "Adam Gilchrist" │ 100 │
│ 2 │ "Mahendra Singh Dhoni" │ 101 │
│ 3 │ "Jonty Rhodes" │ 102 │
│ 4 │ "Sanath Jayasurya" │ 103 │
│ 5 │ "Arjuna Ranatunga" │ 104 │
│ 6 │ "Wasim Akram" │ 105 │
julia> tail(cricket_score_df)
6×2 DataFrames.DataFrame
│ Row │ Player │ Score │
├─────┼────────────────────┼───────┤
│ 1 │ "Arjuna Ranatunga" │ 104 │
│ 2 │ "Wasim Akram" │ 105 │
│ 3 │ "Bret Lee" │ 106 │
│ 4 │ "Brian Lara" │ 107 │
│ 5 │ "Lance Klusner" │ 108 │
│ 6 │ "Yuvaraj Singh" │ 109 │
To access the data in a particular column:
julia> cricket_score_df[:Player]
10-element DataArrays.DataArray{String,1}:
"Adam Gilchrist"
"Mahendra Singh Dhoni"
"Jonty Rhodes"
"Sanath Jayasurya"
"Arjuna Ranatunga"
"Wasim Akram"
"Bret Lee"
"Brian Lara"
"Lance Klusner"
"Yuvaraj Singh"
julia>
By providing real world names to the columns, we do not have the need anymore to remember the numeric indices of these columns as they can be accessed by their real world names.
We can rename the columns by using the rename!() function:
julia> rename!(cricket_score_df,:Player,:TeamPlayer)
10×2 DataFrames.DataFrame
│ Row │ TeamPlayer │ Score │
├─────┼────────────────────────┼───────┤
│ 1 │ "Adam Gilchrist" │ 100 │
│ 2 │ "Mahendra Singh Dhoni" │ 101 │
│ 3 │ "Jonty Rhodes" │ 102 │
│ 4 │ "Sanath Jayasurya" │ 103 │
│ 5 │ "Arjuna Ranatunga" │ 104 │
│ 6 │ "Wasim Akram" │ 105 │
│ 7 │ "Bret Lee" │ 106 │
│ 8 │ "Brian Lara" │ 107 │
│ 9 │ "Lance Klusner" │ 108 │
│ 10 │ "Yuvaraj Singh" │ 109 │
julia>
Using the describe function, Julia summarizes the dataset for us:
julia> describe(cricket_score_df)
TeamPlayer
Summary Stats:
Length: 10
Type: String
Number Unique: 10
Number Missing: 0
% Missing: 0.000000
Score
Summary Stats:
Mean: 104.500000
Minimum: 100.000000
1st Quartile: 102.250000
Median: 104.500000
3rd Quartile: 106.750000
Maximum: 109.000000
Length: 10
Type: Int64
Number Missing: 0
% Missing: 0.000000
To List the column names, use the names() function:
julia> names(cricket_score_df)
2-element Array{Symbol,1}:
:TeamPlayer
:Score
No comments:
Post a Comment