Monday, August 14, 2017

Julia - Language - DataFrames - Spreadsheet like data structures

In real world, multiple datasets have relations between them. Hence, information needs to be represented in  formats like : Spreadsheets or Tables. We cannot hope to address this issue using DataArrays. To solve this problem of storing relations between data in a single data structure we can use DataFrames in Julia.

Corresponding package to DataFrames in R and Python:
Julia
Python
R
DataFrames
Pandas
data.frame

Consider the below example:

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames
INFO: Precompiling module DataFrames.

julia> waugh_df = DataFrame(Name=["Steve Waugh","Mark Waugh"], Age = [52,52], Country = ["Australia","Australia"])
2×3 DataFrames.DataFrame
│ Row │ Name          │ Age │ Country     │
├─────┼───────────────┼─────┼─────────────┤
│ 1   │ "Steve Waugh" │ 52  │ "Australia" │
│ 2   │ "Mark Waugh"  │ 52  │ "Australia" │

julia> 


Now, we cannot create the above dataset using DataArrays. This dataset has the following features:
  1. Labelling of Columns
  2. Different Datatypes in different columns. Remember that matrix can only contain value of one type.
  3. Data Elements have relation with other data elements in the same row of different columns. See, we may argue, I can use multiple vectors and maintain relations between them. But, Vectors cannot be enforced to have the same length. 

Note: A column in DataFrame is represented by a DataArray. 


Technique 2:

Create the DataFrame by calling the DataFrame constructor:
julia> cricket_score_df = DataFrame()
0×0 DataFrames.DataFrame

Add one column, say Players:

julia> cricket_score_df[:Player] = ["Adam Gilchrist", "Mahendra Singh Dhoni","Jonty Rhodes","Sanath Jayasurya","Arjuna Ranatunga","Wasim Akram","Bret Lee","Brian Lara","Lance Klusner","Yuvaraj Singh"]
10-element Array{String,1}:
 "Adam Gilchrist"      
 "Mahendra Singh Dhoni"
 "Jonty Rhodes"        
 "Sanath Jayasurya"    
 "Arjuna Ranatunga"    
 "Wasim Akram"         
 "Bret Lee"            
 "Brian Lara"          
 "Lance Klusner"       
 "Yuvaraj Singh"       

Add second column, say, Score:

julia> cricket_score_df[:Score] = [100,101,102,103,104,105,106,107,108,109]
10-element Array{Int64,1}:
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109

Now, let us check the contents of the DataFrame:

julia> cricket_score_df
10×2 DataFrames.DataFrame
│ Row │ Player                 │ Score │
├─────┼────────────────────────┼───────┤
│ 1   │ "Adam Gilchrist"       │ 100   │
│ 2   │ "Mahendra Singh Dhoni" │ 101   │
│ 3   │ "Jonty Rhodes"         │ 102   │
│ 4   │ "Sanath Jayasurya"     │ 103   │
│ 5   │ "Arjuna Ranatunga"     │ 104   │
│ 6   │ "Wasim Akram"          │ 105   │
│ 7   │ "Bret Lee"             │ 106   │
│ 8   │ "Brian Lara"           │ 107   │
│ 9   │ "Lance Klusner"        │ 108   │
│ 10  │ "Yuvaraj Singh"        │ 109   │

julia> 

We have created a DataFrame. 

Finding Size of a DataFrame:

julia> size(cricket_score_df)
(10, 2)

Use head and tail functions to observe the first few lines of a DataFrame and last few lines respectively:

julia> head(cricket_score_df)
6×2 DataFrames.DataFrame
│ Row │ Player                 │ Score │
├─────┼────────────────────────┼───────┤
│ 1   │ "Adam Gilchrist"       │ 100   │
│ 2   │ "Mahendra Singh Dhoni" │ 101   │
│ 3   │ "Jonty Rhodes"         │ 102   │
│ 4   │ "Sanath Jayasurya"     │ 103   │
│ 5   │ "Arjuna Ranatunga"     │ 104   │
│ 6   │ "Wasim Akram"          │ 105   │

julia> tail(cricket_score_df)
6×2 DataFrames.DataFrame
│ Row │ Player             │ Score │
├─────┼────────────────────┼───────┤
│ 1   │ "Arjuna Ranatunga" │ 104   │
│ 2   │ "Wasim Akram"      │ 105   │
│ 3   │ "Bret Lee"         │ 106   │
│ 4   │ "Brian Lara"       │ 107   │
│ 5   │ "Lance Klusner"    │ 108   │
│ 6   │ "Yuvaraj Singh"    │ 109   │


To access the data in a particular column:

julia> cricket_score_df[:Player]
10-element DataArrays.DataArray{String,1}:
 "Adam Gilchrist"      
 "Mahendra Singh Dhoni"
 "Jonty Rhodes"        
 "Sanath Jayasurya"    
 "Arjuna Ranatunga"    
 "Wasim Akram"         
 "Bret Lee"            
 "Brian Lara"          
 "Lance Klusner"       
 "Yuvaraj Singh"       

julia> 

By providing real world names to the columns, we do not have the need anymore to remember the numeric indices of these columns as they can be accessed by their real world names. 

We can rename the columns by using the rename!() function:


julia> rename!(cricket_score_df,:Player,:TeamPlayer)
10×2 DataFrames.DataFrame
│ Row │ TeamPlayer             │ Score │
├─────┼────────────────────────┼───────┤
│ 1   │ "Adam Gilchrist"       │ 100   │
│ 2   │ "Mahendra Singh Dhoni" │ 101   │
│ 3   │ "Jonty Rhodes"         │ 102   │
│ 4   │ "Sanath Jayasurya"     │ 103   │
│ 5   │ "Arjuna Ranatunga"     │ 104   │
│ 6   │ "Wasim Akram"          │ 105   │
│ 7   │ "Bret Lee"             │ 106   │
│ 8   │ "Brian Lara"           │ 107   │
│ 9   │ "Lance Klusner"        │ 108   │
│ 10  │ "Yuvaraj Singh"        │ 109   │

julia> 


Using the describe function, Julia summarizes the dataset for us:

julia> describe(cricket_score_df)
TeamPlayer
Summary Stats:
Length:         10
Type:           String
Number Unique:  10
Number Missing: 0
% Missing:      0.000000

Score
Summary Stats:
Mean:           104.500000
Minimum:        100.000000
1st Quartile:   102.250000
Median:         104.500000
3rd Quartile:   106.750000
Maximum:        109.000000
Length:         10
Type:           Int64
Number Missing: 0
% Missing:      0.000000



To List the column names, use the names() function:

julia> names(cricket_score_df)
2-element Array{Symbol,1}:
 :TeamPlayer
 :Score     





No comments:

Post a Comment