Friday, August 11, 2017

Julia - Language - NA datatype and DataArrays Package

In Julia, consider an array "a" having 4 values:


$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> a = [1,2,3,4]
4-element Array{Int64,1}:
 1
 2
 3
 4


Consider the case of missing values. Now, If we want to set the element 1 in array a to "no value". How would you do that? Let us try the following:

julia> a[1] = ""
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] setindex!(::Array{Int64,1}, ::String, ::Int64) at ./array.jl:549

julia> a[1] =
       ;

ERROR: syntax: unexpected ;

julia>
We find that, it is not possible to assign "no value" to elements in regular Julia arrays. Hence, to to represent this concept of "no value" or missing values, Julia provides this singleton object, NA. Now, let us try to use NA on regular Julia arrays:

julia> a[1] = NA
ERROR: UndefVarError: NA not defined

We find that NA is not available to use directly. To be able to use NA, we load the DataArrays package:

julia> using DataArrays

julia> a[1] = NA
ERROR: MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] setindex!(::Array{Int64,1}, ::DataArrays.NAtype, ::Int64) at ./array.jl:549

When we use NA ("no value / not available / null / missing value ") on regular Julia arrays, it is not allowing us to include it as an element. Hence, we need to use DataArray instead of regular Julia arrays:

julia> b = DataArray([1,2,3,4])
4-element DataArrays.DataArray{Int64,1}:
 1
 2
 3
 4

Now, let us assign the value NA.

julia> b[1] = NA
NA

julia> b
4-element DataArrays.DataArray{Int64,1}:
  NA
 2 
 3 
 4 

julia>

Success!

Now, Let us try to calculate mean over the elements of vector "b":

julia> mean(b)
NA


We see that Julia does not allow us to compute mean over a vector array having NA element. Hence, we compute mean by slicing the vector array "b":

julia> mean(b[2:end])
3.0


But, using this technique is very inconvenient, as - to be able to slice a vector array - it expects us to know in advance which elements are having NA values. It is better to use dropna function:

julia> mean(dropna(b))
3.0

We can see that dropna function has enabled us to overcome this drawback of slicing and also the mean is calculated over the remaining 3 elements.

No comments:

Post a Comment