Tuesday, September 19, 2017

Julia - Language - Comparing Samples of Random Numbers using the Distributions Package

The purpose of studying the Distributions package is that we want to create arrays of data point values or elements that are picked at random, each drawn from a specific distribution.

Let us imagine, that we compare the results of two sets of experiments or observed outcomes. For example, consider the examination results of a particular course for two consecutive years. The advantage of using the distribution package in Julia is that we do not need the actual results. We can simply simulate some results. These results can be stored in arrays.

Say, 100 students take the examination each year, we can use the rand() function to generate random values. As argument, we pass the name of specific distribution. As an example, we will use the Normal distribution with a mean and standard deviation. 

julia> using Distributions

Referhttps://juliastats.github.io/Distributions.jl/latest/univariate.html#Distributions.Normal

This says:
Normal(mu, sig)   # Normal distribution with mean mu and variance sig^2
This means, that the Normal function takes two input parameters:

  1. Mean
  2. Standard Deviation

For usage of rand() function, refer: https://docs.julialang.org/en/latest/stdlib/numbers/#Base.Random.rand

In this example, we use the rand function to generate 100 random numbers representing the exam results for year 1 which adhere to Normal distribution. The numbers have a mean of 67 and standard deviation of 10.

julia> year1 = rand(Normal(67,10),100)
100-element Array{Float64,1}:
 62.5691
 58.6375
 73.3923
 59.4574
 63.5709
 76.1989
 54.7919
 75.6863
 72.857 
 74.2429
 67.6712
 68.6269
 70.7106
 81.5927
 59.1536
 62.5137
 63.9478
 67.7302
 73.0909
  ⋮     
 70.4392
 58.6405
 73.1801
 59.3546
 56.1787
 66.7647
 84.3317
 58.4741
 70.2783
 65.5608
 54.6659
 73.1065
 40.2822
 65.9016
 72.601 
 63.2549
 68.9232
 76.1037

julia> year2 = rand(Normal(71,15),100)
100-element Array{Float64,1}:
 75.7926
 69.0364
 63.3757
 79.114 
 74.4915
 59.5234
 76.6792
 61.4812
 46.3396
 68.1444
 94.0939
 78.3967
 55.9258
 93.2746
 73.0782
 52.2191
 69.3642
 41.075 
 79.629 
  ⋮     
 60.7618
 64.5384
 85.7592
 46.9339
 74.0685
 44.9415
 79.3021
 70.3854
 96.3139
 81.2064
 58.7041
 54.1915
 85.4515
 40.0752
 85.0923
 79.9331
 79.632 
 64.337 

julia> 


Our aim will be to compare these. The average scored for each year differs. Are they statistically different, though? This is not a course in statistics, but using Julia, we will see how easy it is to tell us if there is such a difference. 

For now, let us plot theoretical distributions. Note that these are not the actual values.

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Distributions

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

julia> using StatPlots

julia> plot(Normal(67,10), fillrange=0,fillalpha=0.5,fillcolor=:orange, label = "Year 1", title = "Boxplot"  )




julia> plot!(Normal(71,15), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "Year 2"  )





We can get more statistical information from these two distributions.

julia> skewness(year1) , skewness(year2)
(-0.024341675858438033, 0.3728199943398096)

julia> kurtosis(year1), kurtosis(year2)
(-0.3391716502155475, -0.3560651067956062)

julia> using HypothesisTests

julia> EqualVarianceTTest(year1,year2)
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -6.401225641166363
    95% confidence interval: (-10.17335143893694, -2.6290998433957857)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0009792968343477447

Details:
    number of observations:   [100,100]
    t-statistic:              -3.34647610411475
    degrees of freedom:       198
    empirical standard error: 1.9128257432633577


julia> 

It says, that the parameter of interest, is the Mean Difference - that means the difference between the means.
h_0 is indicating the NULL hypothesis. The test summary says, that we can reject the NULL hypothesis as the two sided p-value is 0.0009. The t-statistic was -3.346 and the degrees of freedom was 198. Remember that there are 200 set of values minus the two groups which gives us 198. We also see the standard error of 1.91

So, we conclude we have received beautiful information from Hypothesis test as we can quickly determine, if the values stored in the two Julia arrays are statistically different from each other.

Let us now, put these two in a DataFrame.

julia> using DataFrames

julia> dataDF = DataFrame(one=year1,two=year2)
100×2 DataFrames.DataFrame
│ Row │ one     │ two     │
├─────┼─────────┼─────────┤
│ 1   │ 72.3084 │ 63.1657 │
│ 2   │ 58.4573 │ 69.4973 │
│ 3   │ 70.8097 │ 94.1317 │
│ 4   │ 70.7864 │ 57.8267 │
│ 5   │ 65.2744 │ 82.1781 │
│ 6   │ 72.3527 │ 71.8708 │
│ 7   │ 68.5187 │ 56.9388 │
│ 8   │ 58.8689 │ 82.4909 │
│ 9   │ 62.0201 │ 43.9521 │
│ 10  │ 69.8665 │ 56.1039 │
│ 11  │ 70.3609 │ 66.7739 │
│ 12  │ 62.2294 │ 45.2597 │
│ 13  │ 70.4675 │ 65.4719 │
│ 14  │ 77.4113 │ 109.628 │
│ 15  │ 51.4468 │ 61.2482 │
│ 16  │ 62.5706 │ 88.3665 │
│ 17  │ 64.395  │ 84.3294 │

│ 83  │ 66.0341 │ 75.6371 │
│ 84  │ 51.3127 │ 56.022  │
│ 85  │ 73.9547 │ 55.9642 │
│ 86  │ 54.6271 │ 74.2315 │
│ 87  │ 79.5011 │ 46.2863 │
│ 88  │ 45.8135 │ 87.8725 │
│ 89  │ 68.2583 │ 72.3256 │
│ 90  │ 57.9371 │ 67.3443 │
│ 91  │ 74.6372 │ 78.1656 │
│ 92  │ 65.5897 │ 62.3794 │
│ 93  │ 91.3915 │ 72.8167 │
│ 94  │ 55.0672 │ 67.158  │
│ 95  │ 76.6133 │ 62.5635 │
│ 96  │ 54.8723 │ 69.6539 │
│ 97  │ 75.1471 │ 58.0032 │
│ 98  │ 59.0352 │ 69.4477 │
│ 99  │ 72.7119 │ 64.9528 │
│ 100 │ 71.0054 │ 44.5209 │

julia> 

Imagine this time that these were the same set of students going on from year1 to year2. As, these are the aggregate marks at the end of Year1 and Year2 - We now want to know if there is a correlation between the marks in their two years.

Since, we are dealing with linear regression and least squares. Hence, this is easy to accomplish with GLM package functions - the Generalized Linear Models package. Let us call the ordinary least squares. The first parameter (@formula(one~two)) which we pass, asks, is there a correlation between the column1 and column2 of the dataframe dataDF, passed as the second parameter.

We want it from the Normal distribution and the IdentityLink. 

Referhttps://github.com/JuliaStats/GLM.jl

julia> glm(@formula(one~two),dataDF,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: one ~ 1 + two

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   59.2152   4.34291 13.6349   <1e-41
two          0.106299 0.0613117 1.73376   0.0830


julia> 


We see that we have an intercept of 59.2

The slope of our line is 0.106299

Refer:



Saturday, September 16, 2017

Julia - Language - Plots - The standard normal distribution using Distributions Package

We can ask for random data point values for a variable to be taken from any number of discrete or continuous distributions. This greatly expands on the standard normal distribution provided by the randn() function. In this lesson we will concentrate on the continuous random variables and start with the normal distribution. 

Plotting the Standard Normal Distribution:
The Normal() function from the Distribution package takes two arguments. The first is the mean and the second is the variance. We use it in conjunction with the rand() function so that we can specify how many datapoint values we want. 

Refer: http://juliastats.github.io/Distributions.jl/stable/univariate.html#Distributions.Normal

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Distributions

Let us check the Normal function

julia> Normal()
Distributions.Normal{Float64}(μ=0.0, σ=1.0)

julia> arr_norm2 = rand(Normal(0,1),10000)
1000-element Array{Float64,1}:
  0.617306 
  0.916475 
  0.374644 
 -0.361471 
 -0.214228 
  0.528817 
  0.56169  
  1.01749  
 -0.787201 
  1.54954  
  1.90827  
  1.1112   
 -0.319775 
 -0.146447 
 -1.23947  
 -1.50368  
  2.18589  
  0.204094 
  1.51182  
  ⋮        
 -0.424308 
 -0.100826 
 -1.95461  
 -0.63926  
 -0.899714 
  0.369235 
  0.382081 
 -0.118614 
 -0.649621 
 -1.3039   
 -0.524052 
  1.30383  
  1.00772  
 -0.47544  
 -0.62028  
 -0.0306563
  0.269642 
 -0.924341 

julia> 

Just to expand on the point, let us use Plots to show us the theoretical normal distribution. We must import the following packages: 

julia> using StatPlots

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

Let us plot the Standard Normal Distribution 

julia> plot(Normal(0,1), fillrange=0, fillalpha=0.5 , fillcolor=:blue, label="Standard Normal Distribution", title="Density")

julia>




We can also fit some data to a distribution. In the example below we use the norm1 array and fit it to the standard normal distribution. 

julia> arr_norm1 = randn(1000);

julia> fit(Normal,arr_norm1)
Distributions.Normal{Float64}(μ=-0.042955933897932966, σ=1.0144309652412293)

julia> 

We can plot other distributions as well. Here is the χ² distribution with different degrees of freedom. 

julia> plot(Chisq(3), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "3 degrees of freedom")

julia> 



Now, let us use the plot!() function. This is the plot function with a bang (!)sign. What it does is - it writes on top of the existing plot. 

julia> plot!(Chisq(5), fillrange=0,fillalpha=0.25,fillcolor=:orange, label = "5 degrees of freedom")

julia>





julia> plot!(Chisq(10), fillrange=0,fillalpha=0.25,fillcolor=:deepskyblue, label = "10 degrees of freedom")

julia> 




What the plot!() function does is - it plots over the existing plot.

Refer the below Links: 






Julia - Language - PlotlyJS - The standard normal distribution

For the explanation of normal distribution, refer the following video from Khan Academy:







Function
Purpose
randn()
Returns an array of randomly selected values from the - Standard Normal Distribution. The majority of values cluster around the mean of 0 and a standard deviation of 1.


First, import the following packages:

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Distributions

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

julia> using StatPlots
INFO: Precompiling module TableTraits.
INFO: Precompiling module DataValues.
INFO: Precompiling module KernelDensity.
INFO: Precompiling module Loess.

julia> using HypothesisTests
INFO: Precompiling module HypothesisTests.

julia> using DataFrames

julia> using GLM
INFO: Precompiling module GLM.

julia>

Let us get 10000 such values and attach this array to the variable arr_norm1.

julia> arr_norm1 = randn(10000);

julia> arr_norm1
10000-element Array{Float64,1}:
 -0.330113
 -0.526716
 -0.159486
 -1.50848
  0.0444939
 -2.24241
 -0.413104
  1.15788
 -0.74063
  1.3488
  0.404499
 -0.148054
  2.38449
  0.0247912
 -0.846278
 -0.116104
  1.91943
  0.0293767
  0.649929
  ⋮      
 -0.231188
 -0.243903
  0.145286
 -1.68122
  1.39924
 -2.27001
  1.67537
 -0.595119
 -1.23359
  0.249857
 -0.332047
  0.485967
 -1.65663
  0.182978
 -0.220901
 -0.412765
  1.47191
 -0.547864

julia>


We can plot this as histogram. In the example below, we will use the keyword argument bins. Setting it to 20 means that between the minimum and maximum value we create 20 equally sized ranges and count how many values occur in each range.

julia> histogram(arr_norm1,bins=20,label="Standard Normal Distribution",title="Histogram 01")






These values were selected at random. We can check how close we came to to a real mean of 0 and a standard deviation of 1.


# finding mean
julia> mean(arr_norm1)
0.014639565332684248

#standard deviation using std()
julia> std(arr_norm1)
0.9959800381020006

# variance using var()
julia> var(arr_norm1)
0.9919762362976626

julia>

Tuesday, September 12, 2017

Julia - Language - Handling Package Build Errors

Today while trying to install and Build Rmath package in Julia, I was receiving the following error:

julia> Pkg.build("Rmath")
INFO: Building Rmath
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads already created
INFO: Downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (77) error setting certificate verify locations:
  CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
============================================================================[ ERROR: Rmath ]============================================================================

LoadError: failed process: Process(`curl -f -o /home/ubuntu/.julia/v0.6/Rmath/deps/downloads/v0.2.0.tar.gz -L https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz`, ProcessExited(77)) [77]
while loading /home/ubuntu/.julia/v0.6/Rmath/deps/build.jl, in expression starting on line 42

========================================================================================================================================================================

============================================================================[ BUILD ERRORS ]============================================================================

WARNING: Rmath had build errors.

 - packages with build errors remain installed in /home/ubuntu/.julia/v0.6
 - build the package(s) and all dependencies with `Pkg.build("Rmath")`
 - build a single package by running its `deps/build.jl` script

========================================================================================================================================================================

julia>

On investigating, I found that the root cause of the error was the line marked in red above. Hence, I searched for the error in google. I found the following stackoverlow answer helpful:

https://stackoverflow.com/questions/3160909/how-do-i-deal-with-certificates-using-curl-while-trying-to-access-an-https-url/31424970#31424970

So, I followed the instructions in it and created the following file with content in my home directory. 

~ $ vi .curlrc
~ $ more .curlrc 
cacert=/etc/ssl/certs/ca-certificates.crt
~ $ 

This was done to inform curl about the location of certificates in Ubuntu 14.04

After this, I ran the build command in Julia again:

julia> Pkg.build("Rmath")
INFO: Building Rmath
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads already created
INFO: Downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   129    0   129    0     0    118      0 --:--:--  0:00:01 --:--:--   133
100  155k  100  155k    0     0  60274      0  0:00:02  0:00:02 --:--:--  118k
INFO: Done downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/src
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps already created
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/usr/lib
INFO: Changing Directory to /home/ubuntu/.julia/v0.6/Rmath/deps/src/Rmath-julia-0.2.0
INFO: Changing Directory to /home/ubuntu/.julia/v0.6/Rmath/deps/src/Rmath-julia-0.2.0


julia>


This time, it was success without any error message. I have posted this message to inform the technique / thought process I have used to solve this error. 

Monday, September 11, 2017

Julia - Language - Converting columns to Julia arrays

At times, we might need to work with Julia Arrays instead of columns in a DataFrame. For this purpose, we have the convert() function:

julia> my_small_array = convert(Array , my_small_df[:Col01]);

julia> my_small_array
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> 


Julia - Language - DataFrames - Renaming Columns

One more common problem in data science is the naming convention some data collectors use for their column names (variables). It is often required to rename these, at times even to help with deidentifying data to comply with regulations. The rename() and the permanent effect rename!() function can help us achieve just this.

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames;

julia> my_small_df = DataFrame(x=1:10,y=rand(10),z=rand(["Afghanistan","Brazil","China","Denmark","England","Fiji","Guatemala"],10));

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ x  │ y         │ z           │
├─────┼────┼───────────┼─────────────┤
│ 1   │ 1  │ 0.22024   │ "Guatemala" │
│ 2   │ 2  │ 0.0271676 │ "Denmark"   │
│ 3   │ 3  │ 0.757901  │ "China"     │
│ 4   │ 4  │ 0.605231  │ "China"     │
│ 5   │ 5  │ 0.779193  │ "Guatemala" │
│ 6   │ 6  │ 0.01555   │ "Brazil"    │
│ 7   │ 7  │ 0.441247  │ "England"   │
│ 8   │ 8  │ 0.35073   │ "Guatemala" │
│ 9   │ 9  │ 0.63757   │ "Denmark"   │
│ 10  │ 10 │ 0.922693  │ "China"     │

julia> 


The rename() function will rename temporarily only for display purposes.

julia> rename(my_small_df,:z,:Countries)
10×3 DataFrames.DataFrame
│ Row │ x  │ y         │ Countries   │
├─────┼────┼───────────┼─────────────┤
│ 1   │ 1  │ 0.22024   │ "Guatemala" │
│ 2   │ 2  │ 0.0271676 │ "Denmark"   │
│ 3   │ 3  │ 0.757901  │ "China"     │
│ 4   │ 4  │ 0.605231  │ "China"     │
│ 5   │ 5  │ 0.779193  │ "Guatemala" │
│ 6   │ 6  │ 0.01555   │ "Brazil"    │
│ 7   │ 7  │ 0.441247  │ "England"   │
│ 8   │ 8  │ 0.35073   │ "Guatemala" │
│ 9   │ 9  │ 0.63757   │ "Denmark"   │
│ 10  │ 10 │ 0.922693  │ "China"     │

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ x  │ y         │ z           │
├─────┼────┼───────────┼─────────────┤
│ 1   │ 1  │ 0.22024   │ "Guatemala" │
│ 2   │ 2  │ 0.0271676 │ "Denmark"   │
│ 3   │ 3  │ 0.757901  │ "China"     │
│ 4   │ 4  │ 0.605231  │ "China"     │
│ 5   │ 5  │ 0.779193  │ "Guatemala" │
│ 6   │ 6  │ 0.01555   │ "Brazil"    │
│ 7   │ 7  │ 0.441247  │ "England"   │
│ 8   │ 8  │ 0.35073   │ "Guatemala" │
│ 9   │ 9  │ 0.63757   │ "Denmark"   │
│ 10  │ 10 │ 0.922693  │ "China"     │

To make it permanent, we need to use the rename!() function. Let us use Dict() , dictionary to rename the columns. 

julia> rename!(my_small_df,Dict(:x => :Col1, :y => :Col2, :z=>:Countries));

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2      │ Countries   │
├─────┼──────┼───────────┼─────────────┤
│ 1   │ 1    │ 0.22024   │ "Guatemala" │
│ 2   │ 2    │ 0.0271676 │ "Denmark"   │
│ 3   │ 3    │ 0.757901  │ "China"     │
│ 4   │ 4    │ 0.605231  │ "China"     │
│ 5   │ 5    │ 0.779193  │ "Guatemala" │
│ 6   │ 6    │ 0.01555   │ "Brazil"    │
│ 7   │ 7    │ 0.441247  │ "England"   │
│ 8   │ 8    │ 0.35073   │ "Guatemala" │
│ 9   │ 9    │ 0.63757   │ "Denmark"   │
│ 10  │ 10   │ 0.922693  │ "China"     │

julia> 


We can also use the names!() function to rename the columns:

julia> names!(my_small_df,[:Col01,:Col02,:Country]);

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ Col01 │ Col02     │ Country     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 0.22024   │ "Guatemala" │
│ 2   │ 2     │ 0.0271676 │ "Denmark"   │
│ 3   │ 3     │ 0.757901  │ "China"     │
│ 4   │ 4     │ 0.605231  │ "China"     │
│ 5   │ 5     │ 0.779193  │ "Guatemala" │
│ 6   │ 6     │ 0.01555   │ "Brazil"    │
│ 7   │ 7     │ 0.441247  │ "England"   │
│ 8   │ 8     │ 0.35073   │ "Guatemala" │
│ 9   │ 9     │ 0.63757   │ "Denmark"   │
│ 10  │ 10    │ 0.922693  │ "China"     │

julia>

Julia - Language - DataFrames - Dealing with NA values

Another common problem in data entry is that of missing values. This results in data point values of the NA type. They create havoc trying to work with values in a Dataframe. Fortunately, we can get rid of rows that contain NA values in a few ways. 

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames

julia> numeric_df = DataFrame(Col1= 1:10, Col2 = 11:20, Col3 = 21:30)
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ 3    │ 13   │ 23   │
│ 4   │ 4    │ 14   │ 24   │
│ 5   │ 5    │ 15   │ 25   │
│ 6   │ 6    │ 16   │ 26   │
│ 7   │ 7    │ 17   │ 27   │
│ 8   │ 8    │ 18   │ 28   │
│ 9   │ 9    │ 19   │ 29   │
│ 10  │ 10   │ 20   │ 30   │

julia> 


Let us set some NA values:

julia> numeric_df[3,:Col1] = NA
NA

julia> numeric_df[4,:Col2] = NA
NA

julia> numeric_df[[5,9],:Col3] = NA  # row 5 and 9 in column C
NA

julia> numeric_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ NA   │ 13   │ 23   │
│ 4   │ 4    │ NA   │ 24   │
│ 5   │ 5    │ 15   │ NA   │
│ 6   │ 6    │ 16   │ 26   │
│ 7   │ 7    │ 17   │ 27   │
│ 8   │ 8    │ 18   │ 28   │
│ 9   │ 9    │ 19   │ NA   │
│ 10  │ 10   │ 20   │ 30   │

julia> 


The completecases() function returns Boolean values for each row, with a false return if the rows contain a NA value. 

julia> completecases(numeric_df)
10-element DataArrays.DataArray{Bool,1}:
  true
  true
 false
 false
 false
  true
  true
  true
 false
  true

julia> 

The completecases!() function permanentely deletes rows with NA values.

julia> completecases!(numeric_df)
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ 6    │ 16   │ 26   │
│ 4   │ 7    │ 17   │ 27   │
│ 5   │ 8    │ 18   │ 28   │
│ 6   │ 10   │ 20   │ 30   │

julia> numeric_df
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ 6    │ 16   │ 26   │
│ 4   │ 7    │ 17   │ 27   │
│ 5   │ 8    │ 18   │ 28   │
│ 6   │ 10   │ 20   │ 30   │

julia> 

We observe that the rows containing NA values are deleted. Let us now recreate the DataFrame numeric_df to use another way of deleting rows with NA values.

julia> numeric_df = DataFrame(Col1= 1:10, Col2 = 11:20, Col3 = 21:30);

julia> numeric_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ 3    │ 13   │ 23   │
│ 4   │ 4    │ 14   │ 24   │
│ 5   │ 5    │ 15   │ 25   │
│ 6   │ 6    │ 16   │ 26   │
│ 7   │ 7    │ 17   │ 27   │
│ 8   │ 8    │ 18   │ 28   │
│ 9   │ 9    │ 19   │ 29   │
│ 10  │ 10   │ 20   │ 30   │


Also, let us reset the NA values

julia> numeric_df[3,:Col1] = NA
NA

julia> numeric_df[4,:Col2] = NA
NA

julia> numeric_df[[5,9],:Col3] = NA  # row 5 and 9 in column C
NA

julia> 

Back to the original DataFrame, we can use isna() function to show whether a value is of NA type:

julia> isna(numeric_df[:Col1])
10-element BitArray{1}:
 false
 false
  true
 false
 false
 false
 false
 false
 false
 false

julia> 

By adding the findin() function we can identify only the NA rows. The findin() function allows us to specify what we want to find, i.e, true or false boolean values (in this case for the isna())

julia> findin(isna(numeric_df[:Col1]),true)
1-element Array{Int64,1}:
 3

julia> 

We can also use the find() function to simply find the rows with NA values. 

julia> find(isna(numeric_df[:Col1]))
1-element Array{Int64,1}:
 3

julia> 


This presents us with a way to delete all the rows that contain NA values. 

julia> rows,cols = size(numeric_df)
(10, 3)

julia> rows
10

julia> cols
3

julia> 


Creating a for loop to go through all the columns and deleting rows with NA values:

julia> for i in 1:cols
       deleterows!(numeric_df, find(isna(numeric_df[:,i])))
       end

julia> numeric_df
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1   │ 1    │ 11   │ 21   │
│ 2   │ 2    │ 12   │ 22   │
│ 3   │ 6    │ 16   │ 26   │
│ 4   │ 7    │ 17   │ 27   │
│ 5   │ 8    │ 18   │ 28   │
│ 6   │ 10   │ 20   │ 30   │

julia>