Technology Works: September 2017

Tuesday, September 19, 2017

Julia - Language - Comparing Samples of Random Numbers using the Distributions Package

The purpose of studying the Distributions package is that we want to create arrays of data point values or elements that are picked at random, each drawn from a specific distribution.

Let us imagine, that we compare the results of two sets of experiments or observed outcomes. For example, consider the examination results of a particular course for two consecutive years. The advantage of using the distribution package in Julia is that we do not need the actual results. We can simply simulate some results. These results can be stored in arrays.

Say, 100 students take the examination each year, we can use the rand() function to generate random values. As argument, we pass the name of specific distribution. As an example, we will use the Normal distribution with a mean and standard deviation.

julia> using Distributions

Refer: https://juliastats.github.io/Distributions.jl/latest/univariate.html#Distributions.Normal

This says:
Normal(mu, sig) # Normal distribution with mean mu and variance sig^2
This means, that the Normal function takes two input parameters:

Mean
Standard Deviation

For usage of rand() function, refer: https://docs.julialang.org/en/latest/stdlib/numbers/#Base.Random.rand

In this example, we use the rand function to generate 100 random numbers representing the exam results for year 1 which adhere to Normal distribution. The numbers have a mean of 67 and standard deviation of 10.

julia> year1 = rand(Normal(67,10),100)
100-element Array{Float64,1}:
62.5691
58.6375
73.3923
59.4574
63.5709
76.1989
54.7919
75.6863
72.857
74.2429
67.6712
68.6269
70.7106
81.5927
59.1536
62.5137
63.9478
67.7302
73.0909
⋮
70.4392
58.6405
73.1801
59.3546
56.1787
66.7647
84.3317
58.4741
70.2783
65.5608
54.6659
73.1065
40.2822
65.9016
72.601
63.2549
68.9232
76.1037

julia> year2 = rand(Normal(71,15),100)
100-element Array{Float64,1}:
75.7926
69.0364
63.3757
79.114
74.4915
59.5234
76.6792
61.4812
46.3396
68.1444
94.0939
78.3967
55.9258
93.2746
73.0782
52.2191
69.3642
41.075
79.629
⋮
60.7618
64.5384
85.7592
46.9339
74.0685
44.9415
79.3021
70.3854
96.3139
81.2064
58.7041
54.1915
85.4515
40.0752
85.0923
79.9331
79.632
64.337

julia>

Our aim will be to compare these. The average scored for each year differs. Are they statistically different, though? This is not a course in statistics, but using Julia, we will see how easy it is to tell us if there is such a difference.

For now, let us plot theoretical distributions. Note that these are not the actual values.

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using Distributions

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

julia> using StatPlots

julia> plot(Normal(67,10), fillrange=0,fillalpha=0.5,fillcolor=:orange, label = "Year 1", title = "Boxplot" )

julia> plot!(Normal(71,15), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "Year 2" )

We can get more statistical information from these two distributions.

julia> skewness(year1) , skewness(year2)
(-0.024341675858438033, 0.3728199943398096)

julia> kurtosis(year1), kurtosis(year2)
(-0.3391716502155475, -0.3560651067956062)

julia> using HypothesisTests

julia> EqualVarianceTTest(year1,year2)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -6.401225641166363
95% confidence interval: (-10.17335143893694, -2.6290998433957857)

Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0009792968343477447

Details:
number of observations: [100,100]
t-statistic: -3.34647610411475
degrees of freedom: 198
empirical standard error: 1.9128257432633577

julia>

It says, that the parameter of interest, is the Mean Difference - that means the difference between the means.
h_0 is indicating the NULL hypothesis. The test summary says, that we can reject the NULL hypothesis as the two sided p-value is 0.0009. The t-statistic was -3.346 and the degrees of freedom was 198. Remember that there are 200 set of values minus the two groups which gives us 198. We also see the standard error of 1.91

So, we conclude we have received beautiful information from Hypothesis test as we can quickly determine, if the values stored in the two Julia arrays are statistically different from each other.

Let us now, put these two in a DataFrame.

julia> using DataFrames

julia> dataDF = DataFrame(one=year1,two=year2)
100×2 DataFrames.DataFrame
│ Row │ one │ two │
├─────┼─────────┼─────────┤
│ 1 │ 72.3084 │ 63.1657 │
│ 2 │ 58.4573 │ 69.4973 │
│ 3 │ 70.8097 │ 94.1317 │
│ 4 │ 70.7864 │ 57.8267 │
│ 5 │ 65.2744 │ 82.1781 │
│ 6 │ 72.3527 │ 71.8708 │
│ 7 │ 68.5187 │ 56.9388 │
│ 8 │ 58.8689 │ 82.4909 │
│ 9 │ 62.0201 │ 43.9521 │
│ 10 │ 69.8665 │ 56.1039 │
│ 11 │ 70.3609 │ 66.7739 │
│ 12 │ 62.2294 │ 45.2597 │
│ 13 │ 70.4675 │ 65.4719 │
│ 14 │ 77.4113 │ 109.628 │
│ 15 │ 51.4468 │ 61.2482 │
│ 16 │ 62.5706 │ 88.3665 │
│ 17 │ 64.395 │ 84.3294 │
⋮
│ 83 │ 66.0341 │ 75.6371 │
│ 84 │ 51.3127 │ 56.022 │
│ 85 │ 73.9547 │ 55.9642 │
│ 86 │ 54.6271 │ 74.2315 │
│ 87 │ 79.5011 │ 46.2863 │
│ 88 │ 45.8135 │ 87.8725 │
│ 89 │ 68.2583 │ 72.3256 │
│ 90 │ 57.9371 │ 67.3443 │
│ 91 │ 74.6372 │ 78.1656 │
│ 92 │ 65.5897 │ 62.3794 │
│ 93 │ 91.3915 │ 72.8167 │
│ 94 │ 55.0672 │ 67.158 │
│ 95 │ 76.6133 │ 62.5635 │
│ 96 │ 54.8723 │ 69.6539 │
│ 97 │ 75.1471 │ 58.0032 │
│ 98 │ 59.0352 │ 69.4477 │
│ 99 │ 72.7119 │ 64.9528 │
│ 100 │ 71.0054 │ 44.5209 │

julia>

Imagine this time that these were the same set of students going on from year1 to year2. As, these are the aggregate marks at the end of Year1 and Year2 - We now want to know if there is a correlation between the marks in their two years.

Since, we are dealing with linear regression and least squares. Hence, this is easy to accomplish with GLM package functions - the Generalized Linear Models package. Let us call the ordinary least squares. The first parameter (@formula(one~two)) which we pass, asks, is there a correlation between the column1 and column2 of the dataframe dataDF, passed as the second parameter.

We want it from the Normal distribution and the IdentityLink.

Refer: https://github.com/JuliaStats/GLM.jl

julia> glm(@formula(one~two),dataDF,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: one ~ 1 + two

Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 59.2152 4.34291 13.6349 <1e-41
two 0.106299 0.0613117 1.73376 0.0830

julia>

We see that we have an intercept of 59.2
The slope of our line is 0.106299

Refer:

Saturday, September 16, 2017

Julia - Language - Plots - The standard normal distribution using Distributions Package

We can ask for random data point values for a variable to be taken from any number of discrete or continuous distributions. This greatly expands on the standard normal distribution provided by the randn() function. In this lesson we will concentrate on the continuous random variables and start with the normal distribution.

Plotting the Standard Normal Distribution:
The Normal() function from the Distribution package takes two arguments. The first is the mean and the second is the variance. We use it in conjunction with the rand() function so that we can specify how many datapoint values we want.

Refer: http://juliastats.github.io/Distributions.jl/stable/univariate.html#Distributions.Normal

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using Distributions

Let us check the Normal function

julia> Normal()
Distributions.Normal{Float64}(μ=0.0, σ=1.0)

julia> arr_norm2 = rand(Normal(0,1),10000)
1000-element Array{Float64,1}:
0.617306
0.916475
0.374644
-0.361471
-0.214228
0.528817
0.56169
1.01749
-0.787201
1.54954
1.90827
1.1112
-0.319775
-0.146447
-1.23947
-1.50368
2.18589
0.204094
1.51182
⋮
-0.424308
-0.100826
-1.95461
-0.63926
-0.899714
0.369235
0.382081
-0.118614
-0.649621
-1.3039
-0.524052
1.30383
1.00772
-0.47544
-0.62028
-0.0306563
0.269642
-0.924341

julia>

Just to expand on the point, let us use Plots to show us the theoretical normal distribution. We must import the following packages:

julia> using StatPlots

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

Let us plot the Standard Normal Distribution

julia> plot(Normal(0,1), fillrange=0, fillalpha=0.5 , fillcolor=:blue, label="Standard Normal Distribution", title="Density")

julia>

We can also fit some data to a distribution. In the example below we use the norm1 array and fit it to the standard normal distribution.

julia> arr_norm1 = randn(1000);

julia> fit(Normal,arr_norm1)
Distributions.Normal{Float64}(μ=-0.042955933897932966, σ=1.0144309652412293)

julia>

We can plot other distributions as well. Here is the χ² distribution with different degrees of freedom.

julia> plot(Chisq(3), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "3 degrees of freedom")

julia>

Now, let us use the plot!() function. This is the plot function with a bang (!)sign. What it does is - it writes on top of the existing plot.

julia> plot!(Chisq(5), fillrange=0,fillalpha=0.25,fillcolor=:orange, label = "5 degrees of freedom")

julia>

julia> plot!(Chisq(10), fillrange=0,fillalpha=0.25,fillcolor=:deepskyblue, label = "10 degrees of freedom")

julia>

What the plot!() function does is - it plots over the existing plot.

Refer the below Links:

Julia - Language - PlotlyJS - The standard normal distribution

For the explanation of normal distribution, refer the following video from Khan Academy:

Function	Purpose
randn()	Returns an array of randomly selected values from the - Standard Normal Distribution. The majority of values cluster around the mean of 0 and a standard deviation of 1.

First, import the following packages:

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using Distributions

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

julia> using StatPlots
INFO: Precompiling module TableTraits.
INFO: Precompiling module DataValues.
INFO: Precompiling module KernelDensity.
INFO: Precompiling module Loess.

julia> using HypothesisTests
INFO: Precompiling module HypothesisTests.

julia> using DataFrames

julia> using GLM
INFO: Precompiling module GLM.

julia>

Let us get 10000 such values and attach this array to the variable arr_norm1.

julia> arr_norm1 = randn(10000);

julia> arr_norm1
10000-element Array{Float64,1}:
-0.330113
-0.526716
-0.159486
-1.50848
0.0444939
-2.24241
-0.413104
1.15788
-0.74063
1.3488
0.404499
-0.148054
2.38449
0.0247912
-0.846278
-0.116104
1.91943
0.0293767
0.649929
⋮
-0.231188
-0.243903
0.145286
-1.68122
1.39924
-2.27001
1.67537
-0.595119
-1.23359
0.249857
-0.332047
0.485967
-1.65663
0.182978
-0.220901
-0.412765
1.47191
-0.547864

julia>

We can plot this as histogram. In the example below, we will use the keyword argument bins. Setting it to 20 means that between the minimum and maximum value we create 20 equally sized ranges and count how many values occur in each range.

julia> histogram(arr_norm1,bins=20,label="Standard Normal Distribution",title="Histogram 01")

These values were selected at random. We can check how close we came to to a real mean of 0 and a standard deviation of 1.

# finding mean
julia> mean(arr_norm1)
0.014639565332684248

#standard deviation using std()
julia> std(arr_norm1)
0.9959800381020006

# variance using var()
julia> var(arr_norm1)
0.9919762362976626

julia>

Tuesday, September 12, 2017

Julia - Language - Handling Package Build Errors

Today while trying to install and Build Rmath package in Julia, I was receiving the following error:

julia> Pkg.build("Rmath")
INFO: Building Rmath
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads already created
INFO: Downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (77) error setting certificate verify locations:
CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
============================================================================[ ERROR: Rmath ]============================================================================

LoadError: failed process: Process(`curl -f -o /home/ubuntu/.julia/v0.6/Rmath/deps/downloads/v0.2.0.tar.gz -L https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz`, ProcessExited(77)) [77]
while loading /home/ubuntu/.julia/v0.6/Rmath/deps/build.jl, in expression starting on line 42

========================================================================================================================================================================

============================================================================[ BUILD ERRORS ]============================================================================

WARNING: Rmath had build errors.

- packages with build errors remain installed in /home/ubuntu/.julia/v0.6
- build the package(s) and all dependencies with `Pkg.build("Rmath")`
- build a single package by running its `deps/build.jl` script

========================================================================================================================================================================

julia>

On investigating, I found that the root cause of the error was the line marked in red above. Hence, I searched for the error in google. I found the following stackoverlow answer helpful:

https://stackoverflow.com/questions/3160909/how-do-i-deal-with-certificates-using-curl-while-trying-to-access-an-https-url/31424970#31424970

So, I followed the instructions in it and created the following file with content in my home directory.

~ $ vi .curlrc
~ $ more .curlrc
cacert=/etc/ssl/certs/ca-certificates.crt
~ $

This was done to inform curl about the location of certificates in Ubuntu 14.04

After this, I ran the build command in Julia again:

julia> Pkg.build("Rmath")
INFO: Building Rmath
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps/downloads already created
INFO: Downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 129 0 129 0 0 118 0 --:--:-- 0:00:01 --:--:-- 133
100 155k 100 155k 0 0 60274 0 0:00:02 0:00:02 --:--:-- 118k
INFO: Done downloading file https://github.com/JuliaLang/Rmath-julia/archive/v0.2.0.tar.gz
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/src
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps
INFO: Directory /home/ubuntu/.julia/v0.6/Rmath/deps already created
INFO: Attempting to Create directory /home/ubuntu/.julia/v0.6/Rmath/deps/usr/lib
INFO: Changing Directory to /home/ubuntu/.julia/v0.6/Rmath/deps/src/Rmath-julia-0.2.0
INFO: Changing Directory to /home/ubuntu/.julia/v0.6/Rmath/deps/src/Rmath-julia-0.2.0

julia>

This time, it was success without any error message. I have posted this message to inform the technique / thought process I have used to solve this error.

Monday, September 11, 2017

Julia - Language - Converting columns to Julia arrays

At times, we might need to work with Julia Arrays instead of columns in a DataFrame. For this purpose, we have the convert() function:

julia> my_small_array = convert(Array , my_small_df[:Col01]);

julia> my_small_array
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10

julia>

Julia - Language - DataFrames - Renaming Columns

One more common problem in data science is the naming convention some data collectors use for their column names (variables). It is often required to rename these, at times even to help with deidentifying data to comply with regulations. The rename() and the permanent effect rename!() function can help us achieve just this.

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using DataFrames;

julia> my_small_df = DataFrame(x=1:10,y=rand(10),z=rand(["Afghanistan","Brazil","China","Denmark","England","Fiji","Guatemala"],10));

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
├─────┼────┼───────────┼─────────────┤
│ 1 │ 1 │ 0.22024 │ "Guatemala" │
│ 2 │ 2 │ 0.0271676 │ "Denmark" │
│ 3 │ 3 │ 0.757901 │ "China" │
│ 4 │ 4 │ 0.605231 │ "China" │
│ 5 │ 5 │ 0.779193 │ "Guatemala" │
│ 6 │ 6 │ 0.01555 │ "Brazil" │
│ 7 │ 7 │ 0.441247 │ "England" │
│ 8 │ 8 │ 0.35073 │ "Guatemala" │
│ 9 │ 9 │ 0.63757 │ "Denmark" │
│ 10 │ 10 │ 0.922693 │ "China" │

julia>

The rename() function will rename temporarily only for display purposes.

julia> rename(my_small_df,:z,:Countries)
10×3 DataFrames.DataFrame
│ Row │ x │ y │ Countries │
├─────┼────┼───────────┼─────────────┤
│ 1 │ 1 │ 0.22024 │ "Guatemala" │
│ 2 │ 2 │ 0.0271676 │ "Denmark" │
│ 3 │ 3 │ 0.757901 │ "China" │
│ 4 │ 4 │ 0.605231 │ "China" │
│ 5 │ 5 │ 0.779193 │ "Guatemala" │
│ 6 │ 6 │ 0.01555 │ "Brazil" │
│ 7 │ 7 │ 0.441247 │ "England" │
│ 8 │ 8 │ 0.35073 │ "Guatemala" │
│ 9 │ 9 │ 0.63757 │ "Denmark" │
│ 10 │ 10 │ 0.922693 │ "China" │

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
├─────┼────┼───────────┼─────────────┤
│ 1 │ 1 │ 0.22024 │ "Guatemala" │
│ 2 │ 2 │ 0.0271676 │ "Denmark" │
│ 3 │ 3 │ 0.757901 │ "China" │
│ 4 │ 4 │ 0.605231 │ "China" │
│ 5 │ 5 │ 0.779193 │ "Guatemala" │
│ 6 │ 6 │ 0.01555 │ "Brazil" │
│ 7 │ 7 │ 0.441247 │ "England" │
│ 8 │ 8 │ 0.35073 │ "Guatemala" │
│ 9 │ 9 │ 0.63757 │ "Denmark" │
│ 10 │ 10 │ 0.922693 │ "China" │

To make it permanent, we need to use the rename!() function. Let us use Dict() , dictionary to rename the columns.

julia> rename!(my_small_df,Dict(:x => :Col1, :y => :Col2, :z=>:Countries));

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Countries │
├─────┼──────┼───────────┼─────────────┤
│ 1 │ 1 │ 0.22024 │ "Guatemala" │
│ 2 │ 2 │ 0.0271676 │ "Denmark" │
│ 3 │ 3 │ 0.757901 │ "China" │
│ 4 │ 4 │ 0.605231 │ "China" │
│ 5 │ 5 │ 0.779193 │ "Guatemala" │
│ 6 │ 6 │ 0.01555 │ "Brazil" │
│ 7 │ 7 │ 0.441247 │ "England" │
│ 8 │ 8 │ 0.35073 │ "Guatemala" │
│ 9 │ 9 │ 0.63757 │ "Denmark" │
│ 10 │ 10 │ 0.922693 │ "China" │

julia>

We can also use the names!() function to rename the columns:

julia> names!(my_small_df,[:Col01,:Col02,:Country]);

julia> my_small_df
10×3 DataFrames.DataFrame
│ Row │ Col01 │ Col02 │ Country │
├─────┼───────┼───────────┼─────────────┤
│ 1 │ 1 │ 0.22024 │ "Guatemala" │
│ 2 │ 2 │ 0.0271676 │ "Denmark" │
│ 3 │ 3 │ 0.757901 │ "China" │
│ 4 │ 4 │ 0.605231 │ "China" │
│ 5 │ 5 │ 0.779193 │ "Guatemala" │
│ 6 │ 6 │ 0.01555 │ "Brazil" │
│ 7 │ 7 │ 0.441247 │ "England" │
│ 8 │ 8 │ 0.35073 │ "Guatemala" │
│ 9 │ 9 │ 0.63757 │ "Denmark" │
│ 10 │ 10 │ 0.922693 │ "China" │

julia>

Julia - Language - DataFrames - Dealing with NA values

Another common problem in data entry is that of missing values. This results in data point values of the NA type. They create havoc trying to work with values in a Dataframe. Fortunately, we can get rid of rows that contain NA values in a few ways.

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> using DataFrames

julia> numeric_df = DataFrame(Col1= 1:10, Col2 = 11:20, Col3 = 21:30)
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ 3 │ 13 │ 23 │
│ 4 │ 4 │ 14 │ 24 │
│ 5 │ 5 │ 15 │ 25 │
│ 6 │ 6 │ 16 │ 26 │
│ 7 │ 7 │ 17 │ 27 │
│ 8 │ 8 │ 18 │ 28 │
│ 9 │ 9 │ 19 │ 29 │
│ 10 │ 10 │ 20 │ 30 │

julia>

Let us set some NA values:

julia> numeric_df[3,:Col1] = NA
NA

julia> numeric_df[4,:Col2] = NA
NA

julia> numeric_df[[5,9],:Col3] = NA # row 5 and 9 in column C
NA

julia> numeric_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ NA │ 13 │ 23 │
│ 4 │ 4 │ NA │ 24 │
│ 5 │ 5 │ 15 │ NA │
│ 6 │ 6 │ 16 │ 26 │
│ 7 │ 7 │ 17 │ 27 │
│ 8 │ 8 │ 18 │ 28 │
│ 9 │ 9 │ 19 │ NA │
│ 10 │ 10 │ 20 │ 30 │

julia>

The completecases() function returns Boolean values for each row, with a false return if the rows contain a NA value.

julia> completecases(numeric_df)
10-element DataArrays.DataArray{Bool,1}:
true
true
false
false
false
true
true
true
false
true

julia>

The completecases!() function permanentely deletes rows with NA values.

julia> completecases!(numeric_df)
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ 6 │ 16 │ 26 │
│ 4 │ 7 │ 17 │ 27 │
│ 5 │ 8 │ 18 │ 28 │
│ 6 │ 10 │ 20 │ 30 │

julia> numeric_df
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ 6 │ 16 │ 26 │
│ 4 │ 7 │ 17 │ 27 │
│ 5 │ 8 │ 18 │ 28 │
│ 6 │ 10 │ 20 │ 30 │

julia>

We observe that the rows containing NA values are deleted. Let us now recreate the DataFrame numeric_df to use another way of deleting rows with NA values.

julia> numeric_df = DataFrame(Col1= 1:10, Col2 = 11:20, Col3 = 21:30);

julia> numeric_df
10×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ 3 │ 13 │ 23 │
│ 4 │ 4 │ 14 │ 24 │
│ 5 │ 5 │ 15 │ 25 │
│ 6 │ 6 │ 16 │ 26 │
│ 7 │ 7 │ 17 │ 27 │
│ 8 │ 8 │ 18 │ 28 │
│ 9 │ 9 │ 19 │ 29 │
│ 10 │ 10 │ 20 │ 30 │

Also, let us reset the NA values

julia> numeric_df[3,:Col1] = NA
NA

julia> numeric_df[4,:Col2] = NA
NA

julia> numeric_df[[5,9],:Col3] = NA # row 5 and 9 in column C
NA

julia>

Back to the original DataFrame, we can use isna() function to show whether a value is of NA type:

julia> isna(numeric_df[:Col1])
10-element BitArray{1}:
false
false
true
false
false
false
false
false
false
false

julia>

By adding the findin() function we can identify only the NA rows. The findin() function allows us to specify what we want to find, i.e, true or false boolean values (in this case for the isna())

julia> findin(isna(numeric_df[:Col1]),true)
1-element Array{Int64,1}:
3

julia>

We can also use the find() function to simply find the rows with NA values.

julia> find(isna(numeric_df[:Col1]))
1-element Array{Int64,1}:
3

julia>

This presents us with a way to delete all the rows that contain NA values.

julia> rows,cols = size(numeric_df)
(10, 3)

julia> rows
10

julia> cols
3

julia>

Creating a for loop to go through all the columns and deleting rows with NA values:

julia> for i in 1:cols
deleterows!(numeric_df, find(isna(numeric_df[:,i])))
end

julia> numeric_df
6×3 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │
├─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 11 │ 21 │
│ 2 │ 2 │ 12 │ 22 │
│ 3 │ 6 │ 16 │ 26 │
│ 4 │ 7 │ 17 │ 27 │
│ 5 │ 8 │ 18 │ 28 │
│ 6 │ 10 │ 20 │ 30 │

julia>