Tuesday, September 19, 2017

Julia - Language - Comparing Samples of Random Numbers using the Distributions Package

The purpose of studying the Distributions package is that we want to create arrays of data point values or elements that are picked at random, each drawn from a specific distribution.

Let us imagine, that we compare the results of two sets of experiments or observed outcomes. For example, consider the examination results of a particular course for two consecutive years. The advantage of using the distribution package in Julia is that we do not need the actual results. We can simply simulate some results. These results can be stored in arrays.

Say, 100 students take the examination each year, we can use the rand() function to generate random values. As argument, we pass the name of specific distribution. As an example, we will use the Normal distribution with a mean and standard deviation. 

julia> using Distributions

Referhttps://juliastats.github.io/Distributions.jl/latest/univariate.html#Distributions.Normal

This says:
Normal(mu, sig)   # Normal distribution with mean mu and variance sig^2
This means, that the Normal function takes two input parameters:

  1. Mean
  2. Standard Deviation

For usage of rand() function, refer: https://docs.julialang.org/en/latest/stdlib/numbers/#Base.Random.rand

In this example, we use the rand function to generate 100 random numbers representing the exam results for year 1 which adhere to Normal distribution. The numbers have a mean of 67 and standard deviation of 10.

julia> year1 = rand(Normal(67,10),100)
100-element Array{Float64,1}:
 62.5691
 58.6375
 73.3923
 59.4574
 63.5709
 76.1989
 54.7919
 75.6863
 72.857 
 74.2429
 67.6712
 68.6269
 70.7106
 81.5927
 59.1536
 62.5137
 63.9478
 67.7302
 73.0909
  ⋮     
 70.4392
 58.6405
 73.1801
 59.3546
 56.1787
 66.7647
 84.3317
 58.4741
 70.2783
 65.5608
 54.6659
 73.1065
 40.2822
 65.9016
 72.601 
 63.2549
 68.9232
 76.1037

julia> year2 = rand(Normal(71,15),100)
100-element Array{Float64,1}:
 75.7926
 69.0364
 63.3757
 79.114 
 74.4915
 59.5234
 76.6792
 61.4812
 46.3396
 68.1444
 94.0939
 78.3967
 55.9258
 93.2746
 73.0782
 52.2191
 69.3642
 41.075 
 79.629 
  ⋮     
 60.7618
 64.5384
 85.7592
 46.9339
 74.0685
 44.9415
 79.3021
 70.3854
 96.3139
 81.2064
 58.7041
 54.1915
 85.4515
 40.0752
 85.0923
 79.9331
 79.632 
 64.337 

julia> 


Our aim will be to compare these. The average scored for each year differs. Are they statistically different, though? This is not a course in statistics, but using Julia, we will see how easy it is to tell us if there is such a difference. 

For now, let us plot theoretical distributions. Note that these are not the actual values.

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Distributions

julia> using Plots

julia> plotlyjs()
Plots.PlotlyJSBackend()

julia> using StatPlots

julia> plot(Normal(67,10), fillrange=0,fillalpha=0.5,fillcolor=:orange, label = "Year 1", title = "Boxplot"  )




julia> plot!(Normal(71,15), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "Year 2"  )





We can get more statistical information from these two distributions.

julia> skewness(year1) , skewness(year2)
(-0.024341675858438033, 0.3728199943398096)

julia> kurtosis(year1), kurtosis(year2)
(-0.3391716502155475, -0.3560651067956062)

julia> using HypothesisTests

julia> EqualVarianceTTest(year1,year2)
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -6.401225641166363
    95% confidence interval: (-10.17335143893694, -2.6290998433957857)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0009792968343477447

Details:
    number of observations:   [100,100]
    t-statistic:              -3.34647610411475
    degrees of freedom:       198
    empirical standard error: 1.9128257432633577


julia> 

It says, that the parameter of interest, is the Mean Difference - that means the difference between the means.
h_0 is indicating the NULL hypothesis. The test summary says, that we can reject the NULL hypothesis as the two sided p-value is 0.0009. The t-statistic was -3.346 and the degrees of freedom was 198. Remember that there are 200 set of values minus the two groups which gives us 198. We also see the standard error of 1.91

So, we conclude we have received beautiful information from Hypothesis test as we can quickly determine, if the values stored in the two Julia arrays are statistically different from each other.

Let us now, put these two in a DataFrame.

julia> using DataFrames

julia> dataDF = DataFrame(one=year1,two=year2)
100×2 DataFrames.DataFrame
│ Row │ one     │ two     │
├─────┼─────────┼─────────┤
│ 1   │ 72.3084 │ 63.1657 │
│ 2   │ 58.4573 │ 69.4973 │
│ 3   │ 70.8097 │ 94.1317 │
│ 4   │ 70.7864 │ 57.8267 │
│ 5   │ 65.2744 │ 82.1781 │
│ 6   │ 72.3527 │ 71.8708 │
│ 7   │ 68.5187 │ 56.9388 │
│ 8   │ 58.8689 │ 82.4909 │
│ 9   │ 62.0201 │ 43.9521 │
│ 10  │ 69.8665 │ 56.1039 │
│ 11  │ 70.3609 │ 66.7739 │
│ 12  │ 62.2294 │ 45.2597 │
│ 13  │ 70.4675 │ 65.4719 │
│ 14  │ 77.4113 │ 109.628 │
│ 15  │ 51.4468 │ 61.2482 │
│ 16  │ 62.5706 │ 88.3665 │
│ 17  │ 64.395  │ 84.3294 │

│ 83  │ 66.0341 │ 75.6371 │
│ 84  │ 51.3127 │ 56.022  │
│ 85  │ 73.9547 │ 55.9642 │
│ 86  │ 54.6271 │ 74.2315 │
│ 87  │ 79.5011 │ 46.2863 │
│ 88  │ 45.8135 │ 87.8725 │
│ 89  │ 68.2583 │ 72.3256 │
│ 90  │ 57.9371 │ 67.3443 │
│ 91  │ 74.6372 │ 78.1656 │
│ 92  │ 65.5897 │ 62.3794 │
│ 93  │ 91.3915 │ 72.8167 │
│ 94  │ 55.0672 │ 67.158  │
│ 95  │ 76.6133 │ 62.5635 │
│ 96  │ 54.8723 │ 69.6539 │
│ 97  │ 75.1471 │ 58.0032 │
│ 98  │ 59.0352 │ 69.4477 │
│ 99  │ 72.7119 │ 64.9528 │
│ 100 │ 71.0054 │ 44.5209 │

julia> 

Imagine this time that these were the same set of students going on from year1 to year2. As, these are the aggregate marks at the end of Year1 and Year2 - We now want to know if there is a correlation between the marks in their two years.

Since, we are dealing with linear regression and least squares. Hence, this is easy to accomplish with GLM package functions - the Generalized Linear Models package. Let us call the ordinary least squares. The first parameter (@formula(one~two)) which we pass, asks, is there a correlation between the column1 and column2 of the dataframe dataDF, passed as the second parameter.

We want it from the Normal distribution and the IdentityLink. 

Referhttps://github.com/JuliaStats/GLM.jl

julia> glm(@formula(one~two),dataDF,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: one ~ 1 + two

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   59.2152   4.34291 13.6349   <1e-41
two          0.106299 0.0613117 1.73376   0.0830


julia> 


We see that we have an intercept of 59.2

The slope of our line is 0.106299

Refer:



No comments:

Post a Comment