The purpose of studying the Distributions package is that we want to create arrays of data point values or elements that are picked at random, each drawn from a specific distribution.
Let us imagine, that we compare the results of two sets of experiments or observed outcomes. For example, consider the examination results of a particular course for two consecutive years. The advantage of using the distribution package in Julia is that we do not need the actual results. We can simply simulate some results. These results can be stored in arrays.
Say, 100 students take the examination each year, we can use the rand() function to generate random values. As argument, we pass the name of specific distribution. As an example, we will use the Normal distribution with a mean and standard deviation.
julia> using Distributions
Refer: https://juliastats.github.io/Distributions.jl/latest/univariate.html#Distributions.Normal
This says:
Normal(mu, sig) # Normal distribution with mean mu and variance sig^2
This means, that the Normal function takes two input parameters:
For usage of rand() function, refer: https://docs.julialang.org/en/latest/stdlib/numbers/#Base.Random.rand
In this example, we use the rand function to generate 100 random numbers representing the exam results for year 1 which adhere to Normal distribution. The numbers have a mean of 67 and standard deviation of 10.
julia> year1 = rand(Normal(67,10),100)
100-element Array{Float64,1}:
62.5691
58.6375
73.3923
59.4574
63.5709
76.1989
54.7919
75.6863
72.857
74.2429
67.6712
68.6269
70.7106
81.5927
59.1536
62.5137
63.9478
67.7302
73.0909
⋮
70.4392
58.6405
73.1801
59.3546
56.1787
66.7647
84.3317
58.4741
70.2783
65.5608
54.6659
73.1065
40.2822
65.9016
72.601
63.2549
68.9232
76.1037
julia> year2 = rand(Normal(71,15),100)
100-element Array{Float64,1}:
75.7926
69.0364
63.3757
79.114
74.4915
59.5234
76.6792
61.4812
46.3396
68.1444
94.0939
78.3967
55.9258
93.2746
73.0782
52.2191
69.3642
41.075
79.629
⋮
60.7618
64.5384
85.7592
46.9339
74.0685
44.9415
79.3021
70.3854
96.3139
81.2064
58.7041
54.1915
85.4515
40.0752
85.0923
79.9331
79.632
64.337
julia>
Our aim will be to compare these. The average scored for each year differs. Are they statistically different, though? This is not a course in statistics, but using Julia, we will see how easy it is to tell us if there is such a difference.
For now, let us plot theoretical distributions. Note that these are not the actual values.
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Distributions
julia> using Plots
julia> plotlyjs()
Plots.PlotlyJSBackend()
julia> using StatPlots
julia> plot(Normal(67,10), fillrange=0,fillalpha=0.5,fillcolor=:orange, label = "Year 1", title = "Boxplot" )
julia> plot!(Normal(71,15), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "Year 2" )
We can get more statistical information from these two distributions.
julia> skewness(year1) , skewness(year2)
(-0.024341675858438033, 0.3728199943398096)
julia> kurtosis(year1), kurtosis(year2)
(-0.3391716502155475, -0.3560651067956062)
julia> using HypothesisTests
julia> EqualVarianceTTest(year1,year2)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -6.401225641166363
95% confidence interval: (-10.17335143893694, -2.6290998433957857)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0009792968343477447
Details:
number of observations: [100,100]
t-statistic: -3.34647610411475
degrees of freedom: 198
empirical standard error: 1.9128257432633577
julia>
It says, that the parameter of interest, is the Mean Difference - that means the difference between the means.
h_0 is indicating the NULL hypothesis. The test summary says, that we can reject the NULL hypothesis as the two sided p-value is 0.0009. The t-statistic was -3.346 and the degrees of freedom was 198. Remember that there are 200 set of values minus the two groups which gives us 198. We also see the standard error of 1.91
So, we conclude we have received beautiful information from Hypothesis test as we can quickly determine, if the values stored in the two Julia arrays are statistically different from each other.
Let us now, put these two in a DataFrame.
julia> using DataFrames
julia> dataDF = DataFrame(one=year1,two=year2)
100×2 DataFrames.DataFrame
│ Row │ one │ two │
├─────┼─────────┼─────────┤
│ 1 │ 72.3084 │ 63.1657 │
│ 2 │ 58.4573 │ 69.4973 │
│ 3 │ 70.8097 │ 94.1317 │
│ 4 │ 70.7864 │ 57.8267 │
│ 5 │ 65.2744 │ 82.1781 │
│ 6 │ 72.3527 │ 71.8708 │
│ 7 │ 68.5187 │ 56.9388 │
│ 8 │ 58.8689 │ 82.4909 │
│ 9 │ 62.0201 │ 43.9521 │
│ 10 │ 69.8665 │ 56.1039 │
│ 11 │ 70.3609 │ 66.7739 │
│ 12 │ 62.2294 │ 45.2597 │
│ 13 │ 70.4675 │ 65.4719 │
│ 14 │ 77.4113 │ 109.628 │
│ 15 │ 51.4468 │ 61.2482 │
│ 16 │ 62.5706 │ 88.3665 │
│ 17 │ 64.395 │ 84.3294 │
⋮
│ 83 │ 66.0341 │ 75.6371 │
│ 84 │ 51.3127 │ 56.022 │
│ 85 │ 73.9547 │ 55.9642 │
│ 86 │ 54.6271 │ 74.2315 │
│ 87 │ 79.5011 │ 46.2863 │
│ 88 │ 45.8135 │ 87.8725 │
│ 89 │ 68.2583 │ 72.3256 │
│ 90 │ 57.9371 │ 67.3443 │
│ 91 │ 74.6372 │ 78.1656 │
│ 92 │ 65.5897 │ 62.3794 │
│ 93 │ 91.3915 │ 72.8167 │
│ 94 │ 55.0672 │ 67.158 │
│ 95 │ 76.6133 │ 62.5635 │
│ 96 │ 54.8723 │ 69.6539 │
│ 97 │ 75.1471 │ 58.0032 │
│ 98 │ 59.0352 │ 69.4477 │
│ 99 │ 72.7119 │ 64.9528 │
│ 100 │ 71.0054 │ 44.5209 │
julia>
Imagine this time that these were the same set of students going on from year1 to year2. As, these are the aggregate marks at the end of Year1 and Year2 - We now want to know if there is a correlation between the marks in their two years.
Since, we are dealing with linear regression and least squares. Hence, this is easy to accomplish with GLM package functions - the Generalized Linear Models package. Let us call the ordinary least squares. The first parameter (@formula(one~two)) which we pass, asks, is there a correlation between the column1 and column2 of the dataframe dataDF, passed as the second parameter.
We want it from the Normal distribution and the IdentityLink.
Refer: https://github.com/JuliaStats/GLM.jl
julia> glm(@formula(one~two),dataDF,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: one ~ 1 + two
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 59.2152 4.34291 13.6349 <1e-41
two 0.106299 0.0613117 1.73376 0.0830
julia>
We see that we have an intercept of 59.2
The slope of our line is 0.106299
Refer:
Let us imagine, that we compare the results of two sets of experiments or observed outcomes. For example, consider the examination results of a particular course for two consecutive years. The advantage of using the distribution package in Julia is that we do not need the actual results. We can simply simulate some results. These results can be stored in arrays.
Say, 100 students take the examination each year, we can use the rand() function to generate random values. As argument, we pass the name of specific distribution. As an example, we will use the Normal distribution with a mean and standard deviation.
julia> using Distributions
Refer: https://juliastats.github.io/Distributions.jl/latest/univariate.html#Distributions.Normal
This says:
Normal(mu, sig) # Normal distribution with mean mu and variance sig^2
This means, that the Normal function takes two input parameters:
- Mean
- Standard Deviation
For usage of rand() function, refer: https://docs.julialang.org/en/latest/stdlib/numbers/#Base.Random.rand
In this example, we use the rand function to generate 100 random numbers representing the exam results for year 1 which adhere to Normal distribution. The numbers have a mean of 67 and standard deviation of 10.
julia> year1 = rand(Normal(67,10),100)
100-element Array{Float64,1}:
62.5691
58.6375
73.3923
59.4574
63.5709
76.1989
54.7919
75.6863
72.857
74.2429
67.6712
68.6269
70.7106
81.5927
59.1536
62.5137
63.9478
67.7302
73.0909
⋮
70.4392
58.6405
73.1801
59.3546
56.1787
66.7647
84.3317
58.4741
70.2783
65.5608
54.6659
73.1065
40.2822
65.9016
72.601
63.2549
68.9232
76.1037
julia> year2 = rand(Normal(71,15),100)
100-element Array{Float64,1}:
75.7926
69.0364
63.3757
79.114
74.4915
59.5234
76.6792
61.4812
46.3396
68.1444
94.0939
78.3967
55.9258
93.2746
73.0782
52.2191
69.3642
41.075
79.629
⋮
60.7618
64.5384
85.7592
46.9339
74.0685
44.9415
79.3021
70.3854
96.3139
81.2064
58.7041
54.1915
85.4515
40.0752
85.0923
79.9331
79.632
64.337
julia>
Our aim will be to compare these. The average scored for each year differs. Are they statistically different, though? This is not a course in statistics, but using Julia, we will see how easy it is to tell us if there is such a difference.
For now, let us plot theoretical distributions. Note that these are not the actual values.
$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Distributions
julia> using Plots
julia> plotlyjs()
Plots.PlotlyJSBackend()
julia> using StatPlots
julia> plot(Normal(67,10), fillrange=0,fillalpha=0.5,fillcolor=:orange, label = "Year 1", title = "Boxplot" )
julia> plot!(Normal(71,15), fillrange=0,fillalpha=0.5,fillcolor=:blue, label = "Year 2" )
We can get more statistical information from these two distributions.
julia> skewness(year1) , skewness(year2)
(-0.024341675858438033, 0.3728199943398096)
julia> kurtosis(year1), kurtosis(year2)
(-0.3391716502155475, -0.3560651067956062)
julia> using HypothesisTests
julia> EqualVarianceTTest(year1,year2)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -6.401225641166363
95% confidence interval: (-10.17335143893694, -2.6290998433957857)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0009792968343477447
Details:
number of observations: [100,100]
t-statistic: -3.34647610411475
degrees of freedom: 198
empirical standard error: 1.9128257432633577
julia>
It says, that the parameter of interest, is the Mean Difference - that means the difference between the means.
h_0 is indicating the NULL hypothesis. The test summary says, that we can reject the NULL hypothesis as the two sided p-value is 0.0009. The t-statistic was -3.346 and the degrees of freedom was 198. Remember that there are 200 set of values minus the two groups which gives us 198. We also see the standard error of 1.91
So, we conclude we have received beautiful information from Hypothesis test as we can quickly determine, if the values stored in the two Julia arrays are statistically different from each other.
Let us now, put these two in a DataFrame.
julia> using DataFrames
julia> dataDF = DataFrame(one=year1,two=year2)
100×2 DataFrames.DataFrame
│ Row │ one │ two │
├─────┼─────────┼─────────┤
│ 1 │ 72.3084 │ 63.1657 │
│ 2 │ 58.4573 │ 69.4973 │
│ 3 │ 70.8097 │ 94.1317 │
│ 4 │ 70.7864 │ 57.8267 │
│ 5 │ 65.2744 │ 82.1781 │
│ 6 │ 72.3527 │ 71.8708 │
│ 7 │ 68.5187 │ 56.9388 │
│ 8 │ 58.8689 │ 82.4909 │
│ 9 │ 62.0201 │ 43.9521 │
│ 10 │ 69.8665 │ 56.1039 │
│ 11 │ 70.3609 │ 66.7739 │
│ 12 │ 62.2294 │ 45.2597 │
│ 13 │ 70.4675 │ 65.4719 │
│ 14 │ 77.4113 │ 109.628 │
│ 15 │ 51.4468 │ 61.2482 │
│ 16 │ 62.5706 │ 88.3665 │
│ 17 │ 64.395 │ 84.3294 │
⋮
│ 83 │ 66.0341 │ 75.6371 │
│ 84 │ 51.3127 │ 56.022 │
│ 85 │ 73.9547 │ 55.9642 │
│ 86 │ 54.6271 │ 74.2315 │
│ 87 │ 79.5011 │ 46.2863 │
│ 88 │ 45.8135 │ 87.8725 │
│ 89 │ 68.2583 │ 72.3256 │
│ 90 │ 57.9371 │ 67.3443 │
│ 91 │ 74.6372 │ 78.1656 │
│ 92 │ 65.5897 │ 62.3794 │
│ 93 │ 91.3915 │ 72.8167 │
│ 94 │ 55.0672 │ 67.158 │
│ 95 │ 76.6133 │ 62.5635 │
│ 96 │ 54.8723 │ 69.6539 │
│ 97 │ 75.1471 │ 58.0032 │
│ 98 │ 59.0352 │ 69.4477 │
│ 99 │ 72.7119 │ 64.9528 │
│ 100 │ 71.0054 │ 44.5209 │
julia>
Imagine this time that these were the same set of students going on from year1 to year2. As, these are the aggregate marks at the end of Year1 and Year2 - We now want to know if there is a correlation between the marks in their two years.
Since, we are dealing with linear regression and least squares. Hence, this is easy to accomplish with GLM package functions - the Generalized Linear Models package. Let us call the ordinary least squares. The first parameter (@formula(one~two)) which we pass, asks, is there a correlation between the column1 and column2 of the dataframe dataDF, passed as the second parameter.
We want it from the Normal distribution and the IdentityLink.
Refer: https://github.com/JuliaStats/GLM.jl
julia> glm(@formula(one~two),dataDF,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: one ~ 1 + two
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 59.2152 4.34291 13.6349 <1e-41
two 0.106299 0.0613117 1.73376 0.0830
julia>
We see that we have an intercept of 59.2
The slope of our line is 0.106299
Refer:
- https://github.com/JuliaStats/GLM.jl
- http://juliastats.github.io/
- https://www.micahsmith.com/blog/2016/11/why-im-bailing-on-julia-for-machine-learning/
- https://juliaeconomics.com/tag/glm/
No comments:
Post a Comment