Technology Works: Julia - Language - Read and Write from files into DataFrames

Check the current working directory - where we will download the iris.csv file:

$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.0 (2017-06-19 13:05 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia>

julia> ;

shell> pwd
/home/ubuntu

Download the iris.csv dataset file from the github URL:

julia> ;

shell> wget https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/sklearn/datasets/data/iris.csv
--2017-07-12 15:00:08-- https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/sklearn/datasets/data/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2734 (2.7K) [text/plain]
Saving to: iris.csv

iris.csv 100%[=====================================================================================>] 2.67K --.-KB/s in 0s

2017-07-12 15:00:08 (32.1 MB/s) - iris.csv saved [2734/2734]

In real data science problems - data is loaded from files or someother external source. These files can be in any format:

CSV - comma separated values
TSV - tab separated values
WSV - whitespace separated values

Now, let us load the iris dataset into a dataframe using readtable() function. Even though the iris dataset is available in RDatasets package, we will use the downloaded csv file to work with external datasets.

julia> using DataFrames

julia> df_iris = readtable("iris.csv",header=true,separator=',');

julia> df_iris
150x5 DataFrames.DataFrame
| Row | x150 | x4 | setosa | versicolor | virginica |
-------------------------------------------------------
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | 0 |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | 0 |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | 0 |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | 0 |
| 11 | 5.4 | 3.7 | 1.5 | 0.2 | 0 |
| 12 | 4.8 | 3.4 | 1.6 | 0.2 | 0 |
| 13 | 4.8 | 3.0 | 1.4 | 0.1 | 0 |
| 14 | 4.3 | 3.0 | 1.1 | 0.1 | 0 |
| 15 | 5.8 | 4.0 | 1.2 | 0.2 | 0 |
| 16 | 5.7 | 4.4 | 1.5 | 0.4 | 0 |
| 17 | 5.4 | 3.9 | 1.3 | 0.4 | 0 |
|
| 133 | 6.4 | 2.8 | 5.6 | 2.2 | 2 |
| 134 | 6.3 | 2.8 | 5.1 | 1.5 | 2 |
| 135 | 6.1 | 2.6 | 5.6 | 1.4 | 2 |
| 136 | 7.7 | 3.0 | 6.1 | 2.3 | 2 |
| 137 | 6.3 | 3.4 | 5.6 | 2.4 | 2 |
| 138 | 6.4 | 3.1 | 5.5 | 1.8 | 2 |
| 139 | 6.0 | 3.0 | 4.8 | 1.8 | 2 |
| 140 | 6.9 | 3.1 | 5.4 | 2.1 | 2 |
| 141 | 6.7 | 3.1 | 5.6 | 2.4 | 2 |
| 142 | 6.9 | 3.1 | 5.1 | 2.3 | 2 |
| 143 | 5.8 | 2.7 | 5.1 | 1.9 | 2 |
| 144 | 6.8 | 3.2 | 5.9 | 2.3 | 2 |
| 145 | 6.7 | 3.3 | 5.7 | 2.5 | 2 |
| 146 | 6.7 | 3.0 | 5.2 | 2.3 | 2 |
| 147 | 6.3 | 2.5 | 5.0 | 1.9 | 2 |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | 2 |
| 149 | 6.2 | 3.4 | 5.4 | 2.3 | 2 |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | 2 |

The readtable() function has been implemented with different method behaviours so that it can use the multiple dispatch functionality provided by Julia Language:

julia> methods(readtable)
# 3 methods for generic function "readtable":
readtable(io::IO) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:820
readtable(io::IO, nbytes::Integer; header, separator, quotemark, decimal, nastrings, truestrings, falsestrings, makefactors, nrows, names, eltypes, allowcomments, commentmark, ignorepadding, skipstart, skiprows, skipblanks, encoding, allowescapes, normalizenames) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:820
readtable(pathname::AbstractString; header, separator, quotemark, decimal, nastrings, truestrings, falsestrings, makefactors, nrows, names, eltypes, allowcomments, commentmark, ignorepadding, skipstart, skiprows, skipblanks, encoding, allowescapes, normalizenames) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:930

We can see that there are three methods for the readtable() function. These methods support various kinds of data formats. For meanings of the keyword arguments refer the link:
https://juliastats.github.io/DataFrames.jl/stable/man/io/

We may want to output the results to a file. We do this by using the writetable() function.

julia> writetable("output_iris.csv",df_iris,header=true,separator=',')

julia> ;

shell> ls -l | grep iris
-rw-rw-r-- 1 ubuntu ubuntu 2734 Jul 12 15:00 iris.csv
-rw-rw-r-- 1 ubuntu ubuntu 2746 Jul 12 16:00 output_iris.csv

julia>

Technology Works

Wednesday, July 12, 2017

Julia - Language - Read and Write from files into DataFrames

No comments:

Post a Comment