Wednesday, July 12, 2017

Julia - Language - Read and Write from files into DataFrames



Check the current working directory - where we will download the iris.csv file:

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> 

julia> ;

shell> pwd
/home/ubuntu

Download the iris.csv dataset file from the github URL:

julia> ;

shell> wget https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/sklearn/datasets/data/iris.csv
--2017-07-12 15:00:08--  https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/sklearn/datasets/data/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2734 (2.7K) [text/plain]
Saving to: iris.csv

iris.csv                                  100%[=====================================================================================>]   2.67K  --.-KB/s    in 0s      

2017-07-12 15:00:08 (32.1 MB/s) - iris.csv saved [2734/2734]

In real data science problems - data is loaded from files or someother external source. These files can be in any format:

  • CSV - comma separated values
  • TSV - tab separated values
  • WSV - whitespace separated values


Now, let us load the iris dataset into a dataframe using readtable() function. Even though the iris dataset is available in RDatasets package, we will use the downloaded csv file to work with external datasets.

julia> using DataFrames

julia> df_iris = readtable("iris.csv",header=true,separator=',');

julia> df_iris
150x5 DataFrames.DataFrame
| Row | x150 | x4  | setosa | versicolor | virginica |
-------------------------------------------------------
| 1   | 5.1  | 3.5 | 1.4    | 0.2        | 0         |
| 2   | 4.9  | 3.0 | 1.4    | 0.2        | 0         |
| 3   | 4.7  | 3.2 | 1.3    | 0.2        | 0         |
| 4   | 4.6  | 3.1 | 1.5    | 0.2        | 0         |
| 5   | 5.0  | 3.6 | 1.4    | 0.2        | 0         |
| 6   | 5.4  | 3.9 | 1.7    | 0.4        | 0         |
| 7   | 4.6  | 3.4 | 1.4    | 0.3        | 0         |
| 8   | 5.0  | 3.4 | 1.5    | 0.2        | 0         |
| 9   | 4.4  | 2.9 | 1.4    | 0.2        | 0         |
| 10  | 4.9  | 3.1 | 1.5    | 0.1        | 0         |
| 11  | 5.4  | 3.7 | 1.5    | 0.2        | 0         |
| 12  | 4.8  | 3.4 | 1.6    | 0.2        | 0         |
| 13  | 4.8  | 3.0 | 1.4    | 0.1        | 0         |
| 14  | 4.3  | 3.0 | 1.1    | 0.1        | 0         |
| 15  | 5.8  | 4.0 | 1.2    | 0.2        | 0         |
| 16  | 5.7  | 4.4 | 1.5    | 0.4        | 0         |
| 17  | 5.4  | 3.9 | 1.3    | 0.4        | 0         |
|
| 133 | 6.4  | 2.8 | 5.6    | 2.2        | 2         |
| 134 | 6.3  | 2.8 | 5.1    | 1.5        | 2         |
| 135 | 6.1  | 2.6 | 5.6    | 1.4        | 2         |
| 136 | 7.7  | 3.0 | 6.1    | 2.3        | 2         |
| 137 | 6.3  | 3.4 | 5.6    | 2.4        | 2         |
| 138 | 6.4  | 3.1 | 5.5    | 1.8        | 2         |
| 139 | 6.0  | 3.0 | 4.8    | 1.8        | 2         |
| 140 | 6.9  | 3.1 | 5.4    | 2.1        | 2         |
| 141 | 6.7  | 3.1 | 5.6    | 2.4        | 2         |
| 142 | 6.9  | 3.1 | 5.1    | 2.3        | 2         |
| 143 | 5.8  | 2.7 | 5.1    | 1.9        | 2         |
| 144 | 6.8  | 3.2 | 5.9    | 2.3        | 2         |
| 145 | 6.7  | 3.3 | 5.7    | 2.5        | 2         |
| 146 | 6.7  | 3.0 | 5.2    | 2.3        | 2         |
| 147 | 6.3  | 2.5 | 5.0    | 1.9        | 2         |
| 148 | 6.5  | 3.0 | 5.2    | 2.0        | 2         |
| 149 | 6.2  | 3.4 | 5.4    | 2.3        | 2         |
| 150 | 5.9  | 3.0 | 5.1    | 1.8        | 2         |

The readtable() function has been implemented with different method behaviours so that it can use the multiple dispatch functionality provided by Julia Language:

julia> methods(readtable)
# 3 methods for generic function "readtable":
readtable(io::IO) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:820
readtable(io::IO, nbytes::Integer; header, separator, quotemark, decimal, nastrings, truestrings, falsestrings, makefactors, nrows, names, eltypes, allowcomments, commentmark, ignorepadding, skipstart, skiprows, skipblanks, encoding, allowescapes, normalizenames) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:820
readtable(pathname::AbstractString; header, separator, quotemark, decimal, nastrings, truestrings, falsestrings, makefactors, nrows, names, eltypes, allowcomments, commentmark, ignorepadding, skipstart, skiprows, skipblanks, encoding, allowescapes, normalizenames) in DataFrames at /home/ubuntu/.julia/v0.6/DataFrames/src/dataframe/io.jl:930

We can see that there are three methods for the readtable() function. These methods support various kinds of data formats. For meanings of the keyword arguments refer the link:
https://juliastats.github.io/DataFrames.jl/stable/man/io/

We may want to output the results to a file. We do this by using the writetable() function.

julia> writetable("output_iris.csv",df_iris,header=true,separator=',')

julia> ;

shell> ls -l | grep iris
-rw-rw-r--  1 ubuntu ubuntu          2734 Jul 12 15:00 iris.csv
-rw-rw-r--  1 ubuntu ubuntu          2746 Jul 12 16:00 output_iris.csv

julia> 



No comments:

Post a Comment