Friday, August 18, 2017

Julia - Language - Understanding how to use Joins with DataFrames

In order to work with multiple datasets, we often need to merge the datasets in a particular fashion to make the analysis easier or to use it with a particular function.

We will be using the Road Safety Data published by the Department for Transport, UK, and it is open under the OGL - Open Government License.

The datasets can be found here:
https://data.gov.uk/dataset/road-accidents-safety-data

Datasets:
 - Accidents 2015 - http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Accidents_2015.zip
 - Vehicles 2015  - http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Vehicles_2015.zip

Let us Download these DataSets:

$ wget http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Accidents_2015.zip
--2017-08-18 19:07:14--  http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Accidents_2015.zip
Resolving data.dft.gov.uk (data.dft.gov.uk)... 54.231.131.114
Connecting to data.dft.gov.uk (data.dft.gov.uk)|54.231.131.114|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4123639 (3.9M) [application/x-zip-compressed]
Saving to: 'RoadSafetyData_Accidents_2015.zip'

100%[==============================================================================================================================>] 4,123,639    744KB/s   in 15s   

2017-08-18 19:07:30 (267 KB/s) - 'RoadSafetyData_Accidents_2015.zip' saved [4123639/4123639]

$ wget http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Vehicles_2015.zip
--2017-08-18 19:07:36--  http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_Vehicles_2015.zip
Resolving data.dft.gov.uk (data.dft.gov.uk)... 52.218.20.17
Connecting to data.dft.gov.uk (data.dft.gov.uk)|52.218.20.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2527196 (2.4M) [application/x-zip-compressed]
Saving to: 'RoadSafetyData_Vehicles_2015.zip'

100%[==============================================================================================================================>] 2,527,196    696KB/s   in 4.2s  

2017-08-18 19:07:42 (591 KB/s) - 'RoadSafetyData_Vehicles_2015.zip' saved [2527196/2527196]

$

We must now uncompress the zip files:

$ unzip RoadSafetyData_Accidents_2015.zip
Archive:  RoadSafetyData_Accidents_2015.zip
  inflating: Accidents_2015.csv     
$ unzip RoadSafetyData_Vehicles_2015.zip
Archive:  RoadSafetyData_Vehicles_2015.zip
  inflating: Vehicles_2015.csv      
$

Let us now load these datasets using the DataFrames package in Julia:

$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames

julia> df_accidents_2015 = readtable("Accidents_2015.csv",header=true,separator=',');

julia> df_vehicles_2015 = readtable("Vehicles_2015.csv",header=true,separator=',');

julia> 


Let us list the column names:

julia> names(df_accidents_2015)
32-element Array{Symbol,1}:
 :Accident_Index                            
 :Location_Easting_OSGR                     
 :Location_Northing_OSGR                    
 :Longitude                                 
 :Latitude                                  
 :Police_Force                              
 :Accident_Severity                         
 :Number_of_Vehicles                        
 :Number_of_Casualties                      
 :Date                                      
 :Day_of_Week                               
 :Time                                      
 :Local_Authority_District_                 
 :Local_Authority_Highway_                  
 :x1st_Road_Class                           
 :x1st_Road_Number                          
 :Road_Type                                 
 :Speed_limit                               
 :Junction_Detail                           
 :Junction_Control                          
 :x2nd_Road_Class                           
 :x2nd_Road_Number                          
 :Pedestrian_Crossing_Human_Control         
 :Pedestrian_Crossing_Physical_Facilities   
 :Light_Conditions                          
 :Weather_Conditions                        
 :Road_Surface_Conditions                   
 :Special_Conditions_at_Site                
 :Carriageway_Hazards                       
 :Urban_or_Rural_Area                       
 :Did_Police_Officer_Attend_Scene_of_Accident
 :LSOA_of_Accident_Location                 

julia> names(df_vehicles_2015)
23-element Array{Symbol,1}:
 :Accident_Index                 
 :Vehicle_Reference              
 :Vehicle_Type                   
 :Towing_and_Articulation        
 :Vehicle_Manoeuvre              
 :Vehicle_Location_Restricted_Lane
 :Junction_Location              
 :Skidding_and_Overturning       
 :Hit_Object_in_Carriageway      
 :Vehicle_Leaving_Carriageway    
 :Hit_Object_off_Carriageway     
 :x1st_Point_of_Impact           
 :Was_Vehicle_Left_Hand_Drive_   
 :Journey_Purpose_of_Driver      
 :Sex_of_Driver                  
 :Age_of_Driver                  
 :Age_Band_of_Driver             
 :Engine_Capacity_CC_            
 :Propulsion_Code                
 :Age_of_Vehicle                 
 :Driver_IMD_Decile              
 :Driver_Home_Area_Type          
 :Vehicle_IMD_Decile             

julia>


The common column between the two files is: Accident_Index. This is unique and used as an index in this dataset.

First, we will be making the DataFrames package available and then, we will load the data. We load the data into two different dataframes using the readtable function that we discussed earlier:

Let us Perform a full join:

julia> df_full_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index);

julia>

Note: In full join operation, the kind parameter is not passed to the join function.

To see all the types of Joins offered by DataFrames package in Julia, refer the following:
https://juliastats.github.io/DataFrames.jl/stable/lib/manipulation/#joins


Julia Language -  Inner Join:

julia> df_inner_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:inner);

julia> 

Julia Language - Outer Join:

julia> df_outer_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:outer);

julia> 


Julia Language - Left Outer Join:

julia> df_left_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:left);

julia> 

Julia Language - Right Outer Join:

julia> df_right_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:right);

julia>

Julia Language - Semi Join:


julia> df_semi_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:semi);

julia>

Julia Language - Anti Join:

julia> df_anti_vehicles_accident = join(df_vehicles_2015,df_accidents_2015, on = :Accident_Index, kind=:anti);

julia>

Julia Language - Cross Join:

julia> df_cross_vehicles_accident = join(df_vehicles_2015,df_accidents_2015,  kind=:cross);

julia>




Out of Memory Issues with Cross Join:

We received the following error while performing cross join over these two datasets:

julia> df_cross_vehicles_accident = join(df_vehicles_2015,df_accidents_2015,  kind=:cross);
ERROR: OutOfMemoryError()
Stacktrace:
 [1] _repeat at ./abstractarraymath.jl:396 [inlined]
 [2] #repeat#117(::Int64, ::Tuple{Int64}, ::Function, ::Array{String,1}) at ./abstractarraymath.jl:366
 [3] (::Base.#kw##repeat)(::Array{Any,1}, ::Base.#repeat, ::Array{String,1}) at ./<missing>:0 (repeats 2 times)
 [4] collect(::Base.Generator{Array{Any,1},DataFrames.##53#55{Int64}}) at ./array.jl:441
 [5] crossjoin(::DataFrames.DataFrame, ::DataFrames.DataFrame) at /home/shree/.julia/v0.6/DataFrames/src/abstractdataframe/join.jl:251
 [6] #join#52(::Array{Symbol,1}, ::Symbol, ::Function, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /home/shree/.julia/v0.6/DataFrames/src/abstractdataframe/join.jl:197
 [7] (::Base.#kw##join)(::Array{Any,1}, ::Base.#join, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./<missing>:0


julia>

On checking the number of rows:

julia> nrow(df_vehicles_2015)
257845

julia> nrow(df_accidents_2015)
140056


julia> 

We find that the cross join should contain ( 257845 x 140056 = 36112739320 ) rows. This more than 36 billion rows. This is huge set of rows and cannot fit into the main memory. Hence, Let us reduce the rows by creating two new datasets with less rows and then we will be able to see a working cross join.

julia> ; 

shell> head -10 Accidents_2015.csv > Accidents_2015_10.csv

shell> head -10 Vehicles_2015.csv > Vehicles_2015_10.csv

Now, let us load these two datasets.

julia> df_accidents_2015_10 = readtable("Accidents_2015_10.csv",header=true,separator=',');

julia> df_vehicles_2015_10 = readtable("Vehicles_2015_10.csv",header=true,separator=',');

julia> 

Now, let us check the number of rows for these reduced datasets:

julia> nrow(df_accidents_2015_10)
9

julia> nrow(df_vehicles_2015_10)
9

Since these are having 9 rows each. Hence, we can expect atleast (9 x 9 =) 81 rows on performing cross join. Let us now perform cross join.

julia> df_cross_vehicles_accident_10 = join(df_vehicles_2015_10,df_accidents_2015_10,  kind=:cross);

julia> nrow(df_cross_vehicles_accident_10)
81

julia> 

The resultant dataset of cross join contains 81 rows - this is what we were expecting from cross join. What we can note from this is that cross join will have a huge dataset and is not used generally in practice as the result of join has no meaning.


No comments:

Post a Comment