Tuesday, 3 January 2017

Processing STATS19 data (big CSV files) using Linux

The STATS19 road collision data published by the UK government consists of large 'comma separated value' (.csv) files, with hundreds of thousands of lines. In order to create the maps on this blog I used the command-line tools built in to the Linux operating system (it isn't possible to open files that size in ordinary spreadsheet programmes such as Excel).

Here are some examples of the commands I put together using various websites. This may be helpful to others working with STATS19 data or other large data-sets.

The data:

The first line of the file is the header with column titles. Each subsequent line represents a separate collision, a casualty, or a vehicle, depending on which file we're looking at.

To avoid losing the header, 'head' and 'tail' can be used. For instance, sorting looks like "head -n 1 file.csv && tail -n +2 file.csv | sort -t"," -k1,1" - the first line is kept separate and the remainder of the file is sorted. For awk commands 'NR == 1||' can be added to the logic to ensure the first line line is printed. I have ignored headers in some of the examples to make them easier to follow.

Match entries from two data-sets using 'join'

'Join' can be used to join the casualty data with the collision record. Both files must be sorted on the join field otherwise the output is unreliable. Join has the following options:
  • -a1 - keep all records from the first file, even where there is no match 
  • -e “0” - put zeros in any missing fields - only works if you set -o 
  • -t, - sets the delimiter as a comma 
  • -v1 - display the non matching records of file one 
  • -o sets the format of the output line, option "auto" 
  • Join default joins based on the first field, but this can be changed.
  • More info: manpages.ubuntu.com/manpages/xenial/en/man1/join.1posix.html

I followed these steps:
  1. Check to see if any lines don’t match - "join -t, -v1 <(tail -n +2 file1.csv) <(tail -n +2 file2.csv)"
  2. Join, keeping the headers intact - "(join -t, <(head -n 1 file1.csv) <(head -n 1 file2.csv) && join -t, <(tail -n +2 file1.csv) <(tail -n +2 file2csv)) > combinedfile.csv"
  3. After joining carriage returns need to be removed from files, otherwise lines split - "sed 's/\r//' combinedfile.csv"
To join the make and model information (which came in a separate file) on to the main vehicle data file a different join command was used. In this case the make and model information was incomplete and we only needed the join field '0' followed by the 24th and 25th fields in the second file. This was then joined to the vehicle data in a separate operation: "join -t, -a1 -o 0 2.24 2.25 -e empty <(tail -n +2 vehsort.csv) <(tail -n +2 mmsort.csv) > mmjoinfilt.csv"

There was an issue with files starting with an incorrect first column header, this uses sed to replace the first entry: "(head -n 1 file.csv | sed 's/[^,]*,/Key,/' && tail -n +2 file.csv) > output.csv" [^,]*, matches anything except a comma, and replaces it with "Key". By default it only substitutes the first occurrence.

Linking vehicle data on to the casualty record

After the collision data is appended to the casualty data, vehicle data can be appended to this. This produces a file with details of the casualty and the circumstances of the collision and relevant vehicle information combined.

For different situations the vehicle data was joined differently - For pedestrians, the data records which vehicle (first) hit them. All other road users are defined with vehicle records (horses are vehicles) and I inferred a collision where there was just one other vehicle involved. When there were more vehicles it was not possible to infer which vehicles collided with which. When there was only one vehicle involved it was not necessary to link the occupants' records.

In order to match data relevant to a specific vehicle, I added a column which was a combination of the unique accident reference (column 1) and the vehicle identifier (column 3): "awk -F ',' -v OFS=',' '{print $1$3,$0}' file.csv > filewithkey.csv"

The pedestrian data records the vehicle which first hit them, so a similar key can be added and the files can be joined directly.

For two vehicles, the vehicle reference for the colliding vehicle is three minus the current vehicle reference. Because the vehicles will be numbered '1' and '2' this gives the other option. The command to add the key field is: "awk -F ',' -v OFS=',' '$22 ==2|| NR ==1 { print $1(3-$2),$0 }' file.csv" Join it with vehs and remove returns: "join -t, -a1 -o auto -e empty <(tail -n +2 2veh0515kst.csv) <(tail -n +2 vehmm0515fix.csv)) | sed 's/\r//' > 2vehlinked0515.csv"

For three or more vehicles involved (recorded in column 22) we can only match to every vehicle involved "awk -F ',' -v OFS=',' '$22 >=3 { print $0 }') file.csv | join -t, -1 1 -2 2 <vehicles.csv" It is important to note that if we include this data we describe it as collisions where certain vehicles types 'were involved in the incident' rather than describing them as colliding with the casualty. The alternative is not to join this data which means these casualties would not show up on the maps.

Look-up values from a list (column 1) and replace with text (column 2)

In order to make the output more human-readable I wanted to replace the numeric codes with text wherever possible. This can be done using a look-up file which has the codes in the first column and the text entries in the second column (the header text should be in both columns to avoid messing up the header).

In this case the code is taken from column 15 and replaced with the corresponding text:

awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{$15=a[$15];print}' journeypurpose.csv file.csv > file.csv 

Two files are passed to awk. Firstly the look-up file is loaded in to an array and secondly this array is used to replace the values in column 15. 'NR' is the total number of lines processed, and 'FNR' are the number of lines processed in the current file, therefore the first statement triggers for the first file and the second one triggers for the second file.

Alternatively the looked up values can be added as an additional column: "awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$7]}' castype2.csv val2.csv > val3.csv"

Lookup all the files in a folder: "for file in *.csv; do awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$16]}' vehtype.csv $file > "$(basename "$file" .csv)_1.csv"; done"

More information on using awk to do a 'look-up' between two data-sets: unix.stackexchange.com/questions/126485/replacing-the-values-in-one-file-with-the-values-in-another-file-in-bash (note that this is for data using a different delimiter, not comma separated values)

Google Fusion Tables - marker key

In order to display different markers based on severity/caualty type/vehicle type, we need to add a column to tell Google Fusion Tables which marker to use for each data point. Available markers: https://support.google.com/fusiontables/answer/2679986?hl=en

First add a column for the markers: "awk -F , -v OFS=, '{print $0,”Fusion Tables icon”}' file.csv"

Secondly set values depending on what you want to display, for instance to add man and woman icons based on the gender in column 4: "awk -F , -v OFS=, '$4=="Female"{$NF="woman"} $4=="Male"{$NF="man"} 1' "

Here are some of the other values which could be used:
  • $X=="Cyclist"{$NF="cycling"} 
  • $X=="?"{$NF="motorcycling"} 
  • $X=="Horse rider"{$NF="horsebackriding"} 
  • $X=="?"{$NF="truck"} 
  • $X=="Bus or coach (17 or more pass seats)"{$NF="bus"} 
  • $X=="Fatal"{$NF="large_red"} 
  • $X=="Serious"{$NF="large_yellow"} 
  • $X=="Slight"{$NF="small_blue"} 
  • $X=="?"{$NF="small_green"} 
  • $X=="?"{$NF="small_purple"}
Examples (operate on all the files in the folder):

  • for file in *.csv; do awk -F , -v OFS=, '{print $0,"Fusion Tables icon"}' $file > "$(basename "$file" .csv)_1.csv"; done 
  • for file in *_1.csv; do awk -F , -v OFS=, '$4=="Female"{$NF="woman"} $4=="Male"{$NF="man"} $2=="Cyclist"{$NF="cycling"} $2=="Horse rider"{$NF="horsebackriding"} 1' $file > "$(basename "$file" .csv)_2.csv"; done 
  • for file in *.csv; do awk -F , -v OFS=, '$3=="Serious"{$NF="small_yellow"} $3=="Slight"{$NF="small_blue"} 1' $file > "$(basename "$file" .csv)_3.csv"; done
Update - Fusion tables vehicle icons added using lookup type list:
awk -F ',' -v OFS=',' 'NR==FNR{a[$1]=$2} NR>FNR{print $0,a[$8]}' vehicles2.csv file.csv > file.csv 

vehicles2.csv (last line represents the header):
Agricultural vehicle,large_green
Bus or coach (17 or more pass seats),bus
Data missing or out of range,large_blue
Goods 7.5 tonnes mgw and over,truck
Goods over 3.5 tonnes and under 7.5 tonnes,truck
Goods vehicle - unknown weight,truck
Minibus (8 - 16 passenger seats),bus
Mobility scooter,large_purple
Motorcycle 125cc and under,motorcycling
Motorcycle 50cc and under,motorcycling
Motorcycle over 125cc and up to 500cc,motorcycling
Motorcycle over 500cc,motorcycling
Motorcycle - unknown cc,motorcycling
Other vehicle,large_blue
Pedal cycle,cycling
Ridden horse,horsebackriding
Taxi/Private hire car,cabs
Van / Goods 3.5 tonnes mgw or under,truck
Vehicle Type,Fusion Tables icon

Uploading the data to Google Maps

Checking files

It's good to check files as you go along. :
  • Less - You can send data to the 'less' command to see it one screen of data at a time and scroll up and down 
  • View non-printing characters "cat -v" 
  • Compare file headers - For all the .csv files in the directory, put the headers into one file “headers.csv” so you can review differences: "find . -name "*.csv" -exec awk -F ',' -v OFS=',' 'NR==1{print FILENAME,$0}' {} \; >> headers.csv" 
  • Count the fields - Print out how many fields there are in each line, if two or more numbers are outputted it indicates different lines are different lengths (which is probably not good): "awk -F ',' -v OFS=',' '{print NF}' file.csv | uniq" 
  • Count the fields for each file in the folder - Prints out how many fields there are in each line for all the files in the folder: "find . -name "*.csv" -exec awk -F ',' -v OFS=',' '{print FILENAME,NF}' {} \; | uniq" 
  • Count the lines for each file in the folder - "find . -name "*.csv" -exec wc -l {} \;" 
  • Check the unique values in a column (column 16 in this case): "cut -d , -f 16 file.csv | sort | uniq"