Thursday, 12 July 2018

The International E-road Network and Neo4j

I was having fun recently with some E-road data that I found. E-roads are highly intuitive for spatial/ graph-y stuff: you will be on one regularly, and they will lead you to Rome, eventually. Or Aarhus. And because it is graph-y and spatial at the same time, it is obvious to try some shortest path queries on it, which Neo4j has built-in. Think route planning.

This particular dataset has 895 reference places and 1250 sections of road between them. Roads have a distance attribute, which will come in handy. And there is some more metadata to play with like country codes and whether a road is a water crossing.

The data needed some massaging into CSV format before it can be imported into Neo4j. I ended up with one line of CSV per section of E-road, so some duplication of reference places, but meh - Neo4j can merge them back together:


Oh and please go ahead and use the dataset if you find it interesting.

Let's go from Århus to Rome:

Cypher has a feature called variable length pattern matching. Here is the simplest possible Cypher query for finding the path we want:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*]-(rome {name: "Roma"}))
RETURN p

Well, that didn't work, Spinner-of-Death™. The dataset is too large, or the query is too broad, or my laptop is too small. Aha! But I happen to know from playing with the dataset that there exists a part of length 28 between Aarhus and Rome, so we can give the path finder a maximum:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*28]-(rome {name: "Roma"}))
RETURN p

Meh, Spinner-of-Death™ again...

Waypoints! Let's make it even bit easier and constrain the query by inserting waypoints. The path I found while exploring goes via Stuttgart and Milan:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*11]-(stuttgart {name: "Stuttgart"})-[:EROAD*11]-(milan {name: "Milano"})-[:EROAD*6]-(rome {name: "Roma"}))
RETURN p


Result!

We can even find the length of the paths:

MATCH p=(aarhus {name: "Århus"})-[:EROAD*11]-(stuttgart {name: "Stuttgart"})-[:EROAD*11]-(milan {name: "Milano"})-[:EROAD*6]-(rome {name: "Roma"})
RETURN REDUCE(s = 0, r IN relationships(p) | s + TOINT(r.distance)) AS total_distance ORDER BY total_distance ASC

It comes up to 2329 km for the shortest path, 3491 km for the longest, and there are 48 paths that fit the pattern.

We need a shortest path algorithm

Alright, the trouble with the above approach is, the waypoints I chose are probably be sub-optimal, and therefore the path isn't as short as it could be. Also it is a jumble to look at, there are several paths between Stuttgart and Milan of length 11 for example, and it is hard to get an intuition of this intuitive spatial/ graph-y data.

Luckily, Cypher has a shortest path algorithm built in:

MATCH p=shortestPath((aarhus {name: "Århus"})-[rels:EROAD*]-(rome {name: "Roma"}))
RETURN p, length(p), REDUCE(s = 0, r IN rels | s + TOINT(r.distance)) AS total_distance


So there is the shortest path from Aarhus to Rome - in terms of hops. 22 hops and 2948 km, including sailing across the Adriatic Sea. Let's call it the scenic route.

We know that is suboptimal, but at least the query looks sane now without the waypoints. Oh and some graphics skills, we can edit the graph in Neo4j Browser, drag and arrange the nodes so they neatly overlay locations on a map.

Weighted shortest path FTW!

Right. Last refinement, promise. Weighted shortest path is supported in Neo4j using Dijkstra from the APOC procedure library plugin, and we need that so we can minimise distance instead of just hops:

MATCH (aarhus {name: 'Århus'}), (rome {name: 'Roma'})
CALL apoc.algo.dijkstra(aarhus, rome, 'EROAD', 'distance') YIELD path, weight
RETURN path, length(path), weight


Neat and simple query giving us a path with 26 hops, 2147 km, and quite straight-looking on the map. We have a winner.