# A statistical approach

Hello All,

I am not new to OSM, but it’s my first post on that forum.

Some time ago an idea came to my mind which I would like to discuss here.

It’s a matter of fact, that “mass data implies power”, and I think that the OSM project could benefit from that fact. Wikipedia is based on mass data, and it’s powerful. Google aquires mass data every minute, and the power of goolge is incredible. I think, that OSM basically has the same opportunities to benefit from the power of mass data.

With more and more GPS hardware becoming available for comparably low prices, the capabilities for acquisition of data for OSM increase rapidly, but unfortunately it’s all raw data which has to be cultivated before it can be used as an accurate OSM database. But the sheer mass of data not only poses a mass of manual work: its a strenght.

I dare to say, that the statistical information stored in all that raw data offers such a lot of analysis capabilities, that it should be possible to do half the work of cultivation of that data without manual editing - just by putting it all together.

To sketch my idea, lets have a look at some roads in and around my village.

1. The road where my house is, is a dead end street.
3. There are three cross-town highways (Bundesstrassen, I am german) in my village.

Whenever I am recording GPS data whilst driving on these roads, I increase the amount of raw-data for these roads, and I get a lot of information, and its for free.

a) in my dead end street, i have always a speed of 10-40 Km/H.
b) I never leave that street on the dead end.
c) all trackpoints in that street are somewhat different for each track, but statistically they increase precision. The more such points I collect, the lower will the average error be.
d) when I drive along the cross-town highways, my speed is everything from 0 Km/h to 60 Km/h (illegally ), but as an average (collected from a number of tracks/probes), the speed is simply significantly higher then the speed in my home-street or in the housing area.

If we would find a way to put all that data together statistically (and automtically), we could find a lot of facts regarding the above roads.

• every road which is used in only one direction most of the time ( > 98% or so), is presumably a one way street.
• every road which is never left at one of its ends is most likely a dead end street
• every road which is passed with an average speed of not more than 20-30 Km/h (or a similar speed, its just an example) might be a street in a housing area
• roads passed with an average speed 30-40 Km/h are most likely bigger ones than the ones in the housing area

… and so on.

The more data of a particular road we have, the more precise we could calculate the type of road. But the attributes which can be obtained with statistical procedures are much more then only “Is it a road? Which kind of road? Is it a one way street?” …

Let me outline some suprising capabilties of statistical mass data analysis.

Consider a number of 1000 tracks being recorded on a cross-town highway. Let’s say that the tracks were recorded by some pedestrians, some bicycles and quite a number of cars. The tracks would show some different ranges of speed within the distance in question. Some would be between 0 Km/h and 5 Km/h, some up to 30 Km/h and some up to 60 Km/h or more. It’s very likely, that the 60 Km/h were no bicylces or pedestrians, and it’s also very likely that the tracks with up to 30 Km/h were no pedestrians. With some heuristic rules, it might be possible with some precision (i.e. with some average error as well) to identify pedestrian-, car- and bicycle-tracks. The more data we have for such an analysis, the better the result. Any track which contains data which is very far from the average “profiles” (pedestrian/bycycle/car) could be excluded from the calclulation thus gaining precision in our analysis, e.g. by means of “invalidation triggers”.

No let’s filter the tracks which were identified as “car” tracks, and apply some rules. In these rules, I use the term “trigger”, which means that a “configurable amount of probes” fullfills the particular criteria (which is supposed to be a configurable value as well):

• if the percentage of tracks which follow the same direction “fire” a trigger, the street in questions is most likely a one way street
• if the percentage of tracks on the road do not show any stop, there are possibly not very much parking lots in the street or parking is forbidden.
• if we encounter a significant percentage of stops (0 Km/h) at a certain position (with no adjacent junction node), there might be a zebra-crossing or a pedestrian light.
• if the average speed is above 70 Km/h (or “triggers” a specific limit), the road is most likely not a residential street

Or:

• a number of tracks was recorded for a particular route. All of the tracks show a speed below 5-9 Km/h. It is a save bet, that this route is no highway, possibly its even unusable for cars.

Just some offhanded ideas, i am quite sure there are much more things possible.

But is all that really sufficient for accurate maps?

No.

But: it could be a step to push all OSM raw data to a higher level of quality. It could help to develop a procedure to check all manual work for plausibility.

If it is really possible to make all raw data more then just raw data by statistical methods, i could imagine the following scenario:

• raw data (tracks) is loaded up to OSM continuously
• a nightly “build” procedure extracts all the information from various formats (gpx etc.) and loads everthing to a database
• the statistical analysis is run and extracts all the information which can be calculated with a certain degree of probability and writes the results to an “alpha” stage database
• “alpha” data shall then serve as the basis for JOSM and other map editors. With that data available, such editors could offer some kind of a “commit” facility, i.e.: the “alpha” stage data is interpreted as kind of an alpha-map, already in the format of the final map but still having “alpha” status. All “alpha” roads and other map content could then be made “beta” data by just clicking a “confirm” or “commit” button (and/or applying the actions already supported by the editor in question).
• “beta” or “unstable” data could then be the “pre-production” version of the OSM, and be accepted as “stable” or “final” after additional attributes (street-names etc) have been added or the data was reviewed.

Its just an idea to get the most out of all raw data automatically with the aid of statistical methods. There are of course a lot of improvements possible, e.g. the comparison of raw-data against existing beta and/or final data (for plausibility checks), or whether and how younger data is weighed stronger then old data (because roads and traffic rules change permanently), but thats all details. The principle is statistics, a kind of a “profiling” for raw data with the intent to simplify further processing.

Thats the idea. Whats your opinion?

Discussions are welcome!

emax.

it is pretty easy to do this already, but what is needed is somekind of tools to process the GPX files available on OSM site. ATM it’s a lot of work to select the right GPX files and extract the correct data etc. etc.

It can be done and everything you say is a great idea, I even did this back in 2005 on 15 of my tracks I got this:
speed: 4km/h - 110km/h
direction: both.

then if I did the speed for either direction:
west: 4km/h - 80km/h
east: 40km/h-110km/h

Nice isn’t it…

Now if you code something that will select the parts of all GPX traces that are closest to a specific way, then you have come along way to what you say. I did this manually.

PS thoose numbers are completely out of the blue, but it’s more or less what I got DS.

Well in the first Moment that sounds somewhat frustrating. But if you look closer, it makes still sense to to think about it.

• the results strongly depend on the number of probes. Of couse, a 100m width of a road is not very encouraging. But statistics do of course not just use min/max values, but regard things like Gaussian distribution or so - and then it might look differently.

• the same is true for speed values. It will be necessary to create filters to confine the range of the values. The policy is simple: “ignore unplausible values”. The challenge however is “identify plausible ones”

But I agree: with 15 tracks only there is not very much of a choice, more would be better. But even if speed does not deliver trustworthy information for a particular track, there is still other valuable information hidden. The “dead end street” pattern for example or may be others as well. Though I am a C++ developer I do unfortunately not have enough statictics skills - i would otherwise try it my self. I am in the hope that my idea encourages one or another reader (with statistics knowledge) to give it a try.

I think this is partly the thinking behind the philosophy that everyone should upload as many tracks as possible, without cleaning them up or trying to exclude bits that have already been mapped. One issue at the moment is that gathering and uploading tracks still requires a lot of manual processes, so people often only do it with particular mapping projects in mind.

As a starting point, I would suggest writing some very simple examples that focus on a particular issue and suggest corrections to the main map which could then be investigated by hand. I’m not sure exactly what to pick, but you could, for example, look for one way streets that are not correctly tagged as such? This could be used as a proof of concept and help tune the filters required (how many tracks before making a recommendation, how to separate cars, cycles and pedestrians, etc…). Don’t worry too much about your statistical skills for now - if you implement something simple, others can suggest improvements later.

It would be nice then to find a way to gather GPS tracks on a large scale, for example by convincing a taxi or trucking company to carry loggers in all their vehicles. This would provide a lot more data for validation and automated bug detection.

Oliver

Hi Oliver!

It’s an idea to start with a proof of concept. I will try to find some time to write something useful.

To all readers: there are fore sure already some libraries available which can deal with geometric data and create kind of “average curves” for a given number of points. Are there any recommendations?

Very good idea, or look for onway=yes streets and see if the GPX tracks going through it are really oneway.

+1

This has already been done with e.g. eCourier.. It can behard to get those deals.

I think the problem is that there are too many libraries that does this, and usually they are packed into other big packages for GIS analysis.

Choose one area create tools for that, and then publish, This is very important, because I’ve done alot of scripting and realeased almost none because I think they were too hackish, and this is bad, I should have released all my scripts.

It’s also chicken and egg - until we’re using the data in some structured way, there isn’t a big incentive to go and do these deals. Are the details of the eCourier deal somewhere on the mailing list? It seems to be the only deal of its kind. I would like to set up a wiki page to encourage people to do something similar.

I have contacts at a taxi firm, but unfortunately they’re in London, so while it’s still worth doing, but it won’t add that much to the data we already have.

Oliver

Actually I think the ecourier data is anonymized so sure it would be useful, but we have a lot of data (GPS logs) and no tools. So developing tools to handle them would be wonderful, if you can show that GPS logs can be helpful then sure…

The standard API calls return track points within a bbox, but these don’t have dates / times / sequence information, so aren’t much use for this application. It is possible to retrieve individual tracks if you know their id, but not by location, AFAIK.

At the moment, I’m considering selecting an area, then cycling through all the tracks using GET /api/0.5/gpx//details to check whether to download that track’s full data. Is this the best way and will it cause server load issues?

Sadly the gpx track details doesn’t contain the bounding box, so you wouldn’t get much more information by doing that. Perhaps a single coordinate is enough to select tracks for cycling though…

There is a need for a tool that selects GPS tracks from bbxs, so a better path would be to write rails code for that.

I was planning on downloading all the GPX tracks available instead, and doing a local parse of them, to extract BBXs among other things. Of the thousand randomly selected tracks, I downloaded, half had bounding box on the top in the GPX file, and about 10 was in bad XML…