What's the future of OSM's database?

i_am_a_burger · February 9, 2018, 9:03am

At the moment Planet.osm XML uncompressed is close to a TB, but I assume with all cities* on this planet mapped “properly” this could easily be 25 TB. Of course, when starting a project like OSM you don’t know where you end up, but putting all information into one database now no longer looks like a decision that can work forever. So, are there any plans to change OSM’s database model?

https://en.wikipedia.org/wiki/List_of_largest_cities

kocio · February 9, 2018, 11:24am

It sounds like a specific question that is more likely to be discussed on Dev list:

https://lists.openstreetmap.org/listinfo/dev

SK53 · February 9, 2018, 2:24pm

I very much doubt that 25 Tbyte is a significant limit. In the early '90s I was working on databases of 10 Gbyte which seemed huge. When I first worked on Tbyte sized databases we needed specialist parallel processors, but these days 10 TByte is a relatively small database and readily managed by quite modest servers.

There is ample scope to beef up servers to cope with increasing data volumes. particularly as hardware itself continues to improve in performance. I think year-on-year growth in storage is at most in the order of 40-50% which is perfectly manageable as part of regular upgrades to hardware.

Additionally database software, like PostgreSQL, continues to acquire new capabilities. In particular because so many people are running large databases these days there is significant demand for such functionality. One area where PostgreSQL is a liittle weak is in the management of partitioned tables, but declarative partitioning is coming in 10.x.

None of these factors affect the decision to use XML as a method for sharing data. Most serious users only make use of the .PBF files, and it will be a long time before those approach a Terabyte in size. Remember the XML file is not the databse.

Probably the area of OSM where performance impacts most is in building a new tile server from planet using osm2pgsql. It may be that partitioned tables could reduce time to build indexes (although choosing a suitable partition key for geographical data may be difficult). Making the osm2pgsql program restartable at various points would be useful too. However, the main OSM infrastructure is optimised for minutely updates not being built de novo, os such changes, whilst useful to others, may not be such a high priority for the developers.

OverThere · February 9, 2018, 2:42pm

E.g., Postgresql.

Limit Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 - 1600 depending on column types
Maximum Indexes per Table Unlimited

From personal knowledge, high energy physics databases can grow into the petabyte range. I don’t see 25TB as a technical problem.

A RAID 50 array of 10 10TB disks (about USD3500), hardware RAID controller,case and dual power supplies would be under USD5000. That would give about 80TB of usable space. I would consider that to be affordable.

i_am_a_burger · February 10, 2018, 7:32am

Thank you, for the most part I just wanted to know whether there are any proposals that are being discussed.

SK53 · February 10, 2018, 3:31pm

@OverThere: thanks for these.

For comparison 10 Gbyte db in 1990s was a 10 disk array too. Larger databases, e.g., the first iteration of Teradata machines, used Fujitsu Eagles which had 500 Mb storage (or 650 Mb in the SuperEagle). So for a Terabyte of storage one needed 4000 disks with MTBF perhaps 1000 days at most. In addition the disk form factor was vast, 19 inches across and 10.5 high (see https://en.wikipedia.org/wiki/Fujitsu_Eagle)). I think this puts into perspective the challenges of modern massive db systems.

By early 2000s one could stick 100Tb on a single Emc2 Clariion.

OverThere · February 10, 2018, 5:22pm

OK, I admit that this off topic.

I started with a 28MB IBM 1301 hard drive attached to an IBM 7044 main frame computer. The drive was about 1.6m high, 1m thick and 3m long and cost USD115K then, the equivalent of about USD900K now. I have at hand a portable 4TB disk that is 2cm high, 8cm thick and 11cm long and cost about USD100.

I am an American. I used metric as an assistance to the Europeans in the forum.

SK53 · February 11, 2018, 1:26pm

Way off topic, but fun.

I’m a mere newcomer. My first computer purchase was when I bought a SuperEagle in 1987 as disk for a network of Sun 3s. It was around GBP 15k at the time. A year later we were buying Wren Vs for around GBP 500 & really missed an opportunity to cheap RAID-like devices.

Although a more serious point is that many modern scaling issues are tiny in comparison with what some people have already experienced. Also for many contemporary issues there is a tendency for there to be analogous problems from the past. I first became aware of this when a colleague who wrote the code for the first CT scanner (on a DG Nova) felt his boss (Godfrey Hounsfield) didn’t understand the problems he had with memory management in (IIRC 16k). Turned out Hounsfield knew a vast amount about the subject because he’d worked on digital computers (EMIDEC series) in the '50s which had mercury delay line storage. (As an even more off topic aside: these computers had a wonderful machine level instruction to convert monetary values in old british pennies to Pounds,shillings and pence (see http://www.ourcomputerheritage.org/ccs-m1x3.pdf)))).

SimonPoole · February 11, 2018, 1:40pm

No war stories from my side, but just the general observation that not only has technology been able to keep up with our (roughly linear with time) database growth, it has actually become much easier to deal with OSM data in many ways.

You can easily import and run a complete planet on rather modest hardware (perhaps $2k) without having to wait a week or two for the import to complete as you would have 5 years back.

muralito · February 12, 2018, 4:22am

Wow! living history… thanks for sharing.
Also off topic, but i need to ask…

Did you meet von Neumann?