(gelöst) PostGIS: Wie Gebäude in einem Stadtteil selektieren?

toc-rox · June 17, 2022, 3:49pm

Ich möchte via PostGIS die Gebäude in einem Stadtteil selektieren (DB: OSM-Standard).

Dies selektiert die Relation / den Stadtteil “Sentruper Höhe”:
SELECT way FROM planet_osm_polygon WHERE osm_id = -3858975

Dies selektiert viele (jedoch nicht alle) Gebäude:
SELECT way FROM planet_osm_polygon WHERE (building IS NOT NULL)

Wie vereinigt man die beiden Queries jetzt zu einer performanten Abfrage?

dooley · June 17, 2022, 6:37pm

SELECT b.way 
FROM planet_osm_polygon p
JOIN planet_osm_polygon b ON ST_Contains(p.way, b.way)
WHERE p.osm_id = -3858975
AND b.building IS NOT NULL

Statt ST_Contains kann man ST_Intersects verwenden, wenn man die buildings mitnehmen möchte, die “auf der Grenze” liegen.

toc-rox · June 17, 2022, 7:21pm

Danke erstmal für die Select-Anweisung. Ich hatte zwischenzeitlich auch eine Lösung finden können:

SELECT way FROM planet_osm_polygon WHERE (building IS NOT NULL) 
AND ST_Within(way, (SELECT way FROM planet_osm_polygon WHERE osm_id = -3858975))

Das Problem bei beiden Lösungen ist die Performanz (dooley = 1201 sec, toc-rox = 1173). Es stellt sich die Frage, ob das irgendwie noch besser geht.

dooley · June 17, 2022, 8:33pm

Hast du den kompletten planet in der Datenbank? Oder einen Extrakt (welchen)? Importierst du immer komplett oder arbeitest du mit Changesets?

Auf meinem System - Ryzen 9 3900X, 126 GM RAM, NVME 1 TB, PostGres 14/PostGIS 3.3, Geofabrik-DACH, mit Changeset-Verarbeitung und ab und zu ein vacuum analyze - laufen beide Queries deutlich schneller:


osm_dach=# \timing on
Timing is on.
osm_dach=# SELECT way FROM dach_polygon WHERE (building IS NOT NULL) 
AND ST_Within(way, (SELECT way FROM dach_polygon WHERE osm_id = -3858975));
Time: 76.005 ms
osm_dach=# SELECT b.way 
osm_dach-# FROM dach_polygon p
osm_dach-# JOIN dach_polygon b ON ST_Intersects(p.way, b.way)
osm_dach-# WHERE p.osm_id = -3858975
osm_dach-# AND b.building IS NOT NULL;
Time: 60.505 ms

Was sagt denn ein explain analyze mit deinen Queries?

toc-rox · June 18, 2022, 5:52am

Eine Umkreisselektion (2500 Meter) ist im Vergleich zur Flächenselektion um Größenordungen performanter. Die nachfolgende Operation läuft, einschließlich meines weiteren Kontext, circa 3 Sekunden.

SELECT way FROM planet_osm_polygon WHERE (building IS NOT NULL)
AND ST_DWITHIN(way, ST_TRANSFORM(ST_SETSRID(ST_MAKEPOINT(7.59411,51.95742),4326),3857), 2500)

Zu deinen Fragen: Planet = ja, Komplettimport = ja, keine Updates, Systemunterschied = Festplatten anstatt SSD

Die Analysedaten erstelle ich noch. Vielleicht kommen wir damit weiter. Irgendetwas paßt bei der Flächenoperation noch nicht.

dooley · June 24, 2022, 5:22am

Problem gelöst?

toc-rox · June 26, 2022, 3:45pm

ST_Intersects() wäre für mich auch passend. Laut Dokumentation kann ST_Intersects() einen spatialen Index verwenden.

SELECT b.way FROM planet_osm_polygon p JOIN planet_osm_polygon b ON ST_Intersects(p.way, b.way) WHERE p.osm_id = -3858975 AND b.building IS NOT NULL

Jetzt fehlt nur noch der passende Index oder die passenden Indizes. Angelegt habe ich folgendes:

CREATE INDEX planet_osm_polygon_building ON planet_osm_polygon USING GIST (way) WHERE building IS NOT NULL

Dieser Index hat aber hinsichtlich der Performanz nichts gebracht.

Anders dieser Index:

CREATE INDEX planet_osm_polygon_3858975 ON planet_osm_polygon USING GIST (way) WHERE osm_id = -3858975

Jetzt läuft die Abfrage bei mir nur noch 3 Sekunden (!) statt 1200 Sekunden.

Möglicherweise ist aber nur der zweite Index (auf die osm_id -3858975) erforderlich.

Frage: Könntest du das bei dir auch mal probieren? Reicht nur der zweite Index?

dooley · June 26, 2022, 5:55pm

Ein unspezifischer Index auf “building” wird in der Tat nicht bringen. Zumindest in DACH ist der weit überwiegende Anteil ein building. Dein Index nutzt nur, wenn du auf spezifische building-Values einschränken möchtest.

SELECT count(*), 'building' FROM dach_polygon WHERE building IS NOT NULL
union all
SELECT count(*), 'other' FROM dach_polygon WHERE building IS NULL;

  count   | ?column? 
----------+----------
  8996857 | other
 41972293 | building

Ich frage mich gerade, ob bei dir überhaupt die osm2pgsql-“Standardindices” vorhanden sind? Also auf osm_id und auf die Geometriespalte.
Das ist alles stochern im Nebel - poste doch bitte mal die Ausgaben von

SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public'
AND tablename = 'planet_osm_polygon';

und

EXPLAIN ANALYZE
SELECT b.way 
FROM planet_osm_polygon p 
JOIN planet_osm_polygon b ON ST_Intersects(p.way, b.way) 
WHERE p.osm_id = -3858975 
AND b.building IS NOT NULL;

toc-rox · June 27, 2022, 4:23am

Wie ist denn bei dir Performanz wenn du ST_Intersects() nur mit den Standard-Indizes verwendest? Waren das bei dir mit ST_Contains() nicht 70 Sekunden? Irgendwie ist diese Information verschwunden.

Auf jeden Fall fehlen Indizes, soviel scheint ja schon mal klar zu sein. Ich importiere den Planet ohne die Update-Option. Vermutung: In diesem Zusammenhang werden Indizes angelegt, deren Fehlen sich bei der Abfrage negativ auswirkt. Der Vergleich wird es hoffentlich zeigen …

$ psql -U postgres -d osmcarto4 -c "SELECT tablename, indexname, indexdef FROM pg_indexes WHERE schemaname = 'public' AND tablename = 'planet_osm_polygon';" >out.txt
$ cat out.txt
     tablename      |            indexname            |                                                                                                                                            indexdef                                                                                                                                            
--------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 planet_osm_polygon | planet_osm_polygon_way_idx      | CREATE INDEX planet_osm_polygon_way_idx ON public.planet_osm_polygon USING gist (way) WITH (fillfactor='100')
 planet_osm_polygon | planet_osm_polygon_admin        | CREATE INDEX planet_osm_polygon_admin ON public.planet_osm_polygon USING gist (st_pointonsurface(way)) WHERE ((name IS NOT NULL) AND (boundary = 'administrative'::text) AND (admin_level = ANY (ARRAY['0'::text, '1'::text, '2'::text, '3'::text, '4'::text])))
 planet_osm_polygon | planet_osm_polygon_military     | CREATE INDEX planet_osm_polygon_military ON public.planet_osm_polygon USING gist (way) WHERE (((landuse = 'military'::text) OR (military = 'danger_area'::text)) AND (building IS NULL))
 planet_osm_polygon | planet_osm_polygon_name         | CREATE INDEX planet_osm_polygon_name ON public.planet_osm_polygon USING gist (st_pointonsurface(way)) WHERE (name IS NOT NULL)
 planet_osm_polygon | planet_osm_polygon_nobuilding   | CREATE INDEX planet_osm_polygon_nobuilding ON public.planet_osm_polygon USING gist (way) WHERE (building IS NULL)
 planet_osm_polygon | planet_osm_polygon_water        | CREATE INDEX planet_osm_polygon_water ON public.planet_osm_polygon USING gist (way) WHERE ((waterway = ANY (ARRAY['dock'::text, 'riverbank'::text, 'canal'::text])) OR (landuse = ANY (ARRAY['reservoir'::text, 'basin'::text])) OR ("natural" = ANY (ARRAY['water'::text, 'glacier'::text])))
 planet_osm_polygon | planet_osm_polygon_way_area_z10 | CREATE INDEX planet_osm_polygon_way_area_z10 ON public.planet_osm_polygon USING gist (way) WHERE (way_area > (23300)::double precision)
 planet_osm_polygon | planet_osm_polygon_way_area_z6  | CREATE INDEX planet_osm_polygon_way_area_z6 ON public.planet_osm_polygon USING gist (way) WHERE (way_area > (5980000)::double precision)
 planet_osm_polygon | planet_osm_polygon_building     | CREATE INDEX planet_osm_polygon_building ON public.planet_osm_polygon USING gist (way) WHERE (building IS NOT NULL)
 planet_osm_polygon | planet_osm_polygon_3858975      | CREATE INDEX planet_osm_polygon_3858975 ON public.planet_osm_polygon USING gist (way) WHERE (osm_id = '-3858975'::integer)
(10 rows)


$ psql -U postgres -d osmcarto4 -c "EXPLAIN ANALYZE SELECT b.way FROM planet_osm_polygon p JOIN planet_osm_polygon b ON ST_Intersects(p.way, b.way) WHERE p.osm_id = -3858975 AND b.building IS NOT NULL;" >out2.txt
$ cat out2.txt
                                                                               QUERY PLAN                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.68..1485536.99 rows=1080197 width=220) (actual time=425.459..445.050 rows=2812 loops=1)
   ->  Index Scan using planet_osm_polygon_3858975 on planet_osm_polygon p  (cost=0.12..8.14 rows=1 width=220) (actual time=0.097..0.099 rows=1 loops=1)
   ->  Index Scan using planet_osm_polygon_building on planet_osm_polygon b  (cost=0.55..1485017.47 rows=51138 width=220) (actual time=1.210..20.664 rows=2812 loops=1)
         Index Cond: (way && p.way)
         Filter: st_intersects(p.way, way)
         Rows Removed by Filter: 6366
 Planning Time: 8.496 ms
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.518 ms, Inlining 38.109 ms, Optimization 246.647 ms, Emission 139.067 ms, Total 425.341 ms
 Execution Time: 473.589 ms
(12 rows)

dooley · June 27, 2022, 6:29am

Nein, 70 Sekunden waren es nie - immer Millisekunden.

Ich habe eigentlich nur 2 für so eine Query relevante Indizes, osm_id und way:

  tablename   |        indexname         |                                                                                           indexdef                                                                                            
--------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 dach_polygon | dach_polygon_landuse_idx | CREATE INDEX dach_polygon_landuse_idx ON public.dach_polygon USING btree (landuse) WHERE (landuse = ANY (ARRAY['residential'::text, 'industrial'::text, 'commercial'::text, 'retail'::text]))
 dach_polygon | dach_polygon_osm_id_idx  | CREATE INDEX dach_polygon_osm_id_idx ON public.dach_polygon USING btree (osm_id)
 dach_polygon | dach_polygon_tags_idx    | CREATE INDEX dach_polygon_tags_idx ON public.dach_polygon USING gin (tags)
 dach_polygon | dach_polygon_way_idx     | CREATE INDEX dach_polygon_way_idx ON public.dach_polygon USING gist (way)
(4 rows)

                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.98..133212.43 rows=18682 width=194) (actual time=24.547..56.918 rows=2818 loops=1)
   ->  Index Scan using dach_polygon_osm_id_idx on dach_polygon p  (cost=0.56..2.78 rows=1 width=194) (actual time=0.087..0.089 rows=1 loops=1)
         Index Cond: (osm_id = '-3858975'::integer)
   ->  Index Scan using dach_polygon_way_idx on dach_polygon b  (cost=0.42..133167.52 rows=4213 width=194) (actual time=0.377..32.565 rows=2818 loops=1)
         Index Cond: (way && p.way)
         Filter: ((building IS NOT NULL) AND st_intersects(p.way, way))
         Rows Removed by Filter: 8590
 Planning Time: 8.128 ms
 JIT:
   Functions: 9
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 3.841 ms, Inlining 0.000 ms, Optimization 2.639 ms, Emission 21.133 ms, Total 27.612 ms
 Execution Time: 92.880 ms
(13 rows)

Vorschlag: Lösche den Index planet_osm_polygon_3858975 und lege den neu an, aber ohne Einschränkung auf diese eine osm_id. Du möchtest ja wahrscheinlich nicht nur diese eine Abfrage machen.

Clustern nach “way”, wenn noch nicht ausgeführt, wird ziemlich sicher die Abfragezeit verringern, je langsamer der Storage ist, um so mehr. Bitte mal durchlesen vor Ausführung, wegen exklusivem Zugriff (wenn andere Jobs währenddessen laufen müssen). Clustern dauert ziemlich, ist aber wegen fehlender Tabellen-Updates bei dir ja nur ein Onetime-Job.

CLUSTER planet_osm_polygon USING planet_osm_polygon_way_idx;

Ansonsten sehe ich jetzt keine Auffälligkeiten, Indizes werden verwendet, keine Mehrfach-Loops, der Queryplan sieht gut aus, passt alles soweit. Es kann eigentlich nur an den langsamen HDDs liegen.

toc-rox · June 27, 2022, 12:54pm

Danke für die Vorschläge. Vermutlich wird der Index auf alle Polygon-Ways das Nutzen-Aufwand-Verhältnis sprengen. Der Building-Index alleine belegt schon 26 GB. Da werde ich vermutlich in geeigneter Art und Weise einschränken müssen.

dooley · June 27, 2022, 5:44pm

Irgendwie stehe ich jetzt auf dem Schlauch? Du hast doch schon einen Index über ALLE Geometrien:

 planet_osm_polygon | planet_osm_polygon_way_idx      | CREATE INDEX planet_osm_polygon_way_idx ON public.planet_osm_polygon USING gist (way) WITH (fillfactor='100')

Was du NICHT hast, ist ein Index über alle osm_id. Dieser wäre aber extrem nützlich, wenn du Abfragen mit Bedingung nach osm_id hast (WHERE osm_id = xxx). Ein Index über die osm_id wird auch nicht sooo groß (es wird ja bigint statt polygon indiziert).

Ohne osm_id-Index wirst du entweder für jede zukünftig abzufragende Gemeinde einen Index anlegen müssen (wie du es schon für -3858975 gemacht hast) oder jedes mal einen sequentiellen Scan durchführen (laaaaaangsam). Beides möchtest du nicht haben.

Schau die die Unterschiede genauer an, USING gist(way) vs. USING btree(osm_id):


CREATE INDEX planet_osm_polygon_3858975 ON public.planet_osm_polygon USING gist (way) WHERE (osm_id = '-3858975'::integer)
vs.
CREATE INDEX planet_osm_polygon_osm_id_idx ON public.planet_osm_polygon USING btree (osm_id)

Wenn du planet_osm_polygon_osm_id_idx angelegt hast, kannst du planet_osm_polygon_3858975 löschen.
Danach clusterst du, um die Geometrien räumlich auf der Platte zusammenzulegen, damit werden die Plattenzugriffe insbesondere bei HDD schneller, weil halt weniger Kopfbewegungen nötig sind.

CLUSTER planet_osm_polygon USING planet_osm_polygon_way_idx;

Nicht vergessen, nach dem Clustern

ANALYZE VERBOSE public.dach_polygon;

auszuführen.

Wenn du den 26GB-Index planet_osm_polygon_nobuilding unbedingt behalten möchtest, kann du auch nach diesem statt planet_osm_polygon_way_idx clustern.

toc-rox · June 28, 2022, 4:39am

Da habe ich deinen Vorschlag wohl missverstanden. Du meintest nicht

CREATE INDEX planet_osm_polygon_3858975 ON planet_osm_polygon USING GIST (way) WHERE osm_id = -3858975

sondern das

CREATE INDEX planet_osm_polygon_idx_osm_id ON planet_osm_polygon USING btree (osm_id)

dooley · June 28, 2022, 5:14am

Genau.

toc-rox · July 8, 2022, 8:31am

Nach einem planmäßigen Planet-Neuimport habe ich jetzt zusätzlich folgenden Index definiert:

CREATE INDEX planet_osm_polygon_idx_osm_id ON planet_osm_polygon USING btree (osm_id)

Der Index belegt 13 GB und war nach 26 Minuten erstellt. Damit läuft jetzt diese Abfrage sehr performant (<3 Sekunden):

SELECT b.way FROM planet_osm_polygon p JOIN planet_osm_polygon b ON ST_Intersects(p.way, b.way) WHERE p.osm_id = -3858975 AND b.building IS NOT NULL

Vielleicht auch interessant:

SELECT relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) As "Total Size", pg_size_pretty(pg_indexes_size(relid)) as "Index Size", pg_size_pretty(pg_relation_size(relid)) as "Actual Size" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC
 
            table_name              | Total Size | Index Size | Actual Size 
-------------------------------------+------------+------------+-------------
 planet_osm_polygon                  | 285 GB     | 110 GB     | 157 GB
 planet_osm_line                     | 137 GB     | 41 GB      | 84 GB
 planet_osm_point                    | 33 GB      | 8719 MB    | 25 GB
 planet_osm_roads                    | 15 GB      | 2143 MB    | 8795 MB
 water_polygons                      | 1107 MB    | 672 kB     | 8368 kB
 icesheet_outlines                   | 84 MB      | 3728 kB    | 69 MB
 icesheet_polygons                   | 72 MB      | 792 kB     | 13 MB
 simplified_water_polygons           | 34 MB      | 624 kB     | 10 MB
 spatial_ref_sys                     | 7272 kB    | 304 kB     | 6936 kB
 ne_110m_admin_0_boundary_lines_land | 160 kB     | 24 kB      | 96 kB
 external_data                       | 32 kB      | 16 kB      | 8192 bytes
(11 rows)

Danke für deine Unterstützung.

dooley · July 8, 2022, 10:42am

Danke für deine Rückmeldung und freut mich, dass deine Queries jetzt um einiges schneller sind.