Debugging UTF-8 errors with OSM .pbf files

… so I was running a database update on a raster tile server that uses and osm2pgsql database, and I got this error:

Processing: Node(277964k 263.0k/s) Way(41046k 9.61k/s) Relation(395040 1258.1/s)
2026-03-28 17:08:08  ERROR: DB copy thread failed: Ending COPY mode for 'planet_osm_roads' failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xc3 0x2e
CONTEXT:  COPY planet_osm_roads, line 2459
.
Database gis3 load Error

The last part of the error is from my script; the first part from osm2pgsql (version 2.1.1, as shipped with Debian 13).

Now 0xc3 0x2e is indeed an invalid UTF-8 sequence. It’s a à followed by a ., and the à character perhaps suggests that someone has tried to add old Windows-format data to OSM as if it was UTF-8 (like here, although I don’t think that those are generating the error for me). The error happens when osm2pgsql is processing relations, so I’m guessing it’s a relation. The error happens with planet_osm_roads, so it’s something written to that table (such as a name).

The data is OSM via Geofabrik, specifically these files and dates:

guernsey-jersey_2026-03-27T21:21:27Z.osm.pbf
united-kingdom_2026-03-27T21:21:27Z.osm.pbf
isle-of-man_2026-03-27T21:21:27Z.osm.pbf
ireland-and-northern-ireland_2026-03-27T21:21:27Z.osm.pbf

Test loads of Ireland, IoM, Guernsey/Jersey, Wales and Scotland work without errors, suggesting that the issue is with English data (but not Kent, Yorkshire or Rutland which have been tested separately). The actual .pbf data sent to osm2pgsql has some changes from the input files (mostly names), and the output from osm2pgsql is obviously different again.

  • How do I go about finding what data in OSM (and/or transformation that I am doing) is causing the error? Using osmium cat I can output in .opl format and search for anything that grep can find, but I’ve no idea what an invalid UTF-8 sequence would be output as, if anything.

  • Is it possible to get more detail (even just written to STDOUT) of what osm2pgsql is doing at the time the error occurs?

maybe the : --log-sql-data option :

1 Like

That worked - I was able to find out what the data immediately before the one that failed was, and hence work out what the failing data was and what was special about it.

1 Like

I’m curious what exactly you did to get invalid UTF-8. While the ref tags mentioned in the other thread are clearly garbage, they are still valid UTF-8 when I download them from the OSM API. And the data in the extracts looks okay as well. Do you do any kind of special preprocessing?

Yes - and that was the problem. The relation it fell over on was this one. That’s in west Wales, and when I “tested” Wales above I only tested the English name not the Welsh one, and it was the Welsh name that triggered the error. The name Llwybr ‘Capability’ Brown yn Ninefwr, Dinefwr is longer, and those fancy quotes are I suspect UTF-8. That only became a problem when I used the code now commented out here to try and truncate it. The problem with that is that that lua is a byte offset not a character offset. The truncation was after the 0xc3 (in some ASCII dialects Ã) and the next byte after was . (0x2e) and those two bytes together aren’t a valid UTF-8 character.

There are some suggestions here for doing it properly, and I’ll look at that after I’ve had a look at some of the dafter long names currently in the database:

  -1703772 | ldpnwn   | ACW Circular Walk 13C (an even shorter [4 mile total] short-cut route than route A, devised for use on the WWF 1-Jun-2006 event)
  -1702809 | ldpnwn   | ACW Circular Walk 12C (an even shorter [4 mile total] short-cut route than route A, devised for use on the WWF 7-Jun-2005 event)
 -19745897 | ldpnwn   | Dunstable Downs Five Knolls Wildlife and Heritage All Access Route, Dunstable Downs and Whipsnade Estate
 -19745897 | ldpnwn   | Dunstable Downs Five Knolls Wildlife and Heritage All Access Route, Dunstable Downs and Whipsnade Estate
 -19745897 | ldpnwn   | Dunstable Downs Five Knolls Wildlife and Heritage All Access Route, Dunstable Downs and Whipsnade Estate
  -2975858 | ldpmtb   | Cliddesden Duck Pond (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -2975858 | ldpmtb   | Cliddesden Duck Pond (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -2975858 | ldpmtb   | Cliddesden Duck Pond (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -4869659 | ldpmtb   | Watership Down (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -4869659 | ldpmtb   | Watership Down (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -4869659 | ldpmtb   | Watership Down (Cycling Discoveries for Mountain Bikes in and Around North Hampshire) (m)
  -9853050 | ldpnwn   | Circular Meadow Mobility Trail & Linear Duncombe Terrace Mobility Trail, Ashridge Estate
  -9853050 | ldpnwn   | Circular Meadow Mobility Trail & Linear Duncombe Terrace Mobility Trail, Ashridge Estate
  -9853050 | ldpnwn   | Circular Meadow Mobility Trail & Linear Duncombe Terrace Mobility Trail, Ashridge Estate
 -10411148 | ldpnwn   | Pennine Way (Tan Hill to Yeltholm) - deviation to The Cheviot via Scotsman's Cairn
 -19611748 | ldpnwn   | Route to the Royal Burial Ground and Viewing Tower (Purple Route), Sutton Hoo

For completeness, the fix is here.

if ( utf8.len( passedt.name ) > 28 ) then
    start33 = utf8.offset( passedt.name, 29 )
    passedt.name = string.sub( passedt.name, 1, start33-1 ) .. "..."
end

Basically, find the byte number of the start of the 29th UTF-8 character, and truncate up the character immediately before it - the end of the previous UTF-8 character.

The “utf8” code is dependent on lua 5.3, so that’s now a prerequisite and (for completeness) Tilemaker (used for the vector tiles that match these raster ones) is built against that lua library.

1 Like