Parsing OSM data in Python: issue with broken characters

Has anyone parsed OSM files in Python? I try to make a script work with compressed BZ2 file, but it fails at some point:

from bz2 import BZ2File
from xml.parsers import expat

p = expat.ParserCreate()
p.returns_unicode = True

earth = BZ2File('planet-060818.osm.bz2', 'r')
p.ParseFile(earth)

Outputs: xml.parsers.expat.ExpatError: not well-formed (invalid token): line 610127, column 37

Here’s the line:
(the forum transformed wrong characters in “?”. In hex the char codes are \xd1).

Is it the issue with the file, that it contains wrong characters, or something’s wrong with my bz2 or expat?

The planet date: 06-08-18 seems pretty old to me and is certainly not compatible with the current API version 0.6. I can imagine that -back then- there were some export problems. Afaik, currently every character <=32 is encoded to comply with XML standards.

Well, I think it’s encoding issue, in particular in how bz2 decoder and expat parse this file together. I’ve made a workaround for this line, but it stopped later at another line:

<tag k="name" v="Cin\x8e? Rex" />

I’ve had similar problems with this forum (which used to be non-UTF8 compatible) and Cyrillic characters. Some look plain ASCII but aren’t. I’m not expat or Python expert, but did you have a look at this:
http://docs.python.org/library/pyexpat.html

This planet dump is very old. At this time there have been some issues with the UTF-8 character encoding, which all text should have been in. Neither all editors nor the OSM server API did check that only valid UTF-8 text was uploaded at this time, so these invalid UTF-8 characters ended up in the planet.osm XML-file(s). I fear the only way around is to do some pre-processing (either manually or automatically) that checks for and eliminates these encoding errors when reading in this file with a standard conform XML parser.

Micha H.

Ok, I switched to newer dumps, and they were parsed without issues.