The planet date: 06-08-18 seems pretty old to me and is certainly not compatible with the current API version 0.6. I can imagine that -back then- there were some export problems. Afaik, currently every character <=32 is encoded to comply with XML standards.
Well, I think it’s encoding issue, in particular in how bz2 decoder and expat parse this file together. I’ve made a workaround for this line, but it stopped later at another line:
I’ve had similar problems with this forum (which used to be non-UTF8 compatible) and Cyrillic characters. Some look plain ASCII but aren’t. I’m not expat or Python expert, but did you have a look at this: http://docs.python.org/library/pyexpat.html
This planet dump is very old. At this time there have been some issues with the UTF-8 character encoding, which all text should have been in. Neither all editors nor the OSM server API did check that only valid UTF-8 text was uploaded at this time, so these invalid UTF-8 characters ended up in the planet.osm XML-file(s). I fear the only way around is to do some pre-processing (either manually or automatically) that checks for and eliminates these encoding errors when reading in this file with a standard conform XML parser.