Content on test import site is about to be reset.
I am will shortly be running another test import attempt, this time with the first cut at user account merging.
Content on test import site is about to be reset.
I am will shortly be running another test import attempt, this time with the first cut at user account merging.
The account merging test import has failed Looking in @Zverik 's direction
The merging code need some additional edge case handling. Thank you to @TomH for helping me out with his Ruby wizardry.
Yeah, when I first registered, I logged in as “zverik” lowercase, which was another user. And after half a year I re-logged-in as “Zverik” uppercase. I believe I’m not the only one with that issue
Although I won’t complain if the older messages are attributed to another user.
Not to worry, we’re working out all the edge cases. Your user accounts were just the first edge case to hit the test importer.
There are some interesting “conflicts” in the old forum (fluxbb) database. E.g.: Multiple user accounts with the same email address and different OSM IDs. (Example Scenario: User 1 logged into forum, User 1 email changed on osm.org, User 2 is created on osm.org with old User 1 email, User 2 logs into forum.)
The test import account merging now appears to be working well.
The logic is approximately:
A test import to https://forum-import-test.openstreetmap.org/ is running now and is expected to be completed in around 12 hours Update: Full Import Done. The import is using the latest 2nd Feb 2023 forum.osm.org and community.osm.org snapshots.
NOTE https://forum-import-test.openstreetmap.org/ currently does NOT allow login. This will be enabled later during testing.
Work still pending:
Thanks for all the lots of work you are doing here, mate! One can’t say that often enough.
@Zverik Would you mind asking the Russian speaking community to review the test import site?
Particularly: users: Russia - Test Import Site - OpenStreetMap Community
Wow! There is a single thread with 14713 posts in it!?!? Как обозначать? - #51 by Ezhick - users: Russia - Test Import Site - OpenStreetMap Community
User accounts have been merged and created per the rules described here.
Content formatting will not be perfect as a prefect conversion from fluxbb bbcode → discourse’s flavour of markdown is impossible, but the formatting now looks “good enough” to me.
The old forum.osm.org has a few old “mojibake” posts, I am unsure what happened here, likely a mysql unicode issue. These posts are broken in the fluxbb database and it is unlikely they could be easily recovered.
Note the test site does NOT currently allow login.
Do you know the original intended encoding? If so, it’s sometimes possible to recover the text from mojibake, though not always. If it’s only a few posts, then maybe their authors could patch them up after the fact if there’s any confusion.
Here is an example post: Как обозначать? (Page 5) / users: Russia / OpenStreetMap Forum
I don’t think the text can be recovered, but happy for others to try.
It is already destroyed in HTML output of old forum, i.e. all those ?
chars are real 0x3f
ASCII chars, e.g.:
% curl -s 'https://forum.openstreetmap.org/viewtopic.php?pid=33413#p33413' | fgrep -A15 #104 | tail -n1 | hd
00000000 09 09 09 09 09 09 3c 64 69 76 20 63 6c 61 73 73 |......<div class|
00000010 3d 22 71 75 6f 74 65 62 6f 78 22 3e 3c 63 69 74 |="quotebox"><cit|
00000020 65 3e 43 61 6c 69 62 72 61 74 6f 72 20 77 72 6f |e>Calibrator wro|
00000030 74 65 3a 3c 2f 63 69 74 65 3e 3c 62 6c 6f 63 6b |te:</cite><block|
00000040 71 75 6f 74 65 3e 3c 64 69 76 3e 3c 70 3e 3f 3f |quote><div><p>??|
00000050 3f 3f 3f 3f 3f 3f 3f 20 34 20 3f 3f 3f 3f 3f 3f |??????? 4 ??????|
00000060 3f 20 3f 3f 3f 3f 3f 3f 3f 20 3f 20 3f 3f 3f 3f |? ??????? ? ????|
00000070 3f 3f 3f 20 28 6c 61 6e 64 75 73 65 3d 72 65 73 |??? (landuse=res|
00000080 69 64 65 6e 74 61 29 6c 3c 2f 70 3e 3c 2f 64 69 |identa)l</p></di|
However, in my experience mysql itself often keeps enough information to reconstruct such messages (usually by doing the dump to a file, editing a file and changing CHARSET=xxxx
where needed (utf8? utf8mb4 ?), and then importing again) - it is often connecting client that mangles it (if one sets use names
to same thing as the database/table/column uses, one can usually get data out in raw form, which can then be converted. If client however does not issue correct commands on startup, it will get gibberish or ?
).
I’ve had some experience with cp1250
and utf8
being stuffed into database marked as charset=latin1
(which showed as all kind of corruption in the app/web) and recovering it without too much trouble; but I have no experience with Russian charsets (was it UTF8 or KOI8-R or something else initially?); and I have a feeling multi-charset nature of the database might make it more problematic.
Perhaps an file with the result of the the mysqldump --hex-block --no-opt --where=....
containing several problematic messages (as well as few non-problematic messages in other languages/charsets) might inspire someone to take a look, while not being problematic from privacy side (which giving mysql access to test instance probably would).
Test site back after outage of a few hours.
Thanks Grant, I’ll post the announcement today.
One thing I see that’s wrong is links did not get made into links, they stay markdown-ed: Откаты правок - users: Russia - Test Import Site - OpenStreetMap Community I also found double-asterisk for bold that didn’t make the text bold.
Some other topics are correct though: Москва и Московская область (обсуждение) - users: Russia - Test Import Site - OpenStreetMap Community
OK I think I’ve found the issue. It seems to be if bbcode is converted to <ol><li>Item 1</li>...</ol>
syntax then markdown inside the elements is not parsed. If 1. Item 1
style is used then all good.
Note we are testing topic migrations on the import test site. We have migrated topics in the imported Sweden, Brazil, Germany and Russian categories to the existing respective community sections.
So far all good. We are using the method described here (CLI): Bulk move many topics from one category to another - #34 by tshenry - support - Discourse Meta.
Wait there is more I’ve got multiple issues by mappers who looked into the imported forum:
"---"
sequence in the text. Another case: old, new. “Equals” signs do that as well: old, new.Looks like 1 and 3 come from applying markdown markup to old posts, which should not be done obviously.
Erm - at what point do we think the migration process is “good enough” for what by definition are old messages?
I agree that we don’t have to make the import ideal. But I think at least issues 3 and 5 are very important: asterisks are common in tagging discussions, and multi-level quotations are important. 4 was announced as finished by Harry (?) and I’m puzzled at why that didn’t work.
I think “good enough” will be the migrated message keeps reasonable formatting and the ability to understand the message is not changed.
[color=gray]
(bbcode source) is not supported by discourse. Unlikely to be fixed.[building=*]
output is [building=*]
, but discourse is swallowing the *
. Likely not “good enough”, but unsure how I’d approach this.https://forum-import-test.openstreetmap.org/viewtopic.php?pid=145548#p145548
You can use escaping (backslash) for symbols used in markdown:
“[building=] [building=]” vs “[building=*] [building=*]”
OK, now do that as a regex that doesn’t conflict with any other bbcode