Migrating content from old forums

Content on test import site is about to be reset.

I am will shortly be running another test import attempt, this time with the first cut at user account merging.

The account merging test import has failed :frowning: Looking in @Zverik 's direction :wink:

The merging code need some additional edge case handling. Thank you to @TomH for helping me out with his Ruby wizardry.

Yeah, when I first registered, I logged in as “zverik” lowercase, which was another user. And after half a year I re-logged-in as “Zverik” uppercase. I believe I’m not the only one with that issue :slight_smile:

Although I won’t complain if the older messages are attributed to another user.

1 Like

Not to worry, we’re working out all the edge cases. Your user accounts were just the first edge case to hit the test importer.

There are some interesting “conflicts” in the old forum (fluxbb) database. E.g.: Multiple user accounts with the same email address and different OSM IDs. (Example Scenario: User 1 logged into forum, User 1 email changed on osm.org, User 2 is created on osm.org with old User 1 email, User 2 logs into forum.)

The test import account merging now appears to be working well.

The logic is approximately:

  1. If no known osm.org id (uid) and no known email for user: Create New community.osm.org account with an invalid email address. New account immediately suspended. (Can be manually merged post import by admins. Posts are imported OK and displayed normally)
  2. If no known osm.org id (uid) and an email matches with an existing community.osm.org account then: Use existing community.osm.org account.
  3. If no known osm.org id (uid) and email does not match with an existing community.osm.org account then: Create New community.osm.org account (? NEEDS TESTING - Will login work post import ?)
  4. If an osm.org id (uid) exists with a matching community.osm.org account: Use existing community.osm.org account.
  5. If an osm.org id (uid) exists without a matching community.osm.org account: Create New community.osm.org auth linked to osm.org id to allow seemless login on community.osm.org

A test import to https://forum-import-test.openstreetmap.org/ is running now and is expected to be completed in around 12 hours Update: Full Import Done. The import is using the latest 2nd Feb 2023 forum.osm.org and community.osm.org snapshots.

NOTE https://forum-import-test.openstreetmap.org/ currently does NOT allow login. This will be enabled later during testing.

Work still pending:

  1. Fix outstanding account merge issues.
  2. Import of avatars - Unlikely to happen.
  3. Import of user signatures - Unlikely to happen.
  4. Fix permalink for forum topics.
  5. Schedule migration window.
  6. POST IMPORT: Merge imported categories and topics
  7. POST IMPORT: Fix topic descriptions (currently boilerplate text)
  8. POST IMPORT: Admins merge accounts on request (only “import suspended” users)
8 Likes

:+1:
Thanks for all the lots of work you are doing here, mate! One can’t say that often enough.

12 Likes

@Zverik Would you mind asking the Russian speaking community to review the test import site?
Particularly: users: Russia - Test Import Site - OpenStreetMap Community

Wow! There is a single thread with 14713 posts in it!?!? Как обозначать? - #51 by Ezhick - users: Russia - Test Import Site - OpenStreetMap Community

User accounts have been merged and created per the rules described here.

Content formatting will not be perfect as a prefect conversion from fluxbb bbcode → discourse’s flavour of markdown is impossible, but the formatting now looks “good enough” to me.

The old forum.osm.org has a few old “mojibake” posts, I am unsure what happened here, likely a mysql unicode issue. These posts are broken in the fluxbb database and it is unlikely they could be easily recovered.

Note the test site does NOT currently allow login.

4 Likes

Do you know the original intended encoding? If so, it’s sometimes possible to recover the text from mojibake, though not always. If it’s only a few posts, then maybe their authors could patch them up after the fact if there’s any confusion.

Here is an example post: Как обозначать? (Page 5) / users: Russia / OpenStreetMap Forum
I don’t think the text can be recovered, but happy for others to try.

It is already destroyed in HTML output of old forum, i.e. all those ? chars are real 0x3f ASCII chars, e.g.:

% curl -s 'https://forum.openstreetmap.org/viewtopic.php?pid=33413#p33413' | fgrep -A15 #104 | tail -n1 | hd
00000000  09 09 09 09 09 09 3c 64  69 76 20 63 6c 61 73 73  |......<div class|
00000010  3d 22 71 75 6f 74 65 62  6f 78 22 3e 3c 63 69 74  |="quotebox"><cit|
00000020  65 3e 43 61 6c 69 62 72  61 74 6f 72 20 77 72 6f  |e>Calibrator wro|
00000030  74 65 3a 3c 2f 63 69 74  65 3e 3c 62 6c 6f 63 6b  |te:</cite><block|
00000040  71 75 6f 74 65 3e 3c 64  69 76 3e 3c 70 3e 3f 3f  |quote><div><p>??|
00000050  3f 3f 3f 3f 3f 3f 3f 20  34 20 3f 3f 3f 3f 3f 3f  |??????? 4 ??????|
00000060  3f 20 3f 3f 3f 3f 3f 3f  3f 20 3f 20 3f 3f 3f 3f  |? ??????? ? ????|
00000070  3f 3f 3f 20 28 6c 61 6e  64 75 73 65 3d 72 65 73  |??? (landuse=res|
00000080  69 64 65 6e 74 61 29 6c  3c 2f 70 3e 3c 2f 64 69  |identa)l</p></di|

However, in my experience mysql itself often keeps enough information to reconstruct such messages (usually by doing the dump to a file, editing a file and changing CHARSET=xxxx where needed (utf8? utf8mb4 ?), and then importing again) - it is often connecting client that mangles it (if one sets use names to same thing as the database/table/column uses, one can usually get data out in raw form, which can then be converted. If client however does not issue correct commands on startup, it will get gibberish or ?).

I’ve had some experience with cp1250 and utf8 being stuffed into database marked as charset=latin1 (which showed as all kind of corruption in the app/web) and recovering it without too much trouble; but I have no experience with Russian charsets (was it UTF8 or KOI8-R or something else initially?); and I have a feeling multi-charset nature of the database might make it more problematic.

Perhaps an file with the result of the the mysqldump --hex-block --no-opt --where=.... containing several problematic messages (as well as few non-problematic messages in other languages/charsets) might inspire someone to take a look, while not being problematic from privacy side (which giving mysql access to test instance probably would).

4 Likes

Test site back after outage of a few hours.

Thanks Grant, I’ll post the announcement today.

One thing I see that’s wrong is links did not get made into links, they stay markdown-ed: Откаты правок - users: Russia - Test Import Site - OpenStreetMap Community I also found double-asterisk for bold that didn’t make the text bold.

Some other topics are correct though: Москва и Московская область (обсуждение) - users: Russia - Test Import Site - OpenStreetMap Community

1 Like

OK I think I’ve found the issue. It seems to be if bbcode is converted to <ol><li>Item 1</li>...</ol> syntax then markdown inside the elements is not parsed. If 1. Item 1 style is used then all good.

This new code is likely faulty.

2 Likes

Note we are testing topic migrations on the import test site. We have migrated topics in the imported Sweden, Brazil, Germany and Russian categories to the existing respective community sections.

So far all good. We are using the method described here (CLI): Bulk move many topics from one category to another - #34 by tshenry - support - Discourse Meta.

2 Likes

Wait there is more :slight_smile: I’ve got multiple issues by mappers who looked into the imported forum:

  1. oldnew: an entire block was made bold, likely because of "---" sequence in the text. Another case: old, new. “Equals” signs do that as well: old, new.
  2. Same post, look for “брр как всё запутано” phrase: it’s gray in the original, but the color has been lost.
  3. oldnew: asterisks in tags are mistaken for markup.
  4. oldnew: http links to the old forum were not updated. Same for https.
  5. oldnew: multi-level quoting is broken: text was stashed into the last quote.

Looks like 1 and 3 come from applying markdown markup to old posts, which should not be done obviously.

4 Likes

Erm - at what point do we think the migration process is “good enough” for what by definition are old messages? :slight_smile:

6 Likes

I agree that we don’t have to make the import ideal. But I think at least issues 3 and 5 are very important: asterisks are common in tagging discussions, and multi-level quotations are important. 4 was announced as finished by Harry (?) and I’m puzzled at why that didn’t work.

5 Likes

I think “good enough” will be the migrated message keeps reasonable formatting and the ability to understand the message is not changed.

  1. Weird parsing error, not caused by importer. I will look into workaround, unlikely to be fixed.
  2. Colour based markdown [color=gray] (bbcode source) is not supported by discourse. Unlikely to be fixed.
  3. Difficult, the importer input: [building=*] output is [building=*], but discourse is swallowing the *. Likely not “good enough”, but unsure how I’d approach this.
  4. This is already be handled by the permalink redirect code. Take the old url + parameters and use them on test forum URL. eg: https://forum-import-test.openstreetmap.org/viewtopic.php?pid=145548#p145548
  5. Ouch. Not good enough, message is changed and meaning could likely be changed. It is due to over eager regex, likely very difficult to fix. Here is pseudocode version: gist:88c45733449078f8fa061838f0eb899a · GitHub
3 Likes

You can use escaping (backslash) for symbols used in markdown:

“[building=] [building=]” vs “[building=*] [building=*]”

1 Like

OK, now do that as a regex that doesn’t conflict with any other bbcode :stuck_out_tongue: