Can you create a static version of the old help.openstreetmap.org site?

Firefishy · April 23, 2024, 12:02pm

Please could someone volunteer to help us create a static site from the old help.openstreetmap.org site?

Our friends from wireshark have existing code which could be adapted to the task: Wireshark Foundation / Old OSQA ask.wireshark.org · GitLab

The old site runs OSQA on an ancient version of django. The software is unsupported. The operations team are eager to turn off the django version of site and replace it with a static HTML site. The static HTML site can be generated using Hugo static site generator or similar, as per the wireshark converter.

Can you help?

Firefishy · April 23, 2024, 12:09pm

We have a ticket with some older details here: Migrate help.openstreetmap.org from OSQA to static html archive · Issue #149 · openstreetmap/operations · GitHub

rtnf · April 23, 2024, 1:07pm

I would like to help, but my current Internet connection is not strong enough to scrape every page and file on that site (that Wireshark code is basically a manual scrape of every page and file on the site)

I have another suggestion: since the database and files are still stored on the OSM server, can we expose that to the public through a simple API? From there, we can build a simple read-only interface to browse and navigate the Q&As.

tldr : keep the database and files on the OSM server, and replace Django with an ad-hoc API and a read-only user interface.

Another suggestion: share the osqa’s help.openstreetmap.org database dump here (focus on actual Q&A data; any private data could be simply left behind or anonymized first.). Maybe I can reconstruct a whole static Q&A site right from that database dump.

edit : well, somebody has already suggested this idea before.

Firefishy · April 23, 2024, 1:35pm

I cannot share the database dump. It contains private data (password hashes, messages, etc).

I do not want another framework to keep the database and site alive. I am strongly in favour of pure static HTML archive.

I am happy to run any created scraping code to create the static input data.

rtnf · April 23, 2024, 2:00pm

Ok…

import requests
from bs4 import BeautifulSoup
import sys
import json
sys.stdout.reconfigure(encoding='utf-8')

start_page = 1
end_page = 88268

url_template = "https://help.openstreetmap.org/questions/{page}"

data = []

for page in range(start_page, end_page + 1):
    print(page)
    url = url_template.format(page=page)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    element = soup.find(id="CALeft")
    data.append({'id':page,'text':str(element)})    

with open('qnadump.json', 'w') as f:
    json.dump(data, f)

Then, copy qnadump.json, style.css, and r.py into the “questions” subfolder on htdocs.

r.py :

import json
import sys
import os
sys.stdout.reconfigure(encoding='utf-8')

with open('qnadump.json') as f:
    data = json.load(f)

for i in data:
    nya = '<meta charset="UTF-8"><link rel="stylesheet" type="text/css" href="../style.css">' + i['text']
    os.makedirs(str(i['id']), exist_ok=True)
    file_path = os.path.join(str(i['id']), 'index.html')
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(nya)

Run r.py there. It will generate a lot of folders, like this one (although I stopped the process after creating only two for this example below) :

Here’s the final result :

Note :

TODO : Since it will be converted into a static version, search and tag functionalities will be quite challenging to reimplement.
Fortunately, the user’s avatar is stored externally on Gravatar, so we don’t have to manually scrape them one by one.

Stereo · April 23, 2024, 3:15pm

@rtnf would a shell account on our dev server help you? I use my account on that server to run a few python scripts too.

rtnf · April 23, 2024, 3:36pm

Thank you for pointing that out.

By using that shell account, I can create a static site clone. But where should I submit that static site so it can be deployed properly? (probably replacing the old Django-based help.openstreetmap.org with a static site)

Firefishy · April 23, 2024, 4:05pm

Don’t worry about deploying the replacement site. I will handle that once a static copy has been created.

rtnf · April 23, 2024, 4:12pm

Ok. I have submitted the account request.

Starting the scraping process right now…

youngmapper · April 23, 2024, 8:36pm

Yes, I can

Firefishy · April 24, 2024, 8:30am

You likely want to run the scrape via a background session using GNU Screen or tmux.

rtnf · April 24, 2024, 9:21am

I use nohup (no hang out) to run the scraper in the background

RubenKelevra · April 24, 2024, 9:41am

I got spare time. I’m happy do a static copy by scraping.

LMK if there’s someone else working on it or if I should give it a go.

Harry_Wood · April 24, 2024, 2:18pm

Can I cash in my reputation points?

rtnf · April 24, 2024, 2:47pm

Update : 100% scraped. It’s time to reconstruct everything into a static web :

I want to upload the scraped data to Github (so I can download it, and process it locally), but apparently it’s way too large.

So, I have to process everything on the server.

And it’s now done…

https://altilunium.github.io/help.openstreetmap.org-archive/88266/ (just play around with the thread ID there)

Additional notes :

A. Not every thread ID is accessible, since some of them are already deleted from the original help.openstreetmap.org

B. Low quality answers are not scrapped. Only answers shown on first page are scrapped.

C. Some of the files are still hosted on help.openstreetmap.org (and not scrapped). Make sure that all of this files are not deleted when turning off the OSQA django instance.

rtnf · April 24, 2024, 4:28pm

Update : Homepage and (title-based) search functionality now already reimplemented :

https://altilunium.github.io/help.openstreetmap.org-archive/

Firefishy · April 25, 2024, 3:30pm

Feel free to have a go. May the best implementation win.

Firefishy · April 25, 2024, 3:31pm

You could use something like this to “compress” the json. The format remains valid, just whitespace is removed.

jq -c . < input.json >output.json

InsertUser · April 28, 2024, 9:18pm

Wasn’t all of this going to be imported to this forum at one point?

julcnx · April 29, 2024, 1:42pm

I asked the same question before and here’s what I got back: