Can you create a static version of the old help.openstreetmap.org site?

Please could someone volunteer to help us create a static site from the old help.openstreetmap.org site?

Our friends from wireshark have existing code which could be adapted to the task: Wireshark Foundation / Old OSQA ask.wireshark.org · GitLab

The old site runs OSQA on an ancient version of django. The software is unsupported. The operations team are eager to turn off the django version of site and replace it with a static HTML site. The static HTML site can be generated using Hugo static site generator or similar, as per the wireshark converter.

Can you help?

4 Likes

We have a ticket with some older details here: Migrate help.openstreetmap.org from OSQA to static html archive · Issue #149 · openstreetmap/operations · GitHub

I would like to help, but my current Internet connection is not strong enough to scrape every page and file on that site (that Wireshark code is basically a manual scrape of every page and file on the site)

I have another suggestion: since the database and files are still stored on the OSM server, can we expose that to the public through a simple API? From there, we can build a simple read-only interface to browse and navigate the Q&As.

tldr : keep the database and files on the OSM server, and replace Django with an ad-hoc API and a read-only user interface.


Another suggestion: share the osqa’s help.openstreetmap.org database dump here (focus on actual Q&A data; any private data could be simply left behind or anonymized first.). Maybe I can reconstruct a whole static Q&A site right from that database dump.

edit : well, somebody has already suggested this idea before.

I cannot share the database dump. It contains private data (password hashes, messages, etc).

I do not want another framework to keep the database and site alive. I am strongly in favour of pure static HTML archive.

I am happy to run any created scraping code to create the static input data.

Ok…

import requests
from bs4 import BeautifulSoup
import sys
import json
sys.stdout.reconfigure(encoding='utf-8')

start_page = 1
end_page = 88268

url_template = "https://help.openstreetmap.org/questions/{page}"

data = []

for page in range(start_page, end_page + 1):
    print(page)
    url = url_template.format(page=page)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    element = soup.find(id="CALeft")
    data.append({'id':page,'text':str(element)})    

with open('qnadump.json', 'w') as f:
    json.dump(data, f)

Then, copy qnadump.json, style.css, and r.py into the “questions” subfolder on htdocs.

r.py :

import json
import sys
import os
sys.stdout.reconfigure(encoding='utf-8')

with open('qnadump.json') as f:
    data = json.load(f)

for i in data:
    nya = '<meta charset="UTF-8"><link rel="stylesheet" type="text/css" href="../style.css">' + i['text']
    os.makedirs(str(i['id']), exist_ok=True)
    file_path = os.path.join(str(i['id']), 'index.html')
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(nya)

Run r.py there. It will generate a lot of folders, like this one (although I stopped the process after creating only two for this example below) :

image

Here’s the final result :

Note :

  1. TODO : Since it will be converted into a static version, search and tag functionalities will be quite challenging to reimplement.
  2. Fortunately, the user’s avatar is stored externally on Gravatar, so we don’t have to manually scrape them one by one.
4 Likes

@rtnf would a shell account on our dev server help you? I use my account on that server to run a few python scripts too.

Thank you for pointing that out.

By using that shell account, I can create a static site clone. But where should I submit that static site so it can be deployed properly? (probably replacing the old Django-based help.openstreetmap.org with a static site)

Don’t worry about deploying the replacement site. I will handle that once a static copy has been created.

1 Like

Ok. I have submitted the account request.

Starting the scraping process right now…

image

5 Likes

Yes, I can

You likely want to run the scrape via a background session using GNU Screen or tmux.

I use nohup (no hang out) to run the scraper in the background

I got spare time. I’m happy do a static copy by scraping.

LMK if there’s someone else working on it or if I should give it a go.

Can I cash in my reputation points? :slight_smile:

6 Likes

Update : 100% scraped. It’s time to reconstruct everything into a static web :

image

I want to upload the scraped data to Github (so I can download it, and process it locally), but apparently it’s way too large.

image

So, I have to process everything on the server.

And it’s now done…

https://altilunium.github.io/help.openstreetmap.org-archive/88266/ (just play around with the thread ID there)


Additional notes :

A. Not every thread ID is accessible, since some of them are already deleted from the original help.openstreetmap.org

B. Low quality answers are not scrapped. Only answers shown on first page are scrapped.

C. Some of the files are still hosted on help.openstreetmap.org (and not scrapped). Make sure that all of this files are not deleted when turning off the OSQA django instance.

1 Like

Update : Homepage and (title-based) search functionality now already reimplemented :

https://altilunium.github.io/help.openstreetmap.org-archive/

1 Like

Feel free to have a go. May the best implementation win.

You could use something like this to “compress” the json. The format remains valid, just whitespace is removed.

jq -c . < input.json >output.json

Wasn’t all of this going to be imported to this forum at one point?

1 Like

I asked the same question before and here’s what I got back:

2 Likes