Can you create a static version of the old help.openstreetmap.org site?

I cannot share the database dump. It contains private data (password hashes, messages, etc).

I do not want another framework to keep the database and site alive. I am strongly in favour of pure static HTML archive.

I am happy to run any created scraping code to create the static input data.

Ok…

import requests
from bs4 import BeautifulSoup
import sys
import json
sys.stdout.reconfigure(encoding='utf-8')

start_page = 1
end_page = 88268

url_template = "https://help.openstreetmap.org/questions/{page}"

data = []

for page in range(start_page, end_page + 1):
    print(page)
    url = url_template.format(page=page)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    element = soup.find(id="CALeft")
    data.append({'id':page,'text':str(element)})    

with open('qnadump.json', 'w') as f:
    json.dump(data, f)

Then, copy qnadump.json, style.css, and r.py into the “questions” subfolder on htdocs.

r.py :

import json
import sys
import os
sys.stdout.reconfigure(encoding='utf-8')

with open('qnadump.json') as f:
    data = json.load(f)

for i in data:
    nya = '<meta charset="UTF-8"><link rel="stylesheet" type="text/css" href="../style.css">' + i['text']
    os.makedirs(str(i['id']), exist_ok=True)
    file_path = os.path.join(str(i['id']), 'index.html')
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(nya)

Run r.py there. It will generate a lot of folders, like this one (although I stopped the process after creating only two for this example below) :

image

Here’s the final result :

Note :

  1. TODO : Since it will be converted into a static version, search and tag functionalities will be quite challenging to reimplement.
  2. Fortunately, the user’s avatar is stored externally on Gravatar, so we don’t have to manually scrape them one by one.
4 Likes

@rtnf would a shell account on our dev server help you? I use my account on that server to run a few python scripts too.

Thank you for pointing that out.

By using that shell account, I can create a static site clone. But where should I submit that static site so it can be deployed properly? (probably replacing the old Django-based help.openstreetmap.org with a static site)

Don’t worry about deploying the replacement site. I will handle that once a static copy has been created.

1 Like

Ok. I have submitted the account request.

Starting the scraping process right now…

image

5 Likes

Yes, I can

You likely want to run the scrape via a background session using GNU Screen or tmux.

I use nohup (no hang out) to run the scraper in the background

I got spare time. I’m happy do a static copy by scraping.

LMK if there’s someone else working on it or if I should give it a go.

Can I cash in my reputation points? :slight_smile:

6 Likes

Update : 100% scraped. It’s time to reconstruct everything into a static web :

image

I want to upload the scraped data to Github (so I can download it, and process it locally), but apparently it’s way too large.

image

So, I have to process everything on the server.

And it’s now done…

https://altilunium.github.io/help.openstreetmap.org-archive/88266/ (just play around with the thread ID there)


Additional notes :

A. Not every thread ID is accessible, since some of them are already deleted from the original help.openstreetmap.org

B. Low quality answers are not scrapped. Only answers shown on first page are scrapped.

C. Some of the files are still hosted on help.openstreetmap.org (and not scrapped). Make sure that all of this files are not deleted when turning off the OSQA django instance.

1 Like

Update : Homepage and (title-based) search functionality now already reimplemented :

https://altilunium.github.io/help.openstreetmap.org-archive/

1 Like

Feel free to have a go. May the best implementation win.

You could use something like this to “compress” the json. The format remains valid, just whitespace is removed.

jq -c . < input.json >output.json

Wasn’t all of this going to be imported to this forum at one point?

1 Like

I asked the same question before and here’s what I got back:

2 Likes

@rtnf How you getting on creating a static clone of https://help.openstreetmap.org/ ?

Please try ensure the functionality and URLs do not change as far as remotely possible.

Something looking and working very similarly to https://osqa-ask.wireshark.org/ is the ultimate goal.

Note it is also possible to host web content directly under your user on the dev server. eg: https://rtnf.dev.openstreetmap.org/ serves the contents of /home/rtnf/public_html/ directory.

I noticed that both the user profile and tagging functionalities are disabled in the archived version of the Wireshark forum (reimplementing both features using static HTML is quite a challenging task, and they seem to be giving up altogether on this.).

The last problem in my implementation is likely to be the URL change, which could probably be fixed by using Apache’s mod_rewrite (for example, automatically redirecting “/questions/88266/osm-carto-multilingual-tags” to “/questions/88266”). Additionally, I need to (1) add a link to the homepage on each answer page, allowing users to navigate back, and (2) make visual improvements to the homepage.

I’ll try to do it soon & post some updates here.


https://rtnf.dev.openstreetmap.org/help.openstreetmap.org-archive/

Update 1 : Homepage

Update 2 : add a link to the homepage on each answer page (click the osm logo to go back)

Update 3 : All of the old URL can now be preserved (by automatically redirecting “/questions/88266/osm-carto-multilingual-tags” url format to “/questions/88266”)

https://rtnf.dev.openstreetmap.org/help.openstreetmap.org-archive/questions/88266/osm-carto-multilingual-tags (this will be automatically redirected to https://rtnf.dev.openstreetmap.org/help.openstreetmap.org-archive/questions/88266)

.htaccess ModRewrite config :

RewriteEngine On
RewriteBase /help.openstreetmap.org-archive	
RewriteRule ^questions/([0-9]+)/[^/]+$ questions/$1 [R=301,L]

If we aren’t going to archive user pages is there going to be an email to all users that they have XXX days to go there and manually archive their contributions?

I had a go at this for mine before the original shutdown date and a really janky version of it kinda sorta works with httrack (would be way easier if the number of results per page was increased significantly).