An unscheduled power outage just took out my Overpass server. I thought I might have gotten lucky, but no. When I try to restart it, I’m getting “Unsupported index file format version” errors for all the .idx files.
Has anyone ever managed to recover from this state without downloading a fresh database copy?
I have encountered this and I’ve never solved it. I always re-download the database and deal with a temporary outage for my use case. I would love it if overpass were resilient to this issue.
I’ve never had exactly that issue. But I’ve often had weird errors like that. Like @ZeLonewolf, I just reimport the Overpass database and forget about it.
An unscheduled power outage just took out my Overpass server. I thought
I might have gotten lucky, but no. When I try to restart it, I’m getting
“Unsupported index file format version” errors for all the .idx files.
If you still have then could you please send me the *.idx files, the
*.idx.shadow files and or run
for i in $DB_DIR/*.idx $B_DIR/*.idx.shadow; do { echo "$(basename $i)
$(hexdump -C <$i | tail -n 1)";
[side note: fixed a broken command here, now correct]
and send me the result (this one is much smaller and could be pasted in
principle even here, opposed to the larger files themselves)?
In principle, the updates process is designed such that at any given
moment in time the *.idx or the *.idx.shadow files are a valid index,
and I would be highly interested in seeing a state where this is no
longer the case to harden against that.
Unfortunately, I don’t still have the files. But I will keep the offer in mind for next time. We’re on a 100 year old power circuit and although the local utility tries to keep it up, we do have unscheduled outages a couple of times a year.
So, I’m sure we’ll have another opportunity to debug this situation.
I’m sorry for the late answer. The database files are fine, but this is
a new restriction policy intended for the public instances. Please turn
it off with a parameter --duplicate-queries=yes to the dispatcher.
Some background: there have been multiple occasions where people have
sent the exact same request from dozens of IP addresses from a public
cloud multiple times per minute or second.
As the public instances distribute resources per IP address, this can
substantially clog the public instances. Interrelating IP addresses from
the same C-net is not a solution, because in many cases they are used
independently and many cloud providers use multiple disjoint IP address
blocks.
Duplicate queries are for the moment being also accepted on the public
instances, as there is for the moment being enough capacity.
We had another power outage last night. This time I was able to recover and get Overpass back up and running. So far it looks like it’s in good shape.
At first I was having some trouble because the names of the lock files had changed and they weren’t being deleted properly. So I had to edit my scripts and in the process, I took the time to make them more robust.
The launch.sh script now checks to see if each of the processes are running before starting them. If any process dies immediately after launch, the script shuts everything back down. And launch.sh doesn’t delete the shared memory or lock files unless the associated dispatcher process isn’t running.
The shutdown.sh script is also more robust. I’ve cleaned up some of the scripting to better handle some edge cases.
I had posted the scripts in my diary, but I’ll be moving them to a wiki page sometime soon.
It would be great if you could turn all of this into a github project for administering overpass. It would help a lot with all the ad hoc scripting everyone is doing to keep overpass running!
We already have one GitHub project for Overpass. If @drolbr would be interested in having some of this as contributions or combining it with existing scripts, maybe that would be a good place for it?
We already have one GitHub project for Overpass. If @drolbr https://community.openstreetmap.org/u/drolbr would be interested in
having some of this as contributions or combining it with existing
scripts, maybe that would be a good place for it?
I’m highly enthusiastic to collect all the scripts that fix real life
problems.
I’d prefer to have them as separate scripts in pull requests for the
moment being to have all of them in parallel and then reconciling once
we understand the details. Otherwise, we risk to have some improvements
in apply_to_osc.sh, some in fetch_osc_and_apply.sh and none
containing all.
@drolbr I’m using the docker image v0.7.61.8 and the docker healthcheck seems to run into this error with the duplicate query - thus the container stays “unhealthy” and maybe gets restarted by the docker daemon. I didn’t find any option to allow the “–duplicate-queries” through an ENV var - maybe that one got lost? Is there a workaround?
To revive this old thread, I had another unscheduled power outage today and after restarting my Overpass server, updates are not being applied and I’m getting this error from apply_osc_to_db.sh:
File error caught: /opt/overpass/db/node_tags_global.bin.idx File_Blocks_Index: Unsupported index file format version 0 outside range [7512, 7600]
Reading XML file ... finished reading nodes. File error caught: /opt/overpass/db/nodes.bin.idx.shadow File_Blocks_Index: Unsupported index file format version 0 outside range [7512, 7600]
I shut down all the processes but otherwise left the server untouched.
Here’s the result of the hexdump command that was suggested earlier.
There are apparently no .shadow files other than osm_base_version.shadow and I can confirm that if there had been any .idx.shadow files, they would not have been deleted.
This is with Overpass version osm-3s_v0.7.61.8.
@drolbr do you have any suggestions for analysis or recovery?
I just noticed something that looks odd. In apply_osc_to_db.log, it appears to have skipped one diff when it restarted from the power outage (16:27:05 to 16:55:12).
2024-05-29 16:25:13: updating to 6110693
2024-05-29 16:25:16: update complete 6110693
2024-05-29 16:25:16: updating from 6110693
2024-05-29 16:25:21: updating from 6110693
2024-05-29 16:25:26: updating from 6110693
2024-05-29 16:25:31: updating from 6110693
2024-05-29 16:25:36: updating from 6110693
2024-05-29 16:25:41: updating from 6110693
2024-05-29 16:25:46: updating from 6110693
2024-05-29 16:25:51: updating from 6110693
2024-05-29 16:25:56: updating from 6110693
2024-05-29 16:26:01: updating from 6110693
2024-05-29 16:26:06: updating from 6110693
2024-05-29 16:26:06: updating to 6110694
2024-05-29 16:26:10: update complete 6110694
2024-05-29 16:26:10: updating from 6110694
2024-05-29 16:26:15: updating from 6110694
2024-05-29 16:26:20: updating from 6110694
2024-05-29 16:26:25: updating from 6110694
2024-05-29 16:26:30: updating from 6110694
2024-05-29 16:26:35: updating from 6110694
2024-05-29 16:26:40: updating from 6110694
2024-05-29 16:26:45: updating from 6110694
2024-05-29 16:26:50: updating from 6110694
2024-05-29 16:26:55: updating from 6110694
2024-05-29 16:27:00: updating from 6110694
2024-05-29 16:27:05: updating from 6110694
2024-05-29 16:55:12: updating from 6110695
2024-05-29 16:55:17: updating from 6110695
2024-05-29 16:55:22: updating from 6110695
2024-05-29 16:55:27: updating from 6110695
2024-05-29 16:55:32: updating from 6110695
2024-05-29 16:55:37: updating from 6110695
2024-05-29 16:55:37: updating to 6110696
Thank you for the quick reply and once again I’m sorry for the recover work this caused.
Please use for the hexdump a corrected command:
for i in $DB_DIR/*.idx $B_DIR/*.idx.shadow; do { echo "$(basename $i)
$(hexdump -C <$i | tail -n 1)";
The one in the post above (now fixed) had a typo and showed one byte instead of one line.
I have no idea yet what has exactly happened. What is indeed a fatal problem is that there are zeros in the header of most idx files (in fact, all idx files touched at 09:27) which should not be there.
I’ll link this thread from an issue in the issue tracker on GitHub to ensure it gets further scrutiny. Things that can and should be improved is asserting that the replicate_id is set atomically such that no skip over a diff can happen and ensuring that shadow files are kept if the idx files for some reason are all-zero.
I appreciate the assistance with debugging. If everything has gone well, I should have a backup of the db that is less than one week old. So, even in the worst case, recovery should not be difficult.
Thank you again. I have an hypothesis what has happened.
If it is true then this time we cannot recover from the damaged idx
files. But the next version of Overpass API shall get better protections
in place.
Looks like copy_file(..) in the source code in combination with the
power outage has written the same number of zero bytes than what it
should have been copied. This might happen if the file meta data has
been set but the write cache for the actual file has never been written
to disk.
If this is the problem then more paranoia during the restart can avoid
future repetitions of the problem