Power outage corrupted Overpass .idx files

Kai_Johnson · July 31, 2023, 11:22pm

An unscheduled power outage just took out my Overpass server. I thought I might have gotten lucky, but no. When I try to restart it, I’m getting “Unsupported index file format version” errors for all the .idx files.

Has anyone ever managed to recover from this state without downloading a fresh database copy?

ZeLonewolf · August 1, 2023, 12:47am

I have encountered this and I’ve never solved it. I always re-download the database and deal with a temporary outage for my use case. I would love it if overpass were resilient to this issue.

amapanda_ᚐᚋᚐᚅᚇᚐ · August 1, 2023, 7:51am

I’ve never had exactly that issue. But I’ve often had weird errors like that. Like @ZeLonewolf, I just reimport the Overpass database and forget about it.

“Hit it with a big hammer”

drolbr · August 1, 2023, 9:57am

Hi Kai,

An unscheduled power outage just took out my Overpass server. I thought
I might have gotten lucky, but no. When I try to restart it, I’m getting
“Unsupported index file format version” errors for all the .idx files.

If you still have then could you please send me the *.idx files, the
*.idx.shadow files and or run

for i in $DB_DIR/*.idx $B_DIR/*.idx.shadow; do { echo "$(basename $i)
$(hexdump -C <$i | tail -n 1)";

[side note: fixed a broken command here, now correct]

and send me the result (this one is much smaller and could be pasted in
principle even here, opposed to the larger files themselves)?

In principle, the updates process is designed such that at any given
moment in time the *.idx or the *.idx.shadow files are a valid index,
and I would be highly interested in seeing a state where this is no
longer the case to harden against that.

Kai_Johnson · August 1, 2023, 4:43pm

Thanks for the offer to help!

Unfortunately, I don’t still have the files. But I will keep the offer in mind for next time. We’re on a 100 year old power circuit and although the local utility tries to keep it up, we do have unscheduled outages a couple of times a year.

So, I’m sure we’ll have another opportunity to debug this situation.

ZeLonewolf · August 8, 2023, 2:41am

I am currently receiving the following error on overpass queries to my private server:

Error: runtime error: open64: 0 Success /osm3s_osm_base
Dispatcher_Client::request_read_and_idx::duplicate_query

Appreciate any tips to troubleshoot. I’ve preserved the .idx files (168M of them) for now, but I’m going to need to rebuild again to get operational.

drolbr · August 10, 2023, 8:58am

Error: runtime error: open64: 0 Success /osm3s_osm_base
Dispatcher_Client::request_read_and_idx::duplicate_query |

I’m sorry for the late answer. The database files are fine, but this is
a new restriction policy intended for the public instances. Please turn
it off with a parameter --duplicate-queries=yes to the dispatcher.

Some background: there have been multiple occasions where people have
sent the exact same request from dozens of IP addresses from a public
cloud multiple times per minute or second.

As the public instances distribute resources per IP address, this can
substantially clog the public instances. Interrelating IP addresses from
the same C-net is not a solution, because in many cases they are used
independently and many cloud providers use multiple disjoint IP address
blocks.

Duplicate queries are for the moment being also accepted on the public
instances, as there is for the moment being enough capacity.

ZeLonewolf · August 12, 2023, 6:49pm

According to the error output, the correct switch is actually:

--allow-duplicate-queries=(yes|no): Set whether the dispatcher shall block duplicate queries.

Kai_Johnson · September 1, 2023, 8:36pm

We had another power outage last night. This time I was able to recover and get Overpass back up and running. So far it looks like it’s in good shape.

At first I was having some trouble because the names of the lock files had changed and they weren’t being deleted properly. So I had to edit my scripts and in the process, I took the time to make them more robust.

The launch.sh script now checks to see if each of the processes are running before starting them. If any process dies immediately after launch, the script shuts everything back down. And launch.sh doesn’t delete the shared memory or lock files unless the associated dispatcher process isn’t running.

The shutdown.sh script is also more robust. I’ve cleaned up some of the scripting to better handle some edge cases.

I had posted the scripts in my diary, but I’ll be moving them to a wiki page sometime soon.

ZeLonewolf · September 1, 2023, 9:36pm

It would be great if you could turn all of this into a github project for administering overpass. It would help a lot with all the ad hoc scripting everyone is doing to keep overpass running!

Kai_Johnson · September 2, 2023, 9:46pm

We already have one GitHub project for Overpass. If @drolbr would be interested in having some of this as contributions or combining it with existing scripts, maybe that would be a good place for it?

drolbr · September 5, 2023, 3:47pm

We already have one GitHub project for Overpass. If @drolbr
https://community.openstreetmap.org/u/drolbr would be interested in
having some of this as contributions or combining it with existing
scripts, maybe that would be a good place for it?

I’m highly enthusiastic to collect all the scripts that fix real life
problems.

I’d prefer to have them as separate scripts in pull requests for the
moment being to have all of them in parallel and then reconciling once
we understand the details. Otherwise, we risk to have some improvements
in apply_to_osc.sh, some in fetch_osc_and_apply.sh and none
containing all.

lukey78 · December 16, 2023, 7:42pm

@drolbr I’m using the docker image v0.7.61.8 and the docker healthcheck seems to run into this error with the duplicate query - thus the container stays “unhealthy” and maybe gets restarted by the docker daemon. I didn’t find any option to allow the “–duplicate-queries” through an ENV var - maybe that one got lost? Is there a workaround?

UPDATE 17.12.23: created a pull request configure "allow-duplicate-queries" through env var / fix healthcheck script by lukey78 · Pull Request #116 · wiktorn/Overpass-API · GitHub

Kai_Johnson · May 29, 2024, 7:40pm

To revive this old thread, I had another unscheduled power outage today and after restarting my Overpass server, updates are not being applied and I’m getting this error from apply_osc_to_db.sh:

File error caught: /opt/overpass/db/node_tags_global.bin.idx File_Blocks_Index: Unsupported index file format version 0 outside range [7512, 7600]
Reading XML file ... finished reading nodes. File error caught: /opt/overpass/db/nodes.bin.idx.shadow File_Blocks_Index: Unsupported index file format version 0 outside range [7512, 7600]

I shut down all the processes but otherwise left the server untouched.

Here’s the result of the hexdump command that was suggested earlier.

area_blocks.bin.idx 00000000  b0                                                |.|
00000001
areas.bin.idx 00000000  b0                                                |.|
00000001
area_tags_global.bin.idx 00000000  b0                                                |.|
00000001
area_tags_local.bin.idx 00000000  b0                                                |.|
00000001
node_frequent_tags.bin.idx 00000000  00                                                |.|
00000001
node_keys.bin.idx 00000000  00                                                |.|
00000001
nodes.bin.idx 00000000  00                                                |.|
00000001
nodes.map.idx 00000000  00                                                |.|
00000001
nodes_meta.bin.idx 00000000  00                                                |.|
00000001
node_tags_global.bin.idx 00000000  00                                                |.|
00000001
node_tags_local.bin.idx 00000000  00                                                |.|
00000001
relation_frequent_tags.bin.idx 00000000  00                                                |.|
00000001
relation_keys.bin.idx 00000000  00                                                |.|
00000001
relation_roles.bin.idx 00000000  00                                                |.|
00000001
relations.bin.idx 00000000  00                                                |.|
00000001
relations.map.idx 00000000  00                                                |.|
00000001
relations_meta.bin.idx 00000000  00                                                |.|
00000001
relation_tags_global.bin.idx 00000000  00                                                |.|
00000001
relation_tags_local.bin.idx 00000000  00                                                |.|
00000001
user_data.bin.idx 00000000  00                                                |.|
00000001
user_indices.bin.idx 00000000  00                                                |.|
00000001
way_frequent_tags.bin.idx 00000000  00                                                |.|
00000001
way_keys.bin.idx 00000000  00                                                |.|
00000001
ways.bin.idx 00000000  00                                                |.|
00000001
ways.map.idx 00000000  00                                                |.|
00000001
ways_meta.bin.idx 00000000  00                                                |.|
00000001
way_tags_global.bin.idx 00000000  00                                                |.|
00000001
way_tags_local.bin.idx 00000000  00                                                |.|
00000001
bash: db/*.idx.shadow: No such file or directory
*.idx.shadow

There are apparently no .shadow files other than osm_base_version.shadow and I can confirm that if there had been any .idx.shadow files, they would not have been deleted.

This is with Overpass version osm-3s_v0.7.61.8.

@drolbr do you have any suggestions for analysis or recovery?

drolbr · May 29, 2024, 10:02pm

Thank you for the crash report.
Could you please send to file sizes of the *.idx files?

The hexdump and the version marker zero look a little bit like all or
some of the files are empty.

Kai_Johnson · May 29, 2024, 10:36pm

Here’s everything in the db directory:

drwxr-xr-x  4 overpass nogroup       12288 May 29 11:56 .
drwxr-xr-x 11 overpass nogroup        4096 Sep  7  2023 ..
-rw-r--r--  1 overpass nogroup      347396 May 29 09:55 apply_osc_to_db.log
-rw-r--r--  1 overpass nogroup      886617 May 29 00:00 apply_osc_to_db.log.1
-rw-r--r--  1 overpass nogroup       65324 May 28 00:00 apply_osc_to_db.log.2.gz
-rw-r--r--  1 overpass nogroup       65410 May 27 00:00 apply_osc_to_db.log.3.gz
-rw-r--r--  1 overpass nogroup 15450636288 May 28 19:03 area_blocks.bin
-rw-r--r--  1 overpass nogroup      189864 May 28 19:03 area_blocks.bin.idx
-rw-r--r--  1 overpass nogroup   320077824 May 28 19:03 areas.bin
-rw-r--r--  1 overpass nogroup        1016 May 28 19:03 areas.bin.idx
-rw-r--r--  1 overpass nogroup  9273344000 May 28 19:03 area_tags_global.bin
-rw-r--r--  1 overpass nogroup       30546 May 28 19:03 area_tags_global.bin.idx
-rw-r--r--  1 overpass nogroup   397246464 May 28 19:03 area_tags_local.bin
-rw-r--r--  1 overpass nogroup       74255 May 28 19:03 area_tags_local.bin.idx
-rw-r--r--  1 overpass nogroup          21 May 28 18:00 area_version
drwxr-xr-x  2 overpass nogroup        4096 Aug  2  2023 augmented_diffs
-rw-r--r--  1 overpass nogroup          46 Aug  1  2023 base-url
-rw-rw-rw-  1 overpass nogroup     1491948 May 29 11:56 database.log
-rw-rw-rw-  1 overpass nogroup     3577026 May 29 00:00 database.log.1
-rw-rw-rw-  1 overpass nogroup      266905 May 28 00:00 database.log.2.gz
-rw-rw-rw-  1 overpass nogroup      267034 May 27 00:00 database.log.3.gz
-rw-r--r--  1 overpass nogroup      524288 May 27 03:54 node_frequent_tags.bin
-rw-r--r--  1 overpass nogroup          34 May 29 09:27 node_frequent_tags.bin.idx
-rw-r--r--  1 overpass nogroup     3866624 May 29 09:12 node_keys.bin
-rw-r--r--  1 overpass nogroup         216 May 29 09:27 node_keys.bin.idx
-rw-r--r--  1 overpass nogroup 85923545088 May 29 09:27 nodes.bin
-rw-r--r--  1 overpass nogroup    19369384 May 29 09:27 nodes.bin.idx
-rw-r--r--  1 overpass nogroup 71765590016 May 29 09:27 nodes.map
-rw-r--r--  1 overpass nogroup     1457848 May 29 09:27 nodes.map.idx
-rw-r--r--  1 overpass nogroup 72900198400 May 29 09:27 nodes_meta.bin
-rw-r--r--  1 overpass nogroup    38943320 May 29 09:27 nodes_meta.bin.idx
-rw-r--r--  1 overpass nogroup 10092101632 May 29 09:27 node_tags_global.bin
-rw-r--r--  1 overpass nogroup     5860636 May 29 09:27 node_tags_global.bin.idx
-rw-r--r--  1 overpass nogroup  7552581632 May 29 09:27 node_tags_local.bin
-rw-r--r--  1 overpass nogroup     6177103 May 29 09:27 node_tags_local.bin.idx
-rw-r--r--  1 overpass nogroup          23 May 29 09:27 osm_base_version
-rw-r--r--  1 overpass nogroup          23 May 29 11:55 osm_base_version.shadow
-rw-r--r--  1 overpass nogroup      196608 May 27 23:54 relation_frequent_tags.bin
-rw-r--r--  1 overpass nogroup          30 May 29 09:27 relation_frequent_tags.bin.idx
-rw-r--r--  1 overpass nogroup     1179648 May 29 06:42 relation_keys.bin
-rw-r--r--  1 overpass nogroup          56 May 29 09:27 relation_keys.bin.idx
-rw-r--r--  1 overpass nogroup     1441792 May 28 17:40 relation_roles.bin
-rw-r--r--  1 overpass nogroup          88 May 29 09:27 relation_roles.bin.idx
-rw-r--r--  1 overpass nogroup  1913323520 May 29 09:27 relations.bin
-rw-r--r--  1 overpass nogroup      108056 May 29 09:27 relations.bin.idx
-rw-r--r--  1 overpass nogroup   193200128 May 29 09:27 relations.map
-rw-r--r--  1 overpass nogroup        2168 May 29 09:27 relations.map.idx
-rw-r--r--  1 overpass nogroup   344719360 May 29 09:27 relations_meta.bin
-rw-r--r--  1 overpass nogroup       62024 May 29 09:27 relations_meta.bin.idx
-rw-r--r--  1 overpass nogroup  1178337280 May 29 09:27 relation_tags_global.bin
-rw-r--r--  1 overpass nogroup      377587 May 29 09:27 relation_tags_global.bin.idx
-rw-r--r--  1 overpass nogroup   843022336 May 29 09:27 relation_tags_local.bin
-rw-r--r--  1 overpass nogroup      485302 May 29 09:27 relation_tags_local.bin.idx
-rw-r--r--  1 overpass nogroup           8 May 29 09:27 replicate_id
drwxr-xr-x  2 overpass nogroup        4096 Aug 27  2023 rules
-rw-rw-rw-  1 overpass nogroup         220 May 29 09:55 transactions.log
-rw-rw-rw-  1 overpass nogroup         962 May 29 00:00 transactions.log.1
-rw-rw-rw-  1 overpass nogroup         480 May 28 00:00 transactions.log.2.gz
-rw-rw-rw-  1 overpass nogroup         401 May 27 00:00 transactions.log.3.gz
-rw-r--r--  1 overpass nogroup   163250176 May 29 09:27 user_data.bin
-rw-r--r--  1 overpass nogroup        1912 May 29 09:27 user_data.bin.idx
-rw-r--r--  1 overpass nogroup   893763584 May 29 09:27 user_indices.bin
-rw-r--r--  1 overpass nogroup       47080 May 29 09:27 user_indices.bin.idx
-rw-r--r--  1 overpass nogroup      524288 May 29 06:59 way_frequent_tags.bin
-rw-r--r--  1 overpass nogroup          30 May 29 09:27 way_frequent_tags.bin.idx
-rw-r--r--  1 overpass nogroup     4456448 May 29 09:16 way_keys.bin
-rw-r--r--  1 overpass nogroup         248 May 29 09:27 way_keys.bin.idx
-rw-r--r--  1 overpass nogroup 71080689664 May 29 09:27 ways.bin
-rw-r--r--  1 overpass nogroup    18590632 May 29 09:27 ways.bin.idx
-rw-r--r--  1 overpass nogroup 13015187456 May 29 09:27 ways.map
-rw-r--r--  1 overpass nogroup      157208 May 29 09:27 ways.map.idx
-rw-r--r--  1 overpass nogroup 10578444288 May 29 09:27 ways_meta.bin
-rw-r--r--  1 overpass nogroup     4176376 May 29 09:27 ways_meta.bin.idx
-rw-r--r--  1 overpass nogroup 23307567104 May 29 09:27 way_tags_global.bin
-rw-r--r--  1 overpass nogroup     8773470 May 29 09:27 way_tags_global.bin.idx
-rw-r--r--  1 overpass nogroup 15325675520 May 29 09:27 way_tags_local.bin
-rw-r--r--  1 overpass nogroup     7910381 May 29 09:27 way_tags_local.bin.idx

For reference, this db has metadata, but not attic data.

Kai_Johnson · May 30, 2024, 12:31am

I just noticed something that looks odd. In apply_osc_to_db.log, it appears to have skipped one diff when it restarted from the power outage (16:27:05 to 16:55:12).

2024-05-29 16:25:13: updating to 6110693
2024-05-29 16:25:16: update complete 6110693
2024-05-29 16:25:16: updating from 6110693
2024-05-29 16:25:21: updating from 6110693
2024-05-29 16:25:26: updating from 6110693
2024-05-29 16:25:31: updating from 6110693
2024-05-29 16:25:36: updating from 6110693
2024-05-29 16:25:41: updating from 6110693
2024-05-29 16:25:46: updating from 6110693
2024-05-29 16:25:51: updating from 6110693
2024-05-29 16:25:56: updating from 6110693
2024-05-29 16:26:01: updating from 6110693
2024-05-29 16:26:06: updating from 6110693
2024-05-29 16:26:06: updating to 6110694
2024-05-29 16:26:10: update complete 6110694
2024-05-29 16:26:10: updating from 6110694
2024-05-29 16:26:15: updating from 6110694
2024-05-29 16:26:20: updating from 6110694
2024-05-29 16:26:25: updating from 6110694
2024-05-29 16:26:30: updating from 6110694
2024-05-29 16:26:35: updating from 6110694
2024-05-29 16:26:40: updating from 6110694
2024-05-29 16:26:45: updating from 6110694
2024-05-29 16:26:50: updating from 6110694
2024-05-29 16:26:55: updating from 6110694
2024-05-29 16:27:00: updating from 6110694
2024-05-29 16:27:05: updating from 6110694

2024-05-29 16:55:12: updating from 6110695
2024-05-29 16:55:17: updating from 6110695
2024-05-29 16:55:22: updating from 6110695
2024-05-29 16:55:27: updating from 6110695
2024-05-29 16:55:32: updating from 6110695
2024-05-29 16:55:37: updating from 6110695
2024-05-29 16:55:37: updating to 6110696

The update to 6110696 hung and did not complete.

drolbr · May 30, 2024, 6:58am

Thank you for the quick reply and once again I’m sorry for the recover work this caused.

Please use for the hexdump a corrected command:

for i in $DB_DIR/*.idx $B_DIR/*.idx.shadow; do { echo "$(basename $i)
$(hexdump -C <$i | tail -n 1)";

The one in the post above (now fixed) had a typo and showed one byte instead of one line.

I have no idea yet what has exactly happened. What is indeed a fatal problem is that there are zeros in the header of most idx files (in fact, all idx files touched at 09:27) which should not be there.

I’ll link this thread from an issue in the issue tracker on GitHub to ensure it gets further scrutiny. Things that can and should be improved is asserting that the replicate_id is set atomically such that no skip over a diff can happen and ensuring that shadow files are kept if the idx files for some reason are all-zero.

Kai_Johnson · May 30, 2024, 2:05pm

I appreciate the assistance with debugging. If everything has gone well, I should have a backup of the db that is less than one week old. So, even in the worst case, recovery should not be difficult.

area_blocks.bin.idx 0002e5a8
areas.bin.idx 000003f8
area_tags_global.bin.idx 00007752
area_tags_local.bin.idx 0001220f
node_frequent_tags.bin.idx 00000022
node_keys.bin.idx 000000d8
nodes.bin.idx 01278da8
nodes.map.idx 00163eb8
nodes_meta.bin.idx 02523a58
node_tags_global.bin.idx 00596d1c
node_tags_local.bin.idx 005e414f
relation_frequent_tags.bin.idx 0000001e
relation_keys.bin.idx 00000038
relation_roles.bin.idx 00000058
relations.bin.idx 0001a618
relations.map.idx 00000878
relations_meta.bin.idx 0000f248
relation_tags_global.bin.idx 0005c2f3
relation_tags_local.bin.idx 000767b6
user_data.bin.idx 00000778
user_indices.bin.idx 0000b7e8
way_frequent_tags.bin.idx 0000001e
way_keys.bin.idx 000000f8
ways.bin.idx 011baba8
ways.map.idx 00026618
ways_meta.bin.idx 003fb9f8
way_tags_global.bin.idx 0085df5e
way_tags_local.bin.idx 0078b3ed
bash: *.idx.shadow: No such file or directory
*.idx.shadow

Also, here’s the for ... in syntax I’m using for bash:

for i in *.idx *.idx.shadow; do echo "$(basename $i) $(hexdump -C <$i | tail -n 1)"; done;

drolbr · May 31, 2024, 8:16pm

Thank you again. I have an hypothesis what has happened.

If it is true then this time we cannot recover from the damaged idx
files. But the next version of Overpass API shall get better protections
in place.

Looks like copy_file(..) in the source code in combination with the
power outage has written the same number of zero bytes than what it
should have been copied. This might happen if the file meta data has
been set but the write cache for the actual file has never been written
to disk.

If this is the problem then more paranoia during the restart can avoid
future repetitions of the problem