Improving Overpass resilience during uncontrolled shutdowns

Kai_Johnson · November 16, 2025, 4:00pm

I’ve been running my own Overpass server for a while, and I’ve had some issues in the past with recoveries from power outages.

One issue I’ve identified is that if there is an uncontrolled shutdown of the system while the fetch_osc.sh script is downloading data, this can leave an incomplete file in the file system. When Overpass starts up again, the script begins downloading the next update and leaves the incomplete update file in place. Then the apply_osc_to_db.sh script attempts to process the incomplete file, and this causes problems.

So, I spent some time recently on improvements to the fetch_osc.sh and apply_osc_to_db.sh scripts to make them more resilient. And while I was doing that, I made some performance improvements as well.

Improvements to fetch_osc.sh

Better resilience
- Downloads *.osc.gz and state.txt files with .tmp extensions and only removes the .tmp extensions after verifying that the file formats are correct
- Uses curl’s automated retries instead of implementing its own retry mechanism
- Fails quickly if the replication source is not available at startup (likely a configuration error)
- Falls back gracefully if the replication source is temporarily unavailable or not producing updates at the expected frequency
- Log messages have UTC timestamps for event correlation
- Logs extended error information for download failures
- Traps interrupts for controlled shutdowns
More efficient downloads
- Uses curl’s TLS connection keepalive to reduce the number of TLS handshakes
- Downloads updates in larger batches if the database is significantly behind the replication source to further reduce TLS handshakes
Lower latency
- Aligns with the server’s update frequency instead of using a fixed polling period to reduce download latency from ~15 seconds to ~1 second.

Improvements to apply_osc_to_db.sh

Better resilience
- Error checking when unzipping update files
- Upper limit on update batch size to reduce the potential penalty if a database commit fails
- Log messages have UTC timestamps for event correlation
- Cleans up orphaned directories in /tmp
- Traps interrupts for controlled shutdowns
Lower latency
- Aligns with the server’s update frequency instead of using a fixed polling period to reduce update processing latency from ~5 seconds to ~1 second.

I think the revisions to the scripts do a lot to improve resilience during uncontrolled shutdowns. They also seem to improve performance when the database is significantly behind the replication source and improve latency when the database is caught up.

I’d be happy if anyone else wanted to use these scripts and if anyone has feedback or suggestions for improvements. The script source code follows below.

Kai_Johnson · November 16, 2025, 4:02pm

fetch_osc.sh

#!/usr/bin/env bash

# Copyright 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 Roland Olbricht et al.
# With improvements in 2025 by Kai Johnson
#
# This file is part of Overpass_API.
#
# Overpass_API is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# Overpass_API is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Overpass_API. If not, see <https://www.gnu.org/licenses/>.

# ============================================================================
# Script: fetch_osc.sh
# Purpose: Downloads minutely OpenStreetMap change files from remote source
#          with atomic operations, integrity checking, and connection reuse
# ============================================================================

if [[ -z $3 ]]; then
{
  echo "Usage: $0 replicate_id source_url local_dir [sleep]"
  echo ""
  echo "  replicate_id: Starting replicate ID or 'auto' to resume from last fetch"
  echo "  source_url:   Remote replication source (e.g., https://planet.openstreetmap.org/replication/minute)"
  echo "  local_dir:    Local directory for downloaded files"
  echo "  sleep:        (Optional, ignored - kept for compatibility)"
  exit 0
}; fi

# ============================================================================
# CONFIGURATION
# ============================================================================

START_ID="$1"
SOURCE_URL="$2"
LOCAL_DIR="$3"

# Check for deprecated sleep parameter
if [[ -n "$4" ]]; then
  echo "WARNING: Sleep parameter is ignored (timing is now automatic)"
fi

# Download retry configuration
CURL_MAX_RETRIES=20       # Max download attempts before giving up
CURL_RETRY_DELAY=15       # Seconds between retry attempts
CURL_CONNECT_TIMEOUT=30   # Connection timeout in seconds
CURL_KEEPALIVE_TIME=20    # Seconds to retain previous connections
MAX_BATCH_SIZE=360        # Maximum OSC files per batch download

# Network outage handling
OUTAGE_RETRY_DELAY=60     # Seconds to wait during detected network outages
SOURCE_VERIFIED=false     # Flag to track if source URL has been verified

# Update timing configuration
EXPECTED_UPDATE_INTERVAL=51  # Seconds to wait before checking for next update
QUICK_RETRY_DELAY=1          # Seconds between quick retries
QUICK_RETRY_COUNT=10         # Number of quick retries before slow retry
SLOW_RETRY_DELAY=60          # Seconds between retries when updates delayed

# Timestamp tracking
LAST_UPDATE_WALL_CLOCK=      # Wall clock time when last update was downloaded

# Get execution directory (where binaries are located)
EXEC_DIR="$(dirname $0)/"
if [[ ! "${EXEC_DIR:0:1}" == "/" ]]; then
  EXEC_DIR="$(pwd)/$EXEC_DIR"
fi

# Get database directory from dispatcher
DB_DIR=$($EXEC_DIR/dispatcher --show-dir)

if [[ ! -d "$DB_DIR" ]]; then
  echo "ERROR: Database directory '$DB_DIR' does not exist"
  exit 1
fi

# Database state file (tracks what's been applied)
DB_STATE_FILE="$DB_DIR/replicate_id"

# Create local directory if needed
mkdir -p "$LOCAL_DIR"

# Log file
LOG_FILE="$LOCAL_DIR/fetch_osc.log"

# Fetch state tracking file
FETCH_STATE_FILE="$LOCAL_DIR/replicate_id"

# ============================================================================
# LOGGING
# ============================================================================

log_message()
{
  echo "$(date -u '+%F %T'): $1" >> "$LOG_FILE"
}

log_error()
{
  echo "$(date -u '+%F %T'): ERROR: $1" >> "$LOG_FILE"
}

# ============================================================================
# TIMING
# ============================================================================

calculate_sleep_time()
{
  if [[ -z "$LAST_UPDATE_WALL_CLOCK" ]]; then
    echo 15
    return
  fi

  local NOW=$(date +%s)
  local NEXT_CHECK=$((LAST_UPDATE_WALL_CLOCK + EXPECTED_UPDATE_INTERVAL))
  local SLEEP_TIME=$((NEXT_CHECK - NOW))

  if [[ $SLEEP_TIME -lt 0 ]]; then
    echo 0
  else
    echo $SLEEP_TIME
  fi
}

# ============================================================================
# SLEEPING
# ============================================================================

sleep_with_interrupts()
{
  sleep $1 &
  wait $!
}

# ============================================================================
# REMOTE STATE CHECKING
# ============================================================================

# Fetch and parse remote state.txt to get latest available replicate ID
# During network outages (after source has been verified), waits patiently
# Outputs the sequence number to stdout, or nothing on failure
get_latest_available_id()
{
  local REMOTE_STATE="$LOCAL_DIR/state.txt"

  while true; do
    rm -f "$REMOTE_STATE"
    curl -fsSL \
      --keepalive-time $CURL_KEEPALIVE_TIME \
      --connect-timeout "$CURL_CONNECT_TIMEOUT" \
      --retry 3 \
      --retry-delay 5 \
      -o "$REMOTE_STATE" "$SOURCE_URL/state.txt" 2>/dev/null

    local CURL_EXIT=$?

    # If download succeeded, parse and return
    if [[ $CURL_EXIT -eq 0 && -s "$REMOTE_STATE" ]]; then
      local SEQ_LINE=$(grep -E '^sequenceNumber' "$REMOTE_STATE")
      if [[ -n "$SEQ_LINE" ]]; then
        # Parse the number (format is "sequenceNumber=12345")
        echo $((${SEQ_LINE:15} + 0))
        return 0
      fi
    fi

    # Download failed or file invalid
    # Check the SOURCE_VERIFIED flag (set in main loop, not here due to subshell)
    if [[ "$SOURCE_VERIFIED" == "true" ]]; then
      # Source was previously working - this is likely a network outage
      # Wait patiently and retry
      log_message "Unable to reach replication source (likely network outage), retrying in ${OUTAGE_RETRY_DELAY}s..."
      sleep_with_interrupts "$OUTAGE_RETRY_DELAY"
      # Continue loop to retry
    else
      # Source has never worked - might be a configuration error
      log_error "Cannot reach replication source: $SOURCE_URL"
      log_error "Please verify the source URL is correct"
      return 1
    fi
  done
}

# ============================================================================
# PATH CONVERSION
# ============================================================================

get_replicate_path()
{
  local ID=$1
  printf -v DIGIT3 %03u $(($ID % 1000))
  local ARG=$(($ID / 1000))
  printf -v DIGIT2 %03u $(($ARG % 1000))
  ARG=$(($ARG / 1000))
  printf -v DIGIT1 %03u $ARG
  REPLICATE_PATH="$DIGIT1/$DIGIT2/$DIGIT3"
}

# ============================================================================
# FILE VERIFICATION
# ============================================================================

verify_file()
{
  local FILE="$1"
  local TYPE="$2"

  if [[ ! -s "$FILE" ]]; then
    return 1
  fi

  if [[ "$TYPE" == "gzip" ]]; then
    gunzip -t <"$FILE" 2>/dev/null
    return $?
  elif [[ "$TYPE" == "text" ]]; then
    if ! grep -q "^sequenceNumber=" "$FILE" 2>/dev/null; then
      return 1
    fi
    if ! grep -q "^timestamp=" "$FILE" 2>/dev/null; then
      return 1
    fi
    return 0
  fi

  return 1
}

# ============================================================================
# BATCH DOWNLOAD WITH CONNECTION REUSE
# ============================================================================

download_replicate_batch()
{
  local START=$1
  local END=$2
  local BATCH_COUNT=$(($END - $START))

  local URL_LIST=""
  local STATE_FILES=""
  local OSC_FILES=""

  for (( ID=$START+1; ID<=$END; ID++ )); do
    get_replicate_path $ID

    local REMOTE_BASE="$SOURCE_URL/$REPLICATE_PATH"
    local LOCAL_DIR_PATH="$LOCAL_DIR/$DIGIT1/$DIGIT2"
    mkdir -p "$LOCAL_DIR_PATH"

    local OSC_FILE="$LOCAL_DIR_PATH/$DIGIT3.osc.gz"
    local STATE_FILE_LOCAL="$LOCAL_DIR_PATH/$DIGIT3.state.txt"

    if ! verify_file "$STATE_FILE_LOCAL" "text"; then
      URL_LIST="$URL_LIST $REMOTE_BASE.state.txt"
      STATE_FILES="$STATE_FILES $STATE_FILE_LOCAL"
    fi

    if ! verify_file "$OSC_FILE" "gzip"; then
      URL_LIST="$URL_LIST $REMOTE_BASE.osc.gz"
      OSC_FILES="$OSC_FILES $OSC_FILE"
    fi
  done

  if [[ -z "$URL_LIST" ]]; then
    if [[ $BATCH_COUNT -eq 1 ]]; then
      log_message "Downloaded OSC file $END (cached)"
    else
      log_message "Downloaded $BATCH_COUNT OSC files (all cached)"
    fi
    return 0
  fi

  local TEMP_CONFIG="$LOCAL_DIR/curl_batch.txt"
  rm -f "$TEMP_CONFIG"

  local URL_ARRAY=($URL_LIST)
  local STATE_ARRAY=($STATE_FILES)
  local OSC_ARRAY=($OSC_FILES)
  local STATE_IDX=0
  local OSC_IDX=0

  for URL in "${URL_ARRAY[@]}"; do
    if [[ $URL == *.state.txt ]]; then
      echo "url = \"$URL\"" >> "$TEMP_CONFIG"
      echo "output = \"${STATE_ARRAY[$STATE_IDX]}.tmp\"" >> "$TEMP_CONFIG"
      STATE_IDX=$((STATE_IDX + 1))
    else
      echo "url = \"$URL\"" >> "$TEMP_CONFIG"
      echo "output = \"${OSC_ARRAY[$OSC_IDX]}.tmp\"" >> "$TEMP_CONFIG"
      OSC_IDX=$((OSC_IDX + 1))
    fi
  done

  # Download all files with connection reuse
  local CURL_ERROR_LOG="$LOCAL_DIR/curl_error_$.log"

  curl -fsSL \
    --keepalive-time $CURL_KEEPALIVE_TIME \
    --connect-timeout "$CURL_CONNECT_TIMEOUT" \
    --retry "$CURL_MAX_RETRIES" \
    --retry-delay "$CURL_RETRY_DELAY" \
    --parallel \
    --parallel-max 4 \
    --config "$TEMP_CONFIG" 2>"$CURL_ERROR_LOG"

  local CURL_EXIT=$?
  rm -f "$TEMP_CONFIG"

  if [[ $CURL_EXIT -ne 0 ]]; then
    log_error "Batch download failed (exit code: $CURL_EXIT)"

    # Log curl error details if available
    if [[ -s "$CURL_ERROR_LOG" ]]; then
      log_error "Curl error output:"
      while IFS= read -r line; do
        log_error "  $line"
      done < "$CURL_ERROR_LOG"
    fi

    # Provide context for common error codes
    case $CURL_EXIT in
      6)  log_error "Exit 6: Could not resolve host" ;;
      7)  log_error "Exit 7: Failed to connect to host" ;;
      16) log_error "Exit 16: HTTP/2 protocol error (connection reset or framing issue)" ;;
      18) log_error "Exit 18: Partial file transfer" ;;
      22) log_error "Exit 22: HTTP response code indicated failure" ;;
      23) log_error "Exit 23: Write error" ;;
      28) log_error "Exit 28: Operation timeout" ;;
      35) log_error "Exit 35: SSL connect error" ;;
      52) log_error "Exit 52: Empty reply from server" ;;
      55) log_error "Exit 55: Failed sending network data" ;;
      56) log_error "Exit 56: Failed receiving network data" ;;
    esac

    rm -f "$CURL_ERROR_LOG"

    # Clean up temp files
    for (( ID=$START+1; ID<=$END; ID++ )); do
      get_replicate_path $ID
      rm -f "$LOCAL_DIR/$REPLICATE_PATH.state.txt.tmp"
      rm -f "$LOCAL_DIR/$REPLICATE_PATH.osc.gz.tmp"
    done
    return 1
  fi

  # Clean up error log if successful
  rm -f "$CURL_ERROR_LOG"

  local SUCCESS=1
  for (( ID=$START+1; ID<=$END; ID++ )); do
    get_replicate_path $ID
    local LOCAL_DIR_PATH="$LOCAL_DIR/$DIGIT1/$DIGIT2"
    local STATE_FILE="$LOCAL_DIR_PATH/$DIGIT3.state.txt"
    local OSC_FILE="$LOCAL_DIR_PATH/$DIGIT3.osc.gz"

    if [[ -f "$STATE_FILE.tmp" ]]; then
      if verify_file "$STATE_FILE.tmp" "text"; then
        mv "$STATE_FILE.tmp" "$STATE_FILE"
      else
        log_error "State file failed verification: ID $ID"
        rm -f "$STATE_FILE.tmp"
        SUCCESS=0
      fi
    fi

    if [[ -f "$OSC_FILE.tmp" ]]; then
      if verify_file "$OSC_FILE.tmp" "gzip"; then
        mv "$OSC_FILE.tmp" "$OSC_FILE"
      else
        log_error "OSC file failed verification: ID $ID"
        rm -f "$OSC_FILE.tmp"
        SUCCESS=0
      fi
    fi
  done

  if [[ $SUCCESS -eq 1 ]]; then
    if [[ $BATCH_COUNT -eq 1 ]]; then
      log_message "Downloaded OSC file $END"
    else
      log_message "Downloaded $BATCH_COUNT OSC files ($(($START + 1)) to $END)"
    fi
  else
    if [[ $BATCH_COUNT -eq 1 ]]; then
      log_error "File failed verification: OSC file $END"
    else
      log_error "Some files failed verification in batch $(($START + 1)) to $END"
    fi
  fi

  return $((1 - SUCCESS))
}

# ============================================================================
# FETCH STATE MANAGEMENT
# ============================================================================

read_fetch_state()
{
  if [[ -f "$FETCH_STATE_FILE" && -s "$FETCH_STATE_FILE" ]]; then
    cat "$FETCH_STATE_FILE"
  else
    echo "0"
  fi
}

read_db_state()
{
  if [[ -f "$DB_STATE_FILE" && -s "$DB_STATE_FILE" ]]; then
    cat "$DB_STATE_FILE"
  else
    echo "0"
  fi
}

update_fetch_state()
{
  local NEW_ID=$1
  echo "$NEW_ID" > "$FETCH_STATE_FILE.tmp"
  mv "$FETCH_STATE_FILE.tmp" "$FETCH_STATE_FILE"
}

# ============================================================================
# SIGNAL HANDLERS
# ============================================================================

shutdown()
{
  log_message "Shutdown signal received, cleaning up..."
  rm -f "$LOCAL_DIR"/*.tmp
  log_message "Shutdown complete"
  exit 0
}

trap shutdown SIGTERM SIGINT

# ============================================================================
# MAIN EXECUTION
# ============================================================================

log_message "=========================================="
log_message "Starting fetch process"
log_message "Source: $SOURCE_URL"
log_message "Local directory: $LOCAL_DIR"
log_message "=========================================="

if [[ "$START_ID" == "auto" ]]; then
  # In auto mode, start from the database state (what's been applied)
  DB_ID=$(read_db_state)
  FETCH_ID=$(read_fetch_state)

  # Use whichever is higher (in case fetch was ahead when restarted)
  if [[ $FETCH_ID -gt $DB_ID ]]; then
    CURRENT_ID=$FETCH_ID
    log_message "Auto mode: resuming fetch from OSC file $CURRENT_ID (fetch ahead of apply)"
  else
    CURRENT_ID=$DB_ID
    log_message "Auto mode: resuming from OSC file $CURRENT_ID (database state)"
  fi
else
  CURRENT_ID=$START_ID
  log_message "Starting from OSC file $CURRENT_ID"
fi

while true; do
  MAX_AVAILABLE=$(get_latest_available_id)

  # get_latest_available_id will wait during outages if source was previously verified
  # If it returns without a value and source was never verified, there's a config error
  if [[ -z "$MAX_AVAILABLE" ]]; then
    log_error "Fatal: Cannot initialize - replication source unreachable"
    exit 1
  fi

  # Mark source as verified after first successful fetch
  if [[ "$SOURCE_VERIFIED" != "true" ]]; then
    SOURCE_VERIFIED=true
    log_message "Replication source verified and operational"
  fi

  if [[ $MAX_AVAILABLE -le $CURRENT_ID ]]; then
    SLEEP_TIME=$(calculate_sleep_time)

    if [[ $SLEEP_TIME -gt 0 ]]; then
      log_message "No new OSC files available (current: $CURRENT_ID), sleeping ${SLEEP_TIME}s"
      sleep_with_interrupts $SLEEP_TIME
      continue
    fi

    RETRY_COUNT=0
    while [[ $RETRY_COUNT -lt $QUICK_RETRY_COUNT ]]; do
      sleep_with_interrupts $QUICK_RETRY_DELAY
      RETRY_COUNT=$(($RETRY_COUNT + 1))

      MAX_AVAILABLE=$(get_latest_available_id)
      # get_latest_available_id will retry internally if SOURCE_VERIFIED=true
      # If it returns empty, there's a serious problem, but we should still
      # fall back to slow retry rather than exiting

      if [[ -n "$MAX_AVAILABLE" && $MAX_AVAILABLE -gt $CURRENT_ID ]]; then
        # New data found!
        break
      fi

      if [[ $RETRY_COUNT -lt $QUICK_RETRY_COUNT ]]; then
        log_message "Waiting for OSC file $((CURRENT_ID + 1)) (quick retry $RETRY_COUNT/$QUICK_RETRY_COUNT)"
      fi
    done

    # If still no new data after quick retries, fall back to slow retry
    if [[ -z "$MAX_AVAILABLE" || $MAX_AVAILABLE -le $CURRENT_ID ]]; then
      log_message "OSC file $((CURRENT_ID + 1)) not available, falling back to ${SLOW_RETRY_DELAY}s delays"
      sleep_with_interrupts $SLOW_RETRY_DELAY
      continue
    fi
  fi

  # Determine batch size
  BATCH_END=$CURRENT_ID
  for (( TEST_ID=$CURRENT_ID+1; TEST_ID<=$MAX_AVAILABLE && TEST_ID<=$CURRENT_ID+$MAX_BATCH_SIZE; TEST_ID++ )); do
    BATCH_END=$TEST_ID
  done

  BATCH_COUNT=$(($BATCH_END - $CURRENT_ID))
  if [[ $BATCH_COUNT -eq 1 ]]; then
    log_message "Fetching OSC file $BATCH_END"
  else
    log_message "Fetching $BATCH_COUNT OSC files ($(($CURRENT_ID + 1)) to $BATCH_END)"
  fi

  if download_replicate_batch $CURRENT_ID $BATCH_END; then
    LAST_UPDATE_WALL_CLOCK=$(date +%s)
    update_fetch_state $BATCH_END
    CURRENT_ID=$BATCH_END
  else
    log_error "Batch download failed, retrying"
    sleep_with_interrupts 60
  fi
done

Kai_Johnson · November 16, 2025, 4:03pm

apply_osc_to_db.sh

#!/usr/bin/env bash

# Copyright 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 Roland Olbricht et al.
# With improvements in 2025 by Kai Johnson
#
# This file is part of Overpass_API.
#
# Overpass_API is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# Overpass_API is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Overpass_API. If not, see <https://www.gnu.org/licenses/>.

# ============================================================================
# Script: apply_osc_to_db.sh
# Purpose: Applies downloaded OSM change files to the Overpass database
#          in continuous batches with robust error handling
# ============================================================================

if [[ -z $3 ]]; then
{
  echo "Usage: $0 replicate_dir start_id --meta=(attic|yes|no)"
  echo ""
  echo "  replicate_dir: Directory containing downloaded replicate files"
  echo "  start_id:      Starting replicate ID or 'auto' to resume from database state"
  echo "  --meta:        Metadata handling mode (attic=full history, yes=metadata, no=current only)"
  exit 0
}; fi

# ============================================================================
# CONFIGURATION
# ============================================================================

REPLICATE_DIR="$1"
START_ID="$2"
META_ARG="$3"

# Parse metadata argument
META=
if [[ $META_ARG == "--meta=attic" ]]; then
  META="--keep-attic"
elif [[ $META_ARG == "--meta=yes" || $META_ARG == "--meta" ]]; then
  META="--meta"
elif [[ $META_ARG == "--meta=no" ]]; then
  META=
else
  echo "ERROR: You must specify --meta=attic, --meta=yes, or --meta=no"
  exit 1
fi

# Batch configuration
MAX_BATCH_SIZE=360          # Maximum OSC files per batch (6 hours)

# Timestamp tracking
EXPECTED_UPDATE_INTERVAL=57 # Seconds to wait before checking for next update
LAST_UPDATE_WALL_CLOCK=     # Wall clock time when last update was collected

# Get execution directory
EXEC_DIR="$(dirname $0)/"
if [[ ! ${EXEC_DIR:0:1} == "/" ]]; then
  EXEC_DIR="$(pwd)/$EXEC_DIR"
fi

# Convert replicate dir to absolute path
if [[ ! ${REPLICATE_DIR:0:1} == "/" ]]; then
  REPLICATE_DIR="$(pwd)/$REPLICATE_DIR"
fi

# Get database directory
DB_DIR=$($EXEC_DIR/dispatcher --show-dir)

if [[ ! -d "$DB_DIR" ]]; then
  echo "ERROR: Database directory '$DB_DIR' does not exist"
  exit 1
fi

# State file
STATE_FILE="$DB_DIR/replicate_id"

# Log file
LOG_FILE="$DB_DIR/apply_osc_to_db.log"

# Working directory for decompressed files
WORK_DIR=$(mktemp -d /tmp/osm-3s_update_XXXXXX)
mkdir -p "$WORK_DIR"

# ============================================================================
# LOGGING
# ============================================================================

log_message()
{
  echo "$(date -u '+%F %T'): $1" | tee -a "$LOG_FILE"
}

log_error()
{
  echo "$(date -u '+%F %T'): ERROR: $1" | tee -a "$LOG_FILE" >&2
}

# ============================================================================
# STATE MANAGEMENT
# ============================================================================

read_current_state()
{
  if [[ -f "$STATE_FILE" && -s "$STATE_FILE" ]]; then
    cat "$STATE_FILE"
  else
    echo "0"
  fi
}

update_state()
{
  local NEW_ID=$1
  echo "$NEW_ID" > "$STATE_FILE.tmp"
  mv "$STATE_FILE.tmp" "$STATE_FILE"
}

# ============================================================================
# PATH CONVERSION
# ============================================================================

get_replicate_path()
{
  local ID=$1
  printf -v DIGIT3 %03u $(($ID % 1000))
  local ARG=$(($ID / 1000))
  printf -v DIGIT2 %03u $(($ARG % 1000))
  ARG=$(($ARG / 1000))
  printf -v DIGIT1 %03u $ARG
  REPLICATE_PATH="$DIGIT1/$DIGIT2/$DIGIT3"
}

# ============================================================================
# BATCH COLLECTION
# ============================================================================

collect_batch()
{
  local START=$1
  local CURRENT=$(($START + 1))

  BATCH_END=$START

  # Find contiguous downloaded files up to MAX_BATCH_SIZE
  for (( ID=$CURRENT; ID<=$START+$MAX_BATCH_SIZE; ID++ )); do
    get_replicate_path $ID

    local OSC_FILE="$REPLICATE_DIR/$REPLICATE_PATH.osc.gz"
    local STATE_FILE_LOCAL="$REPLICATE_DIR/$REPLICATE_PATH.state.txt"

    if [[ -f "$OSC_FILE" && -f "$STATE_FILE_LOCAL" ]]; then
      BATCH_END=$ID
    else
      break
    fi
  done

  if [[ $BATCH_END -le $START ]]; then
    return 1
  fi

  local COUNT=$(($BATCH_END - $START))
  if [[ $COUNT -eq 1 ]]; then
      log_message "Collected one OSC file: $BATCH_END"
    else
      log_message "Collected batch: $START to $BATCH_END ($COUNT OSC files)"
  fi
  return 0
}

# ============================================================================
# BATCH PREPARATION
# ============================================================================

prepare_batch()
{
  local START=$1
  local END=$2
  local COUNT=$(($END - $START))
  local OUT_DIR="$3"

  mkdir -p "$OUT_DIR"

  if [[ $COUNT -eq 1 ]]; then
    log_message "Decompressing OSC file: $END"
  else
    log_message "Decompressing batch: $START to $END ($COUNT OSC files)"
  fi

  for (( ID=$START+1; ID<=$END; ID++ )); do
    get_replicate_path $ID

    local OSC_GZ="$REPLICATE_DIR/$REPLICATE_PATH.osc.gz"

    if [[ ! -f "$OSC_GZ" ]]; then
      log_error "Missing file: $OSC_GZ"
      return 1
    fi

    printf -v OUT_FILE %09u $ID

    gunzip <"$OSC_GZ" >"$OUT_DIR/$OUT_FILE.osc"

    if [[ $? -ne 0 ]]; then
      log_error "Failed to decompress $OSC_GZ"
      return 1
    fi
  done

  return 0
}

# ============================================================================
# TIMESTAMP EXTRACTION
# ============================================================================

get_timestamp()
{
  local ID=$1
  get_replicate_path $ID

  local STATE_FILE_LOCAL="$REPLICATE_DIR/$REPLICATE_PATH.state.txt"

  local TIMESTAMP_LINE=""
  local WAIT_COUNT=0
  while [[ -z "$TIMESTAMP_LINE" && $WAIT_COUNT -lt 10 ]]; do
    TIMESTAMP_LINE=$(grep "^timestamp" <"$STATE_FILE_LOCAL" 2>/dev/null)
    if [[ -z "$TIMESTAMP_LINE" ]]; then
      sleep_with_interrupts 1
      WAIT_COUNT=$(($WAIT_COUNT + 1))
    fi
  done

  if [[ -z "$TIMESTAMP_LINE" ]]; then
    log_error "Could not extract timestamp from $STATE_FILE_LOCAL"
    return 1
  fi

  DATA_VERSION=${TIMESTAMP_LINE:10}
  return 0
}

# ============================================================================
# BATCH APPLICATION
# ============================================================================

apply_batch()
{
  local OSC_DIR="$1"

  log_message "Applying batch to database (version: ${DATA_VERSION//\\/})"

  cd "$EXEC_DIR"

  local SUCCESS=0
  local RETRY_COUNT=0
  local MAX_RETRIES=5

  while [[ $SUCCESS -eq 0 && $RETRY_COUNT -lt $MAX_RETRIES ]]; do
    ./update_from_dir --osc-dir="$OSC_DIR" --version="$DATA_VERSION" $META --flush-size=0
    local EXITCODE=$?

    if [[ $EXITCODE -eq 0 ]]; then
      SUCCESS=1
    elif [[ $EXITCODE -eq 15 ]]; then
      log_message "Received SIGTERM, shutting down gracefully"
      exit 15
    else
      RETRY_COUNT=$(($RETRY_COUNT + 1))
      log_error "update_from_dir failed (exit code: $EXITCODE), retry $RETRY_COUNT/$MAX_RETRIES"

      if [[ $RETRY_COUNT -lt $MAX_RETRIES ]]; then
        sleep_with_interrupts 60
      fi
    fi
  done

  cd - >/dev/null

  if [[ $SUCCESS -eq 0 ]]; then
    log_error "Failed to apply batch after $MAX_RETRIES attempts"
    return 1
  fi

  return 0
}

# ============================================================================
# TIMING
# ============================================================================

calculate_sleep_time()
{
  if [[ -z "$LAST_UPDATE_WALL_CLOCK" ]]; then
    echo 5
    return
  fi

  local NOW=$(date +%s)
  local NEXT_CHECK=$((LAST_UPDATE_WALL_CLOCK + EXPECTED_UPDATE_INTERVAL))
  local SLEEP_TIME=$((NEXT_CHECK - NOW))

  if [[ $SLEEP_TIME -lt 1 ]]; then
    echo 1
  else
    echo $SLEEP_TIME
  fi
}

# ============================================================================
# SLEEPING
# ============================================================================

sleep_with_interrupts()
{
  sleep $1 &
  wait $!
}

# ============================================================================
# SIGNAL HANDLERS
# ============================================================================

shutdown()
{
  log_message "Shutdown signal received, cleaning up..."

  if [[ -n "$CHILD_PID" && $CHILD_PID -gt 0 ]]; then
    kill $CHILD_PID 2>/dev/null
  fi

  rm -rf "$WORK_DIR"

  log_message "Shutdown complete"
  exit 0
}

trap shutdown SIGTERM SIGINT

# ============================================================================
# MAIN EXECUTION
# ============================================================================

log_message "=========================================="
log_message "Starting apply process"
log_message "Replicate directory: $REPLICATE_DIR"
log_message "Metadata mode: $META_ARG"
log_message "=========================================="

if [[ "$START_ID" == "auto" ]]; then
  CURRENT_ID=$(read_current_state)
  log_message "Auto mode: resuming from OSC file $CURRENT_ID"
else
  CURRENT_ID=$START_ID
  log_message "Starting from OSC file $CURRENT_ID"
fi

# Run database migration
log_message "Running database migration"
cd "$EXEC_DIR"
./migrate_database --migrate &
CHILD_PID=$!
wait "$CHILD_PID"
CHILD_PID=
cd - >/dev/null

# Delete old temp files
log_message "Deleting old temporary files and directories"
rm -rf /tmp/osm-3s_update_*

while true; do
  # Try to collect a batch
  if ! collect_batch $CURRENT_ID; then
    SLEEP_TIME=$(calculate_sleep_time)
    log_message "No new OSC files available, waiting $SLEEP_TIME s"
    sleep_with_interrupts $SLEEP_TIME
    continue
  fi

  LAST_UPDATE_WALL_CLOCK=$(date +%s)

  # Prepare processing directory
  PROCESS_DIR="$WORK_DIR/process_$BATCH_END"
  rm -rf "$PROCESS_DIR"
  mkdir -p "$PROCESS_DIR"

  # Decompress batch
  if ! prepare_batch $CURRENT_ID $BATCH_END "$PROCESS_DIR"; then
    log_error "Failed to prepare batch, skipping"
    rm -rf "$PROCESS_DIR"
    sleep_with_interrupts 60
    continue
  fi

  # Get timestamp
  if ! get_timestamp $BATCH_END; then
    log_error "Failed to get timestamp, skipping batch"
    rm -rf "$PROCESS_DIR"
    sleep_with_interrupts 60
    continue
  fi

  # Apply batch
  if ! apply_batch "$PROCESS_DIR"; then
    log_error "Failed to apply batch, will retry"
    rm -rf "$PROCESS_DIR"
    sleep_with_interrupts 60
    continue
  fi

  # Success - update state
  update_state $BATCH_END
  CURRENT_ID=$BATCH_END

  log_message "Successfully applied batch up to OSC file $CURRENT_ID"

  # Clean up
  rm -rf "$PROCESS_DIR"
done