Schema Validation for Spatial Records

Schema validation for spatial records is the foundational control plane for any OGC-compliant data publishing pipeline. When geographic datasets move from ingestion to catalog publication, structural integrity, coordinate reference system (CRS) compliance, and metadata alignment must be verified before indexing. Without deterministic validation, malformed geometries, missing mandatory fields, or non-conforming XML/JSON payloads corrupt downstream services, break spatial queries, and trigger compliance failures in government and enterprise environments.

This guide outlines a production-ready validation workflow for spatial records, focusing on OGC API Features, GeoJSON, and metadata schemas. It provides tested Python patterns, step-by-step execution logic, and troubleshooting strategies for platform engineers and spatial data publishers.

Prerequisites & Environment Baseline

Before implementing validation pipelines, ensure your environment meets the following baseline requirements:

  • Python 3.10+ with virtual environment isolation
  • Core Libraries: pydantic>=2.0, jsonschema, shapely>=2.0, pyproj, lxml, fastjsonschema (optional for high-throughput validation)
  • Schema References: RFC 7946: The GeoJSON Format, JSON Schema Official Specification, ISO 19115 XSD, DCAT-AP RDF/JSON-LD profiles
  • CRS Registry Access: EPSG.io or pyproj bundled database for URN resolution
  • Catalog Integration Context: Familiarity with how validation outputs feed into Spatial Metadata & Catalog Integration pipelines, particularly for automated rejection routing and retry queues

Familiarity with OGC API Features conformance classes and metadata publishing standards such as Implementing ISO 19115 Metadata Standards will significantly reduce integration friction. Engineers should also provision structured logging (JSON format) and a dead-letter queue (DLQ) for failed payloads before deploying validation logic to production.

Deterministic Validation Workflow

A robust spatial validation pipeline follows a deterministic sequence. Each stage isolates failure modes so that downstream systems receive only verified payloads.

1. Ingest & Normalize

Accept raw payloads (GeoJSON FeatureCollection, XML metadata, or OGC API JSON responses). Normalize encoding, strip BOMs, and enforce UTF-8. Parse into a structured intermediate representation. At this stage, validate content-type headers and reject payloads exceeding configured size thresholds to prevent memory exhaustion.

2. Structural Schema Validation

Validate against the appropriate JSON Schema or XSD. For vector features, enforce OGC GeoJSON structural rules. For metadata, validate against ISO 19115 or DCAT-AP for Spatial Data Portals profiles. Reject early if mandatory top-level keys (type, features, properties) or XML namespaces are missing. Structural validation should run before geometry parsing to avoid expensive computational overhead on fundamentally broken documents.

3. Geometry & Topology Verification

Extract geometry objects and validate using Shapely. Check for:

  • Valid coordinate sequences (minimum 3 points for LineString, 4 for closed Polygon)
  • Self-intersections, duplicate vertices, or ring orientation violations
  • Coordinate bounds within reasonable geographic limits (e.g., longitude between -180 and 180, latitude between -90 and 90)
  • Empty or null geometries flagged explicitly rather than silently dropped

4. CRS & Coordinate Integrity

Verify that the declared CRS matches the coordinate values. Use pyproj to resolve EPSG codes, URNs (urn:ogc:def:crs:EPSG::4326), or WKT strings. Reject records with mismatched axis orders (e.g., lat/long vs long/lat) or deprecated CRS identifiers. Normalize all coordinates to WGS84 (EPSG:4326) if your catalog requires a single canonical projection.

5. Metadata Alignment & Catalog Handoff

Cross-reference spatial bounds with declared metadata extents. Ensure temporal coverage, licensing, and attribution fields comply with organizational policies. Once validation passes, emit a structured validation report alongside the payload for catalog ingestion. Failed records route to quarantine with explicit error codes for remediation.

Production-Ready Python Implementation

The following implementation demonstrates a reliable, type-safe validation function using modern Python tooling. It separates structural, geometric, and CRS checks while collecting all errors before failing.

import json
from typing import Any, Dict, List, Optional
from pydantic import BaseModel, ValidationError, field_validator
from shapely.geometry import shape, mapping
from shapely.validation import make_valid
from pyproj import CRS, Transformer
from jsonschema import validate, ValidationError as JsonSchemaError

class SpatialValidationError(Exception):
    def __init__(self, errors: List[str]):
        self.errors = errors
        super().__init__(f"Validation failed: {'; '.join(errors)}")

def validate_spatial_record(
    payload: Dict[str, Any],
    json_schema: Dict[str, Any],
    target_crs: str = "EPSG:4326"
) -> Dict[str, Any]:
    errors: List[str] = []
    
    # 1. Structural JSON Schema validation
    try:
        validate(instance=payload, schema=json_schema)
    except JsonSchemaError as e:
        errors.append(f"Schema violation: {e.message}")
    
    # 2. Geometry validation
    features = payload.get("features", [])
    for idx, feat in enumerate(features):
        geom = feat.get("geometry")
        if not geom:
            errors.append(f"Feature {idx}: missing geometry")
            continue
            
        try:
            shp = shape(geom)
            if not shp.is_valid:
                shp = make_valid(shp)
                if not shp.is_valid:
                    errors.append(f"Feature {idx}: irreparable invalid geometry")
                    continue
            # Enforce coordinate bounds
            minx, miny, maxx, maxy = shp.bounds
            if not (-180 <= minx <= maxx <= 180 and -90 <= miny <= maxy <= 90):
                errors.append(f"Feature {idx}: coordinates out of geographic bounds")
        except Exception as e:
            errors.append(f"Feature {idx}: geometry parse error - {str(e)}")
    
    # 3. CRS validation & transformation readiness
    declared_crs = payload.get("crs", {}).get("properties", {}).get("name")
    if declared_crs:
        try:
            crs_obj = CRS.from_user_input(declared_crs)
            if not crs_obj.equals(CRS.from_epsg(4326)):
                # Verify transformability to target CRS
                Transformer.from_crs(crs_obj, target_crs, always_xy=True)
        except Exception as e:
            errors.append(f"CRS resolution failed: {declared_crs} - {str(e)}")
    
    if errors:
        raise SpatialValidationError(errors)
        
    return payload

Reliability Considerations

  • Fail-Fast vs. Accumulate: The function accumulates errors rather than failing on the first issue. This enables batch remediation and reduces round-trip retries.
  • Geometry Repair: make_valid attempts to fix common topology issues (e.g., self-intersecting polygons) before rejecting. Use cautiously; log all auto-repairs for audit trails.
  • CRS Transformation: pyproj.Transformer verifies coordinate convertibility without mutating the original payload. Actual projection should occur downstream during catalog indexing.

Error Routing & Retry Architecture

Validation failures should never block the entire ingestion pipeline. Implement a tiered routing strategy:

  1. Hard Failures: Structural schema violations, unparseable payloads, or missing mandatory metadata route immediately to a DLQ with status: rejected. These require manual intervention or upstream source correction.
  2. Soft Failures: Recoverable geometry issues, deprecated CRS identifiers, or missing optional fields route to a retry queue with exponential backoff. Attach a remediation script that attempts normalization before re-ingestion.
  3. Audit Logging: Emit structured JSON logs containing record_id, error_codes, validation_stage, and payload_hash. Integrate with observability stacks (OpenTelemetry, Prometheus, or ELK) to track validation success rates over time.

Government and enterprise environments often require immutable audit trails. Store validation reports alongside catalog entries using content-addressable storage (e.g., S3 with SHA-256 keys) to satisfy compliance audits and data lineage requirements.

Catalog Integration & Indexing Readiness

Once records pass validation, they must be prepared for spatial indexing. Catalog systems expect consistent bounding boxes, normalized CRS, and machine-readable metadata. Validation outputs should include:

  • Pre-computed WKT bounding boxes
  • Standardized property dictionaries stripped of validation artifacts
  • Provenance tags indicating validation version and timestamp

These artifacts feed directly into Spatial Metadata & Catalog Integration workflows, enabling automated search indexing, facet generation, and spatial query optimization. By decoupling validation from indexing, platform teams can scale horizontally: validation nodes handle CPU-intensive geometry checks, while catalog nodes focus on Elasticsearch/PostGIS ingestion.

Conclusion

Schema validation for spatial records is not a single gate but a continuous, multi-stage control process. By enforcing structural compliance, verifying geometry integrity, resolving CRS ambiguities, and routing failures deterministically, engineering teams prevent downstream corruption and maintain OGC compliance. Implementing the patterns outlined here reduces catalog downtime, accelerates metadata publishing, and ensures spatial datasets remain query-ready across enterprise and government platforms.