What is the difference between CSW 2.0.2 and OGC API – Records?

CSW 2.0.2 uses SOAP/HTTP-POST XML envelopes with operations like GetRecords and GetRecordById. OGC API – Records is a REST/JSON successor using OpenAPI, path-based resource IDs, and GeoJSON or JSON-LD responses. Both can expose the same ISO 19115 metadata; the protocol layer differs.

How do I handle records that fail ISO 19115 XSD validation?

Route failed records to a quarantine queue with a structured error envelope (record ID, field path, XSD error message, timestamp). Apply fallback template rules for mandatory-but-absent fields such as identificationInfo and spatialRepresentationInfo, then re-queue for validation rather than silently dropping records.

Which search engine is best for spatial catalog queries?

Elasticsearch and OpenSearch both support geo_shape (BKD-tree backed) and geo_point field types suitable for bounding-box and point-in-polygon queries. Choose OpenSearch for fully open-source deployments or AWS managed service parity; choose Elasticsearch if you are already on the Elastic stack. Both accept Python clients with bulk indexing APIs.

Spatial Metadata & Catalog Integration

Effective spatial data publishing depends on a robust, discoverable metadata layer that connects raw geospatial assets to enterprise discovery portals, government open-data registries, and federated search networks. For GIS platform engineers and Python backend developers, this means building and operating a metadata pipeline that can extract, normalize, validate, index, and synchronize records at scale — while remaining compliant with ISO 19115, DCAT-AP, CSW 2.0.2, and the emerging OGC API – Records standard. Without systematic automation across every stage, spatial catalogs fragment into inconsistent, non-compliant collections that block downstream consumers and accumulate technical debt faster than any manual stewardship team can clear.

This guide covers the complete engineering picture: pipeline architecture, metadata standards and their mapping relationships, Python implementation patterns, search index configuration, cross-portal synchronization strategies, and operational readiness requirements.

1. Architecture Overview

A production spatial catalog is a distributed metadata pipeline, not a monolithic database. The architecture must accommodate heterogeneous source formats — PostGIS tables, GeoPackage files, cloud object storage containing GeoTIFFs or COG files, and legacy shapefiles alongside live OGC service endpoints like WMS and WFS — while serving multiple downstream consumers through standards-aligned APIs.

The five functional layers are:

Ingestion Layer — format-aware connectors that detect source types, manage credentials, issue GetCapabilities requests against live OGC services, and pull embedded metadata (GDAL tags, ISO XML sidecars, FGDC headers).
Transformation and Validation Engine — normalizes raw extracts into target schemas (ISO 19115, DCAT-AP), enforces mandatory field constraints, applies vocabulary mappings, and routes failures to quarantine queues.
Catalog Backend — stores canonical records with versioning, lineage tracking, and provenance chains; exposes CSW 2.0.2 and OGC API – Records endpoints.
Search Index — decoupled from the backend for independent scaling; optimized for multidimensional spatial, temporal, and faceted queries using BKD-tree-backed geo_shape fields.
Synchronization Orchestrator — manages scheduled harvests, webhook-triggered incremental updates, and cross-portal replication with idempotent execution and structured audit logs.

Treating metadata as event-driven data — pushed onto message brokers (RabbitMQ or Kafka) between layers — rather than writing synchronously to a single database is what prevents ingestion bottlenecks during peak data registration bursts and enables independent horizontal scaling of each layer.

2. Metadata Standards and Their Relationships

ISO 19115 — the canonical dataset descriptor

The ISO 19115 family defines comprehensive schemas for dataset identification, spatial representation, distribution, and lineage. ISO 19115-1:2014 is the current base standard; ISO 19115-3 packages the XML implementation schemas (replacing the older ISO 19139:2007 encoding that most legacy catalogs still produce). The key top-level XML element is MD_Metadata, which contains:

identificationInfo (MD_DataIdentification) — title, abstract, keywords, geographic extent, temporal extent
spatialRepresentationInfo (MD_VectorSpatialRepresentation or MD_GridSpatialRepresentation) — geometry type, cell size
referenceSystemInfo (MD_ReferenceSystem) — the CRS authority code (e.g. EPSG:4326)
distributionInfo (MD_Distribution) — format, transfer options, online resources
dataQualityInfo (DQ_DataQuality) — lineage statements, conformance reports

Implementing ISO 19115 Metadata Standards covers the full element hierarchy and Python serialization patterns using lxml.

DCAT-AP — bridging GIS metadata to open-data ecosystems

DCAT-AP (Data Catalog Vocabulary Application Profile) is an RDF-based standard used by European national portals, the EU Open Data Portal, and increasingly by US federal FGDA reporting. It models catalogs as dcat:Catalog, datasets as dcat:Dataset, and distributions as dcat:Distribution. A critical mapping challenge is that ISO 19115’s identificationInfo/citation/title maps to dct:title, while distributionInfo/transferOptions/onLine/linkage maps to dcat:accessURL — but the CRS information in referenceSystemInfo has no direct DCAT-AP equivalent and must be expressed as a dct:conformsTo reference to the EPSG registry URI.

DCAT-AP for Spatial Data Portals details the RDF mapping and how to produce application/ld+json serializations that satisfy both INSPIRE and OGC API – Records clients.

CSW 2.0.2 vs OGC API – Records

The Catalogue Service for the Web (CSW) 2.0.2 is the incumbent OGC catalog protocol. It uses HTTP-GET or HTTP-POST with XML envelopes and defines four core operations:

Operation	Purpose	Mandatory parameters
`GetCapabilities`	Describe service and supported filter encodings	`SERVICE=CSW`, `REQUEST=GetCapabilities`
`GetRecords`	Query records using OGC Filter Encoding or CQL	`SERVICE`, `REQUEST`, `TYPENAMES`, `OUTPUTSCHEMA`
`GetRecordById`	Retrieve a single record by identifier	`SERVICE`, `REQUEST`, `ID`, `OUTPUTSCHEMA`
`DescribeRecord`	Return the schema for a record type	`SERVICE`, `REQUEST`, `TYPENAME`

OUTPUTSCHEMA controls which metadata profile is returned: http://www.isotc211.org/2005/gmd for ISO 19139, http://www.opengis.net/cat/csw/2.0.2 for Dublin Core, or a DCAT-AP URI for RDF output.

OGC API – Records (the REST successor) replaces XML envelopes with OpenAPI paths. Records are addressed at /collections/{collectionId}/items/{recordId} and queried with URL query parameters (bbox, datetime, q, type). Responses default to GeoJSON or JSON-LD. Content negotiation via Accept headers allows the same endpoint to serve legacy XML consumers and modern JSON clients without a separate adapter layer.

Dimension	CSW 2.0.2	OGC API – Records
Protocol	SOAP/HTTP-POST XML	REST / OpenAPI
Record addressing	`GetRecordById?ID=…`	`/items/{recordId}`
Spatial filter	OGC Filter Encoding XML	`bbox=minx,miny,maxx,maxy`
Temporal filter	`ogc:PropertyIsGreaterThan` on `apiso:Modified`	`datetime=2024-01-01/..`
Metadata profiles	ISO 19139 XML, Dublin Core, DCAT RDF	GeoJSON + JSON-LD
Content negotiation	`OUTPUTSCHEMA` parameter	HTTP `Accept` header
Auth model	IP-allow or HTTP Basic	OAuth2 / OIDC

Both protocols can expose the same underlying ISO 19115 records — the difference is entirely in the transport and serialization contract. Most production deployments must support both during a transition period, which favours a backend that serializes on the fly rather than storing format-specific copies.

3. Automated Ingestion and Validation Pipelines

Extraction strategies by source type

Source	Primary extraction tool	Metadata location
PostGIS / PostgreSQL	`psycopg2` + geometry queries	Column comments, `geometry_columns` view
GeoPackage	GDAL `ogrinfo` or `fiona`	`gpkg_contents`, `gpkg_metadata` tables
GeoTIFF / COG	`rasterio` + GDAL	TIFF tags, XML sidecar `.aux.xml`
Shapefile	`ogrinfo` or `geopandas`	`.prj` for CRS, `.cpg` for encoding; no native metadata
WMS / WFS endpoint	HTTP `GetCapabilities`	`Layer` elements in capabilities XML
S3 / object storage	`boto3` list + per-object `HEAD`	Object metadata headers, sidecar JSON

The automated metadata harvesting workflows guide covers each connector in depth, including how to reconstruct missing attributes using deterministic fallback rules when source data lacks embedded documentation.

Python ingestion skeleton

The following pattern shows the extraction and normalization core. It targets Python 3.10+ and depends on requests, lxml, pyproj, and rasterio, all installable via pip.

from __future__ import annotations

import hashlib
import json
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any

import rasterio
from lxml import etree
from pyproj import CRS


@dataclass
class RawMetadataRecord:
    source_uri: str
    source_format: str          # "geotiff" | "wms_capabilities" | "geopackage" | ...
    extracted_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    raw_payload: dict[str, Any] = field(default_factory=dict)


@dataclass
class CanonicalRecord:
    """Normalized ISO 19115-aligned record ready for validation and indexing."""
    record_id: str
    title: str
    abstract: str
    keywords: list[str]
    bbox: tuple[float, float, float, float]   # (west, south, east, north) in EPSG:4326
    crs_auth_code: str                         # e.g. "EPSG:32633"
    date_published: str                         # ISO 8601
    source_uri: str
    lineage: str = ""
    contact_email: str = ""
    use_constraints: str = ""


def extract_from_geotiff(path: Path) -> RawMetadataRecord:
    with rasterio.open(path) as ds:
        crs = CRS.from_user_input(ds.crs)
        bounds = ds.bounds
        # Reproject bounds to EPSG:4326 for canonical bounding box
        from pyproj import Transformer
        if not crs.equals(CRS.from_epsg(4326)):
            transformer = Transformer.from_crs(crs, CRS.from_epsg(4326), always_xy=True)
            west, south = transformer.transform(bounds.left, bounds.bottom)
            east, north = transformer.transform(bounds.right, bounds.top)
        else:
            west, south, east, north = bounds.left, bounds.bottom, bounds.right, bounds.top

        return RawMetadataRecord(
            source_uri=str(path),
            source_format="geotiff",
            raw_payload={
                "title": path.stem,
                "crs_auth": crs.to_authority(),   # e.g. ("EPSG", "32633")
                "bbox_wgs84": (west, south, east, north),
                "driver": ds.driver,
                "count": ds.count,
                "tags": ds.tags(),
            },
        )


def normalize(raw: RawMetadataRecord, fallback_contact: str = "") -> CanonicalRecord:
    """Apply transformation rules to produce a CanonicalRecord.

    Missing mandatory fields trigger fallback template values rather than
    raising exceptions — the validation stage decides whether to admit or quarantine.
    """
    p = raw.raw_payload
    auth = p.get("crs_auth") or ("EPSG", "4326")
    return CanonicalRecord(
        record_id=hashlib.sha256(raw.source_uri.encode()).hexdigest()[:16],
        title=p.get("title") or Path(raw.source_uri).stem,
        abstract=p.get("abstract") or f"Dataset extracted from {raw.source_format} source.",
        keywords=p.get("keywords") or [],
        bbox=p.get("bbox_wgs84") or (-180.0, -90.0, 180.0, 90.0),
        crs_auth_code=f"{auth[0]}:{auth[1]}",
        date_published=p.get("date_published") or raw.extracted_at.date().isoformat(),
        source_uri=raw.source_uri,
        contact_email=p.get("contact_email") or fallback_contact,
        use_constraints=p.get("use_constraints") or "otherRestrictions",
    )

The SRS and Coordinate Reference System Handling guide explains why always-on always_xy=True in the Transformer constructor is mandatory — without it, pyproj follows axis-order rules from the CRS definition, which for EPSG:4326 produces (latitude, longitude) rather than (longitude, latitude), silently inverting every bounding box you write to the index.

Validation before admission

Every CanonicalRecord must pass structural and semantic checks before entering the catalog backend. Mandatory field presence, valid EPSG authority codes, and non-degenerate bounding boxes can be checked in pure Python. XSD validation against the ISO 19139 schema requires lxml with the schema document tree:

from lxml import etree


def validate_iso19139_xml(xml_bytes: bytes, schema_path: Path) -> list[str]:
    """Return a list of XSD error messages; empty list means valid."""
    with schema_path.open("rb") as f:
        schema_doc = etree.parse(f)
    schema = etree.XMLSchema(schema_doc)
    doc = etree.fromstring(xml_bytes)
    schema.validate(doc)
    return [str(e) for e in schema.error_log]

Records that fail validation are routed to a quarantine queue — a simple PostgreSQL table works well — with a structured error envelope: {"record_id": "…", "field_path": "identificationInfo/citation/title", "error_code": "MISSING_MANDATORY", "timestamp": "…"}. This allows data stewards to triage failures without blocking the ingestion of valid records. The schema validation for spatial records page details how to add JSON Schema validation for OGC API – Records payloads alongside XSD validation for ISO 19139.

4. Search Index Configuration and Query Optimization

Decoupling the catalog backend (system of record) from the search index (query engine) allows independent scaling and specialized mapping configurations. Elasticsearch and OpenSearch both provide native geo_shape and geo_point field types backed by BKD trees, which outperform PostGIS spatial indexes for catalog bounding-box queries at the 10,000+ record scale.

Index mapping for spatial records

{
  "mappings": {
    "properties": {
      "record_id":       {"type": "keyword"},
      "title":           {"type": "text", "analyzer": "english"},
      "abstract":        {"type": "text", "analyzer": "english"},
      "keywords":        {"type": "keyword"},
      "bbox":            {"type": "geo_shape"},
      "date_published":  {"type": "date", "format": "strict_date"},
      "crs_auth_code":   {"type": "keyword", "doc_values": true},
      "source_format":   {"type": "keyword", "doc_values": true},
      "use_constraints": {"type": "keyword"}
    }
  }
}

Set doc_values: true on faceted fields (crs_auth_code, source_format, keywords) and keep _source enabled for record reconstruction — disabling _source on mutable catalog records forces re-indexing on every update rather than partial updates.

Query routing pattern

Multidimensional catalog queries should apply filters in selectivity order to prune the candidate set before scoring:

Spatial filter (geo_shape intersection with the user’s bounding box) — most selective for spatially heterogeneous catalogs
Temporal filter (date_published range) — eliminates stale records early
Keyword/facet filters (terms on keywords, source_format) — cheap bucket pruning
Full-text relevance (BM25 on title + abstract) — applied last across the pruned set

def build_catalog_query(
    bbox: tuple[float, float, float, float] | None,
    date_from: str | None,
    date_to: str | None,
    keyword: str | None,
    free_text: str | None,
) -> dict:
    """Compose an Elasticsearch/OpenSearch query dict from catalog filter inputs."""
    filters: list[dict] = []

    if bbox:
        west, south, east, north = bbox
        filters.append({
            "geo_shape": {
                "bbox": {
                    "shape": {
                        "type": "envelope",
                        "coordinates": [[west, north], [east, south]]
                    },
                    "relation": "intersects"
                }
            }
        })

    if date_from or date_to:
        date_range: dict = {}
        if date_from:
            date_range["gte"] = date_from
        if date_to:
            date_range["lte"] = date_to
        filters.append({"range": {"date_published": date_range}})

    if keyword:
        filters.append({"term": {"keywords": keyword}})

    must: list[dict] = []
    if free_text:
        must.append({
            "multi_match": {
                "query": free_text,
                "fields": ["title^2", "abstract"],
                "type": "best_fields"
            }
        })

    return {
        "query": {
            "bool": {
                "filter": filters,
                "must": must or [{"match_all": {}}]
            }
        }
    }

Boosting title by a factor of 2 (title^2) is a standard practice for catalog search because title matches are more precise signals than abstract matches. Pre-compute geohash aggregation buckets at indexing time if your portal renders a heatmap of dataset density — computing them on-the-fly per query is expensive once the catalog exceeds 50,000 records.

5. Protocol Comparison: CSW 2.0.2 vs OGC API – Records

The table in Section 2 covers the protocol-level differences. This section focuses on the implementation trade-offs that affect backend engineers.

Axis order: CSW 2.0.2 GetRecords spatial filters use gml:Envelope with srsName="EPSG:4326", and GML 3.2 mandates latitude-first axis order within that envelope. This is the same trap as WMS 1.3.0 BBOX — documented in detail in the handling spatial reference mismatches in OGC requests guide. OGC API – Records bbox parameters always use longitude-first order (minLon,minLat,maxLon,maxLat), removing this ambiguity.

Authentication: CSW 2.0.2 has no normative authentication mechanism — implementations use IP allowlists, HTTP Basic, or custom token headers that vary between vendors (GeoServer, pycsw, deegree). OGC API – Records is designed for OAuth2/OIDC from the outset, with token introspection at the API gateway level.

Paging: CSW 2.0.2 uses startPosition and maxRecords parameters (1-indexed). OGC API – Records uses offset and limit (0-indexed) with a standardized next link in the response envelope. Mapping between these in a federated harvester requires careful offset translation.

Filter languages: CSW 2.0.2 supports OGC Filter Encoding 1.1 (XML) and optionally CQL (Contextual Query Language) as a text-based alternative. OGC API – Records supports CQL2-Text and CQL2-JSON as the normative filter language. CQL2-JSON is more machine-friendly for Python clients:

import json
import requests

cql2_filter = {
    "op": "and",
    "args": [
        {
            "op": "s_intersects",
            "args": [
                {"property": "bbox"},
                {"type": "Polygon", "coordinates": [[[10,47],[15,47],[15,52],[10,52],[10,47]]]}
            ]
        },
        {
            "op": ">=",
            "args": [{"property": "date_published"}, "2023-01-01"]
        }
    ]
}

resp = requests.post(
    "https://catalog.example.org/collections/main/items",
    json={"filter": cql2_filter, "filter-lang": "cql2-json"},
    headers={"Accept": "application/geo+json"},
    timeout=30,
)
resp.raise_for_status()
records = resp.json()["features"]

6. Production Implementation Patterns

Layered Python architecture

A production catalog backend follows a four-layer request path:

HTTP Request
  → Gateway (auth, rate limiting, content negotiation)
  → Request Validator (parameter parsing, CQL filter compilation)
  → Query Router (dispatch to search index or catalog backend)
  → Serializer (ISO 19139 XML | GeoJSON | JSON-LD | Dublin Core)
  → HTTP Response

The serializer layer is where most protocol-specific complexity lives. Writing ISO 19139 XML requires careful namespace management — the gmd, gco, gml, and srv prefixes must all be registered, and lxml ElementMaker objects help avoid manual {namespace}localName concatenation:

from lxml import etree
from lxml.builder import ElementMaker

GMD = "http://www.isotc211.org/2005/gmd"
GCO = "http://www.isotc211.org/2005/gco"
GML = "http://www.opengis.net/gml/3.2"

gmd = ElementMaker(namespace=GMD, nsmap={"gmd": GMD, "gco": GCO, "gml": GML})
gco = ElementMaker(namespace=GCO, nsmap={"gco": GCO})


def record_to_iso19139(record: "CanonicalRecord") -> bytes:
    """Serialize a CanonicalRecord to ISO 19139 XML bytes."""
    root = gmd.MD_Metadata(
        gmd.fileIdentifier(gco.CharacterString(record.record_id)),
        gmd.identificationInfo(
            gmd.MD_DataIdentification(
                gmd.citation(
                    gmd.CI_Citation(
                        gmd.title(gco.CharacterString(record.title)),
                        gmd.date(
                            gmd.CI_Date(
                                gmd.date(gco.Date(record.date_published)),
                                gmd.dateType(
                                    gmd.CI_DateTypeCode(
                                        record.date_published,
                                        codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/codelist/ML_gmxCodelists.xml#CI_DateTypeCode",
                                        codeListValue="publication",
                                    )
                                ),
                            )
                        ),
                    )
                ),
                gmd.abstract(gco.CharacterString(record.abstract)),
            )
        ),
    )
    return etree.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8")

For GeoJSON output (OGC API – Records), a CanonicalRecord maps directly to a GeoJSON Feature with a geometry of type Polygon (the bounding box) and the remaining fields in properties. Adding @context and @type fields converts this to JSON-LD without duplicating any data.

Orchestration with Apache Airflow

Scheduled harvesting jobs benefit from Airflow’s DAG-level retry semantics and XCom for passing record counts between tasks. A minimal harvesting DAG has four tasks: detect_changes → extract_and_normalize → validate_and_admit → reindex. The validate_and_admit task should push quarantined record IDs to an XCom key so downstream alerting tasks can report validation failure rates without scanning the quarantine table directly.

Automating GeoServer with the Python REST API covers adjacent Python orchestration patterns for layer publishing workflows that feed into catalog ingestion.

7. Operational Considerations

Caching strategy

CSW GetCapabilities responses are expensive to generate (they enumerate all supported schemas, filter capabilities, and output formats) and change infrequently. Cache them at the reverse proxy level with a Cache-Control: public, max-age=3600 header. GetRecords responses that include a bbox filter should not be cached at the CDN because spatial query results are highly variable, but single-record GetRecordById responses are safe to cache by record ID with a reasonable TTL (15–60 minutes depending on update frequency).

For the search index, pre-warm frequently used filter combinations (e.g. source_format=wms or crs_auth_code=EPSG:4326) using index aliases that point to pre-filtered index views. This reduces query latency for the most common portal browsing patterns.

Request coalescing for harvesting bursts

Catalog harvesters — especially those driven by a change detection event — can issue thousands of GetRecordById or /items/{id} requests in a short window. Implement request coalescing in your harvesting client: group IDs into batches of 50–100, use bulk fetch operations (CSW GetRecords with ID filter list, or OGC API – Records q with ID list), and back off exponentially on HTTP 429 or 503 responses. Use connection pooling via requests.Session with HTTPAdapter(max_retries=…, pool_connections=4, pool_maxsize=20) to avoid exhausting file descriptors.

Memory management for large harvests

When extracting metadata from large GeoPackage files or bulk shapefile directories, avoid loading all features into memory to derive geometry extents. Use streaming reads via fiona.open with fiona.prop_type and process one feature at a time to accumulate a bounding box:

import fiona
from shapely.geometry import shape


def get_bbox_from_geopackage(path: str) -> tuple[float, float, float, float]:
    """Stream through a GeoPackage layer to compute the full extent."""
    west = east = north = south = None
    with fiona.open(path) as src:
        for feature in src:
            geom = shape(feature["geometry"])
            b = geom.bounds  # (minx, miny, maxx, maxy)
            west = b[0] if west is None else min(west, b[0])
            south = b[1] if south is None else min(south, b[1])
            east = b[2] if east is None else max(east, b[2])
            north = b[3] if north is None else max(north, b[3])
    return (west or -180.0, south or -90.0, east or 180.0, north or 90.0)

Rate limiting on public CSW endpoints

Public CSW endpoints (national INSPIRE nodes, USGS, Copernicus Open Access Hub) enforce aggressive rate limits — commonly 60–120 requests per minute. Build rate-limit-aware clients with token bucket logic and respect Retry-After headers. Cache capability documents locally for at least 24 hours and implement conditional GET using Last-Modified / If-Modified-Since where the server supports it.

8. Compliance and Validation

OGC CITE compliance for CSW

The OGC Compliance Testing Program provides an online CITE test suite for CSW 2.0.2 at https://cite.opengeospatial.org/teamengine/. Running the test suite against a local catalog instance requires a publicly accessible endpoint or a reverse-tunnelled local service. The CITE engine sends a fixed set of GetCapabilities, GetRecords, GetRecordById, and DescribeRecord requests and validates both the HTTP response codes and the XML payload structure against the CSW schema.

For CI pipelines, the TEAM Engine Docker image can be run locally:

docker run --rm -p 8080:8080 ogccite/teamengine:latest
# Then POST a session against http://localhost:8080/teamengine/rest/suites/csw/2.0.2/run

Schema validation test harness

A pytest-based validation harness catches regressions in serialization output before deployment:

import pytest
from lxml import etree
from pathlib import Path

from your_catalog.serializers import record_to_iso19139
from your_catalog.models import CanonicalRecord


SCHEMA_PATH = Path("schemas/iso19139/gmd/gmd.xsd")


@pytest.fixture
def sample_record() -> CanonicalRecord:
    return CanonicalRecord(
        record_id="abc123",
        title="Flood Risk Zones — Rhine Basin",
        abstract="Vector dataset of 100-year flood inundation extents.",
        keywords=["flood", "risk", "Rhine", "hydrology"],
        bbox=(6.0, 47.0, 15.0, 52.0),
        crs_auth_code="EPSG:4326",
        date_published="2024-03-15",
        source_uri="s3://hydro-data/flood_risk_rhine.gpkg",
        contact_email="[email protected]",
        use_constraints="license",
    )


def test_iso19139_validates_against_xsd(sample_record: CanonicalRecord) -> None:
    xml_bytes = record_to_iso19139(sample_record)
    with SCHEMA_PATH.open("rb") as f:
        schema = etree.XMLSchema(etree.parse(f))
    doc = etree.fromstring(xml_bytes)
    errors = [str(e) for e in schema.error_log]
    assert not errors, f"XSD validation failed:\n" + "\n".join(errors)


def test_title_roundtrips(sample_record: CanonicalRecord) -> None:
    xml_bytes = record_to_iso19139(sample_record)
    doc = etree.fromstring(xml_bytes)
    ns = {"gmd": "http://www.isotc211.org/2005/gmd", "gco": "http://www.isotc211.org/2005/gco"}
    title = doc.findtext(
        ".//gmd:title/gco:CharacterString", namespaces=ns
    )
    assert title == sample_record.title

For OGC API – Records payloads, validate the GeoJSON Feature output against the JSON Schema published at https://schemas.opengis.net/ogcapi/records/part1/1.0/openapi/schemas/recordGeoJSON.yaml using jsonschema with $ref resolution.

Why does my CSW GetRecords return zero results even though records exist?

The most common cause is an OUTPUTSCHEMA mismatch. If you request OUTPUTSCHEMA=http://www.isotc211.org/2005/gmd but the catalog was populated with Dublin Core records (the CSW default), the server returns an empty result set rather than an error. Check the capabilities document for supported OutputSchema values and ensure your ingestion pipeline writes records in the schema you intend to query.

How do I handle a catalog that mixes ISO 19139 and ISO 19115-3 records?

ISO 19115-3 uses a different XML namespace (http://standards.iso.org/iso/19115/-3/mdb/1.0) and reorganizes several element paths (e.g. mdb:MD_Metadata instead of gmd:MD_Metadata). Detect the namespace on the root element before applying XPath and route to namespace-aware extraction functions. Store a metadata_profile field in your canonical store so queries can filter by profile version.

What is the correct bounding box axis order for CSW 2.0.2 spatial filters?

CSW 2.0.2 uses GML 3.1.1 in its Filter Encoding, where gml:Envelope with srsName="EPSG:4326" uses latitude-first order: <gml:lowerCorner>lat_min lon_min</gml:lowerCorner>. OGC API – Records bbox uses longitude-first order. The handling spatial reference mismatches in OGC requests page documents this in full with corrected examples.

Back to Home

Related

Implementing ISO 19115 Metadata Standards — element hierarchy, mandatory fields, and Python serialization with lxml
DCAT-AP for Spatial Data Portals — RDF mappings and JSON-LD generation for INSPIRE and open-data compliance
Automated Metadata Harvesting Workflows — per-source connector patterns and fallback generation for incomplete records
Schema Validation for Spatial Records — XSD and JSON Schema validation pipelines with quarantine queue design
OGC Standards Architecture & Service Fundamentals — WMS, WFS, WMTS, and OGC API service contracts that feed into catalog ingestion