Effective spatial data publishing depends on a robust, discoverable metadata layer that connects raw geospatial assets to enterprise discovery portals, government open-data registries, and federated search networks. For GIS platform engineers and Python backend developers, this means building and operating a metadata pipeline that can extract, normalize, validate, index, and synchronize records at scale — while remaining compliant with ISO 19115, DCAT-AP, CSW 2.0.2, and the emerging OGC API – Records standard. Without systematic automation across every stage, spatial catalogs fragment into inconsistent, non-compliant collections that block downstream consumers and accumulate technical debt faster than any manual stewardship team can clear.
This guide covers the complete engineering picture: pipeline architecture, metadata standards and their mapping relationships, Python implementation patterns, search index configuration, cross-portal synchronization strategies, and operational readiness requirements.
A production spatial catalog is a distributed metadata pipeline, not a monolithic database. The architecture must accommodate heterogeneous source formats — PostGIS tables, GeoPackage files, cloud object storage containing GeoTIFFs or COG files, and legacy shapefiles alongside live OGC service endpoints like WMS and WFS — while serving multiple downstream consumers through standards-aligned APIs.
The five functional layers are:
geo_shape fields.Treating metadata as event-driven data — pushed onto message brokers (RabbitMQ or Kafka) between layers — rather than writing synchronously to a single database is what prevents ingestion bottlenecks during peak data registration bursts and enables independent horizontal scaling of each layer.
The ISO 19115 family defines comprehensive schemas for dataset identification, spatial representation, distribution, and lineage. ISO 19115-1:2014 is the current base standard; ISO 19115-3 packages the XML implementation schemas (replacing the older ISO 19139:2007 encoding that most legacy catalogs still produce). The key top-level XML element is MD_Metadata, which contains:
identificationInfo (MD_DataIdentification) — title, abstract, keywords, geographic extent, temporal extentspatialRepresentationInfo (MD_VectorSpatialRepresentation or MD_GridSpatialRepresentation) — geometry type, cell sizereferenceSystemInfo (MD_ReferenceSystem) — the CRS authority code (e.g. EPSG:4326)distributionInfo (MD_Distribution) — format, transfer options, online resourcesdataQualityInfo (DQ_DataQuality) — lineage statements, conformance reportsImplementing ISO 19115 Metadata Standards covers the full element hierarchy and Python serialization patterns using lxml.
DCAT-AP (Data Catalog Vocabulary Application Profile) is an RDF-based standard used by European national portals, the EU Open Data Portal, and increasingly by US federal FGDA reporting. It models catalogs as dcat:Catalog, datasets as dcat:Dataset, and distributions as dcat:Distribution. A critical mapping challenge is that ISO 19115’s identificationInfo/citation/title maps to dct:title, while distributionInfo/transferOptions/onLine/linkage maps to dcat:accessURL — but the CRS information in referenceSystemInfo has no direct DCAT-AP equivalent and must be expressed as a dct:conformsTo reference to the EPSG registry URI.
DCAT-AP for Spatial Data Portals details the RDF mapping and how to produce application/ld+json serializations that satisfy both INSPIRE and OGC API – Records clients.
The Catalogue Service for the Web (CSW) 2.0.2 is the incumbent OGC catalog protocol. It uses HTTP-GET or HTTP-POST with XML envelopes and defines four core operations:
| Operation | Purpose | Mandatory parameters |
|---|---|---|
GetCapabilities |
Describe service and supported filter encodings | SERVICE=CSW, REQUEST=GetCapabilities |
GetRecords |
Query records using OGC Filter Encoding or CQL | SERVICE, REQUEST, TYPENAMES, OUTPUTSCHEMA |
GetRecordById |
Retrieve a single record by identifier | SERVICE, REQUEST, ID, OUTPUTSCHEMA |
DescribeRecord |
Return the schema for a record type | SERVICE, REQUEST, TYPENAME |
OUTPUTSCHEMA controls which metadata profile is returned: http://www.isotc211.org/2005/gmd for ISO 19139, http://www.opengis.net/cat/csw/2.0.2 for Dublin Core, or a DCAT-AP URI for RDF output.
OGC API – Records (the REST successor) replaces XML envelopes with OpenAPI paths. Records are addressed at /collections/{collectionId}/items/{recordId} and queried with URL query parameters (bbox, datetime, q, type). Responses default to GeoJSON or JSON-LD. Content negotiation via Accept headers allows the same endpoint to serve legacy XML consumers and modern JSON clients without a separate adapter layer.
| Dimension | CSW 2.0.2 | OGC API – Records |
|---|---|---|
| Protocol | SOAP/HTTP-POST XML | REST / OpenAPI |
| Record addressing | GetRecordById?ID=… |
/items/{recordId} |
| Spatial filter | OGC Filter Encoding XML | bbox=minx,miny,maxx,maxy |
| Temporal filter | ogc:PropertyIsGreaterThan on apiso:Modified |
datetime=2024-01-01/.. |
| Metadata profiles | ISO 19139 XML, Dublin Core, DCAT RDF | GeoJSON + JSON-LD |
| Content negotiation | OUTPUTSCHEMA parameter |
HTTP Accept header |
| Auth model | IP-allow or HTTP Basic | OAuth2 / OIDC |
Both protocols can expose the same underlying ISO 19115 records — the difference is entirely in the transport and serialization contract. Most production deployments must support both during a transition period, which favours a backend that serializes on the fly rather than storing format-specific copies.
| Source | Primary extraction tool | Metadata location |
|---|---|---|
| PostGIS / PostgreSQL | psycopg2 + geometry queries |
Column comments, geometry_columns view |
| GeoPackage | GDAL ogrinfo or fiona |
gpkg_contents, gpkg_metadata tables |
| GeoTIFF / COG | rasterio + GDAL |
TIFF tags, XML sidecar .aux.xml |
| Shapefile | ogrinfo or geopandas |
.prj for CRS, .cpg for encoding; no native metadata |
| WMS / WFS endpoint | HTTP GetCapabilities |
Layer elements in capabilities XML |
| S3 / object storage | boto3 list + per-object HEAD |
Object metadata headers, sidecar JSON |
The automated metadata harvesting workflows guide covers each connector in depth, including how to reconstruct missing attributes using deterministic fallback rules when source data lacks embedded documentation.
The following pattern shows the extraction and normalization core. It targets Python 3.10+ and depends on requests, lxml, pyproj, and rasterio, all installable via pip.
from __future__ import annotations
import hashlib
import json
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import rasterio
from lxml import etree
from pyproj import CRS
@dataclass
class RawMetadataRecord:
source_uri: str
source_format: str # "geotiff" | "wms_capabilities" | "geopackage" | ...
extracted_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
raw_payload: dict[str, Any] = field(default_factory=dict)
@dataclass
class CanonicalRecord:
"""Normalized ISO 19115-aligned record ready for validation and indexing."""
record_id: str
title: str
abstract: str
keywords: list[str]
bbox: tuple[float, float, float, float] # (west, south, east, north) in EPSG:4326
crs_auth_code: str # e.g. "EPSG:32633"
date_published: str # ISO 8601
source_uri: str
lineage: str = ""
contact_email: str = ""
use_constraints: str = ""
def extract_from_geotiff(path: Path) -> RawMetadataRecord:
with rasterio.open(path) as ds:
crs = CRS.from_user_input(ds.crs)
bounds = ds.bounds
# Reproject bounds to EPSG:4326 for canonical bounding box
from pyproj import Transformer
if not crs.equals(CRS.from_epsg(4326)):
transformer = Transformer.from_crs(crs, CRS.from_epsg(4326), always_xy=True)
west, south = transformer.transform(bounds.left, bounds.bottom)
east, north = transformer.transform(bounds.right, bounds.top)
else:
west, south, east, north = bounds.left, bounds.bottom, bounds.right, bounds.top
return RawMetadataRecord(
source_uri=str(path),
source_format="geotiff",
raw_payload={
"title": path.stem,
"crs_auth": crs.to_authority(), # e.g. ("EPSG", "32633")
"bbox_wgs84": (west, south, east, north),
"driver": ds.driver,
"count": ds.count,
"tags": ds.tags(),
},
)
def normalize(raw: RawMetadataRecord, fallback_contact: str = "") -> CanonicalRecord:
"""Apply transformation rules to produce a CanonicalRecord.
Missing mandatory fields trigger fallback template values rather than
raising exceptions — the validation stage decides whether to admit or quarantine.
"""
p = raw.raw_payload
auth = p.get("crs_auth") or ("EPSG", "4326")
return CanonicalRecord(
record_id=hashlib.sha256(raw.source_uri.encode()).hexdigest()[:16],
title=p.get("title") or Path(raw.source_uri).stem,
abstract=p.get("abstract") or f"Dataset extracted from {raw.source_format} source.",
keywords=p.get("keywords") or [],
bbox=p.get("bbox_wgs84") or (-180.0, -90.0, 180.0, 90.0),
crs_auth_code=f"{auth[0]}:{auth[1]}",
date_published=p.get("date_published") or raw.extracted_at.date().isoformat(),
source_uri=raw.source_uri,
contact_email=p.get("contact_email") or fallback_contact,
use_constraints=p.get("use_constraints") or "otherRestrictions",
)
The SRS and Coordinate Reference System Handling guide explains why always-on always_xy=True in the Transformer constructor is mandatory — without it, pyproj follows axis-order rules from the CRS definition, which for EPSG:4326 produces (latitude, longitude) rather than (longitude, latitude), silently inverting every bounding box you write to the index.
Every CanonicalRecord must pass structural and semantic checks before entering the catalog backend. Mandatory field presence, valid EPSG authority codes, and non-degenerate bounding boxes can be checked in pure Python. XSD validation against the ISO 19139 schema requires lxml with the schema document tree:
from lxml import etree
def validate_iso19139_xml(xml_bytes: bytes, schema_path: Path) -> list[str]:
"""Return a list of XSD error messages; empty list means valid."""
with schema_path.open("rb") as f:
schema_doc = etree.parse(f)
schema = etree.XMLSchema(schema_doc)
doc = etree.fromstring(xml_bytes)
schema.validate(doc)
return [str(e) for e in schema.error_log]
Records that fail validation are routed to a quarantine queue — a simple PostgreSQL table works well — with a structured error envelope: {"record_id": "…", "field_path": "identificationInfo/citation/title", "error_code": "MISSING_MANDATORY", "timestamp": "…"}. This allows data stewards to triage failures without blocking the ingestion of valid records. The schema validation for spatial records page details how to add JSON Schema validation for OGC API – Records payloads alongside XSD validation for ISO 19139.
Decoupling the catalog backend (system of record) from the search index (query engine) allows independent scaling and specialized mapping configurations. Elasticsearch and OpenSearch both provide native geo_shape and geo_point field types backed by BKD trees, which outperform PostGIS spatial indexes for catalog bounding-box queries at the 10,000+ record scale.
{
"mappings": {
"properties": {
"record_id": {"type": "keyword"},
"title": {"type": "text", "analyzer": "english"},
"abstract": {"type": "text", "analyzer": "english"},
"keywords": {"type": "keyword"},
"bbox": {"type": "geo_shape"},
"date_published": {"type": "date", "format": "strict_date"},
"crs_auth_code": {"type": "keyword", "doc_values": true},
"source_format": {"type": "keyword", "doc_values": true},
"use_constraints": {"type": "keyword"}
}
}
}
Set doc_values: true on faceted fields (crs_auth_code, source_format, keywords) and keep _source enabled for record reconstruction — disabling _source on mutable catalog records forces re-indexing on every update rather than partial updates.
Multidimensional catalog queries should apply filters in selectivity order to prune the candidate set before scoring:
geo_shape intersection with the user’s bounding box) — most selective for spatially heterogeneous catalogsdate_published range) — eliminates stale records earlyterms on keywords, source_format) — cheap bucket pruningtitle + abstract) — applied last across the pruned setdef build_catalog_query(
bbox: tuple[float, float, float, float] | None,
date_from: str | None,
date_to: str | None,
keyword: str | None,
free_text: str | None,
) -> dict:
"""Compose an Elasticsearch/OpenSearch query dict from catalog filter inputs."""
filters: list[dict] = []
if bbox:
west, south, east, north = bbox
filters.append({
"geo_shape": {
"bbox": {
"shape": {
"type": "envelope",
"coordinates": [[west, north], [east, south]]
},
"relation": "intersects"
}
}
})
if date_from or date_to:
date_range: dict = {}
if date_from:
date_range["gte"] = date_from
if date_to:
date_range["lte"] = date_to
filters.append({"range": {"date_published": date_range}})
if keyword:
filters.append({"term": {"keywords": keyword}})
must: list[dict] = []
if free_text:
must.append({
"multi_match": {
"query": free_text,
"fields": ["title^2", "abstract"],
"type": "best_fields"
}
})
return {
"query": {
"bool": {
"filter": filters,
"must": must or [{"match_all": {}}]
}
}
}
Boosting title by a factor of 2 (title^2) is a standard practice for catalog search because title matches are more precise signals than abstract matches. Pre-compute geohash aggregation buckets at indexing time if your portal renders a heatmap of dataset density — computing them on-the-fly per query is expensive once the catalog exceeds 50,000 records.
The table in Section 2 covers the protocol-level differences. This section focuses on the implementation trade-offs that affect backend engineers.
Axis order: CSW 2.0.2 GetRecords spatial filters use gml:Envelope with srsName="EPSG:4326", and GML 3.2 mandates latitude-first axis order within that envelope. This is the same trap as WMS 1.3.0 BBOX — documented in detail in the handling spatial reference mismatches in OGC requests guide. OGC API – Records bbox parameters always use longitude-first order (minLon,minLat,maxLon,maxLat), removing this ambiguity.
Authentication: CSW 2.0.2 has no normative authentication mechanism — implementations use IP allowlists, HTTP Basic, or custom token headers that vary between vendors (GeoServer, pycsw, deegree). OGC API – Records is designed for OAuth2/OIDC from the outset, with token introspection at the API gateway level.
Paging: CSW 2.0.2 uses startPosition and maxRecords parameters (1-indexed). OGC API – Records uses offset and limit (0-indexed) with a standardized next link in the response envelope. Mapping between these in a federated harvester requires careful offset translation.
Filter languages: CSW 2.0.2 supports OGC Filter Encoding 1.1 (XML) and optionally CQL (Contextual Query Language) as a text-based alternative. OGC API – Records supports CQL2-Text and CQL2-JSON as the normative filter language. CQL2-JSON is more machine-friendly for Python clients:
import json
import requests
cql2_filter = {
"op": "and",
"args": [
{
"op": "s_intersects",
"args": [
{"property": "bbox"},
{"type": "Polygon", "coordinates": [[[10,47],[15,47],[15,52],[10,52],[10,47]]]}
]
},
{
"op": ">=",
"args": [{"property": "date_published"}, "2023-01-01"]
}
]
}
resp = requests.post(
"https://catalog.example.org/collections/main/items",
json={"filter": cql2_filter, "filter-lang": "cql2-json"},
headers={"Accept": "application/geo+json"},
timeout=30,
)
resp.raise_for_status()
records = resp.json()["features"]
A production catalog backend follows a four-layer request path:
HTTP Request
→ Gateway (auth, rate limiting, content negotiation)
→ Request Validator (parameter parsing, CQL filter compilation)
→ Query Router (dispatch to search index or catalog backend)
→ Serializer (ISO 19139 XML | GeoJSON | JSON-LD | Dublin Core)
→ HTTP Response
The serializer layer is where most protocol-specific complexity lives. Writing ISO 19139 XML requires careful namespace management — the gmd, gco, gml, and srv prefixes must all be registered, and lxml ElementMaker objects help avoid manual {namespace}localName concatenation:
from lxml import etree
from lxml.builder import ElementMaker
GMD = "http://www.isotc211.org/2005/gmd"
GCO = "http://www.isotc211.org/2005/gco"
GML = "http://www.opengis.net/gml/3.2"
gmd = ElementMaker(namespace=GMD, nsmap={"gmd": GMD, "gco": GCO, "gml": GML})
gco = ElementMaker(namespace=GCO, nsmap={"gco": GCO})
def record_to_iso19139(record: "CanonicalRecord") -> bytes:
"""Serialize a CanonicalRecord to ISO 19139 XML bytes."""
root = gmd.MD_Metadata(
gmd.fileIdentifier(gco.CharacterString(record.record_id)),
gmd.identificationInfo(
gmd.MD_DataIdentification(
gmd.citation(
gmd.CI_Citation(
gmd.title(gco.CharacterString(record.title)),
gmd.date(
gmd.CI_Date(
gmd.date(gco.Date(record.date_published)),
gmd.dateType(
gmd.CI_DateTypeCode(
record.date_published,
codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/codelist/ML_gmxCodelists.xml#CI_DateTypeCode",
codeListValue="publication",
)
),
)
),
)
),
gmd.abstract(gco.CharacterString(record.abstract)),
)
),
)
return etree.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8")
For GeoJSON output (OGC API – Records), a CanonicalRecord maps directly to a GeoJSON Feature with a geometry of type Polygon (the bounding box) and the remaining fields in properties. Adding @context and @type fields converts this to JSON-LD without duplicating any data.
Scheduled harvesting jobs benefit from Airflow’s DAG-level retry semantics and XCom for passing record counts between tasks. A minimal harvesting DAG has four tasks: detect_changes → extract_and_normalize → validate_and_admit → reindex. The validate_and_admit task should push quarantined record IDs to an XCom key so downstream alerting tasks can report validation failure rates without scanning the quarantine table directly.
Automating GeoServer with the Python REST API covers adjacent Python orchestration patterns for layer publishing workflows that feed into catalog ingestion.
CSW GetCapabilities responses are expensive to generate (they enumerate all supported schemas, filter capabilities, and output formats) and change infrequently. Cache them at the reverse proxy level with a Cache-Control: public, max-age=3600 header. GetRecords responses that include a bbox filter should not be cached at the CDN because spatial query results are highly variable, but single-record GetRecordById responses are safe to cache by record ID with a reasonable TTL (15–60 minutes depending on update frequency).
For the search index, pre-warm frequently used filter combinations (e.g. source_format=wms or crs_auth_code=EPSG:4326) using index aliases that point to pre-filtered index views. This reduces query latency for the most common portal browsing patterns.
Catalog harvesters — especially those driven by a change detection event — can issue thousands of GetRecordById or /items/{id} requests in a short window. Implement request coalescing in your harvesting client: group IDs into batches of 50–100, use bulk fetch operations (CSW GetRecords with ID filter list, or OGC API – Records q with ID list), and back off exponentially on HTTP 429 or 503 responses. Use connection pooling via requests.Session with HTTPAdapter(max_retries=…, pool_connections=4, pool_maxsize=20) to avoid exhausting file descriptors.
When extracting metadata from large GeoPackage files or bulk shapefile directories, avoid loading all features into memory to derive geometry extents. Use streaming reads via fiona.open with fiona.prop_type and process one feature at a time to accumulate a bounding box:
import fiona
from shapely.geometry import shape
def get_bbox_from_geopackage(path: str) -> tuple[float, float, float, float]:
"""Stream through a GeoPackage layer to compute the full extent."""
west = east = north = south = None
with fiona.open(path) as src:
for feature in src:
geom = shape(feature["geometry"])
b = geom.bounds # (minx, miny, maxx, maxy)
west = b[0] if west is None else min(west, b[0])
south = b[1] if south is None else min(south, b[1])
east = b[2] if east is None else max(east, b[2])
north = b[3] if north is None else max(north, b[3])
return (west or -180.0, south or -90.0, east or 180.0, north or 90.0)
Public CSW endpoints (national INSPIRE nodes, USGS, Copernicus Open Access Hub) enforce aggressive rate limits — commonly 60–120 requests per minute. Build rate-limit-aware clients with token bucket logic and respect Retry-After headers. Cache capability documents locally for at least 24 hours and implement conditional GET using Last-Modified / If-Modified-Since where the server supports it.
The OGC Compliance Testing Program provides an online CITE test suite for CSW 2.0.2 at https://cite.opengeospatial.org/teamengine/. Running the test suite against a local catalog instance requires a publicly accessible endpoint or a reverse-tunnelled local service. The CITE engine sends a fixed set of GetCapabilities, GetRecords, GetRecordById, and DescribeRecord requests and validates both the HTTP response codes and the XML payload structure against the CSW schema.
For CI pipelines, the TEAM Engine Docker image can be run locally:
docker run --rm -p 8080:8080 ogccite/teamengine:latest
# Then POST a session against http://localhost:8080/teamengine/rest/suites/csw/2.0.2/run
A pytest-based validation harness catches regressions in serialization output before deployment:
import pytest
from lxml import etree
from pathlib import Path
from your_catalog.serializers import record_to_iso19139
from your_catalog.models import CanonicalRecord
SCHEMA_PATH = Path("schemas/iso19139/gmd/gmd.xsd")
@pytest.fixture
def sample_record() -> CanonicalRecord:
return CanonicalRecord(
record_id="abc123",
title="Flood Risk Zones — Rhine Basin",
abstract="Vector dataset of 100-year flood inundation extents.",
keywords=["flood", "risk", "Rhine", "hydrology"],
bbox=(6.0, 47.0, 15.0, 52.0),
crs_auth_code="EPSG:4326",
date_published="2024-03-15",
source_uri="s3://hydro-data/flood_risk_rhine.gpkg",
contact_email="[email protected]",
use_constraints="license",
)
def test_iso19139_validates_against_xsd(sample_record: CanonicalRecord) -> None:
xml_bytes = record_to_iso19139(sample_record)
with SCHEMA_PATH.open("rb") as f:
schema = etree.XMLSchema(etree.parse(f))
doc = etree.fromstring(xml_bytes)
errors = [str(e) for e in schema.error_log]
assert not errors, f"XSD validation failed:\n" + "\n".join(errors)
def test_title_roundtrips(sample_record: CanonicalRecord) -> None:
xml_bytes = record_to_iso19139(sample_record)
doc = etree.fromstring(xml_bytes)
ns = {"gmd": "http://www.isotc211.org/2005/gmd", "gco": "http://www.isotc211.org/2005/gco"}
title = doc.findtext(
".//gmd:title/gco:CharacterString", namespaces=ns
)
assert title == sample_record.title
For OGC API – Records payloads, validate the GeoJSON Feature output against the JSON Schema published at https://schemas.opengis.net/ogcapi/records/part1/1.0/openapi/schemas/recordGeoJSON.yaml using jsonschema with $ref resolution.
The most common cause is an OUTPUTSCHEMA mismatch. If you request OUTPUTSCHEMA=http://www.isotc211.org/2005/gmd but the catalog was populated with Dublin Core records (the CSW default), the server returns an empty result set rather than an error. Check the capabilities document for supported OutputSchema values and ensure your ingestion pipeline writes records in the schema you intend to query.
ISO 19115-3 uses a different XML namespace (http://standards.iso.org/iso/19115/-3/mdb/1.0) and reorganizes several element paths (e.g. mdb:MD_Metadata instead of gmd:MD_Metadata). Detect the namespace on the root element before applying XPath and route to namespace-aware extraction functions. Store a metadata_profile field in your canonical store so queries can filter by profile version.
CSW 2.0.2 uses GML 3.1.1 in its Filter Encoding, where gml:Envelope with srsName="EPSG:4326" uses latitude-first order: <gml:lowerCorner>lat_min lon_min</gml:lowerCorner>. OGC API – Records bbox uses longitude-first order. The handling spatial reference mismatches in OGC requests page documents this in full with corrected examples.
Back to Home
Related
lxml