Validating ISO 19115 XML Against XSD Schemas with lxml
TL;DR: Use lxml.etree.XMLSchema to compile the root gmd.xsd file from a locally hosted schema bundle, then call .validate() on your parsed document. Set no_network=True and recover=False on the parser. Iterate schema.error_log for structured line/column error output. The most common failure is passing a raw file path to etree.XMLSchema() instead of first calling etree.parse() on the XSD — without that step, relative <xs:import> paths cannot resolve.
The Core Challenge: Modular Schema Imports
ISO 19115 metadata is expressed as gmd:MD_Metadata XML, a document format governed by the ISO 19139 encoding rules. The corresponding XSD schema is not a single file. It is a deeply modularized tree: gmd.xsd imports gco.xsd (Geographic Common Objects), which in turn imports fragments from gml (Geography Markup Language), gmx (Extended Types), and srv (Service Metadata). Every <xs:include> and <xs:import> directive uses relative paths that must resolve at compile time.
This modular design, while necessary for reuse across the ISO 19115 metadata standards family, creates three concrete obstacles for Python validation:
- Base URL resolution.
lxmlderives relative-import paths from the base URL of the root XSD document. If the root XSD is not parsed withetree.parse()before schema compilation, that base URL is absent and every relative import silently fails or raises a crypticFailed to load external entityerror. - Network blocking. In CI/CD pipelines and air-gapped agency environments, XSD files must be hosted locally.
no_network=Trueon the XML parser preventslxmlfrom fetching any remote entity, but you must ensure the entire bundle is present locally or the compiler will halt on the first missing import. - Version drift. The schema bundles for ISO 19115:2003 and ISO 19115-1:2014/2019 are structurally different. Mixing a 2003 bundle with a 2014 payload triggers
Element '...' is not validerrors for every element type introduced in the later revisions.
Solving all three correctly is the difference between a validation step that acts as a real quality gate and one that passes invalid documents silently. Proper schema validation is also the first gate described in the broader Spatial Metadata & Catalog Integration workflow before ingestion into GeoNetwork, CKAN, or OGC CSW endpoints.
Validation Pipeline Architecture
The diagram below shows how the components relate: the parser, schema compiler, and error log each play a distinct role.
Production-Ready Code
The function below handles namespace-aware parsing, local schema resolution, and granular error reporting. It assumes you have downloaded the complete ISO 19139 XSD bundle from https://schemas.opengis.net/iso/19139/ and extracted it to a local directory.
import sys
from pathlib import Path
from lxml import etree
def validate_iso19115(xml_path: str, xsd_root_dir: str) -> tuple[bool, list[str]]:
"""
Validate an ISO 19115 XML document against its XSD schema using lxml.
Returns (is_valid, error_messages).
xsd_root_dir must contain the full schema bundle so that relative
<xs:include> paths (e.g. ../gco/gco.xsd) resolve correctly.
"""
xml_path = Path(xml_path)
xsd_root = Path(xsd_root_dir)
# 1. Strict parser: recover=False ensures malformed XML raises immediately
# rather than being silently auto-corrected into invalid metadata.
# huge_tree=True prevents crashes on large payloads with long lineage blocks.
# no_network=True blocks all external entity fetches.
parser = etree.XMLParser(recover=False, no_network=True, huge_tree=True)
try:
xml_doc = etree.parse(str(xml_path), parser)
except etree.XMLSyntaxError as e:
return False, [f"XML Parse Error: {e}"]
# 2. Locate root XSD — gmd.xsd is the entry point for ISO 19115-1:2014/2019.
# For ISO 19115:2003 the same filename is used but the schema bundle differs.
root_xsd = xsd_root / "gmd" / "gmd.xsd"
if not root_xsd.exists():
return False, [f"Root XSD not found at {root_xsd}. Download the schema bundle."]
# 3. Compile schema — lxml resolves relative imports from the base URL of
# the PARSED XSD document. Calling etree.parse() first sets that base URL.
# Passing the file path directly to etree.XMLSchema() skips this step and
# prevents relative imports from resolving.
try:
xsd_doc = etree.parse(str(root_xsd))
schema = etree.XMLSchema(xsd_doc)
except etree.XMLSchemaParseError as e:
return False, [f"XSD Compilation Error: {e}"]
# 4. Validate and extract structured errors from the error log.
is_valid = schema.validate(xml_doc)
errors: list[str] = []
if not is_valid:
for error in schema.error_log:
errors.append(
f"[Line {error.line}, Col {error.column}] {error.message.strip()}"
)
return is_valid, errors
if __name__ == "__main__":
xml_file = "dataset_metadata.xml"
schema_dir = "/opt/iso19115-schemas/2014"
valid, msgs = validate_iso19115(xml_file, schema_dir)
if valid:
print("Validation passed: ISO 19115 structure conforms to schema.")
else:
print("Validation failed:")
for m in msgs:
print(f" - {m}")
sys.exit(1)
Step-by-Step Walkthrough
1. Strict XML Parsing
The XMLParser is configured with three flags. recover=False is the most important: without it, lxml silently repairs malformed XML (unclosed tags, mismatched namespaces) and hands the validator a corrected tree that never matches the original document. Any downstream system — a GeoNetwork harvester, a CSW endpoint, or a DCAT catalog as discussed in the DCAT-AP for Spatial Data Portals guide — will encounter the original broken bytes, not lxml’s repaired version.
no_network=True blocks all outbound entity fetches at the parser level. This is distinct from blocking at the schema compiler level. Both must be set; the parser flag covers <!DOCTYPE> and external entity references in the XML itself, while the schema path relies on locally resolved XSD files.
huge_tree=True is necessary for geospatial metadata records that embed large polygon geometries in gmd:EX_BoundingPolygon or extensive data-lineage descriptions in gmd:LI_Lineage. Without it, lxml raises XMLSyntaxError: xmlSAX2Characters: huge text node on records above roughly 10 MB.
2. Root XSD Location
gmd.xsd is the composition root for both ISO 19115:2003 and ISO 19115-1:2014/2019 bundles. The two bundles share this filename but differ in content: the 2014/2019 bundle adds the srv namespace (service metadata, ISO 19119) and the gmx namespace (extended code lists and type definitions). When the path does not exist, the function returns an actionable error rather than letting etree.XMLSchema() raise an unhelpful None reference.
3. Schema Compilation
The two-step pattern — etree.parse(str(root_xsd)) followed by etree.XMLSchema(xsd_doc) — is mandatory, not optional. When lxml parses an XSD file, it records the file’s absolute path as the document’s base URL. Every subsequent relative <xs:import schemaLocation="../gco/gco.xsd"> is resolved by joining that base URL with the relative path. If you bypass etree.parse() and call etree.XMLSchema(file=str(root_xsd)) directly, the base URL is not set and all relative imports fail.
etree.XMLSchemaParseError is raised at this step if any imported XSD file is missing from the bundle — an explicit, early failure that pinpoints the missing file rather than producing a vague later error.
4. Error Log Extraction
schema.error_log is an lxml.etree._ListErrorLog containing _LogEntry objects. Each entry exposes line, column, and message attributes. The line and column coordinates map directly to positions in the source XML file, making it straightforward to route these messages into structured logging systems, CI annotation APIs, or API response payloads for catalog ingestion endpoints.
Verification
Run the validator against a known-good ISO 19115 record, then against one with a deliberate error:
# Expected: validation passed
python validate.py valid_metadata.xml /opt/iso19115-schemas/2014
# Inject a typo in a required element, then run:
python validate.py broken_metadata.xml /opt/iso19115-schemas/2014
Expected output for a broken record:
Validation failed:
- [Line 14, Col 5] Element '{http://www.isotc211.org/2005/gmd}fileIdentiferr': This element is not expected.
- [Line 29, Col 3] Element '{http://www.isotc211.org/2005/gmd}MD_Metadata': Missing child element(s).
The namespace URI in brackets confirms lxml is validating against the correct gmd namespace and not falling back to an unqualified element check.
Gotchas and Edge Cases
Namespace URI vs. Prefix
lxml validates against namespace URIs, not prefix labels. A document using xmlns:myprefix="http://www.isotc211.org/2005/gmd" validates identically to one using xmlns:gmd="http://www.isotc211.org/2005/gmd". The common source of confusion is an outdated URI from a pre-2005 draft (for example, http://www.opengis.net/gmd). The validator rejects the root element immediately when the URI is wrong, and the error message names the unexpected element — not the URI mismatch — which makes the root cause non-obvious.
Version Alignment
The ISO 19115 standard underwent structural changes between revisions. The 2003 bundle is smaller; 2014/2019 introduced extended srv and gmx modules. Mixing a 2003 XSD bundle with a 2014 payload triggers Element '...' is not valid errors for every complex type introduced in the later revision. Always match the bundle version to the xsi:schemaLocation declared in the document. The schema validation for spatial records page covers version detection patterns in more depth.
recover=True Silent Pass
If recover=True is set (the lxml default), syntactically broken XML is silently repaired before validation. The validator may return True for a document that will fail in any other parser. This is the most common cause of “validation passed but the record was rejected by GeoNetwork.” Always use recover=False in production pipelines.
Large Lineage Payloads
Records generated by automated metadata harvesting workflows sometimes embed multi-level gmd:LI_Lineage blocks that push document size past the default lxml limit. Set huge_tree=True to avoid xmlSAX2Characters: huge text node errors on otherwise valid records.
Integrating into CI/CD Pipelines
Wrap the validation function in a CLI entry point that returns a non-zero exit code on failure. This lets any CI runner (GitHub Actions, GitLab CI, Jenkins) treat schema validation as a blocking gate:
# In a GitHub Actions step:
# - run: python validate.py ${{ github.workspace }}/records/ /opt/iso19115-schemas/2014
For batch processing, lxml’s C-backed parser is thread-safe for concurrent read operations. Use concurrent.futures.ThreadPoolExecutor to parallelise validation across large record sets without spawning additional processes. The compiled etree.XMLSchema object is safe to share across threads once created; compile it once before submitting tasks to the pool.
Back to Implementing ISO 19115 Metadata Standards
Related
- Implementing ISO 19115 Metadata Standards — core structure, element hierarchy, and mandatory fields
- Schema Validation for Spatial Records — broader validation patterns beyond ISO 19115
- Automated Metadata Harvesting Workflows — integrating validation into harvest pipelines
- DCAT-AP for Spatial Data Portals — mapping validated ISO 19115 records to DCAT-AP for open data portals
- Spatial Metadata & Catalog Integration — full coverage of metadata standards, harvesting, and catalog publishing