pyspark.sql.functions.schema_of_xml#

pyspark.sql.functions.schema_of_xml(xml, options=None)[source]#

Parses a XML string and infers its schema in DDL format.

New in version 4.0.0.

Parameters
xmlColumn or str

a XML string or a foldable string column containing a XML string.

optionsdict, optional

options to control parsing. accepts the same options as the XML datasource. See Data Source Option for the version you use.

Returns
Column

a string representation of a StructType parsed from given XML.

Examples

Example 1: Parsing a simple XML with a single element

>>> from pyspark.sql import functions as sf
>>> df = spark.range(1)
>>> df.select(sf.schema_of_xml(sf.lit('<p><a>1</a></p>')).alias("xml")).collect()
[Row(xml='STRUCT<a: BIGINT>')]

Example 2: Parsing an XML with multiple elements in an array

>>> from pyspark.sql import functions as sf
>>> df.select(sf.schema_of_xml(sf.lit('<p><a>1</a><a>2</a></p>')).alias("xml")).collect()
[Row(xml='STRUCT<a: ARRAY<BIGINT>>')]

Example 3: Parsing XML with options to exclude attributes

>>> from pyspark.sql import functions as sf
>>> schema = sf.schema_of_xml('<p><a attr="2">1</a></p>', {'excludeAttribute':'true'})
>>> df.select(schema.alias("xml")).collect()
[Row(xml='STRUCT<a: BIGINT>')]

Example 4: Parsing XML with complex structure

>>> from pyspark.sql import functions as sf
>>> df.select(
...     sf.schema_of_xml(
...         sf.lit('<root><person><name>Alice</name><age>30</age></person></root>')
...     ).alias("xml")
... ).collect()
[Row(xml='STRUCT<person: STRUCT<age: BIGINT, name: STRING>>')]

Example 5: Parsing XML with nested arrays

>>> from pyspark.sql import functions as sf
>>> df.select(
...     sf.schema_of_xml(
...         sf.lit('<data><values><value>1</value><value>2</value></values></data>')
...     ).alias("xml")
... ).collect()
[Row(xml='STRUCT<values: STRUCT<value: ARRAY<BIGINT>>>')]