如何在 Python 中使用 ElementTree 库处理 XML

2024年8月29日 | 阅读 10 分钟

在本教程中，我们将学习如何使用 Python 的 ElementTree 包解析 XML、修改和填充 XML 文件。为了理解数据，我们还将学习 XPath 表达式和 XML 树。

让我们简要介绍一下 XML。如果您熟悉 XML 概念，可以跳过此部分，直接开始下一部分。

什么是 XML？

XML 是 “可扩展标记语言” 的缩写。它用于通过 XML 框架动态理解数据。它主要专注于创建数据具有特定结构的网络页面。

使用 XML 创建的页面称为 XML 文档。XML 生成一种树状结构，该结构简单且支持层次结构。让我们了解 XML 的一些重要属性。

XML 文档包含称为元素的节，这些元素被包含在开始的 < 和结束的 > 标记中。开始标记和结束标记之间的字符是元素的内容。元素可以包含标记，包括其他元素，称为 “子元素”。顶级元素称为根，它包含所有其他文档。
开始标记或空元素包含称为属性的键值对。

下面是 XML 文件的示例文本结构。

XML

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre> 
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
</catalog>

正如我们在上面的 XML 示例文本中看到的 -

<catlog> 是单个根元素，它包含所有其他元素，例如 <book_id> 或 <title>。
子元素或子元素位于 <catlog> 内部，我们可以看到它们是嵌套的。
<book> 元素包含多个“属性”，例如 author、title 等。

注意 - 子元素可以包含它们自己的子元素，也称为“子子元素”。

现在，让我们转向 ElementTree 库。

什么是 ElementTree？

XML 树结构使我们能够以简单的方式进行修改、导航和删除。Python 内置了 ElementTree 库，它提供了几个用于读取和操作 XML 的函数。它用于解析（从文件中读取信息并将其分解）。下面是 XML 数据结构的表格表示。

属性	描述
标签	它代表正在存储的数据。它基本上是一个字符串。
属性	它包含许多作为字典存储的属性
文本字符串	它是一个包含需要显示的信息的文本字符串。
尾部字符串	如果需要，它也可以具有尾部字符串
子元素	它包含许多作为序列存储的子元素

要使用 ElementTree 模块，我们需要像下面这样在程序中导入它。

解析 XML 数据

本教程的主要目标是使用 Python 读取和理解文件。我们的示例文本文件中有很多书籍详情，但数据很混乱。任何人都可以以自己的方式将数据输入文件，导致数据不一致。

让我们看下面的例子。

示例 -

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()
print(root)

输出

<Element 'catalog' at 0x000001FAD52C44A0>

我们在上面的代码中初始化了树，并打印了 XML 根对象。现在，我们可以打印树的每个部分，以便轻松理解树结构。

如前所述，树的每个部分都有一个标签，该标签确定元素。元素可能包含对验证该标签的输入值起重要作用的属性。让我们打印 XML 的根标签。

输出

catalog

如果我们观察顶层的 XML 文件，这个 XML 以 collection 标签为根。让我们看看根的属性。

输出

Attributes are: {}

正如我们所见，根没有属性。

使用 for 循环解析

我们可以使用 for 循环遍历根中的子元素或子节点。让我们理解下面的例子。

示例 -

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

for ch in root:
    print(ch.tag, ch.attrib)

输出

Iterating root using for loop
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}

正如我们所见，所有书籍属性都是根 catalog 的子节点。id 属性指定了书籍属性。有不同 ID 的各种书籍。

获取整个树中元素的信息非常有帮助。现在我们在 for 循环中使用 root.iter() 方法，该方法返回我们拥有的元素数量。但是，它不显示属性或树中的级别。

示例 -

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

print("Iterating root using for loop:")
tags = [elem.tag for elem in root.iter()]
print(tags)

输出

['catalog', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description']

由于 ElementTree 是一个功能强大的库，我们可以使用 .tostring() 方法打印整个文档。我们需要将根传递给此方法，并对文档进行编码和解码。对于 XML，它使用 'utf98'。

让我们理解下面的代码片段。

示例 -

输出

<?xml version='1.0' encoding='utf8'?>
<catalog>   
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
</catalog>

root.iter() 方法帮助我们找到特别感兴趣的元素。此方法将返回根下所有与指定元素匹配的子元素。让我们看看下面的代码。

示例 -

for book in root.iter('book'):
    print(book.attrib)

输出

{'id': 'bk101'}
{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}
{'id': 'bk106'}
{'id': 'bk107'}
{'id': 'bk108'}
{'id': 'bk109'}

XPath 表达式

有时，元素没有属性，只有文本内容。我们可以使用 .text 属性来打印文本内容。让我们理解下面的例子。

示例 -

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

print("Desctiption Values:")
for description in root.iter('description'):
    print(description.text)

输出

An in-depth look at creating applications
with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
Oberon's Legacy.
When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
thousand leagues beneath the sea.
An anthology of horror stories about roaches,
centipedes, scorpions and other insects.
After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.
(Django) PS D:\Python Project> & "C:/Users/DEVANSH SHARMA/.virtualenvs/Django-ExvyqL3O/Scripts/python.exe" "d:/Python Project/sellshares.py"
Desctiption Values:
An in-depth look at creating applications
with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.
When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty
thousand leagues beneath the sea.
An anthology of horror stories about roaches,
centipedes, scorpions and other insects.
After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.

使用 .text 属性，我们可以获取任何属性的内容。

示例 - 2

print("Title Values:")
for title in root.iter('title'):
    print(title.text)

输出

Title Values:
XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

不推荐这种打印 XML 文件的方法。但是，XPath 是最常用和推荐的方法。它代表 XML Path Language，是一种用于快速轻松地搜索 XML 的查询语言。它具有类似路径的语法来标识和导航 XML 文档中的节点。

ElementTree 提供了 findall() 方法，该方法遍历引用元素的直接子节点。

让我们理解下面的例子。

示例 -

import xml.etree.ElementTree as ET
tree = ET.parse('book.xml')
root = tree.getroot()

for val in root.findall("./book/[price='5.95']"):
    print(val.attrib)

输出

{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}

有三本书的价格等于 5.95。此方法可以有效地快速查找大型 XML 文件中的特定结果。现在，我们找到类型为 Romance 的书籍。

示例 - 2

for val in root.findall("./book/[genre='Romance']"):
    print(val.attrib)

输出

{'id': 'bk106'}
{'id': 'bk107'}

修改 XML

我们可以根据需要修改 XML 文件。让我们看看下面的例子。

示例 - 再次打印书籍的标题

for title in root.iter('title'):
    print(title.text)

输出

XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

现在我们将 'Midnight Rain' 标题替换为 The Alchemist。

mod_title = root.find("./book/[title='Midnight Rain']")
print(mod_title)

mod_title.attrib["title"] = "The Alchemist"
print(mod_title.attrib)

输出

<Element 'book' at 0x0000024822762770>
{'id': 'bk102', 'title': 'The Alchemist'}

一旦我们修改了 XML 文件，我们将把这些更改写回 XML。让我们理解下面的例子。

示例 -

tree.write("book.xml")

tree = ET.parse('book.xml')
root = tree.getroot()

for title in root.iter('title'):
    print(title.attrib)

输出

XML Developer's Guide
The Alchemist
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost

示例 - 2

for description in root.iter('description'):
     new_desc = str(description.text)+'This is a author view'
     description.text = str(new_desc)
     description.set('updated', 'yes')

tree.write('book.xml')

输出

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>

上面的代码将向 book.xml 文件添加新描述。我们只选取了两本书来显示输出，但这将反映在整个文件数据中。

结论

在本教程中，我们解释了一些重要概念。XML 文件遵循由标签构建的树结构，它们指定了应该在哪里定义值。智能结构有助于我们轻松地读取和写入 XML。使用开括号和闭括号，标签表示父子关系。

属性进一步描述了如何验证标签或允许布尔标签。正如本教程中所讨论的，ElementTree 是一个强大的 Python 库，它允许我们解析和导航 XML 文档。该库将 XML 文档分解为树结构，提供了一种简单的工作方式来处理 XML 文档。现在我们可以在项目中开始使用此库并解析文档了。

下一主题Python 中的 Bisect 算法函数

如何在 Python 中使用 ElementTree 库处理 XML

什么是 XML？

XML

注意 - 子元素可以包含它们自己的子元素，也称为“子子元素”。

什么是 ElementTree？

解析 XML 数据

使用 for 循环解析

XPath 表达式

修改 XML

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

如何在 Python 中使用 ElementTree 库处理 XML

什么是 XML？

XML

注意 - 子元素可以包含它们自己的子元素，也称为“子子元素”。

什么是 ElementTree？

解析 XML 数据

使用 for 循环解析

XPath 表达式

修改 XML

结论

相关帖子

Fast API 教程：创建 API 的框架

更新 Pyspark DataFrame 元数据

Python Pendulum 库

CPython 是什么

Python Scrapy 模块

get_screenshot_as_file Driver Method - Selenium Python

Python Tkinter 中的 Tree view 小部件和 Tree view 滚动条

Python 用于网络工程

Python Graphviz: DOT 语言

如何在 Python 中读取文本文件

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器