XPath基础与应用

什么是XPath？

XPath（XML Path Language）是一种用于在XML和HTML文档中定位元素的语言。在网页爬虫中，XPath是一种强大的工具，可以帮助我们精准地定位和提取网页中的数据。

 XPath的优势
 
 可以精确定位复杂嵌套结构中的元素
 
 支持多种定位方式，如标签名、属性、文本内容等
 
 可以使用逻辑运算符组合多个条件
 
 与lxml库结合使用，解析效率高

常见XPath表达式

//div → 选取所有div元素

//div[@id='content'] → 选取id为content的div元素

//a[contains(@href, 'python')] → 选取href属性包含'python'的a元素

//h1/text() → 选取所有h1元素的文本内容

XPath路径类型

绝对路径

绝对路径从文档的根节点开始，以斜杠(/)开头。无论当前上下文如何，绝对路径总是指向相同的元素。


/html/body/div/div[1]/h1
/html/body/table/tbody/tr[2]/td[3]

注意：绝对路径依赖于文档的整体结构，当网页结构发生变化时，绝对路径很容易失效。

相对路径

相对路径从当前节点或任意匹配的节点开始，通常以双斜杠(//)或点(.)开头。相对路径更加灵活，适合大多数爬虫场景。


//div[@class="product-list"]
.//a[contains(text(), "购买")]
//span[@id="price"]/following-sibling::span

优势：相对路径更加灵活，对网页结构变化有更强的适应性。

路径选择实例


<html>
 <body>
 <div class="container">
 <h1>产品列表</h1>
 <div class="product-list">
 <div class="product">
 <h2>笔记本电脑</h2>
 <p class="price">¥5999</p>
 <a href="#buy">立即购买</a>
 </div>
 <div class="product">
 <h2>智能手机</h2>
 <p class="price">¥3999</p>
 <a href="#buy">立即购买</a>
 </div>
 </div>
 </div>
 </body>
</html>

使用绝对路径获取产品列表:

/html/body/div/div[@class="product-list"]

使用相对路径获取所有产品标题:

//div[@class="product"]/h2/text()

节点之间的关系

在XML和HTML文档中，元素之间存在多种关系。了解这些关系有助于编写更精准的XPath表达式。

 节点关系类型
 
 1
 
 父节点 (Parent): 包含其他节点的节点
 
 2
 
 子节点 (Child): 被其他节点包含的节点
 
 3
 
 兄弟节点 (Sibling): 具有相同父节点的节点
 
 4
 
 祖先节点 (Ancestor): 父节点、祖父节点等
 
 5
 
 后代节点 (Descendant): 子节点、孙节点等

节点关系表达式

parent::node() → 选取当前节点的父节点

child::node() → 选取当前节点的所有子节点

following-sibling::node() → 选取当前节点之后的所有兄弟节点

preceding-sibling::node() → 选取当前节点之前的所有兄弟节点

ancestor::node() → 选取当前节点的所有祖先节点

节点关系实例分析


<div class="order">
 <div class="order-header">
 <h3>订单信息</h3>
 <p class="order-id">订单号: 20231219001</p>
 </div>
 <div class="order-items">
 <div class="item">
 <span class="name">商品A</span>
 <span class="price">¥99.00</span>
 <span class="quantity">x2</span>
 </div>
 <div class="item">
 <span class="name">商品B</span>
 <span class="price">¥199.00</span>
 <span class="quantity">x1</span>
 </div>
 </div>
 <div class="order-footer">
 <span class="total-label">总计:</span>
 <span class="total-price">¥397.00</span>
 </div>
</div>

获取商品A的价格:

//span[text()="商品A"]/following-sibling::span[@class="price"]

获取总计价格的前一个兄弟节点:

//span[@class="total-price"]/preceding-sibling::span

获取item的父节点:

//div[@class="item"]/parent::node()

实战案例

使用XPath提取电商商品信息

小明需要从电商网站提取商品名称、价格和评价数量，他使用XPath轻松实现了这个需求。

# 使用XPath提取商品信息
import requests
from lxml import etree

url = 'https://example-shop.com/products'
response = requests.get(url)
html = etree.HTML(response.text)

# 提取商品名称
titles = html.xpath('//div[@class="product-title"]/text()')

# 提取价格
prices = html.xpath('//div[@class="product-price"]/span/text()')

# 提取评价数量
ratings = html.xpath('//div[@class="product-rating"]/span/text()')

# 组合数据
for title, price, rating in zip(titles, prices, ratings):
 print(f"商品：{title.strip()}")
 print(f"价格：{price.strip()}")
 print(f"评价：{rating.strip()}")
 print("---")

使用XPath解析新闻网站

小红是一名数据分析师，她使用XPath从新闻网站高效提取标题、发布时间和摘要信息。

# 使用XPath解析新闻网站
import requests
from lxml import etree

url = 'https://example-news.com/latest'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)

# 提取所有新闻条目
news_items = html.xpath('//div[contains(@class, "news-item")]')

for item in news_items:
 # 提取标题
 title = item.xpath('.//h2[@class="news-title"]/a/text()')
 # 提取链接
 link = item.xpath('.//h2[@class="news-title"]/a/@href')
 # 提取发布时间
 publish_time = item.xpath('.//span[@class="publish-time"]/text()')
 # 提取摘要
 summary = item.xpath('.//div[@class="news-summary"]/text()')

 if title and link:
 print(f"标题：{title[0].strip()}")
 print(f"链接：{link[0]}")
 if publish_time:
 print(f"时间：{publish_time[0].strip()}")
 if summary:
 print(f"摘要：{summary[0].strip()}")
 print("---")

互动练习

输入XPath表达式，点击"执行"按钮查看结果。这是一个简单的模拟环境，用于学习XPath的基础语法。

XPath表达式：

示例HTML结构：

<div class="products">
 <div class="product">
 <h3 class="title">Python入门教程</h3>
 <p class="price">¥99.00</p>
 <span class="rating">4.8 (120人评价)</span>
 </div>
 <div class="product">
 <h3 class="title">Web爬虫实战</h3>
 <p class="price">¥129.00</p>
 <span class="rating">4.9 (86人评价)</span>
 </div>
 <div class="product">
 <h3 class="title">数据分析与可视化</h3>
 <p class="price">¥159.00</p>
 <span class="rating">4.7 (64人评价)</span>
 </div>
</div>

结果：

请输入XPath表达式并点击执行

练习题

练习1：基础定位

基础

编写XPath表达式，从以下HTML结构中提取所有商品的标题：

<div class="shop">
 <div class="item">
 <h2>笔记本电脑</h2>
 <p>高性能办公利器</p>
 </div>
 <div class="item">
 <h2>智能手机</h2>
 <p>5G全网通</p>
 </div>
</div>

练习2：属性过滤

进阶

编写XPath表达式，从以下HTML结构中提取价格大于100元的商品名称：

<div class="product-list">
 <product price="99">钢笔</product>
 <product price="199">智能手表</product>
 <product price="299">无线耳机</product>
 <product price="89">笔记本</product>
</div>

练习3：文本内容

进阶

编写XPath表达式，从以下HTML结构中提取包含"Python"字样的课程标题：

<div class="courses">
 <h3>Python数据分析</h3>
 <h3>Java程序设计</h3>
 <h3>Python网络爬虫</h3>
 <h3>Web前端开发</h3>
</div>

练习4：组合定位

高级

编写XPath表达式，从以下HTML结构中提取id为"featured"的div内的所有图片链接：

<div id="featured">
 <div class="item">
 <img src="img1.jpg" alt="产品1">
 </div>
 <div class="item">
 <img src="img2.jpg" alt="产品2">
 </div>
</div>
<div class="normal">
 <img src="img3.jpg" alt="普通产品">
</div>