Python3基础教程(九十三)

BeautifulSoup是python的一个库，最主要的功能是从网页抓取数据。

1.BeautifulSoup简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2.BeautifulSoup爬取静态页面案例

有以下图书静态页面。

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>图书列表</title>
    <style>
        h1 {
            text-align: center;
            color: #666;
            text-shadow: #999999;
        }

        #app {
            width: 80%;
            height: 1960px;
            margin: 20px auto;
            /*outline: 2px solid lightskyblue;*/
            position: relative;
        }

        .book-container {
            width: 50%;
            height: auto;
            position: absolute;
            background: #fafafa;
            top: 0;
            left: 0;
            bottom: 0;
            right: 0;
            margin: 0 auto;
            border: 2px solid #ccc;
            border-radius: 10px;
        }

        .book-container dl {
            width: 80%;
            margin-left: 20px;
        }

        .book-pic {
            width: 280px;
            height: 360px;
            position: relative;
            top: 50%;
            left: 25%;
            margin-top: 12px;

        }

        dl dd {
            text-align: center;
            border-bottom: 1px dashed #999;
            position: relative;

        }

        .book-name {

            font-weight: bolder;
            color: #7a1723;
        }

        .book-author {
            margin-top: 12px;
            font-size: smaller;
            color: #666;
            font-family: 楷体;
        }

        .book-publisher {
            margin-top: 12px;
            font-size: smaller;
            color: #666;
            font-family: 楷体;
        }
    </style>
</head>
<body>
<h1>畅销图书列表</h1>
<hr>
<div id="app">
    <div class="book-container">
        <dl>
            <dt><img class="book-pic" src="http://media.simoniu.com/水浒封面001.jpg"></dt>
            <dd class="book-name">水浒</dd>
            <dd class="book-author">施耐庵</dd>
            <dd class="book-publisher">人民文学出版社</dd>
        </dl>
        <dl>
            <dt><img class="book-pic" src="http://media.simoniu.com/西游记封面001.jpeg"></dt>
            <dd class="book-name">西游记</dd>
            <dd class="book-author">吴承恩</dd>
            <dd class="book-publisher">人民文学出版社</dd>
        </dl>
        <dl>
            <dt><img class="book-pic" src="http://media.simoniu.com/三国演义封面001.png"></dt>
            <dd class="book-name">三国演义</dd>
            <dd class="book-author">罗贯中</dd>
            <dd class="book-publisher">北京大学出版社</dd>
        </dl>
        <dl>
            <dt><img class="book-pic" src="http://media.simoniu.com/红楼梦封面001.png"></dt>
            <dd class="book-name">红楼梦</dd>
            <dd class="book-author">曹雪芹</dd>
            <dd class="book-publisher">人民教育出版社</dd>
        </dl>
    </div>
</div>

<div style="text-align: center;margin: 0 auto">
    <hr>
    <div style="font-size: xx-small;color: darkcyan">2020-2023 &copy; 华清远见 作者：西蒙牛</div>
</div>
</body>
</html>

BeautifulSoup爬虫案例。

# -*- coding: utf-8 -*-
# @Author: simoniu
# @Time : 2023/4/2 11:11
# @File : beautifulsoup_demo.py
# @Software : PyCharm

#BeautifulSoup4最简单的爬虫案例，爬取在线网页中的图书资料

# BeautifulSoup最常用的引入方式
from bs4 import BeautifulSoup
import urllib.request, urllib.error

# 图书类
class Books(object):
    def __init__(self, name, author, publisher, pic):
        self.name = name
        self.author = author
        self.publisher = publisher
        self.pic = pic

    def __str__(self):
        return "书名:%s , 作者:%s, 出版社:%s ,封面:%s " % (self.name, self.author, self.publisher,self.pic)

# 发送网络请求返回HTML
def askURL(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    request = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

# 要爬取的图书页面网址
booksUrl = "http://182.44.62.244:8099/books.html"
html = askURL(booksUrl)
soup = BeautifulSoup(html, 'html.parser')
# print(soup)

# print(soup.find_all("dl"))
book_list = soup.find_all("dl")
books = []

for book in book_list:
    book_name = book.find("dd", attrs={"class": "book-name"}).string
    book_author = book.find("dd", attrs={"class": "book-author"}).string
    book_publisher = book.find("dd", attrs={"class": "book-publisher"}).string
    book_pic = book.find("dt").find("img", attrs={"class": "book-pic"})['src']
    # print(book_name, ',', book_author, ',', book_publisher, ',', book_pic)
    b = Books(book_name, book_author, book_publisher, book_pic)
    books.append(b)

for b in books:
    print(b)

运行效果：

书名:水浒 , 作者:施耐庵, 出版社:人民文学出版社 ,封面:http://media.simoniu.com/水浒封面001.jpg 
书名:西游记 , 作者:吴承恩, 出版社:人民文学出版社 ,封面:http://media.simoniu.com/西游记封面001.jpeg 
书名:三国演义 , 作者:罗贯中, 出版社:北京大学出版社 ,封面:http://media.simoniu.com/三国演义封面001.png 
书名:红楼梦 , 作者:曹雪芹, 出版社:人民教育出版社 ,封面:http://media.simoniu.com/红楼梦封面001.png