BeautifulSoup是python的一个库,最主要的功能是从网页抓取数据。
1.BeautifulSoup简介
2.BeautifulSoup爬取静态页面案例
有以下图书静态页面。
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>图书列表</title>
<style>
h1 {
text-align: center;
color: #666;
text-shadow: #999999;
}
#app {
width: 80%;
height: 1960px;
margin: 20px auto;
/*outline: 2px solid lightskyblue;*/
position: relative;
}
.book-container {
width: 50%;
height: auto;
position: absolute;
background: #fafafa;
top: 0;
left: 0;
bottom: 0;
right: 0;
margin: 0 auto;
border: 2px solid #ccc;
border-radius: 10px;
}
.book-container dl {
width: 80%;
margin-left: 20px;
}
.book-pic {
width: 280px;
height: 360px;
position: relative;
top: 50%;
left: 25%;
margin-top: 12px;
}
dl dd {
text-align: center;
border-bottom: 1px dashed #999;
position: relative;
}
.book-name {
font-weight: bolder;
color: #7a1723;
}
.book-author {
margin-top: 12px;
font-size: smaller;
color: #666;
font-family: 楷体;
}
.book-publisher {
margin-top: 12px;
font-size: smaller;
color: #666;
font-family: 楷体;
}
</style>
</head>
<body>
<h1>畅销图书列表</h1>
<hr>
<div id="app">
<div class="book-container">
<dl>
<dt><img class="book-pic" src="http://media.simoniu.com/水浒封面001.jpg"></dt>
<dd class="book-name">水浒</dd>
<dd class="book-author">施耐庵</dd>
<dd class="book-publisher">人民文学出版社</dd>
</dl>
<dl>
<dt><img class="book-pic" src="http://media.simoniu.com/西游记封面001.jpeg"></dt>
<dd class="book-name">西游记</dd>
<dd class="book-author">吴承恩</dd>
<dd class="book-publisher">人民文学出版社</dd>
</dl>
<dl>
<dt><img class="book-pic" src="http://media.simoniu.com/三国演义封面001.png"></dt>
<dd class="book-name">三国演义</dd>
<dd class="book-author">罗贯中</dd>
<dd class="book-publisher">北京大学出版社</dd>
</dl>
<dl>
<dt><img class="book-pic" src="http://media.simoniu.com/红楼梦封面001.png"></dt>
<dd class="book-name">红楼梦</dd>
<dd class="book-author">曹雪芹</dd>
<dd class="book-publisher">人民教育出版社</dd>
</dl>
</div>
</div>
<div style="text-align: center;margin: 0 auto">
<hr>
<div style="font-size: xx-small;color: darkcyan">2020-2023 © 华清远见 作者:西蒙牛</div>
</div>
</body>
</html>
BeautifulSoup爬虫案例。
# -*- coding: utf-8 -*-
# @Author: simoniu
# @Time : 2023/4/2 11:11
# @File : beautifulsoup_demo.py
# @Software : PyCharm
#BeautifulSoup4最简单的爬虫案例,爬取在线网页中的图书资料
# BeautifulSoup最常用的引入方式
from bs4 import BeautifulSoup
import urllib.request, urllib.error
# 图书类
class Books(object):
def __init__(self, name, author, publisher, pic):
self.name = name
self.author = author
self.publisher = publisher
self.pic = pic
def __str__(self):
return "书名:%s , 作者:%s, 出版社:%s ,封面:%s " % (self.name, self.author, self.publisher,self.pic)
# 发送网络请求返回HTML
def askURL(url):
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
request = urllib.request.Request(url=url, headers=head)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
return html
# 要爬取的图书页面网址
booksUrl = "http://182.44.62.244:8099/books.html"
html = askURL(booksUrl)
soup = BeautifulSoup(html, 'html.parser')
# print(soup)
# print(soup.find_all("dl"))
book_list = soup.find_all("dl")
books = []
for book in book_list:
book_name = book.find("dd", attrs={"class": "book-name"}).string
book_author = book.find("dd", attrs={"class": "book-author"}).string
book_publisher = book.find("dd", attrs={"class": "book-publisher"}).string
book_pic = book.find("dt").find("img", attrs={"class": "book-pic"})['src']
# print(book_name, ',', book_author, ',', book_publisher, ',', book_pic)
b = Books(book_name, book_author, book_publisher, book_pic)
books.append(b)
for b in books:
print(b)
运行效果:
书名:水浒 , 作者:施耐庵, 出版社:人民文学出版社 ,封面:http://media.simoniu.com/水浒封面001.jpg
书名:西游记 , 作者:吴承恩, 出版社:人民文学出版社 ,封面:http://media.simoniu.com/西游记封面001.jpeg
书名:三国演义 , 作者:罗贯中, 出版社:北京大学出版社 ,封面:http://media.simoniu.com/三国演义封面001.png
书名:红楼梦 , 作者:曹雪芹, 出版社:人民教育出版社 ,封面:http://media.simoniu.com/红楼梦封面001.png