文本分类 - 博客详情

文本分类，也被称为文本标签或文档分类，是将文本数据分配到一个或多个类别或标签的任务。这通常涉及将文本数据与预定义的类别进行匹配，以确定文本属于哪个类别。文本分类可以应用于许多领域，如新闻分类、垃圾邮件识别、法律文件分类等。

1.文本分类

文本分类就是将文本映射到预先设定的一个或多个类别上，从类别数来看，可以分为二分类与多分类，从标签数来看，可以分为单标签与多标签。其中，最典型的一种分类技术就是词云图。

下面是一个使用 NLP 技术生成词云的简单案例代码。我们将使用 Python 的 jieba 库进行中文分词，wordcloud 库生成词云图，并结合 matplotlib 进行可视化展示。

# 导入必要的库
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter

# 设置 可视化风格
plt.style.use('tableau-colorblind10')
# 以下代码从全局设置字体为SimHei（黑体），解决显示中文问题
plt.rcParams['font.sans-serif'] = ['SimHei']
# 解决中文字体下坐标轴负数的负号显示问题
plt.rcParams['axes.unicode_minus'] = False

# 示例文本
text = """
自然语言处理（NLP）是人工智能领域的一个重要分支，主要研究如何让计算机理解和处理人类语言。
通过NLP技术，我们可以实现机器翻译、情感分析、文本分类、问答系统等多种应用。
近年来，深度学习在NLP领域取得了显著进展，例如Transformer模型和预训练语言模型（如BERT、GPT）。
这些技术推动了NLP在工业界和学术界的广泛应用。
"""

# 使用jieba进行中文分词
words = jieba.lcut(text)

# 去除停用词（可以根据需要扩展停用词列表）
stopwords = {'的', '了', '是', '在', '和', '就', '也', '而', '与', '对于', '可以', '我们', '如何', '这些', '一个', '一种', '例如'}
filtered_words = [word for word in words if word not in stopwords and len(word) > 1]

# 统计词频
word_counts = Counter(filtered_words)

# 生成词云
wordcloud = WordCloud(
    font_path='simhei.ttf',  # 指定中文字体路径（确保系统中有该字体）
    background_color='white',  # 背景颜色
    width=800,  # 图片宽度
    height=600,  # 图片高度
    max_font_size=100,  # 最大字体大小
    min_font_size=10,  # 最小字体大小
    max_words=100  # 最多显示的词汇数
).generate_from_frequencies(word_counts)

# 显示词云
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # 不显示坐标轴
plt.title('NLP词云示例', fontsize=16)
plt.tight_layout()
plt.show()

# 保存词云图（可选）
wordcloud.to_file('nlp_wordcloud.png')

首先我们来看一下文本分类的词云，除了text classification本身以外，还有不少NLPer们应该比较熟悉的词，比如分词、词向量、一些传统模型、深度学习等等，大体上是比较符合文本分类的发展情况的。

2.文本分类的主要应用

下来我们看一下文本分类的主要应用场景，从新闻分类、主题分类、情感分析到问答分类、自然语言推理等，应用非常广泛。

新闻分类：news classification (NC)
情感分析：sentiment analysis ( SA)
话题标记：topic labeling(TL)
问答系统：question answering(QA)
对话行为分类：dialog act classification (DAC)
自然语言推理：natural language inference (NLD),
关系分类：relation classification (RC)
事件预测：event prediction (EP)

3.使用HuggingFace的transformers库实现新闻分类

AG_NEWS数据集简介：

AG_NEWS：新闻语料库，包含4个大类新闻：World、Sports、Business、Sci/Tec。AG_NEWS共包含120000条训练样本集（train.csv)， 7600测试样本数据集(test.csv)。每个类别分别拥有 30000 个训练样本及 1900 个测试样本。

数据集文件内容如下：

类别序号是0、1、2、3对应着World、Sports、Business、Sci/Tec

确保已安装必要的库：

pip install transformers datasets torch

torch最好安装是cuda版本，可以加快模型的训练速度。

完整代码实现如下：

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch

# 1.加载AG News数据集
dataset = load_dataset('ag_news')

# 查看数据集结构
print(dataset)

# 采样10%的数据用于快速验证
train_dataset = dataset["train"].shuffle(seed=42).select(range(int(len(dataset["train"]) * 0.1)))
test_dataset = dataset["test"].shuffle(seed=42).select(range(int(len(dataset["test"]) * 0.1)))

# 2.数据预处理
# 使用预训练的DistilBERT模型进行分词

# 指定本地目录路径（确保模型和分词器文件已保存到该目录）
local_model_path = "./distilbert-base-uncased"  # 替换为你的本地路径

model_name = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(local_model_path)


# 3.预处理函数
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

"""
train_dataset = dataset["train"].map(preprocess_function, batched=True)
test_dataset = dataset["test"].map(preprocess_function, batched=True)
"""

train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# 4.设置PyTorch tensor格式
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# 5.模型加载与训练
# 4. 模型加载与训练
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    logging_steps=100,
    do_train=True,
    do_eval=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()
trainer.save_model("./saved_model")  # 保存模型
import evaluate

# 6.模型评估
accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


trainer.evaluate()

# 7.推断示例
# 使用训练好的模型进行新闻分类
sample_text = "The government announced new policies to boost economic growth."

# 8. 加载保存的模型进行预测
# 加载保存的模型
loaded_model = AutoModelForSequenceClassification.from_pretrained("./saved_model").to(device)
loaded_model.eval()

# 使用加载的模型进行预测
loaded_inputs = tokenizer(sample_text, return_tensors="pt").to(device)
loaded_outputs = loaded_model(**loaded_inputs)
loaded_prediction = loaded_outputs.logits.argmax(dim=-1).item()

# 打印加载模型后的预测结果
print("Loaded model predicted class:", loaded_prediction)
print("Loaded model class label:", dataset["train"].features["label"].names[loaded_prediction])

运行效果：

D:\miniconda3\envs\nlp_gpu_env\python.exe D:\AI_Course_bak\nlp_project\chapter08\text_classify_demo1.py 
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})
...
D:\miniconda3\envs\nlp_gpu_env\Lib\site-packages\transformers\training_args.py:1611: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of Transformers. Use `eval_strategy` instead
  warnings.warn(
  0%|          | 0/2250 [00:00<?, ?it/s]D:\miniconda3\envs\nlp_gpu_env\Lib\site-packages\transformers\models\distilbert\modeling_distilbert.py:402: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
  4%|▍         | 100/2250 [01:08<24:33,  1.46it/s]{'loss': 0.5551, 'grad_norm': 3.400158405303955, 'learning_rate': 4.7777777777777784e-05, 'epoch': 0.13}
  9%|▉         | 200/2250 [02:17<23:26,  1.46it/s]{'loss': 0.3699, 'grad_norm': 4.7743425369262695, 'learning_rate': 4.555555555555556e-05, 'epoch': 0.27}
 13%|█▎        | 300/2250 [03:25<22:15,  1.46it/s]{'loss': 0.3816, 'grad_norm': 2.223951578140259, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.4}
 18%|█▊        | 400/2250 [04:34<21:07,  1.46it/s]{'loss': 0.3522, 'grad_norm': 5.6071648597717285, 'learning_rate': 4.111111111111111e-05, 'epoch': 0.53}
 22%|██▏       | 500/2250 [05:42<19:57,  1.46it/s]{'loss': 0.2955, 'grad_norm': 5.823001861572266, 'learning_rate': 3.888888888888889e-05, 'epoch': 0.67}
 27%|██▋       | 600/2250 [06:51<18:52,  1.46it/s]{'loss': 0.2974, 'grad_norm': 4.644488334655762, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.8}
 31%|███       | 700/2250 [08:00<17:43,  1.46it/s]{'loss': 0.2846, 'grad_norm': 4.980098724365234, 'learning_rate': 3.444444444444445e-05, 'epoch': 0.93}
 33%|███▎      | 750/2250 [08:34<17:09,  1.46it/s]
  ...
100%|██████████| 2250/2250 [26:23<00:00,  1.46it/s]
100%|██████████| 48/48 [00:12<00:00,  3.91it/s]
                                               {'eval_loss': 0.3427163362503052, 'eval_runtime': 12.2074, 'eval_samples_per_second': 62.257, 'eval_steps_per_second': 3.932, 'epoch': 3.0}
{'train_runtime': 1583.9171, 'train_samples_per_second': 22.728, 'train_steps_per_second': 1.421, 'train_loss': 0.21262090089586047, 'epoch': 3.0}
100%|██████████| 2250/2250 [26:23<00:00,  1.42it/s]
100%|██████████| 48/48 [00:12<00:00,  3.92it/s]

Loaded model predicted class: 2
Loaded model class label: Business

Process finished with exit code 0