文本预处理 - 博客详情

文本预处理是自然语言处理（NLP）中的一个关键步骤，它就像是在烹饪前对食材进行清洗、切割和准备一样。简单来说，就是把原始文本“清理”和“整理”成一种更适合计算机理解和处理的形式。

1.文本预处理

文本预处理是自然语言处理（NLP）中的一个关键步骤，它就像是在烹饪前对食材进行清洗、切割和准备一样。简单来说，就是把原始文本“清理”和“整理”成一种更适合计算机理解和处理的形式。

文本预处理通常包含以下任务：

文本清洗（Text Cleaning）
小写转换（Lowercasing）
词形还原（Lemmatization）
词干提取（Stemming）
停用词去除（Stop Words Removal）
词性标注（Part-of-Speech Tagging）

文本预处理就像是在做饭前的准备工作：

清洗：去掉杂质（标点、特殊字符）。
切分：把句子切成单词（分词）。
标准化：统一大小写，还原词形。
筛选：去掉无用的词（停用词）。

通过这些步骤，文本就变得“干净”和“整齐”，计算机就能更容易地理解和处理了！

2.去除停用词

把那些没有实际意义的“填充词”去掉，就像在沙子里把沙子去掉，只留下金子。具体操作：去掉“the”、“is”、“a”、“an”、“in”等词。

Spacy实现停用词

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_md")

# 示例文本
# 这是一个示例句子，展示了停用词过滤。
text = "This is a sample sentence, showing off the stop words filtration."

# 处理文本
doc = nlp(text)

# 去除停用词
filtered_sentence = [token.text for token in doc if not token.is_stop]

# 输出结果
print("Original Sentence: ", text)
print("Filtered Sentence: ", " ".join(filtered_sentence))

运行效果：

Original Sentence:  This is a sample sentence, showing off the stop words filtration.
Filtered Sentence:  sample sentence , showing stop words filtration .

NLTK实现停用词：

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 下载停用词列表（如果尚未下载）
nltk.download('stopwords')
nltk.download('punkt')

# 示例文本
text = "This is a sample sentence, showing off the stop words filtration."

# 加载英文停用词列表
stop_words = set(stopwords.words('english'))

# 分词
word_tokens = word_tokenize(text)

# 过滤停用词
filtered_sentence = [word for word in word_tokens if word.lower() not in stop_words]

# 输出结果
print("Original Sentence: ", text)
print("Filtered Sentence: ", " ".join(filtered_sentence))

3.词形还原

词形还原（Lemmatization）是自然语言处理（NLP）中的一种技术，它的目标是将单词的不同形态还原为它们的基本形式（称为词根或词基）。简单来说，就是把单词“变回”它最简单的样子。

具体例子：

动词的过去式、进行式等： - “running”（跑，进行式）→ 还原为 “run”（跑，基本形式） - “ran”（跑，过去式）→ 还原为 “run” - “eaten”（吃，过去分词）→ 还原为 “eat”

名词的复数形式： - “apples”（苹果，复数）→ 还原为 “apple” - “books”（书，复数）→ 还原为 “book”

形容词的比较级和最高级： - “better”（更好）→ 还原为 “good” - “worst”（最差）→ 还原为 “bad”

spacy实现词形还原：

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_md")

# 示例文本
text = "I was reading the paper."

# 处理文本
doc = nlp(text)

# 执行词形还原
lemmatized_sentence = [token.lemma_ for token in doc]

# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))

运行效果：

Original Sentence:  I was reading the paper.
Lemmatized Sentence:  I be read the paper .

在这个例子中，Spacy将“was”还原为“be”，“reading”还原为“read”，“paper”保持不变，因为它是名词的单数形式。如果你需要更复杂的词形还原，比如考虑词性标注，可以使用以下代码：

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_md")

# 示例文本
text = "I was reading the paper."

# 处理文本
doc = nlp(text)

# 执行词形还原并考虑词性标注
lemmatized_sentence = []
for token in doc:
    if token.pos_ == "VERB":
        lemmatized_sentence.append(token.lemma_)
    else:
        lemmatized_sentence.append(token.text)

# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))

运行效果：

Original Sentence:  I was reading the paper.
Lemmatized Sentence:  I was read the paper .

NLTK实现词形还原：

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# 下载必要的语料库（如果尚未下载）
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# 示例文本
text = "I was reading the paper."

# 初始化词形还原器
lemmatizer = WordNetLemmatizer()

# 分词
word_tokens = word_tokenize(text)

# 获取词性标注
pos_tags = nltk.pos_tag(word_tokens)

# 将POS标签转换为WordNet格式
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # 默认使用名词

# 执行词形还原
lemmatized_sentence = []
for word, pos in pos_tags:
    wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))