文本预处理是自然语言处理(NLP)中的一个关键步骤,它就像是在烹饪前对食材进行清洗、切割和准备一样。简单来说,就是把原始文本“清理”和“整理”成一种更适合计算机理解和处理的形式。
文本预处理是自然语言处理(NLP)中的一个关键步骤,它就像是在烹饪前对食材进行清洗、切割和准备一样。简单来说,就是把原始文本“清理”和“整理”成一种更适合计算机理解和处理的形式。
文本预处理通常包含以下任务:
文本预处理就像是在做饭前的准备工作:
通过这些步骤,文本就变得“干净”和“整齐”,计算机就能更容易地理解和处理了!
把那些没有实际意义的“填充词”去掉,就像在沙子里把沙子去掉,只留下金子。具体操作:去掉“the”、“is”、“a”、“an”、“in”等词。
Spacy实现停用词
import spacy
# 加载英文模型
nlp = spacy.load("en_core_web_md")
# 示例文本
# 这是一个示例句子,展示了停用词过滤。
text = "This is a sample sentence, showing off the stop words filtration."
# 处理文本
doc = nlp(text)
# 去除停用词
filtered_sentence = [token.text for token in doc if not token.is_stop]
# 输出结果
print("Original Sentence: ", text)
print("Filtered Sentence: ", " ".join(filtered_sentence))
运行效果:
Original Sentence: This is a sample sentence, showing off the stop words filtration.
Filtered Sentence: sample sentence , showing stop words filtration .
NLTK实现停用词:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# 下载停用词列表(如果尚未下载)
nltk.download('stopwords')
nltk.download('punkt')
# 示例文本
text = "This is a sample sentence, showing off the stop words filtration."
# 加载英文停用词列表
stop_words = set(stopwords.words('english'))
# 分词
word_tokens = word_tokenize(text)
# 过滤停用词
filtered_sentence = [word for word in word_tokens if word.lower() not in stop_words]
# 输出结果
print("Original Sentence: ", text)
print("Filtered Sentence: ", " ".join(filtered_sentence))
词形还原(Lemmatization)是自然语言处理(NLP)中的一种技术,它的目标是将单词的不同形态还原为它们的基本形式(称为词根或词基)。简单来说,就是把单词“变回”它最简单的样子。
具体例子:
动词的过去式、进行式等: - “running”(跑,进行式)→ 还原为 “run”(跑,基本形式) - “ran”(跑,过去式)→ 还原为 “run” - “eaten”(吃,过去分词)→ 还原为 “eat”
名词的复数形式: - “apples”(苹果,复数)→ 还原为 “apple” - “books”(书,复数)→ 还原为 “book”
形容词的比较级和最高级: - “better”(更好)→ 还原为 “good” - “worst”(最差)→ 还原为 “bad”
spacy实现词形还原:
import spacy
# 加载英文模型
nlp = spacy.load("en_core_web_md")
# 示例文本
text = "I was reading the paper."
# 处理文本
doc = nlp(text)
# 执行词形还原
lemmatized_sentence = [token.lemma_ for token in doc]
# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))
运行效果:
Original Sentence: I was reading the paper.
Lemmatized Sentence: I be read the paper .
在这个例子中,Spacy将“was”还原为“be”,“reading”还原为“read”,“paper”保持不变,因为它是名词的单数形式。如果你需要更复杂的词形还原,比如考虑词性标注,可以使用以下代码:
import spacy
# 加载英文模型
nlp = spacy.load("en_core_web_md")
# 示例文本
text = "I was reading the paper."
# 处理文本
doc = nlp(text)
# 执行词形还原并考虑词性标注
lemmatized_sentence = []
for token in doc:
if token.pos_ == "VERB":
lemmatized_sentence.append(token.lemma_)
else:
lemmatized_sentence.append(token.text)
# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))
运行效果:
Original Sentence: I was reading the paper.
Lemmatized Sentence: I was read the paper .
NLTK实现词形还原:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
# 下载必要的语料库(如果尚未下载)
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
# 示例文本
text = "I was reading the paper."
# 初始化词形还原器
lemmatizer = WordNetLemmatizer()
# 分词
word_tokens = word_tokenize(text)
# 获取词性标注
pos_tags = nltk.pos_tag(word_tokens)
# 将POS标签转换为WordNet格式
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN # 默认使用名词
# 执行词形还原
lemmatized_sentence = []
for word, pos in pos_tags:
wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos=wordnet_pos))
# 输出结果
print("Original Sentence: ", text)
print("Lemmatized Sentence: ", " ".join(lemmatized_sentence))