Elasticsearch集成中文分词插件

Elasticsearch集成中文分词插件。

在中文数据检索场景中，为了提供更好的检索效果，需要在ES中集成中文分词器，因为ES默认是按照英文的分词规则进行分词的，基本上可以认为是单字分词，对中文分词效果不理想。ES之前是没有提供中文分词器的，现在官方也提供了一些，但是在中文分词领域，IK分词器是不可撼动的，所以本节主要讲一下如何在ES中集成IK这个中文分词器。

首先下载es-ik插件，需要到github上下载。 https://github.com/medcl/elasticsearch-analysis-ik

最终的下载地址为： https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.13.4/elasticsearch-analysis-ik-7.13.4.zip

注意：在ES中安装IK插件的时候，需要在ES集群的所有节点中都安装。

1.安装步骤

1).将下载好的elasticsearch-analysis-ik-7.13.4.zip上传到master的/root/tools/中。

2).再把elasticsearch-analysis-ik-7.13.4.zip复制到elasticsearch的安装目录。

[root@master tools]# cp elasticsearch-analysis-ik-7.13.4.zip /usr/local/elasticsearch

3).将elasticsearch-analysis-ik-7.13.4.zip远程拷贝到slave1和slave2上。

[root@master elasticsearch]# scp -rp elasticsearch-analysis-ik-7.13.4.zip slave1:/usr/local/elasticsearch
elasticsearch-analysis-ik-7.13.4.zip                                                                                                            100% 4399KB  72.3MB/s   00:00    
[root@master elasticsearch]# scp -rp elasticsearch-analysis-ik-7.13.4.zip slave2:/usr/local/elasticsearch
elasticsearch-analysis-ik-7.13.4.zip

4).在master节点离线安装IK插件。

[root@master elasticsearch]# bin/elasticsearch-plugin install file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip 
-> Installing file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip
-> Downloading file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip
[=================================================] 100%   
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.net.SocketPermission * connect,resolve
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed analysis-ik
-> Please restart Elasticsearch to activate any plugins installed

注意：插件安装成功之后在elasticsearch安装目录的config和plugins目录下会产生一个analysis-ik目录。config目录下面的analysis-ik里面存储的是ik的配置文件信息。plugins目录下面的analysis-ik里面存储的是ik的核心jar包。

5).分别在slave1和slave2上安装IK插件。

[root@slave1 elasticsearch]#  bin/elasticsearch-plugin install  file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip
...

[root@slave2 elasticsearch]# bin/elasticsearch-plugin install  file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip

6).修改elasticsearch安装目录的plugins目录下analysis-ik子目录的权限。

接修改/usr/local/elasticsearch目录的权限即可。

[root@master elasticsearch]# chmod -R 777 /usr/local/elasticsearch
[root@slave1 elasticsearch]# chmod -R 777 /usr/local/elasticsearch
[root@slave2 elasticsearch]# chmod -R 777 /usr/local/elasticsearch

7).如果ES集群正在运行，则需要停止集群后再重新启动elasaticsearch集群。

8).验证IK的分词效果。

首先使用默认分词器测试中文分词效果。

[es@master elasticsearch]$ curl -H "Content-Type: application/json" -XPOST  'http://master:9200/java_test/_analyze?pretty' -d '{"text":"我们是中国人"}'
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "们",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "中",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "国",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "人",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

然后使用IK分词器测试中文分词效果。

[es@master elasticsearch]$ curl -H "Content-Type: application/json" -XPOST  'http://master:9200/java_test/_analyze?pretty' -d '{"text":"我们是中国人","tokenizer":"ik_max_word"}'
{
  "tokens" : [
    {
      "token" : "我们",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

在这里我们发现分出来的单词里面有一个"是"，这个单词其实可以认为是一个停用词，在分词的时候是不需要切分出来的。在这被切分出来了，那也就意味着在进行停用词过滤的时候没有过滤掉。

针对ik这个词库而言，它的停用词词库里面都有哪些单词呢？

[es@master elasticsearch]$ more /usr/local/elasticsearch/config/analysis-ik/stopword.dic 
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

ik的停用词词库是stopword.dic这个文件，我们发现这个文件里面目前都是一些英文停用词。我们可以手工在这个文件中把中文停用词添加进去，先添加 "是" 这个停用词。

a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with
是

注意：然后把这个文件的改动同步到集群中的所有节点上。然后再重启ES集群让配置生效。

再使用IK分词器测试一下中文分词效果。

[es@master elasticsearch]$ curl -H "Content-Type: application/json" -XPOST  'http://master:9200/java_test/_analyze?pretty' -d '{"text":"我们是中国人","tokenizer":"ik_max_word"}'
{
  "tokens" : [
    {
      "token" : "我们",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国人",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中国",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "国人",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

此时再查看会发现没有"是" 这个单词了，相当于在过滤停用词的时候把它过滤掉了。