目标对于商品标题中的核心词进行配置相关同义词来人工提高商品搜索的准确度

现状

目前我们使用elasticsearch作为我们的数据仓库，主要存储一些商品数据。对于elasticsearch本身没有做特定的优化和定制，使用的都是默认配置，包括使用默认的standard作为分词器。索引本身没有定制。

目前的查询

GET /shop/item/_search?pretty
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "韩国•悦诗风吟（innisfree） 绿茶精萃保湿洁面膏 150ml"
        }
      }
    }
  }
}

请忽略这里多余的bool和must，因为在实际的查询业务中，我们还会有其他判断条件来帮助筛选精度，目前测试数据是在上述筛选条件已经执行后剔除出来的干净数据。

结果集为

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "75",
                "_index": "shop",
                "_score": 10.844498,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精粹绿茶籽保湿精油30毫升/瓶"
                },
                "_type": "item"
            },
            {
                "_id": "77",
                "_index": "shop",
                "_score": 10.140213,
                "_source": {
                    "title": "innisfree 悦诗风吟绿茶精萃保湿卸妆油 150毫升"
                },
                "_type": "item"
            },
            {
                "_id": "50",
                "_index": "shop",
                "_score": 10.133192,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶保湿精华 50毫升"
                },
                "_type": "item"
            },
            {
                "_id": "20",
                "_index": "shop",
                "_score": 9.213395,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精粹平衡保湿套装"
                },
                "_type": "item"
            },
            {
                "_id": "3",
                "_index": "shop",
                "_score": 8.365414,
                "_source": {
                    "title": "innisfree 悦诗风吟绿茶保湿卸妆水 300毫升"
                },
                "_type": "item"
            },
            {
                "_id": "44",
                "_index": "shop",
                "_score": 7.76646,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶保湿面霜 50毫升"
                },
                "_type": "item"
            },
            {
                "_id": "145",
                "_index": "shop",
                "_score": 7.06796,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃平衡面霜 [环保手帕限量版] 100毫升"
                },
                "_type": "item"
            },
            {
                "_id": "72",
                "_index": "shop",
                "_score": 6.9805746,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶平衡水乳洁面+菁露 4件套"
                },
                "_type": "item"
            },
            {
                "_id": "48",
                "_index": "shop",
                "_score": 6.76891,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶保湿爽肤水 200毫升"
                },
                "_type": "item"
            },
            {
                "_id": "19",
                "_index": "shop",
                "_score": 6.375983,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃水乳两件套装"
                },
                "_type": "item"
            }
        ],
        "max_score": 10.844498,
        "total": 176
    },
    "timed_out": false,
    "took": 856
}

这里的结果与我们实际的业务输出结果也类似，我们业务搜索的时候限制了条数为10条，但是真正的目标数据缺因为匹配score的原因，隐藏在了后面。

GET /shop/item/30?pretty 
{
    "_id": "30",
    "_index": "shop",
    "_source": {
        "title": "innisfree 悦诗风吟 绿茶洗面奶 150毫升"
    },
    "_type": "item",
    "_version": 1,
    "found": true
}

问题分析

首先默认分词对于中文来说是按照逐字分词，这样就丢失了词语本身的意义，在es的默认使用的tf-idf算法中会引入一些长尾问题；
其次洗面奶->洁面膏, ml->毫升这种同义词词组应该被识别出来。

解决

引入ik分词

IKAnalyzer是一个开源的，基于Java语言开发的轻量级的中文分词语言包，它是以Lucene为应用主体，结合词典分词和文法分析算法的中文词组组件。从3.0版本开始，IK发展为面向java的公用分词组件，独立Lucene项目，同时提供了对Lucene的默认优化实现。

安装ik分词

可以选择下载预编译包
解压插件包到your-es-root/plugins，建议重命名文件夹名为analysis-ik。效果如下
1
2
3
4
└─plugins
└─analysis-ik
└─config
└─custom

注意下载的插件版本最好和当前使用的es版本一致。

可以选择使用elasticsearch-plugin来安装(仅支持v5.5.1以上版本)

1	./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.2/elasticsearch-analysis-ik-5.6.2.zip

重启elasticsearch
验证安装

启动时的日志
重启后会看到loaded plugin [analysis-ik]日志

调用分词进行验证

GET _analyze?analyzer=ik_smart&pretty=true&text="中华人民共和国国旗 150ml"
{
    "tokens": [
        {
            "end_offset": 7,
            "position": 0,
            "start_offset": 0,
            "token": "中华人民共和国",
            "type": "CN_WORD"
        },
        {
            "end_offset": 9,
            "position": 1,
            "start_offset": 7,
            "token": "国旗",
            "type": "CN_WORD"
        },
        {
            "end_offset": 15,
            "position": 2,
            "start_offset": 10,
            "token": "150ml",
            "type": "LETTER"
        }
    ]
}

相对应可以试试对比默认分词的效果

1	GET _analyze?analyzer=standard&pretty=true&text="中华人民共和国国旗 150ml"

ik分词有细粒度ik_max_word与智能分词ik_smart两种切分模式，根据实际情况，我们默认采用智能分词。

重建索引

DELETE /shop

PUT /shop

POST /shop/item/_mapping
{
  "properties": {
    "title": {
      "analyzer": "ik_smart",
      "search_analyzer": "ik_smart",
      "type": "text"
    }
  }
}
重新插入数据

验证分词

GET /shop/item/_search?pretty
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "韩国•悦诗风吟（innisfree） 绿茶精萃保湿洁面膏 150ml"
        }
      }
    }
  },
  "highlight":{
  	"fields":{
  		"title":{}
  	}
  }
}

可以看到结果已经按照中文词组进行分词了，同时也可以看到150ml没有产生任何映射作用

配置ik同义词

在elastic-root-path/config/analysis中新建synonyms.txt配置文件。
在synonyms.txt中加入以下配置

洁面膏,洗面奶

重新配置索引

PUT /shop
{
  "mappings": {
    "item": {
      "properties": {
        "title": {
          "analyzer": "by_iksmart",
          "search_analyzer": "by_iksmart",
          "type": "text"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "by_iksmart": {
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ],
          "tokenizer": "ik_smart"
        }
      },
      "filter": {
        "my_synonym_filter": {
          "synonyms_path": "analysis/synonyms.txt",
          "type": "synonym"
        }
      }
    }
  }
}

重新验证

GET /shop/item/_search?pretty
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "韩国•悦诗风吟（innisfree） 绿茶精萃保湿洁面膏 150ml"
        }
      }
    }
  },
  "highlight":{
  	"fields":{
  		"title":{}
  	}
  }
}

结果

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "77",
                "_index": "shop",
                "_score": 8.554801,
                "_source": {
                    "title": "innisfree 悦诗风吟绿茶精萃保湿卸妆油 150毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em><em>绿</em><em>茶精</em><em>萃</em><em>保湿</em>卸妆油 150毫升"
                    ]
                }
            },
            {
                "_id": "87",
                "_index": "shop",
                "_score": 7.466588,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃矿物质喷雾 50毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> <em>绿</em><em>茶精</em><em>萃</em>矿物质喷雾 50毫升"
                    ]
                }
            },
            {
                "_id": "112",
                "_index": "shop",
                "_score": 6.4020305,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃矿物质喷雾 150毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> <em>绿</em><em>茶精</em><em>萃</em>矿物质喷雾 150毫升"
                    ]
                }
            },
            {
                "_id": "19",
                "_index": "shop",
                "_score": 5.85256,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃水乳两件套装"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> <em>绿</em><em>茶精</em><em>萃</em>水乳两件套装"
                    ]
                }
            },
            {
                "_id": "98",
                "_index": "shop",
                "_score": 5.85256,
                "_source": {
                    "title": "innisfree 悦诗风吟绿茶精萃清爽卸妆油 150毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em><em>绿</em><em>茶精</em><em>萃</em>清爽卸妆油 150毫升"
                    ]
                }
            },
            {
                "_id": "145",
                "_index": "shop",
                "_score": 4.69012,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶精萃平衡面霜 [环保手帕限量版] 100毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> <em>绿</em><em>茶精</em><em>萃</em>平衡面霜 [环保手帕限量版] 100毫升"
                    ]
                }
            },
            {
                "_id": "144",
                "_index": "shop",
                "_score": 3.6665916,
                "_source": {
                    "title": "innisfree 悦诗风吟 火山泥毛孔收敛水+毛孔清洁慕斯面膜+清洁洗面奶 "
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> 火山泥毛孔收敛水+毛孔清洁慕斯面膜+清洁<em>洗面奶</em> "
                    ]
                }
            },
            {
                "_id": "30",
                "_index": "shop",
                "_score": 3.3598409,
                "_source": {
                    "title": "innisfree 悦诗风吟 绿茶洗面奶 150毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> 绿茶<em>洗面奶</em> 150毫升"
                    ]
                }
            },
            {
                "_id": "64",
                "_index": "shop",
                "_score": 3.2793133,
                "_source": {
                    "title": "innisfree 悦诗风吟 海盐泡沫洁面膏 温和清爽 130毫升"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> 海盐泡沫<em>洁面膏</em> 温和清爽 130毫升"
                    ]
                }
            },
            {
                "_id": "69",
                "_index": "shop",
                "_score": 3.2793133,
                "_source": {
                    "title": "innisfree 悦诗风吟 火山泥清洁洗面奶 150毫升 油皮福音"
                },
                "_type": "item",
                "highlight": {
                    "title": [
                        "<em>innisfree</em> <em>悦诗风吟</em> 火山泥清洁<em>洗面奶</em> 150毫升 油皮福音"
                    ]
                }
            }
        ],
        "max_score": 8.554801,
        "total": 176
    },
    "timed_out": false,
    "took": 87
}

从结果可以看出，id为30的条目已经被成功搜索出来了。目的达到，接下来要做的就是维护自定义词典与同义词库,优化搜索效果。

附录

实验数据下载地址

Chaos's Blog

记录一次elasticsearch中文搜索配置

现状

问题分析

解决

引入ik分词

安装ik分词

配置ik同义词

附录