一、分析
1. 分析(analysis)
- 首先,标记化一个文本块为适用于倒排索引单独的词(term)
- 然后标准化这些词为标准形式,提高它们的“可搜索性”或“查全率” 分析是由分析器(analyzer)完成的。
2. 分析器(analyzer)
- 字符过滤器(character filter) 过滤处理字符串(比如去掉多余的空格之类的),让字符串在被分词前变得更加“整洁”,一个分析器可能包含零到多个字符过滤器。
- 分词器(tokenizer) 字符串被标记化成独立的词(比如按空格划分成一个个单词),一个分析器必须包含一个分词器。
- 标记过滤器(token filters) 所有的词经过标记过滤,标记过滤器可能修改,添加或删除标记。
只有字段是全文字段(full-text fields)的时候分析器才会被使用,当字段是一个确切的值(exact value)时,不会对该字段做分析。
- 全文字段:类似于string、text
- 确切值:类似于数值、日期
二、自定义分析器
1. char_filter(字符过滤器)
html_strip
(html标签过滤) 参数:escaped_tags
不应该从原始文本中删除的HTML标签数组
mapping
(自定义映射过滤) 参数:mappings
一个映射数组,每个元素的格式为key => value
mappings_path
一个以UTF-8编码的文件的绝对路径或者是相对于config目录的路径,文件每一行都是一个格式为key => value
映射
pattern_replace
(使用正则表达式来匹配字符并使用指定的字符串替换) 参数:pattern
一个java正则表达式mappings_path
作替换的字符串flags
java正则表达式标志
2. tokenizer(分词器)
这里只列出常用的几个,更多分词器请查阅官方文档
standard
(标准分词,默认使用的分词。根据Unicode Consortium的定义的单词边界来切分文本,然后去掉大部分标点符号对于文本分析,所以对于任何语言都是最佳选择) 参数:max_token_length
最大标记长度。如果一个标记超过这个长度,就会被分割。默认值为255
letter
(遇到不是字母的字符就分割) 参数:无lowercase
(在letter
基础上把所分词都转为小写) 参数:无whitespace
(以空格分词) 参数:无keyword
(相当于不分词,接收啥输出啥) 参数:buffer_size
缓冲区大小。默认为256。缓冲区将以这种大小增长,直到所有文本被消耗。建议不要改变这个设置。
3. filter(标记过滤器)
由于标记过滤器太多,这里就不一一介绍了,请查阅官方文档
4. 自定义分析器
newindex
PUT
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"&=>and",
":)=>happy",
":(=>sad"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": [
"the",
"a"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"my_char_filter"
],
"tokenizer": "my_tokenizer",
"filter": [
"lowercase",
"my_filter"
]
}
}
}
}
}
然后用自定义分析器分析一段字符串:
newindex/_analyze
POST
{
"analyzer": "my_analyzer",
"text": "<span>If you are :(, I will be :).</span> The people & a banana",
"explain": true
}
可以看到分析过程:
{
"detail": {
"custom_analyzer": true,
"charfilters": [
{
"name": "html_strip",
"filtered_text": [
"if you are :(, I will be :). the people & a banana"
]
},
{
"name": "my_char_filter",
"filtered_text": [
"if you are sad, I will be happy. the people and a banana"
]
}
],
"tokenizer": {
"name": "my_tokenizer",
"tokens": [
{
"token": "if",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0,
"bytes": "[69 66]",
"positionLength": 1
},
{
"token": "you",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1,
"bytes": "[79 6f 75]",
"positionLength": 1
},
{
"token": "are",
"start_offset": 13,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2,
"bytes": "[61 72 65]",
"positionLength": 1
},
{
"token": "sad",
"start_offset": 17,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3,
"bytes": "[73 61 64]",
"positionLength": 1
},
{
"token": "I",
"start_offset": 21,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4,
"bytes": "[49]",
"positionLength": 1
},
{
"token": "will",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 5,
"bytes": "[77 69 6c 6c]",
"positionLength": 1
},
{
"token": "be",
"start_offset": 28,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6,
"bytes": "[62 65]",
"positionLength": 1
},
{
"token": "happy",
"start_offset": 31,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7,
"bytes": "[68 61 70 70 79]",
"positionLength": 1
},
{
"token": "the",
"start_offset": 42,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 8,
"bytes": "[74 68 65]",
"positionLength": 1
},
{
"token": "peopl",
"start_offset": 46,
"end_offset": 51,
"type": "<ALPHANUM>",
"position": 9,
"bytes": "[70 65 6f 70 6c]",
"positionLength": 1
},
{
"token": "e",
"start_offset": 51,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 10,
"bytes": "[65]",
"positionLength": 1
},
{
"token": "and",
"start_offset": 53,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 11,
"bytes": "[61 6e 64]",
"positionLength": 1
},
{
"token": "a",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 12,
"bytes": "[61]",
"positionLength": 1
},
{
"token": "banan",
"start_offset": 57,
"end_offset": 62,
"type": "<ALPHANUM>",
"position": 13,
"bytes": "[62 61 6e 61 6e]",
"positionLength": 1
},
{
"token": "a",
"start_offset": 62,
"end_offset": 63,
"type": "<ALPHANUM>",
"position": 14,
"bytes": "[61]",
"positionLength": 1
}
]
},
"tokenfilters": [
{
"name": "lowercase",
"tokens": [
{
"token": "if",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0,
"bytes": "[69 66]",
"positionLength": 1
},
{
"token": "you",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1,
"bytes": "[79 6f 75]",
"positionLength": 1
},
{
"token": "are",
"start_offset": 13,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2,
"bytes": "[61 72 65]",
"positionLength": 1
},
{
"token": "sad",
"start_offset": 17,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3,
"bytes": "[73 61 64]",
"positionLength": 1
},
{
"token": "i",
"start_offset": 21,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4,
"bytes": "[69]",
"positionLength": 1
},
{
"token": "will",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 5,
"bytes": "[77 69 6c 6c]",
"positionLength": 1
},
{
"token": "be",
"start_offset": 28,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6,
"bytes": "[62 65]",
"positionLength": 1
},
{
"token": "happy",
"start_offset": 31,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7,
"bytes": "[68 61 70 70 79]",
"positionLength": 1
},
{
"token": "the",
"start_offset": 42,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 8,
"bytes": "[74 68 65]",
"positionLength": 1
},
{
"token": "peopl",
"start_offset": 46,
"end_offset": 51,
"type": "<ALPHANUM>",
"position": 9,
"bytes": "[70 65 6f 70 6c]",
"positionLength": 1
},
{
"token": "e",
"start_offset": 51,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 10,
"bytes": "[65]",
"positionLength": 1
},
{
"token": "and",
"start_offset": 53,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 11,
"bytes": "[61 6e 64]",
"positionLength": 1
},
{
"token": "a",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 12,
"bytes": "[61]",
"positionLength": 1
},
{
"token": "banan",
"start_offset": 57,
"end_offset": 62,
"type": "<ALPHANUM>",
"position": 13,
"bytes": "[62 61 6e 61 6e]",
"positionLength": 1
},
{
"token": "a",
"start_offset": 62,
"end_offset": 63,
"type": "<ALPHANUM>",
"position": 14,
"bytes": "[61]",
"positionLength": 1
}
]
},
{
"name": "my_filter",
"tokens": [
{
"token": "if",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0,
"bytes": "[69 66]",
"positionLength": 1
},
{
"token": "you",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1,
"bytes": "[79 6f 75]",
"positionLength": 1
},
{
"token": "are",
"start_offset": 13,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2,
"bytes": "[61 72 65]",
"positionLength": 1
},
{
"token": "sad",
"start_offset": 17,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3,
"bytes": "[73 61 64]",
"positionLength": 1
},
{
"token": "i",
"start_offset": 21,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4,
"bytes": "[69]",
"positionLength": 1
},
{
"token": "will",
"start_offset": 23,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 5,
"bytes": "[77 69 6c 6c]",
"positionLength": 1
},
{
"token": "be",
"start_offset": 28,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6,
"bytes": "[62 65]",
"positionLength": 1
},
{
"token": "happy",
"start_offset": 31,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7,
"bytes": "[68 61 70 70 79]",
"positionLength": 1
},
{
"token": "peopl",
"start_offset": 46,
"end_offset": 51,
"type": "<ALPHANUM>",
"position": 9,
"bytes": "[70 65 6f 70 6c]",
"positionLength": 1
},
{
"token": "e",
"start_offset": 51,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 10,
"bytes": "[65]",
"positionLength": 1
},
{
"token": "and",
"start_offset": 53,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 11,
"bytes": "[61 6e 64]",
"positionLength": 1
},
{
"token": "banan",
"start_offset": 57,
"end_offset": 62,
"type": "<ALPHANUM>",
"position": 13,
"bytes": "[62 61 6e 61 6e]",
"positionLength": 1
}
]
}
]
}
}