常用(common)词项查询

原英文版地址: https://www.elastic.co/guide/en/elasticsearch/reference/7.7/query-dsl-common-terms-query.html, 原文档版权归 www.elastic.co 所有
本地英文版地址: ../en/query-dsl-common-terms-query.html

重要: 此版本不会发布额外的bug修复或文档更新。最新信息请参考当前版本文档。

» » »

« 多字段匹配(multi_match)查询查询字符串(query_string)查询 »

在7.3.0中废弃。

请改用match查询，它可以有效地跳过文档块，而无需任何配置，前提是不跟踪命中的总数。

common 词项查询是停止词的现代替代方法，它提高了搜索结果的精确度和召回率(通过将停止词考虑在内)，而不牺牲性能。

问题

查询中的每一个词项都有成本。搜索 "The brown fox" 需要三个词项查询，分别针对"the"、"brown" 和 "fox"，所有这些词项都要对索引中的所有文档执行匹配。对 "the" 的查询可能匹配许多文档，因此对相关性的影响比其他两个词项小得多。

以前解决这个问题的方法是忽略高频词。通过将"the"视为stopword(停止词)，减小了索引的大小，并减少了需要执行的词项查询的数量。

这种方法的问题是，虽然停止词对相关性的影响很小，但它们仍然很重要。如果我们删除了停止词，我们就失去了精确性（例如，我们无法区分"happy" 与 "not happy"），我们也失去了回调（例如，像 "The The" 或 "To be or not to be" 这样的文本根本不会存在于索引中）。

解决方案

common 词项查询将要查询的词项分为两组：更重要的（即低频词项）和不太重要的（即以前是停止词的高频词项）。

首先，它搜索与更重要的词项匹配的文档（第一次查询）。这些词项出现在更少的文档中，而对相关性有更大的影响。

然后，对不太重要的词项执行第二次查询，这些词项经常出现，对相关性的影响很小。但是，它不会计算所有匹配的文档的相关性评分，而是只计算第一个查询中已经匹配的文档的 _score。通过这种方式，高频词项可以在不付出性能低下的代价的情况下改进相关性计算。

如果一个查询只包含高频词，那么单个查询将作为 AND（与）查询执行，换句话说，所有词项都是必需的。尽管每个单独的词项会匹配许多文档，但是词项的组合将结果集缩小到仅最相关的文档。单个查询也可以通过指定参数 minimum_should_match 作为 OR 来执行，在这种情况下，应该（给该参数）使用足够高的值。

根据 cutoff_frequency 将词项分配给高频组或低频组，可将其指定为绝对频率(>=1)或相对频率(0.0 ~ 1.0)。 (请记住，文档频率是在每个分片级别上计算的，正如在博客相关性被打破了中所解释的那样。)

也许这个查询最有趣的特性是它能自动适应特定领域的停止词。例如，在一个视频托管网站上，像 "clip" 或 "video" 这样的常用词项会自动作为停止词，而不需要手动维护一个列表。

示例

在本例中，文档频率大于0.1%的单词（例如"this" 和 "is"）将被视为常用词。

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "this is bonsai cool",
                "cutoff_frequency": 0.001
            }
        }
    }
}

应该匹配的词项的数量可以用参数 minimum_should_match (high_freq, low_freq)，low_freq_operator (默认为 "or") 和 high_freq_operator (默认为 "or") 来控制。

对于低频词项，将 low_freq_operator 设置为"and"，以使所有术语都是必需的：

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "low_freq_operator": "and"
            }
        }
    }
}

这大致相当于：

GET /_search
{
    "query": {
        "bool": {
            "must": [
            { "term": { "body": "nelly"}},
            { "term": { "body": "elephant"}},
            { "term": { "body": "cartoon"}}
            ],
            "should": [
            { "term": { "body": "the"}},
            { "term": { "body": "as"}},
            { "term": { "body": "a"}}
            ]
        }
    }
}

或者，使用 minimum_should_match 指定必须出现的低频词项的最小数量或百分比，例如：

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "minimum_should_match": 2
            }
        }
    }
}

这大致相当于：

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "bool": {
                    "should": [
                    { "term": { "body": "nelly"}},
                    { "term": { "body": "elephant"}},
                    { "term": { "body": "cartoon"}}
                    ],
                    "minimum_should_match": 2
                }
            },
            "should": [
                { "term": { "body": "the"}},
                { "term": { "body": "as"}},
                { "term": { "body": "a"}}
                ]
        }
    }
}

不同的 minimum_should_match参数值可通过额外的 low_freq 和 high_freq 参数应用于低频和高频项。以下是提供额外参数时的一个示例（注意结构上的变化）：

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant not as a cartoon",
                "cutoff_frequency": 0.001,
                "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
            }
        }
    }
}

这大致相当于：

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "bool": {
                    "should": [
                    { "term": { "body": "nelly"}},
                    { "term": { "body": "elephant"}},
                    { "term": { "body": "cartoon"}}
                    ],
                    "minimum_should_match": 2
                }
            },
            "should": {
                "bool": {
                    "should": [
                    { "term": { "body": "the"}},
                    { "term": { "body": "not"}},
                    { "term": { "body": "as"}},
                    { "term": { "body": "a"}}
                    ],
                    "minimum_should_match": 3
                }
            }
        }
    }
}

在本例中，这意味着高频词项只有在至少有三个词项时才会对相关性产生影响。但是，对于高频词项，minimum_should_match 最有趣的用途是在只有高频词项的情况下：

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "how not to be",
                "cutoff_frequency": 0.001,
                "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
            }
        }
    }
}

这大致相当于：

GET /_search
{
    "query": {
        "bool": {
            "should": [
            { "term": { "body": "how"}},
            { "term": { "body": "not"}},
            { "term": { "body": "to"}},
            { "term": { "body": "be"}}
            ],
            "minimum_should_match": "3<50%"
        }
    }
}

因此，与使用 AND相比，高频（词项）生成的查询限制略少。

common 词项查询还支持 boost 和 analyzer作为参数。

« 多字段匹配(multi_match)查询查询字符串(query_string)查询 »