Verónica MartínezJun 26, 2017

Elasticsearch decay functions

In one of our recent projects we had to present users with a list of business establishments, usually filtered by category. This could simply be achieved with a filter, but it would hardly be useful for a user to receive a randomly ordered list of places. Considering all this information was stored in an Elasticsearch index, we decided to take advantage of Elasticsearch’s relevance scoring.

The places have some interesting attributes that would help us present them more efficiently to the users. In this case we’ll focus on the rating, price range and location. The idea is to give the users the full list of establishments of the selected type, ranked by a combination of those three attributes. The reasoning being that the user would prefer to know about the best places to go that are both close to them and with the best prices.

For this we will use Decay functions, since what we want is to set up the best values for those three attributes, and penalize the score of those further away from the ideal. The reason we use decay functions instead of filters is that a filter would be static, and would not take into account the flexibility different people have to make a trade off between those attributes, as some of that information would be lost when filtering. Perhaps the user is willing to go a little further for a place with great ratings and/or better prices.

In the following snippet we’ll show an example of how to modify the score of a query. The bool section in this case is the filter by business type, and is just a normal query. Our focus today is on the function_score section.

GET an_index/a_type/_search
{
  "query": {
    "bool": {
      "must": { "business_type": "restaurant" }
    },
    "function_score": {
      "score_mode": "sum",
      "functions": [
        {
          "gauss": {
            "location": {
              "origin": {"lat": user.latitude, "lon": user.longitude},
              "offset":  "2km",
              "scale":  "2km",
              "decay": "0.75"
            }
          },
          weight: 2
        },
        {
          "filter": {
            "exists": {
              "field": "rating"
            }
          },
          "weight": 2
        },
        {
          "gauss": {
            "rating": {
              "origin": "5",
              "offset": "0.5",
              "scale":  "0.5"
            }
          },
          "weight": 1.5
        },
        {
          "gauss": {
            "price_range": {
              "origin": "1",
              "offset": "0",
              "scale":  "1"
            }
          },
          "weight": 0.5
        }
      ]
    }
  }
}

Important! Remember that the location in the Index must be a GeoPoint, otherwise the query will fail.

The first function of the query refers to the distance between the user and the restaurants, all restaurants within a 2km radius from the user’s position will receive the full score ([origin - offset, origin + offset]) and then double it (because it has a weight of 2), then, for every 2km between the user and the business (the scale), it receives three quarters of the score (the decay, a softer penalization than the default half).

The second and third functions play with the ratings. The first one gives priority to restaurants with ratings, while the second gives full points for ratings between 5 (the origin, and maximum rating possible) and 4.5 (origin - offset), and halves the score (default decay) with every 0.5 difference to the origin (scale).

Finally, the query accounts for the price range, where 1 is the ideal as it is the least expensive, the offset is 0 because it should only get full points if the price range is 1, and for every point increased in price range (scale of 1) the score is halved (default decay).

In all cases we use the gauss function for the decay curve, instead of either linear (linear) or exponential (exp), because we feel it’s a better fit for this case. See a graphic example of the difference between the three in the official documentation.

As a final note, we can specify how we want this new scores to affect the final score by setting the score_mode, which in this case is sum. This means the final score of the document will be the sum of each score calculated, which worked best in our case, but there are other options available.

Decay functions are a bit tricky, some trial and error might be required before getting the results we want. A great tool to help in the development process is the explain parameter, which makes the queries take a little longer, but in exchange gives a detailed view of each of the scores calculated for each document. More information about this parameter can be found here.