A Little Performance Story
For each event, we attach the recipient email and display the events nicely in our campaign dashboard
Unique Opens column above. To calculate that, we need to fetch recipient emails from all the open events and do a unique operation on the emails. As you can imagine, this query is quite inefficient. It took upwards of 5 seconds to fetch 500,000 events, despite gigabit connections between our app server and Elasticsearch.
This sluggishness slowed up in our Elasticsearch’s slow log, excerpted above.
To solve this slowness, we had to off load the uniquify queries to the Elasticsearch server and simply return the counts, though this came with a few challenges.
The Power of the Scroll
How tokenization affects your searches
By default, Elasticsearch applies the standard analyzer to fields. For the email address
email@example.com, the standard analyzer creates three tokens:
gmail.com. If we’re to search the field with a terms facet to get unique email counts, the result will include an entry for each token, which inflates the unique count.
The Elasticsearch inquisitor plugin has a handy tool that shows tokenizations with different analyzers.
Taming the scroll with Scala
We must set the mapping of the email field to
not_analyzed for Elasticsearch to return the email as one token. However, changing a mapping requires reindexing of the data for the new mapping to take effect.
As of Elasticsearch 0.90.3, there is no built in reindexing support. Instead, we rely on the scroll query which essentially maintains the query results in memory, for performance and consistency reasons. There will be some data loss in the next index which can be fixed with another scroll query.
Here’s the code to do it in our favorite language, Scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
The code above fetches the
_source field from each entry and reindexes to
newIndex. Be sure to set up the new mapping properly on the new index first, otherwise all that effort was for naught.
Once the data is in the new index, the last step is to delete the old index and create an alias to the new one.o
1 2 3 4 5 6
You can create the alias through the ElasticSearch Head plugin which also nicely displays the aliases on an index.
Coming up next…
I’ll go into how we off loaded the unique queries to the Elasticsearch server, now that the email field is properly treated as one token in the new index.
If this post piqued your interest on what we’re solving at Iterable, drop me a note, justin at iterable.com.