Introduction
Welcome to the third post in our search relevance series! In our previous posts, we explored text-based search and vector-based search separately. We saw how each approach has its strengths: text search excels at exact matches and handling specific terms, while vector search captures semantic meaning and handles synonyms and misspellings well. In this post, we’ll combine these approaches to create a hybrid search system that leverages the best of both worlds.
Our hybrid approach will aim to:
- Combine the precision of text-based search with the semantic understanding of vector-based search.
- Improve our ability to handle a wider range of queries, including those with misspellings or without exact matches.
- Further enhance our search relevance metrics.
We’ll cover the following key topics:
- Implementing a basic hybrid search by combining text and vector queries
- Tuning the hybrid search through boosting
- Evaluating the performance of our hybrid approach
- Exploring the advantages of hybrid search, particularly for handling edge cases
Let’s dive in and see how we can create a more robust and effective search system through hybridization.
Running Hybrid Searches
With Elasticsearch’s kNN query, we can simply add the kNN matching clause into the list of text matches that we had originally used for our text search. One thing to note is that the text match scores are often orders of magnitude larger than the vector match scores. In order to give the vector match a fair chance in the final rankings, we will need to boost their scores. Similar to how we tuned the text scores, we will iterate over boost values for the vector scores in search of the highest-ranking evaluation metrics.
We can create our new hybrid search queries in the following manner:
python
Copy
query_body = {
"size": num_results,
"query": {
"bool": {
"should": [
{
"match": {
"product_title": {
"query": search_text,
}
}
},
{
"match": {
"product_description": {
"query": search_text,
}
}
},
{
"nested": {
"path": "product_attributes",
"query": {
"match": {
"product_attributes.name_value": {
"query": search_text,
}
}
}
}
},
{
"knn": {
"field": "product_vector",
"query_vector": search_vector,
"num_candidates": 50,
}
}
]
}
}
}
By tuning our vector match boosts, we’re effectively able to balance the quality of the text matches between the original text and the products with the semantic relationship captured by the vector embeddings.
Hybrid Search Results
Let’s take a look at a sample query in our three methods so far. The user query in this case is “hanging shelves”
Text
Position | Score | Product ID | Product Title | Relevance |
---|---|---|---|---|
1 | 115.4625 | 107182 | Ameriwood Wardrobe Storage Closet with Hanging Rod and 2-Shelves in American Cherry | 2.67 |
2 | 87.12547 | 117672 | Whitmor 19.50 in. x 45.38 in. x 68.00 in. Double Rod Closet Shelves | Â |
3 | 84.42163 | 174420 | 4-Shelves Tier Pole Caddy in Bronze | Â |
4 | 83.25287 | 106280 | Rolling Shelves 17 in. Express Pullout Shelf | Â |
5 | 83.25287 | 114168 | Rolling Shelves 21 in. Express Pullout Shelf | Â |
6 | 81.89681 | 116589 | Zenith Premium Bathtub and Shower Pole Caddy with 4 Shelves in White | Â |
7 | 80.95274 | 193107 | Stack-On 42 in. DIY Workbench with Full Length Steel Shelves | Â |
8 | 80.55455 | 192343 | Command Picture Hanging Solution Kit | Â |
9 | 80.42712 | 190110 | Martha Stewart Living 24 in. Espresso Shelves (2-Pack) | Â |
10 | 80.26959 | 181845 | Fresca Allier 16 in. W Bathroom Linen Cabinet with 2 Glass Shelves in White | Â |
The top result here for the “Ameriwood Wardrobe Storage Closet with Hanging Rod and 2-Shelves” comes in with a relatively high relevance score of 2.67 (out of three). The rest of the results seem to be for mostly shelves and are missing our “hanging” component.
Vector
Position | Score | Product ID | Product Title | Relevance |
---|---|---|---|---|
1 | 0.702164 | 128711 | Honey-Can-Do 8-Shelf PEVA hanging organizer | 3 |
2 | 0.700011 | 104901 | 4D Concepts Hanging Wall Corner Shelf Storage | 3 |
3 | 0.699165 | 197453 | Martha Stewart Living Solutions 70 in. Silhouette 2-entryway Shelf with Hooks | Â |
4 | 0.699132 | 119985 | Design House 12 in. x 10-3/16 in. White Shelf-Hanging Rod Bracket | Â |
5 | 0.699011 | 139090 | 4D Concepts Hanging Wall Corner Shelf Storage | 2.67 |
6 | 0.696995 | 134901 | Prepac 36 in. W Hanging Entryway Shelf | 3 |
7 | 0.696304 | 136407 | Prepac 60 in. Wall-Mounted Coat Rack in White | Â |
8 | 0.696301 | 188206 | New Age Industrial 15 in. D x 48 in. L 12-Gauge Aluminum Wall Shelf | Â |
9 | 0.695971 | 136535 | Prepac 48.5 in. x 19.25 in. Floating Entryway Shelf and Coat Rack in Black | Â |
10 | 0.695865 | 161429 | Houseworks 34 in. x 5-1/4 in. Unfinished Wood Decor Shelf with Pegs | Â |
Our vector query is able to embed the “hanging shelves” into a single representation, and because of that is able to return more products that look like a hanging shelf solution.
Hybrid
Position | Score | Product ID | Product Title | Relevance |
---|---|---|---|---|
1 | 115.4625 | 107182 | Ameriwood Wardrobe Storage Closet with Hanging Rod and 2-Shelves in American Cherry | 2.67 |
2 | 110.6256 | 163204 | ClosetMaid 54 in. Canteen 8-Shelf Hanging Organizer | Â |
3 | 110.371 | 119437 | ClosetMaid 54 in. Mocha 8-Shelf Hanging Organizer | 2.67 |
4 | 109.9637 | 186249 | ClosetMaid 54 in. Canteen 10-Shelf Hanging Organizer | Â |
5 | 109.8537 | 141074 | ClosetMaid 54 in. Mocha 10-Shelf Hanging Organizer | 3 |
6 | 107.4734 | 124425 | ClosetMaid 24 in. White Versatile Hanging Shelf | 3 |
7 | 106.2367 | 109858 | ClosetMaid 24 in. Hanging Wire Shelf | 3 |
8 | 104.0418 | 154365 | Home Decorators Collection 2-Shelves and Towel Rack in Chrome | Â |
9 | 103.2512 | 128711 | Honey-Can-Do 8-Shelf PEVA hanging organizer | 3 |
10 | 101.6648 | 212563 | Martha Stewart Living Garage 6 in. H x 24 in. W White Metal Shelves | Â |
Our hybrid solution shows us that we really can get the best of both worlds. The top result is our top result from the text search. The 9th result is our top result from the vector search. In between these two are four other high-relevance products that weren’t in the top ten results of our text or vector searches alone.
Evaluation
We’ve run a lot of additional queries up to this point, let’s check back into the evaluation scores to see how we’re shaping up. Recall from the first post that we will be using Mean Recipricol Rank, Mean Average Precision, and Normalized Discounted Cumulative Gain.
Name | MRR | MAP | NDCG | Run Time |
---|---|---|---|---|
textsearch | 0.261 | 0.113 | 0.170 | 178.5 |
textsearch_boosted | 0.318 | 0.149 | 0.218 | 207.4 |
vectorsearch | 0.331 | 0.159 | 0.237 | 509.6 |
vectorsearch_multifield | 0.241 | 0.097 | 0.156 | 623.0 |
vectorsearch_multifield_tuned | 0.255 | 0.106 | 0.168 | 662.4 |
hybrid | 0.325 | 0.163 | 0.238 | 662.6 |
hybrid_boosted | 0.342 | 0.170 | 0.251 | 716.3 |
Increase over Text | 7.7% | 14.1% | 15.5% | 245.4% |
Increase over Vector | 3.4% | 6.9% | 6.2% | 40.6% |
We can see that our tuned hybrid really does give us that “best of both worlds” and the best performance across all three metrics. By tuning our hybrid search we were able to make a 7%-16% gain across all three of our metrics from the tuned text search and a 3%-7% gain across all three of our metrics from the vector search. It worth noting, however, that we take a 245% hit on the time it takes to run these queries over the basic text query.
Conclusion
In this post, we combined the additional capability of vector based search to our text searches across the search relevance dataset. We tuned our hybrid queries using a similar method to how we tuned our text search.
This added capability allowed us to surface better results across our set of queries and even surface results where there weren’t any previously available due to the limitation of text based search.
While Hybrid approaches offer clear advantages in terms of relevance and semantic understanding, it’s once again important to note that these techniques often come with increased computational complexity and indexing costs. The trade-off between performance and relevance should be carefully evaluated based on the specific requirements of the search application.
In the next post, we’ll take our exploration of hybrid search one step further by exploring an additional ranking method called Reciprocal Rank Fusion.
If you want to dive deeper into the code, the notebooks for all of the work above can be found here: