Efficiency Evaluation of Float vs Byte vs Binary Vectors on OpenSearch

I have been engaged on an utility the place, given an enter string, the target is to advocate an output string that’s much like the enter string, for some notion of similarity. A machine studying mannequin, on this case a SentenceTransformers mannequin, is taught this notion of similarity by displaying it many examples of input-output pairs. The mannequin’s weights are then used to encode the half to be beneficial as a vector, and written out right into a search index, on this case OpenSearch. At inference time, the identical mannequin is used to encode the enter string, and OpenSearch’s vector search is used to seek out and advocate the closest string to the enter in vector house.

My dataset consisted of about 10,000 input-output pairs. I cut up it up into 90% coaching set (approx 9,000 pairs) and 10% check set (approx 1,000 pairs). I selected the sentence-transformers/all-MiniLM-L6-v2 mannequin, a great pre-trained general-purpose mannequin for vector search in its personal proper, that maps pairs right into a dense 384-dimensional house. Sentence Transformer fashions use Contrastive Studying, which implies I want each constructive and unfavourable pairs to coach it, however my coaching set are all constructive pairs by definition. Relatively than attempt to generate unfavourable pairs by myself, I used the built-in MultipleNegativesRanking (MNR) loss, which takes constructive pairs and generates unfavourable pairs by sampling from the batches. I educated for 10 epochs, utilizing the AdamW optimizer (the default) with studying charge 2e-5 (additionally the default), saving the very best checkpoint (utilizing similarity on validation set).

To judge, I generated the top-100 nearest neighbors for every enter of the check set pairs, after which computed Recall @ok and MRR (Imply Reciprocal Rank) @Ok for ok = 1, 3, 5, 10, 20, 50, 100. For recall, I might rating a match @ok as profitable if the output of the pair appeared inside the high ok nearest neighbors returned from the vector search. This rating is then averaged throughout all of the 1,000 check set pairs for every worth of ok. MRR is analogous, besides the recall rating for every check set pair and ok is split by the (1-based) place that matched (and 0 if no match).

The baseline for the experiment was computed utilizing an index of encodings of the enter half created utilizing the inventory all-MiniLM-L6-v2 mannequin.

I had additionally lately learn Jo Kristian Bergum’s weblog posts Billion Scale Vector Search with Vespa, half one and two, and extra lately Matryoshka Binary Vectors: Slash Vector Search Prices with Vespa on the Vespa weblog, the place he compares efficiency of vectors of various storage sorts, amongst different issues. Vespa permits for a lot of totally different storage sorts, however I’m utilizing OpenSearch, which presents help for float (float32), byte (int8) and binary (bool) storage sorts. I used to be curious to see (a) if I may change my float based mostly vectors with these, and in that case, how their efficiency would examine. That is what this put up is about.

The index mappings for the float, byte and binary vector subject is as follows. These would have to be set throughout index creation.

Float Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
    "sort": "knn_vector",
    "dimension": 384,
    "technique": {
        "title": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24
        }
    }
}

Byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
    "sort": "knn_vector",
    "dimension": 384,
    "data_type": "byte",
    "technique": {
        "title": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24,
        }
    }
}

Binary vector

1
2
3
4
{
    "sort": "binary",
    "doc_values": "true"
}

To generate the vectors, I used the fine-tuned model of the all-MiniLM-L6-v2 mannequin as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary utilizing the next features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def convert_to_bytes(arr: np.ndarray):
    return np.flooring(arr * 128).astype(np.int8)

def convert_to_bits(arr: np.ndarray):
    bits = np.packbits(
        np.the place(arr > 0, 1, 0)).astype(np.int8)
    arr_b = bits.reshape(arr.form[0], -1)
    hex_bits = []
    for row in arr_b:
        hex_bits.append(str(binascii.hexlify(row), "utf-8"))
    return hex_bits

On the finish of this indexing course of, I had three separate indexes of roughly 10,000 data, one with one a part of the pair encoded as a float32 vector, one other encoded as a byte (int8) vector, and the final as a binary vector. To present you a tough concept of the storage necessities (tough as a result of there are fields apart from the vectors for every index), the sizes reported by /_cat/indices are proven under.

Vector Storage Kind Index Dimension
float 184.9 MB
byte 39.6 MB
binary 15.6 MB

On the question facet, I take advantage of the next Script Rating queries as described within the Byte-quantized vectors in OpenSearch weblog put up and Precise k-NN with scoring script documentation pages. The queries are all rating scripts as proven under. The question for float and byte vectors are the identical, the one distinction is that the float_vector is quantized all the way down to int8 within the second case.

Float and byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "question": {
            "match_all": {}
        },      
        "script": {
            "supply": "knn_score",
            "lang": "knn",
            "params": {
                "subject": "{field_name}",
                "query_value": "{float_or_byte_vector}",
                "space_type": "cosinesimil"
            }       
        }       
    }       
}    

Binary Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "question": {
            "match_all": {}
        },
        "script": {
            "supply": "knn_score",
            "lang": "knn",
            "params": {
                "subject": "{field_name}",
                "query_value": "{binary_vector}",
                "space_type": "hammingbit"
            }
        }
    }
}

The chart under exhibits the Recall and MRR @ok for numerous values of ok as described above.

Very first thing to notice is that fine-tuning helps, or at the very least it helped lots on this case, in all probability as a result of the notion of similarity I used to be working with was extra nuanced than Cosine similarity. With respect to drift vectors versus byte vectors, float vectors have a slight edge as you may see from the desk under (in case you look actually exhausting you can too see the float vector (orange line) and byte vector (inexperienced line) nearly overlaid on one another. Whereas binary vectors do not carry out as effectively, they nonetheless carry out higher than the baseline, and they’re much quicker, to allow them to be helpful as the primary stage of a two-stage retrieval pipeline.

Imply Reciprocal Rank

Recall

By way of response time, binary vectors are the quickest, adopted by byte vectors, adopted by float vectors. I measured the response time for every of the 1,000 check set queries to extract 100 nearest neighbors from OpenSearch, then calculated the imply and customary deviation for every vector storage sort. Listed here are the outcomes.

Vector Storage Kind Imply Response Time Customary Deviation
float 0.382 s 0.097 s
byte 0.231 s 0.050 s
binary 0.176 s 0.048 s

I initially introduced this up within the Relevance Slack channel as a result of I had made some errors in my analysis and was getting outcomes that didn’t agree with frequent sense. Because of all my “non-work colleagues” on the channel who helped me out by validating my observations and being sounding boards which finally helped me discover the errors in my evaluation. In any case, I figured that the ultimate outcomes could also be helpful to the broader neighborhood of vector search fans and professionals, so I’m sharing this. Hope you discover it helpful.

Leave a Reply