当前位置：首页 > article >正文

向量数据库FAISS之一：官方简单教程

article 2024/11/20 14:27:57

1.安装

1.conda安装

# CPU-only version --> Linux (x86_64 and arm64), OSX (arm64 only), and Windows (x86_64)
$ conda install -c pytorch faiss-cpu=1.8.0

# GPU(+CPU) version --> Linux (x86_64 only) for CUDA 11.4 and 12.1
$ conda install -c pytorch -c nvidia faiss-gpu=1.8.0

# GPU(+CPU) version with NVIDIA RAFT --> Linux (x86_64 only) for CUDA 11.8 and 12.1.
$ conda install -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0

# CPU version
$ conda install -c conda-forge faiss-cpu

# GPU version
$ conda install -c conda-forge faiss-gpu

2.总览

什么是相似度搜索
- 给出一组向量d维 ${x_1, …,x_n \}$ ，Fassi 在 RAM 中建立数据结构。
- 结构构建完成后，如果给定一个新的 d 维度向量 x 时，它就能高效地执行运算：
  
  $i = argmin_i||x - x_i||$
  - || · || 是 L2 距离
- Faiss 术语中，数据结构是一个索引，一个具有 add 方法去增加 $x_i$ 向量的对象
- Note : $x_i$ 的维度固定
- 计算 argmin 就是对索引进行 search 操作。
Fsiss 还可以
- 还可以返回第 k 相似的邻居
- 一次性查找几个向量，而不是一次一个
- 以精度黄速度。以 10% 的精度使用快 10% 的方法或少 10 倍的内存
- 使用最大化内积搜索 $argmax_i<x, x_i>$ 代替最小化欧式搜索
- 存索引到磁盘而不是 RAM
- 检索二进制向量而不是浮点向量

2.开始使用

1.获取数据

Faiss 处理固定维度 d 的向量集合，通常为几十到几百。

这些集合可以存储在矩阵中。

我们假设行优先存储，即向量编号 i 的第 j 个分量存储在矩阵的第 i 行、j 列中。

Faiss 仅使用 32 位浮点矩阵。

需要两个指标

xb 给数据库：包含所有的必须被索引化的、需要去查找的向量

xq 给查询向量：需要去查找最接近的邻居。

在下面的示例中，我们将使用在 d=64 维度上形成均匀分布的向量。只是为了好玩，我们沿着第一个维度添加取决于向量索引的小平移。

import numpy as np
d = 64
nb = 100_000
nq = 10_000
np.random.seed(1234)
xb = np.random.random((nb, d)).astype("float32")
xb[:, 0] += np.arange(nb) / 1000
xq = np.random.random((nq, d)).astype("float32")
xq[:, 0] += np.arange(nq) / 1000

print(xb.shape)
print(xq.shape)
'''
(100000, 64)
(10000, 64)
'''

2.建造索引并加入向量

Faiss 是围绕 Index 对象构建的。

它封装了数据库向量集，并可选择对它们进行预处理以提高搜索效率。

索引有很多种类型，我们将使用最简单的版本，仅对它们执行强力的 L2 距离搜索：IndexFlatL2。

当 index 被建立和训练后，两个操作能在索引上执行：add 和 search
- 为了添加元素给 index，需要在 xb 上调用 add。也可以展示 index 的两个状态变量：
  - is_trained: 指向是否需要训练
  - ntotal: 索引的向量的总数

import faiss
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(xb)
print(index.ntotal)
'''
True
100000
'''

3.Searching

最基本的搜索操作能在 index 上执行，通过 knn 搜索实现。

k = 4
D, I = index.search(xb[:5], k)
print(I)
print(D)
'''
[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[0.        7.1751738 7.20763   7.2511625]
 [0.        6.323565  6.684581  6.799946 ]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.527987  7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]]
 '''

D：距离
I：索引
- I[0] = [0, 393, 363, 78] 表示查询点 0 的最近邻点的索引：
  •最近的是自身，索引为 0。
  •第二个最近邻是索引为 393 的点。
  •第三个最近邻是索引为 363 的点。
  •第四个最近邻是索引为 78 的点。

3.更快的搜索

为了加快搜索速度，可以将数据集分段。我们在 d 维空间中定义 Voronoi 单元，每个数据库向量落在其中一个单元中。
在搜索时，仅将查询 x 所在单元中包含的数据库向量 y 以及一些相邻的向量与查询向量进行比较。
这是通过 IndexIVFFlat 索引实现的。这种索引需要一个训练阶段，可以在任何与数据库向量分布相同的向量集合上进行。
IndexIVFFlat 还需要另一个索引，即量化器，它将向量分配给 Voronoi 单元。每个单元都由一个质心定义，找到向量所属的 Voronoi 单元就是在质心集中找到该向量的最近邻居。
这是另一个索引的任务，该索引通常是 IndexFlatL2。
search 有两个参数
1. nlist：cells 数量
2. nprobe：搜索时要访问的 cells 数量(nlist 之外)

nlist = 100
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist)
assert not index.is_trained
index.train(xb)
assert index.is_trained

index.add(xb)                  # add may be a bit slower as well
D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:])                  # neighbors of the 5 last queries

4.更低的内存占用

1.如何所见内存呢？

正如我们看见的，IndexFlatL2 和 IndexIVFFlat 都存储全部的向量。

为了扩展到非常大的数据集，Faiss 提供了基于乘积量化的有损压缩来压缩存储的向量的变体。

向量仍然存储在 Voronoi 单元中，但它们的大小减小到可配置的字节数 m（d 必须是 m 的倍数）。

nlist = 100
m = 8                             # number of subquantizers
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
                                    # 8 specifies that each sub-vector is encoded as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(xq, k)     # search
print(I[-5:])

5.运行在 GPU 上

使用单个GPU

res = faiss.StandardGpuResources()  # use a single GPU

把 cpu 上的资源放到 GPU 上

# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

用法和在CPU上一样

gpu_index_flat.add(xb)         # add vectors to the index
print(gpu_index_flat.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k)  # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

多GPU情况

ngpus = faiss.get_num_gpus()

print("number of GPUs:", ngpus)

cpu_index = faiss.IndexFlatL2(d)

gpu_index = faiss.index_cpu_to_all_gpus(  # build the index
    cpu_index
)

gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries