当前位置：首页 > article >正文

在海量数据中精准定位：BloomFilter的工作原理与实战指南

article 2025/3/31 20:32:14

1.基本概念

布隆过滤器（Bloom Filter）是1970年由 Burton Horward Bloom 提出的一种非常节省空间的概率型数据结构，运行速度快，占用内存小，但有一定误判率且无法删除元素。它实际上是一个很长的位数组（bitmap）和一系列随机哈希函数组成，主要用于判断一个元素是否在一个集合中。

适用于需要高效判断大量元素是否存在、以及允许一定false positive rate(假阳率)存在的业务场景：

解决Redis缓存穿透问题；
黑名单过滤（垃圾邮件地址、手机号、IP地址、域名等等）；
解决推荐过的数据不再推荐（如新闻、视频推荐等）；
部分数据库内置布隆过滤器以判断数据是否存在，从而减少数据库的IO请求，比如HBase。

2.基本原理

1.数据结构

布隆过滤器是由一个固定长度m的位数组和k个哈希函数组成的数据结构，其空间复杂度为O(m)。

位数组

- 初始时，数组的每位均为0；
- 存储元素后，用k个值为1的位点（不是唯一）标识某个元素是否存在；
- 位数组是布隆过滤器节省内存的核心所在。申请一个100w个元素的位数组只占用1000000bit / 8 = 125000Byte = 125000/1024 KB ≈ 122KB的空间。

哈希函数：用于将输入元素映射位数组的n个点位，以后续判断该元素是否不存在/可能存在。

实际运用中，我们可以通过指定预期插入元素个数（expectedInsertions）和误判率（fpp）来初始化一个布隆过滤器，例如Guava中创建一个布隆过滤器

/**
 * Creates a {@link BloomFilter} with the expected number of insertions and expected false
 * positive probability.
 *
 * <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
 * will result in its saturation, and a sharp deterioration of its false positive probability.
 *
 * <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}
 * is.
 *
 * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
 * ensuring proper serialization and deserialization, which is important since {@link #equals}
 * also relies on object identity of funnels.
 *
 * @param funnel the funnel of T's that the constructed {@code BloomFilter} will use
 * @param expectedInsertions the number of expected insertions to the constructed {@code
 *     BloomFilter}; must be positive
 * @param fpp the desired false positive probability (must be positive and less than 1.0)
 * @return a {@code BloomFilter}
 */
public static <T extends @Nullable Object> BloomFilter<T> create(
    Funnel<? super T> funnel, int expectedInsertions, double fpp) {
  return create(funnel, (long) expectedInsertions, fpp);
}

Google Guava BloomFilter的位数组又称位向量，是自定义了一个BitArray类来实现的，其本质上是一个Long类型数组；

@VisibleForTesting
static <T extends @Nullable Object> BloomFilter<T> create(
    Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
  checkNotNull(funnel);
  checkArgument(
      expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
  checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
  checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
  checkNotNull(strategy);

  if (expectedInsertions == 0) {
    expectedInsertions = 1;
  }
  /*
   * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
   * is proportional to -log(p), but there is not much of a point after all, e.g.
   * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
   */
  long numBits = optimalNumOfBits(expectedInsertions, fpp);
  int numHashFunctions = optimalNumOfHashFunctions(fpp);
  try {
      // 核心生成逻辑
    return new BloomFilter<>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
  } catch (IllegalArgumentException e) {
    throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
  }
}

LockFreeBitArray(long bits) {
  checkArgument(bits > 0, "data length is zero!");
  // Avoid delegating to this(long[]), since AtomicLongArray(long[]) will clone its input and
  // thus double memory usage.
  //对于长度为m的位向量来说，对应的long数组的长度应为m/64向上取整。
  this.data =
      new AtomicLongArray(Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING)));
  this.bitCount = LongAddables.create();
}

2. 插入元素
布隆过滤器插入元素的时间复杂度为o(k)。当一个元素e1插入布隆过滤器时，会进行如下操作：

基于k个哈希函数，计算出k个元素对应的哈希值：h1,h2....hk，其中hi=hashi(e1)；
根据得到的哈希值，在位数组中把对应下标的值置为1，即bitmap[hi]=1;

接着插入一个元素e2，可能会有部分插入位点产生冲突：

Google Guava 插入元素相关源码
值得关注的点是，Google Guava BloomFilter中k个哈希函数的生成遵循如下公式：

上述公式中gi(x)为第i（0 < i < k）个哈希函数，初始时需要定义已知的哈希函数h1(x)和h2(x)，求得k个哈希函数的计算过程如下：

//准备阶段 h1 = hash1(input), h2 = hash2(input) 
// 求出k个hash值 g0(x) = h1 
第0个hash函数求出的hash值 g1(x) = h1+h2+1 
第1个hash函数求出的hash值 g2(x) = h1+2*h2+4 
第2个hash函数求出的hash值 ... gk-1(x) = h1+(k-1)*h2+(k-1)^2 
第k-1个hash函数求出的hash值

遵循上述计算过程，Google Guava BloomFilter生成k个哈希函数的步骤如下。
1.首先根据murmur3_128这个哈希计算输入元素的64位哈希值，将其分为两段：后32位为hash1、前32位为hash2；
2.基于公式计算combinedHash值，此处省去了加式中的第三项；
3.取combinedHash对位数组长度的mod，以获得真正插入的index。

public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      boolean bitsChanged = false;
      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }
      return bitsChanged;
    }

// 位数组bits将bitIndex位设置为1
boolean set(long bitIndex) {
      if (get(bitIndex)) {
        return false;
      }

      //取出bitIndex在位数组中的实际索引值
      int longIndex = (int) (bitIndex >>> 6);
      // 与原位数组取并集获得最终结果
      long mask = 1L << bitIndex;

      long oldValue;
      long newValue;
      do {
        // oldValue 0011...1000...1010
        // mask     0000...0100...0000
        //          ------------------
        // newValue 0011...1100...1010    
        oldValue = data.get(longIndex);
        newValue = oldValue | mask;
        if (oldValue == newValue) {
          return false;
        }
        // 多线程下CAS操作
      } while (!data.compareAndSet(longIndex, oldValue, newValue));

      // We turned the bit on, so increment bitCount.
      bitCount.increment();
      return true;
    }

3 查找元素
布隆过滤器查找元素的时间复杂度为O(k)O(k)。查找给定元素eiei是否存在于布隆过滤器中时，会进行如下操作：
1.基于k个哈希函数，计算出k个元素对应的哈希值：h1,h2....hk，其中 hi=hashi(ei)；
2.判断位数组对应位置的值是否为均等于1，即bitmap[hi]=1。
查找结果存在以下两种情况：

存在某一位点的值不为1，此时元素ei一定不存在于布隆过滤器中。
计算出所有位点的值均为1，此时元素ei可能存在于布隆过滤器中。

可能多个元素的点位组合后将ei的点位全部占据导致误判。存储复用、哈希冲突都会导致误判。所以随着元素的增多其误判率应该会不断升高，直到趋近100%。若恰好存在ei计算出的所有哈希值和e1相等的情况，则ei就会被误判为存在于布隆过滤器中，而降低布隆过滤器误判率的方法无非为两种：

增大位数组的长度；
增加哈希函数个数或采用冲突更小的哈希函数。

Google Guava 查找元素相关源码：

public <T> boolean mightContain(
  T object, Funnel<? super T> funnel, int numHashFunctions,
  LockFreeBitArray bits) {

  long bitSize = bits.bitSize();
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);

  long combinedHash = hash1;
  for (int i = 0; i < numHashFunctions; i++) {
    // Make the combined hash positive and indexable
    if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
      return false;
    }
    combinedHash += hash2;
  }
  return true;
}

// 位数组bits判断bitIndex位的值是否为1
// 1. bitIndex >>> 6：取出bitIndex处对应的所有64位元素；
// 2. & 1L << bitIndex：获取bitIndex处对应的值
// 0011...1100...1010
// 0000...0100...0000
//         1
boolean get(long bitIndex) {
  return (data.get((int) (bitIndex >>> 6)) & (1L << bitIndex)) != 0;
}

4.删除元素
普通的布隆过滤器无法进行元素删除。
道理很简单，当位数组的部分位点被不同元素所复用时，若删除其中一个元素，其所有映射位点均被置位0（包括被复用的位点），这会导致复用位点的元素在查找时也会被判断为不存在。
例如，若我们删除元素e1，位点0、4、8均被置为0，再查找元素e2时，由于位点4值为0，e2会被判断为不存在，尽管我们并没有删除它。

5.布隆过滤器各项参数间的关系
布隆过滤器存在如下参数：

存储元素个数 n
误判率 p
位数组长度 m
哈希函数个数 k

在实际使用中，一般提前设定预期的n和p，来确定最佳的位数组长度m和哈希函数个数k，m和k的计算公式如下：

1.m=−(nlnp/(ln2)2)
2.k=(m/n)ln2

可以看到Guava BloomFilter中m和k的取值也是根据上述公式计算的：

// 计算最佳位数组长度m
static long optimalNumOfBits(long n, double p) {
  if (p == 0) {
    p = Double.MIN_VALUE;
  }
  return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}

// 计算最佳哈希函数个数k
static int optimalNumOfHashFunctions(long n, long m) {
  // (m / n) * log(2), but avoid truncation due to division!
  return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

假设布隆过滤器需要存储1亿个元素，且要求误判率为亿分之一，可以计算出所需要的位数组长度和哈希函数个数，可以看到存储如此大量的元素、且误判率很低的情况下，布隆过滤器也只占用了457MB内存：

计算布隆过滤器相关参数的在线网站:

https://hur.st/bloomfilter/?n=100000000&p=1.0E-7&m=&k=

3.如何删除元素

普通的布隆过滤器由于位数组的点位仅能用0/1来表示，因此删除会导致部分复用点位的信息被清空，从而使该点位的其他存储元素收到影响。

因此，布隆过滤器要实现删除元素的功能，需要将位数组的每个点位都改造成一个计数器（counter），从而诞生了可以删除元素的计数布隆过滤器（Counting Bloom Filter，CBF）。

1.基本原理

CBF将基本布隆过滤器位数组的每一位改造成一个计数器（Counter），每个计数器本身相当于一个位数组，来表示该点位被占用次数，一般来说计数器取4位就够用了。

插入元素：CBF插入元素时，通过哈希函数映射到每个位点的计数器均加1。

查找元素：CBF查找元素时，判断对应点位计数器的取值：

若所有点位计数器的值均大于0，则元素可能存在。
若存在计数器的值等于0，则元素一定不存在。

删除元素：CBF删除元素时，通过哈希函数映射到每个位点的计数器均减去1。

4.布隆过滤器使用

1.Google Guava BloomFilter

Google Guava包提供的布隆过滤器适用于简单过滤场景下的单实例应用。

引入pom依赖:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>33.4.0-jre</version> <!-- 使用最新稳定版本 -->
</dependency>

package com.ds.data.bloomfilter;

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

import java.nio.charset.StandardCharsets;

/**
 * @author: xxx
 * @date:2025/3/8 下午4:46
 * @desc:
 */
public class HBaseBloom {

    public static void main(String[] args) {

        // 创建布隆过滤器（预期插入1千万，误判率0.000001%）
        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(StandardCharsets.UTF_8),
                10000000,
                0.00000001
        );

        // 模拟写入HBase数据
        for(int i=0; i<10000000; i++){
            String rowKey = "user_"+i;
            bloomFilter.put(rowKey);
        }
        // 查询验证
        System.out.println(bloomFilter.mightContain("user_999"));  // true
        System.out.println(bloomFilter.mightContain("user_1000001")); // false
        int count = 0;
        for (int i = 0; i < 20000000; i++) {
            // mightContain方法判断数据是否存在
            if (bloomFilter.mightContain("user_"+ i)) {
                //误判
                count++;
            }
        }
        System.out.println("总误判数:" + count);

    }
}

2.TairBloom

TairBloom是Tair自带的BloomFilter，适用于多实例下的分布式应用。其作为一种可扩展布隆过滤器（Scalable Bloom Filter，SBF）的实现，具有动态扩容的能力，同时保证false positive rate不变。

引入pom依赖:

<dependency>
  <groupId>com.aliyun.tair</groupId>
  <artifactId>tairjedis-mc-sdk</artifactId>
  <version>0.0.1-SNAPSHOT</version>
</dependency>

@Service
@Slf4j
public class TairBloomFilterServiceImpl implements TairBloomFilterService {
    @Autowired
    private JedisCluster jedisCluster;

    private TairBloomCluster bloomCluster;

    @PostConstruct
    public void init() {
        bloomCluster = new TairBloomCluster(jedisCluster);
    }

  /**
  * 初始化一个 TairBloom
  **/
    @Override
    public Boolean bfCreate(BloomParameter bloomParameter) {
        log.info("bfCreate.bloomParameter:{}", JSON.toJSONString(bloomParameter));

        try {
            String bloomKey = bloomParameter.getBloomKey();
            // 如果指定的 Key 已经创建过，则不能重复创建
            Boolean exists = jedisCluster.exists(bloomKey);
            if (exists) {
                log.info("bfCreate.exists, key:{}, exists:{}", bloomKey, exists);
                return false;
            }
            // 创建一个指定容量和容错率的TairBloom
            String result = bloomCluster.bfreserve(bloomKey, bloomParameter.getInitCapacity(), bloomParameter.getErrorRate());
            if (!SUCCESS.equals(result)) {
                log.info("bfCreate.fail, key:{}, result:{}", bloomKey, result);
                return false;
            }
            log.info("bfCreate.success, key:{}, result:{}", bloomKey, result);
            return true;
        } catch (Exception e) {
            log.error("bfCreate.error, bloomParameter:{}", JSON.toJSONString(bloomParameter), e);
        }

        return false;
    }

  /**
  * 删除指定key的 TairBloom
  **/
    @Override
    public Boolean keyDelete(String key) {
        log.info("keyDelete.key:{}", key);

        try {
            if (null == key || key.isEmpty()) {
                return false;
            }

            Boolean exists = jedisCluster.exists(key);
            if (!exists) {
                log.info("keyDelete, key:{}, exists:{}", key, exists);
                return false;
            }

            Long result = jedisCluster.del(key);
            log.info("keyDelete.result:{}, key:{}", result, key);
            if(result > 0) {
                return true;
            }
        } catch (Exception e) {
            log.error("keyExists.error, key:{}", key, e);
        }

        return false;
    }

  /**
  * 判断指定key的 TairBloom 是否存在
  **/
    @Override
    public Boolean keyExists(String key) {
        log.info("keyExists.key:{}", key);

        try {
            return jedisCluster.exists(key);
        } catch (Exception e) {
            log.error("keyExists.error, key:{}", key, e);
        }

        return false;
    }

  /**
  * TairBloom插入元素
  **/
    @Override
    public Boolean bfAdd(String key, Long item) {
        log.info("bfAdd.key:{}, item:{}", key, item);

        try {
            if (null == key || key.isEmpty()) {
                return false;
            }
            Boolean exists = jedisCluster.exists(key);
            if (!exists) {
                return false;
            }
            Boolean result = bloomCluster.bfadd(key, item.toString());
            log.info("bfAdd.key:{}, item:{}, result:{}, ", key, item, result);
            if (!result) {
                return false;
            }
            return true;
        } catch (Exception e) {
            log.error("bfAdd.error, key:{}, item:{}", key, item, e);
        }

        return false;
    }

  /**
  * TairBloom查找元素
  **/
    @Override
    public Boolean bfExists(String key, Long item) {
        log.info("bfAdd.key:{}, item:{}", key, item);

        try {
            if (null == key || key.isEmpty()) {
                return false;
            }
            Boolean exists = jedisCluster.exists(key);
            if (!exists) {
                return false;
            }
            Boolean result = bloomCluster.bfexists(key, item.toString());
            log.info("bfexists.key:{}, item:{}, result:{}, ", key, item, result);
            if (!result) {
                return false;
            }
            return true;
        } catch (Exception e) {
            log.error("bfexists.error, key:{}, item:{}", key, item, e);
        }

        return false;
    }
}

5.布隆过滤器在Hbase中的应用

Bloom过滤器在HBase中应用，比如HBase的读操作中，每个HFile都有一个Bloom过滤器，用来在读取时快速判断某个行键是否可能存在于该HFile中，避免不必要的磁盘IO。

在HBase的存储架构中，Bloom Filter主要优化两种场景：

1. StoreFile级别过滤

通过以下配置优化读取性能：

<!-- HBase表配置示例 -->
<HColumnDescriptor>
  <NAME>user_data</NAME>
  <BLOOMFILTER>ROW</BLOOMFILTER> <!-- ROW/ROWCOL模式 -->
  <COMPRESSION>SNAPPY</COMPRESSION>
</HColumnDescriptor>

ROW模式：适用点查为主的场景，针对行键（RowKey）建立过滤器
ROWCOL模式：适用列级查询，针对行键+列限定符建立过滤器

特别提醒：全表扫描场景禁用Bloom Filter工作流程：

客户端发起Get请求
RegionServer先检查MemStore
若MemStore未命中，检查Bloom Filter：

- 过滤器返回"不存在" → 直接跳过该StoreFile
- 过滤器返回"可能存在" → 继续扫描HFile

2. 布隆过滤器索引

在HBase的存储架构中，布隆过滤器作为二级索引存在，核心在于通过概率判断快速过滤无关数据块。

HFile内部结构：

HFile
├── Data Blocks (实际数据存储)
├── Meta Blocks (元数据)
├── Bloom Block (布隆过滤器数据)
└── Trailer (索引指针)
 
→ 通过布隆过滤器快速定位数据块

工作流程：

客户端发起Get(rowkey)请求
RegionServer首先检查MemStore
若MemStore未命中，遍历HFile时：

- 读取HFile的Trailer获取Bloom Block位置
- 加载Bloom Filter到内存（LRU缓存）
- 通过哈希计算判断rowkey是否可能存在于该HFile
- 若Bloom返回false，跳过该HFile的扫描

// 创建表时指定布隆过滤器类型
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("user_profile"));
HColumnDescriptor cfDesc = new HColumnDescriptor("cf");
// 设置布隆过滤器为ROWCOL类型
cfDesc.setBloomFilterType(BloomType.ROWCOL);
tableDesc.addFamily(cfDesc);
admin.createTable(tableDesc);


// 底层实现关键代码, HBase的Bloom Filter写入逻辑核心片段：HFile.Writer内部实现
public void append(Cell cell) {
    if (bloomFilter != null) {
        byte[] row = CellUtil.cloneRow(cell);
        bloomFilter.add(row);  // ROW模式只添加行键
        if (bloomType == BloomType.ROWCOL) {
            byte[] qualifier = CellUtil.cloneQualifier(cell);
            Bytes.putBytes(row, 0, qualifier, 0, qualifier.length);
            bloomFilter.add(row);
        }
    }

}
// 写入实际数据
public class HBaseWriter {
    public static void main(String[] args) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        try (Connection conn = ConnectionFactory.createConnection(conf);
             Table table = conn.getTable(TableName.valueOf("user_profile"))) {

            Put put = new Put(Bytes.toBytes("user_1001"));
            put.addColumn(Bytes.toBytes("cf"),
                         Bytes.toBytes("tags"),
                         Bytes.toBytes("sports,music"));

            // 写入时自动更新布隆过滤器
            table.put(put);
         }
    }
}

写入过程数据流：