从一到无穷大 #39:从 Vectorized Mode vs Code Gen 权衡特定场景执行引擎技术选型标准
本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
本作品 (李兆龙 博文, 由 李兆龙 创作),由 李兆龙 确认,转载请注明版权。
文章目录
- 引言
- MICRO-ARCHITECTURAL ANALYSIS
- Single-Threaded Performance
- Interpretation and Instruction Cache
- Vector Size
- DATA-PARALLEL EXECUTION (SIMD)
- Data-Parallel Selection
- Data-Parallel Hash Table Probing
- Compiler Auto-Vectorization
- INTRA-QUERY PARALLELIZATION
- Exchange vs. Morsel-Driven Parallelism
- Multi-Threaded Execution
- OTHER FACTORS
- OLTP and Multi-Language Support
- Compilation Time
- Profiling and Debuggability
- Adaptivity
- SUMMARY
引言
文章基于TPC-H 的五种查询,重点测试两种现代流行执行引擎模型在OLAP负载下的优劣对比。为了隔离不同系统在数据结构,查询优化,查询处理算法上的差异,并没有选择直接使用开源项目,而是重新实现了除了执行模型外均相同的的原型。Code Gen
称为Typer
,Vectorized Mode
称为Tectorwise
。
具体的细节可以参考下文,我思考的重点时在做执行引擎技术选型时,不考虑工程实现的情况下,什么才是核心所在?我认为可以分为几个重点:
决策项 | 选型 |
---|---|
Workload | OLAP / OLTP |
Exection Mode | Vectorized Mode / Code Gen / Hybrid Vectorized Mode and Code Gen |
Query Parallelization | Exchange / Morsel-Driven |
Data Stream | Push / Pull |
这些思考是宝贵的,对于绝大多数团队而言,从人效比的角度讲自研执行引擎基本死路一条,没有任何理由不用开源引擎。但是选型时如果盯死一个执行引擎,后续会因为各种因素的变化导致当前执行引擎不是最优选项。
而现有业界非常成熟的执行引擎也未必做的很好,比如Velox,其以很高的工程水平实现了Vectorized Mode
,但是没有混合Code Gen
,其次其并行模型选择Exchange
而不是Morsel-Driven
,当然好的地方在于同时实现了Push
和Pull
。
从这个角度讲,Apache Gluten
和Substrait
这样的项目具有非常广阔的前景(当然是没法赚钱的),基本未来范式就会是:
- 将逻辑计划转化为分布式执行计划
- 分布式执行计划转化为Substrait
- 目前主流执行引擎都支持Substrait转本地逻辑计划
这样做有什么显而易见的好处呢?分布式查询中的聚合节点往往计算占绝对的大头,而且原始数据很好,甚至不一定能到Vectorized Mode
的一个Batch
,此时Code Gen
更为合适;而执行节点直接对接数据,就算有各类索引的情况下一般数据量也非常大,且存在大量Group by
,hash join
的操作,此时更适合Vectorized Mode
。
MICRO-ARCHITECTURAL ANALYSIS
Single-Threaded Performance
Instructions and Cache misses
:
- Table 1 shows some important CPU statistics, from which a number of observations can be made.
- First, Tectorwise executes significantly more instructions (up to 2.4×) and usually has more L1 data cache misses (up to 3.3×). Tectorwise breaks all operations into simple steps and must materialize intermediate results between these steps, which results in additional instructions and cache accesses.
- Typer, in contrast, can often keep intermediate results in CPU registers and thus perform the same operations with fewer instructions..
- In Tectorwise intermediate results must be materialized, which is similarly expensive as the computation itself. Thus, one key difference between the two models is that Typer is more efficient for computational queries that can hold intermediate results in CPU registers and have few cache misses.
Memory Stalls[6]
:
- for Q3 and Q9, whose performance is determined by the efficiency of hash table probing, Tectorwise is faster than Typer (by 4% and 32%). This might be surprising given the fact that both engines use exactly the same hash table layout and therefore also have an almost identical number of last level cache (LLC) misses.
- As Figure 4 shows, Tectorwise’s join advantage increases up to 40% for larger data (and hash table) sizes. The reason is that vectorization is better at hiding cache miss latency, as observed from the memory stall counter that measures the number of cycles during which the CPU is stalled waiting for memory.
- This counter explains the performance difference. On the one hand, Tectorwise’s hash table probing code is only a simple loop. It executes only hash table probes thus the CPU’s out-of-order engine can speculate far ahead and generate many outstanding loads. These can even be executed out of order.
- On the other hand, Typer’s code has more complex loops. Each loop can contain code for a scan, selection, hash-table probe, aggregation and more. The out-of-order window of each CPU fills up more quickly with complex loops thus they generate less outstanding loads. In addition every branch miss is more expensive than in a complex loop as more work that is performed under speculative execution is discarded and must be repeated on a miss. Overall, Tectorwise’s simpler loops enable better latency hiding.
sensitivity regarding the hash function
:
- After trying different hash functions, we settled on Murmur2 for Tectorwise, and a CRC-based hash function, which combines two 32-bit CRC results into a single 64-bit hash, for Typer.
- Murmur2 requires twice as many instructions as CRC hashing, but has higher throughput and is therefore slightly faster in Tectorwise, which separates hash computation from probing.
- For Typer, in contrast, the CRC hash function improves the performance up to 40% on larger scale factors—even though most time is spent waiting for cache misses. The lower latency and smaller number of instructions for CRC significantly improve the speculative, pipelined execution of consecutive loop iterations, thereby enabling more concurrent outstanding loads.
- As a note of caution, we remark that one may observe from Table 1 that Tectorwise generally executes more instructions per cycle (IPC) and deduce that Tectorwise performs better. However, this is not necessarily correct. While IPC is a measure of CPU utilization, having a higher IPC is not always better: As can be observed in Q1, Tectorwise’s IPC is 40% higher, but it is still 74% slower due to executing almost twice the number of instructions. This means that one has to be cautious when using IPC to compare database systems’ performance. It is a valid measure of the amount of free processing resources, but should not be used as the sole proxy for overall query processing performance.
summarize
:
- both are efficient and fairly close in performance
- Typer is more efficient for computational queries with few cache misses,
- Tectorwise is slightly better at hiding cache miss latency
Interpretation and Instruction Cache
- Systems based on Volcano-style iteration perform expensive virtual function calls and type dispatch for each processed tuple. This is a form of interpretation overhead as it does not contribute to the actual query processing work.
- Using a profiler, we determined that across our query set the interpreted part is less than 1.5% of the query runtime (measured at scale factor 10). Thus, the DBMS spends 98.5% of its time in primitives doing query processing work.
- As primitives know all involved types at compile time, we conclude that the extra instructions are not interpretation code that is concerned with interpretation decisions and virtual function calls. It is rather due to the load/store instructions for materializing primitive results into vectors.
Vector Size
The vector size is an important parameter for any vectorized engine. So far, our Tectorwise experiments used a value of 1,000 tuples, which is also the default in VectorWise. Figure 5 shows normalized query runtimes for vector sizes from 1 to the maximum (i.e., full materialization). We observe that small (<64) and large vector sizes (>64 K) decrease performance significantly. With a vector size of 1, Tectorwise is a Volcano-style interpreter with its large CPU overhead. Large vectors do not fit into the CPU caches and therefore cause cache misses. The other end of the spectrum is to process the query one column at a time; this approach is used in MonetDB [9]. Generally, a vector size of 1,000 seems to be a good setting for all queries. The only exception is Q3, which executes 15% faster using a vector size of 64K.
- Figure 5 shows normalized query runtimes for vector sizes from 1 to the maximum (i.e., full materialization). We observe that small (<64) and large vector sizes (>64 K) decrease performance significantly.
- With a vector size of 1, Tectorwise is a Volcano-style interpreter with its large CPU overhead. Large vectors do not fit into the CPU caches and therefore cause cache misses.
- The other end of the spectrum is to process the query one column at a time; this approach is used in MonetDB [9].
DATA-PARALLEL EXECUTION (SIMD)
- We found with AVX-512 it is often straightforward to translate scalar code to data-parallel code, and observed performance gains of up to 8.4× in micro-benchmarks.
- However, for the more complicated TPC-H queries, the performance gains are quite small (around 10% for join queries). Fundamentally, this is because most OLAP queries are bound by data access, which does not (yet) benefit much from SIMD, and not by computation, which is the strength of SIMD.
- Coming back to the comparison between data-centric compilation and vectorization, we therefore argue that SIMD does not shift the balance in favor of vectorization much.
Data-Parallel Selection
- Method: A vectorized selection primitive produces a selection vector containing the indexes of all matching tuples.
- Results for a best-case scenario, in which all consumed data are 32-bit integers, are present in the L1 cache, and the input is a contiguous vector, are shown in Figure 6a. The observed performance gain for this micro-benchmark is 8.4×.
- as Figure 6c shows, in a realistic query with multiple expensive selections like Q6, we only observe a speedup of 1.4×—even though almost 90% of the processing time is spent in SIMD primitives; Our experiments revealed two effects that account for this discrepancy: sparse data loading due to selection vectors and cache misses due to varying stride. Sparse data loading occurs in all selection primitives except for the first one. From the second selection primitive on, all primitives receive a selection vector that determines the elements to consider for comparison. These elements must be gathered from noncontiguous memory locations.
- Figure 7 shows the interplay of selection performance and input sparsity on a 4 GB data set. Note that the performance drops for selectivities below 100%, while the scalar and SIMD variants are nearly equal when the is selectivity are below 50%. We also show an estimate of how many cycles on average are spent resolving cache misses. We observe that most of the time is spent waiting for data. Thus the memory subsystem becomes the bottleneck of the selection operation and the positive effect of utilizing SIMD instructions disappears. In the selection cascade of Q6, only the first selection primitive benefits from SIMD and selects 43% of the tuples. This leaves all subsequent selections to operate in a selectivity area where the scalar variant is just as fast.
Data-Parallel Hash Table Probing
- Most of the query processing time is spent in TPC-H. There are two opportunities to apply SIMD: computing the hash values, and the actual lookups in the hash table; For hashing we use Murmur2, which consists of arithmetic operations like integer shifts and multiplications that are available in AVX-512. We can also apply SIMD to lookups into hash tables by using gather, compress store, and masking.
- Figure 8(a) shows that for hashing alone a gain of 2.3× is possible.
- Figure 8(b), we observe an improvement of 1.1× (in the best case). This is because the memory system of the test machine can perform at most two load operations per cycle—regardless of whether SIMD gather or scalar loads are used.
- Figure 8© shows that when employing gather and other SIMD instructions to the Tectorwise probe primitive, a best-case performance gain of 1.4× can be achieved. With a SIMD speedup of 2.3× for hashing and 1.4× for probing, one may expect an overall speedup in between.
- However, as is shown in Figure 8(d) the performance gains almost vanish for TPC-H join queries. This happens even though the majority of the time (55% and 65%) is spent in SIMD-optimized primitives.
- The reason for this behavior can be found in Figure 9. With a growing working set, gains from SIMD diminish and the execution costs are dominated by memory latency. SIMD is only beneficial when all data fits into the cache. We are not the first to observe this phenomenon: Polychroniou et al. [7] found this effect in their study of application of SIMD to database operators.
Compiler Auto-Vectorization
- We tested the GCC 7.2, Clang 5.0, and ICC 18 compilers. Of these, only ICC was able to auto-vectorize a fair amount of primitives (and only with AVX-512).
- Figure 10 shows how successful ICC was in relevant paths for query processing. Its vectorized variant reduces the observed number of instructions executed per tuple by between 20% to 60%. By inspecting traces of the executed code, we confirmed that automatic vectorization was applied to hashing, selection, and projection primitives. Hash table probing and aggregation, however, were not transformed.
- We also show a variant with automatic and manual SIMD application combined, which has a benefit for Q3 and Q9. Unfortunately, these automatic SIMD optimizations do not yield any significant improvements in query runtime.
- Automatic vectorization alone hardly creates any gains but even introduces cases where the optimized code becomes slower. This means that even though primitives can be auto-vectorized, this is not yet a fire-andforget solution.
INTRA-QUERY PARALLELIZATION
Given the decade-long trends of stagnating single-threaded performance and growing number of CPU cores—Intel is selling 28 cores (56 hyper-threads) on a single Xeon chip—any modern query engine must make good use of all available cores.
Exchange vs. Morsel-Driven Parallelism
- The parallelization framework is, however, orthogonal to the query processing model and we implemented morsel-driven parallelization in both Tectorwise and Typer, as it has been shown to scale better than exchange operators [9].
- Morsel-driven parallelism was developed for HyPer and can therefore be implemented quite straightforwardly in Typer: The table scan loop is replaced with a parallel loop and shared data structures like hash tables are appropriately synchronized similar to HyPer’s implementation.
- For Tectorwise, it is less obvious how to use morsel-driven parallelism. The runtime system of Tectorwise creates an operator tree and exclusive resources for every worker. To achieve that the workers can work together on one query, every operator can have shared state. For each operator, a single instance of shared state is created. All workers have access to it and use it to communicate. For example, the shared state for a hash join contains the hash-table for the build side and all workers insert tuples into it. In general, the shared state of each operator is used to share results and coordinate work distribution. Additionally, pipeline breaking operators use a barrier to enforce a global order of sub-tasks. The hash join operator uses this barrier to enforce that first all workers consume the build side and insert results into a shared hash table. Only after that, the probe phase of the join can start. With shared state and a barrier, the Tectorwise implementation exhibits the same workload balancing parallelization behavior as Typer.
Multi-Threaded Execution
We executed our TPC-H workload on scale factor 100 (ca. 100 GB of data).
- Table 3 shows runtimes and speedups in comparison with single-threaded execution.
- Nevertheless, the “Ratio” column of Table 3, which is the quotient of the runtimes of both systems, reveals an interesting effect: For all but one query, the performance gap between the two systems becomes smaller when all 20 hyper-threads are used.
OTHER FACTORS
As a consequence of their architecture and code structure compilation and vectorization have distinct qualities that are not directly related to OLAP performance:
- Compiled queries: allow for fast OLTP stored procedures and seamlessly integrating different programming languages;code generation introduces an additional indirection;
- Vectorization:offers very low query compile times, as primitives are precompiled: As a result of this structure, parts of a vectorized query can be swapped adaptively during runtime and profiling is easier;vectorization comes with a set of constraints on the code, which can be complicated to handle
OLTP and Multi-Language Support
- The vectorized execution model achieves efficiency when many vectors of values are processed, which is almost always the case in OLAP, but not in OLTP, where a query might only touch a single tuple. For OLTP workloads, vectorization has little benefit over traditional Volcano-style iteration.
- With compilation, in contrast, it is possible to compile all queries of a stored procedure into a single, efficient machine code fragment.
- This is a major benefit of compilation for OLTP and HTAP systems. Despite already having a modern vectorized engine (Apollo), the Microsoft SQL Server team felt compelled to additionally integrate the compilation-based engine Hekaton. Compilation can also be highly beneficial for integrating userdefined functions and multiple languages into the same execution environment .
Compilation Time
- A disadvantage of code generation is the risk of compilation time dominating execution time .
- This can be an issue in OLTP queries, though in transactional workloads it can be countered by relying on stored procedures, in which case code-generation can be done ahead of time. However, compilation time can also become large if the generated code is large because (optimizing) LLVM compile time is often super-linear to code size. OLAP queries that consist of many operators will generate large amounts of code, but also a small SQL query such as SELECT * FROM T can produce a lot of code if table T has thousands of columns, as each column leads to some code generation.
- This largely obviates this downside of compilation—but comes at the cost of additional system complexity. Spark falls back to interpreted tuple-ata-time execution if a pipeline generates more than 8 KB Java byte code[11].
Profiling and Debuggability
- A practical advantage of vectorized execution is that detailed profiling is possible without slowing down queries, since getting clock cycle counts for each primitive adds only marginal overhead, as each call to the function works on a thousand values.
- For datacentric compilation, it is hard to separate the contribution of the individual relational operators to the final execution time of a pipeline, though it could be done using sample-based code profiling, if the system can map back generated code lines to the relational operator in the query plan responsible for it. For this reason it is currently not possible in Spark SQL to know the individual contributions to execution time of relational operators, since the system can only measure performance on a per-pipeline basis
Adaptivity
Adaptive query execution, for instance to re-order the evaluation order of conjunctive filter predicates or even joins is a technique for improving robustness that can compensate for (inevitable) estimation errors in query optimization.
SUMMARY
OLAP:
- Computation: Data-centric compiled code is better at computationally-intensive queries, as it is able to keep data in registers and thus needs to execute fewer instructions.
- Parallel data access: Vectorized execution is slightly better in generating parallel cache misses, and thus has some advantage in memory-bound queries that access large hash-tables for aggregation or join
- SIMD: has lately been one of the prime mechanisms employed by hardware architects to increase CPU performance. In theory, vectorized query execution is in a better position to profit from that trend. In practice, we find that the benefits are small as most operations are dominated by memory access cost.
- Parallelization: With find that with morsel-driven parallelism both vectorized and compilation based-engines can scale very well on multi-core CPUs.
- Hardware platforms: We performed all experiments on Intel Skylake, Intel Knights Landing, and AMD Ryzen. The effects listed above occur on all of the machines and neither vectorization nor data-centric compilation dominates on any hardware platform.
Other:
- OLTP as they can create fast stored procedures
- language support as they can seamlessly integrate code written in different languages. Vectorized engines have advantages in terms of
- compile time as primitives are pre-compiled
- profiling as runtime can be attributed to primitives
- adaptivity as execution primitives can be swapped mid-flight.
参考:
- Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask vldb2018
- 数据库内核杂谈 (十五): 执行器之 code generation vs vectorized execution
- Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask 笔记
- 列存数据库 Code Generation & Vectorized Mode
- 深度解读|Spark 中 CodeGen 与向量化技术的研究
- Memory Stall Analysis
- Rethinking SIMD vectorization for in-memory databases sigmod 2005
- Breaking the memory wall in MonetDB
- Morsel-Driven Parallelism: 一种NUMA感知的并行Query Execution框架
- Multi-core parallelization of vectorized query execution Page 17
- 从一到无穷大 #37 Databricks Photon:打响 Spark Native Engine 第一枪
- The Snowflake Elastic Data Warehouse 2016
- Query Engines: Push vs. Pull