arif-arman/origami-sort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPLv3 license

This repository contains the implementation of the components and algorithms of the Origami framework described in the Origami paper. The framework was developed by the IRL @ Texas A&M University, published in the PVLDB Volume 15, and will be presented in the VLDB 2022.

Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multithread. We introduce Origami, an in-memory mergesort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8X faster than prior methods and a branchless merger that achieves up to a 1.5X speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize threadlevel concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2X faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.

A quick preview of Origami sort performance is given in the tables below. For a deeper analysis and discussion, please refer to the paper. Benchmarks run all code compiled with Visual Studio 2019 in Windows Server 2016 on an 8-core Intel Skylake-X (i7-7820X) CPU with a fixed 4.7 GHz clock on all cores, 1 MB L2 cache, and 32 GB of DDR4-3200 quad-channel RAM. When AVX-512 was used, BIOS defaulted to a 400-MHz lower clock (i.e., 4.3 GHz), which is known as the AVX offset implemented by many motherboard manufacturers to keep core temperature under control. We enable the maximum level of optimization and use appropriate flags (e.g., /arch:AVX512) to ensure the compiler uses all available SIMD registers.

The following table shows the performance comparison of Origami with prior works in corresponding SIMD architectures. It compares the sort speed of different chunk sizes in a 1 GB array of 32-bit random keys.

Single Thread Chunked Sort Speed (M keys/s, 32-bit keys)
Chunk sizeSSEAVX2AVX-512
AA-sortOrigamiASPaS-sortPeloton-sortOrigamiWatkins-sortXiaochen-sortYin-sortOrigami
128 K631765313922840198140295
256 K611474712821033184130269
512 K591384412019530172113249
1 M571314110918328160102232
2 M5512439921742515095216
4 M5411837811682314088203
8 M5211235771622113183191
16 M5010733731532012278181
32 M4810232701451911572172
64 M479830671381810969163
128 M459529651321710366156
256 M44912863126179764149


Origami is distribution insensitive i.e. it retains almost constant speed on all inputs. The next table shows this characteristic on 1 GB data for the following distributions:

D1: Uniformly random, generated by Mersenne Twister
D2: All same keys
D3: Sorted
D4: Reverse sorted
D5: Almost sorted, where every 7th key is set to KEY_MAX
D6: Pareto keys, generated as min(ceil(beta(1/(1-u) - 1)), 10000), where beta = 7 and u ~ uniform[0, 1]
D7: Bursts of same keys, where the length of each subsequence is drawn from D6 and key from D1
D8: Random shuffle, generated by randomly permuting D7
D9: Fibonacci, wrapped around when overflows number of items to sort

Single Thread Sort Speed for Different Distributions (M keys/s)
Key sizeD1D2D3D4D5D6D7D8D9
Scalar32475252524747494747
64434747474344454344
64+64252727272525252525
SSE32919090909191929190
64504949505050505049
64+64353434343535353534
AVX232126126125124126126127126125
64484848484848484847
64+64343434343434343434
AVX-51232149145145145148149149149146
64656364636464656463
64+64272626262627262626


Origami achieves near perfect speedup in multi-core environments. The next table shows the scalability for 1 GB random data:

Parallel Sort Speed on Skylake-X (M/s)
Key sizeSpeed (M/s)Speed-up
1C2C4C8C2C4C8C
Scalar3247741472821.63.16.0
6443751492731.73.56.4
64+642544841621.83.46.5
SSE32911793526872.03.97.6
6450941853611.93.77.2
64+6435701392602.04.07.4
AVX2321262484959502.03.97.5
6448951893692.03.97.7
64+6434701372542.14.07.5
AVX-5123214928656510621.93.87.1
64651222424621.93.77.1
64+6427531051972.03.97.3

Recommended Setup:

  • OS: Windows Server 2019 or Server 2016
  • Compiler: MSVC++ 14.29 (Visual Studio 2019 16.11)
  1. Make sure to check that the project is set for x64 Release.
  2. Set C++ standard to C++17 or later under Project > Properties > Configuration Properties > General.
  3. Set register type (Scalar, SSE, AVX2 or AVX-512) and sort key type (uint32, int64 or <int64, int64> key-value pair) in config.h.
  4. Set appropriate compiler flags (e.g., /arch:AVX2, /arch:AVX512 etc.) under Project > Properties > Configuration Properties > C/C++ > Command Line.
  5. Make sure no Windows update from 03/21 or later is installed as they make AVX2 and AVX-512 bmerge upto 20% slower.
  6. Update parameters in config.h if necessary (details below). Current tuned parameters are for the Skylake-X mentioned above.

Parameters:

  1. _P1_NREG: Number of register available, typically 16 or 32
  2. _P1_SWITCH: Switch point from mcmerge to mrmerge, get this from bench_sort_phases:phase1_matrix_merge_test()
  3. _P1_N: Switch point from in register sort to bmerge, get this from bench_sort_phases:phase2_switch_point_test()
  4. _P2_MERGE_UNROLL: In-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
  5. _P2_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
  6. _P2_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
  7. _P2_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
  8. _P3_MERGE_UNROLL: Out-of-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
  9. _P3_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
  10. _P3_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
  11. _P3_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
  12. _MTREE_NREG: Number of registers per stream while merging withing mtree, get this from bench_mtree:mtree_single_thread_test()
  13. _MT_L1_BUFF_N: Buffer size at the internal node of each 4-way node in mtree
  14. _MT_L2_BUFF_N: Buffer size at the root of each 4-way node in mtree
  15. _MIN_K: Minimum way of merge to avoid memory bandwidth bottleneck, get this from bench_mtree:mtree_multi_thread_test()

Run:

Once all parameters are set, Origami sort API can be used as shown in bench_sort:sort_bench().

This project is licensed under the GPLv3.0 License - see the LICENSE file for details.

Arif Arman, Dmitri Loguinov

About

Implementation of Origami: A High-Performance Mergesort Framework

Topics

Resources

License

Stars

Watchers

Forks