Update: I included the results for when PCID is disabled, for comparison, as a worse case scenario.
After learning about Meltdown and Spectre, I waited patiently to get a fix from my OS vendor. However, there were several reports of performance impact due to the kernel mitigation- for example on the PostgresQL developers mailing list there was reports of up to 23% throughput loss; Red Hat engineers report a regression range of 1-20%, but setting OLTP systems as the worse type of workload. As it will be highly dependent on the hardware and workload, I decided of doing some test myself for the use cases I need.
My setup
It is similar to that of my previous tests:
Hardware -desktop grade, no Xeon or proper RAID:
- Intel(R) Core(TM) i7-4790K CPU @ 4.0GHz (x86_64 Quad-core with hyperthreading) with PCID support (disabling pcid with “nopcid” kernel command line will also be tested)
- 32 GB of RAM
- Single, desktop-grade, Samsung SSD 850 PRO 512GB
OS and configuration:
- Debian GNU/Linux 9.3 “Stretch”, comparing kernels:
- 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (no mitigation)
- 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (latest kernel with security updates backported, including pti enabled according to security-announces)
- datadir formatted as xfs, mounted with noatime option, all on top of LVM
- MariaDB Server 10.1.30 compiled from source, queried locally through unix socket
The tests performed:
- The single-thread write with
LOAD DATA
- A read-only sysbench with 8 and 64 threads
The results
LOAD DATA (single thread)
We have been measuring LOAD DATA performance of a single OpenStreetMap table (CSV file) in several previous tests as we detected a regression on some MySQL versions with single-thread write load. I believe it could be a interesting place to start. I tested both the default configuration and another more similar to WMF production:
Load time | rows/s | |
Unpatched Kernel, default configuration | 229.4±1s | 203754 |
Patched Kernel, default configuration | 227.8±2.5s | 205185 |
Patched Kernel, nopcid, default configuration | 227.9±1.6s | 205099 |
Unpatched Kernel, WMF configuration | 163.5±1s | 285878 |
Patched Kernel, WMF configuration | 163.3±1s | 286229 |
Patched Kernel, nopcid, WMF configuration | 165.1±1.3s | 283108 |
No meaningful regressions are observed in this case between the default patched and unpatched kernels- the variability is within the measured error. The nopcid
could be showing some overhead, but the overhead (around 1%) is barely above the measuring error. The nopcid
option is interesting not because the hardware support, but because of the kernel support- backporting it could be a no-option for older distro versions, as Moritz says on the comments.
It is interesting to notice, although offtopic, that while the results with the WMF “optimized” configuration have become better compared to previous years results (due, most likely, to improved CPU and memory resources); the defaults have become worse- a reminder that defaults are not a good metric for comparison.
This is not a surprising result, a single thread is not a real OLTP workload, and more time will be wasted on io waits than the necessary syscalls.
RO-OLTP
Let’s try with a different workload- let’s use a proper benchmarking tool, create a table and perform point selects with it, with 2 different levels of concurrency- 8 threads and 64 threads:
sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=test prepare sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=test --max-time=120 --oltp-read-only=on --max-requests=0 --num-threads={8, 64} run
TPS | SELECTs/s | 95 percentile of latency(ms) | |
Unpatched Kernel, 8 threads | 7333±30 | 100953±1000 | 1.15±0.05 |
Patched Kernel, 8 threads | 6867±150 | 96140±2000 | 1.20±0.01 |
Patched Kernel, nopcid, 8 threads | 6637±20 | 92915±200 | 1.27±0.05 |
Unpatched kernel, 64 threads | 7298±50 | 102176±1000 | 43.21±0.15 |
Patched Kernel, 64 threads | 6768±40 | 94747±1000 | 43.66±0.15 |
Patched Kernel, nopcid, 64 threads | 6648±10 | 93073±100 | 43.96±0.10 |
In this case we can observe around a 4-7% regression in throughput if pcid is enabled. If pcid is disabled, they increase up to 9-10% bad, but not as bad as the warned by some “up to 20%”. If you are in my situation, and upgrade to stretch would be worth to get the pcid support.
Further testing would be required to check at what level of concurrency or what kind of workloads will work better or worse with the extra work for context switch. It will be interesting to measure it with production traffic, too, as some of the above could be nullified when network latencies are added to the mix. Further patches can also change the way mitigation works, plus probably things like having PCID support is helping transparently on all modern hardware.
Have you detected a larger regression? Are you going to patch all your databases right away? Tell me at @jynus.
Thanks for running these tests! The patch set integrated in Debian stretch also backports support for PCID, so you are already making use of the PCID speedup, BTW. Older backports will usually not have PCID support (which was only introduced in Linux 4.14), e.g. the current patches for Linux 4.4.x do not support it.
Thanks! It would also be nice to see results with nopcid to compare with those published at https://www.postgresql.org/message-id/20180102222354.qikjmf7dvnjgbkxe@alap3.anarazel.de. The PCID numbers look very similar.
I will try to get some time and update the numbers with that suggestion. It would be very interesting because not all kernel versions (and with that, older distros like debian oldstable-jessie) will support PCID, as Moritz comments above.
Alexey, thanks for the suggestion, I have added those number, too.
Thank you for a nice perf report.
I assume the worst case (up to X%) occurs when you maximize context switches per query and one way to do that is to maximize mutex contention. sysbench update-index or update-nonindex with a 1-row table and concurrency might show more regression than what you get above.
My fork of sysbench has that (update one row) and a few read-only tests that also get contention with InnoDB. Some results are at http://smalldatum.blogspot.com/2017/12/sysbench-in-memory-and-fast-server-part.html but I have been too lazy to push my sysbench changes to Alexey.
I apologize in advance (my university professors would be ashamed of me :-/) for not providing extra metrics to justify the numbers obtained- statistics of function calls and system calls, and a X-time-axis evolution of some metrics, but I may not have all the time I would like to prepare the graphs 🙁
I know InnoDB does active wait (I think InnoDB call them “spins”) instead of full-blow mutexes in certain cases; do you think that could alter the results somehow OR we could tune InnoDB with existing configuration to optimize for less context switches? Like I said to Alexey, I will try to get the time for additional tests, but cannot promise anything! 🙂
Pingback:This Week in Data with Colin Charles 23: CPU security continues to draw attention - Prosyscom Hosting