34 lines
3.1 KiB
Plaintext
34 lines
3.1 KiB
Plaintext
Below are some of the most relevant tunings applied to Ceph OSDs.
|
||
|
||
The RocksDB key-value store used for BlueStore metadata plays an important role in write performance of the OSD. The following RocksDB tunings were applied to minimize the write amplification due to DB compaction.
|
||
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
|
||
RHCS 3.2 introduces bluestore cache autotuning feature. This works great with most workloads, as such we recommend to use it, however we found out during the tests that we achieved better results disabling the cache autotuning and manually configuring the BlueStore cache options.
|
||
bluestore_cache_autotune = 0
|
||
In random small-block workloads, it is important to keep as many BlueStore metadata(Onodes) cached as possible. If there is adequate memory on the OSD node, incrementing the size of the bluestore cache can increase the performance, in our case we use 8GB for bluestore cache out of the 12Gb available. The default bluestore_cache_size_ssd is 4GB
|
||
bluestore_cache_size_ssd = 8G
|
||
When bluestore_cache_autotune is disabled and bluestore_cache_size_ssd parameter is set, BlueStore cache gets subdivided into 3 different caches:
|
||
cache_meta: used for BlueStore Onode and associated data.
|
||
cache_kv: used for RocksDB block cache including indexes/bloom-filters
|
||
data cache: used for BlueStore cache for data buffers.
|
||
The amount of space that goes to each cache is configurable using ratios, for RBD workloads we increased the bluestore_cache_meta_ratio so we would get a bigger size of the cache dedicated to the BlueStore Onode cache, during the tests the best results were achieved using the following ratios:
|
||
bluestore_cache_autotune = 0
|
||
|
||
|
||
bluestore_cache_kv_ratio = 0.2
|
||
|
||
bluestore_cache_meta_ratio = 0.8
|
||
|
||
When RHCS starts a recovery process for a failed OSD it uses a log-based recovery, small block writes generate a long list of changes that have to be written to the PG log this increases the write amplification and affects performance. During the tests we observed that reducing the number of PG logs stored, enhanced the performance. However there is a drawback associated with these settings, almost all recovery processes will use backfill, when using backfill we have to incrementally move through the entire PG's hash space and compare the source PG’s with the destination PG’s, incrementing the recovery time.
|
||
We have observed during the testing that reducing the number of pg logs reduced the write amplification. As such the following PG log tunings were applied.
|
||
osd_min_pg_log_entries = 10
|
||
|
||
|
||
osd_max_pg_log_entries = 10
|
||
|
||
osd_pg_log_dups_tracked = 10
|
||
|
||
osd_pg_log_trim_min = 10
|
||
|
||
|
||
|
||
https://ceph.io/en/news/blog/2019/bluestore-default-vs-tuned-performance-comparison/ |