ceph_tuning/Tuning_advices
2024-11-04 14:44:21 +03:00

34 lines
3.1 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Below are some of the most relevant tunings applied to Ceph OSDs.
The RocksDB key-value store used for BlueStore metadata plays an important role in write performance of the OSD. The following RocksDB tunings were applied to minimize the write amplification due to DB compaction.
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
RHCS 3.2 introduces bluestore cache autotuning feature. This works great with most workloads, as such we recommend to use it, however we found out during the tests that we achieved better results disabling the cache autotuning and manually configuring the BlueStore cache options.
bluestore_cache_autotune = 0
In random small-block workloads, it is important to keep as many BlueStore metadata(Onodes) cached as possible. If there is adequate memory on the OSD node, incrementing the size of the bluestore cache can increase the performance, in our case we use 8GB for bluestore cache out of the 12Gb available. The default bluestore_cache_size_ssd is 4GB
bluestore_cache_size_ssd = 8G
When bluestore_cache_autotune is disabled and bluestore_cache_size_ssd parameter is set, BlueStore cache gets subdivided into 3 different caches:
cache_meta: used for BlueStore Onode and associated data.
cache_kv: used for RocksDB block cache including indexes/bloom-filters
data cache: used for BlueStore cache for data buffers.
The amount of space that goes to each cache is configurable using ratios, for RBD workloads we increased the bluestore_cache_meta_ratio so we would get a bigger size of the cache dedicated to the BlueStore Onode cache, during the tests the best results were achieved using the following ratios:
bluestore_cache_autotune = 0
bluestore_cache_kv_ratio = 0.2
bluestore_cache_meta_ratio = 0.8
When RHCS starts a recovery process for a failed OSD it uses a log-based recovery, small block writes generate a long list of changes that have to be written to the PG log this increases the write amplification and affects performance. During the tests we observed that reducing the number of PG logs stored, enhanced the performance. However there is a drawback associated with these settings, almost all recovery processes will use backfill, when using backfill we have to incrementally move through the entire PG's hash space and compare the source PGs with the destination PGs, incrementing the recovery time.
We have observed during the testing that reducing the number of pg logs reduced the write amplification. As such the following PG log tunings were applied.
osd_min_pg_log_entries = 10
osd_max_pg_log_entries = 10
osd_pg_log_dups_tracked = 10
osd_pg_log_trim_min = 10
https://ceph.io/en/news/blog/2019/bluestore-default-vs-tuned-performance-comparison/