shitty settings updates on MTS cluster

2024-11-04 14:44:21 +03:00 · 2024-11-04 14:44:21 +03:00 · 7f197a8d79
commit 7f197a8d79
parent 2a71cf2f38
4 changed files with 107 additions and 3 deletions
--- a/46
+++ b/46
@ -0,0 +1,46 @@
 - Планировщик ввода-вывода: текущий - mq-deadline. Замена на noop, причина - необходима пропускная способность и повышенная рабочая нагрузка - процессор.
 - Отключение кешей физических дисков
 - Отключение кешей ядра (write_back -> write_through)
 - ЦПУ в C1 state, то есть в макс производительность
 - Традиционно - MTU 9000
 - Наше ПО (Рашид) работает с файлами на блочном устройстве. Для того, чтобы файловая структура всегда хранилась в памяти хоста и не запрашивала кластер об файловой иерархии - vm.vfs_cache_pressure=1 в sysctl./
 - Общие сетевые настройки:
 				Увеличение сетевых буферов:
 								- net.core.rmem_max = 56623104 (maximum receive window size)
 								- net.core.wmem_max = 56623104 (maximum send window size)
 								- net.core.rmem_default = 56623104 (default receive window size)
 								- net.core.wmem_default = 56623104 (default send window size)
 								- net.core.optmem_max =  40960 (Максимальное значение кольцеового буфера на сокет)
 								- net.ipv4.tcp_rmem = 4096 87380 56623104 (min  default and max value of TCP socker receive buffer)
 								- net.ipv4.tcp_wmem = 4096 65536 56623104 (min  default and max value of TCP socker send buffer)
 				Увеличение максимального числа TCP соединений:
 								- net.core.somaxconn = 1024 (макс кол-во соединений для 1 сокета)
 								- net.core.netdev_max_backlog = 50000 (maximum number of packets allowed to be queued on a network interface before the kernel starts dropping packets.)
 				Тюнинг ТСР:
 								- net.ipv4.tcp_max_syn_backlog = 30000 (Maximal number of remembered connection requests, which have not received an acknowledgment from connecting client)
 								- net.ipv4.tcp_max_tw_buckets = 2000000 (Maximal number of timewait sockets held by system simultaneously)
 								- net.ipv4.tcp_tw_reuse = 1 (allow reusing time_wait sockets for new outbound connections)
 								- net.ipv4f.tcp_tw_recycle = 1 (allow reusing time_wait sockets for new inbound and outbound connections)
 								- net.ipv4.tcp_fin_timeout = 10 (time an orphaned (unreferenced) connection will wait before it is aborted at the local end)
 								- net.ipv4.tcp_slow_start_after_idle = 0 (don't use slow start after period of inactivity)
 - Так как много памяти, то osd_memory_target_autotune выключено. Раздаём вручную с помощью bluestore_cache_size_ssd (5G) и bluestore_cache_size_hdd (5G)
 -  osd_memory_target - 10Г
 - 50% процентов от 8 отдадим на RocksDB - bluestore_cache_kv_ratio 0.5, это внутренний кеш RocksDB, не управляется цефом напрямую
 - bluestore_cache_meta_ratio - какой процент отдать под кэш меты. Открытый вопрос, сейчас стоит 0.5
 - Дефолтный алгоритм для ЕС - Reed-Solomon - норм. Но при количестве шардов 2, лучшие результаты демонстрирует техника Эндрю Блаума и Михаэля Рота - blaum_roth. Также можно попробовать cauchy_good. Это что касается лпенсорсной библиотеки Jerasure
 - Для парралельных тестов создан ещё один пул ЕС с профилем основаным на бибилиотеке ISA, специально для процессоров Intel. В ней имеется поддрежка техник ECC Reed-Solomon и cauchy_good. Выбрали cauchy_good
 - Борьба за IOPS. Из-за особенности ЕС, в нем добавлен цикл Read-Modify-Write для перезаписи существующих объектов, поэтому экспериментально bluestore_prefer_deferred_size_hdd из дефолтных 64Кбит в 128Кбит. 
 Вопросы:
 1) Подходит ли 4+2? Лучше ли чем репликация?
 	На это не возможно ответить однозначно в рамках текущей дискуссии по нескольким причинам из которых основная это различные тестовые образцы. На тестах одна спека, целевые машины для боевого использования - совершенно другие.
 	Важно понимать, что ЕС это не халявное место, за это придётся заплатить высокой нагрузкой на ЦПУ в рабочем режиме и огромной нагрузкой в случае аварийной ситуации. Также очевидно снижение ипсов в следствие самоого механизма работы ЕС. (ХЗ СТОИТ ЛИ ГОВОРИТЬ О ТОМ, ЧТО РАШИД В ОДИН ПОТОК ПИШЕТ И ЕМУ НАДО МНОГО ИПСОВ, ТЕМ БОЛЕЕ ЧТО НАВЕРНЯКА Я И САМ НЕ ЗНАЮ КАК РАШИД ПИШЕТ)
 2) В чем проблема в настоящий момент с кластером? Где результаты тестов?
 	Предыдущие внутренние тесты на ЕС4+2 показали существенный регресс в сравнении с репликацией. В связи с этим решено провести переуставноку кластера с другими параметрами. Также выяснилось, что при больших объёмах блочных устройств СХД цеф, ьутылочным горлышком становится ПО НТ.
 	В настоящее время ведутся работы по поиску оптимального алгоритма работы с рбд для обеспечения соответствия требованиям приказа (тут наверное какой то номер приказа следует назвать). 
 3) Где результаты то бля? Предъявите к осмотру!
 	Результаты обязательно будут опубликованы как только мы, путём проб и ошибок, найдём плюс-минус оптимальную схему. Околодефолтная конфигруация показала свою несостоятельность.
--- a/Seagate_HDD_datasheet.pdf
+++ b/Seagate_HDD_datasheet.pdf
--- a/34
+++ b/34
@ -0,0 +1,34 @@
 Below are some of the most relevant tunings applied to Ceph OSDs.
 The RocksDB key-value store used for BlueStore metadata plays an important role in write performance of the OSD. The following RocksDB tunings were applied to minimize the write amplification due to DB compaction.
 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
 RHCS 3.2 introduces bluestore cache autotuning feature. This works great with most workloads, as such we recommend to use it, however we found out during the tests that we achieved better results disabling the cache autotuning and manually configuring the BlueStore cache options.
 bluestore_cache_autotune = 0
 In random small-block workloads, it is important to keep as many BlueStore metadata(Onodes) cached as possible. If there is adequate memory on the OSD node, incrementing the size of the bluestore cache can increase the performance, in our case we use 8GB for bluestore cache out of the 12Gb available. The default bluestore_cache_size_ssd is 4GB
 bluestore_cache_size_ssd = 8G
 When bluestore_cache_autotune is disabled and bluestore_cache_size_ssd parameter is set, BlueStore cache gets subdivided into 3 different caches:
 cache_meta: used for BlueStore Onode and associated data.
 cache_kv: used for RocksDB block cache including indexes/bloom-filters
 data cache: used for BlueStore cache for data buffers.
 The amount of space that goes to each cache is configurable using ratios, for RBD workloads we increased the bluestore_cache_meta_ratio so we would get a bigger size of the cache dedicated to the  BlueStore Onode cache, during the tests the best results were achieved using the following ratios:
 bluestore_cache_autotune = 0
 bluestore_cache_kv_ratio = 0.2
 bluestore_cache_meta_ratio = 0.8
 When RHCS starts a recovery process for a failed OSD it uses a log-based recovery, small block writes generate a long list of changes that have to be written to the PG log this increases the write amplification and affects performance. During the tests we observed that reducing the number of PG logs stored, enhanced the performance. However there is a drawback associated with these settings, almost all recovery processes will use backfill, when using backfill we have to incrementally move through the entire PG's hash space and compare the source PG’s with the destination PG’s, incrementing the recovery time.
 We have observed during the testing that reducing the number of pg logs reduced the write amplification. As such the following PG log tunings were applied.
 osd_min_pg_log_entries = 10
 osd_max_pg_log_entries = 10
 osd_pg_log_dups_tracked = 10
 osd_pg_log_trim_min = 10
 https://ceph.io/en/news/blog/2019/bluestore-default-vs-tuned-performance-comparison/
--- a/30
+++ b/30
@ -1,3 +1,27 @@
-2.11
+4.11
-    CPU в перфоманс - cpupower frequency-set -g performance
+    Система:
-    Проверить - cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+        CPU в перфоманс - cpupower frequency-set -g performance
        Проверить - cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        Кэши HDD дисков чёт не выключаются hdparm -W , как будто бы не доступна эта функция
        Отключаю кэш HDD LVM-разделам
        MTU 9000
        vm.vfs_cache_pressure = 1
    Сетевые:
        net.core.rmem_max = 56623104
        net.core.wmem_max = 56623104
        net.core.rmem_default = 56623104
        net.core.wmem_default = 56623104
        net.core.optmem_max =  40960
        net.ipv4.tcp_rmem = 4096 87380 56623104
        net.ipv4.tcp_wmem = 4096 65536 56623104
        net.core.somaxconn = 1024
        net.core.netdev_max_backlog = 50000
        net.ipv4.tcp_max_syn_backlog = 30000
        net.ipv4.tcp_max_tw_buckets = 2000000
        net.ipv4.tcp_tw_reuse = 1
        net.ipv4.tcp_fin_timeout = 10
        net.ipv4.tcp_slow_start_after_idle = 0
    Ceph:
        osd_memory_target_autotune оставляю в true, потому что по 29ГБ каждому ОСД предоставлено