Hi, Here is some stats about OSD memory usage on our lab cluster (110hdd*1T + 36ssd*200G) https://drive.google.com/drive/folders/0B1s9jTJ0z59JcmtEX283WUd5bzg?usp=sharing osd_mem_stats.txt contains stats for hdd and ssd OSDs: - stats for process memory usage from /proc/pid/status, - mempools stats - and heap stats We use luminous 12.2.1. We set 9G memory limit in /lib/systemd/system/ceph-osd@.service.d/override.conf But that is not enough - OSDs are still been killed because they use more (with bluestore_cache_size_ssd=3G) OSDs with hdd (with bluestore_cache_size_hdd=1G) use up to 4.6G. And, btw, seems that systemd's oom killer looks at VmData, not on VmRss. Thanks. On 08/30/2017 06:17 PM, Sage Weil wrote: > Hi Aleksei, > > On Wed, 30 Aug 2017, Aleksei Gutikov wrote: >> Hi. >> >> I'm trying to synchronize osd daemons memory limits and bluestore cache >> settings. >> For 12.1.4 we have hdd osds usage about 4G with default settings. >> For ssds we have limit 6G and they are been oom killed periodically. > > So, > >> While >> osd_op_num_threads_per_shard_hdd=1 >> osd_op_num_threads_per_shard_ssd=2 >> and >> osd_op_num_shards_hdd=5 >> osd_op_num_shards_ssd=8 > > aren't relevant to memory usage. The _per_shard is about how many bytes > are stored in each rocksdb key, and the num_shards is about how many > threads we use. > > This is the one that matters: > >> bluestore_cache_size_hdd=1G >> bluestore_cache_size_ssd=3G > > It governs how much memory bluestore limits itself to. He bad news is > that bluestore counts what it allocates, not how much memory the allocator > uses, so there is some overhead. From what I've anecdotally seen it's > something like 1.5x, which kind of sucks; there is more to be done here. > > On top of that is usage by the OSD outside of bluestore, > which is somewhere in the 500m to 1.5g range. > > We're very interested in hearing what observed RSS users see relative to > the configured bluestore size and pg count, along with a dump of the > mempool metadata (ceph daemon osd.NNN dump_mempools). > >> Does anybody have an idea about the equation for upper bound of memory >> consumption? > > Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ? > >> Can bluestore_cache_size be decreased safely for example to 2G, or to 1G? > > Yes, you can/should change this to whatever you like (big or small). > >> I want to calculate the maximum expected size of bluestore metadata (that must >> be definitely fit into cache) using size of raw space, average size of >> objects, rocksdb space amplification. >> I thought it should be something simple like >> raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp. >> For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification is 2 >> and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for cache. >> But wise guys said that I have to take into account number of extents also. >> But bluestore_extent_map_shard_max_size=1200, I hope this number is not a >> multiplicator... > > Nope, just a shard size... > >> What would be correct approach for calculation of this minimum cache size? >> What can be expected size of key-values stored in rocksdb per rados object? > > This depends, unfortunately, on what the write pattern is for the objects. > If they're written by RGW in big chunks, the overhead will be smaller. > If it comes out of a 4k random write pattern it will be bigger. Again, > very interestd in hearing user reports of what is observed in real world > situations. I would trust that more than a calculation from first > principles like the above. > >> Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G >> while default bluestore_cache_kv_max=512M >> Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in >> default case. Is 512M enough for bluestore metadata? > > In Mark's testing he found that we got more performance benefit when > small caches were devoted to rocksdb and large caches were devoted mostly > to the bluestore metadata cache (parsed onodes vs caching the encoded > on-disk content). You can always adjust the 512m value upwards (and that > may make sense for large cache sizes). Again, very interested in hearing > whther that works better or worse for your workload! > > Thanks- > sage > -- Best regards, Aleksei Gutikov Software Engineer | synesis.ru | Minsk. BY