Dramatic performance drop at certain number of objects in pool

All of lore.kernel.org
 help / color / mirror / Atom feed

* Dramatic performance drop at certain number of objects in pool
@ 2016-06-16 12:14 Wade Holler
  2016-06-16 12:48 ` Blair Bethwaite
  2016-06-16 13:38 ` Wido den Hollander
  0 siblings, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 12:14 UTC (permalink / raw)
  To: ceph-devel

Hi All,

I have a repeatable condition when the object count in a pool gets to
320-330 million the object write time dramatically and almost
instantly increases as much as 10X, exhibited by fs_apply_latency
going from 10ms to 100s of ms.

Can someone point me in a direction / have an explanation ?

I can add a new pool and it performs normally.

Config is generally
3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
with NVME for journals. Centos 7.2, XFS

Jewell is the release; inserting objects with librados via some Python
test code.

Best Regards
Wade

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
@ 2016-06-16 12:48 ` Blair Bethwaite
       [not found]   ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-16 13:38 ` Wido den Hollander
  1 sibling, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-16 12:48 UTC (permalink / raw)
  To: Wade Holler; +Cc: Ceph Development, ceph-users

Hi Wade,

What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?

Cheers,

On 16 June 2016 at 22:14, Wade Holler <wade.holler@gmail.com> wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
  2016-06-16 12:48 ` Blair Bethwaite
@ 2016-06-16 13:38 ` Wido den Hollander
  2016-06-16 14:47   ` Wade Holler
  2016-06-19 23:21   ` Blair Bethwaite
  1 sibling, 2 replies; 34+ messages in thread
From: Wido den Hollander @ 2016-06-16 13:38 UTC (permalink / raw)
  To: Wade Holler, ceph-devel


> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
> 
> 
> Hi All,
> 
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
> 

My first guess is the filestore splitting and the amount of files per directory.

You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?

That means you have ~66k objects per PG.

> Can someone point me in a direction / have an explanation ?

If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.

Wido

> 
> I can add a new pool and it performs normally.
> 
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
> 
> Jewell is the release; inserting objects with librados via some Python
> test code.
> 
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number ofobjects in pool
       [not found]   ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-16 14:20     ` Mykola
  2016-06-16 14:30     ` Dramatic performance drop at certain number of objects " Wade Holler
  2016-06-16 14:32     ` Wade Holler
  2 siblings, 0 replies; 34+ messages in thread
From: Mykola @ 2016-06-16 14:20 UTC (permalink / raw)
  To: Blair Bethwaite, Wade Holler
  Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 1840 bytes --]

I see the same behavior with the threshold of around 20M objects for 4 nodes, 16 OSDs, 32TB, hdd-based cluster. The issue dates back to hammer. 

Sent from my Windows 10 phone

From: Blair Bethwaite
Sent: Thursday, June 16, 2016 2:48 PM
To: Wade Holler
Cc: Ceph Development; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
Subject: Re: [ceph-users] Dramatic performance drop at certain number ofobjects in pool

Hi Wade,

What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?

Cheers,

On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
~Blairo
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[-- Attachment #1.2: Type: text/html, Size: 5010 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]   ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-16 14:20     ` Dramatic performance drop at certain number ofobjects " Mykola
@ 2016-06-16 14:30     ` Wade Holler
  2016-06-16 14:32     ` Wade Holler
  2 siblings, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:30 UTC (permalink / raw)
  To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 12955 bytes --]

Blairo,

Thats right, I do see "lots" of READ IO!  If I compare the "bad (330Mil)"
pool, with the new test (good) pool:

iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes, with
almost no writes.

Sounds like you have an idea about what causes this.  I'm happy to hear it!

slabinfo is below.  Drop caches has no affect.

slabinfo - version: 2.1

# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>

blk_io_mits         4674   4769   1664   19    8 : tunables    0    0    0
: slabdata    251    251      0

rpc_inode_cache        0      0    640   51    8 : tunables    0    0    0
: slabdata      0      0      0

t10_alua_tg_pt_gp_cache      0      0    408   40    4 : tunables    0    0
  0 : slabdata      0      0      0

t10_pr_reg_cache       0      0    696   47    8 : tunables    0    0    0
: slabdata      0      0      0

se_sess_cache          0      0    896   36    8 : tunables    0    0    0
: slabdata      0      0      0

kvm_vcpu               0      0  16256    2    8 : tunables    0    0    0
: slabdata      0      0      0

kvm_mmu_page_header     48     48    168   48    2 : tunables    0    0    0
: slabdata      1      1      0

xfs_dqtrx              0      0    528   62    8 : tunables    0    0    0
: slabdata      0      0      0

xfs_dquot              0      0    472   69    8 : tunables    0    0    0
: slabdata      0      0      0

xfs_icr                0      0    144   56    2 : tunables    0    0    0
: slabdata      0      0      0

xfs_ili           96974261 97026835    152   53    2 : tunables    0    0
  0 : slabdata 1830695 1830695      0

xfs_inode         97120263 97120263   1088   30    8 : tunables    0    0
  0 : slabdata 3237631 3237631      0

xfs_efd_item        6280   6360    400   40    4 : tunables    0    0    0
: slabdata    159    159      0

xfs_da_state        3264   3264    480   68    8 : tunables    0    0    0
: slabdata     48     48      0

xfs_btree_cur       1872   1872    208   39    2 : tunables    0    0    0
: slabdata     48     48      0

xfs_log_ticket     23980  23980    184   44    2 : tunables    0    0    0
: slabdata    545    545      0

scsi_cmd_cache      4536   4644    448   36    4 : tunables    0    0    0
: slabdata    129    129      0

kcopyd_job             0      0   3312    9    8 : tunables    0    0    0
: slabdata      0      0      0

dm_uevent              0      0   2608   12    8 : tunables    0    0    0
: slabdata      0      0      0

dm_rq_target_io        0      0    136   60    2 : tunables    0    0    0
: slabdata      0      0      0

UDPLITEv6              0      0   1152   28    8 : tunables    0    0    0
: slabdata      0      0      0

UDPv6                980    980   1152   28    8 : tunables    0    0    0
: slabdata     35     35      0

tw_sock_TCPv6          0      0    256   64    4 : tunables    0    0    0
: slabdata      0      0      0

TCPv6                510    510   2112   15    8 : tunables    0    0    0
: slabdata     34     34      0

uhci_urb_priv       6132   6132     56   73    1 : tunables    0    0    0
: slabdata     84     84      0

cfq_queue          64153  97300    232   70    4 : tunables    0    0    0
: slabdata   1390   1390      0

bsg_cmd                0      0    312   52    4 : tunables    0    0    0
: slabdata      0      0      0

mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0
: slabdata      1      1      0

hugetlbfs_inode_cache    106    106    608   53    8 : tunables    0    0
  0 : slabdata      2      2      0

configfs_dir_cache     46     46     88   46    1 : tunables    0    0    0
: slabdata      1      1      0

dquot                  0      0    256   64    4 : tunables    0    0    0
: slabdata      0      0      0

kioctx              1512   1512    576   56    8 : tunables    0    0    0
: slabdata     27     27      0

userfaultfd_ctx_cache      0      0    128   64    2 : tunables    0    0
  0 : slabdata      0      0      0

pid_namespace          0      0   2176   15    8 : tunables    0    0    0
: slabdata      0      0      0

user_namespace         0      0    280   58    4 : tunables    0    0    0
: slabdata      0      0      0

posix_timers_cache      0      0    248   66    4 : tunables    0    0    0
: slabdata      0      0      0

UDP-Lite               0      0   1024   32    8 : tunables    0    0    0
: slabdata      0      0      0

RAW                 1972   1972    960   34    8 : tunables    0    0    0
: slabdata     58     58      0

UDP                 1472   1504   1024   32    8 : tunables    0    0    0
: slabdata     47     47      0

tw_sock_TCP         6272   6400    256   64    4 : tunables    0    0    0
: slabdata    100    100      0

TCP                 5236   5457   1920   17    8 : tunables    0    0    0
: slabdata    321    321      0

blkdev_queue         421    465   2088   15    8 : tunables    0    0    0
: slabdata     31     31      0

blkdev_requests   36137670 39504234    384   42    4 : tunables    0    0
  0 : slabdata 940577 940577      0

blkdev_ioc          2106   2106    104   39    1 : tunables    0    0    0
: slabdata     54     54      0

fsnotify_event_holder   8160   8160     24  170    1 : tunables    0    0
  0 : slabdata     48     48      0

fsnotify_event     37128  37128    120   68    2 : tunables    0    0    0
: slabdata    546    546      0

sock_inode_cache   11985  11985    640   51    8 : tunables    0    0    0
: slabdata    235    235      0

net_namespace          0      0   4608    7    8 : tunables    0    0    0
: slabdata      0      0      0

shmem_inode_cache   5040   5040    680   48    8 : tunables    0    0    0
: slabdata    105    105      0

Acpi-ParseExt     116256 116256     72   56    1 : tunables    0    0    0
: slabdata   2076   2076      0

Acpi-Namespace     14586  14586     40  102    1 : tunables    0    0    0
: slabdata    143    143      0

taskstats           2352   2352    328   49    4 : tunables    0    0    0
: slabdata     48     48      0

proc_inode_cache  146512 146706    656   49    8 : tunables    0    0    0
: slabdata   2994   2994      0

sigqueue            2448   2448    160   51    2 : tunables    0    0    0
: slabdata     48     48      0

bdev_cache          1872   1872    832   39    8 : tunables    0    0    0
: slabdata     48     48      0

sysfs_dir_cache   172296 172296    112   36    1 : tunables    0    0    0
: slabdata   4786   4786      0

inode_cache        17550  17820    592   55    8 : tunables    0    0    0
: slabdata    324    324      0

dentry            63799847 86138682    192   42    2 : tunables    0    0
  0 : slabdata 2050921 2050921      0

iint_cache             0      0     80   51    1 : tunables    0    0    0
: slabdata      0      0      0

selinux_inode_security  41920  42636     80   51    1 : tunables    0    0
  0 : slabdata    836    836      0

buffer_head       28851697 32477250    104   39    1 : tunables    0    0
  0 : slabdata 832750 832750      0

vm_area_struct     36548  38665    216   37    2 : tunables    0    0    0
: slabdata   1045   1045      0

mm_struct           1120   1120   1600   20    8 : tunables    0    0    0
: slabdata     56     56      0

files_cache         2703   2703    640   51    8 : tunables    0    0    0
: slabdata     53     53      0

signal_cache        5109   5376   1152   28    8 : tunables    0    0    0
: slabdata    192    192      0

sighand_cache       3241   3345   2112   15    8 : tunables    0    0    0
: slabdata    223    223      0

task_xstate        14118  14937    832   39    8 : tunables    0    0    0
: slabdata    383    383      0

task_struct         9295  10538   2944   11    8 : tunables    0    0    0
: slabdata    958    958      0

anon_vma           30400  30400     64   64    1 : tunables    0    0    0
: slabdata    475    475      0

shared_policy_node   5780   5780     48   85    1 : tunables    0    0    0
: slabdata     68     68      0

numa_policy          620    620    264   62    4 : tunables    0    0    0
: slabdata     10     10      0

radix_tree_node   10364872 10364872    584   56    8 : tunables    0    0
  0 : slabdata 185087 185087      0

idr_layer_cache     1185   1185   2112   15    8 : tunables    0    0    0
: slabdata     79     79      0

dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-512        0      0    512   64    8 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-256        0      0    256   64    4 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-128        0      0    128   64    2 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0
: slabdata      0      0      0

dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0
: slabdata      0      0      0

kmalloc-8192         588    680   8192    4    8 : tunables    0    0    0
: slabdata    170    170      0

kmalloc-4096        6374   6424   4096    8    8 : tunables    0    0    0
: slabdata    803    803      0

kmalloc-2048       60889  63744   2048   16    8 : tunables    0    0    0
: slabdata   3984   3984      0

kmalloc-1024       27406  32448   1024   32    8 : tunables    0    0    0
: slabdata   1014   1014      0

kmalloc-512       96841607 96891536    512   64    8 : tunables    0    0
  0 : slabdata 1513967 1513967      0

kmalloc-256        73414 108736    256   64    4 : tunables    0    0    0
: slabdata   1699   1699      0

kmalloc-192        32870  33432    192   42    2 : tunables    0    0    0
: slabdata    796    796      0

kmalloc-128        64128  92736    128   64    2 : tunables    0    0    0
: slabdata   1449   1449      0

kmalloc-96          9350   9618     96   42    1 : tunables    0    0    0
: slabdata    229    229      0

kmalloc-64        159325477 194832256     64   64    1 : tunables    0    0
  0 : slabdata 3044254 3044254      0

kmalloc-32         24960  24960     32  128    1 : tunables    0    0    0
: slabdata    195    195      0

kmalloc-16         45312  45312     16  256    1 : tunables    0    0    0
: slabdata    177    177      0

kmalloc-8          51712  51712      8  512    1 : tunables    0    0    0
: slabdata    101    101      0

kmem_cache_node      741    768     64   64    1 : tunables    0    0    0
: slabdata     12     12      0

kmem_cache           640    640    256   64    4 : tunables    0    0    0
: slabdata     10     10      0

On Thu, Jun 16, 2016 at 8:48 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> Hi Wade,
>
> What IO are you seeing on the OSD devices when this happens (see e.g.
> iostat), are there short periods of high read IOPS where (almost) no
> writes occur? What does your memory usage look like (including slab)?
>
> Cheers,
>
> On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Hi All,
> >
> > I have a repeatable condition when the object count in a pool gets to
> > 320-330 million the object write time dramatically and almost
> > instantly increases as much as 10X, exhibited by fs_apply_latency
> > going from 10ms to 100s of ms.
> >
> > Can someone point me in a direction / have an explanation ?
> >
> > I can add a new pool and it performs normally.
> >
> > Config is generally
> > 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> > with NVME for journals. Centos 7.2, XFS
> >
> > Jewell is the release; inserting objects with librados via some Python
> > test code.
> >
> > Best Regards
> > Wade
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Cheers,
> ~Blairo
>

[-- Attachment #1.2: Type: text/html, Size: 82799 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]   ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-16 14:20     ` Dramatic performance drop at certain number ofobjects " Mykola
  2016-06-16 14:30     ` Dramatic performance drop at certain number of objects " Wade Holler
@ 2016-06-16 14:32     ` Wade Holler
  2 siblings, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:32 UTC (permalink / raw)
  To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw

Blairo,

Thats right, I do see "lots" of READ IO!  If I compare the "bad
(330Mil)" pool, with the new test (good) pool:

iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes,
with almost no writes.

Sounds like you have an idea about what causes this.  I'm happy to hear it!

slabinfo is below.  Drop caches has no affect.

slabinfo - version: 2.1

# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>

blk_io_mits         4674   4769   1664   19    8 : tunables    0    0
  0 : slabdata    251    251      0

rpc_inode_cache        0      0    640   51    8 : tunables    0    0
  0 : slabdata      0      0      0

t10_alua_tg_pt_gp_cache      0      0    408   40    4 : tunables    0
   0    0 : slabdata      0      0      0

t10_pr_reg_cache       0      0    696   47    8 : tunables    0    0
  0 : slabdata      0      0      0

se_sess_cache          0      0    896   36    8 : tunables    0    0
  0 : slabdata      0      0      0

kvm_vcpu               0      0  16256    2    8 : tunables    0    0
  0 : slabdata      0      0      0

kvm_mmu_page_header     48     48    168   48    2 : tunables    0
0    0 : slabdata      1      1      0

xfs_dqtrx              0      0    528   62    8 : tunables    0    0
  0 : slabdata      0      0      0

xfs_dquot              0      0    472   69    8 : tunables    0    0
  0 : slabdata      0      0      0

xfs_icr                0      0    144   56    2 : tunables    0    0
  0 : slabdata      0      0      0

xfs_ili           96974261 97026835    152   53    2 : tunables    0
 0    0 : slabdata 1830695 1830695      0

xfs_inode         97120263 97120263   1088   30    8 : tunables    0
 0    0 : slabdata 3237631 3237631      0

xfs_efd_item        6280   6360    400   40    4 : tunables    0    0
  0 : slabdata    159    159      0

xfs_da_state        3264   3264    480   68    8 : tunables    0    0
  0 : slabdata     48     48      0

xfs_btree_cur       1872   1872    208   39    2 : tunables    0    0
  0 : slabdata     48     48      0

xfs_log_ticket     23980  23980    184   44    2 : tunables    0    0
  0 : slabdata    545    545      0

scsi_cmd_cache      4536   4644    448   36    4 : tunables    0    0
  0 : slabdata    129    129      0

kcopyd_job             0      0   3312    9    8 : tunables    0    0
  0 : slabdata      0      0      0

dm_uevent              0      0   2608   12    8 : tunables    0    0
  0 : slabdata      0      0      0

dm_rq_target_io        0      0    136   60    2 : tunables    0    0
  0 : slabdata      0      0      0

UDPLITEv6              0      0   1152   28    8 : tunables    0    0
  0 : slabdata      0      0      0

UDPv6                980    980   1152   28    8 : tunables    0    0
  0 : slabdata     35     35      0

tw_sock_TCPv6          0      0    256   64    4 : tunables    0    0
  0 : slabdata      0      0      0

TCPv6                510    510   2112   15    8 : tunables    0    0
  0 : slabdata     34     34      0

uhci_urb_priv       6132   6132     56   73    1 : tunables    0    0
  0 : slabdata     84     84      0

cfq_queue          64153  97300    232   70    4 : tunables    0    0
  0 : slabdata   1390   1390      0

bsg_cmd                0      0    312   52    4 : tunables    0    0
  0 : slabdata      0      0      0

mqueue_inode_cache     36     36    896   36    8 : tunables    0    0
   0 : slabdata      1      1      0

hugetlbfs_inode_cache    106    106    608   53    8 : tunables    0
 0    0 : slabdata      2      2      0

configfs_dir_cache     46     46     88   46    1 : tunables    0    0
   0 : slabdata      1      1      0

dquot                  0      0    256   64    4 : tunables    0    0
  0 : slabdata      0      0      0

kioctx              1512   1512    576   56    8 : tunables    0    0
  0 : slabdata     27     27      0

userfaultfd_ctx_cache      0      0    128   64    2 : tunables    0
 0    0 : slabdata      0      0      0

pid_namespace          0      0   2176   15    8 : tunables    0    0
  0 : slabdata      0      0      0

user_namespace         0      0    280   58    4 : tunables    0    0
  0 : slabdata      0      0      0

posix_timers_cache      0      0    248   66    4 : tunables    0    0
   0 : slabdata      0      0      0

UDP-Lite               0      0   1024   32    8 : tunables    0    0
  0 : slabdata      0      0      0

RAW                 1972   1972    960   34    8 : tunables    0    0
  0 : slabdata     58     58      0

UDP                 1472   1504   1024   32    8 : tunables    0    0
  0 : slabdata     47     47      0

tw_sock_TCP         6272   6400    256   64    4 : tunables    0    0
  0 : slabdata    100    100      0

TCP                 5236   5457   1920   17    8 : tunables    0    0
  0 : slabdata    321    321      0

blkdev_queue         421    465   2088   15    8 : tunables    0    0
  0 : slabdata     31     31      0

blkdev_requests   36137670 39504234    384   42    4 : tunables    0
 0    0 : slabdata 940577 940577      0

blkdev_ioc          2106   2106    104   39    1 : tunables    0    0
  0 : slabdata     54     54      0

fsnotify_event_holder   8160   8160     24  170    1 : tunables    0
 0    0 : slabdata     48     48      0

fsnotify_event     37128  37128    120   68    2 : tunables    0    0
  0 : slabdata    546    546      0

sock_inode_cache   11985  11985    640   51    8 : tunables    0    0
  0 : slabdata    235    235      0

net_namespace          0      0   4608    7    8 : tunables    0    0
  0 : slabdata      0      0      0

shmem_inode_cache   5040   5040    680   48    8 : tunables    0    0
  0 : slabdata    105    105      0

Acpi-ParseExt     116256 116256     72   56    1 : tunables    0    0
  0 : slabdata   2076   2076      0

Acpi-Namespace     14586  14586     40  102    1 : tunables    0    0
  0 : slabdata    143    143      0

taskstats           2352   2352    328   49    4 : tunables    0    0
  0 : slabdata     48     48      0

proc_inode_cache  146512 146706    656   49    8 : tunables    0    0
  0 : slabdata   2994   2994      0

sigqueue            2448   2448    160   51    2 : tunables    0    0
  0 : slabdata     48     48      0

bdev_cache          1872   1872    832   39    8 : tunables    0    0
  0 : slabdata     48     48      0

sysfs_dir_cache   172296 172296    112   36    1 : tunables    0    0
  0 : slabdata   4786   4786      0

inode_cache        17550  17820    592   55    8 : tunables    0    0
  0 : slabdata    324    324      0

dentry            63799847 86138682    192   42    2 : tunables    0
 0    0 : slabdata 2050921 2050921      0

iint_cache             0      0     80   51    1 : tunables    0    0
  0 : slabdata      0      0      0

selinux_inode_security  41920  42636     80   51    1 : tunables    0
  0    0 : slabdata    836    836      0

buffer_head       28851697 32477250    104   39    1 : tunables    0
 0    0 : slabdata 832750 832750      0

vm_area_struct     36548  38665    216   37    2 : tunables    0    0
  0 : slabdata   1045   1045      0

mm_struct           1120   1120   1600   20    8 : tunables    0    0
  0 : slabdata     56     56      0

files_cache         2703   2703    640   51    8 : tunables    0    0
  0 : slabdata     53     53      0

signal_cache        5109   5376   1152   28    8 : tunables    0    0
  0 : slabdata    192    192      0

sighand_cache       3241   3345   2112   15    8 : tunables    0    0
  0 : slabdata    223    223      0

task_xstate        14118  14937    832   39    8 : tunables    0    0
  0 : slabdata    383    383      0

task_struct         9295  10538   2944   11    8 : tunables    0    0
  0 : slabdata    958    958      0

anon_vma           30400  30400     64   64    1 : tunables    0    0
  0 : slabdata    475    475      0

shared_policy_node   5780   5780     48   85    1 : tunables    0    0
   0 : slabdata     68     68      0

numa_policy          620    620    264   62    4 : tunables    0    0
  0 : slabdata     10     10      0

radix_tree_node   10364872 10364872    584   56    8 : tunables    0
 0    0 : slabdata 185087 185087      0

idr_layer_cache     1185   1185   2112   15    8 : tunables    0    0
  0 : slabdata     79     79      0

dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-512        0      0    512   64    8 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-256        0      0    256   64    4 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-128        0      0    128   64    2 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-64         0      0     64   64    1 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-32         0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-16         0      0     16  256    1 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-8          0      0      8  512    1 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-192        0      0    192   42    2 : tunables    0    0
  0 : slabdata      0      0      0

dma-kmalloc-96         0      0     96   42    1 : tunables    0    0
  0 : slabdata      0      0      0

kmalloc-8192         588    680   8192    4    8 : tunables    0    0
  0 : slabdata    170    170      0

kmalloc-4096        6374   6424   4096    8    8 : tunables    0    0
  0 : slabdata    803    803      0

kmalloc-2048       60889  63744   2048   16    8 : tunables    0    0
  0 : slabdata   3984   3984      0

kmalloc-1024       27406  32448   1024   32    8 : tunables    0    0
  0 : slabdata   1014   1014      0

kmalloc-512       96841607 96891536    512   64    8 : tunables    0
 0    0 : slabdata 1513967 1513967      0

kmalloc-256        73414 108736    256   64    4 : tunables    0    0
  0 : slabdata   1699   1699      0

kmalloc-192        32870  33432    192   42    2 : tunables    0    0
  0 : slabdata    796    796      0

kmalloc-128        64128  92736    128   64    2 : tunables    0    0
  0 : slabdata   1449   1449      0

kmalloc-96          9350   9618     96   42    1 : tunables    0    0
  0 : slabdata    229    229      0

kmalloc-64        159325477 194832256     64   64    1 : tunables    0
   0    0 : slabdata 3044254 3044254      0

kmalloc-32         24960  24960     32  128    1 : tunables    0    0
  0 : slabdata    195    195      0

kmalloc-16         45312  45312     16  256    1 : tunables    0    0
  0 : slabdata    177    177      0

kmalloc-8          51712  51712      8  512    1 : tunables    0    0
  0 : slabdata    101    101      0

kmem_cache_node      741    768     64   64    1 : tunables    0    0
  0 : slabdata     12     12      0

kmem_cache           640    640    256   64    4 : tunables    0    0
  0 : slabdata     10     10      0


On Thu, Jun 16, 2016 at 8:48 AM, Blair Bethwaite
<blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Wade,
>
> What IO are you seeing on the OSD devices when this happens (see e.g.
> iostat), are there short periods of high read IOPS where (almost) no
> writes occur? What does your memory usage look like (including slab)?
>
> Cheers,
>
> On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>
>> Can someone point me in a direction / have an explanation ?
>>
>> I can add a new pool and it performs normally.
>>
>> Config is generally
>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>> with NVME for journals. Centos 7.2, XFS
>>
>> Jewell is the release; inserting objects with librados via some Python
>> test code.
>>
>> Best Regards
>> Wade
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Cheers,
> ~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 13:38 ` Wido den Hollander
@ 2016-06-16 14:47   ` Wade Holler
  2016-06-16 16:08     ` Wade Holler
  2016-06-19 23:21   ` Blair Bethwaite
  1 sibling, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:47 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Ceph Development

Wido,

I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
So far file counts inside the pool.pg_head directories are all coming
in around ~80k.

Is this an issue ?

I will report back with all pg_head file counts in this example OSD
once it finishes.

Best Regards,
Wade


On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>
>>
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>
>
> My first guess is the filestore splitting and the amount of files per directory.
>
> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
>
> That means you have ~66k objects per PG.
>
>> Can someone point me in a direction / have an explanation ?
>
> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
>
> Wido
>
>>
>> I can add a new pool and it performs normally.
>>
>> Config is generally
>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>> with NVME for journals. Centos 7.2, XFS
>>
>> Jewell is the release; inserting objects with librados via some Python
>> test code.
>>
>> Best Regards
>> Wade
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 14:47   ` Wade Holler
@ 2016-06-16 16:08     ` Wade Holler
  2016-06-17  8:49       ` Wido den Hollander
  0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-16 16:08 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Ceph Development

Ok.  Of the 202 pgs on this example OSD,

65 of them have around ~160k files
137 ( the rest ) -have around ~80k files



On Thu, Jun 16, 2016 at 10:47 AM, Wade Holler <wade.holler@gmail.com> wrote:
> Wido,
>
> I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
> So far file counts inside the pool.pg_head directories are all coming
> in around ~80k.
>
> Is this an issue ?
>
> I will report back with all pg_head file counts in this example OSD
> once it finishes.
>
> Best Regards,
> Wade
>
>
> On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
>>
>>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>>
>>>
>>> Hi All,
>>>
>>> I have a repeatable condition when the object count in a pool gets to
>>> 320-330 million the object write time dramatically and almost
>>> instantly increases as much as 10X, exhibited by fs_apply_latency
>>> going from 10ms to 100s of ms.
>>>
>>
>> My first guess is the filestore splitting and the amount of files per directory.
>>
>> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
>>
>> That means you have ~66k objects per PG.
>>
>>> Can someone point me in a direction / have an explanation ?
>>
>> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
>>
>> Wido
>>
>>>
>>> I can add a new pool and it performs normally.
>>>
>>> Config is generally
>>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>>> with NVME for journals. Centos 7.2, XFS
>>>
>>> Jewell is the release; inserting objects with librados via some Python
>>> test code.
>>>
>>> Best Regards
>>> Wade
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 16:08     ` Wade Holler
@ 2016-06-17  8:49       ` Wido den Hollander
  0 siblings, 0 replies; 34+ messages in thread
From: Wido den Hollander @ 2016-06-17  8:49 UTC (permalink / raw)
  To: Wade Holler; +Cc: Ceph Development


> Op 16 juni 2016 om 18:08 schreef Wade Holler <wade.holler@gmail.com>:
> 
> 
> Ok.  Of the 202 pgs on this example OSD,
> 
> 65 of them have around ~160k files
> 137 ( the rest ) -have around ~80k files
> 

Are those files in the same directory or spread out over multiple sub directories?

You might want to take a look at: http://docs.ceph.com/docs/jewel/rados/configuration/filestore-config-ref/

"filestore split multiple"

Wido

> 
> 
> On Thu, Jun 16, 2016 at 10:47 AM, Wade Holler <wade.holler@gmail.com> wrote:
> > Wido,
> >
> > I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
> > So far file counts inside the pool.pg_head directories are all coming
> > in around ~80k.
> >
> > Is this an issue ?
> >
> > I will report back with all pg_head file counts in this example OSD
> > once it finishes.
> >
> > Best Regards,
> > Wade
> >
> >
> > On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
> >>
> >>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I have a repeatable condition when the object count in a pool gets to
> >>> 320-330 million the object write time dramatically and almost
> >>> instantly increases as much as 10X, exhibited by fs_apply_latency
> >>> going from 10ms to 100s of ms.
> >>>
> >>
> >> My first guess is the filestore splitting and the amount of files per directory.
> >>
> >> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
> >>
> >> That means you have ~66k objects per PG.
> >>
> >>> Can someone point me in a direction / have an explanation ?
> >>
> >> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
> >>
> >> Wido
> >>
> >>>
> >>> I can add a new pool and it performs normally.
> >>>
> >>> Config is generally
> >>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> >>> with NVME for journals. Centos 7.2, XFS
> >>>
> >>> Jewell is the release; inserting objects with librados via some Python
> >>> test code.
> >>>
> >>> Best Regards
> >>> Wade
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-16 13:38 ` Wido den Hollander
  2016-06-16 14:47   ` Wade Holler
@ 2016-06-19 23:21   ` Blair Bethwaite
       [not found]     ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-20  6:32     ` Blair Bethwaite
  1 sibling, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-19 23:21 UTC (permalink / raw)
  To: Wido den Hollander, Wade Holler; +Cc: Ceph Development, ceph-users

Hi Wade,

(Apologies for the slowness - AFK for the weekend).

On 16 June 2016 at 23:38, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>
>>
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>r filestore
>
> My first guess is the filestore splitting and the amount of files per directory.

I concur with Wido and suggest you try upping your filestore split and
merge threshold config values.

I've seen this issue a number of times now with write heavy workload,
and would love to at least write some docs about it, because it must
happen to a lot of users running RBD workloads on largish drives.
However, I'm not sure how to definitively diagnose the issue and
pinpoint the problem. The gist of the issue is the number of files
and/or directories on your OSD filesystems, at some system dependent
threshold you get to a point where you can no longer sufficiently
cache inodes and/or dentrys, so IOs on those files(ystems) have to
incur extra disk IOPS to read the filesystem structure from disk (I
believe that's the small read IO you're seeing, and unfortunately it
seems to effectively choke writes - we've seen all sorts of related
slow request issues). If you watch your xfs stats you'll likely get
further confirmation. In my experience xs_dir_lookups balloons (which
means directory lookups are missing cache and going to disk).

What I'm not clear on is whether there are two different pathologies
at play here, i.e., specifically dentry cache issues versus inode
cache issues. In the former case making Ceph's directory structure
shallower with more files per directory may help (or perhaps
increasing the number of PGs - more top-level directories), but in the
latter case you're likely to need various system tuning (lower vfs
cache pressure, more memory?, fewer files (larger object size))
depending on your workload.

-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]     ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20  0:52       ` Christian Balzer
  0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-20  0:52 UTC (permalink / raw)
  To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw


Hello Blair,

On Mon, 20 Jun 2016 09:21:27 +1000 Blair Bethwaite wrote:

> Hi Wade,
> 
> (Apologies for the slowness - AFK for the weekend).
> 
> On 16 June 2016 at 23:38, Wido den Hollander <wido-fspyXLx8qC4@public.gmane.org> wrote:
> >
> >> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >>
> >>
> >> Hi All,
> >>
> >> I have a repeatable condition when the object count in a pool gets to
> >> 320-330 million the object write time dramatically and almost
> >> instantly increases as much as 10X, exhibited by fs_apply_latency
> >> going from 10ms to 100s of ms.
> >>r filestore
> >
> > My first guess is the filestore splitting and the amount of files per
> > directory.
> 
> I concur with Wido and suggest you try upping your filestore split and
> merge threshold config values.
>
This is probably a good idea but as mentioned/suggested below, it would
be something that eventually settle down in a new equilibrium.
Something I don't think is happening here. 
 
> I've seen this issue a number of times now with write heavy workload,
> and would love to at least write some docs about it, because it must
> happen to a lot of users running RBD workloads on largish drives.
> However, I'm not sure how to definitively diagnose the issue and
> pinpoint the problem. The gist of the issue is the number of files
> and/or directories on your OSD filesystems, at some system dependent
> threshold you get to a point where you can no longer sufficiently
> cache inodes and/or dentrys, so IOs on those files(ystems) have to
> incur extra disk IOPS to read the filesystem structure from disk (I
> believe that's the small read IO you're seeing, and unfortunately it
> seems to effectively choke writes - we've seen all sorts of related
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).
> 
> What I'm not clear on is whether there are two different pathologies
> at play here, i.e., specifically dentry cache issues versus inode
> cache issues. In the former case making Ceph's directory structure
> shallower with more files per directory may help (or perhaps
> increasing the number of PGs - more top-level directories), but in the
> latter case you're likely to need various system tuning (lower vfs
> cache pressure, more memory?, fewer files (larger object size))
> depending on your workload.
> 
I can very much confirm this from the days when on my main production
cluster all 1024 PGs (but only about 6GB of data and 1.6 million objects)
were on just 4 OSDs (25TB each).

Once SLAB ran out of steam and couldn't hold all the respective entries
(Ext4 here, but same diff), things became very slow.

My litmus test is that a "ls -R /var/lib/ceph/osd/ceph-nn/ >/dev/null"
should be pretty much instantaneous and not having to access the disk at
all.

More RAM and proper tuning as well as smaller OSDs are all ways forward to
alleviate/prevent this issue.

It would be interesting to see/know how bluestore fares in this kind of
scenario.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi-FW+hd8ioUD0@public.gmane.org   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
  2016-06-19 23:21   ` Blair Bethwaite
       [not found]     ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20  6:32     ` Blair Bethwaite
       [not found]       ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-20  6:32 UTC (permalink / raw)
  To: Wido den Hollander, Wade Holler; +Cc: Ceph Development, ceph-users

On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]       ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 18:48         ` Wade Holler
       [not found]           ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-20 18:48 UTC (permalink / raw)
  To: Blair Bethwaite, Wido den Hollander
  Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 1258 bytes --]

Thanks everyone for your replies.  I sincerely appreciate it. We are
testing with different pg_num and filestore_split_multiple settings.  Early
indications are .... well not great. Regardless it is nice to understand
the symptoms better so we try to design around it.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
>
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
>
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
>
> --
> Cheers,
> ~Blairo
>

[-- Attachment #1.2: Type: text/html, Size: 1854 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]           ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 20:47             ` Warren Wang - ISD
       [not found]               ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-20 20:47 UTC (permalink / raw)
  To: Wade Holler, Blair Bethwaite, Wido den Hollander
  Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw

[-- Attachment #1.1: Type: text/plain, Size: 3321 bytes --]

Sorry, late to the party here. I agree, up the merge and split thresholds. We're as high as 50/12. I chimed in on an RH ticket here. One of those things you just have to find out as an operator since it's not well documented :(

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

We have over 200 million objects in this cluster, and it's still doing over 15000 write IOPS all day long with 302 spinning drives + SATA SSD journals. Having enough memory and dropping your vfs_cache_pressure should also help.

Keep in mind that if you change the values, it won't take effect immediately. It only merges them back if the directory is under the calculated threshold and a write occurs (maybe a read, I forget).

Warren

From: ceph-users <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> on behalf of Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>
Date: Monday, June 20, 2016 at 2:48 PM
To: Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite@gmail.com>>, Wido den Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>>
Cc: Ceph Development <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY@public.gmane.orgnel.org>>, "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

Thanks everyone for your replies.  I sincerely appreciate it. We are testing with different pg_num and filestore_split_multiple settings.  Early indications are .... well not great. Regardless it is nice to understand the symptoms better so we try to design around it.

Best Regards,
Wade

On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

--
Cheers,
~Blairo

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***

[-- Attachment #1.2: Type: text/html, Size: 5282 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]               ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-20 22:58                 ` Christian Balzer
  2016-06-23  1:26                   ` [ceph-users] " Wade Holler
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-20 22:58 UTC (permalink / raw)
  To: Warren Wang - ISD
  Cc: Ceph Development, Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw


Hello,

On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:

> Sorry, late to the party here. I agree, up the merge and split
> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> One of those things you just have to find out as an operator since it's
> not well documented :(
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> 
> We have over 200 million objects in this cluster, and it's still doing
> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
> journals. Having enough memory and dropping your vfs_cache_pressure
> should also help.
> 
Indeed.

Since it was asked in that bug report and also my first suspicion, it
would probably be good time to clarify that it isn't the splits that cause
the performance degradation, but the resulting inflation of dir entries
and exhaustion of SLAB and thus having to go to disk for things that
normally would be in memory.

Looking at Blair's graph from yesterday pretty much makes that clear, a
purely split caused degradation should have relented much quicker. 


> Keep in mind that if you change the values, it won't take effect
> immediately. It only merges them back if the directory is under the
> calculated threshold and a write occurs (maybe a read, I forget).
> 
If it's a read a plain scrub might do the trick.

Christian
> Warren
> 
> 
> From: ceph-users
> <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
> on behalf of Wade Holler
> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday, June
> 20, 2016 at 2:48 PM To: Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido den
> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> Subject:
> Re: [ceph-users] Dramatic performance drop at certain number of objects
> in pool
> 
> Thanks everyone for your replies.  I sincerely appreciate it. We are
> testing with different pg_num and filestore_split_multiple settings.
> Early indications are .... well not great. Regardless it is nice to
> understand the symptoms better so we try to design around it.
> 
> Best Regards,
> Wade
> 
> 
> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote: On
> 20 June 2016 at 09:21, Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
> 
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
> 
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
> 
> --
> Cheers,
> ~Blairo
> 
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential ***


-- 
Christian Balzer        Network/Systems Engineer                
chibi-FW+hd8ioUD0@public.gmane.org   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-20 22:58                 ` Christian Balzer
@ 2016-06-23  1:26                   ` Wade Holler
       [not found]                     ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-23  1:26 UTC (permalink / raw)
  To: Christian Balzer
  Cc: Warren Wang - ISD, Blair Bethwaite, Wido den Hollander,
	Ceph Development, ceph-users

Based on everyones suggestions; The first modification to 50 / 16
enabled our config to get to ~645Mill objects before the behavior in
question was observed (~330 was the previous ceiling).  Subsequent
modification to 50 / 24 has enabled us to get to 1.1 Billion+

Thank you all very much for your support and assistance.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com> wrote:
>
> Hello,
>
> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>
>> Sorry, late to the party here. I agree, up the merge and split
>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>> One of those things you just have to find out as an operator since it's
>> not well documented :(
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>
>> We have over 200 million objects in this cluster, and it's still doing
>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>> journals. Having enough memory and dropping your vfs_cache_pressure
>> should also help.
>>
> Indeed.
>
> Since it was asked in that bug report and also my first suspicion, it
> would probably be good time to clarify that it isn't the splits that cause
> the performance degradation, but the resulting inflation of dir entries
> and exhaustion of SLAB and thus having to go to disk for things that
> normally would be in memory.
>
> Looking at Blair's graph from yesterday pretty much makes that clear, a
> purely split caused degradation should have relented much quicker.
>
>
>> Keep in mind that if you change the values, it won't take effect
>> immediately. It only merges them back if the directory is under the
>> calculated threshold and a write occurs (maybe a read, I forget).
>>
> If it's a read a plain scrub might do the trick.
>
> Christian
>> Warren
>>
>>
>> From: ceph-users
>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
>> on behalf of Wade Holler
>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday, June
>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido den
>> Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph Development
>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject:
>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>> in pool
>>
>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>> testing with different pg_num and filestore_split_multiple settings.
>> Early indications are .... well not great. Regardless it is nice to
>> understand the symptoms better so we try to design around it.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote: On
>> 20 June 2016 at 09:21, Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> > slow request issues). If you watch your xfs stats you'll likely get
>> > further confirmation. In my experience xs_dir_lookups balloons (which
>> > means directory lookups are missing cache and going to disk).
>>
>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>> problem we had only ephemerally set the new filestore merge/split
>> values - oops. Here's what started happening when we upgraded and
>> restarted a bunch of OSDs:
>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>
>> Seemed to cause lots of slow requests :-/. We corrected it about
>> 12:30, then still took a while to settle.
>>
>> --
>> Cheers,
>> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                     ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23  1:33                       ` Blair Bethwaite
  2016-06-23  1:41                         ` [ceph-users] " Wade Holler
  2016-06-23  2:37                         ` [ceph-users] " Christian Balzer
  0 siblings, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23  1:33 UTC (permalink / raw)
  To: Wade Holler; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw

Wade, good to know.

For the record, what does this work out to roughly per OSD? And how
much RAM and how many PGs per OSD do you have?

What's your workload? I wonder whether for certain workloads (e.g.
RBD) it's better to increase default object size somewhat before
pushing the split/merge up a lot...

Cheers,

On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Based on everyones suggestions; The first modification to 50 / 16
> enabled our config to get to ~645Mill objects before the behavior in
> question was observed (~330 was the previous ceiling).  Subsequent
> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>
> Thank you all very much for your support and assistance.
>
> Best Regards,
> Wade
>
>
> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
>>
>> Hello,
>>
>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>
>>> Sorry, late to the party here. I agree, up the merge and split
>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>> One of those things you just have to find out as an operator since it's
>>> not well documented :(
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>
>>> We have over 200 million objects in this cluster, and it's still doing
>>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>> should also help.
>>>
>> Indeed.
>>
>> Since it was asked in that bug report and also my first suspicion, it
>> would probably be good time to clarify that it isn't the splits that cause
>> the performance degradation, but the resulting inflation of dir entries
>> and exhaustion of SLAB and thus having to go to disk for things that
>> normally would be in memory.
>>
>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>> purely split caused degradation should have relented much quicker.
>>
>>
>>> Keep in mind that if you change the values, it won't take effect
>>> immediately. It only merges them back if the directory is under the
>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>
>> If it's a read a plain scrub might do the trick.
>>
>> Christian
>>> Warren
>>>
>>>
>>> From: ceph-users
>>> <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>> on behalf of Wade Holler
>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday, June
>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido den
>>> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
>>> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> Subject:
>>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>>> in pool
>>>
>>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>>> testing with different pg_num and filestore_split_multiple settings.
>>> Early indications are .... well not great. Regardless it is nice to
>>> understand the symptoms better so we try to design around it.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote: On
>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>> > slow request issues). If you watch your xfs stats you'll likely get
>>> > further confirmation. In my experience xs_dir_lookups balloons (which
>>> > means directory lookups are missing cache and going to disk).
>>>
>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>> problem we had only ephemerally set the new filestore merge/split
>>> values - oops. Here's what started happening when we upgraded and
>>> restarted a bunch of OSDs:
>>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>>
>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> 12:30, then still took a while to settle.
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi-FW+hd8ioUD0@public.gmane.org           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/



-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  1:33                       ` Blair Bethwaite
@ 2016-06-23  1:41                         ` Wade Holler
  2016-06-23  2:01                           ` Blair Bethwaite
       [not found]                           ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-23  2:37                         ` [ceph-users] " Christian Balzer
  1 sibling, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-23  1:41 UTC (permalink / raw)
  To: Blair Bethwaite
  Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
	Ceph Development, ceph-users

Blairo,

We'll speak in pre-replication numbers, replication for this pool is 3.

23.3 Million Objects / OSD
pg_num 2048
16 OSDs / Server
3 Servers
660 GB RAM Total, 179 GB Used (free -t) / Server
vm.swappiness = 1
vm.vfs_cache_pressure = 100

Workload is native librados with python.  ALL 4k objects.

Best Regards,
Wade


On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
<blair.bethwaite@gmail.com> wrote:
> Wade, good to know.
>
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
>
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
>
> Cheers,
>
> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>> Based on everyones suggestions; The first modification to 50 / 16
>> enabled our config to get to ~645Mill objects before the behavior in
>> question was observed (~330 was the previous ceiling).  Subsequent
>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>
>> Thank you all very much for your support and assistance.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com> wrote:
>>>
>>> Hello,
>>>
>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>
>>>> Sorry, late to the party here. I agree, up the merge and split
>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>> One of those things you just have to find out as an operator since it's
>>>> not well documented :(
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>
>>>> We have over 200 million objects in this cluster, and it's still doing
>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>> should also help.
>>>>
>>> Indeed.
>>>
>>> Since it was asked in that bug report and also my first suspicion, it
>>> would probably be good time to clarify that it isn't the splits that cause
>>> the performance degradation, but the resulting inflation of dir entries
>>> and exhaustion of SLAB and thus having to go to disk for things that
>>> normally would be in memory.
>>>
>>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>>> purely split caused degradation should have relented much quicker.
>>>
>>>
>>>> Keep in mind that if you change the values, it won't take effect
>>>> immediately. It only merges them back if the directory is under the
>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>
>>> If it's a read a plain scrub might do the trick.
>>>
>>> Christian
>>>> Warren
>>>>
>>>>
>>>> From: ceph-users
>>>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
>>>> on behalf of Wade Holler
>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday, June
>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido den
>>>> Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph Development
>>>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject:
>>>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>>>> in pool
>>>>
>>>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>>>> testing with different pg_num and filestore_split_multiple settings.
>>>> Early indications are .... well not great. Regardless it is nice to
>>>> understand the symptoms better so we try to design around it.
>>>>
>>>> Best Regards,
>>>> Wade
>>>>
>>>>
>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote: On
>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>>>> > slow request issues). If you watch your xfs stats you'll likely get
>>>> > further confirmation. In my experience xs_dir_lookups balloons (which
>>>> > means directory lookups are missing cache and going to disk).
>>>>
>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>> problem we had only ephemerally set the new filestore merge/split
>>>> values - oops. Here's what started happening when we upgraded and
>>>> restarted a bunch of OSDs:
>>>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>>>
>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>> 12:30, then still took a while to settle.
>>>>
>>>> --
>>>> Cheers,
>>>> ~Blairo
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the individual or entity to whom they are addressed.
>>>> If you have received this email in error destroy it immediately. ***
>>>> Walmart Confidential ***
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>
>
>
> --
> Cheers,
> ~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  1:41                         ` [ceph-users] " Wade Holler
@ 2016-06-23  2:01                           ` Blair Bethwaite
  2016-06-23  2:28                             ` Christian Balzer
  2016-06-23  2:31                             ` Wade Holler
       [not found]                           ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23  2:01 UTC (permalink / raw)
  To: Wade Holler
  Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
	Ceph Development, ceph-users

On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
> Workload is native librados with python.  ALL 4k objects.

Was that meant to be 4MB?

-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  2:01                           ` Blair Bethwaite
@ 2016-06-23  2:28                             ` Christian Balzer
  2016-06-23  2:36                               ` Blair Bethwaite
  2016-06-23  2:31                             ` Wade Holler
  1 sibling, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-23  2:28 UTC (permalink / raw)
  To: Blair Bethwaite
  Cc: Wade Holler, Warren Wang - ISD, Wido den Hollander,
	Ceph Development, ceph-users

On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:

> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
> > Workload is native librados with python.  ALL 4k objects.
> 
> Was that meant to be 4MB?
> 
Nope, he means 4K, he's putting lots of small objects via a python script
into the cluster to test for exactly this problem.

See his original post.


Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  2:01                           ` Blair Bethwaite
  2016-06-23  2:28                             ` Christian Balzer
@ 2016-06-23  2:31                             ` Wade Holler
  1 sibling, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-23  2:31 UTC (permalink / raw)
  To: Blair Bethwaite
  Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
	Ceph Development, ceph-users

No.  Our application writes very small objects.


On Wed, Jun 22, 2016 at 10:01 PM, Blair Bethwaite
<blair.bethwaite@gmail.com> wrote:
> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
>> Workload is native librados with python.  ALL 4k objects.
>
> Was that meant to be 4MB?
>
> --
> Cheers,
> ~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  2:28                             ` Christian Balzer
@ 2016-06-23  2:36                               ` Blair Bethwaite
  0 siblings, 0 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23  2:36 UTC (permalink / raw)
  To: Christian Balzer, Wade Holler
  Cc: Warren Wang - ISD, Wido den Hollander, Ceph Development, ceph-users

Hi Christian,

Ah ok, I didn't see object size mentioned earlier. But I guess direct
rados small objects would be a rarish use-case and explains the very
high object counts.

I'm interested in finding the right balance for RBD given object size
is another variable that can be tweaked there. I recall the
UnitedStack folks using 32MB.

Cheers,

On 23 June 2016 at 12:28, Christian Balzer <chibi@gol.com> wrote:
> On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:
>
>> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
>> > Workload is native librados with python.  ALL 4k objects.
>>
>> Was that meant to be 4MB?
>>
> Nope, he means 4K, he's putting lots of small objects via a python script
> into the cluster to test for exactly this problem.
>
> See his original post.
>
>
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/



-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-23  1:33                       ` Blair Bethwaite
  2016-06-23  1:41                         ` [ceph-users] " Wade Holler
@ 2016-06-23  2:37                         ` Christian Balzer
       [not found]                           ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-23  2:37 UTC (permalink / raw)
  To: ceph-users
  Cc: Blair Bethwaite, Wade Holler, Warren Wang - ISD,
	Wido den Hollander, Ceph Development

On Thu, 23 Jun 2016 11:33:05 +1000 Blair Bethwaite wrote:

> Wade, good to know.
> 
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
> 
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
> 
I'd posit that that RBD is _least_ likely to encounter this issue in a
moderately balanced setup.
Think about it, a 4MB RBD object can hold literally hundreds of files.

While with CephFS or RGW, a file or S3 object is going to cost you about 2
RADOS objects each.

Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
the available space. 
Don't think I could hit this problem before running out of space.

Christian

> Cheers,
> 
> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> > Based on everyones suggestions; The first modification to 50 / 16
> > enabled our config to get to ~645Mill objects before the behavior in
> > question was observed (~330 was the previous ceiling).  Subsequent
> > modification to 50 / 24 has enabled us to get to 1.1 Billion+
> >
> > Thank you all very much for your support and assistance.
> >
> > Best Regards,
> > Wade
> >
> >
> > On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>
> >>> Sorry, late to the party here. I agree, up the merge and split
> >>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> >>> One of those things you just have to find out as an operator since
> >>> it's not well documented :(
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>
> >>> We have over 200 million objects in this cluster, and it's still
> >>> doing over 15000 write IOPS all day long with 302 spinning drives +
> >>> SATA SSD journals. Having enough memory and dropping your
> >>> vfs_cache_pressure should also help.
> >>>
> >> Indeed.
> >>
> >> Since it was asked in that bug report and also my first suspicion, it
> >> would probably be good time to clarify that it isn't the splits that
> >> cause the performance degradation, but the resulting inflation of dir
> >> entries and exhaustion of SLAB and thus having to go to disk for
> >> things that normally would be in memory.
> >>
> >> Looking at Blair's graph from yesterday pretty much makes that clear,
> >> a purely split caused degradation should have relented much quicker.
> >>
> >>
> >>> Keep in mind that if you change the values, it won't take effect
> >>> immediately. It only merges them back if the directory is under the
> >>> calculated threshold and a write occurs (maybe a read, I forget).
> >>>
> >> If it's a read a plain scrub might do the trick.
> >>
> >> Christian
> >>> Warren
> >>>
> >>>
> >>> From: ceph-users
> >>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
> >>> on behalf of Wade Holler
> >>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday,
> >>> June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido
> >>> den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph
> >>> Development
> >>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>> Subject: Re: [ceph-users] Dramatic performance drop at certain
> >>> number of objects in pool
> >>>
> >>> Thanks everyone for your replies.  I sincerely appreciate it. We are
> >>> testing with different pg_num and filestore_split_multiple settings.
> >>> Early indications are .... well not great. Regardless it is nice to
> >>> understand the symptoms better so we try to design around it.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>> On 20 June 2016 at 09:21, Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>> > slow request issues). If you watch your xfs stats you'll likely get
> >>> > further confirmation. In my experience xs_dir_lookups balloons
> >>> > (which means directory lookups are missing cache and going to
> >>> > disk).
> >>>
> >>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> >>> preparation for Jewel/RHCS2. Turns out when we last hit this very
> >>> problem we had only ephemerally set the new filestore merge/split
> >>> values - oops. Here's what started happening when we upgraded and
> >>> restarted a bunch of OSDs:
> >>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
> >>>
> >>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>> 12:30, then still took a while to settle.
> >>>
> >>> --
> >>> Cheers,
> >>> ~Blairo
> >>>
> >>> This email and any files transmitted with it are confidential and
> >>> intended solely for the individual or entity to whom they are
> >>> addressed. If you have received this email in error destroy it
> >>> immediately. *** Walmart Confidential ***
> >>
> >>
> >> --
> >> Christian Balzer        Network/Systems Engineer
> >> chibi@gol.com           Global OnLine Japan/Rakuten Communications
> >> http://www.gol.com/
> 
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                           ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
@ 2016-06-23  2:55                             ` Blair Bethwaite
       [not found]                               ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23  2:55 UTC (permalink / raw)
  To: Christian Balzer; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development

On 23 June 2016 at 12:37, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
> Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
> the available space.
> Don't think I could hit this problem before running out of space.

Perhaps. However ~30TB per server is pretty low with present HDD
sizes. In the pool on our large cluster where we've seen this issue we
have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
testing at about 20% usage (with default 4MB objects). We went to 40 /
8. Then as I reported the other day we hit the issue again at
somewhere around 50% usage. Now we're at 50 / 12.

The boxes mentioned above are a couple of years old. Today we're
buying 2RU servers with 128TB in them (16x 8TB)!

Replacing our current NAS on RBD setup with CephFS is now starting to
scare me...

-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                               ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23  3:38                                 ` Christian Balzer
  0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-23  3:38 UTC (permalink / raw)
  To: Blair Bethwaite, Wade Holler
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development


Hello Blair, hello Wade (see below),

On Thu, 23 Jun 2016 12:55:17 +1000 Blair Bethwaite wrote:

> On 23 June 2016 at 12:37, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
> > Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> > servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7%
> > of the available space.
> > Don't think I could hit this problem before running out of space.
> 
> Perhaps. However ~30TB per server is pretty low with present HDD
> sizes. 
These are in fact 24 3TB HDDs per server, but in 6 RAID10s with 4 HDDs
each.

>In the pool on our large cluster where we've seen this issue we
> have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
> testing at about 20% usage (with default 4MB objects). We went to 40 /
> 8. Then as I reported the other day we hit the issue again at
> somewhere around 50% usage. Now we're at 50 / 12.
> 
High density storage servers have a number of other gotchas and tuning
requirements, I'd consider this simply another one.

As for increasing the default RBD object size, I'd be weary about
performance impacts, especially if you ever are going to have a cache-tier.

If there is no cache-tier in your future for certain, striping might
counteract larger objects.

> The boxes mentioned above are a couple of years old. Today we're
> buying 2RU servers with 128TB in them (16x 8TB)!
> 
As people including me noticed and noted, large OSDs are pushing things,
in more ways than just this issues.

I know very well how attractive it is from a cost and rack space (also
a cost factor of course) perspective to build dense storage nodes, but
most people need more IOPS than storage space and that's were smaller,
faster OSDs are better suited, as pointed out in the Ceph docs for a long
time.

> Replacing our current NAS on RBD setup with CephFS is now starting to
> scare me...
> 
If this is going to happen when Bluestore is stable, this _particular_
problem should be a non-issue hopefully.
I'm sure Murphy will find other amusing ways to keep us entertained
and high-stressed, though.
If nothing else, CephFS would scare me more than a by now well known
problem that can be tuned away.


A question/request for Wade, would it be possible to reformat your OSDs
with Ext4 (I know deprecated, but if you know what you're doing...), BTRFS?
I'm wondering if either doesn't exhibit this behavior or if so at a
different point?

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi-FW+hd8ioUD0@public.gmane.org   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                           ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23 22:09                             ` Warren Wang - ISD
       [not found]                               ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-23 22:09 UTC (permalink / raw)
  To: Wade Holler, Blair Bethwaite
  Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw

	vm.vfs_cache_pressure = 100

Go the other direction on that. You¹ll want to keep it low to help keep
inode/dentry info in memory. We use 10, and haven¹t had a problem.


Warren Wang




On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server
>vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python.  ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling).  Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org>
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's
>>>>> not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing
>>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA
>>>>>SSD
>>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>>> should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion, it
>>>> would probably be good time to clarify that it isn't the splits that
>>>>cause
>>>> the performance degradation, but the resulting inflation of dir
>>>>entries
>>>> and exhaustion of SLAB and thus having to go to disk for things that
>>>> normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that clear,
>>>>a
>>>> purely split caused degradation should have relented much quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under the
>>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>> 
>>>>><ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJ4Eiagz67IpQ@public.gmane.org
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday,
>>>>>June
>>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido
>>>>>den
>>>>> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
>>>>> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>>>>Subject:
>>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>>>objects
>>>>> in pool
>>>>>
>>>>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>>>>> testing with different pg_num and filestore_split_multiple settings.
>>>>> Early indications are .... well not great. Regardless it is nice to
>>>>> understand the symptoms better so we try to design around it.
>>>>>
>>>>> Best Regards,
>>>>> Wade
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>>On
>>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>> > slow request issues). If you watch your xfs stats you'll likely get
>>>>> > further confirmation. In my experience xs_dir_lookups balloons
>>>>>(which
>>>>> > means directory lookups are missing cache and going to disk).
>>>>>
>>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>>> problem we had only ephemerally set the new filestore merge/split
>>>>> values - oops. Here's what started happening when we upgraded and
>>>>> restarted a bunch of OSDs:
>>>>> 
>>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_
>>>>>lookup.png
>>>>>
>>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>>> 12:30, then still took a while to settle.
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> ~Blairo
>>>>>
>>>>> This email and any files transmitted with it are confidential and
>>>>> intended solely for the individual or entity to whom they are
>>>>>addressed.
>>>>> If you have received this email in error destroy it immediately. ***
>>>>> Walmart Confidential ***
>>>>
>>>>
>>>> --
>>>> Christian Balzer        Network/Systems Engineer
>>>> chibi-FW+hd8ioUD0@public.gmane.org           Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>
>>
>>
>> --
>> Cheers,
>> ~Blairo

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                               ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-23 22:24                                 ` Somnath Roy
       [not found]                                   ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-06-23 22:24 UTC (permalink / raw)
  To: Warren Wang - ISD, Wade Holler, Blair Bethwaite
  Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw

Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to *pin* inode/dentries in memory.
We are using that for long now (with 128 TB node memory) and it seems helping specially for the random write workload and saving xattrs read in between.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of Warren Wang - ISD
Sent: Thursday, June 23, 2016 3:09 PM
To: Wade Holler; Blair Bethwaite
Cc: Ceph Development; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

vm.vfs_cache_pressure = 100

Go the other direction on that. You¹ll want to keep it low to help keep inode/dentry info in memory. We use 10, and haven¹t had a problem.


Warren Wang




On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python.  ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling).  Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org>
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's  not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing  over 15000 write IOPS all day long with 302 spinning drives
>>>>>+ SATA SSD  journals. Having enough memory and dropping your
>>>>>vfs_cache_pressure  should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion,
>>>>it  would probably be good time to clarify that it isn't the splits
>>>>that cause  the performance degradation, but the resulting inflation
>>>>of dir entries  and exhaustion of SLAB and thus having to go to disk
>>>>for things that  normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that
>>>>clear, a  purely split caused degradation should have relented much
>>>>quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under
>>>>> the calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>>
>>>>><ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces@lists.
>>>>>cep
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date:
>>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido
>>>>>den  Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph
>>>>>Development
>>>>><ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>>>>Subject:
>>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>>>objects  in pool
>>>>>
>>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>>>>> are testing with different pg_num and filestore_split_multiple settings.
>>>>> Early indications are .... well not great. Regardless it is nice
>>>>> to understand the symptoms better so we try to design around it.
>>>>>
>>>>> Best Regards,
>>>>> Wade
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>>On
>>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>> > slow request issues). If you watch your xfs stats you'll likely
>>>>> > get further confirmation. In my experience xs_dir_lookups
>>>>> > balloons
>>>>>(which
>>>>> > means directory lookups are missing cache and going to disk).
>>>>>
>>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>>> problem we had only ephemerally set the new filestore merge/split
>>>>> values - oops. Here's what started happening when we upgraded and
>>>>> restarted a bunch of OSDs:
>>>>>
>>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_d
>>>>>ir_
>>>>>lookup.png
>>>>>
>>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>>> 12:30, then still took a while to settle.
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> ~Blairo
>>>>>
>>>>> This email and any files transmitted with it are confidential and
>>>>>intended solely for the individual or entity to whom they are
>>>>>addressed.
>>>>> If you have received this email in error destroy it immediately.
>>>>>***  Walmart Confidential ***
>>>>
>>>>
>>>> --
>>>> Christian Balzer        Network/Systems Engineer
>>>> chibi-FW+hd8ioUD0@public.gmane.org           Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>
>>
>>
>> --
>> Cheers,
>> ~Blairo

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** _______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                                   ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-06-24  0:08                                     ` Christian Balzer
       [not found]                                       ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-24  0:08 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: Blair Bethwaite, Ceph Development


Hello,

On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:

> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to *pin*
> inode/dentries in memory. We are using that for long now (with 128 TB
> node memory) and it seems helping specially for the random write
> workload and saving xattrs read in between.
>
128TB node memory, really?
Can I have some of those, too? ^o^
And here I was thinking that Wade's 660GB machines were on the excessive
side.

There's something to be said (and optimized) when your storage nodes have
the same or more RAM as your compute nodes...

As for Warren, well spotted. 
I personally use vm.vfs_cache_pressure = 1, this avoids the potential
fireworks if your memory is really needed elsewhere, while keeping things
in memory normally. 

Christian

> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of
> Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> To: Wade Holler; Blair Bethwaite
> Cc: Ceph Development; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Dramatic performance drop at certain number of
> objects in pool
> 
> vm.vfs_cache_pressure = 100
> 
> Go the other direction on that. You¹ll want to keep it low to help keep
> inode/dentry info in memory. We use 10, and haven¹t had a problem.
> 
> 
> Warren Wang
> 
> 
> 
> 
> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
> 
> >Blairo,
> >
> >We'll speak in pre-replication numbers, replication for this pool is 3.
> >
> >23.3 Million Objects / OSD
> >pg_num 2048
> >16 OSDs / Server
> >3 Servers
> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >vm.vfs_cache_pressure = 100
> >
> >Workload is native librados with python.  ALL 4k objects.
> >
> >Best Regards,
> >Wade
> >
> >
> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> ><blair.bethwaite@gmail.com> wrote:
> >> Wade, good to know.
> >>
> >> For the record, what does this work out to roughly per OSD? And how
> >> much RAM and how many PGs per OSD do you have?
> >>
> >> What's your workload? I wonder whether for certain workloads (e.g.
> >> RBD) it's better to increase default object size somewhat before
> >> pushing the split/merge up a lot...
> >>
> >> Cheers,
> >>
> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> >>> Based on everyones suggestions; The first modification to 50 / 16
> >>> enabled our config to get to ~645Mill objects before the behavior in
> >>> question was observed (~330 was the previous ceiling).  Subsequent
> >>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
> >>>
> >>> Thank you all very much for your support and assistance.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> >>>wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>>
> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>>>here.
> >>>>> One of those things you just have to find out as an operator since
> >>>>>it's  not well documented :(
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>>>
> >>>>> We have over 200 million objects in this cluster, and it's still
> >>>>>doing  over 15000 write IOPS all day long with 302 spinning drives
> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
> >>>>>vfs_cache_pressure  should also help.
> >>>>>
> >>>> Indeed.
> >>>>
> >>>> Since it was asked in that bug report and also my first suspicion,
> >>>>it  would probably be good time to clarify that it isn't the splits
> >>>>that cause  the performance degradation, but the resulting inflation
> >>>>of dir entries  and exhaustion of SLAB and thus having to go to disk
> >>>>for things that  normally would be in memory.
> >>>>
> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>>clear, a  purely split caused degradation should have relented much
> >>>>quicker.
> >>>>
> >>>>
> >>>>> Keep in mind that if you change the values, it won't take effect
> >>>>> immediately. It only merges them back if the directory is under
> >>>>> the calculated threshold and a write occurs (maybe a read, I
> >>>>> forget).
> >>>>>
> >>>> If it's a read a plain scrub might do the trick.
> >>>>
> >>>> Christian
> >>>>> Warren
> >>>>>
> >>>>>
> >>>>> From: ceph-users
> >>>>>
> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>>>cep
> >>>>>h.com>>
> >>>>> on behalf of Wade Holler
> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido
> >>>>>den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph
> >>>>>Development
> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>>>Subject:
> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
> >>>>>objects  in pool
> >>>>>
> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
> >>>>> are testing with different pg_num and filestore_split_multiple
> >>>>> settings. Early indications are .... well not great. Regardless it
> >>>>> is nice to understand the symptoms better so we try to design
> >>>>> around it.
> >>>>>
> >>>>> Best Regards,
> >>>>> Wade
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>>On
> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>> > slow request issues). If you watch your xfs stats you'll likely
> >>>>> > get further confirmation. In my experience xs_dir_lookups
> >>>>> > balloons
> >>>>>(which
> >>>>> > means directory lookups are missing cache and going to disk).
> >>>>>
> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> >>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
> >>>>> problem we had only ephemerally set the new filestore merge/split
> >>>>> values - oops. Here's what started happening when we upgraded and
> >>>>> restarted a bunch of OSDs:
> >>>>>
> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_d
> >>>>>ir_
> >>>>>lookup.png
> >>>>>
> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>>>> 12:30, then still took a while to settle.
> >>>>>
> >>>>> --
> >>>>> Cheers,
> >>>>> ~Blairo
> >>>>>
> >>>>> This email and any files transmitted with it are confidential and
> >>>>>intended solely for the individual or entity to whom they are
> >>>>>addressed.
> >>>>> If you have received this email in error destroy it immediately.
> >>>>>***  Walmart Confidential ***
> >>>>
> >>>>
> >>>> --
> >>>> Christian Balzer        Network/Systems Engineer
> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
> >>>> http://www.gol.com/
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> ~Blairo
> 
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential *** _______________________________________________
> ceph-users mailing list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The
> information contained in this electronic mail message is intended only
> for the use of the designated recipient(s) named above. If the reader of
> this message is not the intended recipient, you are hereby notified that
> you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please
> notify the sender by telephone or e-mail (as shown above) immediately
> and destroy any and all copies of this message in your possession
> (whether hard copies or electronically stored copies).
> _______________________________________________ ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                                       ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
@ 2016-06-24  0:09                                         ` Somnath Roy
  2016-06-24 14:23                                           ` [ceph-users] " Wade Holler
  0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-06-24  0:09 UTC (permalink / raw)
  To: Christian Balzer, ceph-users-idqoXFIVOFJgJs9I8MT0rw
  Cc: Blair Bethwaite, Ceph Development

Oops , typo , 128 GB :-)...

-----Original Message-----
From: Christian Balzer [mailto:chibi@gol.com]
Sent: Thursday, June 23, 2016 5:08 PM
To: ceph-users@lists.ceph.com
Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph Development
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool


Hello,

On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:

> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> *pin* inode/dentries in memory. We are using that for long now (with
> 128 TB node memory) and it seems helping specially for the random
> write workload and saving xattrs read in between.
>
128TB node memory, really?
Can I have some of those, too? ^o^
And here I was thinking that Wade's 660GB machines were on the excessive side.

There's something to be said (and optimized) when your storage nodes have the same or more RAM as your compute nodes...

As for Warren, well spotted.
I personally use vm.vfs_cache_pressure = 1, this avoids the potential fireworks if your memory is really needed elsewhere, while keeping things in memory normally.

Christian

> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> To: Wade Holler; Blair Bethwaite
> Cc: Ceph Development; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Dramatic performance drop at certain number
> of objects in pool
>
> vm.vfs_cache_pressure = 100
>
> Go the other direction on that. You¹ll want to keep it low to help
> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>
>
> Warren Wang
>
>
>
>
> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>
> >Blairo,
> >
> >We'll speak in pre-replication numbers, replication for this pool is 3.
> >
> >23.3 Million Objects / OSD
> >pg_num 2048
> >16 OSDs / Server
> >3 Servers
> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >vm.vfs_cache_pressure = 100
> >
> >Workload is native librados with python.  ALL 4k objects.
> >
> >Best Regards,
> >Wade
> >
> >
> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> ><blair.bethwaite@gmail.com> wrote:
> >> Wade, good to know.
> >>
> >> For the record, what does this work out to roughly per OSD? And how
> >> much RAM and how many PGs per OSD do you have?
> >>
> >> What's your workload? I wonder whether for certain workloads (e.g.
> >> RBD) it's better to increase default object size somewhat before
> >> pushing the split/merge up a lot...
> >>
> >> Cheers,
> >>
> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> >>> Based on everyones suggestions; The first modification to 50 / 16
> >>> enabled our config to get to ~645Mill objects before the behavior
> >>> in question was observed (~330 was the previous ceiling).
> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
> >>> Billion+
> >>>
> >>> Thank you all very much for your support and assistance.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> >>>wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>>
> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>>>here.
> >>>>> One of those things you just have to find out as an operator
> >>>>>since it's  not well documented :(
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>>>
> >>>>> We have over 200 million objects in this cluster, and it's still
> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
> >>>>>drives
> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
> >>>>>vfs_cache_pressure  should also help.
> >>>>>
> >>>> Indeed.
> >>>>
> >>>> Since it was asked in that bug report and also my first
> >>>>suspicion, it  would probably be good time to clarify that it
> >>>>isn't the splits that cause  the performance degradation, but the
> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
> >>>>thus having to go to disk for things that  normally would be in memory.
> >>>>
> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>>clear, a  purely split caused degradation should have relented
> >>>>much quicker.
> >>>>
> >>>>
> >>>>> Keep in mind that if you change the values, it won't take effect
> >>>>> immediately. It only merges them back if the directory is under
> >>>>> the calculated threshold and a write occurs (maybe a read, I
> >>>>> forget).
> >>>>>
> >>>> If it's a read a plain scrub might do the trick.
> >>>>
> >>>> Christian
> >>>>> Warren
> >>>>>
> >>>>>
> >>>>> From: ceph-users
> >>>>>
> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>>>cep
> >>>>>h.com>>
> >>>>> on behalf of Wade Holler
> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
> >>>>>Ceph Development
> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>>>Subject:
> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
> >>>>>objects  in pool
> >>>>>
> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
> >>>>> are testing with different pg_num and filestore_split_multiple
> >>>>> settings. Early indications are .... well not great. Regardless
> >>>>> it is nice to understand the symptoms better so we try to design
> >>>>> around it.
> >>>>>
> >>>>> Best Regards,
> >>>>> Wade
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>>On
> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>> > slow request issues). If you watch your xfs stats you'll
> >>>>> > likely get further confirmation. In my experience
> >>>>> > xs_dir_lookups balloons
> >>>>>(which
> >>>>> > means directory lookups are missing cache and going to disk).
> >>>>>
> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
> >>>>> very problem we had only ephemerally set the new filestore
> >>>>> merge/split values - oops. Here's what started happening when we
> >>>>> upgraded and restarted a bunch of OSDs:
> >>>>>
> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
> >>>>>_d
> >>>>>ir_
> >>>>>lookup.png
> >>>>>
> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>>>> 12:30, then still took a while to settle.
> >>>>>
> >>>>> --
> >>>>> Cheers,
> >>>>> ~Blairo
> >>>>>
> >>>>> This email and any files transmitted with it are confidential
> >>>>>and intended solely for the individual or entity to whom they are
> >>>>>addressed.
> >>>>> If you have received this email in error destroy it immediately.
> >>>>>***  Walmart Confidential ***
> >>>>
> >>>>
> >>>> --
> >>>> Christian Balzer        Network/Systems Engineer
> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
> >>>> http://www.gol.com/
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> ~Blairo
>
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential ***
> _______________________________________________
> ceph-users mailing list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
> The information contained in this electronic mail message is intended
> only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any
> review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> _______________________________________________ ceph-users mailing
> list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Christian Balzer        Network/Systems Engineer
chibi@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-24  0:09                                         ` Somnath Roy
@ 2016-06-24 14:23                                           ` Wade Holler
       [not found]                                             ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]                                             ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw@mail.gmail.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-24 14:23 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Christian Balzer, ceph-users, Warren Wang - ISD, Blair Bethwaite,
	Ceph Development

On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
think it is the best choice for most configs.  However with our large
memory footprint, vfs_cache_pressure=1 increased the likelihood of
hitting an issue where our write response time would double; then a
drop of caches would return response time to normal.  I don't claim to
totally understand this and I only have speculation at the moment.
Again thanks for this suggestion, I do think it is best for boxes that
don't have very large memory.

@ Christian - reformatting to btrfs or ext4 is an option in my test
cluster.  I thought about that but needed to sort xfs first. (thats
what production will run right now) You all have helped me do that and
thank you again.  I will circle back and test btrfs under the same
conditions.  I suspect that it will behave similarly but it's only a
day and half's work or so to test.

Best Regards,
Wade


On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Oops , typo , 128 GB :-)...
>
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@gol.com]
> Sent: Thursday, June 23, 2016 5:08 PM
> To: ceph-users@lists.ceph.com
> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph Development
> Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
>
>
> Hello,
>
> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>
>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>> *pin* inode/dentries in memory. We are using that for long now (with
>> 128 TB node memory) and it seems helping specially for the random
>> write workload and saving xattrs read in between.
>>
> 128TB node memory, really?
> Can I have some of those, too? ^o^
> And here I was thinking that Wade's 660GB machines were on the excessive side.
>
> There's something to be said (and optimized) when your storage nodes have the same or more RAM as your compute nodes...
>
> As for Warren, well spotted.
> I personally use vm.vfs_cache_pressure = 1, this avoids the potential fireworks if your memory is really needed elsewhere, while keeping things in memory normally.
>
> Christian
>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>> To: Wade Holler; Blair Bethwaite
>> Cc: Ceph Development; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>> of objects in pool
>>
>> vm.vfs_cache_pressure = 100
>>
>> Go the other direction on that. You¹ll want to keep it low to help
>> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>>
>>
>> Warren Wang
>>
>>
>>
>>
>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>
>> >Blairo,
>> >
>> >We'll speak in pre-replication numbers, replication for this pool is 3.
>> >
>> >23.3 Million Objects / OSD
>> >pg_num 2048
>> >16 OSDs / Server
>> >3 Servers
>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>> >vm.vfs_cache_pressure = 100
>> >
>> >Workload is native librados with python.  ALL 4k objects.
>> >
>> >Best Regards,
>> >Wade
>> >
>> >
>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>> ><blair.bethwaite@gmail.com> wrote:
>> >> Wade, good to know.
>> >>
>> >> For the record, what does this work out to roughly per OSD? And how
>> >> much RAM and how many PGs per OSD do you have?
>> >>
>> >> What's your workload? I wonder whether for certain workloads (e.g.
>> >> RBD) it's better to increase default object size somewhat before
>> >> pushing the split/merge up a lot...
>> >>
>> >> Cheers,
>> >>
>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>> >>> Based on everyones suggestions; The first modification to 50 / 16
>> >>> enabled our config to get to ~645Mill objects before the behavior
>> >>> in question was observed (~330 was the previous ceiling).
>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>> >>> Billion+
>> >>>
>> >>> Thank you all very much for your support and assistance.
>> >>>
>> >>> Best Regards,
>> >>> Wade
>> >>>
>> >>>
>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>> >>>wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>> >>>>
>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>> >>>>>here.
>> >>>>> One of those things you just have to find out as an operator
>> >>>>>since it's  not well documented :(
>> >>>>>
>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>> >>>>>
>> >>>>> We have over 200 million objects in this cluster, and it's still
>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>> >>>>>drives
>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>> >>>>>vfs_cache_pressure  should also help.
>> >>>>>
>> >>>> Indeed.
>> >>>>
>> >>>> Since it was asked in that bug report and also my first
>> >>>>suspicion, it  would probably be good time to clarify that it
>> >>>>isn't the splits that cause  the performance degradation, but the
>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>> >>>>thus having to go to disk for things that  normally would be in memory.
>> >>>>
>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>> >>>>clear, a  purely split caused degradation should have relented
>> >>>>much quicker.
>> >>>>
>> >>>>
>> >>>>> Keep in mind that if you change the values, it won't take effect
>> >>>>> immediately. It only merges them back if the directory is under
>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>> >>>>> forget).
>> >>>>>
>> >>>> If it's a read a plain scrub might do the trick.
>> >>>>
>> >>>> Christian
>> >>>>> Warren
>> >>>>>
>> >>>>>
>> >>>>> From: ceph-users
>> >>>>>
>> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>> >>>>>cep
>> >>>>>h.com>>
>> >>>>> on behalf of Wade Holler
>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>> >>>>>Ceph Development
>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>> >>>>>Subject:
>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>> >>>>>objects  in pool
>> >>>>>
>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>> >>>>> are testing with different pg_num and filestore_split_multiple
>> >>>>> settings. Early indications are .... well not great. Regardless
>> >>>>> it is nice to understand the symptoms better so we try to design
>> >>>>> around it.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>> Wade
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> >>>>>On
>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> >>>>> > slow request issues). If you watch your xfs stats you'll
>> >>>>> > likely get further confirmation. In my experience
>> >>>>> > xs_dir_lookups balloons
>> >>>>>(which
>> >>>>> > means directory lookups are missing cache and going to disk).
>> >>>>>
>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>> >>>>> very problem we had only ephemerally set the new filestore
>> >>>>> merge/split values - oops. Here's what started happening when we
>> >>>>> upgraded and restarted a bunch of OSDs:
>> >>>>>
>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>> >>>>>_d
>> >>>>>ir_
>> >>>>>lookup.png
>> >>>>>
>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>> >>>>> 12:30, then still took a while to settle.
>> >>>>>
>> >>>>> --
>> >>>>> Cheers,
>> >>>>> ~Blairo
>> >>>>>
>> >>>>> This email and any files transmitted with it are confidential
>> >>>>>and intended solely for the individual or entity to whom they are
>> >>>>>addressed.
>> >>>>> If you have received this email in error destroy it immediately.
>> >>>>>***  Walmart Confidential ***
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Christian Balzer        Network/Systems Engineer
>> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
>> >>>> http://www.gol.com/
>> >>
>> >>
>> >>
>> >> --
>> >> Cheers,
>> >> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>> _______________________________________________
>> ceph-users mailing list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>> The information contained in this electronic mail message is intended
>> only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>> review, dissemination, distribution, or copying of this message is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender by telephone or e-mail (as shown above)
>> immediately and destroy any and all copies of this message in your
>> possession (whether hard copies or electronically stored copies).
>> _______________________________________________ ceph-users mailing
>> list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                                             ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-24 16:24                                               ` Warren Wang - ISD
       [not found]                                                 ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-24 16:24 UTC (permalink / raw)
  To: Wade Holler, Somnath Roy
  Cc: Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development

Oops, that reminds me, do you have min_free_kbytes set to something
reasonable like at least 2-4GB?

Warren Wang



On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:

>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>think it is the best choice for most configs.  However with our large
>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>hitting an issue where our write response time would double; then a
>drop of caches would return response time to normal.  I don't claim to
>totally understand this and I only have speculation at the moment.
>Again thanks for this suggestion, I do think it is best for boxes that
>don't have very large memory.
>
>@ Christian - reformatting to btrfs or ext4 is an option in my test
>cluster.  I thought about that but needed to sort xfs first. (thats
>what production will run right now) You all have helped me do that and
>thank you again.  I will circle back and test btrfs under the same
>conditions.  I suspect that it will behave similarly but it's only a
>day and half's work or so to test.
>
>Best Regards,
>Wade
>
>
>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
>wrote:
>> Oops , typo , 128 GB :-)...
>>
>> -----Original Message-----
>> From: Christian Balzer [mailto:chibi@gol.com]
>> Sent: Thursday, June 23, 2016 5:08 PM
>> To: ceph-users@lists.ceph.com
>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>Development
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>of objects in pool
>>
>>
>> Hello,
>>
>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>
>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>> *pin* inode/dentries in memory. We are using that for long now (with
>>> 128 TB node memory) and it seems helping specially for the random
>>> write workload and saving xattrs read in between.
>>>
>> 128TB node memory, really?
>> Can I have some of those, too? ^o^
>> And here I was thinking that Wade's 660GB machines were on the
>>excessive side.
>>
>> There's something to be said (and optimized) when your storage nodes
>>have the same or more RAM as your compute nodes...
>>
>> As for Warren, well spotted.
>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>fireworks if your memory is really needed elsewhere, while keeping
>>things in memory normally.
>>
>> Christian
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>> To: Wade Holler; Blair Bethwaite
>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>> of objects in pool
>>>
>>> vm.vfs_cache_pressure = 100
>>>
>>> Go the other direction on that. You易ll want to keep it low to help
>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>
>>>
>>> Warren Wang
>>>
>>>
>>>
>>>
>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>>
>>> >Blairo,
>>> >
>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>3.
>>> >
>>> >23.3 Million Objects / OSD
>>> >pg_num 2048
>>> >16 OSDs / Server
>>> >3 Servers
>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>> >vm.vfs_cache_pressure = 100
>>> >
>>> >Workload is native librados with python.  ALL 4k objects.
>>> >
>>> >Best Regards,
>>> >Wade
>>> >
>>> >
>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>> ><blair.bethwaite@gmail.com> wrote:
>>> >> Wade, good to know.
>>> >>
>>> >> For the record, what does this work out to roughly per OSD? And how
>>> >> much RAM and how many PGs per OSD do you have?
>>> >>
>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>> >> RBD) it's better to increase default object size somewhat before
>>> >> pushing the split/merge up a lot...
>>> >>
>>> >> Cheers,
>>> >>
>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>> >>> in question was observed (~330 was the previous ceiling).
>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>> >>> Billion+
>>> >>>
>>> >>> Thank you all very much for your support and assistance.
>>> >>>
>>> >>> Best Regards,
>>> >>> Wade
>>> >>>
>>> >>>
>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>>> >>>wrote:
>>> >>>>
>>> >>>> Hello,
>>> >>>>
>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>> >>>>
>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>> >>>>>here.
>>> >>>>> One of those things you just have to find out as an operator
>>> >>>>>since it's  not well documented :(
>>> >>>>>
>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>> >>>>>
>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>>> >>>>>drives
>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>>> >>>>>vfs_cache_pressure  should also help.
>>> >>>>>
>>> >>>> Indeed.
>>> >>>>
>>> >>>> Since it was asked in that bug report and also my first
>>> >>>>suspicion, it  would probably be good time to clarify that it
>>> >>>>isn't the splits that cause  the performance degradation, but the
>>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>>> >>>>thus having to go to disk for things that  normally would be in
>>>memory.
>>> >>>>
>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>> >>>>clear, a  purely split caused degradation should have relented
>>> >>>>much quicker.
>>> >>>>
>>> >>>>
>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>> >>>>> immediately. It only merges them back if the directory is under
>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>> >>>>> forget).
>>> >>>>>
>>> >>>> If it's a read a plain scrub might do the trick.
>>> >>>>
>>> >>>> Christian
>>> >>>>> Warren
>>> >>>>>
>>> >>>>>
>>> >>>>> From: ceph-users
>>> >>>>>
>>> 
>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>>> >>>>>cep
>>> >>>>>h.com>>
>>> >>>>> on behalf of Wade Holler
>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>>> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>>> >>>>>Ceph Development
>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>> >>>>>Subject:
>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>> >>>>>objects  in pool
>>> >>>>>
>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>> >>>>> settings. Early indications are .... well not great. Regardless
>>> >>>>> it is nice to understand the symptoms better so we try to design
>>> >>>>> around it.
>>> >>>>>
>>> >>>>> Best Regards,
>>> >>>>> Wade
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>>On
>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>> >>>>> > likely get further confirmation. In my experience
>>> >>>>> > xs_dir_lookups balloons
>>> >>>>>(which
>>> >>>>> > means directory lookups are missing cache and going to disk).
>>> >>>>>
>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>> >>>>> very problem we had only ephemerally set the new filestore
>>> >>>>> merge/split values - oops. Here's what started happening when we
>>> >>>>> upgraded and restarted a bunch of OSDs:
>>> >>>>>
>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>> >>>>>_d
>>> >>>>>ir_
>>> >>>>>lookup.png
>>> >>>>>
>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> >>>>> 12:30, then still took a while to settle.
>>> >>>>>
>>> >>>>> --
>>> >>>>> Cheers,
>>> >>>>> ~Blairo
>>> >>>>>
>>> >>>>> This email and any files transmitted with it are confidential
>>> >>>>>and intended solely for the individual or entity to whom they are
>>> >>>>>addressed.
>>> >>>>> If you have received this email in error destroy it immediately.
>>> >>>>>***  Walmart Confidential ***
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Christian Balzer        Network/Systems Engineer
>>> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
>>> >>>> http://www.gol.com/
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Cheers,
>>> >> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are
>>>addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>> _______________________________________________
>>> ceph-users mailing list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>> The information contained in this electronic mail message is intended
>>> only for the use of the designated recipient(s) named above. If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>>> _______________________________________________ ceph-users mailing
>>> list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).


This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                                                 ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-24 19:45                                                   ` Wade Holler
  2016-06-25  3:07                                                     ` [ceph-users] " Christian Balzer
  0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-24 19:45 UTC (permalink / raw)
  To: Warren Wang - ISD
  Cc: Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development

Not reasonable as you say :

vm.min_free_kbytes = 90112

we're in recovery post expansion (48->54 OSDs) right now but free -t is:

#free -t

              total        used        free      shared  buff/cache   available

Mem:      693097104   378383384    36870080      369292   277843640   250931372

Swap:       1048572         956     1047616

Total:    694145676   378384340    37917696


On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
<Warren.Wang@walmart.com> wrote:
> Oops, that reminds me, do you have min_free_kbytes set to something
> reasonable like at least 2-4GB?
>
> Warren Wang
>
>
>
> On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:
>
>>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>>think it is the best choice for most configs.  However with our large
>>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>>hitting an issue where our write response time would double; then a
>>drop of caches would return response time to normal.  I don't claim to
>>totally understand this and I only have speculation at the moment.
>>Again thanks for this suggestion, I do think it is best for boxes that
>>don't have very large memory.
>>
>>@ Christian - reformatting to btrfs or ext4 is an option in my test
>>cluster.  I thought about that but needed to sort xfs first. (thats
>>what production will run right now) You all have helped me do that and
>>thank you again.  I will circle back and test btrfs under the same
>>conditions.  I suspect that it will behave similarly but it's only a
>>day and half's work or so to test.
>>
>>Best Regards,
>>Wade
>>
>>
>>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
>>wrote:
>>> Oops , typo , 128 GB :-)...
>>>
>>> -----Original Message-----
>>> From: Christian Balzer [mailto:chibi@gol.com]
>>> Sent: Thursday, June 23, 2016 5:08 PM
>>> To: ceph-users@lists.ceph.com
>>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>>Development
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>of objects in pool
>>>
>>>
>>> Hello,
>>>
>>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>>
>>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>>> *pin* inode/dentries in memory. We are using that for long now (with
>>>> 128 TB node memory) and it seems helping specially for the random
>>>> write workload and saving xattrs read in between.
>>>>
>>> 128TB node memory, really?
>>> Can I have some of those, too? ^o^
>>> And here I was thinking that Wade's 660GB machines were on the
>>>excessive side.
>>>
>>> There's something to be said (and optimized) when your storage nodes
>>>have the same or more RAM as your compute nodes...
>>>
>>> As for Warren, well spotted.
>>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>>fireworks if your memory is really needed elsewhere, while keeping
>>>things in memory normally.
>>>
>>> Christian
>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>>> To: Wade Holler; Blair Bethwaite
>>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>> of objects in pool
>>>>
>>>> vm.vfs_cache_pressure = 100
>>>>
>>>> Go the other direction on that. You易ll want to keep it low to help
>>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>>
>>>>
>>>> Warren Wang
>>>>
>>>>
>>>>
>>>>
>>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>>>
>>>> >Blairo,
>>>> >
>>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>>3.
>>>> >
>>>> >23.3 Million Objects / OSD
>>>> >pg_num 2048
>>>> >16 OSDs / Server
>>>> >3 Servers
>>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>>> >vm.vfs_cache_pressure = 100
>>>> >
>>>> >Workload is native librados with python.  ALL 4k objects.
>>>> >
>>>> >Best Regards,
>>>> >Wade
>>>> >
>>>> >
>>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>>> ><blair.bethwaite@gmail.com> wrote:
>>>> >> Wade, good to know.
>>>> >>
>>>> >> For the record, what does this work out to roughly per OSD? And how
>>>> >> much RAM and how many PGs per OSD do you have?
>>>> >>
>>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>>> >> RBD) it's better to increase default object size somewhat before
>>>> >> pushing the split/merge up a lot...
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>>> >>> in question was observed (~330 was the previous ceiling).
>>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>>> >>> Billion+
>>>> >>>
>>>> >>> Thank you all very much for your support and assistance.
>>>> >>>
>>>> >>> Best Regards,
>>>> >>> Wade
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>>>> >>>wrote:
>>>> >>>>
>>>> >>>> Hello,
>>>> >>>>
>>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>> >>>>
>>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>>> >>>>>here.
>>>> >>>>> One of those things you just have to find out as an operator
>>>> >>>>>since it's  not well documented :(
>>>> >>>>>
>>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>> >>>>>
>>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>>>> >>>>>drives
>>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>>>> >>>>>vfs_cache_pressure  should also help.
>>>> >>>>>
>>>> >>>> Indeed.
>>>> >>>>
>>>> >>>> Since it was asked in that bug report and also my first
>>>> >>>>suspicion, it  would probably be good time to clarify that it
>>>> >>>>isn't the splits that cause  the performance degradation, but the
>>>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>>>> >>>>thus having to go to disk for things that  normally would be in
>>>>memory.
>>>> >>>>
>>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>>> >>>>clear, a  purely split caused degradation should have relented
>>>> >>>>much quicker.
>>>> >>>>
>>>> >>>>
>>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>>> >>>>> immediately. It only merges them back if the directory is under
>>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>>> >>>>> forget).
>>>> >>>>>
>>>> >>>> If it's a read a plain scrub might do the trick.
>>>> >>>>
>>>> >>>> Christian
>>>> >>>>> Warren
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> From: ceph-users
>>>> >>>>>
>>>>
>>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>>>> >>>>>cep
>>>> >>>>>h.com>>
>>>> >>>>> on behalf of Wade Holler
>>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>>>> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>>>> >>>>>Ceph Development
>>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>>> >>>>>Subject:
>>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>> >>>>>objects  in pool
>>>> >>>>>
>>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>>> >>>>> settings. Early indications are .... well not great. Regardless
>>>> >>>>> it is nice to understand the symptoms better so we try to design
>>>> >>>>> around it.
>>>> >>>>>
>>>> >>>>> Best Regards,
>>>> >>>>> Wade
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>>wrote:
>>>> >>>>>On
>>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>>wrote:
>>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>>> >>>>> > likely get further confirmation. In my experience
>>>> >>>>> > xs_dir_lookups balloons
>>>> >>>>>(which
>>>> >>>>> > means directory lookups are missing cache and going to disk).
>>>> >>>>>
>>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>>> >>>>> very problem we had only ephemerally set the new filestore
>>>> >>>>> merge/split values - oops. Here's what started happening when we
>>>> >>>>> upgraded and restarted a bunch of OSDs:
>>>> >>>>>
>>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>>> >>>>>_d
>>>> >>>>>ir_
>>>> >>>>>lookup.png
>>>> >>>>>
>>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>> >>>>> 12:30, then still took a while to settle.
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Cheers,
>>>> >>>>> ~Blairo
>>>> >>>>>
>>>> >>>>> This email and any files transmitted with it are confidential
>>>> >>>>>and intended solely for the individual or entity to whom they are
>>>> >>>>>addressed.
>>>> >>>>> If you have received this email in error destroy it immediately.
>>>> >>>>>***  Walmart Confidential ***
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Christian Balzer        Network/Systems Engineer
>>>> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
>>>> >>>> http://www.gol.com/
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Cheers,
>>>> >> ~Blairo
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the individual or entity to whom they are
>>>>addressed.
>>>> If you have received this email in error destroy it immediately. ***
>>>> Walmart Confidential ***
>>>> _______________________________________________
>>>> ceph-users mailing list ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>>> The information contained in this electronic mail message is intended
>>>> only for the use of the designated recipient(s) named above. If the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified that you have received this message in error and that any
>>>> review, dissemination, distribution, or copying of this message is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender by telephone or e-mail (as shown above)
>>>> immediately and destroy any and all copies of this message in your
>>>> possession (whether hard copies or electronically stored copies).
>>>> _______________________________________________ ceph-users mailing
>>>> list ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> chibi@gol.com   Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is intended only for the use of the designated recipient(s) named above.
>>>If the reader of this message is not the intended recipient, you are
>>>hereby notified that you have received this message in error and that
>>>any review, dissemination, distribution, or copying of this message is
>>>strictly prohibited. If you have received this communication in error,
>>>please notify the sender by telephone or e-mail (as shown above)
>>>immediately and destroy any and all copies of this message in your
>>>possession (whether hard copies or electronically stored copies).
>
>
> This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
  2016-06-24 19:45                                                   ` Wade Holler
@ 2016-06-25  3:07                                                     ` Christian Balzer
  0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-25  3:07 UTC (permalink / raw)
  To: Wade Holler
  Cc: Warren Wang - ISD, Somnath Roy, ceph-users, Blair Bethwaite,
	Ceph Development


Hello,

On Fri, 24 Jun 2016 15:45:52 -0400 Wade Holler wrote:

> Not reasonable as you say :
> 
> vm.min_free_kbytes = 90112
>
Yeah, my nodes with IB adapters all have that set to at least 512MB, 1GB
if they're over 64GB.
 
> we're in recovery post expansion (48->54 OSDs) right now but free -t is:
> 
> #free -t
> 
Free can be very misleading when it comes to the actual state of things
with regards to memory fragmentation.

Take a look at "cat /proc/buddyinfo" and read up on linux memory
fragmentation.
Also this:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22214.html

Christian

>               total        used        free      shared  buff/cache
> available
> 
> Mem:      693097104   378383384    36870080      369292   277843640
> 250931372
> 
> Swap:       1048572         956     1047616
> 
> Total:    694145676   378384340    37917696
> 
> 
> On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
> <Warren.Wang@walmart.com> wrote:
> > Oops, that reminds me, do you have min_free_kbytes set to something
> > reasonable like at least 2-4GB?
> >
> > Warren Wang
> >
> >
> >
> > On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:
> >
> >>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
> >>think it is the best choice for most configs.  However with our large
> >>memory footprint, vfs_cache_pressure=1 increased the likelihood of
> >>hitting an issue where our write response time would double; then a
> >>drop of caches would return response time to normal.  I don't claim to
> >>totally understand this and I only have speculation at the moment.
> >>Again thanks for this suggestion, I do think it is best for boxes that
> >>don't have very large memory.
> >>
> >>@ Christian - reformatting to btrfs or ext4 is an option in my test
> >>cluster.  I thought about that but needed to sort xfs first. (thats
> >>what production will run right now) You all have helped me do that and
> >>thank you again.  I will circle back and test btrfs under the same
> >>conditions.  I suspect that it will behave similarly but it's only a
> >>day and half's work or so to test.
> >>
> >>Best Regards,
> >>Wade
> >>
> >>
> >>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> >>wrote:
> >>> Oops , typo , 128 GB :-)...
> >>>
> >>> -----Original Message-----
> >>> From: Christian Balzer [mailto:chibi@gol.com]
> >>> Sent: Thursday, June 23, 2016 5:08 PM
> >>> To: ceph-users@lists.ceph.com
> >>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite;
> >>> Ceph
> >>>Development
> >>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
> >>>of objects in pool
> >>>
> >>>
> >>> Hello,
> >>>
> >>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
> >>>
> >>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> >>>> *pin* inode/dentries in memory. We are using that for long now (with
> >>>> 128 TB node memory) and it seems helping specially for the random
> >>>> write workload and saving xattrs read in between.
> >>>>
> >>> 128TB node memory, really?
> >>> Can I have some of those, too? ^o^
> >>> And here I was thinking that Wade's 660GB machines were on the
> >>>excessive side.
> >>>
> >>> There's something to be said (and optimized) when your storage nodes
> >>>have the same or more RAM as your compute nodes...
> >>>
> >>> As for Warren, well spotted.
> >>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
> >>>fireworks if your memory is really needed elsewhere, while keeping
> >>>things in memory normally.
> >>>
> >>> Christian
> >>>
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On
> >>>> Behalf Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> >>>> To: Wade Holler; Blair Bethwaite
> >>>> Cc: Ceph Development; ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Dramatic performance drop at certain
> >>>> number of objects in pool
> >>>>
> >>>> vm.vfs_cache_pressure = 100
> >>>>
> >>>> Go the other direction on that. You易ll want to keep it low to help
> >>>> keep inode/dentry info in memory. We use 10, and haven易t had a
> >>>> problem.
> >>>>
> >>>>
> >>>> Warren Wang
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
> >>>>
> >>>> >Blairo,
> >>>> >
> >>>> >We'll speak in pre-replication numbers, replication for this pool
> >>>> >is
> >>>>3.
> >>>> >
> >>>> >23.3 Million Objects / OSD
> >>>> >pg_num 2048
> >>>> >16 OSDs / Server
> >>>> >3 Servers
> >>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >>>> >vm.vfs_cache_pressure = 100
> >>>> >
> >>>> >Workload is native librados with python.  ALL 4k objects.
> >>>> >
> >>>> >Best Regards,
> >>>> >Wade
> >>>> >
> >>>> >
> >>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> >>>> ><blair.bethwaite@gmail.com> wrote:
> >>>> >> Wade, good to know.
> >>>> >>
> >>>> >> For the record, what does this work out to roughly per OSD? And
> >>>> >> how much RAM and how many PGs per OSD do you have?
> >>>> >>
> >>>> >> What's your workload? I wonder whether for certain workloads
> >>>> >> (e.g. RBD) it's better to increase default object size somewhat
> >>>> >> before pushing the split/merge up a lot...
> >>>> >>
> >>>> >> Cheers,
> >>>> >>
> >>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com>
> >>>> >> wrote:
> >>>> >>> Based on everyones suggestions; The first modification to 50 /
> >>>> >>> 16 enabled our config to get to ~645Mill objects before the
> >>>> >>> behavior in question was observed (~330 was the previous
> >>>> >>> ceiling). Subsequent modification to 50 / 24 has enabled us to
> >>>> >>> get to 1.1 Billion+
> >>>> >>>
> >>>> >>> Thank you all very much for your support and assistance.
> >>>> >>>
> >>>> >>> Best Regards,
> >>>> >>> Wade
> >>>> >>>
> >>>> >>>
> >>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer
> >>>> >>> <chibi@gol.com>
> >>>> >>>wrote:
> >>>> >>>>
> >>>> >>>> Hello,
> >>>> >>>>
> >>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>> >>>>
> >>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>> >>>>>here.
> >>>> >>>>> One of those things you just have to find out as an operator
> >>>> >>>>>since it's  not well documented :(
> >>>> >>>>>
> >>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>> >>>>>
> >>>> >>>>> We have over 200 million objects in this cluster, and it's
> >>>> >>>>> still
> >>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
> >>>> >>>>>drives
> >>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
> >>>> >>>>>vfs_cache_pressure  should also help.
> >>>> >>>>>
> >>>> >>>> Indeed.
> >>>> >>>>
> >>>> >>>> Since it was asked in that bug report and also my first
> >>>> >>>>suspicion, it  would probably be good time to clarify that it
> >>>> >>>>isn't the splits that cause  the performance degradation, but
> >>>> >>>>the resulting inflation of dir entries  and exhaustion of SLAB
> >>>> >>>>and thus having to go to disk for things that  normally would
> >>>> >>>>be in
> >>>>memory.
> >>>> >>>>
> >>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>> >>>>clear, a  purely split caused degradation should have relented
> >>>> >>>>much quicker.
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>> Keep in mind that if you change the values, it won't take
> >>>> >>>>> effect immediately. It only merges them back if the directory
> >>>> >>>>> is under the calculated threshold and a write occurs (maybe a
> >>>> >>>>> read, I forget).
> >>>> >>>>>
> >>>> >>>> If it's a read a plain scrub might do the trick.
> >>>> >>>>
> >>>> >>>> Christian
> >>>> >>>>> Warren
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> From: ceph-users
> >>>> >>>>>
> >>>>
> >>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>> >>>>>cep
> >>>> >>>>>h.com>>
> >>>> >>>>> on behalf of Wade Holler
> >>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
> >>>> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
> >>>> >>>>>Ceph Development
> >>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>> >>>>>Subject:
> >>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number
> >>>> >>>>> of
> >>>> >>>>>objects  in pool
> >>>> >>>>>
> >>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it.
> >>>> >>>>> We are testing with different pg_num and
> >>>> >>>>> filestore_split_multiple settings. Early indications are ....
> >>>> >>>>> well not great. Regardless it is nice to understand the
> >>>> >>>>> symptoms better so we try to design around it.
> >>>> >>>>>
> >>>> >>>>> Best Regards,
> >>>> >>>>> Wade
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
> >>>>wrote:
> >>>> >>>>>On
> >>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
> >>>>wrote:
> >>>> >>>>> > slow request issues). If you watch your xfs stats you'll
> >>>> >>>>> > likely get further confirmation. In my experience
> >>>> >>>>> > xs_dir_lookups balloons
> >>>> >>>>>(which
> >>>> >>>>> > means directory lookups are missing cache and going to
> >>>> >>>>> > disk).
> >>>> >>>>>
> >>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
> >>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit
> >>>> >>>>> this very problem we had only ephemerally set the new
> >>>> >>>>> filestore merge/split values - oops. Here's what started
> >>>> >>>>> happening when we upgraded and restarted a bunch of OSDs:
> >>>> >>>>>
> >>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
> >>>> >>>>>_d
> >>>> >>>>>ir_
> >>>> >>>>>lookup.png
> >>>> >>>>>
> >>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it
> >>>> >>>>> about 12:30, then still took a while to settle.
> >>>> >>>>>
> >>>> >>>>> --
> >>>> >>>>> Cheers,
> >>>> >>>>> ~Blairo
> >>>> >>>>>
> >>>> >>>>> This email and any files transmitted with it are confidential
> >>>> >>>>>and intended solely for the individual or entity to whom they
> >>>> >>>>>are addressed.
> >>>> >>>>> If you have received this email in error destroy it
> >>>> >>>>> immediately.
> >>>> >>>>>***  Walmart Confidential ***
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> --
> >>>> >>>> Christian Balzer        Network/Systems Engineer
> >>>> >>>> chibi@gol.com           Global OnLine Japan/Rakuten
> >>>> >>>> Communications http://www.gol.com/
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Cheers,
> >>>> >> ~Blairo
> >>>>
> >>>> This email and any files transmitted with it are confidential and
> >>>> intended solely for the individual or entity to whom they are
> >>>>addressed.
> >>>> If you have received this email in error destroy it immediately. ***
> >>>> Walmart Confidential ***
> >>>> _______________________________________________
> >>>> ceph-users mailing list ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
> >>>> The information contained in this electronic mail message is
> >>>> intended only for the use of the designated recipient(s) named
> >>>> above. If the reader of this message is not the intended recipient,
> >>>> you are hereby notified that you have received this message in
> >>>> error and that any review, dissemination, distribution, or copying
> >>>> of this message is strictly prohibited. If you have received this
> >>>> communication in error, please notify the sender by telephone or
> >>>> e-mail (as shown above) immediately and destroy any and all copies
> >>>> of this message in your possession (whether hard copies or
> >>>> electronically stored copies).
> >>>> _______________________________________________ ceph-users mailing
> >>>> list ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>>
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> chibi@gol.com   Global OnLine Japan/Rakuten Communications
> >>> http://www.gol.com/
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message
> >>>is intended only for the use of the designated recipient(s) named
> >>>above. If the reader of this message is not the intended recipient,
> >>>you are hereby notified that you have received this message in error
> >>>and that any review, dissemination, distribution, or copying of this
> >>>message is strictly prohibited. If you have received this
> >>>communication in error, please notify the sender by telephone or
> >>>e-mail (as shown above) immediately and destroy any and all copies of
> >>>this message in your possession (whether hard copies or
> >>>electronically stored copies).
> >
> >
> > This email and any files transmitted with it are confidential and
> > intended solely for the individual or entity to whom they are
> > addressed. If you have received this email in error destroy it
> > immediately. *** Walmart Confidential ***
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Dramatic performance drop at certain number of objects in pool
       [not found]                                               ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-27  8:12                                                 ` Blair Bethwaite
  0 siblings, 0 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-27  8:12 UTC (permalink / raw)
  To: Kyle Bader; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development

[-- Attachment #1.1: Type: text/plain, Size: 851 bytes --]

On 25 Jun 2016 6:02 PM, "Kyle Bader" <kyle.bader-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> fdatasync takes longer when you have more inodes in the slab caches, it's
the double edged sword of vfs_cache_pressure.

That's a bit sad when, iiuc, it's only journals doing fdatasync in the Ceph
write path. I'd have expected the vfs to handle this on a per fs basis (and
a journal filesystem would have very little in the inode cache).

It's somewhat annoying there isn't a way to favor dentries (and perhaps
dentry inodes) over other inodes in the vfs cache. Our experience shows
that it's dentry misses that cause the major performance issues (makes
sense when you consider the osd is storing all its data in the leafs of the
on disk PG structure).

This is another discussion that seems to backup the choice to implement
bluestore.

Cheers,
Blair

[-- Attachment #1.2: Type: text/html, Size: 1072 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-06-27  8:12 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
2016-06-16 12:48 ` Blair Bethwaite
     [not found]   ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-16 14:20     ` Dramatic performance drop at certain number ofobjects " Mykola
2016-06-16 14:30     ` Dramatic performance drop at certain number of objects " Wade Holler
2016-06-16 14:32     ` Wade Holler
2016-06-16 13:38 ` Wido den Hollander
2016-06-16 14:47   ` Wade Holler
2016-06-16 16:08     ` Wade Holler
2016-06-17  8:49       ` Wido den Hollander
2016-06-19 23:21   ` Blair Bethwaite
     [not found]     ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20  0:52       ` Christian Balzer
2016-06-20  6:32     ` Blair Bethwaite
     [not found]       ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 18:48         ` Wade Holler
     [not found]           ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 20:47             ` Warren Wang - ISD
     [not found]               ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-20 22:58                 ` Christian Balzer
2016-06-23  1:26                   ` [ceph-users] " Wade Holler
     [not found]                     ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23  1:33                       ` Blair Bethwaite
2016-06-23  1:41                         ` [ceph-users] " Wade Holler
2016-06-23  2:01                           ` Blair Bethwaite
2016-06-23  2:28                             ` Christian Balzer
2016-06-23  2:36                               ` Blair Bethwaite
2016-06-23  2:31                             ` Wade Holler
     [not found]                           ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23 22:09                             ` Warren Wang - ISD
     [not found]                               ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-23 22:24                                 ` Somnath Roy
     [not found]                                   ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-06-24  0:08                                     ` Christian Balzer
     [not found]                                       ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
2016-06-24  0:09                                         ` Somnath Roy
2016-06-24 14:23                                           ` [ceph-users] " Wade Holler
     [not found]                                             ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-24 16:24                                               ` Warren Wang - ISD
     [not found]                                                 ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-24 19:45                                                   ` Wade Holler
2016-06-25  3:07                                                     ` [ceph-users] " Christian Balzer
     [not found]                                             ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw@mail.gmail.com>
     [not found]                                               ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-27  8:12                                                 ` Blair Bethwaite
2016-06-23  2:37                         ` [ceph-users] " Christian Balzer
     [not found]                           ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
2016-06-23  2:55                             ` Blair Bethwaite
     [not found]                               ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23  3:38                                 ` Christian Balzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.