* Dramatic performance drop at certain number of objects in pool
@ 2016-06-16 12:14 Wade Holler
2016-06-16 12:48 ` Blair Bethwaite
2016-06-16 13:38 ` Wido den Hollander
0 siblings, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 12:14 UTC (permalink / raw)
To: ceph-devel
Hi All,
I have a repeatable condition when the object count in a pool gets to
320-330 million the object write time dramatically and almost
instantly increases as much as 10X, exhibited by fs_apply_latency
going from 10ms to 100s of ms.
Can someone point me in a direction / have an explanation ?
I can add a new pool and it performs normally.
Config is generally
3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
with NVME for journals. Centos 7.2, XFS
Jewell is the release; inserting objects with librados via some Python
test code.
Best Regards
Wade
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
@ 2016-06-16 12:48 ` Blair Bethwaite
[not found] ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-16 13:38 ` Wido den Hollander
1 sibling, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-16 12:48 UTC (permalink / raw)
To: Wade Holler; +Cc: Ceph Development, ceph-users
Hi Wade,
What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?
Cheers,
On 16 June 2016 at 22:14, Wade Holler <wade.holler@gmail.com> wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
2016-06-16 12:48 ` Blair Bethwaite
@ 2016-06-16 13:38 ` Wido den Hollander
2016-06-16 14:47 ` Wade Holler
2016-06-19 23:21 ` Blair Bethwaite
1 sibling, 2 replies; 34+ messages in thread
From: Wido den Hollander @ 2016-06-16 13:38 UTC (permalink / raw)
To: Wade Holler, ceph-devel
> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>
>
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
My first guess is the filestore splitting and the amount of files per directory.
You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
That means you have ~66k objects per PG.
> Can someone point me in a direction / have an explanation ?
If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
Wido
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number ofobjects in pool
[not found] ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-16 14:20 ` Mykola
2016-06-16 14:30 ` Dramatic performance drop at certain number of objects " Wade Holler
2016-06-16 14:32 ` Wade Holler
2 siblings, 0 replies; 34+ messages in thread
From: Mykola @ 2016-06-16 14:20 UTC (permalink / raw)
To: Blair Bethwaite, Wade Holler
Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
[-- Attachment #1.1: Type: text/plain, Size: 1840 bytes --]
I see the same behavior with the threshold of around 20M objects for 4 nodes, 16 OSDs, 32TB, hdd-based cluster. The issue dates back to hammer.
Sent from my Windows 10 phone
From: Blair Bethwaite
Sent: Thursday, June 16, 2016 2:48 PM
To: Wade Holler
Cc: Ceph Development; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
Subject: Re: [ceph-users] Dramatic performance drop at certain number ofobjects in pool
Hi Wade,
What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?
Cheers,
On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Cheers,
~Blairo
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[-- Attachment #1.2: Type: text/html, Size: 5010 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-16 14:20 ` Dramatic performance drop at certain number ofobjects " Mykola
@ 2016-06-16 14:30 ` Wade Holler
2016-06-16 14:32 ` Wade Holler
2 siblings, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:30 UTC (permalink / raw)
To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
[-- Attachment #1.1: Type: text/plain, Size: 12955 bytes --]
Blairo,
Thats right, I do see "lots" of READ IO! If I compare the "bad (330Mil)"
pool, with the new test (good) pool:
iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes, with
almost no writes.
Sounds like you have an idea about what causes this. I'm happy to hear it!
slabinfo is below. Drop caches has no affect.
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
blk_io_mits 4674 4769 1664 19 8 : tunables 0 0 0
: slabdata 251 251 0
rpc_inode_cache 0 0 640 51 8 : tunables 0 0 0
: slabdata 0 0 0
t10_alua_tg_pt_gp_cache 0 0 408 40 4 : tunables 0 0
0 : slabdata 0 0 0
t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0
: slabdata 0 0 0
se_sess_cache 0 0 896 36 8 : tunables 0 0 0
: slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0
: slabdata 0 0 0
kvm_mmu_page_header 48 48 168 48 2 : tunables 0 0 0
: slabdata 1 1 0
xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0
: slabdata 0 0 0
xfs_dquot 0 0 472 69 8 : tunables 0 0 0
: slabdata 0 0 0
xfs_icr 0 0 144 56 2 : tunables 0 0 0
: slabdata 0 0 0
xfs_ili 96974261 97026835 152 53 2 : tunables 0 0
0 : slabdata 1830695 1830695 0
xfs_inode 97120263 97120263 1088 30 8 : tunables 0 0
0 : slabdata 3237631 3237631 0
xfs_efd_item 6280 6360 400 40 4 : tunables 0 0 0
: slabdata 159 159 0
xfs_da_state 3264 3264 480 68 8 : tunables 0 0 0
: slabdata 48 48 0
xfs_btree_cur 1872 1872 208 39 2 : tunables 0 0 0
: slabdata 48 48 0
xfs_log_ticket 23980 23980 184 44 2 : tunables 0 0 0
: slabdata 545 545 0
scsi_cmd_cache 4536 4644 448 36 4 : tunables 0 0 0
: slabdata 129 129 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0
: slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0
: slabdata 0 0 0
dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0
: slabdata 0 0 0
UDPLITEv6 0 0 1152 28 8 : tunables 0 0 0
: slabdata 0 0 0
UDPv6 980 980 1152 28 8 : tunables 0 0 0
: slabdata 35 35 0
tw_sock_TCPv6 0 0 256 64 4 : tunables 0 0 0
: slabdata 0 0 0
TCPv6 510 510 2112 15 8 : tunables 0 0 0
: slabdata 34 34 0
uhci_urb_priv 6132 6132 56 73 1 : tunables 0 0 0
: slabdata 84 84 0
cfq_queue 64153 97300 232 70 4 : tunables 0 0 0
: slabdata 1390 1390 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0
: slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0
: slabdata 1 1 0
hugetlbfs_inode_cache 106 106 608 53 8 : tunables 0 0
0 : slabdata 2 2 0
configfs_dir_cache 46 46 88 46 1 : tunables 0 0 0
: slabdata 1 1 0
dquot 0 0 256 64 4 : tunables 0 0 0
: slabdata 0 0 0
kioctx 1512 1512 576 56 8 : tunables 0 0 0
: slabdata 27 27 0
userfaultfd_ctx_cache 0 0 128 64 2 : tunables 0 0
0 : slabdata 0 0 0
pid_namespace 0 0 2176 15 8 : tunables 0 0 0
: slabdata 0 0 0
user_namespace 0 0 280 58 4 : tunables 0 0 0
: slabdata 0 0 0
posix_timers_cache 0 0 248 66 4 : tunables 0 0 0
: slabdata 0 0 0
UDP-Lite 0 0 1024 32 8 : tunables 0 0 0
: slabdata 0 0 0
RAW 1972 1972 960 34 8 : tunables 0 0 0
: slabdata 58 58 0
UDP 1472 1504 1024 32 8 : tunables 0 0 0
: slabdata 47 47 0
tw_sock_TCP 6272 6400 256 64 4 : tunables 0 0 0
: slabdata 100 100 0
TCP 5236 5457 1920 17 8 : tunables 0 0 0
: slabdata 321 321 0
blkdev_queue 421 465 2088 15 8 : tunables 0 0 0
: slabdata 31 31 0
blkdev_requests 36137670 39504234 384 42 4 : tunables 0 0
0 : slabdata 940577 940577 0
blkdev_ioc 2106 2106 104 39 1 : tunables 0 0 0
: slabdata 54 54 0
fsnotify_event_holder 8160 8160 24 170 1 : tunables 0 0
0 : slabdata 48 48 0
fsnotify_event 37128 37128 120 68 2 : tunables 0 0 0
: slabdata 546 546 0
sock_inode_cache 11985 11985 640 51 8 : tunables 0 0 0
: slabdata 235 235 0
net_namespace 0 0 4608 7 8 : tunables 0 0 0
: slabdata 0 0 0
shmem_inode_cache 5040 5040 680 48 8 : tunables 0 0 0
: slabdata 105 105 0
Acpi-ParseExt 116256 116256 72 56 1 : tunables 0 0 0
: slabdata 2076 2076 0
Acpi-Namespace 14586 14586 40 102 1 : tunables 0 0 0
: slabdata 143 143 0
taskstats 2352 2352 328 49 4 : tunables 0 0 0
: slabdata 48 48 0
proc_inode_cache 146512 146706 656 49 8 : tunables 0 0 0
: slabdata 2994 2994 0
sigqueue 2448 2448 160 51 2 : tunables 0 0 0
: slabdata 48 48 0
bdev_cache 1872 1872 832 39 8 : tunables 0 0 0
: slabdata 48 48 0
sysfs_dir_cache 172296 172296 112 36 1 : tunables 0 0 0
: slabdata 4786 4786 0
inode_cache 17550 17820 592 55 8 : tunables 0 0 0
: slabdata 324 324 0
dentry 63799847 86138682 192 42 2 : tunables 0 0
0 : slabdata 2050921 2050921 0
iint_cache 0 0 80 51 1 : tunables 0 0 0
: slabdata 0 0 0
selinux_inode_security 41920 42636 80 51 1 : tunables 0 0
0 : slabdata 836 836 0
buffer_head 28851697 32477250 104 39 1 : tunables 0 0
0 : slabdata 832750 832750 0
vm_area_struct 36548 38665 216 37 2 : tunables 0 0 0
: slabdata 1045 1045 0
mm_struct 1120 1120 1600 20 8 : tunables 0 0 0
: slabdata 56 56 0
files_cache 2703 2703 640 51 8 : tunables 0 0 0
: slabdata 53 53 0
signal_cache 5109 5376 1152 28 8 : tunables 0 0 0
: slabdata 192 192 0
sighand_cache 3241 3345 2112 15 8 : tunables 0 0 0
: slabdata 223 223 0
task_xstate 14118 14937 832 39 8 : tunables 0 0 0
: slabdata 383 383 0
task_struct 9295 10538 2944 11 8 : tunables 0 0 0
: slabdata 958 958 0
anon_vma 30400 30400 64 64 1 : tunables 0 0 0
: slabdata 475 475 0
shared_policy_node 5780 5780 48 85 1 : tunables 0 0 0
: slabdata 68 68 0
numa_policy 620 620 264 62 4 : tunables 0 0 0
: slabdata 10 10 0
radix_tree_node 10364872 10364872 584 56 8 : tunables 0 0
0 : slabdata 185087 185087 0
idr_layer_cache 1185 1185 2112 15 8 : tunables 0 0 0
: slabdata 79 79 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-512 0 0 512 64 8 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-256 0 0 256 64 4 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-128 0 0 128 64 2 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0
: slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0
: slabdata 0 0 0
kmalloc-8192 588 680 8192 4 8 : tunables 0 0 0
: slabdata 170 170 0
kmalloc-4096 6374 6424 4096 8 8 : tunables 0 0 0
: slabdata 803 803 0
kmalloc-2048 60889 63744 2048 16 8 : tunables 0 0 0
: slabdata 3984 3984 0
kmalloc-1024 27406 32448 1024 32 8 : tunables 0 0 0
: slabdata 1014 1014 0
kmalloc-512 96841607 96891536 512 64 8 : tunables 0 0
0 : slabdata 1513967 1513967 0
kmalloc-256 73414 108736 256 64 4 : tunables 0 0 0
: slabdata 1699 1699 0
kmalloc-192 32870 33432 192 42 2 : tunables 0 0 0
: slabdata 796 796 0
kmalloc-128 64128 92736 128 64 2 : tunables 0 0 0
: slabdata 1449 1449 0
kmalloc-96 9350 9618 96 42 1 : tunables 0 0 0
: slabdata 229 229 0
kmalloc-64 159325477 194832256 64 64 1 : tunables 0 0
0 : slabdata 3044254 3044254 0
kmalloc-32 24960 24960 32 128 1 : tunables 0 0 0
: slabdata 195 195 0
kmalloc-16 45312 45312 16 256 1 : tunables 0 0 0
: slabdata 177 177 0
kmalloc-8 51712 51712 8 512 1 : tunables 0 0 0
: slabdata 101 101 0
kmem_cache_node 741 768 64 64 1 : tunables 0 0 0
: slabdata 12 12 0
kmem_cache 640 640 256 64 4 : tunables 0 0 0
: slabdata 10 10 0
On Thu, Jun 16, 2016 at 8:48 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:
> Hi Wade,
>
> What IO are you seeing on the OSD devices when this happens (see e.g.
> iostat), are there short periods of high read IOPS where (almost) no
> writes occur? What does your memory usage look like (including slab)?
>
> Cheers,
>
> On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Hi All,
> >
> > I have a repeatable condition when the object count in a pool gets to
> > 320-330 million the object write time dramatically and almost
> > instantly increases as much as 10X, exhibited by fs_apply_latency
> > going from 10ms to 100s of ms.
> >
> > Can someone point me in a direction / have an explanation ?
> >
> > I can add a new pool and it performs normally.
> >
> > Config is generally
> > 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> > with NVME for journals. Centos 7.2, XFS
> >
> > Jewell is the release; inserting objects with librados via some Python
> > test code.
> >
> > Best Regards
> > Wade
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Cheers,
> ~Blairo
>
[-- Attachment #1.2: Type: text/html, Size: 82799 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-16 14:20 ` Dramatic performance drop at certain number ofobjects " Mykola
2016-06-16 14:30 ` Dramatic performance drop at certain number of objects " Wade Holler
@ 2016-06-16 14:32 ` Wade Holler
2 siblings, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:32 UTC (permalink / raw)
To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Blairo,
Thats right, I do see "lots" of READ IO! If I compare the "bad
(330Mil)" pool, with the new test (good) pool:
iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes,
with almost no writes.
Sounds like you have an idea about what causes this. I'm happy to hear it!
slabinfo is below. Drop caches has no affect.
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
blk_io_mits 4674 4769 1664 19 8 : tunables 0 0
0 : slabdata 251 251 0
rpc_inode_cache 0 0 640 51 8 : tunables 0 0
0 : slabdata 0 0 0
t10_alua_tg_pt_gp_cache 0 0 408 40 4 : tunables 0
0 0 : slabdata 0 0 0
t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0
0 : slabdata 0 0 0
se_sess_cache 0 0 896 36 8 : tunables 0 0
0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0
0 : slabdata 0 0 0
kvm_mmu_page_header 48 48 168 48 2 : tunables 0
0 0 : slabdata 1 1 0
xfs_dqtrx 0 0 528 62 8 : tunables 0 0
0 : slabdata 0 0 0
xfs_dquot 0 0 472 69 8 : tunables 0 0
0 : slabdata 0 0 0
xfs_icr 0 0 144 56 2 : tunables 0 0
0 : slabdata 0 0 0
xfs_ili 96974261 97026835 152 53 2 : tunables 0
0 0 : slabdata 1830695 1830695 0
xfs_inode 97120263 97120263 1088 30 8 : tunables 0
0 0 : slabdata 3237631 3237631 0
xfs_efd_item 6280 6360 400 40 4 : tunables 0 0
0 : slabdata 159 159 0
xfs_da_state 3264 3264 480 68 8 : tunables 0 0
0 : slabdata 48 48 0
xfs_btree_cur 1872 1872 208 39 2 : tunables 0 0
0 : slabdata 48 48 0
xfs_log_ticket 23980 23980 184 44 2 : tunables 0 0
0 : slabdata 545 545 0
scsi_cmd_cache 4536 4644 448 36 4 : tunables 0 0
0 : slabdata 129 129 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0
0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0
0 : slabdata 0 0 0
dm_rq_target_io 0 0 136 60 2 : tunables 0 0
0 : slabdata 0 0 0
UDPLITEv6 0 0 1152 28 8 : tunables 0 0
0 : slabdata 0 0 0
UDPv6 980 980 1152 28 8 : tunables 0 0
0 : slabdata 35 35 0
tw_sock_TCPv6 0 0 256 64 4 : tunables 0 0
0 : slabdata 0 0 0
TCPv6 510 510 2112 15 8 : tunables 0 0
0 : slabdata 34 34 0
uhci_urb_priv 6132 6132 56 73 1 : tunables 0 0
0 : slabdata 84 84 0
cfq_queue 64153 97300 232 70 4 : tunables 0 0
0 : slabdata 1390 1390 0
bsg_cmd 0 0 312 52 4 : tunables 0 0
0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0
0 : slabdata 1 1 0
hugetlbfs_inode_cache 106 106 608 53 8 : tunables 0
0 0 : slabdata 2 2 0
configfs_dir_cache 46 46 88 46 1 : tunables 0 0
0 : slabdata 1 1 0
dquot 0 0 256 64 4 : tunables 0 0
0 : slabdata 0 0 0
kioctx 1512 1512 576 56 8 : tunables 0 0
0 : slabdata 27 27 0
userfaultfd_ctx_cache 0 0 128 64 2 : tunables 0
0 0 : slabdata 0 0 0
pid_namespace 0 0 2176 15 8 : tunables 0 0
0 : slabdata 0 0 0
user_namespace 0 0 280 58 4 : tunables 0 0
0 : slabdata 0 0 0
posix_timers_cache 0 0 248 66 4 : tunables 0 0
0 : slabdata 0 0 0
UDP-Lite 0 0 1024 32 8 : tunables 0 0
0 : slabdata 0 0 0
RAW 1972 1972 960 34 8 : tunables 0 0
0 : slabdata 58 58 0
UDP 1472 1504 1024 32 8 : tunables 0 0
0 : slabdata 47 47 0
tw_sock_TCP 6272 6400 256 64 4 : tunables 0 0
0 : slabdata 100 100 0
TCP 5236 5457 1920 17 8 : tunables 0 0
0 : slabdata 321 321 0
blkdev_queue 421 465 2088 15 8 : tunables 0 0
0 : slabdata 31 31 0
blkdev_requests 36137670 39504234 384 42 4 : tunables 0
0 0 : slabdata 940577 940577 0
blkdev_ioc 2106 2106 104 39 1 : tunables 0 0
0 : slabdata 54 54 0
fsnotify_event_holder 8160 8160 24 170 1 : tunables 0
0 0 : slabdata 48 48 0
fsnotify_event 37128 37128 120 68 2 : tunables 0 0
0 : slabdata 546 546 0
sock_inode_cache 11985 11985 640 51 8 : tunables 0 0
0 : slabdata 235 235 0
net_namespace 0 0 4608 7 8 : tunables 0 0
0 : slabdata 0 0 0
shmem_inode_cache 5040 5040 680 48 8 : tunables 0 0
0 : slabdata 105 105 0
Acpi-ParseExt 116256 116256 72 56 1 : tunables 0 0
0 : slabdata 2076 2076 0
Acpi-Namespace 14586 14586 40 102 1 : tunables 0 0
0 : slabdata 143 143 0
taskstats 2352 2352 328 49 4 : tunables 0 0
0 : slabdata 48 48 0
proc_inode_cache 146512 146706 656 49 8 : tunables 0 0
0 : slabdata 2994 2994 0
sigqueue 2448 2448 160 51 2 : tunables 0 0
0 : slabdata 48 48 0
bdev_cache 1872 1872 832 39 8 : tunables 0 0
0 : slabdata 48 48 0
sysfs_dir_cache 172296 172296 112 36 1 : tunables 0 0
0 : slabdata 4786 4786 0
inode_cache 17550 17820 592 55 8 : tunables 0 0
0 : slabdata 324 324 0
dentry 63799847 86138682 192 42 2 : tunables 0
0 0 : slabdata 2050921 2050921 0
iint_cache 0 0 80 51 1 : tunables 0 0
0 : slabdata 0 0 0
selinux_inode_security 41920 42636 80 51 1 : tunables 0
0 0 : slabdata 836 836 0
buffer_head 28851697 32477250 104 39 1 : tunables 0
0 0 : slabdata 832750 832750 0
vm_area_struct 36548 38665 216 37 2 : tunables 0 0
0 : slabdata 1045 1045 0
mm_struct 1120 1120 1600 20 8 : tunables 0 0
0 : slabdata 56 56 0
files_cache 2703 2703 640 51 8 : tunables 0 0
0 : slabdata 53 53 0
signal_cache 5109 5376 1152 28 8 : tunables 0 0
0 : slabdata 192 192 0
sighand_cache 3241 3345 2112 15 8 : tunables 0 0
0 : slabdata 223 223 0
task_xstate 14118 14937 832 39 8 : tunables 0 0
0 : slabdata 383 383 0
task_struct 9295 10538 2944 11 8 : tunables 0 0
0 : slabdata 958 958 0
anon_vma 30400 30400 64 64 1 : tunables 0 0
0 : slabdata 475 475 0
shared_policy_node 5780 5780 48 85 1 : tunables 0 0
0 : slabdata 68 68 0
numa_policy 620 620 264 62 4 : tunables 0 0
0 : slabdata 10 10 0
radix_tree_node 10364872 10364872 584 56 8 : tunables 0
0 0 : slabdata 185087 185087 0
idr_layer_cache 1185 1185 2112 15 8 : tunables 0 0
0 : slabdata 79 79 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 64 8 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 64 4 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 64 2 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0
0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0
0 : slabdata 0 0 0
kmalloc-8192 588 680 8192 4 8 : tunables 0 0
0 : slabdata 170 170 0
kmalloc-4096 6374 6424 4096 8 8 : tunables 0 0
0 : slabdata 803 803 0
kmalloc-2048 60889 63744 2048 16 8 : tunables 0 0
0 : slabdata 3984 3984 0
kmalloc-1024 27406 32448 1024 32 8 : tunables 0 0
0 : slabdata 1014 1014 0
kmalloc-512 96841607 96891536 512 64 8 : tunables 0
0 0 : slabdata 1513967 1513967 0
kmalloc-256 73414 108736 256 64 4 : tunables 0 0
0 : slabdata 1699 1699 0
kmalloc-192 32870 33432 192 42 2 : tunables 0 0
0 : slabdata 796 796 0
kmalloc-128 64128 92736 128 64 2 : tunables 0 0
0 : slabdata 1449 1449 0
kmalloc-96 9350 9618 96 42 1 : tunables 0 0
0 : slabdata 229 229 0
kmalloc-64 159325477 194832256 64 64 1 : tunables 0
0 0 : slabdata 3044254 3044254 0
kmalloc-32 24960 24960 32 128 1 : tunables 0 0
0 : slabdata 195 195 0
kmalloc-16 45312 45312 16 256 1 : tunables 0 0
0 : slabdata 177 177 0
kmalloc-8 51712 51712 8 512 1 : tunables 0 0
0 : slabdata 101 101 0
kmem_cache_node 741 768 64 64 1 : tunables 0 0
0 : slabdata 12 12 0
kmem_cache 640 640 256 64 4 : tunables 0 0
0 : slabdata 10 10 0
On Thu, Jun 16, 2016 at 8:48 AM, Blair Bethwaite
<blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Wade,
>
> What IO are you seeing on the OSD devices when this happens (see e.g.
> iostat), are there short periods of high read IOPS where (almost) no
> writes occur? What does your memory usage look like (including slab)?
>
> Cheers,
>
> On 16 June 2016 at 22:14, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>
>> Can someone point me in a direction / have an explanation ?
>>
>> I can add a new pool and it performs normally.
>>
>> Config is generally
>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>> with NVME for journals. Centos 7.2, XFS
>>
>> Jewell is the release; inserting objects with librados via some Python
>> test code.
>>
>> Best Regards
>> Wade
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Cheers,
> ~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 13:38 ` Wido den Hollander
@ 2016-06-16 14:47 ` Wade Holler
2016-06-16 16:08 ` Wade Holler
2016-06-19 23:21 ` Blair Bethwaite
1 sibling, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-16 14:47 UTC (permalink / raw)
To: Wido den Hollander; +Cc: Ceph Development
Wido,
I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
So far file counts inside the pool.pg_head directories are all coming
in around ~80k.
Is this an issue ?
I will report back with all pg_head file counts in this example OSD
once it finishes.
Best Regards,
Wade
On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>
>>
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>
>
> My first guess is the filestore splitting and the amount of files per directory.
>
> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
>
> That means you have ~66k objects per PG.
>
>> Can someone point me in a direction / have an explanation ?
>
> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
>
> Wido
>
>>
>> I can add a new pool and it performs normally.
>>
>> Config is generally
>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>> with NVME for journals. Centos 7.2, XFS
>>
>> Jewell is the release; inserting objects with librados via some Python
>> test code.
>>
>> Best Regards
>> Wade
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 14:47 ` Wade Holler
@ 2016-06-16 16:08 ` Wade Holler
2016-06-17 8:49 ` Wido den Hollander
0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-16 16:08 UTC (permalink / raw)
To: Wido den Hollander; +Cc: Ceph Development
Ok. Of the 202 pgs on this example OSD,
65 of them have around ~160k files
137 ( the rest ) -have around ~80k files
On Thu, Jun 16, 2016 at 10:47 AM, Wade Holler <wade.holler@gmail.com> wrote:
> Wido,
>
> I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
> So far file counts inside the pool.pg_head directories are all coming
> in around ~80k.
>
> Is this an issue ?
>
> I will report back with all pg_head file counts in this example OSD
> once it finishes.
>
> Best Regards,
> Wade
>
>
> On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
>>
>>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>>
>>>
>>> Hi All,
>>>
>>> I have a repeatable condition when the object count in a pool gets to
>>> 320-330 million the object write time dramatically and almost
>>> instantly increases as much as 10X, exhibited by fs_apply_latency
>>> going from 10ms to 100s of ms.
>>>
>>
>> My first guess is the filestore splitting and the amount of files per directory.
>>
>> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
>>
>> That means you have ~66k objects per PG.
>>
>>> Can someone point me in a direction / have an explanation ?
>>
>> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
>>
>> Wido
>>
>>>
>>> I can add a new pool and it performs normally.
>>>
>>> Config is generally
>>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
>>> with NVME for journals. Centos 7.2, XFS
>>>
>>> Jewell is the release; inserting objects with librados via some Python
>>> test code.
>>>
>>> Best Regards
>>> Wade
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 16:08 ` Wade Holler
@ 2016-06-17 8:49 ` Wido den Hollander
0 siblings, 0 replies; 34+ messages in thread
From: Wido den Hollander @ 2016-06-17 8:49 UTC (permalink / raw)
To: Wade Holler; +Cc: Ceph Development
> Op 16 juni 2016 om 18:08 schreef Wade Holler <wade.holler@gmail.com>:
>
>
> Ok. Of the 202 pgs on this example OSD,
>
> 65 of them have around ~160k files
> 137 ( the rest ) -have around ~80k files
>
Are those files in the same directory or spread out over multiple sub directories?
You might want to take a look at: http://docs.ceph.com/docs/jewel/rados/configuration/filestore-config-ref/
"filestore split multiple"
Wido
>
>
> On Thu, Jun 16, 2016 at 10:47 AM, Wade Holler <wade.holler@gmail.com> wrote:
> > Wido,
> >
> > I am walking an Example OSD now and counting the files. 3096 PGs for this pool.
> > So far file counts inside the pool.pg_head directories are all coming
> > in around ~80k.
> >
> > Is this an issue ?
> >
> > I will report back with all pg_head file counts in this example OSD
> > once it finishes.
> >
> > Best Regards,
> > Wade
> >
> >
> > On Thu, Jun 16, 2016 at 9:38 AM, Wido den Hollander <wido@42on.com> wrote:
> >>
> >>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I have a repeatable condition when the object count in a pool gets to
> >>> 320-330 million the object write time dramatically and almost
> >>> instantly increases as much as 10X, exhibited by fs_apply_latency
> >>> going from 10ms to 100s of ms.
> >>>
> >>
> >> My first guess is the filestore splitting and the amount of files per directory.
> >>
> >> You have 3*16=48 OSDs, is that correct? With roughly 100 PGs per OSD you have let's say 4800 PGs in total?
> >>
> >> That means you have ~66k objects per PG.
> >>
> >>> Can someone point me in a direction / have an explanation ?
> >>
> >> If you take a look at one of the OSDs, are there a huge amount of files in a single directory? Look inside the 'current' directory on that OSDs.
> >>
> >> Wido
> >>
> >>>
> >>> I can add a new pool and it performs normally.
> >>>
> >>> Config is generally
> >>> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> >>> with NVME for journals. Centos 7.2, XFS
> >>>
> >>> Jewell is the release; inserting objects with librados via some Python
> >>> test code.
> >>>
> >>> Best Regards
> >>> Wade
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-16 13:38 ` Wido den Hollander
2016-06-16 14:47 ` Wade Holler
@ 2016-06-19 23:21 ` Blair Bethwaite
[not found] ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 6:32 ` Blair Bethwaite
1 sibling, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-19 23:21 UTC (permalink / raw)
To: Wido den Hollander, Wade Holler; +Cc: Ceph Development, ceph-users
Hi Wade,
(Apologies for the slowness - AFK for the weekend).
On 16 June 2016 at 23:38, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler@gmail.com>:
>>
>>
>> Hi All,
>>
>> I have a repeatable condition when the object count in a pool gets to
>> 320-330 million the object write time dramatically and almost
>> instantly increases as much as 10X, exhibited by fs_apply_latency
>> going from 10ms to 100s of ms.
>>r filestore
>
> My first guess is the filestore splitting and the amount of files per directory.
I concur with Wido and suggest you try upping your filestore split and
merge threshold config values.
I've seen this issue a number of times now with write heavy workload,
and would love to at least write some docs about it, because it must
happen to a lot of users running RBD workloads on largish drives.
However, I'm not sure how to definitively diagnose the issue and
pinpoint the problem. The gist of the issue is the number of files
and/or directories on your OSD filesystems, at some system dependent
threshold you get to a point where you can no longer sufficiently
cache inodes and/or dentrys, so IOs on those files(ystems) have to
incur extra disk IOPS to read the filesystem structure from disk (I
believe that's the small read IO you're seeing, and unfortunately it
seems to effectively choke writes - we've seen all sorts of related
slow request issues). If you watch your xfs stats you'll likely get
further confirmation. In my experience xs_dir_lookups balloons (which
means directory lookups are missing cache and going to disk).
What I'm not clear on is whether there are two different pathologies
at play here, i.e., specifically dentry cache issues versus inode
cache issues. In the former case making Ceph's directory structure
shallower with more files per directory may help (or perhaps
increasing the number of PGs - more top-level directories), but in the
latter case you're likely to need various system tuning (lower vfs
cache pressure, more memory?, fewer files (larger object size))
depending on your workload.
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 0:52 ` Christian Balzer
0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-20 0:52 UTC (permalink / raw)
To: Blair Bethwaite; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Hello Blair,
On Mon, 20 Jun 2016 09:21:27 +1000 Blair Bethwaite wrote:
> Hi Wade,
>
> (Apologies for the slowness - AFK for the weekend).
>
> On 16 June 2016 at 23:38, Wido den Hollander <wido-fspyXLx8qC4@public.gmane.org> wrote:
> >
> >> Op 16 juni 2016 om 14:14 schreef Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> >>
> >>
> >> Hi All,
> >>
> >> I have a repeatable condition when the object count in a pool gets to
> >> 320-330 million the object write time dramatically and almost
> >> instantly increases as much as 10X, exhibited by fs_apply_latency
> >> going from 10ms to 100s of ms.
> >>r filestore
> >
> > My first guess is the filestore splitting and the amount of files per
> > directory.
>
> I concur with Wido and suggest you try upping your filestore split and
> merge threshold config values.
>
This is probably a good idea but as mentioned/suggested below, it would
be something that eventually settle down in a new equilibrium.
Something I don't think is happening here.
> I've seen this issue a number of times now with write heavy workload,
> and would love to at least write some docs about it, because it must
> happen to a lot of users running RBD workloads on largish drives.
> However, I'm not sure how to definitively diagnose the issue and
> pinpoint the problem. The gist of the issue is the number of files
> and/or directories on your OSD filesystems, at some system dependent
> threshold you get to a point where you can no longer sufficiently
> cache inodes and/or dentrys, so IOs on those files(ystems) have to
> incur extra disk IOPS to read the filesystem structure from disk (I
> believe that's the small read IO you're seeing, and unfortunately it
> seems to effectively choke writes - we've seen all sorts of related
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).
>
> What I'm not clear on is whether there are two different pathologies
> at play here, i.e., specifically dentry cache issues versus inode
> cache issues. In the former case making Ceph's directory structure
> shallower with more files per directory may help (or perhaps
> increasing the number of PGs - more top-level directories), but in the
> latter case you're likely to need various system tuning (lower vfs
> cache pressure, more memory?, fewer files (larger object size))
> depending on your workload.
>
I can very much confirm this from the days when on my main production
cluster all 1024 PGs (but only about 6GB of data and 1.6 million objects)
were on just 4 OSDs (25TB each).
Once SLAB ran out of steam and couldn't hold all the respective entries
(Ext4 here, but same diff), things became very slow.
My litmus test is that a "ls -R /var/lib/ceph/osd/ceph-nn/ >/dev/null"
should be pretty much instantaneous and not having to access the disk at
all.
More RAM and proper tuning as well as smaller OSDs are all ways forward to
alleviate/prevent this issue.
It would be interesting to see/know how bluestore fares in this kind of
scenario.
Christian
--
Christian Balzer Network/Systems Engineer
chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
2016-06-19 23:21 ` Blair Bethwaite
[not found] ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 6:32 ` Blair Bethwaite
[not found] ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-20 6:32 UTC (permalink / raw)
To: Wido den Hollander, Wade Holler; +Cc: Ceph Development, ceph-users
On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).
Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 18:48 ` Wade Holler
[not found] ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-20 18:48 UTC (permalink / raw)
To: Blair Bethwaite, Wido den Hollander
Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
[-- Attachment #1.1: Type: text/plain, Size: 1258 bytes --]
Thanks everyone for your replies. I sincerely appreciate it. We are
testing with different pg_num and filestore_split_multiple settings. Early
indications are .... well not great. Regardless it is nice to understand
the symptoms better so we try to design around it.
Best Regards,
Wade
On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:
> On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
>
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
>
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
>
> --
> Cheers,
> ~Blairo
>
[-- Attachment #1.2: Type: text/html, Size: 1854 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-20 20:47 ` Warren Wang - ISD
[not found] ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-20 20:47 UTC (permalink / raw)
To: Wade Holler, Blair Bethwaite, Wido den Hollander
Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
[-- Attachment #1.1: Type: text/plain, Size: 3321 bytes --]
Sorry, late to the party here. I agree, up the merge and split thresholds. We're as high as 50/12. I chimed in on an RH ticket here. One of those things you just have to find out as an operator since it's not well documented :(
https://bugzilla.redhat.com/show_bug.cgi?id=1219974
We have over 200 million objects in this cluster, and it's still doing over 15000 write IOPS all day long with 302 spinning drives + SATA SSD journals. Having enough memory and dropping your vfs_cache_pressure should also help.
Keep in mind that if you change the values, it won't take effect immediately. It only merges them back if the directory is under the calculated threshold and a write occurs (maybe a read, I forget).
Warren
From: ceph-users <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> on behalf of Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>
Date: Monday, June 20, 2016 at 2:48 PM
To: Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite@gmail.com>>, Wido den Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>>
Cc: Ceph Development <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY@public.gmane.orgnel.org>>, "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>" <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
Thanks everyone for your replies. I sincerely appreciate it. We are testing with different pg_num and filestore_split_multiple settings. Early indications are .... well not great. Regardless it is nice to understand the symptoms better so we try to design around it.
Best Regards,
Wade
On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
On 20 June 2016 at 09:21, Blair Bethwaite <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).
Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.
--
Cheers,
~Blairo
This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
[-- Attachment #1.2: Type: text/html, Size: 5282 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-20 22:58 ` Christian Balzer
2016-06-23 1:26 ` [ceph-users] " Wade Holler
0 siblings, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-20 22:58 UTC (permalink / raw)
To: Warren Wang - ISD
Cc: Ceph Development, Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Hello,
On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> Sorry, late to the party here. I agree, up the merge and split
> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> One of those things you just have to find out as an operator since it's
> not well documented :(
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>
> We have over 200 million objects in this cluster, and it's still doing
> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
> journals. Having enough memory and dropping your vfs_cache_pressure
> should also help.
>
Indeed.
Since it was asked in that bug report and also my first suspicion, it
would probably be good time to clarify that it isn't the splits that cause
the performance degradation, but the resulting inflation of dir entries
and exhaustion of SLAB and thus having to go to disk for things that
normally would be in memory.
Looking at Blair's graph from yesterday pretty much makes that clear, a
purely split caused degradation should have relented much quicker.
> Keep in mind that if you change the values, it won't take effect
> immediately. It only merges them back if the directory is under the
> calculated threshold and a write occurs (maybe a read, I forget).
>
If it's a read a plain scrub might do the trick.
Christian
> Warren
>
>
> From: ceph-users
> <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
> on behalf of Wade Holler
> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday, June
> 20, 2016 at 2:48 PM To: Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido den
> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> Subject:
> Re: [ceph-users] Dramatic performance drop at certain number of objects
> in pool
>
> Thanks everyone for your replies. I sincerely appreciate it. We are
> testing with different pg_num and filestore_split_multiple settings.
> Early indications are .... well not great. Regardless it is nice to
> understand the symptoms better so we try to design around it.
>
> Best Regards,
> Wade
>
>
> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote: On
> 20 June 2016 at 09:21, Blair Bethwaite
> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
>
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
>
> --
> Cheers,
> ~Blairo
>
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential ***
--
Christian Balzer Network/Systems Engineer
chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-20 22:58 ` Christian Balzer
@ 2016-06-23 1:26 ` Wade Holler
[not found] ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-23 1:26 UTC (permalink / raw)
To: Christian Balzer
Cc: Warren Wang - ISD, Blair Bethwaite, Wido den Hollander,
Ceph Development, ceph-users
Based on everyones suggestions; The first modification to 50 / 16
enabled our config to get to ~645Mill objects before the behavior in
question was observed (~330 was the previous ceiling). Subsequent
modification to 50 / 24 has enabled us to get to 1.1 Billion+
Thank you all very much for your support and assistance.
Best Regards,
Wade
On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com> wrote:
>
> Hello,
>
> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>
>> Sorry, late to the party here. I agree, up the merge and split
>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>> One of those things you just have to find out as an operator since it's
>> not well documented :(
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>
>> We have over 200 million objects in this cluster, and it's still doing
>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>> journals. Having enough memory and dropping your vfs_cache_pressure
>> should also help.
>>
> Indeed.
>
> Since it was asked in that bug report and also my first suspicion, it
> would probably be good time to clarify that it isn't the splits that cause
> the performance degradation, but the resulting inflation of dir entries
> and exhaustion of SLAB and thus having to go to disk for things that
> normally would be in memory.
>
> Looking at Blair's graph from yesterday pretty much makes that clear, a
> purely split caused degradation should have relented much quicker.
>
>
>> Keep in mind that if you change the values, it won't take effect
>> immediately. It only merges them back if the directory is under the
>> calculated threshold and a write occurs (maybe a read, I forget).
>>
> If it's a read a plain scrub might do the trick.
>
> Christian
>> Warren
>>
>>
>> From: ceph-users
>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
>> on behalf of Wade Holler
>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday, June
>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido den
>> Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph Development
>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject:
>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>> in pool
>>
>> Thanks everyone for your replies. I sincerely appreciate it. We are
>> testing with different pg_num and filestore_split_multiple settings.
>> Early indications are .... well not great. Regardless it is nice to
>> understand the symptoms better so we try to design around it.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote: On
>> 20 June 2016 at 09:21, Blair Bethwaite
>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> > slow request issues). If you watch your xfs stats you'll likely get
>> > further confirmation. In my experience xs_dir_lookups balloons (which
>> > means directory lookups are missing cache and going to disk).
>>
>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>> problem we had only ephemerally set the new filestore merge/split
>> values - oops. Here's what started happening when we upgraded and
>> restarted a bunch of OSDs:
>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>
>> Seemed to cause lots of slow requests :-/. We corrected it about
>> 12:30, then still took a while to settle.
>>
>> --
>> Cheers,
>> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>
>
> --
> Christian Balzer Network/Systems Engineer
> chibi@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23 1:33 ` Blair Bethwaite
2016-06-23 1:41 ` [ceph-users] " Wade Holler
2016-06-23 2:37 ` [ceph-users] " Christian Balzer
0 siblings, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23 1:33 UTC (permalink / raw)
To: Wade Holler; +Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Wade, good to know.
For the record, what does this work out to roughly per OSD? And how
much RAM and how many PGs per OSD do you have?
What's your workload? I wonder whether for certain workloads (e.g.
RBD) it's better to increase default object size somewhat before
pushing the split/merge up a lot...
Cheers,
On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Based on everyones suggestions; The first modification to 50 / 16
> enabled our config to get to ~645Mill objects before the behavior in
> question was observed (~330 was the previous ceiling). Subsequent
> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>
> Thank you all very much for your support and assistance.
>
> Best Regards,
> Wade
>
>
> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
>>
>> Hello,
>>
>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>
>>> Sorry, late to the party here. I agree, up the merge and split
>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>> One of those things you just have to find out as an operator since it's
>>> not well documented :(
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>
>>> We have over 200 million objects in this cluster, and it's still doing
>>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>> should also help.
>>>
>> Indeed.
>>
>> Since it was asked in that bug report and also my first suspicion, it
>> would probably be good time to clarify that it isn't the splits that cause
>> the performance degradation, but the resulting inflation of dir entries
>> and exhaustion of SLAB and thus having to go to disk for things that
>> normally would be in memory.
>>
>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>> purely split caused degradation should have relented much quicker.
>>
>>
>>> Keep in mind that if you change the values, it won't take effect
>>> immediately. It only merges them back if the directory is under the
>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>
>> If it's a read a plain scrub might do the trick.
>>
>> Christian
>>> Warren
>>>
>>>
>>> From: ceph-users
>>> <ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>> on behalf of Wade Holler
>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday, June
>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido den
>>> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
>>> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>> Subject:
>>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>>> in pool
>>>
>>> Thanks everyone for your replies. I sincerely appreciate it. We are
>>> testing with different pg_num and filestore_split_multiple settings.
>>> Early indications are .... well not great. Regardless it is nice to
>>> understand the symptoms better so we try to design around it.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote: On
>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>> > slow request issues). If you watch your xfs stats you'll likely get
>>> > further confirmation. In my experience xs_dir_lookups balloons (which
>>> > means directory lookups are missing cache and going to disk).
>>>
>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>> problem we had only ephemerally set the new filestore merge/split
>>> values - oops. Here's what started happening when we upgraded and
>>> restarted a bunch of OSDs:
>>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>>
>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> 12:30, then still took a while to settle.
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>
>>
>> --
>> Christian Balzer Network/Systems Engineer
>> chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 1:33 ` Blair Bethwaite
@ 2016-06-23 1:41 ` Wade Holler
2016-06-23 2:01 ` Blair Bethwaite
[not found] ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23 2:37 ` [ceph-users] " Christian Balzer
1 sibling, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-23 1:41 UTC (permalink / raw)
To: Blair Bethwaite
Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
Ceph Development, ceph-users
Blairo,
We'll speak in pre-replication numbers, replication for this pool is 3.
23.3 Million Objects / OSD
pg_num 2048
16 OSDs / Server
3 Servers
660 GB RAM Total, 179 GB Used (free -t) / Server
vm.swappiness = 1
vm.vfs_cache_pressure = 100
Workload is native librados with python. ALL 4k objects.
Best Regards,
Wade
On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
<blair.bethwaite@gmail.com> wrote:
> Wade, good to know.
>
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
>
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
>
> Cheers,
>
> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>> Based on everyones suggestions; The first modification to 50 / 16
>> enabled our config to get to ~645Mill objects before the behavior in
>> question was observed (~330 was the previous ceiling). Subsequent
>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>
>> Thank you all very much for your support and assistance.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com> wrote:
>>>
>>> Hello,
>>>
>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>
>>>> Sorry, late to the party here. I agree, up the merge and split
>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>> One of those things you just have to find out as an operator since it's
>>>> not well documented :(
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>
>>>> We have over 200 million objects in this cluster, and it's still doing
>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>> should also help.
>>>>
>>> Indeed.
>>>
>>> Since it was asked in that bug report and also my first suspicion, it
>>> would probably be good time to clarify that it isn't the splits that cause
>>> the performance degradation, but the resulting inflation of dir entries
>>> and exhaustion of SLAB and thus having to go to disk for things that
>>> normally would be in memory.
>>>
>>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>>> purely split caused degradation should have relented much quicker.
>>>
>>>
>>>> Keep in mind that if you change the values, it won't take effect
>>>> immediately. It only merges them back if the directory is under the
>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>
>>> If it's a read a plain scrub might do the trick.
>>>
>>> Christian
>>>> Warren
>>>>
>>>>
>>>> From: ceph-users
>>>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
>>>> on behalf of Wade Holler
>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday, June
>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido den
>>>> Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph Development
>>>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject:
>>>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>>>> in pool
>>>>
>>>> Thanks everyone for your replies. I sincerely appreciate it. We are
>>>> testing with different pg_num and filestore_split_multiple settings.
>>>> Early indications are .... well not great. Regardless it is nice to
>>>> understand the symptoms better so we try to design around it.
>>>>
>>>> Best Regards,
>>>> Wade
>>>>
>>>>
>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote: On
>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>>>> > slow request issues). If you watch your xfs stats you'll likely get
>>>> > further confirmation. In my experience xs_dir_lookups balloons (which
>>>> > means directory lookups are missing cache and going to disk).
>>>>
>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>> problem we had only ephemerally set the new filestore merge/split
>>>> values - oops. Here's what started happening when we upgraded and
>>>> restarted a bunch of OSDs:
>>>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>>>
>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>> 12:30, then still took a while to settle.
>>>>
>>>> --
>>>> Cheers,
>>>> ~Blairo
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the individual or entity to whom they are addressed.
>>>> If you have received this email in error destroy it immediately. ***
>>>> Walmart Confidential ***
>>>
>>>
>>> --
>>> Christian Balzer Network/Systems Engineer
>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>
>
>
> --
> Cheers,
> ~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 1:41 ` [ceph-users] " Wade Holler
@ 2016-06-23 2:01 ` Blair Bethwaite
2016-06-23 2:28 ` Christian Balzer
2016-06-23 2:31 ` Wade Holler
[not found] ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 2 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23 2:01 UTC (permalink / raw)
To: Wade Holler
Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
Ceph Development, ceph-users
On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
> Workload is native librados with python. ALL 4k objects.
Was that meant to be 4MB?
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 2:01 ` Blair Bethwaite
@ 2016-06-23 2:28 ` Christian Balzer
2016-06-23 2:36 ` Blair Bethwaite
2016-06-23 2:31 ` Wade Holler
1 sibling, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-23 2:28 UTC (permalink / raw)
To: Blair Bethwaite
Cc: Wade Holler, Warren Wang - ISD, Wido den Hollander,
Ceph Development, ceph-users
On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:
> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
> > Workload is native librados with python. ALL 4k objects.
>
> Was that meant to be 4MB?
>
Nope, he means 4K, he's putting lots of small objects via a python script
into the cluster to test for exactly this problem.
See his original post.
Christian
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 2:01 ` Blair Bethwaite
2016-06-23 2:28 ` Christian Balzer
@ 2016-06-23 2:31 ` Wade Holler
1 sibling, 0 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-23 2:31 UTC (permalink / raw)
To: Blair Bethwaite
Cc: Christian Balzer, Warren Wang - ISD, Wido den Hollander,
Ceph Development, ceph-users
No. Our application writes very small objects.
On Wed, Jun 22, 2016 at 10:01 PM, Blair Bethwaite
<blair.bethwaite@gmail.com> wrote:
> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
>> Workload is native librados with python. ALL 4k objects.
>
> Was that meant to be 4MB?
>
> --
> Cheers,
> ~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 2:28 ` Christian Balzer
@ 2016-06-23 2:36 ` Blair Bethwaite
0 siblings, 0 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23 2:36 UTC (permalink / raw)
To: Christian Balzer, Wade Holler
Cc: Warren Wang - ISD, Wido den Hollander, Ceph Development, ceph-users
Hi Christian,
Ah ok, I didn't see object size mentioned earlier. But I guess direct
rados small objects would be a rarish use-case and explains the very
high object counts.
I'm interested in finding the right balance for RBD given object size
is another variable that can be tweaked there. I recall the
UnitedStack folks using 32MB.
Cheers,
On 23 June 2016 at 12:28, Christian Balzer <chibi@gol.com> wrote:
> On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:
>
>> On 23 June 2016 at 11:41, Wade Holler <wade.holler@gmail.com> wrote:
>> > Workload is native librados with python. ALL 4k objects.
>>
>> Was that meant to be 4MB?
>>
> Nope, he means 4K, he's putting lots of small objects via a python script
> into the cluster to test for exactly this problem.
>
> See his original post.
>
>
> Christian
> --
> Christian Balzer Network/Systems Engineer
> chibi@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-23 1:33 ` Blair Bethwaite
2016-06-23 1:41 ` [ceph-users] " Wade Holler
@ 2016-06-23 2:37 ` Christian Balzer
[not found] ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
1 sibling, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-23 2:37 UTC (permalink / raw)
To: ceph-users
Cc: Blair Bethwaite, Wade Holler, Warren Wang - ISD,
Wido den Hollander, Ceph Development
On Thu, 23 Jun 2016 11:33:05 +1000 Blair Bethwaite wrote:
> Wade, good to know.
>
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
>
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
>
I'd posit that that RBD is _least_ likely to encounter this issue in a
moderately balanced setup.
Think about it, a 4MB RBD object can hold literally hundreds of files.
While with CephFS or RGW, a file or S3 object is going to cost you about 2
RADOS objects each.
Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
the available space.
Don't think I could hit this problem before running out of space.
Christian
> Cheers,
>
> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> > Based on everyones suggestions; The first modification to 50 / 16
> > enabled our config to get to ~645Mill objects before the behavior in
> > question was observed (~330 was the previous ceiling). Subsequent
> > modification to 50 / 24 has enabled us to get to 1.1 Billion+
> >
> > Thank you all very much for your support and assistance.
> >
> > Best Regards,
> > Wade
> >
> >
> > On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>
> >>> Sorry, late to the party here. I agree, up the merge and split
> >>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> >>> One of those things you just have to find out as an operator since
> >>> it's not well documented :(
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>
> >>> We have over 200 million objects in this cluster, and it's still
> >>> doing over 15000 write IOPS all day long with 302 spinning drives +
> >>> SATA SSD journals. Having enough memory and dropping your
> >>> vfs_cache_pressure should also help.
> >>>
> >> Indeed.
> >>
> >> Since it was asked in that bug report and also my first suspicion, it
> >> would probably be good time to clarify that it isn't the splits that
> >> cause the performance degradation, but the resulting inflation of dir
> >> entries and exhaustion of SLAB and thus having to go to disk for
> >> things that normally would be in memory.
> >>
> >> Looking at Blair's graph from yesterday pretty much makes that clear,
> >> a purely split caused degradation should have relented much quicker.
> >>
> >>
> >>> Keep in mind that if you change the values, it won't take effect
> >>> immediately. It only merges them back if the directory is under the
> >>> calculated threshold and a write occurs (maybe a read, I forget).
> >>>
> >> If it's a read a plain scrub might do the trick.
> >>
> >> Christian
> >>> Warren
> >>>
> >>>
> >>> From: ceph-users
> >>> <ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
> >>> on behalf of Wade Holler
> >>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date: Monday,
> >>> June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido
> >>> den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph
> >>> Development
> >>> <ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>> Subject: Re: [ceph-users] Dramatic performance drop at certain
> >>> number of objects in pool
> >>>
> >>> Thanks everyone for your replies. I sincerely appreciate it. We are
> >>> testing with different pg_num and filestore_split_multiple settings.
> >>> Early indications are .... well not great. Regardless it is nice to
> >>> understand the symptoms better so we try to design around it.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>> On 20 June 2016 at 09:21, Blair Bethwaite
> >>> <blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>> > slow request issues). If you watch your xfs stats you'll likely get
> >>> > further confirmation. In my experience xs_dir_lookups balloons
> >>> > (which means directory lookups are missing cache and going to
> >>> > disk).
> >>>
> >>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> >>> preparation for Jewel/RHCS2. Turns out when we last hit this very
> >>> problem we had only ephemerally set the new filestore merge/split
> >>> values - oops. Here's what started happening when we upgraded and
> >>> restarted a bunch of OSDs:
> >>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
> >>>
> >>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>> 12:30, then still took a while to settle.
> >>>
> >>> --
> >>> Cheers,
> >>> ~Blairo
> >>>
> >>> This email and any files transmitted with it are confidential and
> >>> intended solely for the individual or entity to whom they are
> >>> addressed. If you have received this email in error destroy it
> >>> immediately. *** Walmart Confidential ***
> >>
> >>
> >> --
> >> Christian Balzer Network/Systems Engineer
> >> chibi@gol.com Global OnLine Japan/Rakuten Communications
> >> http://www.gol.com/
>
>
>
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
@ 2016-06-23 2:55 ` Blair Bethwaite
[not found] ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-23 2:55 UTC (permalink / raw)
To: Christian Balzer; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development
On 23 June 2016 at 12:37, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
> Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
> the available space.
> Don't think I could hit this problem before running out of space.
Perhaps. However ~30TB per server is pretty low with present HDD
sizes. In the pool on our large cluster where we've seen this issue we
have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
testing at about 20% usage (with default 4MB objects). We went to 40 /
8. Then as I reported the other day we hit the issue again at
somewhere around 50% usage. Now we're at 50 / 12.
The boxes mentioned above are a couple of years old. Today we're
buying 2RU servers with 128TB in them (16x 8TB)!
Replacing our current NAS on RBD setup with CephFS is now starting to
scare me...
--
Cheers,
~Blairo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23 3:38 ` Christian Balzer
0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-23 3:38 UTC (permalink / raw)
To: Blair Bethwaite, Wade Holler
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development
Hello Blair, hello Wade (see below),
On Thu, 23 Jun 2016 12:55:17 +1000 Blair Bethwaite wrote:
> On 23 June 2016 at 12:37, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> wrote:
> > Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> > servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7%
> > of the available space.
> > Don't think I could hit this problem before running out of space.
>
> Perhaps. However ~30TB per server is pretty low with present HDD
> sizes.
These are in fact 24 3TB HDDs per server, but in 6 RAID10s with 4 HDDs
each.
>In the pool on our large cluster where we've seen this issue we
> have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
> testing at about 20% usage (with default 4MB objects). We went to 40 /
> 8. Then as I reported the other day we hit the issue again at
> somewhere around 50% usage. Now we're at 50 / 12.
>
High density storage servers have a number of other gotchas and tuning
requirements, I'd consider this simply another one.
As for increasing the default RBD object size, I'd be weary about
performance impacts, especially if you ever are going to have a cache-tier.
If there is no cache-tier in your future for certain, striping might
counteract larger objects.
> The boxes mentioned above are a couple of years old. Today we're
> buying 2RU servers with 128TB in them (16x 8TB)!
>
As people including me noticed and noted, large OSDs are pushing things,
in more ways than just this issues.
I know very well how attractive it is from a cost and rack space (also
a cost factor of course) perspective to build dense storage nodes, but
most people need more IOPS than storage space and that's were smaller,
faster OSDs are better suited, as pointed out in the Ceph docs for a long
time.
> Replacing our current NAS on RBD setup with CephFS is now starting to
> scare me...
>
If this is going to happen when Bluestore is stable, this _particular_
problem should be a non-issue hopefully.
I'm sure Murphy will find other amusing ways to keep us entertained
and high-stressed, though.
If nothing else, CephFS would scare me more than a by now well known
problem that can be tuned away.
A question/request for Wade, would it be possible to reformat your OSDs
with Ext4 (I know deprecated, but if you know what you're doing...), BTRFS?
I'm wondering if either doesn't exhibit this behavior or if so at a
different point?
Christian
--
Christian Balzer Network/Systems Engineer
chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
http://www.gol.com/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-23 22:09 ` Warren Wang - ISD
[not found] ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-23 22:09 UTC (permalink / raw)
To: Wade Holler, Blair Bethwaite
Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
vm.vfs_cache_pressure = 100
Go the other direction on that. You¹ll want to keep it low to help keep
inode/dentry info in memory. We use 10, and haven¹t had a problem.
Warren Wang
On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server
>vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python. ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling). Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org>
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's
>>>>> not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing
>>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA
>>>>>SSD
>>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>>> should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion, it
>>>> would probably be good time to clarify that it isn't the splits that
>>>>cause
>>>> the performance degradation, but the resulting inflation of dir
>>>>entries
>>>> and exhaustion of SLAB and thus having to go to disk for things that
>>>> normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that clear,
>>>>a
>>>> purely split caused degradation should have relented much quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under the
>>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>>
>>>>><ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces-idqoXFIVOFJ4Eiagz67IpQ@public.gmane.org
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date: Monday,
>>>>>June
>>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido
>>>>>den
>>>>> Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph Development
>>>>> <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>>>>Subject:
>>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>>>objects
>>>>> in pool
>>>>>
>>>>> Thanks everyone for your replies. I sincerely appreciate it. We are
>>>>> testing with different pg_num and filestore_split_multiple settings.
>>>>> Early indications are .... well not great. Regardless it is nice to
>>>>> understand the symptoms better so we try to design around it.
>>>>>
>>>>> Best Regards,
>>>>> Wade
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>>On
>>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>>> <blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>> > slow request issues). If you watch your xfs stats you'll likely get
>>>>> > further confirmation. In my experience xs_dir_lookups balloons
>>>>>(which
>>>>> > means directory lookups are missing cache and going to disk).
>>>>>
>>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>>> problem we had only ephemerally set the new filestore merge/split
>>>>> values - oops. Here's what started happening when we upgraded and
>>>>> restarted a bunch of OSDs:
>>>>>
>>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_
>>>>>lookup.png
>>>>>
>>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>>> 12:30, then still took a while to settle.
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> ~Blairo
>>>>>
>>>>> This email and any files transmitted with it are confidential and
>>>>> intended solely for the individual or entity to whom they are
>>>>>addressed.
>>>>> If you have received this email in error destroy it immediately. ***
>>>>> Walmart Confidential ***
>>>>
>>>>
>>>> --
>>>> Christian Balzer Network/Systems Engineer
>>>> chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>
>>
>>
>> --
>> Cheers,
>> ~Blairo
This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-23 22:24 ` Somnath Roy
[not found] ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-06-23 22:24 UTC (permalink / raw)
To: Warren Wang - ISD, Wade Holler, Blair Bethwaite
Cc: Ceph Development, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to *pin* inode/dentries in memory.
We are using that for long now (with 128 TB node memory) and it seems helping specially for the random write workload and saving xattrs read in between.
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of Warren Wang - ISD
Sent: Thursday, June 23, 2016 3:09 PM
To: Wade Holler; Blair Bethwaite
Cc: Ceph Development; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
vm.vfs_cache_pressure = 100
Go the other direction on that. You¹ll want to keep it low to help keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
Warren Wang
On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python. ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling). Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org>
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing over 15000 write IOPS all day long with 302 spinning drives
>>>>>+ SATA SSD journals. Having enough memory and dropping your
>>>>>vfs_cache_pressure should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion,
>>>>it would probably be good time to clarify that it isn't the splits
>>>>that cause the performance degradation, but the resulting inflation
>>>>of dir entries and exhaustion of SLAB and thus having to go to disk
>>>>for things that normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that
>>>>clear, a purely split caused degradation should have relented much
>>>>quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under
>>>>> the calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>>
>>>>><ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-bounces@lists.
>>>>>cep
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> Date:
>>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>>, Wido
>>>>>den Hollander <wido-fspyXLx8qC4@public.gmane.org<mailto:wido-fspyXLx8qC4@public.gmane.org>> Cc: Ceph
>>>>>Development
>>>>><ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org<mailto:ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>>,
>>>>> "ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>"
>>>>> <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org<mailto:ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>>
>>>>>Subject:
>>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>>>objects in pool
>>>>>
>>>>> Thanks everyone for your replies. I sincerely appreciate it. We
>>>>> are testing with different pg_num and filestore_split_multiple settings.
>>>>> Early indications are .... well not great. Regardless it is nice
>>>>> to understand the symptoms better so we try to design around it.
>>>>>
>>>>> Best Regards,
>>>>> Wade
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>>On
>>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>>><blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org<mailto:blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>> wrote:
>>>>> > slow request issues). If you watch your xfs stats you'll likely
>>>>> > get further confirmation. In my experience xs_dir_lookups
>>>>> > balloons
>>>>>(which
>>>>> > means directory lookups are missing cache and going to disk).
>>>>>
>>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>>>> problem we had only ephemerally set the new filestore merge/split
>>>>> values - oops. Here's what started happening when we upgraded and
>>>>> restarted a bunch of OSDs:
>>>>>
>>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_d
>>>>>ir_
>>>>>lookup.png
>>>>>
>>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>>> 12:30, then still took a while to settle.
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> ~Blairo
>>>>>
>>>>> This email and any files transmitted with it are confidential and
>>>>>intended solely for the individual or entity to whom they are
>>>>>addressed.
>>>>> If you have received this email in error destroy it immediately.
>>>>>*** Walmart Confidential ***
>>>>
>>>>
>>>> --
>>>> Christian Balzer Network/Systems Engineer
>>>> chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>
>>
>>
>> --
>> Cheers,
>> ~Blairo
This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** _______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-06-24 0:08 ` Christian Balzer
[not found] ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Christian Balzer @ 2016-06-24 0:08 UTC (permalink / raw)
To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: Blair Bethwaite, Ceph Development
Hello,
On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to *pin*
> inode/dentries in memory. We are using that for long now (with 128 TB
> node memory) and it seems helping specially for the random write
> workload and saving xattrs read in between.
>
128TB node memory, really?
Can I have some of those, too? ^o^
And here I was thinking that Wade's 660GB machines were on the excessive
side.
There's something to be said (and optimized) when your storage nodes have
the same or more RAM as your compute nodes...
As for Warren, well spotted.
I personally use vm.vfs_cache_pressure = 1, this avoids the potential
fireworks if your memory is really needed elsewhere, while keeping things
in memory normally.
Christian
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of
> Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> To: Wade Holler; Blair Bethwaite
> Cc: Ceph Development; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Dramatic performance drop at certain number of
> objects in pool
>
> vm.vfs_cache_pressure = 100
>
> Go the other direction on that. You¹ll want to keep it low to help keep
> inode/dentry info in memory. We use 10, and haven¹t had a problem.
>
>
> Warren Wang
>
>
>
>
> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>
> >Blairo,
> >
> >We'll speak in pre-replication numbers, replication for this pool is 3.
> >
> >23.3 Million Objects / OSD
> >pg_num 2048
> >16 OSDs / Server
> >3 Servers
> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >vm.vfs_cache_pressure = 100
> >
> >Workload is native librados with python. ALL 4k objects.
> >
> >Best Regards,
> >Wade
> >
> >
> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> ><blair.bethwaite@gmail.com> wrote:
> >> Wade, good to know.
> >>
> >> For the record, what does this work out to roughly per OSD? And how
> >> much RAM and how many PGs per OSD do you have?
> >>
> >> What's your workload? I wonder whether for certain workloads (e.g.
> >> RBD) it's better to increase default object size somewhat before
> >> pushing the split/merge up a lot...
> >>
> >> Cheers,
> >>
> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> >>> Based on everyones suggestions; The first modification to 50 / 16
> >>> enabled our config to get to ~645Mill objects before the behavior in
> >>> question was observed (~330 was the previous ceiling). Subsequent
> >>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
> >>>
> >>> Thank you all very much for your support and assistance.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> >>>wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>>
> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>>>here.
> >>>>> One of those things you just have to find out as an operator since
> >>>>>it's not well documented :(
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>>>
> >>>>> We have over 200 million objects in this cluster, and it's still
> >>>>>doing over 15000 write IOPS all day long with 302 spinning drives
> >>>>>+ SATA SSD journals. Having enough memory and dropping your
> >>>>>vfs_cache_pressure should also help.
> >>>>>
> >>>> Indeed.
> >>>>
> >>>> Since it was asked in that bug report and also my first suspicion,
> >>>>it would probably be good time to clarify that it isn't the splits
> >>>>that cause the performance degradation, but the resulting inflation
> >>>>of dir entries and exhaustion of SLAB and thus having to go to disk
> >>>>for things that normally would be in memory.
> >>>>
> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>>clear, a purely split caused degradation should have relented much
> >>>>quicker.
> >>>>
> >>>>
> >>>>> Keep in mind that if you change the values, it won't take effect
> >>>>> immediately. It only merges them back if the directory is under
> >>>>> the calculated threshold and a write occurs (maybe a read, I
> >>>>> forget).
> >>>>>
> >>>> If it's a read a plain scrub might do the trick.
> >>>>
> >>>> Christian
> >>>>> Warren
> >>>>>
> >>>>>
> >>>>> From: ceph-users
> >>>>>
> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>>>cep
> >>>>>h.com>>
> >>>>> on behalf of Wade Holler
> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>, Wido
> >>>>>den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc: Ceph
> >>>>>Development
> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>>>Subject:
> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
> >>>>>objects in pool
> >>>>>
> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We
> >>>>> are testing with different pg_num and filestore_split_multiple
> >>>>> settings. Early indications are .... well not great. Regardless it
> >>>>> is nice to understand the symptoms better so we try to design
> >>>>> around it.
> >>>>>
> >>>>> Best Regards,
> >>>>> Wade
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>>On
> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>> > slow request issues). If you watch your xfs stats you'll likely
> >>>>> > get further confirmation. In my experience xs_dir_lookups
> >>>>> > balloons
> >>>>>(which
> >>>>> > means directory lookups are missing cache and going to disk).
> >>>>>
> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> >>>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
> >>>>> problem we had only ephemerally set the new filestore merge/split
> >>>>> values - oops. Here's what started happening when we upgraded and
> >>>>> restarted a bunch of OSDs:
> >>>>>
> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_d
> >>>>>ir_
> >>>>>lookup.png
> >>>>>
> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>>>> 12:30, then still took a while to settle.
> >>>>>
> >>>>> --
> >>>>> Cheers,
> >>>>> ~Blairo
> >>>>>
> >>>>> This email and any files transmitted with it are confidential and
> >>>>>intended solely for the individual or entity to whom they are
> >>>>>addressed.
> >>>>> If you have received this email in error destroy it immediately.
> >>>>>*** Walmart Confidential ***
> >>>>
> >>>>
> >>>> --
> >>>> Christian Balzer Network/Systems Engineer
> >>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
> >>>> http://www.gol.com/
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> ~Blairo
>
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential *** _______________________________________________
> ceph-users mailing list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The
> information contained in this electronic mail message is intended only
> for the use of the designated recipient(s) named above. If the reader of
> this message is not the intended recipient, you are hereby notified that
> you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please
> notify the sender by telephone or e-mail (as shown above) immediately
> and destroy any and all copies of this message in your possession
> (whether hard copies or electronically stored copies).
> _______________________________________________ ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
@ 2016-06-24 0:09 ` Somnath Roy
2016-06-24 14:23 ` [ceph-users] " Wade Holler
0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-06-24 0:09 UTC (permalink / raw)
To: Christian Balzer, ceph-users-idqoXFIVOFJgJs9I8MT0rw
Cc: Blair Bethwaite, Ceph Development
Oops , typo , 128 GB :-)...
-----Original Message-----
From: Christian Balzer [mailto:chibi@gol.com]
Sent: Thursday, June 23, 2016 5:08 PM
To: ceph-users@lists.ceph.com
Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph Development
Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
Hello,
On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> *pin* inode/dentries in memory. We are using that for long now (with
> 128 TB node memory) and it seems helping specially for the random
> write workload and saving xattrs read in between.
>
128TB node memory, really?
Can I have some of those, too? ^o^
And here I was thinking that Wade's 660GB machines were on the excessive side.
There's something to be said (and optimized) when your storage nodes have the same or more RAM as your compute nodes...
As for Warren, well spotted.
I personally use vm.vfs_cache_pressure = 1, this avoids the potential fireworks if your memory is really needed elsewhere, while keeping things in memory normally.
Christian
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> To: Wade Holler; Blair Bethwaite
> Cc: Ceph Development; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Dramatic performance drop at certain number
> of objects in pool
>
> vm.vfs_cache_pressure = 100
>
> Go the other direction on that. You¹ll want to keep it low to help
> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>
>
> Warren Wang
>
>
>
>
> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>
> >Blairo,
> >
> >We'll speak in pre-replication numbers, replication for this pool is 3.
> >
> >23.3 Million Objects / OSD
> >pg_num 2048
> >16 OSDs / Server
> >3 Servers
> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >vm.vfs_cache_pressure = 100
> >
> >Workload is native librados with python. ALL 4k objects.
> >
> >Best Regards,
> >Wade
> >
> >
> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> ><blair.bethwaite@gmail.com> wrote:
> >> Wade, good to know.
> >>
> >> For the record, what does this work out to roughly per OSD? And how
> >> much RAM and how many PGs per OSD do you have?
> >>
> >> What's your workload? I wonder whether for certain workloads (e.g.
> >> RBD) it's better to increase default object size somewhat before
> >> pushing the split/merge up a lot...
> >>
> >> Cheers,
> >>
> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
> >>> Based on everyones suggestions; The first modification to 50 / 16
> >>> enabled our config to get to ~645Mill objects before the behavior
> >>> in question was observed (~330 was the previous ceiling).
> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
> >>> Billion+
> >>>
> >>> Thank you all very much for your support and assistance.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
> >>>wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>>
> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>>>here.
> >>>>> One of those things you just have to find out as an operator
> >>>>>since it's not well documented :(
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>>>
> >>>>> We have over 200 million objects in this cluster, and it's still
> >>>>>doing over 15000 write IOPS all day long with 302 spinning
> >>>>>drives
> >>>>>+ SATA SSD journals. Having enough memory and dropping your
> >>>>>vfs_cache_pressure should also help.
> >>>>>
> >>>> Indeed.
> >>>>
> >>>> Since it was asked in that bug report and also my first
> >>>>suspicion, it would probably be good time to clarify that it
> >>>>isn't the splits that cause the performance degradation, but the
> >>>>resulting inflation of dir entries and exhaustion of SLAB and
> >>>>thus having to go to disk for things that normally would be in memory.
> >>>>
> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>>clear, a purely split caused degradation should have relented
> >>>>much quicker.
> >>>>
> >>>>
> >>>>> Keep in mind that if you change the values, it won't take effect
> >>>>> immediately. It only merges them back if the directory is under
> >>>>> the calculated threshold and a write occurs (maybe a read, I
> >>>>> forget).
> >>>>>
> >>>> If it's a read a plain scrub might do the trick.
> >>>>
> >>>> Christian
> >>>>> Warren
> >>>>>
> >>>>>
> >>>>> From: ceph-users
> >>>>>
> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>>>cep
> >>>>>h.com>>
> >>>>> on behalf of Wade Holler
> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
> >>>>>Wido den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
> >>>>>Ceph Development
> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>>>Subject:
> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
> >>>>>objects in pool
> >>>>>
> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We
> >>>>> are testing with different pg_num and filestore_split_multiple
> >>>>> settings. Early indications are .... well not great. Regardless
> >>>>> it is nice to understand the symptoms better so we try to design
> >>>>> around it.
> >>>>>
> >>>>> Best Regards,
> >>>>> Wade
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>>On
> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
> >>>>> > slow request issues). If you watch your xfs stats you'll
> >>>>> > likely get further confirmation. In my experience
> >>>>> > xs_dir_lookups balloons
> >>>>>(which
> >>>>> > means directory lookups are missing cache and going to disk).
> >>>>>
> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
> >>>>> very problem we had only ephemerally set the new filestore
> >>>>> merge/split values - oops. Here's what started happening when we
> >>>>> upgraded and restarted a bunch of OSDs:
> >>>>>
> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
> >>>>>_d
> >>>>>ir_
> >>>>>lookup.png
> >>>>>
> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>>>> 12:30, then still took a while to settle.
> >>>>>
> >>>>> --
> >>>>> Cheers,
> >>>>> ~Blairo
> >>>>>
> >>>>> This email and any files transmitted with it are confidential
> >>>>>and intended solely for the individual or entity to whom they are
> >>>>>addressed.
> >>>>> If you have received this email in error destroy it immediately.
> >>>>>*** Walmart Confidential ***
> >>>>
> >>>>
> >>>> --
> >>>> Christian Balzer Network/Systems Engineer
> >>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
> >>>> http://www.gol.com/
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> ~Blairo
>
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential ***
> _______________________________________________
> ceph-users mailing list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
> The information contained in this electronic mail message is intended
> only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any
> review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> _______________________________________________ ceph-users mailing
> list ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-24 0:09 ` Somnath Roy
@ 2016-06-24 14:23 ` Wade Holler
[not found] ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[not found] ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw@mail.gmail.com>
0 siblings, 2 replies; 34+ messages in thread
From: Wade Holler @ 2016-06-24 14:23 UTC (permalink / raw)
To: Somnath Roy
Cc: Christian Balzer, ceph-users, Warren Wang - ISD, Blair Bethwaite,
Ceph Development
On the vm.vfs_cace_pressure = 1 : We had this initially and I still
think it is the best choice for most configs. However with our large
memory footprint, vfs_cache_pressure=1 increased the likelihood of
hitting an issue where our write response time would double; then a
drop of caches would return response time to normal. I don't claim to
totally understand this and I only have speculation at the moment.
Again thanks for this suggestion, I do think it is best for boxes that
don't have very large memory.
@ Christian - reformatting to btrfs or ext4 is an option in my test
cluster. I thought about that but needed to sort xfs first. (thats
what production will run right now) You all have helped me do that and
thank you again. I will circle back and test btrfs under the same
conditions. I suspect that it will behave similarly but it's only a
day and half's work or so to test.
Best Regards,
Wade
On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Oops , typo , 128 GB :-)...
>
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@gol.com]
> Sent: Thursday, June 23, 2016 5:08 PM
> To: ceph-users@lists.ceph.com
> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph Development
> Subject: Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
>
>
> Hello,
>
> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>
>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>> *pin* inode/dentries in memory. We are using that for long now (with
>> 128 TB node memory) and it seems helping specially for the random
>> write workload and saving xattrs read in between.
>>
> 128TB node memory, really?
> Can I have some of those, too? ^o^
> And here I was thinking that Wade's 660GB machines were on the excessive side.
>
> There's something to be said (and optimized) when your storage nodes have the same or more RAM as your compute nodes...
>
> As for Warren, well spotted.
> I personally use vm.vfs_cache_pressure = 1, this avoids the potential fireworks if your memory is really needed elsewhere, while keeping things in memory normally.
>
> Christian
>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>> To: Wade Holler; Blair Bethwaite
>> Cc: Ceph Development; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>> of objects in pool
>>
>> vm.vfs_cache_pressure = 100
>>
>> Go the other direction on that. You¹ll want to keep it low to help
>> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>>
>>
>> Warren Wang
>>
>>
>>
>>
>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>
>> >Blairo,
>> >
>> >We'll speak in pre-replication numbers, replication for this pool is 3.
>> >
>> >23.3 Million Objects / OSD
>> >pg_num 2048
>> >16 OSDs / Server
>> >3 Servers
>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>> >vm.vfs_cache_pressure = 100
>> >
>> >Workload is native librados with python. ALL 4k objects.
>> >
>> >Best Regards,
>> >Wade
>> >
>> >
>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>> ><blair.bethwaite@gmail.com> wrote:
>> >> Wade, good to know.
>> >>
>> >> For the record, what does this work out to roughly per OSD? And how
>> >> much RAM and how many PGs per OSD do you have?
>> >>
>> >> What's your workload? I wonder whether for certain workloads (e.g.
>> >> RBD) it's better to increase default object size somewhat before
>> >> pushing the split/merge up a lot...
>> >>
>> >> Cheers,
>> >>
>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>> >>> Based on everyones suggestions; The first modification to 50 / 16
>> >>> enabled our config to get to ~645Mill objects before the behavior
>> >>> in question was observed (~330 was the previous ceiling).
>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>> >>> Billion+
>> >>>
>> >>> Thank you all very much for your support and assistance.
>> >>>
>> >>> Best Regards,
>> >>> Wade
>> >>>
>> >>>
>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>> >>>wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>> >>>>
>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>> >>>>>here.
>> >>>>> One of those things you just have to find out as an operator
>> >>>>>since it's not well documented :(
>> >>>>>
>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>> >>>>>
>> >>>>> We have over 200 million objects in this cluster, and it's still
>> >>>>>doing over 15000 write IOPS all day long with 302 spinning
>> >>>>>drives
>> >>>>>+ SATA SSD journals. Having enough memory and dropping your
>> >>>>>vfs_cache_pressure should also help.
>> >>>>>
>> >>>> Indeed.
>> >>>>
>> >>>> Since it was asked in that bug report and also my first
>> >>>>suspicion, it would probably be good time to clarify that it
>> >>>>isn't the splits that cause the performance degradation, but the
>> >>>>resulting inflation of dir entries and exhaustion of SLAB and
>> >>>>thus having to go to disk for things that normally would be in memory.
>> >>>>
>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>> >>>>clear, a purely split caused degradation should have relented
>> >>>>much quicker.
>> >>>>
>> >>>>
>> >>>>> Keep in mind that if you change the values, it won't take effect
>> >>>>> immediately. It only merges them back if the directory is under
>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>> >>>>> forget).
>> >>>>>
>> >>>> If it's a read a plain scrub might do the trick.
>> >>>>
>> >>>> Christian
>> >>>>> Warren
>> >>>>>
>> >>>>>
>> >>>>> From: ceph-users
>> >>>>>
>> >>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>> >>>>>cep
>> >>>>>h.com>>
>> >>>>> on behalf of Wade Holler
>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>> >>>>>Wido den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>> >>>>>Ceph Development
>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>> >>>>>Subject:
>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>> >>>>>objects in pool
>> >>>>>
>> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We
>> >>>>> are testing with different pg_num and filestore_split_multiple
>> >>>>> settings. Early indications are .... well not great. Regardless
>> >>>>> it is nice to understand the symptoms better so we try to design
>> >>>>> around it.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>> Wade
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> >>>>>On
>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>> wrote:
>> >>>>> > slow request issues). If you watch your xfs stats you'll
>> >>>>> > likely get further confirmation. In my experience
>> >>>>> > xs_dir_lookups balloons
>> >>>>>(which
>> >>>>> > means directory lookups are missing cache and going to disk).
>> >>>>>
>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>> >>>>> very problem we had only ephemerally set the new filestore
>> >>>>> merge/split values - oops. Here's what started happening when we
>> >>>>> upgraded and restarted a bunch of OSDs:
>> >>>>>
>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>> >>>>>_d
>> >>>>>ir_
>> >>>>>lookup.png
>> >>>>>
>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>> >>>>> 12:30, then still took a while to settle.
>> >>>>>
>> >>>>> --
>> >>>>> Cheers,
>> >>>>> ~Blairo
>> >>>>>
>> >>>>> This email and any files transmitted with it are confidential
>> >>>>>and intended solely for the individual or entity to whom they are
>> >>>>>addressed.
>> >>>>> If you have received this email in error destroy it immediately.
>> >>>>>*** Walmart Confidential ***
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Christian Balzer Network/Systems Engineer
>> >>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>> >>>> http://www.gol.com/
>> >>
>> >>
>> >>
>> >> --
>> >> Cheers,
>> >> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>> _______________________________________________
>> ceph-users mailing list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>> The information contained in this electronic mail message is intended
>> only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>> review, dissemination, distribution, or copying of this message is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender by telephone or e-mail (as shown above)
>> immediately and destroy any and all copies of this message in your
>> possession (whether hard copies or electronically stored copies).
>> _______________________________________________ ceph-users mailing
>> list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Christian Balzer Network/Systems Engineer
> chibi@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-24 16:24 ` Warren Wang - ISD
[not found] ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
0 siblings, 1 reply; 34+ messages in thread
From: Warren Wang - ISD @ 2016-06-24 16:24 UTC (permalink / raw)
To: Wade Holler, Somnath Roy
Cc: Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development
Oops, that reminds me, do you have min_free_kbytes set to something
reasonable like at least 2-4GB?
Warren Wang
On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:
>On the vm.vfs_cace_pressure = 1 : We had this initially and I still
>think it is the best choice for most configs. However with our large
>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>hitting an issue where our write response time would double; then a
>drop of caches would return response time to normal. I don't claim to
>totally understand this and I only have speculation at the moment.
>Again thanks for this suggestion, I do think it is best for boxes that
>don't have very large memory.
>
>@ Christian - reformatting to btrfs or ext4 is an option in my test
>cluster. I thought about that but needed to sort xfs first. (thats
>what production will run right now) You all have helped me do that and
>thank you again. I will circle back and test btrfs under the same
>conditions. I suspect that it will behave similarly but it's only a
>day and half's work or so to test.
>
>Best Regards,
>Wade
>
>
>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
>wrote:
>> Oops , typo , 128 GB :-)...
>>
>> -----Original Message-----
>> From: Christian Balzer [mailto:chibi@gol.com]
>> Sent: Thursday, June 23, 2016 5:08 PM
>> To: ceph-users@lists.ceph.com
>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>Development
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>of objects in pool
>>
>>
>> Hello,
>>
>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>
>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>> *pin* inode/dentries in memory. We are using that for long now (with
>>> 128 TB node memory) and it seems helping specially for the random
>>> write workload and saving xattrs read in between.
>>>
>> 128TB node memory, really?
>> Can I have some of those, too? ^o^
>> And here I was thinking that Wade's 660GB machines were on the
>>excessive side.
>>
>> There's something to be said (and optimized) when your storage nodes
>>have the same or more RAM as your compute nodes...
>>
>> As for Warren, well spotted.
>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>fireworks if your memory is really needed elsewhere, while keeping
>>things in memory normally.
>>
>> Christian
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>> To: Wade Holler; Blair Bethwaite
>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>> of objects in pool
>>>
>>> vm.vfs_cache_pressure = 100
>>>
>>> Go the other direction on that. You易ll want to keep it low to help
>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>
>>>
>>> Warren Wang
>>>
>>>
>>>
>>>
>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>>
>>> >Blairo,
>>> >
>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>3.
>>> >
>>> >23.3 Million Objects / OSD
>>> >pg_num 2048
>>> >16 OSDs / Server
>>> >3 Servers
>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>> >vm.vfs_cache_pressure = 100
>>> >
>>> >Workload is native librados with python. ALL 4k objects.
>>> >
>>> >Best Regards,
>>> >Wade
>>> >
>>> >
>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>> ><blair.bethwaite@gmail.com> wrote:
>>> >> Wade, good to know.
>>> >>
>>> >> For the record, what does this work out to roughly per OSD? And how
>>> >> much RAM and how many PGs per OSD do you have?
>>> >>
>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>> >> RBD) it's better to increase default object size somewhat before
>>> >> pushing the split/merge up a lot...
>>> >>
>>> >> Cheers,
>>> >>
>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>> >>> in question was observed (~330 was the previous ceiling).
>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>> >>> Billion+
>>> >>>
>>> >>> Thank you all very much for your support and assistance.
>>> >>>
>>> >>> Best Regards,
>>> >>> Wade
>>> >>>
>>> >>>
>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>>> >>>wrote:
>>> >>>>
>>> >>>> Hello,
>>> >>>>
>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>> >>>>
>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>> >>>>>here.
>>> >>>>> One of those things you just have to find out as an operator
>>> >>>>>since it's not well documented :(
>>> >>>>>
>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>> >>>>>
>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>> >>>>>doing over 15000 write IOPS all day long with 302 spinning
>>> >>>>>drives
>>> >>>>>+ SATA SSD journals. Having enough memory and dropping your
>>> >>>>>vfs_cache_pressure should also help.
>>> >>>>>
>>> >>>> Indeed.
>>> >>>>
>>> >>>> Since it was asked in that bug report and also my first
>>> >>>>suspicion, it would probably be good time to clarify that it
>>> >>>>isn't the splits that cause the performance degradation, but the
>>> >>>>resulting inflation of dir entries and exhaustion of SLAB and
>>> >>>>thus having to go to disk for things that normally would be in
>>>memory.
>>> >>>>
>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>> >>>>clear, a purely split caused degradation should have relented
>>> >>>>much quicker.
>>> >>>>
>>> >>>>
>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>> >>>>> immediately. It only merges them back if the directory is under
>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>> >>>>> forget).
>>> >>>>>
>>> >>>> If it's a read a plain scrub might do the trick.
>>> >>>>
>>> >>>> Christian
>>> >>>>> Warren
>>> >>>>>
>>> >>>>>
>>> >>>>> From: ceph-users
>>> >>>>>
>>>
>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>>> >>>>>cep
>>> >>>>>h.com>>
>>> >>>>> on behalf of Wade Holler
>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>>> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>>> >>>>>Wido den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>>> >>>>>Ceph Development
>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>> >>>>>Subject:
>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>> >>>>>objects in pool
>>> >>>>>
>>> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We
>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>> >>>>> settings. Early indications are .... well not great. Regardless
>>> >>>>> it is nice to understand the symptoms better so we try to design
>>> >>>>> around it.
>>> >>>>>
>>> >>>>> Best Regards,
>>> >>>>> Wade
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>>On
>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>> >>>>> > likely get further confirmation. In my experience
>>> >>>>> > xs_dir_lookups balloons
>>> >>>>>(which
>>> >>>>> > means directory lookups are missing cache and going to disk).
>>> >>>>>
>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>> >>>>> very problem we had only ephemerally set the new filestore
>>> >>>>> merge/split values - oops. Here's what started happening when we
>>> >>>>> upgraded and restarted a bunch of OSDs:
>>> >>>>>
>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>> >>>>>_d
>>> >>>>>ir_
>>> >>>>>lookup.png
>>> >>>>>
>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> >>>>> 12:30, then still took a while to settle.
>>> >>>>>
>>> >>>>> --
>>> >>>>> Cheers,
>>> >>>>> ~Blairo
>>> >>>>>
>>> >>>>> This email and any files transmitted with it are confidential
>>> >>>>>and intended solely for the individual or entity to whom they are
>>> >>>>>addressed.
>>> >>>>> If you have received this email in error destroy it immediately.
>>> >>>>>*** Walmart Confidential ***
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Christian Balzer Network/Systems Engineer
>>> >>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>>> >>>> http://www.gol.com/
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Cheers,
>>> >> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are
>>>addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>> _______________________________________________
>>> ceph-users mailing list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>> The information contained in this electronic mail message is intended
>>> only for the use of the designated recipient(s) named above. If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>>> _______________________________________________ ceph-users mailing
>>> list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Christian Balzer Network/Systems Engineer
>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
@ 2016-06-24 19:45 ` Wade Holler
2016-06-25 3:07 ` [ceph-users] " Christian Balzer
0 siblings, 1 reply; 34+ messages in thread
From: Wade Holler @ 2016-06-24 19:45 UTC (permalink / raw)
To: Warren Wang - ISD
Cc: Blair Bethwaite, ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development
Not reasonable as you say :
vm.min_free_kbytes = 90112
we're in recovery post expansion (48->54 OSDs) right now but free -t is:
#free -t
total used free shared buff/cache available
Mem: 693097104 378383384 36870080 369292 277843640 250931372
Swap: 1048572 956 1047616
Total: 694145676 378384340 37917696
On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
<Warren.Wang@walmart.com> wrote:
> Oops, that reminds me, do you have min_free_kbytes set to something
> reasonable like at least 2-4GB?
>
> Warren Wang
>
>
>
> On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:
>
>>On the vm.vfs_cace_pressure = 1 : We had this initially and I still
>>think it is the best choice for most configs. However with our large
>>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>>hitting an issue where our write response time would double; then a
>>drop of caches would return response time to normal. I don't claim to
>>totally understand this and I only have speculation at the moment.
>>Again thanks for this suggestion, I do think it is best for boxes that
>>don't have very large memory.
>>
>>@ Christian - reformatting to btrfs or ext4 is an option in my test
>>cluster. I thought about that but needed to sort xfs first. (thats
>>what production will run right now) You all have helped me do that and
>>thank you again. I will circle back and test btrfs under the same
>>conditions. I suspect that it will behave similarly but it's only a
>>day and half's work or so to test.
>>
>>Best Regards,
>>Wade
>>
>>
>>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
>>wrote:
>>> Oops , typo , 128 GB :-)...
>>>
>>> -----Original Message-----
>>> From: Christian Balzer [mailto:chibi@gol.com]
>>> Sent: Thursday, June 23, 2016 5:08 PM
>>> To: ceph-users@lists.ceph.com
>>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>>Development
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>of objects in pool
>>>
>>>
>>> Hello,
>>>
>>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>>
>>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>>> *pin* inode/dentries in memory. We are using that for long now (with
>>>> 128 TB node memory) and it seems helping specially for the random
>>>> write workload and saving xattrs read in between.
>>>>
>>> 128TB node memory, really?
>>> Can I have some of those, too? ^o^
>>> And here I was thinking that Wade's 660GB machines were on the
>>>excessive side.
>>>
>>> There's something to be said (and optimized) when your storage nodes
>>>have the same or more RAM as your compute nodes...
>>>
>>> As for Warren, well spotted.
>>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>>fireworks if your memory is really needed elsewhere, while keeping
>>>things in memory normally.
>>>
>>> Christian
>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>>> To: Wade Holler; Blair Bethwaite
>>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>> of objects in pool
>>>>
>>>> vm.vfs_cache_pressure = 100
>>>>
>>>> Go the other direction on that. You易ll want to keep it low to help
>>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>>
>>>>
>>>> Warren Wang
>>>>
>>>>
>>>>
>>>>
>>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>>>
>>>> >Blairo,
>>>> >
>>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>>3.
>>>> >
>>>> >23.3 Million Objects / OSD
>>>> >pg_num 2048
>>>> >16 OSDs / Server
>>>> >3 Servers
>>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>>> >vm.vfs_cache_pressure = 100
>>>> >
>>>> >Workload is native librados with python. ALL 4k objects.
>>>> >
>>>> >Best Regards,
>>>> >Wade
>>>> >
>>>> >
>>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>>> ><blair.bethwaite@gmail.com> wrote:
>>>> >> Wade, good to know.
>>>> >>
>>>> >> For the record, what does this work out to roughly per OSD? And how
>>>> >> much RAM and how many PGs per OSD do you have?
>>>> >>
>>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>>> >> RBD) it's better to increase default object size somewhat before
>>>> >> pushing the split/merge up a lot...
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>>> >>> in question was observed (~330 was the previous ceiling).
>>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>>> >>> Billion+
>>>> >>>
>>>> >>> Thank you all very much for your support and assistance.
>>>> >>>
>>>> >>> Best Regards,
>>>> >>> Wade
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>>>> >>>wrote:
>>>> >>>>
>>>> >>>> Hello,
>>>> >>>>
>>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>> >>>>
>>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>>> >>>>>here.
>>>> >>>>> One of those things you just have to find out as an operator
>>>> >>>>>since it's not well documented :(
>>>> >>>>>
>>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>> >>>>>
>>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>>> >>>>>doing over 15000 write IOPS all day long with 302 spinning
>>>> >>>>>drives
>>>> >>>>>+ SATA SSD journals. Having enough memory and dropping your
>>>> >>>>>vfs_cache_pressure should also help.
>>>> >>>>>
>>>> >>>> Indeed.
>>>> >>>>
>>>> >>>> Since it was asked in that bug report and also my first
>>>> >>>>suspicion, it would probably be good time to clarify that it
>>>> >>>>isn't the splits that cause the performance degradation, but the
>>>> >>>>resulting inflation of dir entries and exhaustion of SLAB and
>>>> >>>>thus having to go to disk for things that normally would be in
>>>>memory.
>>>> >>>>
>>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>>> >>>>clear, a purely split caused degradation should have relented
>>>> >>>>much quicker.
>>>> >>>>
>>>> >>>>
>>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>>> >>>>> immediately. It only merges them back if the directory is under
>>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>>> >>>>> forget).
>>>> >>>>>
>>>> >>>> If it's a read a plain scrub might do the trick.
>>>> >>>>
>>>> >>>> Christian
>>>> >>>>> Warren
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> From: ceph-users
>>>> >>>>>
>>>>
>>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>>>> >>>>>cep
>>>> >>>>>h.com>>
>>>> >>>>> on behalf of Wade Holler
>>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>>>> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>>>> >>>>>Wido den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>>>> >>>>>Ceph Development
>>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>>> >>>>>Subject:
>>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>> >>>>>objects in pool
>>>> >>>>>
>>>> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We
>>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>>> >>>>> settings. Early indications are .... well not great. Regardless
>>>> >>>>> it is nice to understand the symptoms better so we try to design
>>>> >>>>> around it.
>>>> >>>>>
>>>> >>>>> Best Regards,
>>>> >>>>> Wade
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>>wrote:
>>>> >>>>>On
>>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>>wrote:
>>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>>> >>>>> > likely get further confirmation. In my experience
>>>> >>>>> > xs_dir_lookups balloons
>>>> >>>>>(which
>>>> >>>>> > means directory lookups are missing cache and going to disk).
>>>> >>>>>
>>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>>> >>>>> very problem we had only ephemerally set the new filestore
>>>> >>>>> merge/split values - oops. Here's what started happening when we
>>>> >>>>> upgraded and restarted a bunch of OSDs:
>>>> >>>>>
>>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>>> >>>>>_d
>>>> >>>>>ir_
>>>> >>>>>lookup.png
>>>> >>>>>
>>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>> >>>>> 12:30, then still took a while to settle.
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Cheers,
>>>> >>>>> ~Blairo
>>>> >>>>>
>>>> >>>>> This email and any files transmitted with it are confidential
>>>> >>>>>and intended solely for the individual or entity to whom they are
>>>> >>>>>addressed.
>>>> >>>>> If you have received this email in error destroy it immediately.
>>>> >>>>>*** Walmart Confidential ***
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Christian Balzer Network/Systems Engineer
>>>> >>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>>>> >>>> http://www.gol.com/
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Cheers,
>>>> >> ~Blairo
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the individual or entity to whom they are
>>>>addressed.
>>>> If you have received this email in error destroy it immediately. ***
>>>> Walmart Confidential ***
>>>> _______________________________________________
>>>> ceph-users mailing list ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>>> The information contained in this electronic mail message is intended
>>>> only for the use of the designated recipient(s) named above. If the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified that you have received this message in error and that any
>>>> review, dissemination, distribution, or copying of this message is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender by telephone or e-mail (as shown above)
>>>> immediately and destroy any and all copies of this message in your
>>>> possession (whether hard copies or electronically stored copies).
>>>> _______________________________________________ ceph-users mailing
>>>> list ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> --
>>> Christian Balzer Network/Systems Engineer
>>> chibi@gol.com Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is intended only for the use of the designated recipient(s) named above.
>>>If the reader of this message is not the intended recipient, you are
>>>hereby notified that you have received this message in error and that
>>>any review, dissemination, distribution, or copying of this message is
>>>strictly prohibited. If you have received this communication in error,
>>>please notify the sender by telephone or e-mail (as shown above)
>>>immediately and destroy any and all copies of this message in your
>>>possession (whether hard copies or electronically stored copies).
>
>
> This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [ceph-users] Dramatic performance drop at certain number of objects in pool
2016-06-24 19:45 ` Wade Holler
@ 2016-06-25 3:07 ` Christian Balzer
0 siblings, 0 replies; 34+ messages in thread
From: Christian Balzer @ 2016-06-25 3:07 UTC (permalink / raw)
To: Wade Holler
Cc: Warren Wang - ISD, Somnath Roy, ceph-users, Blair Bethwaite,
Ceph Development
Hello,
On Fri, 24 Jun 2016 15:45:52 -0400 Wade Holler wrote:
> Not reasonable as you say :
>
> vm.min_free_kbytes = 90112
>
Yeah, my nodes with IB adapters all have that set to at least 512MB, 1GB
if they're over 64GB.
> we're in recovery post expansion (48->54 OSDs) right now but free -t is:
>
> #free -t
>
Free can be very misleading when it comes to the actual state of things
with regards to memory fragmentation.
Take a look at "cat /proc/buddyinfo" and read up on linux memory
fragmentation.
Also this:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22214.html
Christian
> total used free shared buff/cache
> available
>
> Mem: 693097104 378383384 36870080 369292 277843640
> 250931372
>
> Swap: 1048572 956 1047616
>
> Total: 694145676 378384340 37917696
>
>
> On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
> <Warren.Wang@walmart.com> wrote:
> > Oops, that reminds me, do you have min_free_kbytes set to something
> > reasonable like at least 2-4GB?
> >
> > Warren Wang
> >
> >
> >
> > On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:
> >
> >>On the vm.vfs_cace_pressure = 1 : We had this initially and I still
> >>think it is the best choice for most configs. However with our large
> >>memory footprint, vfs_cache_pressure=1 increased the likelihood of
> >>hitting an issue where our write response time would double; then a
> >>drop of caches would return response time to normal. I don't claim to
> >>totally understand this and I only have speculation at the moment.
> >>Again thanks for this suggestion, I do think it is best for boxes that
> >>don't have very large memory.
> >>
> >>@ Christian - reformatting to btrfs or ext4 is an option in my test
> >>cluster. I thought about that but needed to sort xfs first. (thats
> >>what production will run right now) You all have helped me do that and
> >>thank you again. I will circle back and test btrfs under the same
> >>conditions. I suspect that it will behave similarly but it's only a
> >>day and half's work or so to test.
> >>
> >>Best Regards,
> >>Wade
> >>
> >>
> >>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> >>wrote:
> >>> Oops , typo , 128 GB :-)...
> >>>
> >>> -----Original Message-----
> >>> From: Christian Balzer [mailto:chibi@gol.com]
> >>> Sent: Thursday, June 23, 2016 5:08 PM
> >>> To: ceph-users@lists.ceph.com
> >>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite;
> >>> Ceph
> >>>Development
> >>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
> >>>of objects in pool
> >>>
> >>>
> >>> Hello,
> >>>
> >>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
> >>>
> >>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> >>>> *pin* inode/dentries in memory. We are using that for long now (with
> >>>> 128 TB node memory) and it seems helping specially for the random
> >>>> write workload and saving xattrs read in between.
> >>>>
> >>> 128TB node memory, really?
> >>> Can I have some of those, too? ^o^
> >>> And here I was thinking that Wade's 660GB machines were on the
> >>>excessive side.
> >>>
> >>> There's something to be said (and optimized) when your storage nodes
> >>>have the same or more RAM as your compute nodes...
> >>>
> >>> As for Warren, well spotted.
> >>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
> >>>fireworks if your memory is really needed elsewhere, while keeping
> >>>things in memory normally.
> >>>
> >>> Christian
> >>>
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On
> >>>> Behalf Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> >>>> To: Wade Holler; Blair Bethwaite
> >>>> Cc: Ceph Development; ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Dramatic performance drop at certain
> >>>> number of objects in pool
> >>>>
> >>>> vm.vfs_cache_pressure = 100
> >>>>
> >>>> Go the other direction on that. You易ll want to keep it low to help
> >>>> keep inode/dentry info in memory. We use 10, and haven易t had a
> >>>> problem.
> >>>>
> >>>>
> >>>> Warren Wang
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
> >>>>
> >>>> >Blairo,
> >>>> >
> >>>> >We'll speak in pre-replication numbers, replication for this pool
> >>>> >is
> >>>>3.
> >>>> >
> >>>> >23.3 Million Objects / OSD
> >>>> >pg_num 2048
> >>>> >16 OSDs / Server
> >>>> >3 Servers
> >>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >>>> >vm.vfs_cache_pressure = 100
> >>>> >
> >>>> >Workload is native librados with python. ALL 4k objects.
> >>>> >
> >>>> >Best Regards,
> >>>> >Wade
> >>>> >
> >>>> >
> >>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> >>>> ><blair.bethwaite@gmail.com> wrote:
> >>>> >> Wade, good to know.
> >>>> >>
> >>>> >> For the record, what does this work out to roughly per OSD? And
> >>>> >> how much RAM and how many PGs per OSD do you have?
> >>>> >>
> >>>> >> What's your workload? I wonder whether for certain workloads
> >>>> >> (e.g. RBD) it's better to increase default object size somewhat
> >>>> >> before pushing the split/merge up a lot...
> >>>> >>
> >>>> >> Cheers,
> >>>> >>
> >>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com>
> >>>> >> wrote:
> >>>> >>> Based on everyones suggestions; The first modification to 50 /
> >>>> >>> 16 enabled our config to get to ~645Mill objects before the
> >>>> >>> behavior in question was observed (~330 was the previous
> >>>> >>> ceiling). Subsequent modification to 50 / 24 has enabled us to
> >>>> >>> get to 1.1 Billion+
> >>>> >>>
> >>>> >>> Thank you all very much for your support and assistance.
> >>>> >>>
> >>>> >>> Best Regards,
> >>>> >>> Wade
> >>>> >>>
> >>>> >>>
> >>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer
> >>>> >>> <chibi@gol.com>
> >>>> >>>wrote:
> >>>> >>>>
> >>>> >>>> Hello,
> >>>> >>>>
> >>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
> >>>> >>>>
> >>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>> >>>>>here.
> >>>> >>>>> One of those things you just have to find out as an operator
> >>>> >>>>>since it's not well documented :(
> >>>> >>>>>
> >>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>> >>>>>
> >>>> >>>>> We have over 200 million objects in this cluster, and it's
> >>>> >>>>> still
> >>>> >>>>>doing over 15000 write IOPS all day long with 302 spinning
> >>>> >>>>>drives
> >>>> >>>>>+ SATA SSD journals. Having enough memory and dropping your
> >>>> >>>>>vfs_cache_pressure should also help.
> >>>> >>>>>
> >>>> >>>> Indeed.
> >>>> >>>>
> >>>> >>>> Since it was asked in that bug report and also my first
> >>>> >>>>suspicion, it would probably be good time to clarify that it
> >>>> >>>>isn't the splits that cause the performance degradation, but
> >>>> >>>>the resulting inflation of dir entries and exhaustion of SLAB
> >>>> >>>>and thus having to go to disk for things that normally would
> >>>> >>>>be in
> >>>>memory.
> >>>> >>>>
> >>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
> >>>> >>>>clear, a purely split caused degradation should have relented
> >>>> >>>>much quicker.
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>> Keep in mind that if you change the values, it won't take
> >>>> >>>>> effect immediately. It only merges them back if the directory
> >>>> >>>>> is under the calculated threshold and a write occurs (maybe a
> >>>> >>>>> read, I forget).
> >>>> >>>>>
> >>>> >>>> If it's a read a plain scrub might do the trick.
> >>>> >>>>
> >>>> >>>> Christian
> >>>> >>>>> Warren
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> From: ceph-users
> >>>> >>>>>
> >>>>
> >>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
> >>>> >>>>>cep
> >>>> >>>>>h.com>>
> >>>> >>>>> on behalf of Wade Holler
> >>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
> >>>> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
> >>>> >>>>>Wido den Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
> >>>> >>>>>Ceph Development
> >>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
> >>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
> >>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
> >>>> >>>>>Subject:
> >>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number
> >>>> >>>>> of
> >>>> >>>>>objects in pool
> >>>> >>>>>
> >>>> >>>>> Thanks everyone for your replies. I sincerely appreciate it.
> >>>> >>>>> We are testing with different pg_num and
> >>>> >>>>> filestore_split_multiple settings. Early indications are ....
> >>>> >>>>> well not great. Regardless it is nice to understand the
> >>>> >>>>> symptoms better so we try to design around it.
> >>>> >>>>>
> >>>> >>>>> Best Regards,
> >>>> >>>>> Wade
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
> >>>>wrote:
> >>>> >>>>>On
> >>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
> >>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
> >>>>wrote:
> >>>> >>>>> > slow request issues). If you watch your xfs stats you'll
> >>>> >>>>> > likely get further confirmation. In my experience
> >>>> >>>>> > xs_dir_lookups balloons
> >>>> >>>>>(which
> >>>> >>>>> > means directory lookups are missing cache and going to
> >>>> >>>>> > disk).
> >>>> >>>>>
> >>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
> >>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit
> >>>> >>>>> this very problem we had only ephemerally set the new
> >>>> >>>>> filestore merge/split values - oops. Here's what started
> >>>> >>>>> happening when we upgraded and restarted a bunch of OSDs:
> >>>> >>>>>
> >>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
> >>>> >>>>>_d
> >>>> >>>>>ir_
> >>>> >>>>>lookup.png
> >>>> >>>>>
> >>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it
> >>>> >>>>> about 12:30, then still took a while to settle.
> >>>> >>>>>
> >>>> >>>>> --
> >>>> >>>>> Cheers,
> >>>> >>>>> ~Blairo
> >>>> >>>>>
> >>>> >>>>> This email and any files transmitted with it are confidential
> >>>> >>>>>and intended solely for the individual or entity to whom they
> >>>> >>>>>are addressed.
> >>>> >>>>> If you have received this email in error destroy it
> >>>> >>>>> immediately.
> >>>> >>>>>*** Walmart Confidential ***
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> --
> >>>> >>>> Christian Balzer Network/Systems Engineer
> >>>> >>>> chibi@gol.com Global OnLine Japan/Rakuten
> >>>> >>>> Communications http://www.gol.com/
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Cheers,
> >>>> >> ~Blairo
> >>>>
> >>>> This email and any files transmitted with it are confidential and
> >>>> intended solely for the individual or entity to whom they are
> >>>>addressed.
> >>>> If you have received this email in error destroy it immediately. ***
> >>>> Walmart Confidential ***
> >>>> _______________________________________________
> >>>> ceph-users mailing list ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
> >>>> The information contained in this electronic mail message is
> >>>> intended only for the use of the designated recipient(s) named
> >>>> above. If the reader of this message is not the intended recipient,
> >>>> you are hereby notified that you have received this message in
> >>>> error and that any review, dissemination, distribution, or copying
> >>>> of this message is strictly prohibited. If you have received this
> >>>> communication in error, please notify the sender by telephone or
> >>>> e-mail (as shown above) immediately and destroy any and all copies
> >>>> of this message in your possession (whether hard copies or
> >>>> electronically stored copies).
> >>>> _______________________________________________ ceph-users mailing
> >>>> list ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>>
> >>> --
> >>> Christian Balzer Network/Systems Engineer
> >>> chibi@gol.com Global OnLine Japan/Rakuten Communications
> >>> http://www.gol.com/
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message
> >>>is intended only for the use of the designated recipient(s) named
> >>>above. If the reader of this message is not the intended recipient,
> >>>you are hereby notified that you have received this message in error
> >>>and that any review, dissemination, distribution, or copying of this
> >>>message is strictly prohibited. If you have received this
> >>>communication in error, please notify the sender by telephone or
> >>>e-mail (as shown above) immediately and destroy any and all copies of
> >>>this message in your possession (whether hard copies or
> >>>electronically stored copies).
> >
> >
> > This email and any files transmitted with it are confidential and
> > intended solely for the individual or entity to whom they are
> > addressed. If you have received this email in error destroy it
> > immediately. *** Walmart Confidential ***
>
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Dramatic performance drop at certain number of objects in pool
[not found] ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-27 8:12 ` Blair Bethwaite
0 siblings, 0 replies; 34+ messages in thread
From: Blair Bethwaite @ 2016-06-27 8:12 UTC (permalink / raw)
To: Kyle Bader; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Ceph Development
[-- Attachment #1.1: Type: text/plain, Size: 851 bytes --]
On 25 Jun 2016 6:02 PM, "Kyle Bader" <kyle.bader-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> fdatasync takes longer when you have more inodes in the slab caches, it's
the double edged sword of vfs_cache_pressure.
That's a bit sad when, iiuc, it's only journals doing fdatasync in the Ceph
write path. I'd have expected the vfs to handle this on a per fs basis (and
a journal filesystem would have very little in the inode cache).
It's somewhat annoying there isn't a way to favor dentries (and perhaps
dentry inodes) over other inodes in the vfs cache. Our experience shows
that it's dentry misses that cause the major performance issues (makes
sense when you consider the osd is storing all its data in the leafs of the
on disk PG structure).
This is another discussion that seems to backup the choice to implement
bluestore.
Cheers,
Blair
[-- Attachment #1.2: Type: text/html, Size: 1072 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2016-06-27 8:12 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 12:14 Dramatic performance drop at certain number of objects in pool Wade Holler
2016-06-16 12:48 ` Blair Bethwaite
[not found] ` <CA+z5Dsz=e1N9RxRoF5Wao8Dogf_S1UstNZaCJ=oj-efj83HBig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-16 14:20 ` Dramatic performance drop at certain number ofobjects " Mykola
2016-06-16 14:30 ` Dramatic performance drop at certain number of objects " Wade Holler
2016-06-16 14:32 ` Wade Holler
2016-06-16 13:38 ` Wido den Hollander
2016-06-16 14:47 ` Wade Holler
2016-06-16 16:08 ` Wade Holler
2016-06-17 8:49 ` Wido den Hollander
2016-06-19 23:21 ` Blair Bethwaite
[not found] ` <CA+z5DszqHuevkAF3W01R=7AAeqVcyuHZTX0+bAvThgihvOjwuA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 0:52 ` Christian Balzer
2016-06-20 6:32 ` Blair Bethwaite
[not found] ` <CA+z5Dsy4tbyiL71C8CQCTQ66tY1=9thSWdNA4BSn6=tNfGUE6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 18:48 ` Wade Holler
[not found] ` <CA+e22Sc3iY5Lvp4oGwJ_wwpJsOJsWdB1thaHWEAuYP=bbGHAeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-20 20:47 ` Warren Wang - ISD
[not found] ` <D38DCB57.131AE%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-20 22:58 ` Christian Balzer
2016-06-23 1:26 ` [ceph-users] " Wade Holler
[not found] ` <CA+e22SdrwRHmAD=67MpVtUXVyCOmidcoUXrANZVeDJc2tcJfnQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23 1:33 ` Blair Bethwaite
2016-06-23 1:41 ` [ceph-users] " Wade Holler
2016-06-23 2:01 ` Blair Bethwaite
2016-06-23 2:28 ` Christian Balzer
2016-06-23 2:36 ` Blair Bethwaite
2016-06-23 2:31 ` Wade Holler
[not found] ` <CA+e22SfaiBUQ9Wanay6_oji9t7131o67B2oDtaEW_zXwqCJfbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23 22:09 ` Warren Wang - ISD
[not found] ` <D391D1A4.145D6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-23 22:24 ` Somnath Roy
[not found] ` <BL2PR02MB2115BD5C173011A0CB92F964F42D0-TNqo25UYn65rzea/mugEKanrV9Ap65cLvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-06-24 0:08 ` Christian Balzer
[not found] ` <20160624090806.1246b1ff-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
2016-06-24 0:09 ` Somnath Roy
2016-06-24 14:23 ` [ceph-users] " Wade Holler
[not found] ` <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-24 16:24 ` Warren Wang - ISD
[not found] ` <D392D6EB.146C6%warren.wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
2016-06-24 19:45 ` Wade Holler
2016-06-25 3:07 ` [ceph-users] " Christian Balzer
[not found] ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw@mail.gmail.com>
[not found] ` <CAFMfnwoqbr+_c913oyxpvzHNS+NPdXX17dMdXoC1ZiuZM1GzPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-27 8:12 ` Blair Bethwaite
2016-06-23 2:37 ` [ceph-users] " Christian Balzer
[not found] ` <20160623113717.446a1f9d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>
2016-06-23 2:55 ` Blair Bethwaite
[not found] ` <CA+z5DszcLqV32NnWeuu+WsRZoZwM493Jfy7WcSpVtaDyArwFAQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-23 3:38 ` Christian Balzer
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.