All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore prefer wal size
@ 2017-02-01 15:34 Sage Weil
       [not found] ` <001001d27cbd$cee2a9f0$6ca7fdd0$@fisk.me.uk>
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-02-01 15:34 UTC (permalink / raw)
  To: nick; +Cc: ceph-devel

Hey Nick,

I've updated/improved the prefer wal size PR (that sends small writes 
through the wal).  See

	https://github.com/ceph/ceph/pull/13217

if you want to try it out.

Thanks!
sage



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: bluestore prefer wal size
       [not found]   ` <bb0bdeb6.AEMAHeJMfMsAAAAAAAAAAFklQUsAADNJBWwAAAAAAACRXwBYklNO@mailjet.com>
@ 2017-02-01 21:38     ` Sage Weil
       [not found]       ` <45ba3843.AEEAHxSt8JAAAAAAAAAAAFklQUsAADNJBWwAAAAAAACRXwBYkldP@mailjet.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-02-01 21:38 UTC (permalink / raw)
  To: Nick Fisk; +Cc: 'Mark Nelson', ceph-devel

On Wed, 1 Feb 2017, Nick Fisk wrote:
> Further update,
> 
> I set bluestore_debug_omit_block_device_write to true and this gave me 
> near filestore performance, albeit still with very high write amp on the 
> SSD's. So its definitely something around that part of the code waiting 
> on the writes to the spinning disks, puzzled why the commit didn't help.
> 
> I also set debug logging to 20/20 for bdev,bluefs,osd,bluestore and a 
> grep didn't reveal any of the debug in that commit eg "defering small 
> 0x". So possibly something isn't working as expected?

Oh, yeah, the option isn't working then.  I saw the line in my debug 
output.. are you sure you set bluestore_prefer_wal_size ?

sage

> 
> Nick
> 
> -----Original Message-----
> From: Nick Fisk [mailto:nick@fisk.me.uk] 
> Sent: 01 February 2017 19:03
> To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson' <mnelson@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore prefer wal size
> 
> Hi Sage,
> 
> First results not looking good. It looks like write IO to the SSD's (sdd and sdi) is now massively amplified, by somewhere in the region of about 10x. But I'm still only getting around 100 4kb's seq write iops from the fio client. This is in comparison to 2500-3000 iops on the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).
> 
> rbd engine: RBD version: 0.1.11
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s] [0/119/0 iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb  1 18:56:51 2017
>   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
>     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
>     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
>      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
>     clat percentiles (msec):
>      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
>      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
>      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
>      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
>      | 99.99th=[  273]
>     bw (KB  /s): min=  191, max=  480, per=100.00%, avg=464.18, stdev=34.60
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
>     lat (msec) : 500=0.01%
>   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> I've checked via the admin socket and the prefer wal option is set to 8192.
> 
> Random capture of iostat
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05   11.64    0.00   11.64   8.73   4.80
> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54    3.59    0.00    3.59   3.58  54.20
> sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09    0.06    0.00    0.06   0.06   9.20
> sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91    4.20    0.00    4.20   4.20  91.20
> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18    2.68    0.00    2.68   2.68  17.80
> sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80    3.71    0.00    3.71   3.71  80.40
> sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09    0.08    0.00    0.08   0.07   8.40
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: 01 February 2017 15:34
> To: nick@fisk.me.uk
> Cc: ceph-devel@vger.kernel.org
> Subject: bluestore prefer wal size
> 
> Hey Nick,
> 
> I've updated/improved the prefer wal size PR (that sends small writes through the wal).  See
> 
> 	https://github.com/ceph/ceph/pull/13217
> 
> if you want to try it out.
> 
> Thanks!
> sage
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: bluestore prefer wal size
       [not found]       ` <45ba3843.AEEAHxSt8JAAAAAAAAAAAFklQUsAADNJBWwAAAAAAACRXwBYkldP@mailjet.com>
@ 2017-02-01 21:47         ` Sage Weil
  2017-02-01 22:12           ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-02-01 21:47 UTC (permalink / raw)
  To: Nick Fisk; +Cc: 'Mark Nelson', ceph-devel

On Wed, 1 Feb 2017, Nick Fisk wrote:
 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: 01 February 2017 21:39
> > To: Nick Fisk <nick@fisk.me.uk>
> > Cc: 'Mark Nelson' <mnelson@redhat.com>; ceph-devel@vger.kernel.org
> > Subject: RE: bluestore prefer wal size
> > 
> > On Wed, 1 Feb 2017, Nick Fisk wrote:
> > > Further update,
> > >
> > > I set bluestore_debug_omit_block_device_write to true and this gave me
> > > near filestore performance, albeit still with very high write amp on
> > > the SSD's. So its definitely something around that part of the code
> > > waiting on the writes to the spinning disks, puzzled why the commit didn't
> > help.
> > >
> > > I also set debug logging to 20/20 for bdev,bluefs,osd,bluestore and a
> > > grep didn't reveal any of the debug in that commit eg "defering small
> > > 0x". So possibly something isn't working as expected?
> > 
> > Oh, yeah, the option isn't working then.  I saw the line in my debug output..
> > are you sure you set bluestore_prefer_wal_size ?
> 
> Yep 100%, I even did a config show on the admin socket to confirm the value. I will rebuild the OSD's and try again and do a bit more digging to see if I can work out why it's not entering that if statement.

Oh, I know what the problem is.  Will push a fixed patch shortly.

sage

> 
> > 
> > sage
> > 
> > >
> > > Nick
> > >
> > > -----Original Message-----
> > > From: Nick Fisk [mailto:nick@fisk.me.uk]
> > > Sent: 01 February 2017 19:03
> > > To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
> > <mnelson@redhat.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore prefer wal size
> > >
> > > Hi Sage,
> > >
> > > First results not looking good. It looks like write IO to the SSD's (sdd and sdi)
> > is now massively amplified, by somewhere in the region of about 10x. But I'm
> > still only getting around 100 4kb's seq write iops from the fio client. This is in
> > comparison to 2500-3000 iops on the SSD's and ~200 iops per spinning disk
> > (sdc,sde,sdg,sdh).
> > >
> > > rbd engine: RBD version: 0.1.11
> > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s] [0/119/0 iops]
> > > [eta 00m:00s]
> > > rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb  1
> > 18:56:51 2017
> > >   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
> > >     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
> > >     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
> > >      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
> > >     clat percentiles (msec):
> > >      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
> > >      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
> > >      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
> > >      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
> > >      | 99.99th=[  273]
> > >     bw (KB  /s): min=  191, max=  480, per=100.00%, avg=464.18, stdev=34.60
> > >     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
> > >     lat (msec) : 500=0.01%
> > >   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
> > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > >=64=0.0%
> > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> > >      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
> > drop=r=0/w=0/d=0
> > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > >
> > > I've checked via the admin socket and the prefer wal option is set to 8192.
> > >
> > > Random capture of iostat
> > >
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz
> > await r_await w_await  svctm  %util
> > > sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05   11.64
> > 0.00   11.64   8.73   4.80
> > > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
> > 0.00    0.00   0.00   0.00
> > > sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54    3.59
> > 0.00    3.59   3.58  54.20
> > > sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09    0.06
> > 0.00    0.06   0.06   9.20
> > > sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91    4.20
> > 0.00    4.20   4.20  91.20
> > > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
> > 0.00    0.00   0.00   0.00
> > > sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18    2.68
> > 0.00    2.68   2.68  17.80
> > > sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80    3.71
> > 0.00    3.71   3.71  80.40
> > > sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09    0.08
> > 0.00    0.08   0.07   8.40
> > >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: 01 February 2017 15:34
> > > To: nick@fisk.me.uk
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: bluestore prefer wal size
> > >
> > > Hey Nick,
> > >
> > > I've updated/improved the prefer wal size PR (that sends small writes
> > > through the wal).  See
> > >
> > > 	https://github.com/ceph/ceph/pull/13217
> > >
> > > if you want to try it out.
> > >
> > > Thanks!
> > > sage
> > >
> > >
> > >
> > >
> > >
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: bluestore prefer wal size
  2017-02-01 21:47         ` Sage Weil
@ 2017-02-01 22:12           ` Sage Weil
       [not found]             ` <00bb01d27d45$21a48050$64ed80f0$@fisk.me.uk>
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-02-01 22:12 UTC (permalink / raw)
  To: Nick Fisk; +Cc: 'Mark Nelson', ceph-devel

On Wed, 1 Feb 2017, Sage Weil wrote:
> On Wed, 1 Feb 2017, Nick Fisk wrote:
>  
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: 01 February 2017 21:39
> > > To: Nick Fisk <nick@fisk.me.uk>
> > > Cc: 'Mark Nelson' <mnelson@redhat.com>; ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore prefer wal size
> > > 
> > > On Wed, 1 Feb 2017, Nick Fisk wrote:
> > > > Further update,
> > > >
> > > > I set bluestore_debug_omit_block_device_write to true and this gave me
> > > > near filestore performance, albeit still with very high write amp on
> > > > the SSD's. So its definitely something around that part of the code
> > > > waiting on the writes to the spinning disks, puzzled why the commit didn't
> > > help.
> > > >
> > > > I also set debug logging to 20/20 for bdev,bluefs,osd,bluestore and a
> > > > grep didn't reveal any of the debug in that commit eg "defering small
> > > > 0x". So possibly something isn't working as expected?
> > > 
> > > Oh, yeah, the option isn't working then.  I saw the line in my debug output..
> > > are you sure you set bluestore_prefer_wal_size ?
> > 
> > Yep 100%, I even did a config show on the admin socket to confirm the value. I will rebuild the OSD's and try again and do a bit more digging to see if I can work out why it's not entering that if statement.
> 
> Oh, I know what the problem is.  Will push a fixed patch shortly.

Try now?

sage

> 
> sage
> 
> > 
> > > 
> > > sage
> > > 
> > > >
> > > > Nick
> > > >
> > > > -----Original Message-----
> > > > From: Nick Fisk [mailto:nick@fisk.me.uk]
> > > > Sent: 01 February 2017 19:03
> > > > To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
> > > <mnelson@redhat.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: bluestore prefer wal size
> > > >
> > > > Hi Sage,
> > > >
> > > > First results not looking good. It looks like write IO to the SSD's (sdd and sdi)
> > > is now massively amplified, by somewhere in the region of about 10x. But I'm
> > > still only getting around 100 4kb's seq write iops from the fio client. This is in
> > > comparison to 2500-3000 iops on the SSD's and ~200 iops per spinning disk
> > > (sdc,sde,sdg,sdh).
> > > >
> > > > rbd engine: RBD version: 0.1.11
> > > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s] [0/119/0 iops]
> > > > [eta 00m:00s]
> > > > rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb  1
> > > 18:56:51 2017
> > > >   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
> > > >     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
> > > >     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
> > > >      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
> > > >     clat percentiles (msec):
> > > >      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
> > > >      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
> > > >      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
> > > >      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
> > > >      | 99.99th=[  273]
> > > >     bw (KB  /s): min=  191, max=  480, per=100.00%, avg=464.18, stdev=34.60
> > > >     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
> > > >     lat (msec) : 500=0.01%
> > > >   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > > >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > >=64=0.0%
> > > >      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
> > > drop=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > >
> > > > I've checked via the admin socket and the prefer wal option is set to 8192.
> > > >
> > > > Random capture of iostat
> > > >
> > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz
> > > await r_await w_await  svctm  %util
> > > > sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05   11.64
> > > 0.00   11.64   8.73   4.80
> > > > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
> > > 0.00    0.00   0.00   0.00
> > > > sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54    3.59
> > > 0.00    3.59   3.58  54.20
> > > > sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09    0.06
> > > 0.00    0.06   0.06   9.20
> > > > sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91    4.20
> > > 0.00    4.20   4.20  91.20
> > > > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
> > > 0.00    0.00   0.00   0.00
> > > > sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18    2.68
> > > 0.00    2.68   2.68  17.80
> > > > sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80    3.71
> > > 0.00    3.71   3.71  80.40
> > > > sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09    0.08
> > > 0.00    0.08   0.07   8.40
> > > >
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: 01 February 2017 15:34
> > > > To: nick@fisk.me.uk
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: bluestore prefer wal size
> > > >
> > > > Hey Nick,
> > > >
> > > > I've updated/improved the prefer wal size PR (that sends small writes
> > > > through the wal).  See
> > > >
> > > > 	https://github.com/ceph/ceph/pull/13217
> > > >
> > > > if you want to try it out.
> > > >
> > > > Thanks!
> > > > sage
> > > >
> > > >
> > > >
> > > >
> > > >
> > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bluestore prefer wal size
       [not found]               ` <d0caeb89.AEUAHScduCcAAAAAAAAAAFPDZXcAADNJBWwAAAAAAACRXwBYk0Zd@mailjet.com>
@ 2017-02-02 20:17                 ` Mark Nelson
       [not found]                   ` <002101d27d99$5e266ad0$1a734070$@fisk.me.uk>
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Nelson @ 2017-02-02 20:17 UTC (permalink / raw)
  To: nick, 'Sage Weil'; +Cc: ceph-devel

Hi Nick,

On 02/02/2017 08:46 AM, Nick Fisk wrote:
> Further update
>
> If I do random writes instead of sequential, I see the "defering small write via wal" action. But from the do_alloc_write function not small write.
>  Ie. log entry is:
> _do_alloc_write defering small 0x1000 write via wal

I just ran some tests using the branch with 4k random and sequential 
writes against a single OSD.  I see deferring small write via wal from 
do_alloc_write both for both sequential and random when 
bluestore_prefer_wal_size is set to be larger than the IO size. 
Otherwise, I don't see the messages.  I am also preconditioning with 4MB 
writes.

So far I've just been testing on NVMe, but I should be able to do some 
mixed tests soon to see if there are any interesting performance 
implications.

Mark

>
> And that’s for a 4kb random write on an RBD that has been fully conditioned via fio 4MB writes
>
> So it's something in the sequential nature of the test which trips it up as I don't see any sign of deferred writes
>
> From looking at the code in the small write function, I can only see two possibilities that would stop if from reaching the prefer wal IF statement
>
>   while (ep != o->extent_map.extent_map.end()) {
>     if (ep->blob_start() >= end) {
>       break;
>
> and
>
>   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
> 	b->get_blob().is_unused(b_off, b_len) &&
> 	b->get_blob().is_allocated(b_off, b_len)) {
>
> Unfortunately, that’s about as far as I have got, before my knowledge of Bluestore becomes a limitation.
>
> Don't know if that triggers any thoughts?
>
>>
>> 2017-02-02 10:39:31.988782 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing 0x56102a5a4900
>> osr(9.73 0x56102a5db1a0)
>> 2017-02-02 10:39:31.988786 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900 = 0x56102ab4be40
>> seq 260
>> 2017-02-02 10:39:31.988797 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key s
>> 2017-02-02 10:39:31.988802 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key s = 0
>> 2017-02-02 10:39:31.988806 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~
>> 1000
>> 2017-02-02 10:39:31.988809 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
>> 2017-02-02 10:39:31.988812 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12 target_blob_size
>> 0x80000
>> 2017-02-02 10:39:31.988814 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000
>> 2017-02-02 10:39:31.988816 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering Blob(0x56102aac05a0
>> blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) bstart 0x80000
>> 2017-02-02 10:39:31.988821 7ff7282ff700 20 bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard 0x7b000~1000
>> 2017-02-02 10:39:31.988827 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write 0x7b000~1000 of
>> mutable Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>> SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
>> 2017-02-02 10:39:31.988832 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex 0xfb000~1000: 0x7b000~1000
>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1) SharedBlob(0x56102aa76260))
>> 2017-02-02 10:39:31.988836 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc 0x56102ab4be40 0 blobs
>> 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve instance 94627427524464 num_used 345738 total 7627973
>> 2017-02-02 10:39:31.988840 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old 0xfb000~1000: 0x7b000~1000
>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1) SharedBlob(0x56102aa76260))
>> 2017-02-02 10:39:31.988844 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) compress_extent_map 0xfb000~1000 next shard
>> 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000: 0x7b000~1000 Blob(0x56102aac05a0
>> blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
>> 2017-02-02 10:39:31.988850 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) compress_extent_map 0xfb000~1000 next shard
>> 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000: 0x7c000~4000 Blob(0x56102aac05a0
>> blob([0x256480000~80000] mutable+csum crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
>> 2017-02-02 10:39:31.988868 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000 dirty
>> 2017-02-02 10:39:31.988870 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 0xfb000~1000 = 0
>> 2017-02-02 10:39:31.988875 7ff7282ff700 15 bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head #9:ce000000::::head#
>> 2017-02-02 10:39:31.988882 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head #9:ce000000::::head# = 0
>> 2017-02-02 10:39:31.988885 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc 0x56102ab4be40 onodes
>> 0x56102a8ee300 shared_blobs
>> 2017-02-02 10:39:31.988894 7ff7282ff700 20 bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531 bytes (was 531) from 1
>> extents
>> 2017-02-02 10:39:31.988902 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4)   onode
>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is 424 (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
>> extents)
>> 2017-02-02 10:39:31.988914 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 prepare
>> 2017-02-02 10:39:31.988916 7ff7282ff700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40
>> 2017-02-02 10:39:31.988917 7ff7282ff700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 io_done
>> 2017-02-02 10:39:31.988918 7ff7282ff700 20 bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) finish_write
>> buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing nocache)
>> 2017-02-02 10:39:31.988939 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
>> 2017-02-02 10:39:31.988941 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1 submitting 1
>> cleaning 0
>> 2017-02-02 10:39:31.988952 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc 0x56102ab4be40 allocated 0x[]
>> released 0x[]
>> 2017-02-02 10:39:31.988999 7ff72e30b700 10 bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace bluefs 9535 M free
>> (0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143
>> 2017-02-02 10:39:31.989400 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1 cleaned 0 in 0.000456
>> 2017-02-02 10:39:31.989405 7ff72e30b700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x56102ab4be40 kv_submitted
>> 2017-02-02 10:39:31.989407 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc 0x56102ab4be40
>> 2017-02-02 10:39:31.989413 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40 seq 255
>> 2017-02-02 10:39:31.989414 7ff72e30b700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write [0x2564fb000~1000]
>>
>>
>>>
>>> sage
>>>
>>>>
>>>> sage
>>>>
>>>>>
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>>
>>>>>>> Nick
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Nick Fisk [mailto:nick@fisk.me.uk]
>>>>>>> Sent: 01 February 2017 19:03
>>>>>>> To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
>>>>>> <mnelson@redhat.com>
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Subject: RE: bluestore prefer wal size
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> First results not looking good. It looks like write IO to the
>>>>>>> SSD's (sdd and sdi)
>>>>>> is now massively amplified, by somewhere in the region of about
>>>>>> 10x. But I'm still only getting around 100 4kb's seq write iops
>>>>>> from the fio client. This is in comparison to 2500-3000 iops on
>>>>>> the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).
>>>>>>>
>>>>>>> rbd engine: RBD version: 0.1.11
>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s]
>>>>>>> [0/119/0 iops] [eta 00m:00s]
>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb
>>>>>>> 1
>>>>>> 18:56:51 2017
>>>>>>>   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
>>>>>>>     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
>>>>>>>     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
>>>>>>>      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
>>>>>>>     clat percentiles (msec):
>>>>>>>      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
>>>>>>>      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
>>>>>>>      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
>>>>>>>      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
>>>>>>>      | 99.99th=[  273]
>>>>>>>     bw (KB  /s): min=  191, max=  480, per=100.00%, avg=464.18, stdev=34.60
>>>>>>>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
>>>>>>>     lat (msec) : 500=0.01%
>>>>>>>   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
>>>>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>>> =64=0.0%
>>>>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.0%
>>>>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>>> 64=0.0%, =64=0.0%
>>>>>>>      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
>>>>>> drop=r=0/w=0/d=0
>>>>>>>      latency   : target=0, window=0, percentile=100.00%, depth=1
>>>>>>>
>>>>>>> I've checked via the admin socket and the prefer wal option is set to 8192.
>>>>>>>
>>>>>>> Random capture of iostat
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz
>>>>>> await r_await w_await  svctm  %util
>>>>>>> sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05   11.64
>>>>>> 0.00   11.64   8.73   4.80
>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
>>>>>> 0.00    0.00   0.00   0.00
>>>>>>> sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54    3.59
>>>>>> 0.00    3.59   3.58  54.20
>>>>>>> sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09    0.06
>>>>>> 0.00    0.06   0.06   9.20
>>>>>>> sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91    4.20
>>>>>> 0.00    4.20   4.20  91.20
>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00
>>>>>> 0.00    0.00   0.00   0.00
>>>>>>> sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18    2.68
>>>>>> 0.00    2.68   2.68  17.80
>>>>>>> sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80    3.71
>>>>>> 0.00    3.71   3.71  80.40
>>>>>>> sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09    0.08
>>>>>> 0.00    0.08   0.07   8.40
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>> Sent: 01 February 2017 15:34
>>>>>>> To: nick@fisk.me.uk
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Subject: bluestore prefer wal size
>>>>>>>
>>>>>>> Hey Nick,
>>>>>>>
>>>>>>> I've updated/improved the prefer wal size PR (that sends small
>>>>>>> writes through the wal).  See
>>>>>>>
>>>>>>> 	https://github.com/ceph/ceph/pull/13217
>>>>>>>
>>>>>>> if you want to try it out.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bluestore prefer wal size
       [not found]                     ` <5ed3f3b2.ADsAAGRwZ_AAAAAAAAAAAFPDZXcAADNJBWwAAAAAAACRXwBYlFIS@mailjet.com>
@ 2017-02-03 15:20                       ` Mark Nelson
       [not found]                         ` <ba5b6d21.ADsAAGSJJ-oAAAAAAAAAAFJy4NwAADNJBWwAAAAAAACRXwBYlKUc@mailjet.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Nelson @ 2017-02-03 15:20 UTC (permalink / raw)
  To: nick, 'Sage Weil'; +Cc: ceph-devel

Hi Nick,

So I'm still catching up to your testing, but last night I ran through a 
number of tests with a single OSD, a single client, and iodepth=1 fio 
rbd tests:

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00

I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore, 
bluestore with 1k and 16k prefer wal sizes.  I believe I'm seeing 
similar results to what you saw.  On NVMe we are pretty similar likely 
due to the raw backend throughput the NVMe drives can maintain but on 
HDD, bluestore is doing quite a bit worse than filestore for this 
specific use case.  During the bluestore HDD tests, I watched disk 
access and it appears we are saturating the disk with small writes 
despite the low client performance.  I'll be digging into this more 
today but I wanted to send out an update to let you know I believe I've 
reproduced your results and am looking into it.

Thanks!
Mark

On 02/03/2017 03:49 AM, Nick Fisk wrote:
>
>
>> -----Original Message-----
>> From: Nick Fisk [mailto:nick@fisk.me.uk]
>> Sent: 02 February 2017 21:15
>> To: 'Mark Nelson' <mnelson@redhat.com>; 'Sage Weil' <sage@newdream.net>
>> Cc: ceph-devel@vger.kernel.org
>> Subject: RE: bluestore prefer wal size
>>
>> Hi Mark,
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>> Sent: 02 February 2017 20:17
>>> To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: bluestore prefer wal size
>>>
>>> Hi Nick,
>>>
>>> On 02/02/2017 08:46 AM, Nick Fisk wrote:
>>>> Further update
>>>>
>>>> If I do random writes instead of sequential, I see the "defering
>>>> small write
>>> via wal" action. But from the do_alloc_write function not small write.
>>>>  Ie. log entry is:
>>>> _do_alloc_write defering small 0x1000 write via wal
>>>
>>> I just ran some tests using the branch with 4k random and sequential
>>> writes against a single OSD.  I see deferring small write via wal from
>>> do_alloc_write both for both sequential and random when
>>> bluestore_prefer_wal_size is set to be larger than the IO size.
>>> Otherwise, I don't see the messages.  I am also preconditioning with
>>> 4MB writes.
>>>
>>> So far I've just been testing on NVMe, but I should be able to do some
>>> mixed tests soon to see if there are any interesting performance implications.
>>
>> I've just looked at the RBD I ran those random tests on and I realized that it was only about 90% pre-conditioned due to me leaving the
>> time_based fio option on. After filling completely with writes, now both random and sequential are doing _do_write_small writes and
>> no deferring wal writes.
>>
>> From looking through the code today, as far as I understand it
>>
>> Write_small = write to existing object under min_allow_size Write_big = write to existing object over min_allow_size Alloc_write =
>> write to a new object
>>
>> Is that correct?
>>
>> If it is, I'm interested in why you are seeing alloc_writes, as I don't see them if the RBD is fully pre-conditioned? Unless there is
>> something changing with the min_alloc_size as you are using non rotational media, which is causing this change?
>>
>> I'm adding a few extra dout's to the write_small function to try and see what's going on, hopefully I will know more tomorrow.
>>
>
> Morning,
>
> So, results are in. I added some debugging before the IF statement (https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c12766c7801ac/src/os/bluestore/BlueStore.cc#L7885)
>
> So the following section now looks like below:
>
>     dout(20) << __func__ << "  before IF 0x" << std::hex << b_off << "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " - " << b->get_blob().is_unused(b_off, b_len) << " - " << b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
>     if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
>         b->get_blob().get_ondisk_length() >= b_off + b_len &&
>         b->get_blob().is_unused(b_off, b_len) &&
>         b->get_blob().is_allocated(b_off, b_len)) {
>
> And the output was
>
> _do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -
>
> So the " b->get_blob().is_unused(b_off, b_len)" call is returning false, which stops it going into the IF block and doing the prefer wal write. Is that expected?
>
> Nick
>
>> Nick
>>
>>>
>>> Mark
>>>
>>>>
>>>> And that’s for a 4kb random write on an RBD that has been fully
>>>> conditioned via fio 4MB writes
>>>>
>>>> So it's something in the sequential nature of the test which trips
>>>> it up as I don't see any sign of deferred writes
>>>>
>>>> From looking at the code in the small write function, I can only see
>>>> two possibilities that would stop if from reaching the prefer wal IF
>>>> statement
>>>>
>>>>   while (ep != o->extent_map.extent_map.end()) {
>>>>     if (ep->blob_start() >= end) {
>>>>       break;
>>>>
>>>> and
>>>>
>>>>   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
>>>> 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
>>>> 	b->get_blob().is_unused(b_off, b_len) &&
>>>> 	b->get_blob().is_allocated(b_off, b_len)) {
>>>>
>>>> Unfortunately, that’s about as far as I have got, before my
>>>> knowledge of
>>> Bluestore becomes a limitation.
>>>>
>>>> Don't know if that triggers any thoughts?
>>>>
>>>>>
>>>>> 2017-02-02 10:39:31.988782 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing
>>>>> 0x56102a5a4900
>>>>> osr(9.73 0x56102a5db1a0)
>>>>> 2017-02-02 10:39:31.988786 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900
>>>>> =
>>>>> 0x56102ab4be40 seq 260
>>>>> 2017-02-02 10:39:31.988797 7ff7282ff700 15
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
>>> s
>>>>> 2017-02-02 10:39:31.988802 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
>>> s
>>>>> = 0
>>>>> 2017-02-02 10:39:31.988806 7ff7282ff700 15
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>> 0xfb000~
>>>>> 1000
>>>>> 2017-02-02 10:39:31.988809 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>>>> 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
>>>>> 2017-02-02 10:39:31.988812 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12
>>>>> target_blob_size
>>>>> 0x80000
>>>>> 2017-02-02 10:39:31.988814 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000
>>>>> 2017-02-02 10:39:31.988816 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering
>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>> crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
>>>>> bstart 0x80000
>>>>> 2017-02-02 10:39:31.988821 7ff7282ff700 20
>>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard
>>>>> 0x7b000~1000
>>>>> 2017-02-02 10:39:31.988827 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
>>>>> 0x7b000~1000 of mutable Blob(0x56102aac05a0
>>> blob([0x256480000~80000]
>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>> SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
>>>>> 2017-02-02 10:39:31.988832 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
>>>>> 0xfb000~1000: 0x7b000~1000
>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>> crc32c/0x1000)
>>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
>>>>> SharedBlob(0x56102aa76260))
>>>>> 2017-02-02 10:39:31.988836 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
>>>>> 0x56102ab4be40 0 blobs
>>>>> 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
>>>>> instance 94627427524464 num_used 345738 total 7627973
>>>>> 2017-02-02 10:39:31.988840 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
>>>>> 0xfb000~1000: 0x7b000~1000
>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>> crc32c/0x1000)
>>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
>>>>> SharedBlob(0x56102aa76260))
>>>>> 2017-02-02 10:39:31.988844 7ff7282ff700 20
>>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
>>> 0xfb000~1000
>>>>> next shard
>>>>> 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
>>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000:
>>>>> 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>> SharedBlob(0x56102aa76260))
>>>>> 2017-02-02 10:39:31.988850 7ff7282ff700 20
>>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
>>> 0xfb000~1000
>>>>> next shard
>>>>> 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
>>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000:
>>>>> 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>> SharedBlob(0x56102aa76260))
>>>>> 2017-02-02 10:39:31.988868 7ff7282ff700 20
>>>>> bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000
>>>>> dirty
>>>>> 2017-02-02 10:39:31.988870 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>>>> 0xfb000~1000 = 0
>>>>> 2017-02-02 10:39:31.988875 7ff7282ff700 15
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
>>>>> #9:ce000000::::head#
>>>>> 2017-02-02 10:39:31.988882 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
>>>>> #9:ce000000::::head# = 0
>>>>> 2017-02-02 10:39:31.988885 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
>>>>> 0x56102ab4be40 onodes
>>>>> 0x56102a8ee300 shared_blobs
>>>>> 2017-02-02 10:39:31.988894 7ff7282ff700 20
>>>>> bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531
>>>>> bytes
>>> (was 531) from 1 extents
>>>>> 2017-02-02 10:39:31.988902 7ff7282ff700 20
>>> bluestore(/var/lib/ceph/osd/ceph-4)   onode
>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is 424
>>>>> (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
>>>>> extents)
>>>>> 2017-02-02 10:39:31.988914 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>> 0x56102ab4be40 prepare
>>>>> 2017-02-02 10:39:31.988916 7ff7282ff700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40
>>>>> 2017-02-02 10:39:31.988917 7ff7282ff700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>> 0x56102ab4be40 io_done
>>>>> 2017-02-02 10:39:31.988918 7ff7282ff700 20
>>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
>>>>> finish_write
>>>>> buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing
>>>>> nocache)
>>>>> 2017-02-02 10:39:31.988939 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
>>>>> 2017-02-02 10:39:31.988941 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1
>>>>> submitting 1 cleaning 0
>>>>> 2017-02-02 10:39:31.988952 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
>>>>> 0x56102ab4be40 allocated 0x[] released 0x[]
>>>>> 2017-02-02 10:39:31.988999 7ff72e30b700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
>>>>> bluefs
>>>>> 9535 M free
>>>>> (0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143
>>>>> 2017-02-02 10:39:31.989400 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1
>>>>> cleaned 0 in 0.000456
>>>>> 2017-02-02 10:39:31.989405 7ff72e30b700 10
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>> 0x56102ab4be40 kv_submitted
>>>>> 2017-02-02 10:39:31.989407 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
>>>>> 0x56102ab4be40
>>>>> 2017-02-02 10:39:31.989413 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40
>>>>> seq
>>>>> 255
>>>>> 2017-02-02 10:39:31.989414 7ff72e30b700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
>>>>> [0x2564fb000~1000]
>>>>>
>>>>>
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Nick Fisk [mailto:nick@fisk.me.uk]
>>>>>>>>>> Sent: 01 February 2017 19:03
>>>>>>>>>> To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
>>>>>>>>> <mnelson@redhat.com>
>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>> Subject: RE: bluestore prefer wal size
>>>>>>>>>>
>>>>>>>>>> Hi Sage,
>>>>>>>>>>
>>>>>>>>>> First results not looking good. It looks like write IO to the
>>>>>>>>>> SSD's (sdd and sdi)
>>>>>>>>> is now massively amplified, by somewhere in the region of about
>>>>>>>>> 10x. But I'm still only getting around 100 4kb's seq write iops
>>>>>>>>> from the fio client. This is in comparison to 2500-3000 iops on
>>>>>>>>> the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).
>>>>>>>>>>
>>>>>>>>>> rbd engine: RBD version: 0.1.11
>>>>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s]
>>>>>>>>>> [0/119/0 iops] [eta 00m:00s]
>>>>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed Feb
>>>>>>>>>> 1
>>>>>>>>> 18:56:51 2017
>>>>>>>>>>   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
>>>>>>>>>>     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
>>>>>>>>>>     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
>>>>>>>>>>      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
>>>>>>>>>>     clat percentiles (msec):
>>>>>>>>>>      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
>>>>>>>>>>      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
>>>>>>>>>>      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
>>>>>>>>>>      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
>>>>>>>>>>      | 99.99th=[  273]
>>>>>>>>>>     bw (KB  /s): min=  191, max=  480, per=100.00%,
>>>>>>>>>> avg=464.18,
>>> stdev=34.60
>>>>>>>>>>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
>>>>>>>>>>     lat (msec) : 500=0.01%
>>>>>>>>>>   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
>>>>>>>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
>>> 32=0.0%,
>>>>>>>>>> =64=0.0%
>>>>>>>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> 64=0.0%,
>>>>>>>>>> =64=0.0%
>>>>>>>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>>>>>> 64=0.0%, =64=0.0%
>>>>>>>>>>      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
>>>>>>>>> drop=r=0/w=0/d=0
>>>>>>>>>>      latency   : target=0, window=0, percentile=100.00%, depth=1
>>>>>>>>>>
>>>>>>>>>> I've checked via the admin socket and the prefer wal option is
>>>>>>>>>> set to
>>> 8192.
>>>>>>>>>>
>>>>>>>>>> Random capture of iostat
>>>>>>>>>>
>>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
>>> avgqu-sz
>>>>>>>>> await r_await w_await  svctm  %util
>>>>>>>>>> sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05
>>> 11.64
>>>>>>>>> 0.00   11.64   8.73   4.80
>>>>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
>>> 0.00
>>>>>>>>> 0.00    0.00   0.00   0.00
>>>>>>>>>> sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54
>>> 3.59
>>>>>>>>> 0.00    3.59   3.58  54.20
>>>>>>>>>> sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09
>>> 0.06
>>>>>>>>> 0.00    0.06   0.06   9.20
>>>>>>>>>> sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91
>>> 4.20
>>>>>>>>> 0.00    4.20   4.20  91.20
>>>>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
>>> 0.00
>>>>>>>>> 0.00    0.00   0.00   0.00
>>>>>>>>>> sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18
>>> 2.68
>>>>>>>>> 0.00    2.68   2.68  17.80
>>>>>>>>>> sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80
>>> 3.71
>>>>>>>>> 0.00    3.71   3.71  80.40
>>>>>>>>>> sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09
>>> 0.08
>>>>>>>>> 0.00    0.08   0.07   8.40
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>>> Sent: 01 February 2017 15:34
>>>>>>>>>> To: nick@fisk.me.uk
>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>> Subject: bluestore prefer wal size
>>>>>>>>>>
>>>>>>>>>> Hey Nick,
>>>>>>>>>>
>>>>>>>>>> I've updated/improved the prefer wal size PR (that sends small
>>>>>>>>>> writes through the wal).  See
>>>>>>>>>>
>>>>>>>>>> 	https://github.com/ceph/ceph/pull/13217
>>>>>>>>>>
>>>>>>>>>> if you want to try it out.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: bluestore prefer wal size
       [not found]                         ` <ba5b6d21.ADsAAGSJJ-oAAAAAAAAAAFJy4NwAADNJBWwAAAAAAACRXwBYlKUc@mailjet.com>
@ 2017-03-06 14:59                           ` Sage Weil
  2017-03-06 15:32                             ` Mark Nelson
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-03-06 14:59 UTC (permalink / raw)
  To: Nick Fisk; +Cc: 'Mark Nelson', ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 20782 bytes --]

On Fri, 3 Feb 2017, Nick Fisk wrote:
> Hi Mark,
> 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: 03 February 2017 15:20
> > To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: bluestore prefer wal size
> > 
> > Hi Nick,
> > 
> > So I'm still catching up to your testing, but last night I ran through a number of tests with a single OSD, a single client, and iodepth=1 fio
> > rbd tests:
> > 
> > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00
> > 
> > I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore, bluestore with 1k and 16k prefer wal sizes.  I believe I'm
> > seeing similar results to what you saw.  On NVMe we are pretty similar likely due to the raw backend throughput the NVMe drives can
> > maintain but on HDD, bluestore is doing quite a bit worse than filestore for this specific use case.  During the bluestore HDD tests, I
> > watched disk access and it appears we are saturating the disk with small writes despite the low client performance.  I'll be digging into
> > this more today but I wanted to send out an update to let you know I believe I've reproduced your results and am looking into it.
> 
> Glad to hear that you can reproduce. After debugging that IF statement, 
> I think I'm at the point where I am out of my depth. But if there is 
> anything I can do, let me know.

I finally got my dev box HDD sorted out and retested this.  The IOPS were 
being halved because bluestore was triggering a rocksdb commit (and the 
associated latency) just to retire the WAL record.  I rebased and fixed it 
so that the WAL entries are deleted lazily (in the next commit round) and 
it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k 
writes with queue depth 1 (6TB 7200rpm WD black).  That's in the 
neighborhood of what I'd expect from a spinner... it's basically seeking 
back and forth between two write positions (the rocksdb log and the new 
object data where the allocator position is).  We could probably bring 
this up a bit by making the WAL work be batched on HDD (build up several 
IOs worth and dispatch it all at once to reduce seeks).  That'll take a 
bit more work, though.

With WAL on an NVMe, I get about ~450 IOPS.

Want to give it a try?
sage


 > 
> > 
> > Thanks!
> > Mark
> > 
> > On 02/03/2017 03:49 AM, Nick Fisk wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Nick Fisk [mailto:nick@fisk.me.uk]
> > >> Sent: 02 February 2017 21:15
> > >> To: 'Mark Nelson' <mnelson@redhat.com>; 'Sage Weil'
> > >> <sage@newdream.net>
> > >> Cc: ceph-devel@vger.kernel.org
> > >> Subject: RE: bluestore prefer wal size
> > >>
> > >> Hi Mark,
> > >>> -----Original Message-----
> > >>> From: Mark Nelson [mailto:mnelson@redhat.com]
> > >>> Sent: 02 February 2017 20:17
> > >>> To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
> > >>> Cc: ceph-devel@vger.kernel.org
> > >>> Subject: Re: bluestore prefer wal size
> > >>>
> > >>> Hi Nick,
> > >>>
> > >>> On 02/02/2017 08:46 AM, Nick Fisk wrote:
> > >>>> Further update
> > >>>>
> > >>>> If I do random writes instead of sequential, I see the "defering
> > >>>> small write
> > >>> via wal" action. But from the do_alloc_write function not small write.
> > >>>>  Ie. log entry is:
> > >>>> _do_alloc_write defering small 0x1000 write via wal
> > >>>
> > >>> I just ran some tests using the branch with 4k random and sequential
> > >>> writes against a single OSD.  I see deferring small write via wal
> > >>> from do_alloc_write both for both sequential and random when
> > >>> bluestore_prefer_wal_size is set to be larger than the IO size.
> > >>> Otherwise, I don't see the messages.  I am also preconditioning with
> > >>> 4MB writes.
> > >>>
> > >>> So far I've just been testing on NVMe, but I should be able to do
> > >>> some mixed tests soon to see if there are any interesting performance implications.
> > >>
> > >> I've just looked at the RBD I ran those random tests on and I
> > >> realized that it was only about 90% pre-conditioned due to me leaving
> > >> the time_based fio option on. After filling completely with writes, now both random and sequential are doing _do_write_small
> > writes and no deferring wal writes.
> > >>
> > >> From looking through the code today, as far as I understand it
> > >>
> > >> Write_small = write to existing object under min_allow_size Write_big
> > >> = write to existing object over min_allow_size Alloc_write = write to
> > >> a new object
> > >>
> > >> Is that correct?
> > >>
> > >> If it is, I'm interested in why you are seeing alloc_writes, as I
> > >> don't see them if the RBD is fully pre-conditioned? Unless there is something changing with the min_alloc_size as you are using non
> > rotational media, which is causing this change?
> > >>
> > >> I'm adding a few extra dout's to the write_small function to try and see what's going on, hopefully I will know more tomorrow.
> > >>
> > >
> > > Morning,
> > >
> > > So, results are in. I added some debugging before the IF statement
> > > (https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c127
> > > 66c7801ac/src/os/bluestore/BlueStore.cc#L7885)
> > >
> > > So the following section now looks like below:
> > >
> > >     dout(20) << __func__ << "  before IF 0x" << std::hex << b_off << "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " -
> > " << b->get_blob().is_unused(b_off, b_len) << " - " << b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
> > >     if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > >         b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > >         b->get_blob().is_unused(b_off, b_len) &&
> > >         b->get_blob().is_allocated(b_off, b_len)) {
> > >
> > > And the output was
> > >
> > > _do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -
> > >
> > > So the " b->get_blob().is_unused(b_off, b_len)" call is returning false, which stops it going into the IF block and doing the prefer wal
> > write. Is that expected?
> > >
> > > Nick
> > >
> > >> Nick
> > >>
> > >>>
> > >>> Mark
> > >>>
> > >>>>
> > >>>> And that’s for a 4kb random write on an RBD that has been fully
> > >>>> conditioned via fio 4MB writes
> > >>>>
> > >>>> So it's something in the sequential nature of the test which trips
> > >>>> it up as I don't see any sign of deferred writes
> > >>>>
> > >>>> From looking at the code in the small write function, I can only
> > >>>> see two possibilities that would stop if from reaching the prefer
> > >>>> wal IF statement
> > >>>>
> > >>>>   while (ep != o->extent_map.extent_map.end()) {
> > >>>>     if (ep->blob_start() >= end) {
> > >>>>       break;
> > >>>>
> > >>>> and
> > >>>>
> > >>>>   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > >>>> 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > >>>> 	b->get_blob().is_unused(b_off, b_len) &&
> > >>>> 	b->get_blob().is_allocated(b_off, b_len)) {
> > >>>>
> > >>>> Unfortunately, that’s about as far as I have got, before my
> > >>>> knowledge of
> > >>> Bluestore becomes a limitation.
> > >>>>
> > >>>> Don't know if that triggers any thoughts?
> > >>>>
> > >>>>>
> > >>>>> 2017-02-02 10:39:31.988782 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing
> > >>>>> 0x56102a5a4900
> > >>>>> osr(9.73 0x56102a5db1a0)
> > >>>>> 2017-02-02 10:39:31.988786 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900
> > >>>>> =
> > >>>>> 0x56102ab4be40 seq 260
> > >>>>> 2017-02-02 10:39:31.988797 7ff7282ff700 15
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
> > >>> s
> > >>>>> 2017-02-02 10:39:31.988802 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
> > >>> s
> > >>>>> = 0
> > >>>>> 2017-02-02 10:39:31.988806 7ff7282ff700 15
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > >>> 0xfb000~
> > >>>>> 1000
> > >>>>> 2017-02-02 10:39:31.988809 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > >>>>> 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
> > >>>>> 2017-02-02 10:39:31.988812 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12
> > >>>>> target_blob_size
> > >>>>> 0x80000
> > >>>>> 2017-02-02 10:39:31.988814 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000
> > >>>>> 2017-02-02 10:39:31.988816 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering
> > >>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > >>>>> crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
> > >>>>> bstart 0x80000
> > >>>>> 2017-02-02 10:39:31.988821 7ff7282ff700 20
> > >>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard
> > >>>>> 0x7b000~1000
> > >>>>> 2017-02-02 10:39:31.988827 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
> > >>>>> 0x7b000~1000 of mutable Blob(0x56102aac05a0
> > >>> blob([0x256480000~80000]
> > >>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > >>>>> SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
> > >>>>> 2017-02-02 10:39:31.988832 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
> > >>>>> 0xfb000~1000: 0x7b000~1000
> > >>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > >>>>> crc32c/0x1000)
> > >>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > >>>>> SharedBlob(0x56102aa76260))
> > >>>>> 2017-02-02 10:39:31.988836 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
> > >>>>> 0x56102ab4be40 0 blobs
> > >>>>> 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
> > >>>>> instance 94627427524464 num_used 345738 total 7627973
> > >>>>> 2017-02-02 10:39:31.988840 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
> > >>>>> 0xfb000~1000: 0x7b000~1000
> > >>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > >>>>> crc32c/0x1000)
> > >>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > >>>>> SharedBlob(0x56102aa76260))
> > >>>>> 2017-02-02 10:39:31.988844 7ff7282ff700 20
> > >>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > >>> 0xfb000~1000
> > >>>>> next shard
> > >>>>> 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
> > >>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > >>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000:
> > >>>>> 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > >>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > >>>>> SharedBlob(0x56102aa76260))
> > >>>>> 2017-02-02 10:39:31.988850 7ff7282ff700 20
> > >>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > >>> 0xfb000~1000
> > >>>>> next shard
> > >>>>> 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
> > >>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > >>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000:
> > >>>>> 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > >>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > >>>>> SharedBlob(0x56102aa76260))
> > >>>>> 2017-02-02 10:39:31.988868 7ff7282ff700 20
> > >>>>> bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000
> > >>>>> dirty
> > >>>>> 2017-02-02 10:39:31.988870 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > >>>>> 0xfb000~1000 = 0
> > >>>>> 2017-02-02 10:39:31.988875 7ff7282ff700 15
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > >>>>> #9:ce000000::::head#
> > >>>>> 2017-02-02 10:39:31.988882 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > >>>>> #9:ce000000::::head# = 0
> > >>>>> 2017-02-02 10:39:31.988885 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
> > >>>>> 0x56102ab4be40 onodes
> > >>>>> 0x56102a8ee300 shared_blobs
> > >>>>> 2017-02-02 10:39:31.988894 7ff7282ff700 20
> > >>>>> bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531
> > >>>>> bytes
> > >>> (was 531) from 1 extents
> > >>>>> 2017-02-02 10:39:31.988902 7ff7282ff700 20
> > >>> bluestore(/var/lib/ceph/osd/ceph-4)   onode
> > >>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is
> > >>>>> 424
> > >>>>> (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
> > >>>>> extents)
> > >>>>> 2017-02-02 10:39:31.988914 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > >>>>> 0x56102ab4be40 prepare
> > >>>>> 2017-02-02 10:39:31.988916 7ff7282ff700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40
> > >>>>> 2017-02-02 10:39:31.988917 7ff7282ff700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > >>>>> 0x56102ab4be40 io_done
> > >>>>> 2017-02-02 10:39:31.988918 7ff7282ff700 20
> > >>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
> > >>>>> finish_write
> > >>>>> buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing
> > >>>>> nocache)
> > >>>>> 2017-02-02 10:39:31.988939 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
> > >>>>> 2017-02-02 10:39:31.988941 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1
> > >>>>> submitting 1 cleaning 0
> > >>>>> 2017-02-02 10:39:31.988952 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
> > >>>>> 0x56102ab4be40 allocated 0x[] released 0x[]
> > >>>>> 2017-02-02 10:39:31.988999 7ff72e30b700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
> > >>>>> bluefs
> > >>>>> 9535 M free
> > >>>>> (0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143
> > >>>>> 2017-02-02 10:39:31.989400 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1
> > >>>>> cleaned 0 in 0.000456
> > >>>>> 2017-02-02 10:39:31.989405 7ff72e30b700 10
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > >>>>> 0x56102ab4be40 kv_submitted
> > >>>>> 2017-02-02 10:39:31.989407 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
> > >>>>> 0x56102ab4be40
> > >>>>> 2017-02-02 10:39:31.989413 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40
> > >>>>> seq
> > >>>>> 255
> > >>>>> 2017-02-02 10:39:31.989414 7ff72e30b700 20
> > >>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
> > >>>>> [0x2564fb000~1000]
> > >>>>>
> > >>>>>
> > >>>>>>
> > >>>>>> sage
> > >>>>>>
> > >>>>>>>
> > >>>>>>> sage
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> sage
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Nick
> > >>>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Nick Fisk [mailto:nick@fisk.me.uk]
> > >>>>>>>>>> Sent: 01 February 2017 19:03
> > >>>>>>>>>> To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
> > >>>>>>>>> <mnelson@redhat.com>
> > >>>>>>>>>> Cc: ceph-devel@vger.kernel.org
> > >>>>>>>>>> Subject: RE: bluestore prefer wal size
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Sage,
> > >>>>>>>>>>
> > >>>>>>>>>> First results not looking good. It looks like write IO to the
> > >>>>>>>>>> SSD's (sdd and sdi)
> > >>>>>>>>> is now massively amplified, by somewhere in the region of
> > >>>>>>>>> about 10x. But I'm still only getting around 100 4kb's seq
> > >>>>>>>>> write iops from the fio client. This is in comparison to
> > >>>>>>>>> 2500-3000 iops on the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).
> > >>>>>>>>>>
> > >>>>>>>>>> rbd engine: RBD version: 0.1.11
> > >>>>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s]
> > >>>>>>>>>> [0/119/0 iops] [eta 00m:00s]
> > >>>>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed
> > >>>>>>>>>> Feb
> > >>>>>>>>>> 1
> > >>>>>>>>> 18:56:51 2017
> > >>>>>>>>>>   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
> > >>>>>>>>>>     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
> > >>>>>>>>>>     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
> > >>>>>>>>>>      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
> > >>>>>>>>>>     clat percentiles (msec):
> > >>>>>>>>>>      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
> > >>>>>>>>>>      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
> > >>>>>>>>>>      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
> > >>>>>>>>>>      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
> > >>>>>>>>>>      | 99.99th=[  273]
> > >>>>>>>>>>     bw (KB  /s): min=  191, max=  480, per=100.00%,
> > >>>>>>>>>> avg=464.18,
> > >>> stdev=34.60
> > >>>>>>>>>>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
> > >>>>>>>>>>     lat (msec) : 500=0.01%
> > >>>>>>>>>>   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
> > >>>>>>>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
> > >>> 32=0.0%,
> > >>>>>>>>>> =64=0.0%
> > >>>>>>>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > >>> 64=0.0%,
> > >>>>>>>>>> =64=0.0%
> > >>>>>>>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > >>>>>>>>>> 64=0.0%, =64=0.0%
> > >>>>>>>>>>      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
> > >>>>>>>>> drop=r=0/w=0/d=0
> > >>>>>>>>>>      latency   : target=0, window=0, percentile=100.00%, depth=1
> > >>>>>>>>>>
> > >>>>>>>>>> I've checked via the admin socket and the prefer wal option
> > >>>>>>>>>> is set to
> > >>> 8192.
> > >>>>>>>>>>
> > >>>>>>>>>> Random capture of iostat
> > >>>>>>>>>>
> > >>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> > >>> avgqu-sz
> > >>>>>>>>> await r_await w_await  svctm  %util
> > >>>>>>>>>> sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05
> > >>> 11.64
> > >>>>>>>>> 0.00   11.64   8.73   4.80
> > >>>>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
> > >>> 0.00
> > >>>>>>>>> 0.00    0.00   0.00   0.00
> > >>>>>>>>>> sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54
> > >>> 3.59
> > >>>>>>>>> 0.00    3.59   3.58  54.20
> > >>>>>>>>>> sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09
> > >>> 0.06
> > >>>>>>>>> 0.00    0.06   0.06   9.20
> > >>>>>>>>>> sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91
> > >>> 4.20
> > >>>>>>>>> 0.00    4.20   4.20  91.20
> > >>>>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
> > >>> 0.00
> > >>>>>>>>> 0.00    0.00   0.00   0.00
> > >>>>>>>>>> sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18
> > >>> 2.68
> > >>>>>>>>> 0.00    2.68   2.68  17.80
> > >>>>>>>>>> sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80
> > >>> 3.71
> > >>>>>>>>> 0.00    3.71   3.71  80.40
> > >>>>>>>>>> sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09
> > >>> 0.08
> > >>>>>>>>> 0.00    0.08   0.07   8.40
> > >>>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > >>>>>>>>>> Sent: 01 February 2017 15:34
> > >>>>>>>>>> To: nick@fisk.me.uk
> > >>>>>>>>>> Cc: ceph-devel@vger.kernel.org
> > >>>>>>>>>> Subject: bluestore prefer wal size
> > >>>>>>>>>>
> > >>>>>>>>>> Hey Nick,
> > >>>>>>>>>>
> > >>>>>>>>>> I've updated/improved the prefer wal size PR (that sends
> > >>>>>>>>>> small writes through the wal).  See
> > >>>>>>>>>>
> > >>>>>>>>>> 	https://github.com/ceph/ceph/pull/13217
> > >>>>>>>>>>
> > >>>>>>>>>> if you want to try it out.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks!
> > >>>>>>>>>> sage
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>> --
> > >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>>>>>> in the body of a message to majordomo@vger.kernel.org More
> > >>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bluestore prefer wal size
  2017-03-06 14:59                           ` Sage Weil
@ 2017-03-06 15:32                             ` Mark Nelson
  2017-03-06 17:53                               ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Nelson @ 2017-03-06 15:32 UTC (permalink / raw)
  To: Sage Weil, Nick Fisk; +Cc: ceph-devel



On 03/06/2017 08:59 AM, Sage Weil wrote:
> On Fri, 3 Feb 2017, Nick Fisk wrote:
>> Hi Mark,
>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>> Sent: 03 February 2017 15:20
>>> To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: bluestore prefer wal size
>>>
>>> Hi Nick,
>>>
>>> So I'm still catching up to your testing, but last night I ran through a number of tests with a single OSD, a single client, and iodepth=1 fio
>>> rbd tests:
>>>
>>> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00
>>>
>>> I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore, bluestore with 1k and 16k prefer wal sizes.  I believe I'm
>>> seeing similar results to what you saw.  On NVMe we are pretty similar likely due to the raw backend throughput the NVMe drives can
>>> maintain but on HDD, bluestore is doing quite a bit worse than filestore for this specific use case.  During the bluestore HDD tests, I
>>> watched disk access and it appears we are saturating the disk with small writes despite the low client performance.  I'll be digging into
>>> this more today but I wanted to send out an update to let you know I believe I've reproduced your results and am looking into it.
>>
>> Glad to hear that you can reproduce. After debugging that IF statement,
>> I think I'm at the point where I am out of my depth. But if there is
>> anything I can do, let me know.
>
> I finally got my dev box HDD sorted out and retested this.  The IOPS were
> being halved because bluestore was triggering a rocksdb commit (and the
> associated latency) just to retire the WAL record.  I rebased and fixed it
> so that the WAL entries are deleted lazily (in the next commit round) and
> it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
> writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
> neighborhood of what I'd expect from a spinner... it's basically seeking
> back and forth between two write positions (the rocksdb log and the new
> object data where the allocator position is).  We could probably bring
> this up a bit by making the WAL work be batched on HDD (build up several
> IOs worth and dispatch it all at once to reduce seeks).  That'll take a
> bit more work, though.
>
> With WAL on an NVMe, I get about ~450 IOPS.
>
> Want to give it a try?

What's the branch?


> sage
>
>
>  >
>>>
>>> Thanks!
>>> Mark
>>>
>>> On 02/03/2017 03:49 AM, Nick Fisk wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Nick Fisk [mailto:nick@fisk.me.uk]
>>>>> Sent: 02 February 2017 21:15
>>>>> To: 'Mark Nelson' <mnelson@redhat.com>; 'Sage Weil'
>>>>> <sage@newdream.net>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: RE: bluestore prefer wal size
>>>>>
>>>>> Hi Mark,
>>>>>> -----Original Message-----
>>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>>> Sent: 02 February 2017 20:17
>>>>>> To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: Re: bluestore prefer wal size
>>>>>>
>>>>>> Hi Nick,
>>>>>>
>>>>>> On 02/02/2017 08:46 AM, Nick Fisk wrote:
>>>>>>> Further update
>>>>>>>
>>>>>>> If I do random writes instead of sequential, I see the "defering
>>>>>>> small write
>>>>>> via wal" action. But from the do_alloc_write function not small write.
>>>>>>>  Ie. log entry is:
>>>>>>> _do_alloc_write defering small 0x1000 write via wal
>>>>>>
>>>>>> I just ran some tests using the branch with 4k random and sequential
>>>>>> writes against a single OSD.  I see deferring small write via wal
>>>>>> from do_alloc_write both for both sequential and random when
>>>>>> bluestore_prefer_wal_size is set to be larger than the IO size.
>>>>>> Otherwise, I don't see the messages.  I am also preconditioning with
>>>>>> 4MB writes.
>>>>>>
>>>>>> So far I've just been testing on NVMe, but I should be able to do
>>>>>> some mixed tests soon to see if there are any interesting performance implications.
>>>>>
>>>>> I've just looked at the RBD I ran those random tests on and I
>>>>> realized that it was only about 90% pre-conditioned due to me leaving
>>>>> the time_based fio option on. After filling completely with writes, now both random and sequential are doing _do_write_small
>>> writes and no deferring wal writes.
>>>>>
>>>>> From looking through the code today, as far as I understand it
>>>>>
>>>>> Write_small = write to existing object under min_allow_size Write_big
>>>>> = write to existing object over min_allow_size Alloc_write = write to
>>>>> a new object
>>>>>
>>>>> Is that correct?
>>>>>
>>>>> If it is, I'm interested in why you are seeing alloc_writes, as I
>>>>> don't see them if the RBD is fully pre-conditioned? Unless there is something changing with the min_alloc_size as you are using non
>>> rotational media, which is causing this change?
>>>>>
>>>>> I'm adding a few extra dout's to the write_small function to try and see what's going on, hopefully I will know more tomorrow.
>>>>>
>>>>
>>>> Morning,
>>>>
>>>> So, results are in. I added some debugging before the IF statement
>>>> (https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c127
>>>> 66c7801ac/src/os/bluestore/BlueStore.cc#L7885)
>>>>
>>>> So the following section now looks like below:
>>>>
>>>>     dout(20) << __func__ << "  before IF 0x" << std::hex << b_off << "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " -
>>> " << b->get_blob().is_unused(b_off, b_len) << " - " << b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
>>>>     if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
>>>>         b->get_blob().get_ondisk_length() >= b_off + b_len &&
>>>>         b->get_blob().is_unused(b_off, b_len) &&
>>>>         b->get_blob().is_allocated(b_off, b_len)) {
>>>>
>>>> And the output was
>>>>
>>>> _do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -
>>>>
>>>> So the " b->get_blob().is_unused(b_off, b_len)" call is returning false, which stops it going into the IF block and doing the prefer wal
>>> write. Is that expected?
>>>>
>>>> Nick
>>>>
>>>>> Nick
>>>>>
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>>
>>>>>>> And that’s for a 4kb random write on an RBD that has been fully
>>>>>>> conditioned via fio 4MB writes
>>>>>>>
>>>>>>> So it's something in the sequential nature of the test which trips
>>>>>>> it up as I don't see any sign of deferred writes
>>>>>>>
>>>>>>> From looking at the code in the small write function, I can only
>>>>>>> see two possibilities that would stop if from reaching the prefer
>>>>>>> wal IF statement
>>>>>>>
>>>>>>>   while (ep != o->extent_map.extent_map.end()) {
>>>>>>>     if (ep->blob_start() >= end) {
>>>>>>>       break;
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>>   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
>>>>>>> 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
>>>>>>> 	b->get_blob().is_unused(b_off, b_len) &&
>>>>>>> 	b->get_blob().is_allocated(b_off, b_len)) {
>>>>>>>
>>>>>>> Unfortunately, that’s about as far as I have got, before my
>>>>>>> knowledge of
>>>>>> Bluestore becomes a limitation.
>>>>>>>
>>>>>>> Don't know if that triggers any thoughts?
>>>>>>>
>>>>>>>>
>>>>>>>> 2017-02-02 10:39:31.988782 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions existing
>>>>>>>> 0x56102a5a4900
>>>>>>>> osr(9.73 0x56102a5db1a0)
>>>>>>>> 2017-02-02 10:39:31.988786 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr 0x56102a5a4900
>>>>>>>> =
>>>>>>>> 0x56102ab4be40 seq 260
>>>>>>>> 2017-02-02 10:39:31.988797 7ff7282ff700 15
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
>>>>>> s
>>>>>>>> 2017-02-02 10:39:31.988802 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2 key
>>>>>> s
>>>>>>>> = 0
>>>>>>>> 2017-02-02 10:39:31.988806 7ff7282ff700 15
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>>>>> 0xfb000~
>>>>>>>> 1000
>>>>>>>> 2017-02-02 10:39:31.988809 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>>>>>>> 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
>>>>>>>> 2017-02-02 10:39:31.988812 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer csum_order 12
>>>>>>>> target_blob_size
>>>>>>>> 0x80000
>>>>>>>> 2017-02-02 10:39:31.988814 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small 0xfb000~1000
>>>>>>>> 2017-02-02 10:39:31.988816 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small considering
>>>>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>>>>> crc32c/0x1000) ref_map(0x0~80000=1) SharedBlob(0x56102aa76260))
>>>>>>>> bstart 0x80000
>>>>>>>> 2017-02-02 10:39:31.988821 7ff7282ff700 20
>>>>>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0) _discard
>>>>>>>> 0x7b000~1000
>>>>>>>> 2017-02-02 10:39:31.988827 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
>>>>>>>> 0x7b000~1000 of mutable Blob(0x56102aac05a0
>>>>>> blob([0x256480000~80000]
>>>>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>>>>> SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
>>>>>>>> 2017-02-02 10:39:31.988832 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
>>>>>>>> 0xfb000~1000: 0x7b000~1000
>>>>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>>>>> crc32c/0x1000)
>>>>>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
>>>>>>>> SharedBlob(0x56102aa76260))
>>>>>>>> 2017-02-02 10:39:31.988836 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
>>>>>>>> 0x56102ab4be40 0 blobs
>>>>>>>> 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
>>>>>>>> instance 94627427524464 num_used 345738 total 7627973
>>>>>>>> 2017-02-02 10:39:31.988840 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
>>>>>>>> 0xfb000~1000: 0x7b000~1000
>>>>>>>> Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
>>>>>>>> crc32c/0x1000)
>>>>>>>> ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
>>>>>>>> SharedBlob(0x56102aa76260))
>>>>>>>> 2017-02-02 10:39:31.988844 7ff7282ff700 20
>>>>>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
>>>>>> 0xfb000~1000
>>>>>>>> next shard
>>>>>>>> 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
>>>>>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>>>>>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfb000~1000:
>>>>>>>> 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
>>>>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>>>>> SharedBlob(0x56102aa76260))
>>>>>>>> 2017-02-02 10:39:31.988850 7ff7282ff700 20
>>>>>>>> bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
>>>>>> 0xfb000~1000
>>>>>>>> next shard
>>>>>>>> 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
>>>>>>>> blob([0x256480000~80000] mutable+csum crc32c/0x1000)
>>>>>>>> ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and 0xfc000~4000:
>>>>>>>> 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
>>>>>>>> mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
>>>>>>>> SharedBlob(0x56102aa76260))
>>>>>>>> 2017-02-02 10:39:31.988868 7ff7282ff700 20
>>>>>>>> bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard 0x80000
>>>>>>>> dirty
>>>>>>>> 2017-02-02 10:39:31.988870 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
>>>>>>>> 0xfb000~1000 = 0
>>>>>>>> 2017-02-02 10:39:31.988875 7ff7282ff700 15
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
>>>>>>>> #9:ce000000::::head#
>>>>>>>> 2017-02-02 10:39:31.988882 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
>>>>>>>> #9:ce000000::::head# = 0
>>>>>>>> 2017-02-02 10:39:31.988885 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
>>>>>>>> 0x56102ab4be40 onodes
>>>>>>>> 0x56102a8ee300 shared_blobs
>>>>>>>> 2017-02-02 10:39:31.988894 7ff7282ff700 20
>>>>>>>> bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is 531
>>>>>>>> bytes
>>>>>> (was 531) from 1 extents
>>>>>>>> 2017-02-02 10:39:31.988902 7ff7282ff700 20
>>>>>> bluestore(/var/lib/ceph/osd/ceph-4)   onode
>>>>>>>> #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# is
>>>>>>>> 424
>>>>>>>> (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
>>>>>>>> extents)
>>>>>>>> 2017-02-02 10:39:31.988914 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>>>>> 0x56102ab4be40 prepare
>>>>>>>> 2017-02-02 10:39:31.988916 7ff7282ff700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io 0x56102ab4be40
>>>>>>>> 2017-02-02 10:39:31.988917 7ff7282ff700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>>>>> 0x56102ab4be40 io_done
>>>>>>>> 2017-02-02 10:39:31.988918 7ff7282ff700 20
>>>>>>>> bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
>>>>>>>> finish_write
>>>>>>>> buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000 writing
>>>>>>>> nocache)
>>>>>>>> 2017-02-02 10:39:31.988939 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
>>>>>>>> 2017-02-02 10:39:31.988941 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing 1
>>>>>>>> submitting 1 cleaning 0
>>>>>>>> 2017-02-02 10:39:31.988952 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
>>>>>>>> 0x56102ab4be40 allocated 0x[] released 0x[]
>>>>>>>> 2017-02-02 10:39:31.988999 7ff72e30b700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
>>>>>>>> bluefs
>>>>>>>> 9535 M free
>>>>>>>> (0.999896) bluestore 222 G free (0.954675), bluefs_ratio 0.0402143
>>>>>>>> 2017-02-02 10:39:31.989400 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed 1
>>>>>>>> cleaned 0 in 0.000456
>>>>>>>> 2017-02-02 10:39:31.989405 7ff72e30b700 10
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
>>>>>>>> 0x56102ab4be40 kv_submitted
>>>>>>>> 2017-02-02 10:39:31.989407 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
>>>>>>>> 0x56102ab4be40
>>>>>>>> 2017-02-02 10:39:31.989413 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc 0x56102ab4be40
>>>>>>>> seq
>>>>>>>> 255
>>>>>>>> 2017-02-02 10:39:31.989414 7ff72e30b700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
>>>>>>>> [0x2564fb000~1000]
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Nick Fisk [mailto:nick@fisk.me.uk]
>>>>>>>>>>>>> Sent: 01 February 2017 19:03
>>>>>>>>>>>>> To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
>>>>>>>>>>>> <mnelson@redhat.com>
>>>>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>>>>> Subject: RE: bluestore prefer wal size
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Sage,
>>>>>>>>>>>>>
>>>>>>>>>>>>> First results not looking good. It looks like write IO to the
>>>>>>>>>>>>> SSD's (sdd and sdi)
>>>>>>>>>>>> is now massively amplified, by somewhere in the region of
>>>>>>>>>>>> about 10x. But I'm still only getting around 100 4kb's seq
>>>>>>>>>>>> write iops from the fio client. This is in comparison to
>>>>>>>>>>>> 2500-3000 iops on the SSD's and ~200 iops per spinning disk (sdc,sde,sdg,sdh).
>>>>>>>>>>>>>
>>>>>>>>>>>>> rbd engine: RBD version: 0.1.11
>>>>>>>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB /s]
>>>>>>>>>>>>> [0/119/0 iops] [eta 00m:00s]
>>>>>>>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=31171: Wed
>>>>>>>>>>>>> Feb
>>>>>>>>>>>>> 1
>>>>>>>>>>>> 18:56:51 2017
>>>>>>>>>>>>>   write: io=27836KB, bw=475020B/s, iops=115, runt= 60006msec
>>>>>>>>>>>>>     slat (usec): min=6, max=142, avg=10.98, stdev= 8.38
>>>>>>>>>>>>>     clat (msec): min=1, max=271, avg= 8.61, stdev= 3.63
>>>>>>>>>>>>>      lat (msec): min=1, max=271, avg= 8.62, stdev= 3.63
>>>>>>>>>>>>>     clat percentiles (msec):
>>>>>>>>>>>>>      |  1.00th=[    8],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
>>>>>>>>>>>>>      | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
>>>>>>>>>>>>>      | 70.00th=[    9], 80.00th=[    9], 90.00th=[    9], 95.00th=[    9],
>>>>>>>>>>>>>      | 99.00th=[   17], 99.50th=[   25], 99.90th=[   33], 99.95th=[   36],
>>>>>>>>>>>>>      | 99.99th=[  273]
>>>>>>>>>>>>>     bw (KB  /s): min=  191, max=  480, per=100.00%,
>>>>>>>>>>>>> avg=464.18,
>>>>>> stdev=34.60
>>>>>>>>>>>>>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%, 20=1.55%, 50=0.69%
>>>>>>>>>>>>>     lat (msec) : 500=0.01%
>>>>>>>>>>>>>   cpu          : usr=0.13%, sys=0.12%, ctx=7224, majf=0, minf=5
>>>>>>>>>>>>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
>>>>>> 32=0.0%,
>>>>>>>>>>>>> =64=0.0%
>>>>>>>>>>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>> 64=0.0%,
>>>>>>>>>>>>> =64=0.0%
>>>>>>>>>>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>>>>>>>>>>>> 64=0.0%, =64=0.0%
>>>>>>>>>>>>>      issued    : total=r=0/w=6959/d=0, short=r=0/w=0/d=0,
>>>>>>>>>>>> drop=r=0/w=0/d=0
>>>>>>>>>>>>>      latency   : target=0, window=0, percentile=100.00%, depth=1
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've checked via the admin socket and the prefer wal option
>>>>>>>>>>>>> is set to
>>>>>> 8192.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Random capture of iostat
>>>>>>>>>>>>>
>>>>>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
>>>>>> avgqu-sz
>>>>>>>>>>>> await r_await w_await  svctm  %util
>>>>>>>>>>>>> sda               0.00    77.00    0.00    5.50     0.00   338.00   122.91     0.05
>>>>>> 11.64
>>>>>>>>>>>> 0.00   11.64   8.73   4.80
>>>>>>>>>>>>> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
>>>>>> 0.00
>>>>>>>>>>>> 0.00    0.00   0.00   0.00
>>>>>>>>>>>>> sdc               0.00     0.00    0.00  151.50     0.00   302.00     3.99     0.54
>>>>>> 3.59
>>>>>>>>>>>> 0.00    3.59   3.58  54.20
>>>>>>>>>>>>> sdd               0.00     0.00    0.00 1474.00     0.00  4008.00     5.44     0.09
>>>>>> 0.06
>>>>>>>>>>>> 0.00    0.06   0.06   9.20
>>>>>>>>>>>>> sde               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.91
>>>>>> 4.20
>>>>>>>>>>>> 0.00    4.20   4.20  91.20
>>>>>>>>>>>>> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00
>>>>>> 0.00
>>>>>>>>>>>> 0.00    0.00   0.00   0.00
>>>>>>>>>>>>> sdg               0.00     0.00    0.00   66.50     0.00   134.00     4.03     0.18
>>>>>> 2.68
>>>>>>>>>>>> 0.00    2.68   2.68  17.80
>>>>>>>>>>>>> sdh               0.00     0.00    0.00  217.00     0.00   434.00     4.00     0.80
>>>>>> 3.71
>>>>>>>>>>>> 0.00    3.71   3.71  80.40
>>>>>>>>>>>>> sdi               0.00     0.00    0.00 1134.00     0.00  3082.00     5.44     0.09
>>>>>> 0.08
>>>>>>>>>>>> 0.00    0.08   0.07   8.40
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>>>>>> Sent: 01 February 2017 15:34
>>>>>>>>>>>>> To: nick@fisk.me.uk
>>>>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>>>>> Subject: bluestore prefer wal size
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've updated/improved the prefer wal size PR (that sends
>>>>>>>>>>>>> small writes through the wal).  See
>>>>>>>>>>>>>
>>>>>>>>>>>>> 	https://github.com/ceph/ceph/pull/13217
>>>>>>>>>>>>>
>>>>>>>>>>>>> if you want to try it out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> sage
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bluestore prefer wal size
  2017-03-06 15:32                             ` Mark Nelson
@ 2017-03-06 17:53                               ` Sage Weil
  2017-03-07  4:34                                 ` Mark Nelson
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-03-06 17:53 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Nick Fisk, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 26652 bytes --]

On Mon, 6 Mar 2017, Mark Nelson wrote:
> On 03/06/2017 08:59 AM, Sage Weil wrote:
> > On Fri, 3 Feb 2017, Nick Fisk wrote:
> > > Hi Mark,
> > > 
> > > > -----Original Message-----
> > > > From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > Sent: 03 February 2017 15:20
> > > > To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: Re: bluestore prefer wal size
> > > > 
> > > > Hi Nick,
> > > > 
> > > > So I'm still catching up to your testing, but last night I ran through a
> > > > number of tests with a single OSD, a single client, and iodepth=1 fio
> > > > rbd tests:
> > > > 
> > > > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00
> > > > 
> > > > I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore,
> > > > bluestore with 1k and 16k prefer wal sizes.  I believe I'm
> > > > seeing similar results to what you saw.  On NVMe we are pretty similar
> > > > likely due to the raw backend throughput the NVMe drives can
> > > > maintain but on HDD, bluestore is doing quite a bit worse than filestore
> > > > for this specific use case.  During the bluestore HDD tests, I
> > > > watched disk access and it appears we are saturating the disk with small
> > > > writes despite the low client performance.  I'll be digging into
> > > > this more today but I wanted to send out an update to let you know I
> > > > believe I've reproduced your results and am looking into it.
> > > 
> > > Glad to hear that you can reproduce. After debugging that IF statement,
> > > I think I'm at the point where I am out of my depth. But if there is
> > > anything I can do, let me know.
> > 
> > I finally got my dev box HDD sorted out and retested this.  The IOPS were
> > being halved because bluestore was triggering a rocksdb commit (and the
> > associated latency) just to retire the WAL record.  I rebased and fixed it
> > so that the WAL entries are deleted lazily (in the next commit round) and
> > it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
> > writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
> > neighborhood of what I'd expect from a spinner... it's basically seeking
> > back and forth between two write positions (the rocksdb log and the new
> > object data where the allocator position is).  We could probably bring
> > this up a bit by making the WAL work be batched on HDD (build up several
> > IOs worth and dispatch it all at once to reduce seeks).  That'll take a
> > bit more work, though.

I've made this change too, and now I get about 400 IOPS out of a straight 
HDD (no SSD).  Much better!  I set up the tunables so that it does the 
batching behavior only for HDD and not for SSD; we may want to revisit 
that decision later.

> > With WAL on an NVMe, I get about ~450 IOPS.
> > 
> > Want to give it a try?
> 
> What's the branch?

wip-bluestore-prefer-wal-size

sage


> 
> 
> > sage
> > 
> > 
> >  >
> > > > 
> > > > Thanks!
> > > > Mark
> > > > 
> > > > On 02/03/2017 03:49 AM, Nick Fisk wrote:
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Nick Fisk [mailto:nick@fisk.me.uk]
> > > > > > Sent: 02 February 2017 21:15
> > > > > > To: 'Mark Nelson' <mnelson@redhat.com>; 'Sage Weil'
> > > > > > <sage@newdream.net>
> > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > Subject: RE: bluestore prefer wal size
> > > > > > 
> > > > > > Hi Mark,
> > > > > > > -----Original Message-----
> > > > > > > From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > > > > Sent: 02 February 2017 20:17
> > > > > > > To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
> > > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > > Subject: Re: bluestore prefer wal size
> > > > > > > 
> > > > > > > Hi Nick,
> > > > > > > 
> > > > > > > On 02/02/2017 08:46 AM, Nick Fisk wrote:
> > > > > > > > Further update
> > > > > > > > 
> > > > > > > > If I do random writes instead of sequential, I see the "defering
> > > > > > > > small write
> > > > > > > via wal" action. But from the do_alloc_write function not small
> > > > > > > write.
> > > > > > > >  Ie. log entry is:
> > > > > > > > _do_alloc_write defering small 0x1000 write via wal
> > > > > > > 
> > > > > > > I just ran some tests using the branch with 4k random and
> > > > > > > sequential
> > > > > > > writes against a single OSD.  I see deferring small write via wal
> > > > > > > from do_alloc_write both for both sequential and random when
> > > > > > > bluestore_prefer_wal_size is set to be larger than the IO size.
> > > > > > > Otherwise, I don't see the messages.  I am also preconditioning
> > > > > > > with
> > > > > > > 4MB writes.
> > > > > > > 
> > > > > > > So far I've just been testing on NVMe, but I should be able to do
> > > > > > > some mixed tests soon to see if there are any interesting
> > > > > > > performance implications.
> > > > > > 
> > > > > > I've just looked at the RBD I ran those random tests on and I
> > > > > > realized that it was only about 90% pre-conditioned due to me
> > > > > > leaving
> > > > > > the time_based fio option on. After filling completely with writes,
> > > > > > now both random and sequential are doing _do_write_small
> > > > writes and no deferring wal writes.
> > > > > > 
> > > > > > From looking through the code today, as far as I understand it
> > > > > > 
> > > > > > Write_small = write to existing object under min_allow_size
> > > > > > Write_big
> > > > > > = write to existing object over min_allow_size Alloc_write = write
> > > > > > to
> > > > > > a new object
> > > > > > 
> > > > > > Is that correct?
> > > > > > 
> > > > > > If it is, I'm interested in why you are seeing alloc_writes, as I
> > > > > > don't see them if the RBD is fully pre-conditioned? Unless there is
> > > > > > something changing with the min_alloc_size as you are using non
> > > > rotational media, which is causing this change?
> > > > > > 
> > > > > > I'm adding a few extra dout's to the write_small function to try and
> > > > > > see what's going on, hopefully I will know more tomorrow.
> > > > > > 
> > > > > 
> > > > > Morning,
> > > > > 
> > > > > So, results are in. I added some debugging before the IF statement
> > > > > (https://github.com/liewegas/ceph/blob/944b4d3a3869a472c1e66a4fa73c127
> > > > > 66c7801ac/src/os/bluestore/BlueStore.cc#L7885)
> > > > > 
> > > > > So the following section now looks like below:
> > > > > 
> > > > >     dout(20) << __func__ << "  before IF 0x" << std::hex << b_off <<
> > > > > "~" << b_len << " - " << b->get_blob().get_ondisk_length() << " -
> > > > " << b->get_blob().is_unused(b_off, b_len) << " - " <<
> > > > b->get_blob().is_allocated(b_off, b_len) << " - " << dendl;
> > > > >     if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > > > >         b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > > > >         b->get_blob().is_unused(b_off, b_len) &&
> > > > >         b->get_blob().is_allocated(b_off, b_len)) {
> > > > > 
> > > > > And the output was
> > > > > 
> > > > > _do_write_small  before IF 0xf000~1000 - 10000 - 0 - 1 -
> > > > > 
> > > > > So the " b->get_blob().is_unused(b_off, b_len)" call is returning
> > > > > false, which stops it going into the IF block and doing the prefer wal
> > > > write. Is that expected?
> > > > > 
> > > > > Nick
> > > > > 
> > > > > > Nick
> > > > > > 
> > > > > > > 
> > > > > > > Mark
> > > > > > > 
> > > > > > > > 
> > > > > > > > And that’s for a 4kb random write on an RBD that has been fully
> > > > > > > > conditioned via fio 4MB writes
> > > > > > > > 
> > > > > > > > So it's something in the sequential nature of the test which
> > > > > > > > trips
> > > > > > > > it up as I don't see any sign of deferred writes
> > > > > > > > 
> > > > > > > > From looking at the code in the small write function, I can only
> > > > > > > > see two possibilities that would stop if from reaching the
> > > > > > > > prefer
> > > > > > > > wal IF statement
> > > > > > > > 
> > > > > > > >   while (ep != o->extent_map.extent_map.end()) {
> > > > > > > >     if (ep->blob_start() >= end) {
> > > > > > > >       break;
> > > > > > > > 
> > > > > > > > and
> > > > > > > > 
> > > > > > > >   if ((b_off % chunk_size == 0 && b_len % chunk_size == 0) &&
> > > > > > > > 	b->get_blob().get_ondisk_length() >= b_off + b_len &&
> > > > > > > > 	b->get_blob().is_unused(b_off, b_len) &&
> > > > > > > > 	b->get_blob().is_allocated(b_off, b_len)) {
> > > > > > > > 
> > > > > > > > Unfortunately, that’s about as far as I have got, before my
> > > > > > > > knowledge of
> > > > > > > Bluestore becomes a limitation.
> > > > > > > > 
> > > > > > > > Don't know if that triggers any thoughts?
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 2017-02-02 10:39:31.988782 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) queue_transactions
> > > > > > > > > existing
> > > > > > > > > 0x56102a5a4900
> > > > > > > > > osr(9.73 0x56102a5db1a0)
> > > > > > > > > 2017-02-02 10:39:31.988786 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_create osr
> > > > > > > > > 0x56102a5a4900
> > > > > > > > > =
> > > > > > > > > 0x56102ab4be40 seq 260
> > > > > > > > > 2017-02-02 10:39:31.988797 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2
> > > > > > > > > key
> > > > > > > s
> > > > > > > > > 2017-02-02 10:39:31.988802 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _setattrs 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head# 2
> > > > > > > > > key
> > > > > > > s
> > > > > > > > > = 0
> > > > > > > > > 2017-02-02 10:39:31.988806 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > 0xfb000~
> > > > > > > > > 1000
> > > > > > > > > 2017-02-02 10:39:31.988809 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > 0xfb000~1000 - have 0x400000 (4194304) bytes fadvise_flags 0x0
> > > > > > > > > 2017-02-02 10:39:31.988812 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write prefer
> > > > > > > > > csum_order 12
> > > > > > > > > target_blob_size
> > > > > > > > > 0x80000
> > > > > > > > > 2017-02-02 10:39:31.988814 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small
> > > > > > > > > 0xfb000~1000
> > > > > > > > > 2017-02-02 10:39:31.988816 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small
> > > > > > > > > considering
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > bstart 0x80000
> > > > > > > > > 2017-02-02 10:39:31.988821 7ff7282ff700 20
> > > > > > > > > bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
> > > > > > > > > _discard
> > > > > > > > > 0x7b000~1000
> > > > > > > > > 2017-02-02 10:39:31.988827 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  wal write
> > > > > > > > > 0x7b000~1000 of mutable Blob(0x56102aac05a0
> > > > > > > blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260)) at [0x2564fb000~1000]
> > > > > > > > > 2017-02-02 10:39:31.988832 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  lex
> > > > > > > > > 0xfb000~1000: 0x7b000~1000
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000)
> > > > > > > > > ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988836 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_alloc_write txc
> > > > > > > > > 0x56102ab4be40 0 blobs
> > > > > > > > > 2017-02-02 10:39:31.988837 7ff7282ff700 10 bitmapalloc:reserve
> > > > > > > > > instance 94627427524464 num_used 345738 total 7627973
> > > > > > > > > 2017-02-02 10:39:31.988840 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _wctx_finish lex_old
> > > > > > > > > 0xfb000~1000: 0x7b000~1000
> > > > > > > > > Blob(0x56102aac05a0 blob([0x256480000~80000] mutable+csum
> > > > > > > > > crc32c/0x1000)
> > > > > > > > > ref_map(0x0~7b000=1,0x7b000~1000=2,0x7c000~4000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988844 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > > > > > > 0xfb000~1000
> > > > > > > > > next shard
> > > > > > > > > 0x100000 merging 0x80000~7b000: 0x0~7b000 Blob(0x56102aac05a0
> > > > > > > > > blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > > > > > > > > ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and
> > > > > > > > > 0xfb000~1000:
> > > > > > > > > 0x7b000~1000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988850 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) compress_extent_map
> > > > > > > 0xfb000~1000
> > > > > > > > > next shard
> > > > > > > > > 0x100000 merging 0x80000~7c000: 0x0~7c000 Blob(0x56102aac05a0
> > > > > > > > > blob([0x256480000~80000] mutable+csum crc32c/0x1000)
> > > > > > > > > ref_map(0x0~80000=1) SharedBlob(0x56102aa76260)) and
> > > > > > > > > 0xfc000~4000:
> > > > > > > > > 0x7c000~4000 Blob(0x56102aac05a0 blob([0x256480000~80000]
> > > > > > > > > mutable+csum crc32c/0x1000) ref_map(0x0~80000=1)
> > > > > > > > > SharedBlob(0x56102aa76260))
> > > > > > > > > 2017-02-02 10:39:31.988868 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) dirty_range mark shard
> > > > > > > > > 0x80000
> > > > > > > > > dirty
> > > > > > > > > 2017-02-02 10:39:31.988870 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _write 9.73_head
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > 0xfb000~1000 = 0
> > > > > > > > > 2017-02-02 10:39:31.988875 7ff7282ff700 15
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > > > > > > > > #9:ce000000::::head#
> > > > > > > > > 2017-02-02 10:39:31.988882 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _omap_setkeys 9.73_head
> > > > > > > > > #9:ce000000::::head# = 0
> > > > > > > > > 2017-02-02 10:39:31.988885 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_write_nodes txc
> > > > > > > > > 0x56102ab4be40 onodes
> > > > > > > > > 0x56102a8ee300 shared_blobs
> > > > > > > > > 2017-02-02 10:39:31.988894 7ff7282ff700 20
> > > > > > > > > bluestore.extentmap(0x56102a8ee3f0) update shard 0x80000 is
> > > > > > > > > 531
> > > > > > > > > bytes
> > > > > > > (was 531) from 1 extents
> > > > > > > > > 2017-02-02 10:39:31.988902 7ff7282ff700 20
> > > > > > > bluestore(/var/lib/ceph/osd/ceph-4)   onode
> > > > > > > > > #9:cff290b9:::rbd_data.36646a74b0dc51.0000000000000001:head#
> > > > > > > > > is
> > > > > > > > > 424
> > > > > > > > > (422 bytes onode + 2 bytes spanning blobs + 0 bytes inline
> > > > > > > > > extents)
> > > > > > > > > 2017-02-02 10:39:31.988914 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 prepare
> > > > > > > > > 2017-02-02 10:39:31.988916 7ff7282ff700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_io
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > 2017-02-02 10:39:31.988917 7ff7282ff700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 io_done
> > > > > > > > > 2017-02-02 10:39:31.988918 7ff7282ff700 20
> > > > > > > > > bluestore.BufferSpace(0x56102aa762b8 in 0x56102a1337a0)
> > > > > > > > > finish_write
> > > > > > > > > buffer(0x56102adb1830 space 0x56102aa762b8 0x7b000~1000
> > > > > > > > > writing
> > > > > > > > > nocache)
> > > > > > > > > 2017-02-02 10:39:31.988939 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread wake
> > > > > > > > > 2017-02-02 10:39:31.988941 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committing
> > > > > > > > > 1
> > > > > > > > > submitting 1 cleaning 0
> > > > > > > > > 2017-02-02 10:39:31.988952 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finalize_kv txc
> > > > > > > > > 0x56102ab4be40 allocated 0x[] released 0x[]
> > > > > > > > > 2017-02-02 10:39:31.988999 7ff72e30b700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _balance_bluefs_freespace
> > > > > > > > > bluefs
> > > > > > > > > 9535 M free
> > > > > > > > > (0.999896) bluestore 222 G free (0.954675), bluefs_ratio
> > > > > > > > > 0.0402143
> > > > > > > > > 2017-02-02 10:39:31.989400 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _kv_sync_thread committed
> > > > > > > > > 1
> > > > > > > > > cleaned 0 in 0.000456
> > > > > > > > > 2017-02-02 10:39:31.989405 7ff72e30b700 10
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc
> > > > > > > > > 0x56102ab4be40 kv_submitted
> > > > > > > > > 2017-02-02 10:39:31.989407 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _txc_finish_kv txc
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > 2017-02-02 10:39:31.989413 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _wal_apply txc
> > > > > > > > > 0x56102ab4be40
> > > > > > > > > seq
> > > > > > > > > 255
> > > > > > > > > 2017-02-02 10:39:31.989414 7ff72e30b700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-4) _do_wal_op write
> > > > > > > > > [0x2564fb000~1000]
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > sage
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > sage
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > sage
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Nick
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Nick Fisk [mailto:nick@fisk.me.uk]
> > > > > > > > > > > > > > Sent: 01 February 2017 19:03
> > > > > > > > > > > > > > To: 'Sage Weil' <sweil@redhat.com>; 'Mark Nelson'
> > > > > > > > > > > > > <mnelson@redhat.com>
> > > > > > > > > > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > > > > > > > > > Subject: RE: bluestore prefer wal size
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi Sage,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > First results not looking good. It looks like write
> > > > > > > > > > > > > > IO to the
> > > > > > > > > > > > > > SSD's (sdd and sdi)
> > > > > > > > > > > > > is now massively amplified, by somewhere in the region
> > > > > > > > > > > > > of
> > > > > > > > > > > > > about 10x. But I'm still only getting around 100 4kb's
> > > > > > > > > > > > > seq
> > > > > > > > > > > > > write iops from the fio client. This is in comparison
> > > > > > > > > > > > > to
> > > > > > > > > > > > > 2500-3000 iops on the SSD's and ~200 iops per spinning
> > > > > > > > > > > > > disk (sdc,sde,sdg,sdh).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > rbd engine: RBD version: 0.1.11
> > > > > > > > > > > > > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/476KB/0KB
> > > > > > > > > > > > > > /s]
> > > > > > > > > > > > > > [0/119/0 iops] [eta 00m:00s]
> > > > > > > > > > > > > > rbd_iodepth32: (groupid=0, jobs=1): err= 0:
> > > > > > > > > > > > > > pid=31171: Wed
> > > > > > > > > > > > > > Feb
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > 18:56:51 2017
> > > > > > > > > > > > > >   write: io=27836KB, bw=475020B/s, iops=115, runt=
> > > > > > > > > > > > > > 60006msec
> > > > > > > > > > > > > >     slat (usec): min=6, max=142, avg=10.98, stdev=
> > > > > > > > > > > > > > 8.38
> > > > > > > > > > > > > >     clat (msec): min=1, max=271, avg= 8.61, stdev=
> > > > > > > > > > > > > > 3.63
> > > > > > > > > > > > > >      lat (msec): min=1, max=271, avg= 8.62, stdev=
> > > > > > > > > > > > > > 3.63
> > > > > > > > > > > > > >     clat percentiles (msec):
> > > > > > > > > > > > > >      |  1.00th=[    8],  5.00th=[    9], 10.00th=[
> > > > > > > > > > > > > > 9], 20.00th=[    9],
> > > > > > > > > > > > > >      | 30.00th=[    9], 40.00th=[    9], 50.00th=[
> > > > > > > > > > > > > > 9], 60.00th=[    9],
> > > > > > > > > > > > > >      | 70.00th=[    9], 80.00th=[    9], 90.00th=[
> > > > > > > > > > > > > > 9], 95.00th=[    9],
> > > > > > > > > > > > > >      | 99.00th=[   17], 99.50th=[   25], 99.90th=[
> > > > > > > > > > > > > > 33], 99.95th=[   36],
> > > > > > > > > > > > > >      | 99.99th=[  273]
> > > > > > > > > > > > > >     bw (KB  /s): min=  191, max=  480, per=100.00%,
> > > > > > > > > > > > > > avg=464.18,
> > > > > > > stdev=34.60
> > > > > > > > > > > > > >     lat (msec) : 2=0.01%, 4=0.04%, 10=97.69%,
> > > > > > > > > > > > > > 20=1.55%, 50=0.69%
> > > > > > > > > > > > > >     lat (msec) : 500=0.01%
> > > > > > > > > > > > > >   cpu          : usr=0.13%, sys=0.12%, ctx=7224,
> > > > > > > > > > > > > > majf=0, minf=5
> > > > > > > > > > > > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%,
> > > > > > > > > > > > > > 16=0.0%,
> > > > > > > 32=0.0%,
> > > > > > > > > > > > > > =64=0.0%
> > > > > > > > > > > > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> > > > > > > > > > > > > > 32=0.0%,
> > > > > > > 64=0.0%,
> > > > > > > > > > > > > > =64=0.0%
> > > > > > > > > > > > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> > > > > > > > > > > > > > 32=0.0%,
> > > > > > > > > > > > > > 64=0.0%, =64=0.0%
> > > > > > > > > > > > > >      issued    : total=r=0/w=6959/d=0,
> > > > > > > > > > > > > > short=r=0/w=0/d=0,
> > > > > > > > > > > > > drop=r=0/w=0/d=0
> > > > > > > > > > > > > >      latency   : target=0, window=0,
> > > > > > > > > > > > > > percentile=100.00%, depth=1
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I've checked via the admin socket and the prefer wal
> > > > > > > > > > > > > > option
> > > > > > > > > > > > > > is set to
> > > > > > > 8192.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Random capture of iostat
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s
> > > > > > > > > > > > > > rkB/s    wkB/s avgrq-sz
> > > > > > > avgqu-sz
> > > > > > > > > > > > > await r_await w_await  svctm  %util
> > > > > > > > > > > > > > sda               0.00    77.00    0.00    5.50
> > > > > > > > > > > > > > 0.00   338.00   122.91     0.05
> > > > > > > 11.64
> > > > > > > > > > > > > 0.00   11.64   8.73   4.80
> > > > > > > > > > > > > > sdb               0.00     0.00    0.00    0.00
> > > > > > > > > > > > > > 0.00     0.00     0.00     0.00
> > > > > > > 0.00
> > > > > > > > > > > > > 0.00    0.00   0.00   0.00
> > > > > > > > > > > > > > sdc               0.00     0.00    0.00  151.50
> > > > > > > > > > > > > > 0.00   302.00     3.99     0.54
> > > > > > > 3.59
> > > > > > > > > > > > > 0.00    3.59   3.58  54.20
> > > > > > > > > > > > > > sdd               0.00     0.00    0.00 1474.00
> > > > > > > > > > > > > > 0.00  4008.00     5.44     0.09
> > > > > > > 0.06
> > > > > > > > > > > > > 0.00    0.06   0.06   9.20
> > > > > > > > > > > > > > sde               0.00     0.00    0.00  217.00
> > > > > > > > > > > > > > 0.00   434.00     4.00     0.91
> > > > > > > 4.20
> > > > > > > > > > > > > 0.00    4.20   4.20  91.20
> > > > > > > > > > > > > > sdf               0.00     0.00    0.00    0.00
> > > > > > > > > > > > > > 0.00     0.00     0.00     0.00
> > > > > > > 0.00
> > > > > > > > > > > > > 0.00    0.00   0.00   0.00
> > > > > > > > > > > > > > sdg               0.00     0.00    0.00   66.50
> > > > > > > > > > > > > > 0.00   134.00     4.03     0.18
> > > > > > > 2.68
> > > > > > > > > > > > > 0.00    2.68   2.68  17.80
> > > > > > > > > > > > > > sdh               0.00     0.00    0.00  217.00
> > > > > > > > > > > > > > 0.00   434.00     4.00     0.80
> > > > > > > 3.71
> > > > > > > > > > > > > 0.00    3.71   3.71  80.40
> > > > > > > > > > > > > > sdi               0.00     0.00    0.00 1134.00
> > > > > > > > > > > > > > 0.00  3082.00     5.44     0.09
> > > > > > > 0.08
> > > > > > > > > > > > > 0.00    0.08   0.07   8.40
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > > > > > > Sent: 01 February 2017 15:34
> > > > > > > > > > > > > > To: nick@fisk.me.uk
> > > > > > > > > > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > > > > > > > > > Subject: bluestore prefer wal size
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hey Nick,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I've updated/improved the prefer wal size PR (that
> > > > > > > > > > > > > > sends
> > > > > > > > > > > > > > small writes through the wal).  See
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 	https://github.com/ceph/ceph/pull/13217
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > if you want to try it out.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > sage
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > > > ceph-devel"
> > > > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > > > > majordomo info at
> > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > 
> > > > > 
> > > 
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bluestore prefer wal size
  2017-03-06 17:53                               ` Sage Weil
@ 2017-03-07  4:34                                 ` Mark Nelson
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Nelson @ 2017-03-07  4:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Nick Fisk, ceph-devel



On 03/06/2017 11:53 AM, Sage Weil wrote:
> On Mon, 6 Mar 2017, Mark Nelson wrote:
>> On 03/06/2017 08:59 AM, Sage Weil wrote:
>>> On Fri, 3 Feb 2017, Nick Fisk wrote:
>>>> Hi Mark,
>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>> Sent: 03 February 2017 15:20
>>>>> To: nick@fisk.me.uk; 'Sage Weil' <sage@newdream.net>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: bluestore prefer wal size
>>>>>
>>>>> Hi Nick,
>>>>>
>>>>> So I'm still catching up to your testing, but last night I ran through a
>>>>> number of tests with a single OSD, a single client, and iodepth=1 fio
>>>>> rbd tests:
>>>>>
>>>>> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZWE85OFI3Q2xQZ00
>>>>>
>>>>> I tested HDD, HDD+NVMe, and NVMe configurations looking at filestore,
>>>>> bluestore with 1k and 16k prefer wal sizes.  I believe I'm
>>>>> seeing similar results to what you saw.  On NVMe we are pretty similar
>>>>> likely due to the raw backend throughput the NVMe drives can
>>>>> maintain but on HDD, bluestore is doing quite a bit worse than filestore
>>>>> for this specific use case.  During the bluestore HDD tests, I
>>>>> watched disk access and it appears we are saturating the disk with small
>>>>> writes despite the low client performance.  I'll be digging into
>>>>> this more today but I wanted to send out an update to let you know I
>>>>> believe I've reproduced your results and am looking into it.
>>>>
>>>> Glad to hear that you can reproduce. After debugging that IF statement,
>>>> I think I'm at the point where I am out of my depth. But if there is
>>>> anything I can do, let me know.
>>>
>>> I finally got my dev box HDD sorted out and retested this.  The IOPS were
>>> being halved because bluestore was triggering a rocksdb commit (and the
>>> associated latency) just to retire the WAL record.  I rebased and fixed it
>>> so that the WAL entries are deleted lazily (in the next commit round) and
>>> it's 2x faster than before.  I'm getting ~110 IOPS with rados bench 4k
>>> writes with queue depth 1 (6TB 7200rpm WD black).  That's in the
>>> neighborhood of what I'd expect from a spinner... it's basically seeking
>>> back and forth between two write positions (the rocksdb log and the new
>>> object data where the allocator position is).  We could probably bring
>>> this up a bit by making the WAL work be batched on HDD (build up several
>>> IOs worth and dispatch it all at once to reduce seeks).  That'll take a
>>> bit more work, though.
>
> I've made this change too, and now I get about 400 IOPS out of a straight
> HDD (no SSD).  Much better!  I set up the tunables so that it does the
> batching behavior only for HDD and not for SSD; we may want to revisit
> that decision later.
>
>>> With WAL on an NVMe, I get about ~450 IOPS.
>>>
>>> Want to give it a try?
>>
>> What's the branch?
>
> wip-bluestore-prefer-wal-size

Got a chance to give it a quick shot, but it's consistently stuck 
creating PGs and eventually the monitor marks OSDs down due to no osd or 
PG stats.  I'll give it a try with a single OSD cluster tomorrow.

>
> sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-07  4:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-01 15:34 bluestore prefer wal size Sage Weil
     [not found] ` <001001d27cbd$cee2a9f0$6ca7fdd0$@fisk.me.uk>
     [not found]   ` <bb0bdeb6.AEMAHeJMfMsAAAAAAAAAAFklQUsAADNJBWwAAAAAAACRXwBYklNO@mailjet.com>
2017-02-01 21:38     ` Sage Weil
     [not found]       ` <45ba3843.AEEAHxSt8JAAAAAAAAAAAFklQUsAADNJBWwAAAAAAACRXwBYkldP@mailjet.com>
2017-02-01 21:47         ` Sage Weil
2017-02-01 22:12           ` Sage Weil
     [not found]             ` <00bb01d27d45$21a48050$64ed80f0$@fisk.me.uk>
     [not found]               ` <d0caeb89.AEUAHScduCcAAAAAAAAAAFPDZXcAADNJBWwAAAAAAACRXwBYk0Zd@mailjet.com>
2017-02-02 20:17                 ` Mark Nelson
     [not found]                   ` <002101d27d99$5e266ad0$1a734070$@fisk.me.uk>
     [not found]                     ` <5ed3f3b2.ADsAAGRwZ_AAAAAAAAAAAFPDZXcAADNJBWwAAAAAAACRXwBYlFIS@mailjet.com>
2017-02-03 15:20                       ` Mark Nelson
     [not found]                         ` <ba5b6d21.ADsAAGSJJ-oAAAAAAAAAAFJy4NwAADNJBWwAAAAAAACRXwBYlKUc@mailjet.com>
2017-03-06 14:59                           ` Sage Weil
2017-03-06 15:32                             ` Mark Nelson
2017-03-06 17:53                               ` Sage Weil
2017-03-07  4:34                                 ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.