All of lore.kernel.org
 help / color / mirror / Atom feed
* Sparse file info in filestore not propagated to other OSDs
@ 2017-04-06 10:15 Piotr Dałek
  2017-04-06 13:25 ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 10:15 UTC (permalink / raw)
  To: ceph-devel

Hello,

We recently had an interesting issue with RBD images and filestore on Jewel 
10.2.5:
We have a pool with RBD images, all of them mostly untouched (large areas of 
those images unused), and once we added 3 new OSDs to cluster, objects 
representing these images grew substantially on new OSDs: objects hosting 
unused areas of these images on original OSDs remained small (~8K of space 
actually used, 4M allocated), but on new OSDs were large (4M allocated *and* 
actually used). After investigation we concluded that Ceph didn't propagate 
sparse file information during cluster rebalance, resulting in correct data 
contents on all OSDs, but no sparse file data on new OSDs, hence disk space 
usage increase on those.

Example on test cluster, before growing it by one OSD:

ls:

osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18 
/var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18 
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18 
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0

du:

osd-01-cluster: 12 
/var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-02-cluster: 12 
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: 12 
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0


mon-01-cluster:~ # rbd diff test
Offset   Length  Type
8388608  4194304 data
16777216 4096    data
33554432 4194304 data
37748736 2048    data

And after growing it:

ls:

clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' 
-exec ls -l {} \+
osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18 
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18 
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25 
/var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0

du:

clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' 
-exec du -k {} \+
osd-02-cluster: 12 
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: 12 
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-04-cluster: 4100 
/var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0

Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew 
from 12 to 4100KB when copied from other OSDs to osd-04.

Is this something to be expected? Is there any way to make it propagate the 
sparse file info? Or should we think about issuing a "fallocate -d"-like 
patch for writes on filestore?

(We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue 
remains; our XFS uses 4K bsize).

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 10:15 Sparse file info in filestore not propagated to other OSDs Piotr Dałek
@ 2017-04-06 13:25 ` Sage Weil
  2017-04-06 13:30   ` Piotr Dałek
  2017-04-13 14:23   ` Piotr Dałek
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-04-06 13:25 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3832 bytes --]

On Thu, 6 Apr 2017, Piotr Dałek wrote:
> Hello,
> 
> We recently had an interesting issue with RBD images and filestore on Jewel
> 10.2.5:
> We have a pool with RBD images, all of them mostly untouched (large areas of
> those images unused), and once we added 3 new OSDs to cluster, objects
> representing these images grew substantially on new OSDs: objects hosting
> unused areas of these images on original OSDs remained small (~8K of space
> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
> actually used). After investigation we concluded that Ceph didn't propagate
> sparse file information during cluster rebalance, resulting in correct data
> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
> usage increase on those.
> 
> Example on test cluster, before growing it by one OSD:
> 
> ls:
> 
> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> du:
> 
> osd-01-cluster: 12
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> 
> mon-01-cluster:~ # rbd diff test
> Offset   Length  Type
> 8388608  4194304 data
> 16777216 4096    data
> 33554432 4194304 data
> 37748736 2048    data
> 
> And after growing it:
> 
> ls:
> 
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> ls -l {} \+
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> du:
> 
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> du -k {} \+
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: 4100
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
> from 12 to 4100KB when copied from other OSDs to osd-04.
> 
> Is this something to be expected? Is there any way to make it propagate the
> sparse file info? Or should we think about issuing a "fallocate -d"-like patch
> for writes on filestore?
> 
> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
> remains; our XFS uses 4K bsize).

I think the solution here is to use sparse_read during recovery.  The 
PushOp data representation already supports it; it's just a matter of 
skipping the zeros.  The recovery code could also have an option to check 
for fully-zero regions of the data and turn those into holes as well.  For 
ReplicatedBackend, see build_push_op().

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 13:25 ` Sage Weil
@ 2017-04-06 13:30   ` Piotr Dałek
  2017-04-06 13:55     ` Sage Weil
  2017-04-13 14:23   ` Piotr Dałek
  1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 13:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/06/2017 03:25 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> Hello,
>>
>> We recently had an interesting issue with RBD images and filestore on Jewel
>> 10.2.5:
>> We have a pool with RBD images, all of them mostly untouched (large areas of
>> those images unused), and once we added 3 new OSDs to cluster, objects
>> representing these images grew substantially on new OSDs: objects hosting
>> unused areas of these images on original OSDs remained small (~8K of space
>> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
>> actually used). After investigation we concluded that Ceph didn't propagate
>> sparse file information during cluster rebalance, resulting in correct data
>> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
>> usage increase on those.
>>
>> [..]
>
> I think the solution here is to use sparse_read during recovery.  The
> PushOp data representation already supports it; it's just a matter of
> skipping the zeros.  The recovery code could also have an option to check
> for fully-zero regions of the data and turn those into holes as well.  For
> ReplicatedBackend, see build_push_op().

Can we abuse that to reduce amount of regular (client/inter-osd) network 
traffic?

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 13:30   ` Piotr Dałek
@ 2017-04-06 13:55     ` Sage Weil
  2017-04-06 14:24       ` Piotr Dałek
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2017-04-06 13:55 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1815 bytes --]

On Thu, 6 Apr 2017, Piotr Dałek wrote:
> On 04/06/2017 03:25 PM, Sage Weil wrote:
> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > Hello,
> > > 
> > > We recently had an interesting issue with RBD images and filestore on
> > > Jewel
> > > 10.2.5:
> > > We have a pool with RBD images, all of them mostly untouched (large areas
> > > of
> > > those images unused), and once we added 3 new OSDs to cluster, objects
> > > representing these images grew substantially on new OSDs: objects hosting
> > > unused areas of these images on original OSDs remained small (~8K of space
> > > actually used, 4M allocated), but on new OSDs were large (4M allocated
> > > *and*
> > > actually used). After investigation we concluded that Ceph didn't
> > > propagate
> > > sparse file information during cluster rebalance, resulting in correct
> > > data
> > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
> > > space
> > > usage increase on those.
> > > 
> > > [..]
> > 
> > I think the solution here is to use sparse_read during recovery.  The
> > PushOp data representation already supports it; it's just a matter of
> > skipping the zeros.  The recovery code could also have an option to check
> > for fully-zero regions of the data and turn those into holes as well.  For
> > ReplicatedBackend, see build_push_op().
> 
> Can we abuse that to reduce amount of regular (client/inter-osd) network
> traffic?

Yeah... I wouldn't call it abuse :).  sparse_read() will use 
SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the 
metadata on-hand.  It may be a bit slower, though... more complexity 
and such.  They recently implemented something like this for the kernel 
NFS server and found it was faster for very sparse files but the rest of 
the time it was a fair bit slower.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 13:55     ` Sage Weil
@ 2017-04-06 14:24       ` Piotr Dałek
  2017-04-06 14:27         ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 14:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/06/2017 03:55 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>> Hello,
>>>>
>>>> We recently had an interesting issue with RBD images and filestore on
>>>> Jewel
>>>> 10.2.5:
>>>> We have a pool with RBD images, all of them mostly untouched (large areas
>>>> of
>>>> those images unused), and once we added 3 new OSDs to cluster, objects
>>>> representing these images grew substantially on new OSDs: objects hosting
>>>> unused areas of these images on original OSDs remained small (~8K of space
>>>> actually used, 4M allocated), but on new OSDs were large (4M allocated
>>>> *and*
>>>> actually used). After investigation we concluded that Ceph didn't
>>>> propagate
>>>> sparse file information during cluster rebalance, resulting in correct
>>>> data
>>>> contents on all OSDs, but no sparse file data on new OSDs, hence disk
>>>> space
>>>> usage increase on those.
>>>>
>>>> [..]
>>>
>>> I think the solution here is to use sparse_read during recovery.  The
>>> PushOp data representation already supports it; it's just a matter of
>>> skipping the zeros.  The recovery code could also have an option to check
>>> for fully-zero regions of the data and turn those into holes as well.  For
>>> ReplicatedBackend, see build_push_op().
>>
>> Can we abuse that to reduce amount of regular (client/inter-osd) network
>> traffic?
>
> Yeah... I wouldn't call it abuse :).  sparse_read() will use
> SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
> metadata on-hand.  It may be a bit slower, though... more complexity
> and such.  They recently implemented something like this for the kernel
> NFS server and found it was faster for very sparse files but the rest of
> the time it was a fair bit slower.

I was wondering if we could modify regular reads in a way that makes them 
work as it used to work, but not transmit zeroed out pages/blocks/objects 
(in other words, you still would get bufferptrs full of zeroes, but they 
wouldn't be transmitted as such over the wire; specialized case of RLE 
compression). That shouldn't be so much slower. But I don't really see how 
that would work without protocol change... Well, at least it's possible to 
replace some of calls to read with sparse read, utilizing filesystem/file 
store metadata to do heavy lifting for us.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 14:24       ` Piotr Dałek
@ 2017-04-06 14:27         ` Sage Weil
  2017-04-06 15:50           ` Jason Dillaman
  2017-04-07  6:46           ` Piotr Dałek
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-04-06 14:27 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3035 bytes --]

On Thu, 6 Apr 2017, Piotr Dałek wrote:
> On 04/06/2017 03:55 PM, Sage Weil wrote:
> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > Hello,
> > > > > 
> > > > > We recently had an interesting issue with RBD images and filestore on
> > > > > Jewel
> > > > > 10.2.5:
> > > > > We have a pool with RBD images, all of them mostly untouched (large
> > > > > areas
> > > > > of
> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
> > > > > representing these images grew substantially on new OSDs: objects
> > > > > hosting
> > > > > unused areas of these images on original OSDs remained small (~8K of
> > > > > space
> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
> > > > > *and*
> > > > > actually used). After investigation we concluded that Ceph didn't
> > > > > propagate
> > > > > sparse file information during cluster rebalance, resulting in correct
> > > > > data
> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
> > > > > space
> > > > > usage increase on those.
> > > > > 
> > > > > [..]
> > > > 
> > > > I think the solution here is to use sparse_read during recovery.  The
> > > > PushOp data representation already supports it; it's just a matter of
> > > > skipping the zeros.  The recovery code could also have an option to
> > > > check
> > > > for fully-zero regions of the data and turn those into holes as well.
> > > > For
> > > > ReplicatedBackend, see build_push_op().
> > > 
> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
> > > traffic?
> > 
> > Yeah... I wouldn't call it abuse :).  sparse_read() will use
> > SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
> > metadata on-hand.  It may be a bit slower, though... more complexity
> > and such.  They recently implemented something like this for the kernel
> > NFS server and found it was faster for very sparse files but the rest of
> > the time it was a fair bit slower.
> 
> I was wondering if we could modify regular reads in a way that makes them work
> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
> words, you still would get bufferptrs full of zeroes, but they wouldn't be
> transmitted as such over the wire; specialized case of RLE compression). That
> shouldn't be so much slower. But I don't really see how that would work
> without protocol change... Well, at least it's possible to replace some of
> calls to read with sparse read, utilizing filesystem/file store metadata to do
> heavy lifting for us.

IIRC librbd used to have an option to do sparse-read all the time instead 
of read (I think this was in ObjectCacher somewhere?) but I think it got 
turned off for some reason?  Memory is very fuzzy here.  In any case, 
changing the client to use sparse-read is the way to do it, I think.  
I'm a bit skeptical that this will have much of an impact, though.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 14:27         ` Sage Weil
@ 2017-04-06 15:50           ` Jason Dillaman
  2017-04-06 17:52             ` Josh Durgin
  2017-04-07  6:46           ` Piotr Dałek
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Dillaman @ 2017-04-06 15:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: Piotr Dałek, ceph-devel

I don't recall a configuration option, but librbd always uses sparse
reads when the cache is disabled and never uses sparse reads for
cache-based reads. I'm pretty sure there wasn't a rationale for the
split -- instead, I've always assumed it was an oversight when that
feature was added years ago.

On Thu, Apr 6, 2017 at 10:27 AM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:55 PM, Sage Weil wrote:
>> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
>> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > > > Hello,
>> > > > >
>> > > > > We recently had an interesting issue with RBD images and filestore on
>> > > > > Jewel
>> > > > > 10.2.5:
>> > > > > We have a pool with RBD images, all of them mostly untouched (large
>> > > > > areas
>> > > > > of
>> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
>> > > > > representing these images grew substantially on new OSDs: objects
>> > > > > hosting
>> > > > > unused areas of these images on original OSDs remained small (~8K of
>> > > > > space
>> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
>> > > > > *and*
>> > > > > actually used). After investigation we concluded that Ceph didn't
>> > > > > propagate
>> > > > > sparse file information during cluster rebalance, resulting in correct
>> > > > > data
>> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
>> > > > > space
>> > > > > usage increase on those.
>> > > > >
>> > > > > [..]
>> > > >
>> > > > I think the solution here is to use sparse_read during recovery.  The
>> > > > PushOp data representation already supports it; it's just a matter of
>> > > > skipping the zeros.  The recovery code could also have an option to
>> > > > check
>> > > > for fully-zero regions of the data and turn those into holes as well.
>> > > > For
>> > > > ReplicatedBackend, see build_push_op().
>> > >
>> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
>> > > traffic?
>> >
>> > Yeah... I wouldn't call it abuse :).  sparse_read() will use
>> > SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
>> > metadata on-hand.  It may be a bit slower, though... more complexity
>> > and such.  They recently implemented something like this for the kernel
>> > NFS server and found it was faster for very sparse files but the rest of
>> > the time it was a fair bit slower.
>>
>> I was wondering if we could modify regular reads in a way that makes them work
>> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
>> words, you still would get bufferptrs full of zeroes, but they wouldn't be
>> transmitted as such over the wire; specialized case of RLE compression). That
>> shouldn't be so much slower. But I don't really see how that would work
>> without protocol change... Well, at least it's possible to replace some of
>> calls to read with sparse read, utilizing filesystem/file store metadata to do
>> heavy lifting for us.
>
> IIRC librbd used to have an option to do sparse-read all the time instead
> of read (I think this was in ObjectCacher somewhere?) but I think it got
> turned off for some reason?  Memory is very fuzzy here.  In any case,
> changing the client to use sparse-read is the way to do it, I think.
> I'm a bit skeptical that this will have much of an impact, though.
>
> sage



-- 
Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 15:50           ` Jason Dillaman
@ 2017-04-06 17:52             ` Josh Durgin
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2017-04-06 17:52 UTC (permalink / raw)
  To: dillaman, Sage Weil; +Cc: Piotr Dałek, ceph-devel

On 04/06/2017 08:50 AM, Jason Dillaman wrote:
> I don't recall a configuration option, but librbd always uses sparse
> reads when the cache is disabled and never uses sparse reads for
> cache-based reads. I'm pretty sure there wasn't a rationale for the
> split -- instead, I've always assumed it was an oversight when that
> feature was added years ago.

It was turned off on the osd side (with the 'filestore fiemap' option)
due to (at the time) buggy kernel behavior with fiemap.

I think I didn't bother implementing it in the cache because of this -
it wouldn't have been safe to use it at the time, and it would have
required more work in the cache to support sparse data at that point
too.

Nowadays seek_data/hole should be reliable (there have been xfstests
for it for a few years) so it's worth looking into for the
recovery case. The rbd cache could be pretty easily modified to support
sparse reads at this point too, which would help the keep the
copy-on-write case sparse.

Josh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 14:27         ` Sage Weil
  2017-04-06 15:50           ` Jason Dillaman
@ 2017-04-07  6:46           ` Piotr Dałek
  1 sibling, 0 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-04-07  6:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/06/2017 04:27 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:55 PM, Sage Weil wrote:
>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>>> [..]
>>>>>
>>>>> I think the solution here is to use sparse_read during recovery.  The
>>>>> PushOp data representation already supports it; it's just a matter of
>>>>> skipping the zeros.  The recovery code could also have an option to
>>>>> check
>>>>> for fully-zero regions of the data and turn those into holes as well.
>>>>> For
>>>>> ReplicatedBackend, see build_push_op().
>>>>
>>>> Can we abuse that to reduce amount of regular (client/inter-osd) network
>>>> traffic?
>>>
>>> Yeah... I wouldn't call it abuse :).  sparse_read() will use
>>> SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
>>> metadata on-hand.  It may be a bit slower, though... more complexity
>>> and such.  They recently implemented something like this for the kernel
>>> NFS server and found it was faster for very sparse files but the rest of
>>> the time it was a fair bit slower.
>>
>> I was wondering if we could modify regular reads in a way that makes them work
>> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
>> words, you still would get bufferptrs full of zeroes, but they wouldn't be
>> transmitted as such over the wire; specialized case of RLE compression). That
>> shouldn't be so much slower. But I don't really see how that would work
>> without protocol change... Well, at least it's possible to replace some of
>> calls to read with sparse read, utilizing filesystem/file store metadata to do
>> heavy lifting for us.
>
> IIRC librbd used to have an option to do sparse-read all the time instead
> of read (I think this was in ObjectCacher somewhere?) but I think it got
> turned off for some reason?  Memory is very fuzzy here.  In any case,
> changing the client to use sparse-read is the way to do it, I think.
> I'm a bit skeptical that this will have much of an impact, though.

I don't expect it to be a big win either, having even a simple RLE 
compressor would be more useful (and in particular, make "rados bench" 
useless), but if sparse reads are also less bandwidth-intensive, it could be 
meaningful for many large cluster operators and also easier to implement 
without breaking too much.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-04-06 13:25 ` Sage Weil
  2017-04-06 13:30   ` Piotr Dałek
@ 2017-04-13 14:23   ` Piotr Dałek
       [not found]     ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-13 14:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/06/2017 03:25 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> Hello,
>>
>> We recently had an interesting issue with RBD images and filestore on Jewel
>> 10.2.5:
>> We have a pool with RBD images, all of them mostly untouched (large areas of
>> those images unused), and once we added 3 new OSDs to cluster, objects
>> representing these images grew substantially on new OSDs: objects hosting
>> unused areas of these images on original OSDs remained small (~8K of space
>> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
>> actually used). After investigation we concluded that Ceph didn't propagate
>> sparse file information during cluster rebalance, resulting in correct data
>> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
>> usage increase on those.
>>
>> Example on test cluster, before growing it by one OSD:
>>
>> ls:
>>
>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> du:
>>
>> osd-01-cluster: 12
>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-02-cluster: 12
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: 12
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>>
>> mon-01-cluster:~ # rbd diff test
>> Offset   Length  Type
>> 8388608  4194304 data
>> 16777216 4096    data
>> 33554432 4194304 data
>> 37748736 2048    data
>>
>> And after growing it:
>>
>> ls:
>>
>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
>> ls -l {} \+
>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> du:
>>
>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
>> du -k {} \+
>> osd-02-cluster: 12
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: 12
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-04-cluster: 4100
>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
>> from 12 to 4100KB when copied from other OSDs to osd-04.
>>
>> Is this something to be expected? Is there any way to make it propagate the
>> sparse file info? Or should we think about issuing a "fallocate -d"-like patch
>> for writes on filestore?
>>
>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
>> remains; our XFS uses 4K bsize).
>
> I think the solution here is to use sparse_read during recovery.  The
> PushOp data representation already supports it; it's just a matter of
> skipping the zeros.  The recovery code could also have an option to check
> for fully-zero regions of the data and turn those into holes as well.  For
> ReplicatedBackend, see build_push_op().

So far it turns out that there's even easier solution, we just enabled 
"filestore seek hole" on some test cluster and that seems to fix the problem 
for us. We'll see if fiemap works too.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
       [not found]     ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-06-14  6:30       ` Paweł Sadowski
  2017-06-14 13:44         ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Paweł Sadowski @ 2017-06-14  6:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users



On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> On 04/06/2017 03:25 PM, Sage Weil wrote:
>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>> Hello,
>>>
>>> We recently had an interesting issue with RBD images and filestore
>>> on Jewel
>>> 10.2.5:
>>> We have a pool with RBD images, all of them mostly untouched (large
>>> areas of
>>> those images unused), and once we added 3 new OSDs to cluster, objects
>>> representing these images grew substantially on new OSDs: objects
>>> hosting
>>> unused areas of these images on original OSDs remained small (~8K of
>>> space
>>> actually used, 4M allocated), but on new OSDs were large (4M
>>> allocated *and*
>>> actually used). After investigation we concluded that Ceph didn't
>>> propagate
>>> sparse file information during cluster rebalance, resulting in
>>> correct data
>>> contents on all OSDs, but no sparse file data on new OSDs, hence
>>> disk space
>>> usage increase on those.
>>>
>>> Example on test cluster, before growing it by one OSD:
>>>
>>> ls:
>>>
>>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> osd-01-cluster: 12
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>>
>>> mon-01-cluster:~ # rbd diff test
>>> Offset   Length  Type
>>> 8388608  4194304 data
>>> 16777216 4096    data
>>> 33554432 4194304 data
>>> 37748736 2048    data
>>>
>>> And after growing it:
>>>
>>> ls:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> ls -l {} \+
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> du -k {} \+
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: 4100
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> Note that
>>> "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
>>> from 12 to 4100KB when copied from other OSDs to osd-04.
>>>
>>> Is this something to be expected? Is there any way to make it
>>> propagate the
>>> sparse file info? Or should we think about issuing a "fallocate
>>> -d"-like patch
>>> for writes on filestore?
>>>
>>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
>>> remains; our XFS uses 4K bsize).
>>
>> I think the solution here is to use sparse_read during recovery.  The
>> PushOp data representation already supports it; it's just a matter of
>> skipping the zeros.  The recovery code could also have an option to
>> check
>> for fully-zero regions of the data and turn those into holes as
>> well.  For
>> ReplicatedBackend, see build_push_op().
>
> So far it turns out that there's even easier solution, we just enabled
> "filestore seek hole" on some test cluster and that seems to fix the
> problem for us. We'll see if fiemap works too.
>

Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?

I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?


(added ceph-users as it might interest others also).

-- 
PS
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-06-14  6:30       ` Paweł Sadowski
@ 2017-06-14 13:44         ` Sage Weil
       [not found]           ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2017-06-14 13:44 UTC (permalink / raw)
  To: Paweł Sadowski; +Cc: Piotr Dałek, ceph-devel, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5779 bytes --]

On Wed, 14 Jun 2017, Paweł Sadowski wrote:
> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> > On 04/06/2017 03:25 PM, Sage Weil wrote:
> >> On Thu, 6 Apr 2017, Piotr Dałek wrote:
> >>> Hello,
> >>>
> >>> We recently had an interesting issue with RBD images and filestore
> >>> on Jewel
> >>> 10.2.5:
> >>> We have a pool with RBD images, all of them mostly untouched (large
> >>> areas of
> >>> those images unused), and once we added 3 new OSDs to cluster, objects
> >>> representing these images grew substantially on new OSDs: objects
> >>> hosting
> >>> unused areas of these images on original OSDs remained small (~8K of
> >>> space
> >>> actually used, 4M allocated), but on new OSDs were large (4M
> >>> allocated *and*
> >>> actually used). After investigation we concluded that Ceph didn't
> >>> propagate
> >>> sparse file information during cluster rebalance, resulting in
> >>> correct data
> >>> contents on all OSDs, but no sparse file data on new OSDs, hence
> >>> disk space
> >>> usage increase on those.
> >>>
> >>> Example on test cluster, before growing it by one OSD:
> >>>
> >>> ls:
> >>>
> >>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> >>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> du:
> >>>
> >>> osd-01-cluster: 12
> >>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-02-cluster: 12
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: 12
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>>
> >>> mon-01-cluster:~ # rbd diff test
> >>> Offset   Length  Type
> >>> 8388608  4194304 data
> >>> 16777216 4096    data
> >>> 33554432 4194304 data
> >>> 37748736 2048    data
> >>>
> >>> And after growing it:
> >>>
> >>> ls:
> >>>
> >>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
> >>> '*data*' -exec
> >>> ls -l {} \+
> >>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
> >>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> du:
> >>>
> >>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
> >>> '*data*' -exec
> >>> du -k {} \+
> >>> osd-02-cluster: 12
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: 12
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-04-cluster: 4100
> >>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> Note that
> >>> "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
> >>> from 12 to 4100KB when copied from other OSDs to osd-04.
> >>>
> >>> Is this something to be expected? Is there any way to make it
> >>> propagate the
> >>> sparse file info? Or should we think about issuing a "fallocate
> >>> -d"-like patch
> >>> for writes on filestore?
> >>>
> >>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
> >>> remains; our XFS uses 4K bsize).
> >>
> >> I think the solution here is to use sparse_read during recovery.  The
> >> PushOp data representation already supports it; it's just a matter of
> >> skipping the zeros.  The recovery code could also have an option to
> >> check
> >> for fully-zero regions of the data and turn those into holes as
> >> well.  For
> >> ReplicatedBackend, see build_push_op().
> >
> > So far it turns out that there's even easier solution, we just enabled
> > "filestore seek hole" on some test cluster and that seems to fix the
> > problem for us. We'll see if fiemap works too.
> >
> 
> Is it safe to enable "filestore seek hole", are there any tests that
> verifies that everything related to RBD works fine with this enabled?
> Can we make this enabled by default?

We would need to enable it in the qa environment first.  The risk here is 
that users run a broad range of kernels and we are exposing ourselves to 
any bugs in any kernel version they may run.  I'd prefer to leave it off 
by default.  We can enable it in the qa suite, though, which covers 
centos7 (latest kernel) and ubuntu xenial and trusty.
 
> I tested on few of our production images and it seems that about 30% is
> sparse. This will be lost on any cluster wide event (add/remove nodes,
> PG grow, recovery).
> 
> How this is/will be handled in BlueStore?

BlueStore exposes the same sparseness metadata that enabling the 
filestore seek hole or fiemap options does, so it won't be a problem 
there.

I think the only thing that we could potentially add is zero detection 
on writes (so that explicitly writing zeros consumes no space).  We'd 
have to be a bit careful measuring the performance impact of that check on 
non-zero writes.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
       [not found]           ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-06-21  7:05             ` Piotr Dałek
  2017-06-21 13:24               ` Sage Weil
  2017-06-21 13:35               ` [ceph-users] " Jason Dillaman
  0 siblings, 2 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21  7:05 UTC (permalink / raw)
  To: Sage Weil, Paweł Sadowski; +Cc: ceph-devel, ceph-users

On 17-06-14 03:44 PM, Sage Weil wrote:
> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>> [snip]
>>>>
>>>> I think the solution here is to use sparse_read during recovery.  The
>>>> PushOp data representation already supports it; it's just a matter of
>>>> skipping the zeros.  The recovery code could also have an option to
>>>> check
>>>> for fully-zero regions of the data and turn those into holes as
>>>> well.  For
>>>> ReplicatedBackend, see build_push_op().
>>>
>>> So far it turns out that there's even easier solution, we just enabled
>>> "filestore seek hole" on some test cluster and that seems to fix the
>>> problem for us. We'll see if fiemap works too.
>>>
>>
>> Is it safe to enable "filestore seek hole", are there any tests that
>> verifies that everything related to RBD works fine with this enabled?
>> Can we make this enabled by default?
> 
> We would need to enable it in the qa environment first.  The risk here is
> that users run a broad range of kernels and we are exposing ourselves to
> any bugs in any kernel version they may run.  I'd prefer to leave it off
> by default.

That's a common regression? If not, we could blacklist particular kernels 
and call it a day.
  > We can enable it in the qa suite, though, which covers
> centos7 (latest kernel) and ubuntu xenial and trusty.

+1. Do you need some particular PR for that?

>> I tested on few of our production images and it seems that about 30% is
>> sparse. This will be lost on any cluster wide event (add/remove nodes,
>> PG grow, recovery).
>>
>> How this is/will be handled in BlueStore?
> 
> BlueStore exposes the same sparseness metadata that enabling the
> filestore seek hole or fiemap options does, so it won't be a problem
> there.
> 
> I think the only thing that we could potentially add is zero detection
> on writes (so that explicitly writing zeros consumes no space).  We'd
> have to be a bit careful measuring the performance impact of that check on
> non-zero writes.

I saw that RBD (librbd) does that - replacing writes with discards when 
buffer contains only zeros. Some code that does the same in librados could 
be added and it shouldn't impact performance much, current implementation of 
  mem_is_zero is fast and shouldn't be a big problem.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-06-21  7:05             ` Piotr Dałek
@ 2017-06-21 13:24               ` Sage Weil
  2017-06-21 13:46                 ` Piotr Dałek
  2017-06-26 11:59                 ` Piotr Dalek
  2017-06-21 13:35               ` [ceph-users] " Jason Dillaman
  1 sibling, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-06-21 13:24 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Paweł Sadowski, ceph-devel, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2971 bytes --]

On Wed, 21 Jun 2017, Piotr Dałek wrote:
> On 17-06-14 03:44 PM, Sage Weil wrote:
> > On Wed, 14 Jun 2017, Paweł Sadowski wrote:
> > > On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> > > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > > [snip]
> > > > > 
> > > > > I think the solution here is to use sparse_read during recovery.  The
> > > > > PushOp data representation already supports it; it's just a matter of
> > > > > skipping the zeros.  The recovery code could also have an option to
> > > > > check
> > > > > for fully-zero regions of the data and turn those into holes as
> > > > > well.  For
> > > > > ReplicatedBackend, see build_push_op().
> > > > 
> > > > So far it turns out that there's even easier solution, we just enabled
> > > > "filestore seek hole" on some test cluster and that seems to fix the
> > > > problem for us. We'll see if fiemap works too.
> > > > 
> > > 
> > > Is it safe to enable "filestore seek hole", are there any tests that
> > > verifies that everything related to RBD works fine with this enabled?
> > > Can we make this enabled by default?
> > 
> > We would need to enable it in the qa environment first.  The risk here is
> > that users run a broad range of kernels and we are exposing ourselves to
> > any bugs in any kernel version they may run.  I'd prefer to leave it off
> > by default.
> 
> That's a common regression? If not, we could blacklist particular kernels and
> call it a day.
>  > We can enable it in the qa suite, though, which covers
> > centos7 (latest kernel) and ubuntu xenial and trusty.
> 
> +1. Do you need some particular PR for that?

Sure.  How about a patch that adds the config option to several of the 
files in qa/suites/rados/thrash/thrashers?

> > > I tested on few of our production images and it seems that about 30% is
> > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > PG grow, recovery).
> > > 
> > > How this is/will be handled in BlueStore?
> > 
> > BlueStore exposes the same sparseness metadata that enabling the
> > filestore seek hole or fiemap options does, so it won't be a problem
> > there.
> > 
> > I think the only thing that we could potentially add is zero detection
> > on writes (so that explicitly writing zeros consumes no space).  We'd
> > have to be a bit careful measuring the performance impact of that check on
> > non-zero writes.
> 
> I saw that RBD (librbd) does that - replacing writes with discards when buffer
> contains only zeros. Some code that does the same in librados could be added
> and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.

I'd rather not have librados silently translating requests; I think it 
makes more sense to do any zero checking in bluestore.  _do_write_small 
and _do_write_big already break writes into (aligned) chunks; that would 
be an easy place to add the check.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs
  2017-06-21  7:05             ` Piotr Dałek
  2017-06-21 13:24               ` Sage Weil
@ 2017-06-21 13:35               ` Jason Dillaman
       [not found]                 ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Dillaman @ 2017-06-21 13:35 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Sage Weil, Paweł Sadowski, ceph-devel, ceph-users

On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> I saw that RBD (librbd) does that - replacing writes with discards when
> buffer contains only zeros. Some code that does the same in librados could
> be added and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.

I'm pretty sure the only place where librbd converts a write to a
discard is actually the specialized "writesame" operation used by
tcmu-runner, as an optimization for ESX's initialization of a new
image.

-- 
Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-06-21 13:24               ` Sage Weil
@ 2017-06-21 13:46                 ` Piotr Dałek
       [not found]                   ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
  2017-06-26 11:59                 ` Piotr Dalek
  1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21 13:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Paweł Sadowski, ceph-devel, ceph-users

On 17-06-21 03:24 PM, Sage Weil wrote:
> On Wed, 21 Jun 2017, Piotr Dałek wrote:
>> On 17-06-14 03:44 PM, Sage Weil wrote:
>>> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>>>> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
>>>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>>>> [snip]
>>>>>>
>>>>>> I think the solution here is to use sparse_read during recovery.  The
>>>>>> PushOp data representation already supports it; it's just a matter of
>>>>>> skipping the zeros.  The recovery code could also have an option to
>>>>>> check
>>>>>> for fully-zero regions of the data and turn those into holes as
>>>>>> well.  For
>>>>>> ReplicatedBackend, see build_push_op().
>>>>>
>>>>> So far it turns out that there's even easier solution, we just enabled
>>>>> "filestore seek hole" on some test cluster and that seems to fix the
>>>>> problem for us. We'll see if fiemap works too.
>>>>>
>>>>
>>>> Is it safe to enable "filestore seek hole", are there any tests that
>>>> verifies that everything related to RBD works fine with this enabled?
>>>> Can we make this enabled by default?
>>>
>>> We would need to enable it in the qa environment first.  The risk here is
>>> that users run a broad range of kernels and we are exposing ourselves to
>>> any bugs in any kernel version they may run.  I'd prefer to leave it off
>>> by default.
>>
>> That's a common regression? If not, we could blacklist particular kernels and
>> call it a day.
 >>
>>> We can enable it in the qa suite, though, which covers
>>> centos7 (latest kernel) and ubuntu xenial and trusty.
>>
>> +1. Do you need some particular PR for that?
> 
> Sure.  How about a patch that adds the config option to several of the
> files in qa/suites/rados/thrash/thrashers?

OK.

>>>> I tested on few of our production images and it seems that about 30% is
>>>> sparse. This will be lost on any cluster wide event (add/remove nodes,
>>>> PG grow, recovery).
>>>>
>>>> How this is/will be handled in BlueStore?
>>>
>>> BlueStore exposes the same sparseness metadata that enabling the
>>> filestore seek hole or fiemap options does, so it won't be a problem
>>> there.
>>>
>>> I think the only thing that we could potentially add is zero detection
>>> on writes (so that explicitly writing zeros consumes no space).  We'd
>>> have to be a bit careful measuring the performance impact of that check on
>>> non-zero writes.
>>
>> I saw that RBD (librbd) does that - replacing writes with discards when buffer
>> contains only zeros. Some code that does the same in librados could be added
>> and it shouldn't impact performance much, current implementation of
>> mem_is_zero is fast and shouldn't be a big problem.
> 
> I'd rather not have librados silently translating requests; I think it
> makes more sense to do any zero checking in bluestore.  _do_write_small
> and _do_write_big already break writes into (aligned) chunks; that would
> be an easy place to add the check.

That leaves out filestore.

And while I get your point, doing it on librados level would reduce network 
usage for zeroed out regions as well, and check could be done just once, not 
replica_size times...

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
       [not found]                 ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-21 13:47                   ` Piotr Dałek
  0 siblings, 0 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21 13:47 UTC (permalink / raw)
  To: dillaman-H+wXaHxf7aLQT0dZR+AlfA; +Cc: ceph-devel, ceph-users

On 17-06-21 03:35 PM, Jason Dillaman wrote:
> On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
>> I saw that RBD (librbd) does that - replacing writes with discards when
>> buffer contains only zeros. Some code that does the same in librados could
>> be added and it shouldn't impact performance much, current implementation of
>> mem_is_zero is fast and shouldn't be a big problem.
> 
> I'm pretty sure the only place where librbd converts a write to a
> discard is actually the specialized "writesame" operation used by
> tcmu-runner, as an optimization for ESX's initialization of a new
> image.

Still, I saw it! ;-)

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
       [not found]                   ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-06-21 13:56                     ` Sage Weil
  0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2017-06-21 13:56 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2263 bytes --]

On Wed, 21 Jun 2017, Piotr Dałek wrote:
> > > > > I tested on few of our production images and it seems that about 30%
> > > > > is
> > > > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > > > PG grow, recovery).
> > > > > 
> > > > > How this is/will be handled in BlueStore?
> > > > 
> > > > BlueStore exposes the same sparseness metadata that enabling the
> > > > filestore seek hole or fiemap options does, so it won't be a problem
> > > > there.
> > > > 
> > > > I think the only thing that we could potentially add is zero detection
> > > > on writes (so that explicitly writing zeros consumes no space).  We'd
> > > > have to be a bit careful measuring the performance impact of that check
> > > > on
> > > > non-zero writes.
> > > 
> > > I saw that RBD (librbd) does that - replacing writes with discards when
> > > buffer
> > > contains only zeros. Some code that does the same in librados could be
> > > added
> > > and it shouldn't impact performance much, current implementation of
> > > mem_is_zero is fast and shouldn't be a big problem.
> > 
> > I'd rather not have librados silently translating requests; I think it
> > makes more sense to do any zero checking in bluestore.  _do_write_small
> > and _do_write_big already break writes into (aligned) chunks; that would
> > be an easy place to add the check.
> 
> That leaves out filestore.
> 
> And while I get your point, doing it on librados level would reduce network
> usage for zeroed out regions as well, and check could be done just once, not
> replica_size times...

In the librbd case I think a client-side check makes sense.

For librados, it's a low level interface with complicated semantics.  
Silently translating a write op to a zero op feels dangerous to me.  
Would a zero range extend the object size, for example?  Or implicitly 
create an object that doesn't exist?  I can't remember.  (It would need to 
match write perfectly for this to be safe.)  The user might also have a 
compound op of multiple operations, which would make swapping one out in 
the middle stranger.  And probably half the librados unit tests would 
stop testing what we thought they were testing.  Etc.

It seems more natural to do this a layer up in librbd or rgw...

sage

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Sparse file info in filestore not propagated to other OSDs
  2017-06-21 13:24               ` Sage Weil
  2017-06-21 13:46                 ` Piotr Dałek
@ 2017-06-26 11:59                 ` Piotr Dalek
  1 sibling, 0 replies; 19+ messages in thread
From: Piotr Dalek @ 2017-06-26 11:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Paweł Sadowski, ceph-devel, ceph-users

On 17-06-21 03:24 PM, Sage Weil wrote:
> On Wed, 21 Jun 2017, Piotr Dałek wrote:
>> On 17-06-14 03:44 PM, Sage Weil wrote:
>>> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>>>> [snip]
>>>>
>>>> Is it safe to enable "filestore seek hole", are there any tests that
>>>> verifies that everything related to RBD works fine with this enabled?
>>>> Can we make this enabled by default?
>>>
>>> We would need to enable it in the qa environment first.  The risk here is
>>> that users run a broad range of kernels and we are exposing ourselves to
>>> any bugs in any kernel version they may run.  I'd prefer to leave it off
>>> by default.
>>
>> That's a common regression? If not, we could blacklist particular kernels and
>> call it a day.

>>> We can enable it in the qa suite, though, which covers
>>> centos7 (latest kernel) and ubuntu xenial and trusty.
>>
>> +1. Do you need some particular PR for that?
> 
> Sure.  How about a patch that adds the config option to several of the
> files in qa/suites/rados/thrash/thrashers?

Does 
https://github.com/ovh/ceph/commit/fe65e3a19470eea16c9d273d1aac1c7eff7d2ff1 
look reasonably?

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-06-26 11:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 10:15 Sparse file info in filestore not propagated to other OSDs Piotr Dałek
2017-04-06 13:25 ` Sage Weil
2017-04-06 13:30   ` Piotr Dałek
2017-04-06 13:55     ` Sage Weil
2017-04-06 14:24       ` Piotr Dałek
2017-04-06 14:27         ` Sage Weil
2017-04-06 15:50           ` Jason Dillaman
2017-04-06 17:52             ` Josh Durgin
2017-04-07  6:46           ` Piotr Dałek
2017-04-13 14:23   ` Piotr Dałek
     [not found]     ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-06-14  6:30       ` Paweł Sadowski
2017-06-14 13:44         ` Sage Weil
     [not found]           ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-06-21  7:05             ` Piotr Dałek
2017-06-21 13:24               ` Sage Weil
2017-06-21 13:46                 ` Piotr Dałek
     [not found]                   ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-06-21 13:56                     ` Sage Weil
2017-06-26 11:59                 ` Piotr Dalek
2017-06-21 13:35               ` [ceph-users] " Jason Dillaman
     [not found]                 ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-21 13:47                   ` Piotr Dałek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.