* Sparse file info in filestore not propagated to other OSDs
@ 2017-04-06 10:15 Piotr Dałek
2017-04-06 13:25 ` Sage Weil
0 siblings, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 10:15 UTC (permalink / raw)
To: ceph-devel
Hello,
We recently had an interesting issue with RBD images and filestore on Jewel
10.2.5:
We have a pool with RBD images, all of them mostly untouched (large areas of
those images unused), and once we added 3 new OSDs to cluster, objects
representing these images grew substantially on new OSDs: objects hosting
unused areas of these images on original OSDs remained small (~8K of space
actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
actually used). After investigation we concluded that Ceph didn't propagate
sparse file information during cluster rebalance, resulting in correct data
contents on all OSDs, but no sparse file data on new OSDs, hence disk space
usage increase on those.
Example on test cluster, before growing it by one OSD:
ls:
osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
/var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
du:
osd-01-cluster: 12
/var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-02-cluster: 12
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: 12
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
mon-01-cluster:~ # rbd diff test
Offset Length Type
8388608 4194304 data
16777216 4096 data
33554432 4194304 data
37748736 2048 data
And after growing it:
ls:
clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*'
-exec ls -l {} \+
osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:25
/var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
du:
clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*'
-exec du -k {} \+
osd-02-cluster: 12
/var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-03-cluster: 12
/var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
osd-04-cluster: 4100
/var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
from 12 to 4100KB when copied from other OSDs to osd-04.
Is this something to be expected? Is there any way to make it propagate the
sparse file info? Or should we think about issuing a "fallocate -d"-like
patch for writes on filestore?
(We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
remains; our XFS uses 4K bsize).
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 10:15 Sparse file info in filestore not propagated to other OSDs Piotr Dałek
@ 2017-04-06 13:25 ` Sage Weil
2017-04-06 13:30 ` Piotr Dałek
2017-04-13 14:23 ` Piotr Dałek
0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-04-06 13:25 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3832 bytes --]
On Thu, 6 Apr 2017, Piotr Dałek wrote:
> Hello,
>
> We recently had an interesting issue with RBD images and filestore on Jewel
> 10.2.5:
> We have a pool with RBD images, all of them mostly untouched (large areas of
> those images unused), and once we added 3 new OSDs to cluster, objects
> representing these images grew substantially on new OSDs: objects hosting
> unused areas of these images on original OSDs remained small (~8K of space
> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
> actually used). After investigation we concluded that Ceph didn't propagate
> sparse file information during cluster rebalance, resulting in correct data
> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
> usage increase on those.
>
> Example on test cluster, before growing it by one OSD:
>
> ls:
>
> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>
> du:
>
> osd-01-cluster: 12
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>
>
> mon-01-cluster:~ # rbd diff test
> Offset Length Type
> 8388608 4194304 data
> 16777216 4096 data
> 33554432 4194304 data
> 37748736 2048 data
>
> And after growing it:
>
> ls:
>
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> ls -l {} \+
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:25
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>
> du:
>
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> du -k {} \+
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: 4100
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>
> Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
> from 12 to 4100KB when copied from other OSDs to osd-04.
>
> Is this something to be expected? Is there any way to make it propagate the
> sparse file info? Or should we think about issuing a "fallocate -d"-like patch
> for writes on filestore?
>
> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
> remains; our XFS uses 4K bsize).
I think the solution here is to use sparse_read during recovery. The
PushOp data representation already supports it; it's just a matter of
skipping the zeros. The recovery code could also have an option to check
for fully-zero regions of the data and turn those into holes as well. For
ReplicatedBackend, see build_push_op().
sage
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 13:25 ` Sage Weil
@ 2017-04-06 13:30 ` Piotr Dałek
2017-04-06 13:55 ` Sage Weil
2017-04-13 14:23 ` Piotr Dałek
1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 13:30 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 04/06/2017 03:25 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> Hello,
>>
>> We recently had an interesting issue with RBD images and filestore on Jewel
>> 10.2.5:
>> We have a pool with RBD images, all of them mostly untouched (large areas of
>> those images unused), and once we added 3 new OSDs to cluster, objects
>> representing these images grew substantially on new OSDs: objects hosting
>> unused areas of these images on original OSDs remained small (~8K of space
>> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
>> actually used). After investigation we concluded that Ceph didn't propagate
>> sparse file information during cluster rebalance, resulting in correct data
>> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
>> usage increase on those.
>>
>> [..]
>
> I think the solution here is to use sparse_read during recovery. The
> PushOp data representation already supports it; it's just a matter of
> skipping the zeros. The recovery code could also have an option to check
> for fully-zero regions of the data and turn those into holes as well. For
> ReplicatedBackend, see build_push_op().
Can we abuse that to reduce amount of regular (client/inter-osd) network
traffic?
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 13:30 ` Piotr Dałek
@ 2017-04-06 13:55 ` Sage Weil
2017-04-06 14:24 ` Piotr Dałek
0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2017-04-06 13:55 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1815 bytes --]
On Thu, 6 Apr 2017, Piotr Dałek wrote:
> On 04/06/2017 03:25 PM, Sage Weil wrote:
> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > Hello,
> > >
> > > We recently had an interesting issue with RBD images and filestore on
> > > Jewel
> > > 10.2.5:
> > > We have a pool with RBD images, all of them mostly untouched (large areas
> > > of
> > > those images unused), and once we added 3 new OSDs to cluster, objects
> > > representing these images grew substantially on new OSDs: objects hosting
> > > unused areas of these images on original OSDs remained small (~8K of space
> > > actually used, 4M allocated), but on new OSDs were large (4M allocated
> > > *and*
> > > actually used). After investigation we concluded that Ceph didn't
> > > propagate
> > > sparse file information during cluster rebalance, resulting in correct
> > > data
> > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
> > > space
> > > usage increase on those.
> > >
> > > [..]
> >
> > I think the solution here is to use sparse_read during recovery. The
> > PushOp data representation already supports it; it's just a matter of
> > skipping the zeros. The recovery code could also have an option to check
> > for fully-zero regions of the data and turn those into holes as well. For
> > ReplicatedBackend, see build_push_op().
>
> Can we abuse that to reduce amount of regular (client/inter-osd) network
> traffic?
Yeah... I wouldn't call it abuse :). sparse_read() will use
SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the
metadata on-hand. It may be a bit slower, though... more complexity
and such. They recently implemented something like this for the kernel
NFS server and found it was faster for very sparse files but the rest of
the time it was a fair bit slower.
sage
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 13:55 ` Sage Weil
@ 2017-04-06 14:24 ` Piotr Dałek
2017-04-06 14:27 ` Sage Weil
0 siblings, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-06 14:24 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 04/06/2017 03:55 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>> Hello,
>>>>
>>>> We recently had an interesting issue with RBD images and filestore on
>>>> Jewel
>>>> 10.2.5:
>>>> We have a pool with RBD images, all of them mostly untouched (large areas
>>>> of
>>>> those images unused), and once we added 3 new OSDs to cluster, objects
>>>> representing these images grew substantially on new OSDs: objects hosting
>>>> unused areas of these images on original OSDs remained small (~8K of space
>>>> actually used, 4M allocated), but on new OSDs were large (4M allocated
>>>> *and*
>>>> actually used). After investigation we concluded that Ceph didn't
>>>> propagate
>>>> sparse file information during cluster rebalance, resulting in correct
>>>> data
>>>> contents on all OSDs, but no sparse file data on new OSDs, hence disk
>>>> space
>>>> usage increase on those.
>>>>
>>>> [..]
>>>
>>> I think the solution here is to use sparse_read during recovery. The
>>> PushOp data representation already supports it; it's just a matter of
>>> skipping the zeros. The recovery code could also have an option to check
>>> for fully-zero regions of the data and turn those into holes as well. For
>>> ReplicatedBackend, see build_push_op().
>>
>> Can we abuse that to reduce amount of regular (client/inter-osd) network
>> traffic?
>
> Yeah... I wouldn't call it abuse :). sparse_read() will use
> SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the
> metadata on-hand. It may be a bit slower, though... more complexity
> and such. They recently implemented something like this for the kernel
> NFS server and found it was faster for very sparse files but the rest of
> the time it was a fair bit slower.
I was wondering if we could modify regular reads in a way that makes them
work as it used to work, but not transmit zeroed out pages/blocks/objects
(in other words, you still would get bufferptrs full of zeroes, but they
wouldn't be transmitted as such over the wire; specialized case of RLE
compression). That shouldn't be so much slower. But I don't really see how
that would work without protocol change... Well, at least it's possible to
replace some of calls to read with sparse read, utilizing filesystem/file
store metadata to do heavy lifting for us.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 14:24 ` Piotr Dałek
@ 2017-04-06 14:27 ` Sage Weil
2017-04-06 15:50 ` Jason Dillaman
2017-04-07 6:46 ` Piotr Dałek
0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-04-06 14:27 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 3035 bytes --]
On Thu, 6 Apr 2017, Piotr Dałek wrote:
> On 04/06/2017 03:55 PM, Sage Weil wrote:
> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > Hello,
> > > > >
> > > > > We recently had an interesting issue with RBD images and filestore on
> > > > > Jewel
> > > > > 10.2.5:
> > > > > We have a pool with RBD images, all of them mostly untouched (large
> > > > > areas
> > > > > of
> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
> > > > > representing these images grew substantially on new OSDs: objects
> > > > > hosting
> > > > > unused areas of these images on original OSDs remained small (~8K of
> > > > > space
> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
> > > > > *and*
> > > > > actually used). After investigation we concluded that Ceph didn't
> > > > > propagate
> > > > > sparse file information during cluster rebalance, resulting in correct
> > > > > data
> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
> > > > > space
> > > > > usage increase on those.
> > > > >
> > > > > [..]
> > > >
> > > > I think the solution here is to use sparse_read during recovery. The
> > > > PushOp data representation already supports it; it's just a matter of
> > > > skipping the zeros. The recovery code could also have an option to
> > > > check
> > > > for fully-zero regions of the data and turn those into holes as well.
> > > > For
> > > > ReplicatedBackend, see build_push_op().
> > >
> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
> > > traffic?
> >
> > Yeah... I wouldn't call it abuse :). sparse_read() will use
> > SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the
> > metadata on-hand. It may be a bit slower, though... more complexity
> > and such. They recently implemented something like this for the kernel
> > NFS server and found it was faster for very sparse files but the rest of
> > the time it was a fair bit slower.
>
> I was wondering if we could modify regular reads in a way that makes them work
> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
> words, you still would get bufferptrs full of zeroes, but they wouldn't be
> transmitted as such over the wire; specialized case of RLE compression). That
> shouldn't be so much slower. But I don't really see how that would work
> without protocol change... Well, at least it's possible to replace some of
> calls to read with sparse read, utilizing filesystem/file store metadata to do
> heavy lifting for us.
IIRC librbd used to have an option to do sparse-read all the time instead
of read (I think this was in ObjectCacher somewhere?) but I think it got
turned off for some reason? Memory is very fuzzy here. In any case,
changing the client to use sparse-read is the way to do it, I think.
I'm a bit skeptical that this will have much of an impact, though.
sage
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 14:27 ` Sage Weil
@ 2017-04-06 15:50 ` Jason Dillaman
2017-04-06 17:52 ` Josh Durgin
2017-04-07 6:46 ` Piotr Dałek
1 sibling, 1 reply; 19+ messages in thread
From: Jason Dillaman @ 2017-04-06 15:50 UTC (permalink / raw)
To: Sage Weil; +Cc: Piotr Dałek, ceph-devel
I don't recall a configuration option, but librbd always uses sparse
reads when the cache is disabled and never uses sparse reads for
cache-based reads. I'm pretty sure there wasn't a rationale for the
split -- instead, I've always assumed it was an oversight when that
feature was added years ago.
On Thu, Apr 6, 2017 at 10:27 AM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:55 PM, Sage Weil wrote:
>> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
>> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > > > Hello,
>> > > > >
>> > > > > We recently had an interesting issue with RBD images and filestore on
>> > > > > Jewel
>> > > > > 10.2.5:
>> > > > > We have a pool with RBD images, all of them mostly untouched (large
>> > > > > areas
>> > > > > of
>> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
>> > > > > representing these images grew substantially on new OSDs: objects
>> > > > > hosting
>> > > > > unused areas of these images on original OSDs remained small (~8K of
>> > > > > space
>> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
>> > > > > *and*
>> > > > > actually used). After investigation we concluded that Ceph didn't
>> > > > > propagate
>> > > > > sparse file information during cluster rebalance, resulting in correct
>> > > > > data
>> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
>> > > > > space
>> > > > > usage increase on those.
>> > > > >
>> > > > > [..]
>> > > >
>> > > > I think the solution here is to use sparse_read during recovery. The
>> > > > PushOp data representation already supports it; it's just a matter of
>> > > > skipping the zeros. The recovery code could also have an option to
>> > > > check
>> > > > for fully-zero regions of the data and turn those into holes as well.
>> > > > For
>> > > > ReplicatedBackend, see build_push_op().
>> > >
>> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
>> > > traffic?
>> >
>> > Yeah... I wouldn't call it abuse :). sparse_read() will use
>> > SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the
>> > metadata on-hand. It may be a bit slower, though... more complexity
>> > and such. They recently implemented something like this for the kernel
>> > NFS server and found it was faster for very sparse files but the rest of
>> > the time it was a fair bit slower.
>>
>> I was wondering if we could modify regular reads in a way that makes them work
>> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
>> words, you still would get bufferptrs full of zeroes, but they wouldn't be
>> transmitted as such over the wire; specialized case of RLE compression). That
>> shouldn't be so much slower. But I don't really see how that would work
>> without protocol change... Well, at least it's possible to replace some of
>> calls to read with sparse read, utilizing filesystem/file store metadata to do
>> heavy lifting for us.
>
> IIRC librbd used to have an option to do sparse-read all the time instead
> of read (I think this was in ObjectCacher somewhere?) but I think it got
> turned off for some reason? Memory is very fuzzy here. In any case,
> changing the client to use sparse-read is the way to do it, I think.
> I'm a bit skeptical that this will have much of an impact, though.
>
> sage
--
Jason
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 15:50 ` Jason Dillaman
@ 2017-04-06 17:52 ` Josh Durgin
0 siblings, 0 replies; 19+ messages in thread
From: Josh Durgin @ 2017-04-06 17:52 UTC (permalink / raw)
To: dillaman, Sage Weil; +Cc: Piotr Dałek, ceph-devel
On 04/06/2017 08:50 AM, Jason Dillaman wrote:
> I don't recall a configuration option, but librbd always uses sparse
> reads when the cache is disabled and never uses sparse reads for
> cache-based reads. I'm pretty sure there wasn't a rationale for the
> split -- instead, I've always assumed it was an oversight when that
> feature was added years ago.
It was turned off on the osd side (with the 'filestore fiemap' option)
due to (at the time) buggy kernel behavior with fiemap.
I think I didn't bother implementing it in the cache because of this -
it wouldn't have been safe to use it at the time, and it would have
required more work in the cache to support sparse data at that point
too.
Nowadays seek_data/hole should be reliable (there have been xfstests
for it for a few years) so it's worth looking into for the
recovery case. The rbd cache could be pretty easily modified to support
sparse reads at this point too, which would help the keep the
copy-on-write case sparse.
Josh
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 14:27 ` Sage Weil
2017-04-06 15:50 ` Jason Dillaman
@ 2017-04-07 6:46 ` Piotr Dałek
1 sibling, 0 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-04-07 6:46 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 04/06/2017 04:27 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:55 PM, Sage Weil wrote:
>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>>> [..]
>>>>>
>>>>> I think the solution here is to use sparse_read during recovery. The
>>>>> PushOp data representation already supports it; it's just a matter of
>>>>> skipping the zeros. The recovery code could also have an option to
>>>>> check
>>>>> for fully-zero regions of the data and turn those into holes as well.
>>>>> For
>>>>> ReplicatedBackend, see build_push_op().
>>>>
>>>> Can we abuse that to reduce amount of regular (client/inter-osd) network
>>>> traffic?
>>>
>>> Yeah... I wouldn't call it abuse :). sparse_read() will use
>>> SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the
>>> metadata on-hand. It may be a bit slower, though... more complexity
>>> and such. They recently implemented something like this for the kernel
>>> NFS server and found it was faster for very sparse files but the rest of
>>> the time it was a fair bit slower.
>>
>> I was wondering if we could modify regular reads in a way that makes them work
>> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
>> words, you still would get bufferptrs full of zeroes, but they wouldn't be
>> transmitted as such over the wire; specialized case of RLE compression). That
>> shouldn't be so much slower. But I don't really see how that would work
>> without protocol change... Well, at least it's possible to replace some of
>> calls to read with sparse read, utilizing filesystem/file store metadata to do
>> heavy lifting for us.
>
> IIRC librbd used to have an option to do sparse-read all the time instead
> of read (I think this was in ObjectCacher somewhere?) but I think it got
> turned off for some reason? Memory is very fuzzy here. In any case,
> changing the client to use sparse-read is the way to do it, I think.
> I'm a bit skeptical that this will have much of an impact, though.
I don't expect it to be a big win either, having even a simple RLE
compressor would be more useful (and in particular, make "rados bench"
useless), but if sparse reads are also less bandwidth-intensive, it could be
meaningful for many large cluster operators and also easier to implement
without breaking too much.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-04-06 13:25 ` Sage Weil
2017-04-06 13:30 ` Piotr Dałek
@ 2017-04-13 14:23 ` Piotr Dałek
[not found] ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-04-13 14:23 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 04/06/2017 03:25 PM, Sage Weil wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> Hello,
>>
>> We recently had an interesting issue with RBD images and filestore on Jewel
>> 10.2.5:
>> We have a pool with RBD images, all of them mostly untouched (large areas of
>> those images unused), and once we added 3 new OSDs to cluster, objects
>> representing these images grew substantially on new OSDs: objects hosting
>> unused areas of these images on original OSDs remained small (~8K of space
>> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
>> actually used). After investigation we concluded that Ceph didn't propagate
>> sparse file information during cluster rebalance, resulting in correct data
>> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
>> usage increase on those.
>>
>> Example on test cluster, before growing it by one OSD:
>>
>> ls:
>>
>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> du:
>>
>> osd-01-cluster: 12
>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-02-cluster: 12
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: 12
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>>
>> mon-01-cluster:~ # rbd diff test
>> Offset Length Type
>> 8388608 4194304 data
>> 16777216 4096 data
>> 33554432 4194304 data
>> 37748736 2048 data
>>
>> And after growing it:
>>
>> ls:
>>
>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
>> ls -l {} \+
>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:25
>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> du:
>>
>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
>> du -k {} \+
>> osd-02-cluster: 12
>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-03-cluster: 12
>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>> osd-04-cluster: 4100
>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>
>> Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
>> from 12 to 4100KB when copied from other OSDs to osd-04.
>>
>> Is this something to be expected? Is there any way to make it propagate the
>> sparse file info? Or should we think about issuing a "fallocate -d"-like patch
>> for writes on filestore?
>>
>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
>> remains; our XFS uses 4K bsize).
>
> I think the solution here is to use sparse_read during recovery. The
> PushOp data representation already supports it; it's just a matter of
> skipping the zeros. The recovery code could also have an option to check
> for fully-zero regions of the data and turn those into holes as well. For
> ReplicatedBackend, see build_push_op().
So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the problem
for us. We'll see if fiemap works too.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
[not found] ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-06-14 6:30 ` Paweł Sadowski
2017-06-14 13:44 ` Sage Weil
0 siblings, 1 reply; 19+ messages in thread
From: Paweł Sadowski @ 2017-06-14 6:30 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, ceph-users
On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> On 04/06/2017 03:25 PM, Sage Weil wrote:
>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>> Hello,
>>>
>>> We recently had an interesting issue with RBD images and filestore
>>> on Jewel
>>> 10.2.5:
>>> We have a pool with RBD images, all of them mostly untouched (large
>>> areas of
>>> those images unused), and once we added 3 new OSDs to cluster, objects
>>> representing these images grew substantially on new OSDs: objects
>>> hosting
>>> unused areas of these images on original OSDs remained small (~8K of
>>> space
>>> actually used, 4M allocated), but on new OSDs were large (4M
>>> allocated *and*
>>> actually used). After investigation we concluded that Ceph didn't
>>> propagate
>>> sparse file information during cluster rebalance, resulting in
>>> correct data
>>> contents on all OSDs, but no sparse file data on new OSDs, hence
>>> disk space
>>> usage increase on those.
>>>
>>> Example on test cluster, before growing it by one OSD:
>>>
>>> ls:
>>>
>>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> osd-01-cluster: 12
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>>
>>> mon-01-cluster:~ # rbd diff test
>>> Offset Length Type
>>> 8388608 4194304 data
>>> 16777216 4096 data
>>> 33554432 4194304 data
>>> 37748736 2048 data
>>>
>>> And after growing it:
>>>
>>> ls:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> ls -l {} \+
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:25
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> du -k {} \+
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: 4100
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> Note that
>>> "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
>>> from 12 to 4100KB when copied from other OSDs to osd-04.
>>>
>>> Is this something to be expected? Is there any way to make it
>>> propagate the
>>> sparse file info? Or should we think about issuing a "fallocate
>>> -d"-like patch
>>> for writes on filestore?
>>>
>>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
>>> remains; our XFS uses 4K bsize).
>>
>> I think the solution here is to use sparse_read during recovery. The
>> PushOp data representation already supports it; it's just a matter of
>> skipping the zeros. The recovery code could also have an option to
>> check
>> for fully-zero regions of the data and turn those into holes as
>> well. For
>> ReplicatedBackend, see build_push_op().
>
> So far it turns out that there's even easier solution, we just enabled
> "filestore seek hole" on some test cluster and that seems to fix the
> problem for us. We'll see if fiemap works too.
>
Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?
I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).
How this is/will be handled in BlueStore?
(added ceph-users as it might interest others also).
--
PS
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-06-14 6:30 ` Paweł Sadowski
@ 2017-06-14 13:44 ` Sage Weil
[not found] ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2017-06-14 13:44 UTC (permalink / raw)
To: Paweł Sadowski; +Cc: Piotr Dałek, ceph-devel, ceph-users
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5779 bytes --]
On Wed, 14 Jun 2017, Paweł Sadowski wrote:
> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> > On 04/06/2017 03:25 PM, Sage Weil wrote:
> >> On Thu, 6 Apr 2017, Piotr Dałek wrote:
> >>> Hello,
> >>>
> >>> We recently had an interesting issue with RBD images and filestore
> >>> on Jewel
> >>> 10.2.5:
> >>> We have a pool with RBD images, all of them mostly untouched (large
> >>> areas of
> >>> those images unused), and once we added 3 new OSDs to cluster, objects
> >>> representing these images grew substantially on new OSDs: objects
> >>> hosting
> >>> unused areas of these images on original OSDs remained small (~8K of
> >>> space
> >>> actually used, 4M allocated), but on new OSDs were large (4M
> >>> allocated *and*
> >>> actually used). After investigation we concluded that Ceph didn't
> >>> propagate
> >>> sparse file information during cluster rebalance, resulting in
> >>> correct data
> >>> contents on all OSDs, but no sparse file data on new OSDs, hence
> >>> disk space
> >>> usage increase on those.
> >>>
> >>> Example on test cluster, before growing it by one OSD:
> >>>
> >>> ls:
> >>>
> >>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> >>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> du:
> >>>
> >>> osd-01-cluster: 12
> >>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-02-cluster: 12
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: 12
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>>
> >>> mon-01-cluster:~ # rbd diff test
> >>> Offset Length Type
> >>> 8388608 4194304 data
> >>> 16777216 4096 data
> >>> 33554432 4194304 data
> >>> 37748736 2048 data
> >>>
> >>> And after growing it:
> >>>
> >>> ls:
> >>>
> >>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
> >>> '*data*' -exec
> >>> ls -l {} \+
> >>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:18
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr 6 09:25
> >>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> du:
> >>>
> >>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
> >>> '*data*' -exec
> >>> du -k {} \+
> >>> osd-02-cluster: 12
> >>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-03-cluster: 12
> >>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>> osd-04-cluster: 4100
> >>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> >>>
> >>>
> >>> Note that
> >>> "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
> >>> from 12 to 4100KB when copied from other OSDs to osd-04.
> >>>
> >>> Is this something to be expected? Is there any way to make it
> >>> propagate the
> >>> sparse file info? Or should we think about issuing a "fallocate
> >>> -d"-like patch
> >>> for writes on filestore?
> >>>
> >>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
> >>> remains; our XFS uses 4K bsize).
> >>
> >> I think the solution here is to use sparse_read during recovery. The
> >> PushOp data representation already supports it; it's just a matter of
> >> skipping the zeros. The recovery code could also have an option to
> >> check
> >> for fully-zero regions of the data and turn those into holes as
> >> well. For
> >> ReplicatedBackend, see build_push_op().
> >
> > So far it turns out that there's even easier solution, we just enabled
> > "filestore seek hole" on some test cluster and that seems to fix the
> > problem for us. We'll see if fiemap works too.
> >
>
> Is it safe to enable "filestore seek hole", are there any tests that
> verifies that everything related to RBD works fine with this enabled?
> Can we make this enabled by default?
We would need to enable it in the qa environment first. The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run. I'd prefer to leave it off
by default. We can enable it in the qa suite, though, which covers
centos7 (latest kernel) and ubuntu xenial and trusty.
> I tested on few of our production images and it seems that about 30% is
> sparse. This will be lost on any cluster wide event (add/remove nodes,
> PG grow, recovery).
>
> How this is/will be handled in BlueStore?
BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.
I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space). We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.
sage
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
[not found] ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-06-21 7:05 ` Piotr Dałek
2017-06-21 13:24 ` Sage Weil
2017-06-21 13:35 ` [ceph-users] " Jason Dillaman
0 siblings, 2 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21 7:05 UTC (permalink / raw)
To: Sage Weil, Paweł Sadowski; +Cc: ceph-devel, ceph-users
On 17-06-14 03:44 PM, Sage Weil wrote:
> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>> [snip]
>>>>
>>>> I think the solution here is to use sparse_read during recovery. The
>>>> PushOp data representation already supports it; it's just a matter of
>>>> skipping the zeros. The recovery code could also have an option to
>>>> check
>>>> for fully-zero regions of the data and turn those into holes as
>>>> well. For
>>>> ReplicatedBackend, see build_push_op().
>>>
>>> So far it turns out that there's even easier solution, we just enabled
>>> "filestore seek hole" on some test cluster and that seems to fix the
>>> problem for us. We'll see if fiemap works too.
>>>
>>
>> Is it safe to enable "filestore seek hole", are there any tests that
>> verifies that everything related to RBD works fine with this enabled?
>> Can we make this enabled by default?
>
> We would need to enable it in the qa environment first. The risk here is
> that users run a broad range of kernels and we are exposing ourselves to
> any bugs in any kernel version they may run. I'd prefer to leave it off
> by default.
That's a common regression? If not, we could blacklist particular kernels
and call it a day.
> We can enable it in the qa suite, though, which covers
> centos7 (latest kernel) and ubuntu xenial and trusty.
+1. Do you need some particular PR for that?
>> I tested on few of our production images and it seems that about 30% is
>> sparse. This will be lost on any cluster wide event (add/remove nodes,
>> PG grow, recovery).
>>
>> How this is/will be handled in BlueStore?
>
> BlueStore exposes the same sparseness metadata that enabling the
> filestore seek hole or fiemap options does, so it won't be a problem
> there.
>
> I think the only thing that we could potentially add is zero detection
> on writes (so that explicitly writing zeros consumes no space). We'd
> have to be a bit careful measuring the performance impact of that check on
> non-zero writes.
I saw that RBD (librbd) does that - replacing writes with discards when
buffer contains only zeros. Some code that does the same in librados could
be added and it shouldn't impact performance much, current implementation of
mem_is_zero is fast and shouldn't be a big problem.
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-06-21 7:05 ` Piotr Dałek
@ 2017-06-21 13:24 ` Sage Weil
2017-06-21 13:46 ` Piotr Dałek
2017-06-26 11:59 ` Piotr Dalek
2017-06-21 13:35 ` [ceph-users] " Jason Dillaman
1 sibling, 2 replies; 19+ messages in thread
From: Sage Weil @ 2017-06-21 13:24 UTC (permalink / raw)
To: Piotr Dałek; +Cc: Paweł Sadowski, ceph-devel, ceph-users
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2971 bytes --]
On Wed, 21 Jun 2017, Piotr Dałek wrote:
> On 17-06-14 03:44 PM, Sage Weil wrote:
> > On Wed, 14 Jun 2017, Paweł Sadowski wrote:
> > > On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> > > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > > [snip]
> > > > >
> > > > > I think the solution here is to use sparse_read during recovery. The
> > > > > PushOp data representation already supports it; it's just a matter of
> > > > > skipping the zeros. The recovery code could also have an option to
> > > > > check
> > > > > for fully-zero regions of the data and turn those into holes as
> > > > > well. For
> > > > > ReplicatedBackend, see build_push_op().
> > > >
> > > > So far it turns out that there's even easier solution, we just enabled
> > > > "filestore seek hole" on some test cluster and that seems to fix the
> > > > problem for us. We'll see if fiemap works too.
> > > >
> > >
> > > Is it safe to enable "filestore seek hole", are there any tests that
> > > verifies that everything related to RBD works fine with this enabled?
> > > Can we make this enabled by default?
> >
> > We would need to enable it in the qa environment first. The risk here is
> > that users run a broad range of kernels and we are exposing ourselves to
> > any bugs in any kernel version they may run. I'd prefer to leave it off
> > by default.
>
> That's a common regression? If not, we could blacklist particular kernels and
> call it a day.
> > We can enable it in the qa suite, though, which covers
> > centos7 (latest kernel) and ubuntu xenial and trusty.
>
> +1. Do you need some particular PR for that?
Sure. How about a patch that adds the config option to several of the
files in qa/suites/rados/thrash/thrashers?
> > > I tested on few of our production images and it seems that about 30% is
> > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > PG grow, recovery).
> > >
> > > How this is/will be handled in BlueStore?
> >
> > BlueStore exposes the same sparseness metadata that enabling the
> > filestore seek hole or fiemap options does, so it won't be a problem
> > there.
> >
> > I think the only thing that we could potentially add is zero detection
> > on writes (so that explicitly writing zeros consumes no space). We'd
> > have to be a bit careful measuring the performance impact of that check on
> > non-zero writes.
>
> I saw that RBD (librbd) does that - replacing writes with discards when buffer
> contains only zeros. Some code that does the same in librados could be added
> and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.
I'd rather not have librados silently translating requests; I think it
makes more sense to do any zero checking in bluestore. _do_write_small
and _do_write_big already break writes into (aligned) chunks; that would
be an easy place to add the check.
sage
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs
2017-06-21 7:05 ` Piotr Dałek
2017-06-21 13:24 ` Sage Weil
@ 2017-06-21 13:35 ` Jason Dillaman
[not found] ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 19+ messages in thread
From: Jason Dillaman @ 2017-06-21 13:35 UTC (permalink / raw)
To: Piotr Dałek; +Cc: Sage Weil, Paweł Sadowski, ceph-devel, ceph-users
On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
> I saw that RBD (librbd) does that - replacing writes with discards when
> buffer contains only zeros. Some code that does the same in librados could
> be added and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.
I'm pretty sure the only place where librbd converts a write to a
discard is actually the specialized "writesame" operation used by
tcmu-runner, as an optimization for ESX's initialization of a new
image.
--
Jason
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-06-21 13:24 ` Sage Weil
@ 2017-06-21 13:46 ` Piotr Dałek
[not found] ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-06-26 11:59 ` Piotr Dalek
1 sibling, 1 reply; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21 13:46 UTC (permalink / raw)
To: Sage Weil; +Cc: Paweł Sadowski, ceph-devel, ceph-users
On 17-06-21 03:24 PM, Sage Weil wrote:
> On Wed, 21 Jun 2017, Piotr Dałek wrote:
>> On 17-06-14 03:44 PM, Sage Weil wrote:
>>> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>>>> On 04/13/2017 04:23 PM, Piotr Dałek wrote:
>>>>> On 04/06/2017 03:25 PM, Sage Weil wrote:
>>>>>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>>>>>> [snip]
>>>>>>
>>>>>> I think the solution here is to use sparse_read during recovery. The
>>>>>> PushOp data representation already supports it; it's just a matter of
>>>>>> skipping the zeros. The recovery code could also have an option to
>>>>>> check
>>>>>> for fully-zero regions of the data and turn those into holes as
>>>>>> well. For
>>>>>> ReplicatedBackend, see build_push_op().
>>>>>
>>>>> So far it turns out that there's even easier solution, we just enabled
>>>>> "filestore seek hole" on some test cluster and that seems to fix the
>>>>> problem for us. We'll see if fiemap works too.
>>>>>
>>>>
>>>> Is it safe to enable "filestore seek hole", are there any tests that
>>>> verifies that everything related to RBD works fine with this enabled?
>>>> Can we make this enabled by default?
>>>
>>> We would need to enable it in the qa environment first. The risk here is
>>> that users run a broad range of kernels and we are exposing ourselves to
>>> any bugs in any kernel version they may run. I'd prefer to leave it off
>>> by default.
>>
>> That's a common regression? If not, we could blacklist particular kernels and
>> call it a day.
>>
>>> We can enable it in the qa suite, though, which covers
>>> centos7 (latest kernel) and ubuntu xenial and trusty.
>>
>> +1. Do you need some particular PR for that?
>
> Sure. How about a patch that adds the config option to several of the
> files in qa/suites/rados/thrash/thrashers?
OK.
>>>> I tested on few of our production images and it seems that about 30% is
>>>> sparse. This will be lost on any cluster wide event (add/remove nodes,
>>>> PG grow, recovery).
>>>>
>>>> How this is/will be handled in BlueStore?
>>>
>>> BlueStore exposes the same sparseness metadata that enabling the
>>> filestore seek hole or fiemap options does, so it won't be a problem
>>> there.
>>>
>>> I think the only thing that we could potentially add is zero detection
>>> on writes (so that explicitly writing zeros consumes no space). We'd
>>> have to be a bit careful measuring the performance impact of that check on
>>> non-zero writes.
>>
>> I saw that RBD (librbd) does that - replacing writes with discards when buffer
>> contains only zeros. Some code that does the same in librados could be added
>> and it shouldn't impact performance much, current implementation of
>> mem_is_zero is fast and shouldn't be a big problem.
>
> I'd rather not have librados silently translating requests; I think it
> makes more sense to do any zero checking in bluestore. _do_write_small
> and _do_write_big already break writes into (aligned) chunks; that would
> be an easy place to add the check.
That leaves out filestore.
And while I get your point, doing it on librados level would reduce network
usage for zeroed out regions as well, and check could be done just once, not
replica_size times...
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
[not found] ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-21 13:47 ` Piotr Dałek
0 siblings, 0 replies; 19+ messages in thread
From: Piotr Dałek @ 2017-06-21 13:47 UTC (permalink / raw)
To: dillaman-H+wXaHxf7aLQT0dZR+AlfA; +Cc: ceph-devel, ceph-users
On 17-06-21 03:35 PM, Jason Dillaman wrote:
> On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek <piotr.dalek@corp.ovh.com> wrote:
>> I saw that RBD (librbd) does that - replacing writes with discards when
>> buffer contains only zeros. Some code that does the same in librados could
>> be added and it shouldn't impact performance much, current implementation of
>> mem_is_zero is fast and shouldn't be a big problem.
>
> I'm pretty sure the only place where librbd converts a write to a
> discard is actually the specialized "writesame" operation used by
> tcmu-runner, as an optimization for ESX's initialization of a new
> image.
Still, I saw it! ;-)
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
[not found] ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-06-21 13:56 ` Sage Weil
0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2017-06-21 13:56 UTC (permalink / raw)
To: Piotr Dałek; +Cc: ceph-devel, ceph-users
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2263 bytes --]
On Wed, 21 Jun 2017, Piotr Dałek wrote:
> > > > > I tested on few of our production images and it seems that about 30%
> > > > > is
> > > > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > > > PG grow, recovery).
> > > > >
> > > > > How this is/will be handled in BlueStore?
> > > >
> > > > BlueStore exposes the same sparseness metadata that enabling the
> > > > filestore seek hole or fiemap options does, so it won't be a problem
> > > > there.
> > > >
> > > > I think the only thing that we could potentially add is zero detection
> > > > on writes (so that explicitly writing zeros consumes no space). We'd
> > > > have to be a bit careful measuring the performance impact of that check
> > > > on
> > > > non-zero writes.
> > >
> > > I saw that RBD (librbd) does that - replacing writes with discards when
> > > buffer
> > > contains only zeros. Some code that does the same in librados could be
> > > added
> > > and it shouldn't impact performance much, current implementation of
> > > mem_is_zero is fast and shouldn't be a big problem.
> >
> > I'd rather not have librados silently translating requests; I think it
> > makes more sense to do any zero checking in bluestore. _do_write_small
> > and _do_write_big already break writes into (aligned) chunks; that would
> > be an easy place to add the check.
>
> That leaves out filestore.
>
> And while I get your point, doing it on librados level would reduce network
> usage for zeroed out regions as well, and check could be done just once, not
> replica_size times...
In the librbd case I think a client-side check makes sense.
For librados, it's a low level interface with complicated semantics.
Silently translating a write op to a zero op feels dangerous to me.
Would a zero range extend the object size, for example? Or implicitly
create an object that doesn't exist? I can't remember. (It would need to
match write perfectly for this to be safe.) The user might also have a
compound op of multiple operations, which would make swapping one out in
the middle stranger. And probably half the librados unit tests would
stop testing what we thought they were testing. Etc.
It seems more natural to do this a layer up in librbd or rgw...
sage
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Sparse file info in filestore not propagated to other OSDs
2017-06-21 13:24 ` Sage Weil
2017-06-21 13:46 ` Piotr Dałek
@ 2017-06-26 11:59 ` Piotr Dalek
1 sibling, 0 replies; 19+ messages in thread
From: Piotr Dalek @ 2017-06-26 11:59 UTC (permalink / raw)
To: Sage Weil; +Cc: Paweł Sadowski, ceph-devel, ceph-users
On 17-06-21 03:24 PM, Sage Weil wrote:
> On Wed, 21 Jun 2017, Piotr Dałek wrote:
>> On 17-06-14 03:44 PM, Sage Weil wrote:
>>> On Wed, 14 Jun 2017, Paweł Sadowski wrote:
>>>> [snip]
>>>>
>>>> Is it safe to enable "filestore seek hole", are there any tests that
>>>> verifies that everything related to RBD works fine with this enabled?
>>>> Can we make this enabled by default?
>>>
>>> We would need to enable it in the qa environment first. The risk here is
>>> that users run a broad range of kernels and we are exposing ourselves to
>>> any bugs in any kernel version they may run. I'd prefer to leave it off
>>> by default.
>>
>> That's a common regression? If not, we could blacklist particular kernels and
>> call it a day.
>>> We can enable it in the qa suite, though, which covers
>>> centos7 (latest kernel) and ubuntu xenial and trusty.
>>
>> +1. Do you need some particular PR for that?
>
> Sure. How about a patch that adds the config option to several of the
> files in qa/suites/rados/thrash/thrashers?
Does
https://github.com/ovh/ceph/commit/fe65e3a19470eea16c9d273d1aac1c7eff7d2ff1
look reasonably?
--
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-06-26 11:59 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 10:15 Sparse file info in filestore not propagated to other OSDs Piotr Dałek
2017-04-06 13:25 ` Sage Weil
2017-04-06 13:30 ` Piotr Dałek
2017-04-06 13:55 ` Sage Weil
2017-04-06 14:24 ` Piotr Dałek
2017-04-06 14:27 ` Sage Weil
2017-04-06 15:50 ` Jason Dillaman
2017-04-06 17:52 ` Josh Durgin
2017-04-07 6:46 ` Piotr Dałek
2017-04-13 14:23 ` Piotr Dałek
[not found] ` <d4bde447-f179-aeca-bac5-636fa40ccba5-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-06-14 6:30 ` Paweł Sadowski
2017-06-14 13:44 ` Sage Weil
[not found] ` <alpine.DEB.2.11.1706141340520.3646-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-06-21 7:05 ` Piotr Dałek
2017-06-21 13:24 ` Sage Weil
2017-06-21 13:46 ` Piotr Dałek
[not found] ` <898546b4-b9b2-5413-27ab-74534cc77eed-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-06-21 13:56 ` Sage Weil
2017-06-26 11:59 ` Piotr Dalek
2017-06-21 13:35 ` [ceph-users] " Jason Dillaman
[not found] ` <CA+aFP1DJ3L3Pg0r4Pj3o7JoNTNnBRRs0u_nnb2JYz4nGxafUTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-21 13:47 ` Piotr Dałek
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.