All of lore.kernel.org
 help / color / mirror / Atom feed
* cls_rbd copyup and write
@ 2017-06-27  8:22 Ning Yao
  2017-06-27 13:40 ` Jason Dillaman
  2017-06-27 13:42 ` Sage Weil
  0 siblings, 2 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-27  8:22 UTC (permalink / raw)
  To: ceph-devel

Hi, all

currently I find that when do copy on write for a clone image. librbd
call the cls copyup function to write the data, reading from its
parent, to the child.

However, there is a issue here:  if an object in the parent image -->
[0, 8192] with data and [8192, end] without data, then after COW
operation, it will filling the whole object [0, end] to the children
object with [8192, end] all zeros. This phenomenon also occurs in
flatten images.

Actually, we already have sparse_read to just read data without holes.
However, copyup function does not support to write serveral fragments
such as {[0, 8192], [16384,20480]}.

So it that possible to direct send OSDOp {[cow write], [cow write],
[user write]} instead of  OSDOp {[copyup], [user write]} ?



Regards
Ning Yao

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cls_rbd copyup and write
  2017-06-27  8:22 cls_rbd copyup and write Ning Yao
@ 2017-06-27 13:40 ` Jason Dillaman
  2017-06-28  4:06   ` Ning Yao
  2017-06-27 13:42 ` Sage Weil
  1 sibling, 1 reply; 6+ messages in thread
From: Jason Dillaman @ 2017-06-27 13:40 UTC (permalink / raw)
  To: Ning Yao; +Cc: ceph-devel

This is definitely an optimization we can test post-Luminous release
once bluestore is the defacto OSD object store. Of course, even
bluestore won't track holes down to 8KiB -- only 16KiB or 64KiB
depending on your backing device and settings. I am pretty sure
Luminous already has an optimization to not copy-up if the full parent
object is zeroed.

I do remember a presentation about surprising results when
implementing NFS v4.2 READ_PLUS sparse support where it actually
degraded performance due to the need to seek the file holes. There
might be a performance trade-off to consider when objects have lots of
holes due to increased metadata plus decreased data locality.

On Tue, Jun 27, 2017 at 4:22 AM, Ning Yao <zay11022@gmail.com> wrote:
> Hi, all
>
> currently I find that when do copy on write for a clone image. librbd
> call the cls copyup function to write the data, reading from its
> parent, to the child.
>
> However, there is a issue here:  if an object in the parent image -->
> [0, 8192] with data and [8192, end] without data, then after COW
> operation, it will filling the whole object [0, end] to the children
> object with [8192, end] all zeros. This phenomenon also occurs in
> flatten images.
>
> Actually, we already have sparse_read to just read data without holes.
> However, copyup function does not support to write serveral fragments
> such as {[0, 8192], [16384,20480]}.
>
> So it that possible to direct send OSDOp {[cow write], [cow write],
> [user write]} instead of  OSDOp {[copyup], [user write]} ?
>
>
>
> Regards
> Ning Yao
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cls_rbd copyup and write
  2017-06-27  8:22 cls_rbd copyup and write Ning Yao
  2017-06-27 13:40 ` Jason Dillaman
@ 2017-06-27 13:42 ` Sage Weil
       [not found]   ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
  2017-06-28  4:09   ` Ning Yao
  1 sibling, 2 replies; 6+ messages in thread
From: Sage Weil @ 2017-06-27 13:42 UTC (permalink / raw)
  To: Ning Yao; +Cc: ceph-devel

On Tue, 27 Jun 2017, Ning Yao wrote:
> Hi, all
> 
> currently I find that when do copy on write for a clone image. librbd
> call the cls copyup function to write the data, reading from its
> parent, to the child.
> 
> However, there is a issue here:  if an object in the parent image -->
> [0, 8192] with data and [8192, end] without data, then after COW
> operation, it will filling the whole object [0, end] to the children
> object with [8192, end] all zeros. This phenomenon also occurs in
> flatten images.

Note that BlueStore (luminous) doesn't have this issue: the clone is an 
O(1) metadata operation and subsequent writes are basically copy-no-write.

> Actually, we already have sparse_read to just read data without holes.
> However, copyup function does not support to write serveral fragments
> such as {[0, 8192], [16384,20480]}.
> 
> So it that possible to direct send OSDOp {[cow write], [cow write],
> [user write]} instead of  OSDOp {[copyup], [user write]} ?

It seems like the better fix for FileStore is to make the copyup operation 
do a sparse_read and write only the allocated ranges.  I think the only 
issue there is that the two mechanisms for making sparse_read actually 
sparse are fiemap and seek_hole_data, both of which are disabled by 
default because they rely on newish or buggy-in-the-past kernel APIs and 
we want to avoid hard to diagnose breakage.  They should be enabled with 
caution.

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cls_rbd copyup and write
       [not found]   ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
@ 2017-06-27 13:49     ` Sage Weil
  0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2017-06-27 13:49 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Ning Yao, ceph-devel

On Tue, 27 Jun 2017, Haomai Wang wrote:
> it's not related to objectstore clone... it's a rbd-side clone. so it won't
> invoke "clone".

Oh right, nevermind! :)

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cls_rbd copyup and write
  2017-06-27 13:40 ` Jason Dillaman
@ 2017-06-28  4:06   ` Ning Yao
  0 siblings, 0 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-28  4:06 UTC (permalink / raw)
  To: dillaman; +Cc: ceph-devel

2017-06-27 21:40 GMT+08:00 Jason Dillaman <jdillama@redhat.com>:
> This is definitely an optimization we can test post-Luminous release
> once bluestore is the defacto OSD object store. Of course, even
> bluestore won't track holes down to 8KiB -- only 16KiB or 64KiB
> depending on your backing device and settings. I am pretty sure
> Luminous already has an optimization to not copy-up if the full parent
> object is zeroed.
you mean if the full parent object is zeroed, then it will not
copy-up?  But what about a 4M object only with several 16 KiB or 64KiB
holes in Bluestore, It seems those objects still read to rbd-client
side and send copy-up request to osd-side and I do not find bluestone
will treat the whole 64KiB allocated extents as holes if its data is
all zeros.


> I do remember a presentation about surprising results when
> implementing NFS v4.2 READ_PLUS sparse support where it actually
> degraded performance due to the need to seek the file holes. There
> might be a performance trade-off to consider when objects have lots of
> holes due to increased metadata plus decreased data locality.
Yeah, but I think if we can send a single MOSDOp containing several
OSDOps.  So it will be treat as a single transaction in osd-side and
deal with much more efficiently. If we send several MOSDOps, then it
will become bad since each transaction on osd-side will be queued and
processed serially  because the pg_lock and rw_lock for object.
Actually, we face the same issue when vm flush in-memory data on disk
and lots of adjacent but not continue writeOps will submit to osd-side
with Each MOSDOp simultaneously so that the single pg will process
each transaction one by one, which leads to a bad latency for those
ops at the end of the pg_wq queue.


> On Tue, Jun 27, 2017 at 4:22 AM, Ning Yao <zay11022@gmail.com> wrote:
>> Hi, all
>>
>> currently I find that when do copy on write for a clone image. librbd
>> call the cls copyup function to write the data, reading from its
>> parent, to the child.
>>
>> However, there is a issue here:  if an object in the parent image -->
>> [0, 8192] with data and [8192, end] without data, then after COW
>> operation, it will filling the whole object [0, end] to the children
>> object with [8192, end] all zeros. This phenomenon also occurs in
>> flatten images.
>>
>> Actually, we already have sparse_read to just read data without holes.
>> However, copyup function does not support to write serveral fragments
>> such as {[0, 8192], [16384,20480]}.
>>
>> So it that possible to direct send OSDOp {[cow write], [cow write],
>> [user write]} instead of  OSDOp {[copyup], [user write]} ?
>>
>>
>>
>> Regards
>> Ning Yao
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cls_rbd copyup and write
  2017-06-27 13:42 ` Sage Weil
       [not found]   ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
@ 2017-06-28  4:09   ` Ning Yao
  1 sibling, 0 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-28  4:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

2017-06-27 21:42 GMT+08:00 Sage Weil <sage@newdream.net>:
> On Tue, 27 Jun 2017, Ning Yao wrote:
>> Hi, all
>>
>> currently I find that when do copy on write for a clone image. librbd
>> call the cls copyup function to write the data, reading from its
>> parent, to the child.
>>
>> However, there is a issue here:  if an object in the parent image -->
>> [0, 8192] with data and [8192, end] without data, then after COW
>> operation, it will filling the whole object [0, end] to the children
>> object with [8192, end] all zeros. This phenomenon also occurs in
>> flatten images.
>
> Note that BlueStore (luminous) doesn't have this issue: the clone is an
> O(1) metadata operation and subsequent writes are basically copy-no-write.
Do we say the same things? the osd-side clone ops only occurs for rbd
snapshot. what I said is rbd clone, which is the layering feature in
red-client side.


>> Actually, we already have sparse_read to just read data without holes.
>> However, copyup function does not support to write serveral fragments
>> such as {[0, 8192], [16384,20480]}.
>>
>> So it that possible to direct send OSDOp {[cow write], [cow write],
>> [user write]} instead of  OSDOp {[copyup], [user write]} ?
>
> It seems like the better fix for FileStore is to make the copyup operation
> do a sparse_read and write only the allocated ranges.  I think the only
> issue there is that the two mechanisms for making sparse_read actually
> sparse are fiemap and seek_hole_data, both of which are disabled by
> default because they rely on newish or buggy-in-the-past kernel APIs and
> we want to avoid hard to diagnose breakage.  They should be enabled with
> caution.
>
> sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-28  4:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-27  8:22 cls_rbd copyup and write Ning Yao
2017-06-27 13:40 ` Jason Dillaman
2017-06-28  4:06   ` Ning Yao
2017-06-27 13:42 ` Sage Weil
     [not found]   ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
2017-06-27 13:49     ` Sage Weil
2017-06-28  4:09   ` Ning Yao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.