* cls_rbd copyup and write
@ 2017-06-27 8:22 Ning Yao
2017-06-27 13:40 ` Jason Dillaman
2017-06-27 13:42 ` Sage Weil
0 siblings, 2 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-27 8:22 UTC (permalink / raw)
To: ceph-devel
Hi, all
currently I find that when do copy on write for a clone image. librbd
call the cls copyup function to write the data, reading from its
parent, to the child.
However, there is a issue here: if an object in the parent image -->
[0, 8192] with data and [8192, end] without data, then after COW
operation, it will filling the whole object [0, end] to the children
object with [8192, end] all zeros. This phenomenon also occurs in
flatten images.
Actually, we already have sparse_read to just read data without holes.
However, copyup function does not support to write serveral fragments
such as {[0, 8192], [16384,20480]}.
So it that possible to direct send OSDOp {[cow write], [cow write],
[user write]} instead of OSDOp {[copyup], [user write]} ?
Regards
Ning Yao
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: cls_rbd copyup and write
2017-06-27 8:22 cls_rbd copyup and write Ning Yao
@ 2017-06-27 13:40 ` Jason Dillaman
2017-06-28 4:06 ` Ning Yao
2017-06-27 13:42 ` Sage Weil
1 sibling, 1 reply; 6+ messages in thread
From: Jason Dillaman @ 2017-06-27 13:40 UTC (permalink / raw)
To: Ning Yao; +Cc: ceph-devel
This is definitely an optimization we can test post-Luminous release
once bluestore is the defacto OSD object store. Of course, even
bluestore won't track holes down to 8KiB -- only 16KiB or 64KiB
depending on your backing device and settings. I am pretty sure
Luminous already has an optimization to not copy-up if the full parent
object is zeroed.
I do remember a presentation about surprising results when
implementing NFS v4.2 READ_PLUS sparse support where it actually
degraded performance due to the need to seek the file holes. There
might be a performance trade-off to consider when objects have lots of
holes due to increased metadata plus decreased data locality.
On Tue, Jun 27, 2017 at 4:22 AM, Ning Yao <zay11022@gmail.com> wrote:
> Hi, all
>
> currently I find that when do copy on write for a clone image. librbd
> call the cls copyup function to write the data, reading from its
> parent, to the child.
>
> However, there is a issue here: if an object in the parent image -->
> [0, 8192] with data and [8192, end] without data, then after COW
> operation, it will filling the whole object [0, end] to the children
> object with [8192, end] all zeros. This phenomenon also occurs in
> flatten images.
>
> Actually, we already have sparse_read to just read data without holes.
> However, copyup function does not support to write serveral fragments
> such as {[0, 8192], [16384,20480]}.
>
> So it that possible to direct send OSDOp {[cow write], [cow write],
> [user write]} instead of OSDOp {[copyup], [user write]} ?
>
>
>
> Regards
> Ning Yao
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: cls_rbd copyup and write
2017-06-27 8:22 cls_rbd copyup and write Ning Yao
2017-06-27 13:40 ` Jason Dillaman
@ 2017-06-27 13:42 ` Sage Weil
[not found] ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
2017-06-28 4:09 ` Ning Yao
1 sibling, 2 replies; 6+ messages in thread
From: Sage Weil @ 2017-06-27 13:42 UTC (permalink / raw)
To: Ning Yao; +Cc: ceph-devel
On Tue, 27 Jun 2017, Ning Yao wrote:
> Hi, all
>
> currently I find that when do copy on write for a clone image. librbd
> call the cls copyup function to write the data, reading from its
> parent, to the child.
>
> However, there is a issue here: if an object in the parent image -->
> [0, 8192] with data and [8192, end] without data, then after COW
> operation, it will filling the whole object [0, end] to the children
> object with [8192, end] all zeros. This phenomenon also occurs in
> flatten images.
Note that BlueStore (luminous) doesn't have this issue: the clone is an
O(1) metadata operation and subsequent writes are basically copy-no-write.
> Actually, we already have sparse_read to just read data without holes.
> However, copyup function does not support to write serveral fragments
> such as {[0, 8192], [16384,20480]}.
>
> So it that possible to direct send OSDOp {[cow write], [cow write],
> [user write]} instead of OSDOp {[copyup], [user write]} ?
It seems like the better fix for FileStore is to make the copyup operation
do a sparse_read and write only the allocated ranges. I think the only
issue there is that the two mechanisms for making sparse_read actually
sparse are fiemap and seek_hole_data, both of which are disabled by
default because they rely on newish or buggy-in-the-past kernel APIs and
we want to avoid hard to diagnose breakage. They should be enabled with
caution.
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: cls_rbd copyup and write
[not found] ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
@ 2017-06-27 13:49 ` Sage Weil
0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2017-06-27 13:49 UTC (permalink / raw)
To: Haomai Wang; +Cc: Ning Yao, ceph-devel
On Tue, 27 Jun 2017, Haomai Wang wrote:
> it's not related to objectstore clone... it's a rbd-side clone. so it won't
> invoke "clone".
Oh right, nevermind! :)
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: cls_rbd copyup and write
2017-06-27 13:40 ` Jason Dillaman
@ 2017-06-28 4:06 ` Ning Yao
0 siblings, 0 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-28 4:06 UTC (permalink / raw)
To: dillaman; +Cc: ceph-devel
2017-06-27 21:40 GMT+08:00 Jason Dillaman <jdillama@redhat.com>:
> This is definitely an optimization we can test post-Luminous release
> once bluestore is the defacto OSD object store. Of course, even
> bluestore won't track holes down to 8KiB -- only 16KiB or 64KiB
> depending on your backing device and settings. I am pretty sure
> Luminous already has an optimization to not copy-up if the full parent
> object is zeroed.
you mean if the full parent object is zeroed, then it will not
copy-up? But what about a 4M object only with several 16 KiB or 64KiB
holes in Bluestore, It seems those objects still read to rbd-client
side and send copy-up request to osd-side and I do not find bluestone
will treat the whole 64KiB allocated extents as holes if its data is
all zeros.
> I do remember a presentation about surprising results when
> implementing NFS v4.2 READ_PLUS sparse support where it actually
> degraded performance due to the need to seek the file holes. There
> might be a performance trade-off to consider when objects have lots of
> holes due to increased metadata plus decreased data locality.
Yeah, but I think if we can send a single MOSDOp containing several
OSDOps. So it will be treat as a single transaction in osd-side and
deal with much more efficiently. If we send several MOSDOps, then it
will become bad since each transaction on osd-side will be queued and
processed serially because the pg_lock and rw_lock for object.
Actually, we face the same issue when vm flush in-memory data on disk
and lots of adjacent but not continue writeOps will submit to osd-side
with Each MOSDOp simultaneously so that the single pg will process
each transaction one by one, which leads to a bad latency for those
ops at the end of the pg_wq queue.
> On Tue, Jun 27, 2017 at 4:22 AM, Ning Yao <zay11022@gmail.com> wrote:
>> Hi, all
>>
>> currently I find that when do copy on write for a clone image. librbd
>> call the cls copyup function to write the data, reading from its
>> parent, to the child.
>>
>> However, there is a issue here: if an object in the parent image -->
>> [0, 8192] with data and [8192, end] without data, then after COW
>> operation, it will filling the whole object [0, end] to the children
>> object with [8192, end] all zeros. This phenomenon also occurs in
>> flatten images.
>>
>> Actually, we already have sparse_read to just read data without holes.
>> However, copyup function does not support to write serveral fragments
>> such as {[0, 8192], [16384,20480]}.
>>
>> So it that possible to direct send OSDOp {[cow write], [cow write],
>> [user write]} instead of OSDOp {[copyup], [user write]} ?
>>
>>
>>
>> Regards
>> Ning Yao
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: cls_rbd copyup and write
2017-06-27 13:42 ` Sage Weil
[not found] ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
@ 2017-06-28 4:09 ` Ning Yao
1 sibling, 0 replies; 6+ messages in thread
From: Ning Yao @ 2017-06-28 4:09 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
2017-06-27 21:42 GMT+08:00 Sage Weil <sage@newdream.net>:
> On Tue, 27 Jun 2017, Ning Yao wrote:
>> Hi, all
>>
>> currently I find that when do copy on write for a clone image. librbd
>> call the cls copyup function to write the data, reading from its
>> parent, to the child.
>>
>> However, there is a issue here: if an object in the parent image -->
>> [0, 8192] with data and [8192, end] without data, then after COW
>> operation, it will filling the whole object [0, end] to the children
>> object with [8192, end] all zeros. This phenomenon also occurs in
>> flatten images.
>
> Note that BlueStore (luminous) doesn't have this issue: the clone is an
> O(1) metadata operation and subsequent writes are basically copy-no-write.
Do we say the same things? the osd-side clone ops only occurs for rbd
snapshot. what I said is rbd clone, which is the layering feature in
red-client side.
>> Actually, we already have sparse_read to just read data without holes.
>> However, copyup function does not support to write serveral fragments
>> such as {[0, 8192], [16384,20480]}.
>>
>> So it that possible to direct send OSDOp {[cow write], [cow write],
>> [user write]} instead of OSDOp {[copyup], [user write]} ?
>
> It seems like the better fix for FileStore is to make the copyup operation
> do a sparse_read and write only the allocated ranges. I think the only
> issue there is that the two mechanisms for making sparse_read actually
> sparse are fiemap and seek_hole_data, both of which are disabled by
> default because they rely on newish or buggy-in-the-past kernel APIs and
> we want to avoid hard to diagnose breakage. They should be enabled with
> caution.
>
> sage
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-06-28 4:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-27 8:22 cls_rbd copyup and write Ning Yao
2017-06-27 13:40 ` Jason Dillaman
2017-06-28 4:06 ` Ning Yao
2017-06-27 13:42 ` Sage Weil
[not found] ` <CACJqLyZqdbe4dNpSOOG-q4iXWE2Kkk6W-y3FACyP2x0rkm6drw@mail.gmail.com>
2017-06-27 13:49 ` Sage Weil
2017-06-28 4:09 ` Ning Yao
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.