Question about writeback performance and content address obejct for deduplication

All of lore.kernel.org
 help / color / mirror / Atom feed

* Question about writeback performance and content address obejct for deduplication
@ 2017-01-26 11:04 myoungwon oh
  2017-01-31 14:24 ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-01-26 11:04 UTC (permalink / raw)
  To: ceph-devel, 오명원

I have two questions.

1. I would like to ask about CAS location. current our implementation store
content address object in storage tier.However, If we store the CAO in the
cache tier, we can get a performance advantage. Do you think we can create
CAO in cachetier? or create a separate storage pool for CAS?

2. The results below are performance result for our current implementation.
experiment setup:
PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
ORIGINAL(without dedup feature and cache tier),
fio, 512K block, seq. I/O, single thread

One thing to note is that the writeback case is slower than the proxy.
We think there are three problems as follows.

A. The current implementation creates a fingerprint by reading the entire
object when flushing. Therefore, there is a problem that read and write are
mixed.

B. When client request read, the promote_object function reads the object
and writes it back to the cache tier, which also causes a mix of read and
write.

C. When flushing, the unchanged part is rewritten because flush operation
perform per-object based.

Do I have something wrong? or Could you give me a suggestion to improve
performance?

a. Write performance (KB/s)

dedup_ratio  0 20 40 60 80 100

PROXY  45586 47804 51120 52844 56167 55302

WRITEBACK  13151 11078 9531 13010 9518 8319

ORIGINAL  121209 124786 122140 121195 122540 132363

b. Read performance (KB/s)

dedup_ratio  0 20 40 60 80 100

PROXY  112231 118994 118070 120071 117884 132748

WRITEBACK  34040 29109 19104 26677 24756 21695

ORIGINAL  285482 284398 278063 277989 271793 285094

thanks,
Myoungwon Oh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-01-26 11:04 Question about writeback performance and content address obejct for deduplication myoungwon oh
@ 2017-01-31 14:24 ` Sage Weil
  2017-02-07 11:03   ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-01-31 14:24 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Thu, 26 Jan 2017, myoungwon oh wrote:
> I have two questions.
> 
> 1. I would like to ask about CAS location. current our implementation store
> content address object in storage tier.However, If we store the CAO in the
> cache tier, we can get a performance advantage. Do you think we can create
> CAO in cachetier? or create a separate storage pool for CAS?

It depends on the design.  If the you are naming the objects at the 
librados client side, then you can use the rados cluster itself 
unmodified (with or without a cache tier).  This is roughly how I have
anticipated implementing the CAS storage portion.  If you are doing the 
chunking hashing and within the OSD itself, then you can't do the CAS 
at the first tier because the requests won't be directed at the right OSD.

> 2. The results below are performance result for our current implementation.
> experiment setup:
> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
> ORIGINAL(without dedup feature and cache tier),
> fio, 512K block, seq. I/O, single thread
> 
> One thing to note is that the writeback case is slower than the proxy.
> We think there are three problems as follows.
> 
> A. The current implementation creates a fingerprint by reading the entire
> object when flushing. Therefore, there is a problem that read and write are
> mixed.

I expect this is a small factor compared to the fact that in writeback 
mode you have to *write* to the cache tier, which is 3x replicated, 
whereas in proxy mode those writes don't happen at all.

> B. When client request read, the promote_object function reads the object
> and writes it back to the cache tier, which also causes a mix of read and
> write.

This can be mitigated by setting the min_read_recency_for_promote pool 
property to something >1.  Then reads will be proxied unless the object 
appears to be hot (because it has been touched over multiple 
hitset intervals).

> C. When flushing, the unchanged part is rewritten because flush operation
> perform per-object based.

Yes.

Is there a description of your overall approach somewhere?

sage


> 
> Do I have something wrong? or Could you give me a suggestion to improve
> performance?
> 
> 
> a. Write performance (KB/s)
> 
> dedup_ratio  0 20 40 60 80 100
> 
> PROXY  45586 47804 51120 52844 56167 55302
> 
> WRITEBACK  13151 11078 9531 13010 9518 8319
> 
> ORIGINAL  121209 124786 122140 121195 122540 132363
> 
> 
> b. Read performance (KB/s)
> 
> dedup_ratio  0 20 40 60 80 100
> 
> PROXY  112231 118994 118070 120071 117884 132748
> 
> WRITEBACK  34040 29109 19104 26677 24756 21695
> 
> ORIGINAL  285482 284398 278063 277989 271793 285094
> 
> 
> thanks,
> Myoungwon Oh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-01-31 14:24 ` Sage Weil
@ 2017-02-07 11:03   ` myoungwon oh
  2017-02-07 14:50     ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-02-07 11:03 UTC (permalink / raw)
  To: ceph-devel, sweil, 오명원

Hi sage.

I uploaded the document which describe my overall appoach.
please see it and give me feedback.
slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg

thanks


2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
> On Thu, 26 Jan 2017, myoungwon oh wrote:
>> I have two questions.
>>
>> 1. I would like to ask about CAS location. current our implementation store
>> content address object in storage tier.However, If we store the CAO in the
>> cache tier, we can get a performance advantage. Do you think we can create
>> CAO in cachetier? or create a separate storage pool for CAS?
>
> It depends on the design.  If the you are naming the objects at the
> librados client side, then you can use the rados cluster itself
> unmodified (with or without a cache tier).  This is roughly how I have
> anticipated implementing the CAS storage portion.  If you are doing the
> chunking hashing and within the OSD itself, then you can't do the CAS
> at the first tier because the requests won't be directed at the right OSD.
>
>> 2. The results below are performance result for our current implementation.
>> experiment setup:
>> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>> ORIGINAL(without dedup feature and cache tier),
>> fio, 512K block, seq. I/O, single thread
>>
>> One thing to note is that the writeback case is slower than the proxy.
>> We think there are three problems as follows.
>>
>> A. The current implementation creates a fingerprint by reading the entire
>> object when flushing. Therefore, there is a problem that read and write are
>> mixed.
>
> I expect this is a small factor compared to the fact that in writeback
> mode you have to *write* to the cache tier, which is 3x replicated,
> whereas in proxy mode those writes don't happen at all.
>
>> B. When client request read, the promote_object function reads the object
>> and writes it back to the cache tier, which also causes a mix of read and
>> write.
>
> This can be mitigated by setting the min_read_recency_for_promote pool
> property to something >1.  Then reads will be proxied unless the object
> appears to be hot (because it has been touched over multiple
> hitset intervals).
>
>> C. When flushing, the unchanged part is rewritten because flush operation
>> perform per-object based.
>
> Yes.
>
> Is there a description of your overall approach somewhere?
>
> sage
>
>
>>
>> Do I have something wrong? or Could you give me a suggestion to improve
>> performance?
>>
>>
>> a. Write performance (KB/s)
>>
>> dedup_ratio  0 20 40 60 80 100
>>
>> PROXY  45586 47804 51120 52844 56167 55302
>>
>> WRITEBACK  13151 11078 9531 13010 9518 8319
>>
>> ORIGINAL  121209 124786 122140 121195 122540 132363
>>
>>
>> b. Read performance (KB/s)
>>
>> dedup_ratio  0 20 40 60 80 100
>>
>> PROXY  112231 118994 118070 120071 117884 132748
>>
>> WRITEBACK  34040 29109 19104 26677 24756 21695
>>
>> ORIGINAL  285482 284398 278063 277989 271793 285094
>>
>>
>> thanks,
>> Myoungwon Oh
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-02-07 11:03   ` myoungwon oh
@ 2017-02-07 14:50     ` Sage Weil
  2017-03-14  6:25       ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-02-07 14:50 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Tue, 7 Feb 2017, myoungwon oh wrote:
> Hi sage.
> 
> I uploaded the document which describe my overall appoach.
> please see it and give me feedback.
> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg

This approach looks pretty close to what we have been planning.  A few 
comments:

1) I think it may be better to view the tier/pool that has the object 
metadata as the "base" pool, and the CAS pool with the refcounted 
object chunks as as tier below that.

2) I think we can use an object class or a handful of new native rados 
operations to make the CAS pool read/write operations more efficient.  In 
your slides you describe a process something like

  rados(getattr)
  if exists
     rados(increment ref count)
  else
     rados(write object and set ref count to 1)

This could be collapsed into a single optimistic operation that sends the 
data and a command that says "create or increment ref count" so that the 
conditional behavior is handled at the OSD.  This will be more efficient 
for small chunks.  (For large chunks, or in cases where we have some 
confidence that the chunk probably already exists, the pessimistic 
approach might still make sense.)  Either way, we should probably support 
both.

3) We'd like to generalize the first pool behavior so that it is just a 
special case of the new tiering functionality.  The idea is that an 
object_info_t can have a 'manifest' that described where and how the 
object is really stored instead of the object data itself (much like it 
can already be a whiteout, etc.).  In the simplest case, the manifest 
would just say "this object is stored in pool X" (simple tiering).  In 
this case, the manifest would a structure like

  map<offset, tuple<length, cas object, pool>>

I think it'll be worth the effort to build a general struture here that we 
can use for basic tiering (not just dedup).

sage



> 
> thanks
> 
> 
> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
> > On Thu, 26 Jan 2017, myoungwon oh wrote:
> >> I have two questions.
> >>
> >> 1. I would like to ask about CAS location. current our implementation store
> >> content address object in storage tier.However, If we store the CAO in the
> >> cache tier, we can get a performance advantage. Do you think we can create
> >> CAO in cachetier? or create a separate storage pool for CAS?
> >
> > It depends on the design.  If the you are naming the objects at the
> > librados client side, then you can use the rados cluster itself
> > unmodified (with or without a cache tier).  This is roughly how I have
> > anticipated implementing the CAS storage portion.  If you are doing the
> > chunking hashing and within the OSD itself, then you can't do the CAS
> > at the first tier because the requests won't be directed at the right OSD.
> >
> >> 2. The results below are performance result for our current implementation.
> >> experiment setup:
> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
> >> ORIGINAL(without dedup feature and cache tier),
> >> fio, 512K block, seq. I/O, single thread
> >>
> >> One thing to note is that the writeback case is slower than the proxy.
> >> We think there are three problems as follows.
> >>
> >> A. The current implementation creates a fingerprint by reading the entire
> >> object when flushing. Therefore, there is a problem that read and write are
> >> mixed.
> >
> > I expect this is a small factor compared to the fact that in writeback
> > mode you have to *write* to the cache tier, which is 3x replicated,
> > whereas in proxy mode those writes don't happen at all.
> >
> >> B. When client request read, the promote_object function reads the object
> >> and writes it back to the cache tier, which also causes a mix of read and
> >> write.
> >
> > This can be mitigated by setting the min_read_recency_for_promote pool
> > property to something >1.  Then reads will be proxied unless the object
> > appears to be hot (because it has been touched over multiple
> > hitset intervals).
> >
> >> C. When flushing, the unchanged part is rewritten because flush operation
> >> perform per-object based.
> >
> > Yes.
> >
> > Is there a description of your overall approach somewhere?
> >
> > sage
> >
> >
> >>
> >> Do I have something wrong? or Could you give me a suggestion to improve
> >> performance?
> >>
> >>
> >> a. Write performance (KB/s)
> >>
> >> dedup_ratio  0 20 40 60 80 100
> >>
> >> PROXY  45586 47804 51120 52844 56167 55302
> >>
> >> WRITEBACK  13151 11078 9531 13010 9518 8319
> >>
> >> ORIGINAL  121209 124786 122140 121195 122540 132363
> >>
> >>
> >> b. Read performance (KB/s)
> >>
> >> dedup_ratio  0 20 40 60 80 100
> >>
> >> PROXY  112231 118994 118070 120071 117884 132748
> >>
> >> WRITEBACK  34040 29109 19104 26677 24756 21695
> >>
> >> ORIGINAL  285482 284398 278063 277989 271793 285094
> >>
> >>
> >> thanks,
> >> Myoungwon Oh
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-02-07 14:50     ` Sage Weil
@ 2017-03-14  6:25       ` myoungwon oh
  2017-03-16 13:42         ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-03-14  6:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

Hi Sage

I addressed all of your concerns (I applied CAS pool and dedup
metadata in object_info_t) and created public repository in order to
show the prototype implementation
(https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
support non-aligned I/O, but for (K)RBD). This code is based on Jewel
and is not cleaned well but you can see the basic flow (start_flush(),
maybe_handle_cache_detail() ). It would be nice if you give me some
comments.

I have some queries mentioned below on which your feedback is highly required.

1. dedup metadata in object_info_t

You mentioned that it would be nice to make tuple in object_info_t
such as map<offset, tuple<length, cas object, pool>> But, I made
dedup_chunk_info_t in object_info_t because I need one more parameter
(chunk_state) and for extensibility. This is because to avoid read and
fingerprinting during flush time. chunk_state represents three states
in writeback mode. First is CLEAN (data and fingerprint are not
modified). Second is MODIFIED (data is modified but fingerprint is not
calculated). Third is CALCULATED (data is modified and fingerprint is
also calculated). When data is stored in cache tier, chunk_state will
be defined. Therefore, reading data and fingerprinting can be removed
during flush.

2. Single Rados Operation

You mentioned a Rados operation which can concurrently read the
reference count and write data. Do you want that API in objecter
class? (for example, objector->read_ref_and_write())

3. Write sequence for performance.

Current write sequence (proxy mode) is

a. Read metadata (promote_object)
b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
original pool)
c. If data and metadata are stored then, proxy osd will issue message
to decrease the reference count (for previous chunk) to OSD (in CAS
pool) and update local object metadata (via simple_opc_submit)
d. If reference count is successful, send Ack to client

As you can see, the number of operations increased due to reference
count and metadata updates. This can degrade performance. My question
is that can we send ack to client at (c) above? (But I am worried
about inconsistent reference count state.)

Write sequence (writeback mode) is

a.  Read object data and do fingerprinting (if data is not calculated).
b. Send reference count decrement message (for previous chunk) to osd
(in CAS pool) and updates local object metadata
c. Send copy_from message to osd (in CAS pool) and send copy_from
message (in order to copy the dedup metadata) to a osd (in original
pool)

Writeback mode also increase the number of operation. Can we reduce?

4. Performance.

Performance is improved compared to previous results. But It still
seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
thread, target_max_objects = 4)

Major concerns are first is fingerprint overhead and second is
writeback performance in cache tier. When the chunk size is large
(>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
small chunk.)

Regarding writeback performance, Flush need two more operations than
proxy mode. First is "marking clean state". Second is "reading dedup
metadata and data from storage". Therefore, actual read and write
occur. These cause that flush completion is delayed.

Small chunk performance in the writeback mode is significantly
degraded because single flush thread handles multiple copy_from
message. It seems that we should improve basic flushing performance.

Write performance (MB/s)

Dedup ratio     0         60       100

Proxy             55       64       73

Writeback       48       50       50

Original           120      120      122

Read performance (MB/s)

Dedup ratio     0         60       100

Proxy             117      130      141

Writeback       198      197      200

Original           280      276      285

5. Command to enable dedup

Ceph osd pool create sds-hot 1024
Ceph osd pool create sds-cas 1024
Ceph osd tier add_cas rbd sds-hot sds-cas
Ceph osd tier sds-hot (proxy or writeback)
Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
Ceph osd tier set-overlay rbd sds-hot

Thanks
Myoungwon Oh
(omwmw@sk.com)

2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
> On Tue, 7 Feb 2017, myoungwon oh wrote:
>> Hi sage.
>>
>> I uploaded the document which describe my overall appoach.
>> please see it and give me feedback.
>> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
>
> This approach looks pretty close to what we have been planning.  A few
> comments:
>
> 1) I think it may be better to view the tier/pool that has the object
> metadata as the "base" pool, and the CAS pool with the refcounted
> object chunks as as tier below that.
>
> 2) I think we can use an object class or a handful of new native rados
> operations to make the CAS pool read/write operations more efficient.  In
> your slides you describe a process something like
>
>   rados(getattr)
>   if exists
>      rados(increment ref count)
>   else
>      rados(write object and set ref count to 1)
>
> This could be collapsed into a single optimistic operation that sends the
> data and a command that says "create or increment ref count" so that the
> conditional behavior is handled at the OSD.  This will be more efficient
> for small chunks.  (For large chunks, or in cases where we have some
> confidence that the chunk probably already exists, the pessimistic
> approach might still make sense.)  Either way, we should probably support
> both.
>
> 3) We'd like to generalize the first pool behavior so that it is just a
> special case of the new tiering functionality.  The idea is that an
> object_info_t can have a 'manifest' that described where and how the
> object is really stored instead of the object data itself (much like it
> can already be a whiteout, etc.).  In the simplest case, the manifest
> would just say "this object is stored in pool X" (simple tiering).  In
> this case, the manifest would a structure like
>
>   map<offset, tuple<length, cas object, pool>>
>
> I think it'll be worth the effort to build a general struture here that we
> can use for basic tiering (not just dedup).
>
> sage
>
>
>
>>
>> thanks
>>
>>
>> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
>> > On Thu, 26 Jan 2017, myoungwon oh wrote:
>> >> I have two questions.
>> >>
>> >> 1. I would like to ask about CAS location. current our implementation store
>> >> content address object in storage tier.However, If we store the CAO in the
>> >> cache tier, we can get a performance advantage. Do you think we can create
>> >> CAO in cachetier? or create a separate storage pool for CAS?
>> >
>> > It depends on the design.  If the you are naming the objects at the
>> > librados client side, then you can use the rados cluster itself
>> > unmodified (with or without a cache tier).  This is roughly how I have
>> > anticipated implementing the CAS storage portion.  If you are doing the
>> > chunking hashing and within the OSD itself, then you can't do the CAS
>> > at the first tier because the requests won't be directed at the right OSD.
>> >
>> >> 2. The results below are performance result for our current implementation.
>> >> experiment setup:
>> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>> >> ORIGINAL(without dedup feature and cache tier),
>> >> fio, 512K block, seq. I/O, single thread
>> >>
>> >> One thing to note is that the writeback case is slower than the proxy.
>> >> We think there are three problems as follows.
>> >>
>> >> A. The current implementation creates a fingerprint by reading the entire
>> >> object when flushing. Therefore, there is a problem that read and write are
>> >> mixed.
>> >
>> > I expect this is a small factor compared to the fact that in writeback
>> > mode you have to *write* to the cache tier, which is 3x replicated,
>> > whereas in proxy mode those writes don't happen at all.
>> >
>> >> B. When client request read, the promote_object function reads the object
>> >> and writes it back to the cache tier, which also causes a mix of read and
>> >> write.
>> >
>> > This can be mitigated by setting the min_read_recency_for_promote pool
>> > property to something >1.  Then reads will be proxied unless the object
>> > appears to be hot (because it has been touched over multiple
>> > hitset intervals).
>> >
>> >> C. When flushing, the unchanged part is rewritten because flush operation
>> >> perform per-object based.
>> >
>> > Yes.
>> >
>> > Is there a description of your overall approach somewhere?
>> >
>> > sage
>> >
>> >
>> >>
>> >> Do I have something wrong? or Could you give me a suggestion to improve
>> >> performance?
>> >>
>> >>
>> >> a. Write performance (KB/s)
>> >>
>> >> dedup_ratio  0 20 40 60 80 100
>> >>
>> >> PROXY  45586 47804 51120 52844 56167 55302
>> >>
>> >> WRITEBACK  13151 11078 9531 13010 9518 8319
>> >>
>> >> ORIGINAL  121209 124786 122140 121195 122540 132363
>> >>
>> >>
>> >> b. Read performance (KB/s)
>> >>
>> >> dedup_ratio  0 20 40 60 80 100
>> >>
>> >> PROXY  112231 118994 118070 120071 117884 132748
>> >>
>> >> WRITEBACK  34040 29109 19104 26677 24756 21695
>> >>
>> >> ORIGINAL  285482 284398 278063 277989 271793 285094
>> >>
>> >>
>> >> thanks,
>> >> Myoungwon Oh
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-14  6:25       ` myoungwon oh
@ 2017-03-16 13:42         ` Sage Weil
  2017-03-20 12:43           ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-03-16 13:42 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

Hi Myoungwon,

This is quite a patch!  Sorry for the slow reply.

On Tue, 14 Mar 2017, myoungwon oh wrote:
> Hi Sage
> 
> 
> I addressed all of your concerns (I applied CAS pool and dedup
> metadata in object_info_t) and created public repository in order to
> show the prototype implementation
> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
> and is not cleaned well but you can see the basic flow (start_flush(),
> maybe_handle_cache_detail() ). It would be nice if you give me some
> comments.
> 
> I have some queries mentioned below on which your feedback is highly required.
> 
> 1. dedup metadata in object_info_t
> 
> You mentioned that it would be nice to make tuple in object_info_t
> such as map<offset, tuple<length, cas object, pool>> But, I made
> dedup_chunk_info_t in object_info_t because I need one more parameter
> (chunk_state) and for extensibility.

Yes, we definitely want an extensible approach to the state in 
object_info_t that will support

 - a simple redirect ("the object is in that other pool")
 - a dedup object ("the object consists of these N lumps, each one 
referencing an object named X_i in pool Y_i")
 - an external system (extenral archive, like a backup system, external 
object store, whatever)

I think we should try to come up with a general notion, like "redirect" or 
"object map" or something that covers other options... not just dedup!

> This is because to avoid read and
> fingerprinting during flush time. chunk_state represents three states
> in writeback mode. First is CLEAN (data and fingerprint are not
> modified). Second is MODIFIED (data is modified but fingerprint is not
> calculated). Third is CALCULATED (data is modified and fingerprint is
> also calculated). When data is stored in cache tier, chunk_state will
> be defined. Therefore, reading data and fingerprinting can be removed
> during flush.

I'm not following this, though.  I think "clean" would just mean we are 
storing the normal object in the pool.  "modified" would mean that the 
FLAG_DIRTY is set.  And "calculated" would mean we have successfully 
chunked the object, stored or taken refs on the chunks, and written the 
chunk map into object_info_t? 

> 2. Single Rados Operation
> 
> You mentioned a Rados operation which can concurrently read the
> reference count and write data. Do you want that API in objecter
> class? (for example, objector->read_ref_and_write())

We may not need to make it a first-class rados operation.  For example, 
cls_refcount could probably be extended with a write_or_get operation.  
But it might also be advantageous to make it a native op.  The main thing 
I'm worried about here is that we probably want to make the refs 
reliable and autitable, which means backpointers (so you can look at a 
chunk and see which dedup objects are using it).  That means that a 
popular sequence of bytes might have a huge number of references, and that 
will need to scale gracefully.  Or, we just use counters, accept that 
failure conditions could make us leak dedup chunks, and make all of our 
failure paths fail-safe.

> 3. Write sequence for performance.
> 
> Current write sequence (proxy mode) is
> 
> a. Read metadata (promote_object)
> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
> original pool)
> c. If data and metadata are stored then, proxy osd will issue message
> to decrease the reference count (for previous chunk) to OSD (in CAS
> pool) and update local object metadata (via simple_opc_submit)
> d. If reference count is successful, send Ack to client
> 
> As you can see, the number of operations increased due to reference
> count and metadata updates. This can degrade performance. My question
> is that can we send ack to client at (c) above? (But I am worried
> about inconsistent reference count state.)

I'm worried that if we focus on inline dedup immediately we'll end up with 
something that is less general and more fragile.  It's also harder.  
Instead, we can consider the inline and async dedup separately.  Async:

writeback:
a. normal write into object.  ack client.
...
b. dedup agent: read object (from cache), chunk
c. dedup agent: write/refcount chunks
d. replace object with dedup manifest

This could happen with or without a delay.  I don't think it makes sense 
to consider "promote" here at all; it sounds like you're assuming the 
initial dedup tier is a cache tier, and we should try not to assume that 
(even though it might be possible).  Instead, I think a "basic" setup 
would probably be

1. base pool (all ssd; contains all metadata for all objects, and absorbs 
   writes).
2. dedup pool(s) contain refcounted chunks

If we want to do inline dedup, it would be some complex code that combines 
all of the steps above into one, at the expense of client latency.

In any case, it's awesome that you have a working prototype.  However, 
it's not going to be practical to take a huge patch(set) like this and 
merge it all at once.  It's too much code to review, too complex, and too 
hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed 
PrimaryLogPG.cc), which is slated for a big refactor right after luminous.

The way to approach this to get it upstream is to break this down into 
different logical components and design/review/test/merge each of them 
indepdendently.  Having a prototype is useful in that it will be easier to 
answer a lot of the questions we'll have deciding how each part should 
work and what it needs to be able to handle, but don't expect that most 
of that code will end up in the final version!

I'm guessing we can break this down into a few logical components:

1) How do we store chunks.  We know we want refcounted objects for each 
chunk.  We don't know how we'll manage the refcounts, whether we want/need 
backpointers, whether we are willing to tolerate "leaking" references in 
failure cases (so that we fail to clean up all chunks if we e.g. delete 
all data), whether we want to implement it as a rados class or a native 
rados op, whether we want to support EC, compression, etc.  This whole 
discussion one is a great place to start because it is self-contained and 
doesn't break anything else.

2) How do we do the dedup manifists (and redirects) in object_info_t.  We 
want the solution to include or be compatible with simpler tiering, like 
having the object_info_t simply be a pointer to a different (colder) pool.  
In fact, I think this is the thing to do first becuase it will make us 
fix/solve all the basic problems with flush and promote.  And extending 
this to include dedup (object is composed of many little bits in other 
pools) is then a matter of making that 'manifest' (or whatever we call it) 
a generic and extensible description.  Remember we also want to support 
pushing objects into external systems (say, glacier, or some other 
external object store like a backup system).

3) How do we chunk.  You have some classes that handle aligned chunking.  
We'll probably eventually want content-based chunking (based on Rabin 
fingerprinting or whatever the new hotness is).  Real users will probably 
want adjustable policies based on what they know of the content they're 
storing, and the system will probably want to support multiple CAS pools 
based on which policy is being used (as that determines chunk sizes 
etc and whether we'll actually have any dedup happening).

4) How to drive the dedup process itself.  An async agent that's part of 
the exiting tier_agent?  An external process?  Something inline in the 
write path?  This is the hardest question to answer, and the one that is 
most likely to collide with other planned OSD work.  It can also come 
last, IMO!  We can start with a simple offline agent and perhaps 
eventually do something more clever or efficient.

In any case, I think #1 and #2 are the key discussions we should have now.  
I suggest starting a pad and email thread for each (pad.ceph.com) so we 
can brainstorm design options, weight trade-offs, and come to some 
consensus.  (I had some thoughts, for example, on a hybrid scheme 
somewhere between explicit backpointers and a simple refcount that could 
consume fixed overhead but still provide information that would enable a 
moderately efficient scrub/audit.)

Thanks!
sage

> Write sequence (writeback mode) is
> 
> a.  Read object data and do fingerprinting (if data is not calculated).
> b. Send reference count decrement message (for previous chunk) to osd
> (in CAS pool) and updates local object metadata
> c. Send copy_from message to osd (in CAS pool) and send copy_from
> message (in order to copy the dedup metadata) to a osd (in original
> pool)
> 
> Writeback mode also increase the number of operation. Can we reduce?
> 
> 
> 
> 4. Performance.
> 
> Performance is improved compared to previous results. But It still
> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
> thread, target_max_objects = 4)
> 
> Major concerns are first is fingerprint overhead and second is
> writeback performance in cache tier. When the chunk size is large
> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
> small chunk.)
> 
> Regarding writeback performance, Flush need two more operations than
> proxy mode. First is "marking clean state". Second is "reading dedup
> metadata and data from storage". Therefore, actual read and write
> occur. These cause that flush completion is delayed.
> 
> Small chunk performance in the writeback mode is significantly
> degraded because single flush thread handles multiple copy_from
> message. It seems that we should improve basic flushing performance.
> 
> 
> Write performance (MB/s)
> 
> Dedup ratio     0         60       100
> 
> Proxy             55       64       73
> 
> Writeback       48       50       50
> 
> Original           120      120      122
> 
> 
> 
> Read performance (MB/s)
> 
> Dedup ratio     0         60       100
> 
> Proxy             117      130      141
> 
> Writeback       198      197      200
> 
> Original           280      276      285
> 
> 
> 
> 
> 5. Command to enable dedup
> 
> Ceph osd pool create sds-hot 1024
> Ceph osd pool create sds-cas 1024
> Ceph osd tier add_cas rbd sds-hot sds-cas
> Ceph osd tier sds-hot (proxy or writeback)
> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
> Ceph osd tier set-overlay rbd sds-hot
> 
> 
> 
> Thanks
> Myoungwon Oh
> (omwmw@sk.com)
> 
> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
> > On Tue, 7 Feb 2017, myoungwon oh wrote:
> >> Hi sage.
> >>
> >> I uploaded the document which describe my overall appoach.
> >> please see it and give me feedback.
> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
> >
> > This approach looks pretty close to what we have been planning.  A few
> > comments:
> >
> > 1) I think it may be better to view the tier/pool that has the object
> > metadata as the "base" pool, and the CAS pool with the refcounted
> > object chunks as as tier below that.
> >
> > 2) I think we can use an object class or a handful of new native rados
> > operations to make the CAS pool read/write operations more efficient.  In
> > your slides you describe a process something like
> >
> >   rados(getattr)
> >   if exists
> >      rados(increment ref count)
> >   else
> >      rados(write object and set ref count to 1)
> >
> > This could be collapsed into a single optimistic operation that sends the
> > data and a command that says "create or increment ref count" so that the
> > conditional behavior is handled at the OSD.  This will be more efficient
> > for small chunks.  (For large chunks, or in cases where we have some
> > confidence that the chunk probably already exists, the pessimistic
> > approach might still make sense.)  Either way, we should probably support
> > both.
> >
> > 3) We'd like to generalize the first pool behavior so that it is just a
> > special case of the new tiering functionality.  The idea is that an
> > object_info_t can have a 'manifest' that described where and how the
> > object is really stored instead of the object data itself (much like it
> > can already be a whiteout, etc.).  In the simplest case, the manifest
> > would just say "this object is stored in pool X" (simple tiering).  In
> > this case, the manifest would a structure like
> >
> >   map<offset, tuple<length, cas object, pool>>
> >
> > I think it'll be worth the effort to build a general struture here that we
> > can use for basic tiering (not just dedup).
> >
> > sage
> >
> >
> >
> >>
> >> thanks
> >>
> >>
> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
> >> >> I have two questions.
> >> >>
> >> >> 1. I would like to ask about CAS location. current our implementation store
> >> >> content address object in storage tier.However, If we store the CAO in the
> >> >> cache tier, we can get a performance advantage. Do you think we can create
> >> >> CAO in cachetier? or create a separate storage pool for CAS?
> >> >
> >> > It depends on the design.  If the you are naming the objects at the
> >> > librados client side, then you can use the rados cluster itself
> >> > unmodified (with or without a cache tier).  This is roughly how I have
> >> > anticipated implementing the CAS storage portion.  If you are doing the
> >> > chunking hashing and within the OSD itself, then you can't do the CAS
> >> > at the first tier because the requests won't be directed at the right OSD.
> >> >
> >> >> 2. The results below are performance result for our current implementation.
> >> >> experiment setup:
> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
> >> >> ORIGINAL(without dedup feature and cache tier),
> >> >> fio, 512K block, seq. I/O, single thread
> >> >>
> >> >> One thing to note is that the writeback case is slower than the proxy.
> >> >> We think there are three problems as follows.
> >> >>
> >> >> A. The current implementation creates a fingerprint by reading the entire
> >> >> object when flushing. Therefore, there is a problem that read and write are
> >> >> mixed.
> >> >
> >> > I expect this is a small factor compared to the fact that in writeback
> >> > mode you have to *write* to the cache tier, which is 3x replicated,
> >> > whereas in proxy mode those writes don't happen at all.
> >> >
> >> >> B. When client request read, the promote_object function reads the object
> >> >> and writes it back to the cache tier, which also causes a mix of read and
> >> >> write.
> >> >
> >> > This can be mitigated by setting the min_read_recency_for_promote pool
> >> > property to something >1.  Then reads will be proxied unless the object
> >> > appears to be hot (because it has been touched over multiple
> >> > hitset intervals).
> >> >
> >> >> C. When flushing, the unchanged part is rewritten because flush operation
> >> >> perform per-object based.
> >> >
> >> > Yes.
> >> >
> >> > Is there a description of your overall approach somewhere?
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >> Do I have something wrong? or Could you give me a suggestion to improve
> >> >> performance?
> >> >>
> >> >>
> >> >> a. Write performance (KB/s)
> >> >>
> >> >> dedup_ratio  0 20 40 60 80 100
> >> >>
> >> >> PROXY  45586 47804 51120 52844 56167 55302
> >> >>
> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
> >> >>
> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
> >> >>
> >> >>
> >> >> b. Read performance (KB/s)
> >> >>
> >> >> dedup_ratio  0 20 40 60 80 100
> >> >>
> >> >> PROXY  112231 118994 118070 120071 117884 132748
> >> >>
> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
> >> >>
> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
> >> >>
> >> >>
> >> >> thanks,
> >> >> Myoungwon Oh
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-16 13:42         ` Sage Weil
@ 2017-03-20 12:43           ` myoungwon oh
  2017-03-24 19:32             ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-03-20 12:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

Hi sage.

Thanks for your comments!
I created pads in order to brainstorm design option about #1, #2 first.

#1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
#2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk


Thanks.

2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
> Hi Myoungwon,
>
> This is quite a patch!  Sorry for the slow reply.
>
> On Tue, 14 Mar 2017, myoungwon oh wrote:
>> Hi Sage
>>
>>
>> I addressed all of your concerns (I applied CAS pool and dedup
>> metadata in object_info_t) and created public repository in order to
>> show the prototype implementation
>> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
>> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
>> and is not cleaned well but you can see the basic flow (start_flush(),
>> maybe_handle_cache_detail() ). It would be nice if you give me some
>> comments.
>>
>> I have some queries mentioned below on which your feedback is highly required.
>>
>> 1. dedup metadata in object_info_t
>>
>> You mentioned that it would be nice to make tuple in object_info_t
>> such as map<offset, tuple<length, cas object, pool>> But, I made
>> dedup_chunk_info_t in object_info_t because I need one more parameter
>> (chunk_state) and for extensibility.
>
> Yes, we definitely want an extensible approach to the state in
> object_info_t that will support
>
>  - a simple redirect ("the object is in that other pool")
>  - a dedup object ("the object consists of these N lumps, each one
> referencing an object named X_i in pool Y_i")
>  - an external system (extenral archive, like a backup system, external
> object store, whatever)
>
> I think we should try to come up with a general notion, like "redirect" or
> "object map" or something that covers other options... not just dedup!
>
>> This is because to avoid read and
>> fingerprinting during flush time. chunk_state represents three states
>> in writeback mode. First is CLEAN (data and fingerprint are not
>> modified). Second is MODIFIED (data is modified but fingerprint is not
>> calculated). Third is CALCULATED (data is modified and fingerprint is
>> also calculated). When data is stored in cache tier, chunk_state will
>> be defined. Therefore, reading data and fingerprinting can be removed
>> during flush.
>
> I'm not following this, though.  I think "clean" would just mean we are
> storing the normal object in the pool.  "modified" would mean that the
> FLAG_DIRTY is set.  And "calculated" would mean we have successfully
> chunked the object, stored or taken refs on the chunks, and written the
> chunk map into object_info_t?
>
>
>> 2. Single Rados Operation
>>
>> You mentioned a Rados operation which can concurrently read the
>> reference count and write data. Do you want that API in objecter
>> class? (for example, objector->read_ref_and_write())
>
> We may not need to make it a first-class rados operation.  For example,
> cls_refcount could probably be extended with a write_or_get operation.
> But it might also be advantageous to make it a native op.  The main thing
> I'm worried about here is that we probably want to make the refs
> reliable and autitable, which means backpointers (so you can look at a
> chunk and see which dedup objects are using it).  That means that a
> popular sequence of bytes might have a huge number of references, and that
> will need to scale gracefully.  Or, we just use counters, accept that
> failure conditions could make us leak dedup chunks, and make all of our
> failure paths fail-safe.
>
>> 3. Write sequence for performance.
>>
>> Current write sequence (proxy mode) is
>>
>> a. Read metadata (promote_object)
>> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
>> original pool)
>> c. If data and metadata are stored then, proxy osd will issue message
>> to decrease the reference count (for previous chunk) to OSD (in CAS
>> pool) and update local object metadata (via simple_opc_submit)
>> d. If reference count is successful, send Ack to client
>>
>> As you can see, the number of operations increased due to reference
>> count and metadata updates. This can degrade performance. My question
>> is that can we send ack to client at (c) above? (But I am worried
>> about inconsistent reference count state.)
>
> I'm worried that if we focus on inline dedup immediately we'll end up with
> something that is less general and more fragile.  It's also harder.
> Instead, we can consider the inline and async dedup separately.  Async:
>
> writeback:
> a. normal write into object.  ack client.
> ...
> b. dedup agent: read object (from cache), chunk
> c. dedup agent: write/refcount chunks
> d. replace object with dedup manifest
>
> This could happen with or without a delay.  I don't think it makes sense
> to consider "promote" here at all; it sounds like you're assuming the
> initial dedup tier is a cache tier, and we should try not to assume that
> (even though it might be possible).  Instead, I think a "basic" setup
> would probably be
>
> 1. base pool (all ssd; contains all metadata for all objects, and absorbs
>    writes).
> 2. dedup pool(s) contain refcounted chunks
>
> If we want to do inline dedup, it would be some complex code that combines
> all of the steps above into one, at the expense of client latency.
>
>
> In any case, it's awesome that you have a working prototype.  However,
> it's not going to be practical to take a huge patch(set) like this and
> merge it all at once.  It's too much code to review, too complex, and too
> hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
> PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
>
> The way to approach this to get it upstream is to break this down into
> different logical components and design/review/test/merge each of them
> indepdendently.  Having a prototype is useful in that it will be easier to
> answer a lot of the questions we'll have deciding how each part should
> work and what it needs to be able to handle, but don't expect that most
> of that code will end up in the final version!
>
> I'm guessing we can break this down into a few logical components:
>
> 1) How do we store chunks.  We know we want refcounted objects for each
> chunk.  We don't know how we'll manage the refcounts, whether we want/need
> backpointers, whether we are willing to tolerate "leaking" references in
> failure cases (so that we fail to clean up all chunks if we e.g. delete
> all data), whether we want to implement it as a rados class or a native
> rados op, whether we want to support EC, compression, etc.  This whole
> discussion one is a great place to start because it is self-contained and
> doesn't break anything else.
>
> 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
> want the solution to include or be compatible with simpler tiering, like
> having the object_info_t simply be a pointer to a different (colder) pool.
> In fact, I think this is the thing to do first becuase it will make us
> fix/solve all the basic problems with flush and promote.  And extending
> this to include dedup (object is composed of many little bits in other
> pools) is then a matter of making that 'manifest' (or whatever we call it)
> a generic and extensible description.  Remember we also want to support
> pushing objects into external systems (say, glacier, or some other
> external object store like a backup system).
>
> 3) How do we chunk.  You have some classes that handle aligned chunking.
> We'll probably eventually want content-based chunking (based on Rabin
> fingerprinting or whatever the new hotness is).  Real users will probably
> want adjustable policies based on what they know of the content they're
> storing, and the system will probably want to support multiple CAS pools
> based on which policy is being used (as that determines chunk sizes
> etc and whether we'll actually have any dedup happening).
>
> 4) How to drive the dedup process itself.  An async agent that's part of
> the exiting tier_agent?  An external process?  Something inline in the
> write path?  This is the hardest question to answer, and the one that is
> most likely to collide with other planned OSD work.  It can also come
> last, IMO!  We can start with a simple offline agent and perhaps
> eventually do something more clever or efficient.
>
> In any case, I think #1 and #2 are the key discussions we should have now.
> I suggest starting a pad and email thread for each (pad.ceph.com) so we
> can brainstorm design options, weight trade-offs, and come to some
> consensus.  (I had some thoughts, for example, on a hybrid scheme
> somewhere between explicit backpointers and a simple refcount that could
> consume fixed overhead but still provide information that would enable a
> moderately efficient scrub/audit.)
>
> Thanks!
> sage
>
>
>
>
>> Write sequence (writeback mode) is
>>
>> a.  Read object data and do fingerprinting (if data is not calculated).
>> b. Send reference count decrement message (for previous chunk) to osd
>> (in CAS pool) and updates local object metadata
>> c. Send copy_from message to osd (in CAS pool) and send copy_from
>> message (in order to copy the dedup metadata) to a osd (in original
>> pool)
>>
>> Writeback mode also increase the number of operation. Can we reduce?
>>
>>
>>
>> 4. Performance.
>>
>> Performance is improved compared to previous results. But It still
>> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
>> thread, target_max_objects = 4)
>>
>> Major concerns are first is fingerprint overhead and second is
>> writeback performance in cache tier. When the chunk size is large
>> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
>> small chunk.)
>>
>> Regarding writeback performance, Flush need two more operations than
>> proxy mode. First is "marking clean state". Second is "reading dedup
>> metadata and data from storage". Therefore, actual read and write
>> occur. These cause that flush completion is delayed.
>>
>> Small chunk performance in the writeback mode is significantly
>> degraded because single flush thread handles multiple copy_from
>> message. It seems that we should improve basic flushing performance.
>>
>>
>> Write performance (MB/s)
>>
>> Dedup ratio     0         60       100
>>
>> Proxy             55       64       73
>>
>> Writeback       48       50       50
>>
>> Original           120      120      122
>>
>>
>>
>> Read performance (MB/s)
>>
>> Dedup ratio     0         60       100
>>
>> Proxy             117      130      141
>>
>> Writeback       198      197      200
>>
>> Original           280      276      285
>>
>>
>>
>>
>> 5. Command to enable dedup
>>
>> Ceph osd pool create sds-hot 1024
>> Ceph osd pool create sds-cas 1024
>> Ceph osd tier add_cas rbd sds-hot sds-cas
>> Ceph osd tier sds-hot (proxy or writeback)
>> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
>> Ceph osd tier set-overlay rbd sds-hot
>>
>>
>>
>> Thanks
>> Myoungwon Oh
>> (omwmw@sk.com)
>>
>> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> > On Tue, 7 Feb 2017, myoungwon oh wrote:
>> >> Hi sage.
>> >>
>> >> I uploaded the document which describe my overall appoach.
>> >> please see it and give me feedback.
>> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
>> >
>> > This approach looks pretty close to what we have been planning.  A few
>> > comments:
>> >
>> > 1) I think it may be better to view the tier/pool that has the object
>> > metadata as the "base" pool, and the CAS pool with the refcounted
>> > object chunks as as tier below that.
>> >
>> > 2) I think we can use an object class or a handful of new native rados
>> > operations to make the CAS pool read/write operations more efficient.  In
>> > your slides you describe a process something like
>> >
>> >   rados(getattr)
>> >   if exists
>> >      rados(increment ref count)
>> >   else
>> >      rados(write object and set ref count to 1)
>> >
>> > This could be collapsed into a single optimistic operation that sends the
>> > data and a command that says "create or increment ref count" so that the
>> > conditional behavior is handled at the OSD.  This will be more efficient
>> > for small chunks.  (For large chunks, or in cases where we have some
>> > confidence that the chunk probably already exists, the pessimistic
>> > approach might still make sense.)  Either way, we should probably support
>> > both.
>> >
>> > 3) We'd like to generalize the first pool behavior so that it is just a
>> > special case of the new tiering functionality.  The idea is that an
>> > object_info_t can have a 'manifest' that described where and how the
>> > object is really stored instead of the object data itself (much like it
>> > can already be a whiteout, etc.).  In the simplest case, the manifest
>> > would just say "this object is stored in pool X" (simple tiering).  In
>> > this case, the manifest would a structure like
>> >
>> >   map<offset, tuple<length, cas object, pool>>
>> >
>> > I think it'll be worth the effort to build a general struture here that we
>> > can use for basic tiering (not just dedup).
>> >
>> > sage
>> >
>> >
>> >
>> >>
>> >> thanks
>> >>
>> >>
>> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
>> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
>> >> >> I have two questions.
>> >> >>
>> >> >> 1. I would like to ask about CAS location. current our implementation store
>> >> >> content address object in storage tier.However, If we store the CAO in the
>> >> >> cache tier, we can get a performance advantage. Do you think we can create
>> >> >> CAO in cachetier? or create a separate storage pool for CAS?
>> >> >
>> >> > It depends on the design.  If the you are naming the objects at the
>> >> > librados client side, then you can use the rados cluster itself
>> >> > unmodified (with or without a cache tier).  This is roughly how I have
>> >> > anticipated implementing the CAS storage portion.  If you are doing the
>> >> > chunking hashing and within the OSD itself, then you can't do the CAS
>> >> > at the first tier because the requests won't be directed at the right OSD.
>> >> >
>> >> >> 2. The results below are performance result for our current implementation.
>> >> >> experiment setup:
>> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>> >> >> ORIGINAL(without dedup feature and cache tier),
>> >> >> fio, 512K block, seq. I/O, single thread
>> >> >>
>> >> >> One thing to note is that the writeback case is slower than the proxy.
>> >> >> We think there are three problems as follows.
>> >> >>
>> >> >> A. The current implementation creates a fingerprint by reading the entire
>> >> >> object when flushing. Therefore, there is a problem that read and write are
>> >> >> mixed.
>> >> >
>> >> > I expect this is a small factor compared to the fact that in writeback
>> >> > mode you have to *write* to the cache tier, which is 3x replicated,
>> >> > whereas in proxy mode those writes don't happen at all.
>> >> >
>> >> >> B. When client request read, the promote_object function reads the object
>> >> >> and writes it back to the cache tier, which also causes a mix of read and
>> >> >> write.
>> >> >
>> >> > This can be mitigated by setting the min_read_recency_for_promote pool
>> >> > property to something >1.  Then reads will be proxied unless the object
>> >> > appears to be hot (because it has been touched over multiple
>> >> > hitset intervals).
>> >> >
>> >> >> C. When flushing, the unchanged part is rewritten because flush operation
>> >> >> perform per-object based.
>> >> >
>> >> > Yes.
>> >> >
>> >> > Is there a description of your overall approach somewhere?
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> Do I have something wrong? or Could you give me a suggestion to improve
>> >> >> performance?
>> >> >>
>> >> >>
>> >> >> a. Write performance (KB/s)
>> >> >>
>> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >>
>> >> >> PROXY  45586 47804 51120 52844 56167 55302
>> >> >>
>> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
>> >> >>
>> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
>> >> >>
>> >> >>
>> >> >> b. Read performance (KB/s)
>> >> >>
>> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >>
>> >> >> PROXY  112231 118994 118070 120071 117884 132748
>> >> >>
>> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
>> >> >>
>> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
>> >> >>
>> >> >>
>> >> >> thanks,
>> >> >> Myoungwon Oh
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >>
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-20 12:43           ` myoungwon oh
@ 2017-03-24 19:32             ` Sage Weil
  2017-03-27 13:46               ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-03-24 19:32 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Mon, 20 Mar 2017, myoungwon oh wrote:
> Hi sage.
> 
> Thanks for your comments!
> I created pads in order to brainstorm design option about #1, #2 first.
> 
> #1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
> #2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk

I made some comments in the pad!

sage

> 
> 
> Thanks.
> 
> 2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
> > Hi Myoungwon,
> >
> > This is quite a patch!  Sorry for the slow reply.
> >
> > On Tue, 14 Mar 2017, myoungwon oh wrote:
> >> Hi Sage
> >>
> >>
> >> I addressed all of your concerns (I applied CAS pool and dedup
> >> metadata in object_info_t) and created public repository in order to
> >> show the prototype implementation
> >> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
> >> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
> >> and is not cleaned well but you can see the basic flow (start_flush(),
> >> maybe_handle_cache_detail() ). It would be nice if you give me some
> >> comments.
> >>
> >> I have some queries mentioned below on which your feedback is highly required.
> >>
> >> 1. dedup metadata in object_info_t
> >>
> >> You mentioned that it would be nice to make tuple in object_info_t
> >> such as map<offset, tuple<length, cas object, pool>> But, I made
> >> dedup_chunk_info_t in object_info_t because I need one more parameter
> >> (chunk_state) and for extensibility.
> >
> > Yes, we definitely want an extensible approach to the state in
> > object_info_t that will support
> >
> >  - a simple redirect ("the object is in that other pool")
> >  - a dedup object ("the object consists of these N lumps, each one
> > referencing an object named X_i in pool Y_i")
> >  - an external system (extenral archive, like a backup system, external
> > object store, whatever)
> >
> > I think we should try to come up with a general notion, like "redirect" or
> > "object map" or something that covers other options... not just dedup!
> >
> >> This is because to avoid read and
> >> fingerprinting during flush time. chunk_state represents three states
> >> in writeback mode. First is CLEAN (data and fingerprint are not
> >> modified). Second is MODIFIED (data is modified but fingerprint is not
> >> calculated). Third is CALCULATED (data is modified and fingerprint is
> >> also calculated). When data is stored in cache tier, chunk_state will
> >> be defined. Therefore, reading data and fingerprinting can be removed
> >> during flush.
> >
> > I'm not following this, though.  I think "clean" would just mean we are
> > storing the normal object in the pool.  "modified" would mean that the
> > FLAG_DIRTY is set.  And "calculated" would mean we have successfully
> > chunked the object, stored or taken refs on the chunks, and written the
> > chunk map into object_info_t?
> >
> >
> >> 2. Single Rados Operation
> >>
> >> You mentioned a Rados operation which can concurrently read the
> >> reference count and write data. Do you want that API in objecter
> >> class? (for example, objector->read_ref_and_write())
> >
> > We may not need to make it a first-class rados operation.  For example,
> > cls_refcount could probably be extended with a write_or_get operation.
> > But it might also be advantageous to make it a native op.  The main thing
> > I'm worried about here is that we probably want to make the refs
> > reliable and autitable, which means backpointers (so you can look at a
> > chunk and see which dedup objects are using it).  That means that a
> > popular sequence of bytes might have a huge number of references, and that
> > will need to scale gracefully.  Or, we just use counters, accept that
> > failure conditions could make us leak dedup chunks, and make all of our
> > failure paths fail-safe.
> >
> >> 3. Write sequence for performance.
> >>
> >> Current write sequence (proxy mode) is
> >>
> >> a. Read metadata (promote_object)
> >> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
> >> original pool)
> >> c. If data and metadata are stored then, proxy osd will issue message
> >> to decrease the reference count (for previous chunk) to OSD (in CAS
> >> pool) and update local object metadata (via simple_opc_submit)
> >> d. If reference count is successful, send Ack to client
> >>
> >> As you can see, the number of operations increased due to reference
> >> count and metadata updates. This can degrade performance. My question
> >> is that can we send ack to client at (c) above? (But I am worried
> >> about inconsistent reference count state.)
> >
> > I'm worried that if we focus on inline dedup immediately we'll end up with
> > something that is less general and more fragile.  It's also harder.
> > Instead, we can consider the inline and async dedup separately.  Async:
> >
> > writeback:
> > a. normal write into object.  ack client.
> > ...
> > b. dedup agent: read object (from cache), chunk
> > c. dedup agent: write/refcount chunks
> > d. replace object with dedup manifest
> >
> > This could happen with or without a delay.  I don't think it makes sense
> > to consider "promote" here at all; it sounds like you're assuming the
> > initial dedup tier is a cache tier, and we should try not to assume that
> > (even though it might be possible).  Instead, I think a "basic" setup
> > would probably be
> >
> > 1. base pool (all ssd; contains all metadata for all objects, and absorbs
> >    writes).
> > 2. dedup pool(s) contain refcounted chunks
> >
> > If we want to do inline dedup, it would be some complex code that combines
> > all of the steps above into one, at the expense of client latency.
> >
> >
> > In any case, it's awesome that you have a working prototype.  However,
> > it's not going to be practical to take a huge patch(set) like this and
> > merge it all at once.  It's too much code to review, too complex, and too
> > hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
> > PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
> >
> > The way to approach this to get it upstream is to break this down into
> > different logical components and design/review/test/merge each of them
> > indepdendently.  Having a prototype is useful in that it will be easier to
> > answer a lot of the questions we'll have deciding how each part should
> > work and what it needs to be able to handle, but don't expect that most
> > of that code will end up in the final version!
> >
> > I'm guessing we can break this down into a few logical components:
> >
> > 1) How do we store chunks.  We know we want refcounted objects for each
> > chunk.  We don't know how we'll manage the refcounts, whether we want/need
> > backpointers, whether we are willing to tolerate "leaking" references in
> > failure cases (so that we fail to clean up all chunks if we e.g. delete
> > all data), whether we want to implement it as a rados class or a native
> > rados op, whether we want to support EC, compression, etc.  This whole
> > discussion one is a great place to start because it is self-contained and
> > doesn't break anything else.
> >
> > 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
> > want the solution to include or be compatible with simpler tiering, like
> > having the object_info_t simply be a pointer to a different (colder) pool.
> > In fact, I think this is the thing to do first becuase it will make us
> > fix/solve all the basic problems with flush and promote.  And extending
> > this to include dedup (object is composed of many little bits in other
> > pools) is then a matter of making that 'manifest' (or whatever we call it)
> > a generic and extensible description.  Remember we also want to support
> > pushing objects into external systems (say, glacier, or some other
> > external object store like a backup system).
> >
> > 3) How do we chunk.  You have some classes that handle aligned chunking.
> > We'll probably eventually want content-based chunking (based on Rabin
> > fingerprinting or whatever the new hotness is).  Real users will probably
> > want adjustable policies based on what they know of the content they're
> > storing, and the system will probably want to support multiple CAS pools
> > based on which policy is being used (as that determines chunk sizes
> > etc and whether we'll actually have any dedup happening).
> >
> > 4) How to drive the dedup process itself.  An async agent that's part of
> > the exiting tier_agent?  An external process?  Something inline in the
> > write path?  This is the hardest question to answer, and the one that is
> > most likely to collide with other planned OSD work.  It can also come
> > last, IMO!  We can start with a simple offline agent and perhaps
> > eventually do something more clever or efficient.
> >
> > In any case, I think #1 and #2 are the key discussions we should have now.
> > I suggest starting a pad and email thread for each (pad.ceph.com) so we
> > can brainstorm design options, weight trade-offs, and come to some
> > consensus.  (I had some thoughts, for example, on a hybrid scheme
> > somewhere between explicit backpointers and a simple refcount that could
> > consume fixed overhead but still provide information that would enable a
> > moderately efficient scrub/audit.)
> >
> > Thanks!
> > sage
> >
> >
> >
> >
> >> Write sequence (writeback mode) is
> >>
> >> a.  Read object data and do fingerprinting (if data is not calculated).
> >> b. Send reference count decrement message (for previous chunk) to osd
> >> (in CAS pool) and updates local object metadata
> >> c. Send copy_from message to osd (in CAS pool) and send copy_from
> >> message (in order to copy the dedup metadata) to a osd (in original
> >> pool)
> >>
> >> Writeback mode also increase the number of operation. Can we reduce?
> >>
> >>
> >>
> >> 4. Performance.
> >>
> >> Performance is improved compared to previous results. But It still
> >> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
> >> thread, target_max_objects = 4)
> >>
> >> Major concerns are first is fingerprint overhead and second is
> >> writeback performance in cache tier. When the chunk size is large
> >> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
> >> small chunk.)
> >>
> >> Regarding writeback performance, Flush need two more operations than
> >> proxy mode. First is "marking clean state". Second is "reading dedup
> >> metadata and data from storage". Therefore, actual read and write
> >> occur. These cause that flush completion is delayed.
> >>
> >> Small chunk performance in the writeback mode is significantly
> >> degraded because single flush thread handles multiple copy_from
> >> message. It seems that we should improve basic flushing performance.
> >>
> >>
> >> Write performance (MB/s)
> >>
> >> Dedup ratio     0         60       100
> >>
> >> Proxy             55       64       73
> >>
> >> Writeback       48       50       50
> >>
> >> Original           120      120      122
> >>
> >>
> >>
> >> Read performance (MB/s)
> >>
> >> Dedup ratio     0         60       100
> >>
> >> Proxy             117      130      141
> >>
> >> Writeback       198      197      200
> >>
> >> Original           280      276      285
> >>
> >>
> >>
> >>
> >> 5. Command to enable dedup
> >>
> >> Ceph osd pool create sds-hot 1024
> >> Ceph osd pool create sds-cas 1024
> >> Ceph osd tier add_cas rbd sds-hot sds-cas
> >> Ceph osd tier sds-hot (proxy or writeback)
> >> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
> >> Ceph osd tier set-overlay rbd sds-hot
> >>
> >>
> >>
> >> Thanks
> >> Myoungwon Oh
> >> (omwmw@sk.com)
> >>
> >> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
> >> > On Tue, 7 Feb 2017, myoungwon oh wrote:
> >> >> Hi sage.
> >> >>
> >> >> I uploaded the document which describe my overall appoach.
> >> >> please see it and give me feedback.
> >> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
> >> >
> >> > This approach looks pretty close to what we have been planning.  A few
> >> > comments:
> >> >
> >> > 1) I think it may be better to view the tier/pool that has the object
> >> > metadata as the "base" pool, and the CAS pool with the refcounted
> >> > object chunks as as tier below that.
> >> >
> >> > 2) I think we can use an object class or a handful of new native rados
> >> > operations to make the CAS pool read/write operations more efficient.  In
> >> > your slides you describe a process something like
> >> >
> >> >   rados(getattr)
> >> >   if exists
> >> >      rados(increment ref count)
> >> >   else
> >> >      rados(write object and set ref count to 1)
> >> >
> >> > This could be collapsed into a single optimistic operation that sends the
> >> > data and a command that says "create or increment ref count" so that the
> >> > conditional behavior is handled at the OSD.  This will be more efficient
> >> > for small chunks.  (For large chunks, or in cases where we have some
> >> > confidence that the chunk probably already exists, the pessimistic
> >> > approach might still make sense.)  Either way, we should probably support
> >> > both.
> >> >
> >> > 3) We'd like to generalize the first pool behavior so that it is just a
> >> > special case of the new tiering functionality.  The idea is that an
> >> > object_info_t can have a 'manifest' that described where and how the
> >> > object is really stored instead of the object data itself (much like it
> >> > can already be a whiteout, etc.).  In the simplest case, the manifest
> >> > would just say "this object is stored in pool X" (simple tiering).  In
> >> > this case, the manifest would a structure like
> >> >
> >> >   map<offset, tuple<length, cas object, pool>>
> >> >
> >> > I think it'll be worth the effort to build a general struture here that we
> >> > can use for basic tiering (not just dedup).
> >> >
> >> > sage
> >> >
> >> >
> >> >
> >> >>
> >> >> thanks
> >> >>
> >> >>
> >> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
> >> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
> >> >> >> I have two questions.
> >> >> >>
> >> >> >> 1. I would like to ask about CAS location. current our implementation store
> >> >> >> content address object in storage tier.However, If we store the CAO in the
> >> >> >> cache tier, we can get a performance advantage. Do you think we can create
> >> >> >> CAO in cachetier? or create a separate storage pool for CAS?
> >> >> >
> >> >> > It depends on the design.  If the you are naming the objects at the
> >> >> > librados client side, then you can use the rados cluster itself
> >> >> > unmodified (with or without a cache tier).  This is roughly how I have
> >> >> > anticipated implementing the CAS storage portion.  If you are doing the
> >> >> > chunking hashing and within the OSD itself, then you can't do the CAS
> >> >> > at the first tier because the requests won't be directed at the right OSD.
> >> >> >
> >> >> >> 2. The results below are performance result for our current implementation.
> >> >> >> experiment setup:
> >> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
> >> >> >> ORIGINAL(without dedup feature and cache tier),
> >> >> >> fio, 512K block, seq. I/O, single thread
> >> >> >>
> >> >> >> One thing to note is that the writeback case is slower than the proxy.
> >> >> >> We think there are three problems as follows.
> >> >> >>
> >> >> >> A. The current implementation creates a fingerprint by reading the entire
> >> >> >> object when flushing. Therefore, there is a problem that read and write are
> >> >> >> mixed.
> >> >> >
> >> >> > I expect this is a small factor compared to the fact that in writeback
> >> >> > mode you have to *write* to the cache tier, which is 3x replicated,
> >> >> > whereas in proxy mode those writes don't happen at all.
> >> >> >
> >> >> >> B. When client request read, the promote_object function reads the object
> >> >> >> and writes it back to the cache tier, which also causes a mix of read and
> >> >> >> write.
> >> >> >
> >> >> > This can be mitigated by setting the min_read_recency_for_promote pool
> >> >> > property to something >1.  Then reads will be proxied unless the object
> >> >> > appears to be hot (because it has been touched over multiple
> >> >> > hitset intervals).
> >> >> >
> >> >> >> C. When flushing, the unchanged part is rewritten because flush operation
> >> >> >> perform per-object based.
> >> >> >
> >> >> > Yes.
> >> >> >
> >> >> > Is there a description of your overall approach somewhere?
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Do I have something wrong? or Could you give me a suggestion to improve
> >> >> >> performance?
> >> >> >>
> >> >> >>
> >> >> >> a. Write performance (KB/s)
> >> >> >>
> >> >> >> dedup_ratio  0 20 40 60 80 100
> >> >> >>
> >> >> >> PROXY  45586 47804 51120 52844 56167 55302
> >> >> >>
> >> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
> >> >> >>
> >> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
> >> >> >>
> >> >> >>
> >> >> >> b. Read performance (KB/s)
> >> >> >>
> >> >> >> dedup_ratio  0 20 40 60 80 100
> >> >> >>
> >> >> >> PROXY  112231 118994 118070 120071 117884 132748
> >> >> >>
> >> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
> >> >> >>
> >> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
> >> >> >>
> >> >> >>
> >> >> >> thanks,
> >> >> >> Myoungwon Oh
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-24 19:32             ` Sage Weil
@ 2017-03-27 13:46               ` myoungwon oh
  2017-03-27 14:00                 ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-03-27 13:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

I added comments in the pad.
I will make pads in order to discuss  #3 and #4 if you agree with #1, #2.

thanks.

2017-03-25 4:32 GMT+09:00 Sage Weil <sweil@redhat.com>:
> On Mon, 20 Mar 2017, myoungwon oh wrote:
>> Hi sage.
>>
>> Thanks for your comments!
>> I created pads in order to brainstorm design option about #1, #2 first.
>>
>> #1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
>> #2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk
>
> I made some comments in the pad!
>
> sage
>
>>
>>
>> Thanks.
>>
>> 2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> > Hi Myoungwon,
>> >
>> > This is quite a patch!  Sorry for the slow reply.
>> >
>> > On Tue, 14 Mar 2017, myoungwon oh wrote:
>> >> Hi Sage
>> >>
>> >>
>> >> I addressed all of your concerns (I applied CAS pool and dedup
>> >> metadata in object_info_t) and created public repository in order to
>> >> show the prototype implementation
>> >> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
>> >> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
>> >> and is not cleaned well but you can see the basic flow (start_flush(),
>> >> maybe_handle_cache_detail() ). It would be nice if you give me some
>> >> comments.
>> >>
>> >> I have some queries mentioned below on which your feedback is highly required.
>> >>
>> >> 1. dedup metadata in object_info_t
>> >>
>> >> You mentioned that it would be nice to make tuple in object_info_t
>> >> such as map<offset, tuple<length, cas object, pool>> But, I made
>> >> dedup_chunk_info_t in object_info_t because I need one more parameter
>> >> (chunk_state) and for extensibility.
>> >
>> > Yes, we definitely want an extensible approach to the state in
>> > object_info_t that will support
>> >
>> >  - a simple redirect ("the object is in that other pool")
>> >  - a dedup object ("the object consists of these N lumps, each one
>> > referencing an object named X_i in pool Y_i")
>> >  - an external system (extenral archive, like a backup system, external
>> > object store, whatever)
>> >
>> > I think we should try to come up with a general notion, like "redirect" or
>> > "object map" or something that covers other options... not just dedup!
>> >
>> >> This is because to avoid read and
>> >> fingerprinting during flush time. chunk_state represents three states
>> >> in writeback mode. First is CLEAN (data and fingerprint are not
>> >> modified). Second is MODIFIED (data is modified but fingerprint is not
>> >> calculated). Third is CALCULATED (data is modified and fingerprint is
>> >> also calculated). When data is stored in cache tier, chunk_state will
>> >> be defined. Therefore, reading data and fingerprinting can be removed
>> >> during flush.
>> >
>> > I'm not following this, though.  I think "clean" would just mean we are
>> > storing the normal object in the pool.  "modified" would mean that the
>> > FLAG_DIRTY is set.  And "calculated" would mean we have successfully
>> > chunked the object, stored or taken refs on the chunks, and written the
>> > chunk map into object_info_t?
>> >
>> >
>> >> 2. Single Rados Operation
>> >>
>> >> You mentioned a Rados operation which can concurrently read the
>> >> reference count and write data. Do you want that API in objecter
>> >> class? (for example, objector->read_ref_and_write())
>> >
>> > We may not need to make it a first-class rados operation.  For example,
>> > cls_refcount could probably be extended with a write_or_get operation.
>> > But it might also be advantageous to make it a native op.  The main thing
>> > I'm worried about here is that we probably want to make the refs
>> > reliable and autitable, which means backpointers (so you can look at a
>> > chunk and see which dedup objects are using it).  That means that a
>> > popular sequence of bytes might have a huge number of references, and that
>> > will need to scale gracefully.  Or, we just use counters, accept that
>> > failure conditions could make us leak dedup chunks, and make all of our
>> > failure paths fail-safe.
>> >
>> >> 3. Write sequence for performance.
>> >>
>> >> Current write sequence (proxy mode) is
>> >>
>> >> a. Read metadata (promote_object)
>> >> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
>> >> original pool)
>> >> c. If data and metadata are stored then, proxy osd will issue message
>> >> to decrease the reference count (for previous chunk) to OSD (in CAS
>> >> pool) and update local object metadata (via simple_opc_submit)
>> >> d. If reference count is successful, send Ack to client
>> >>
>> >> As you can see, the number of operations increased due to reference
>> >> count and metadata updates. This can degrade performance. My question
>> >> is that can we send ack to client at (c) above? (But I am worried
>> >> about inconsistent reference count state.)
>> >
>> > I'm worried that if we focus on inline dedup immediately we'll end up with
>> > something that is less general and more fragile.  It's also harder.
>> > Instead, we can consider the inline and async dedup separately.  Async:
>> >
>> > writeback:
>> > a. normal write into object.  ack client.
>> > ...
>> > b. dedup agent: read object (from cache), chunk
>> > c. dedup agent: write/refcount chunks
>> > d. replace object with dedup manifest
>> >
>> > This could happen with or without a delay.  I don't think it makes sense
>> > to consider "promote" here at all; it sounds like you're assuming the
>> > initial dedup tier is a cache tier, and we should try not to assume that
>> > (even though it might be possible).  Instead, I think a "basic" setup
>> > would probably be
>> >
>> > 1. base pool (all ssd; contains all metadata for all objects, and absorbs
>> >    writes).
>> > 2. dedup pool(s) contain refcounted chunks
>> >
>> > If we want to do inline dedup, it would be some complex code that combines
>> > all of the steps above into one, at the expense of client latency.
>> >
>> >
>> > In any case, it's awesome that you have a working prototype.  However,
>> > it's not going to be practical to take a huge patch(set) like this and
>> > merge it all at once.  It's too much code to review, too complex, and too
>> > hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
>> > PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
>> >
>> > The way to approach this to get it upstream is to break this down into
>> > different logical components and design/review/test/merge each of them
>> > indepdendently.  Having a prototype is useful in that it will be easier to
>> > answer a lot of the questions we'll have deciding how each part should
>> > work and what it needs to be able to handle, but don't expect that most
>> > of that code will end up in the final version!
>> >
>> > I'm guessing we can break this down into a few logical components:
>> >
>> > 1) How do we store chunks.  We know we want refcounted objects for each
>> > chunk.  We don't know how we'll manage the refcounts, whether we want/need
>> > backpointers, whether we are willing to tolerate "leaking" references in
>> > failure cases (so that we fail to clean up all chunks if we e.g. delete
>> > all data), whether we want to implement it as a rados class or a native
>> > rados op, whether we want to support EC, compression, etc.  This whole
>> > discussion one is a great place to start because it is self-contained and
>> > doesn't break anything else.
>> >
>> > 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
>> > want the solution to include or be compatible with simpler tiering, like
>> > having the object_info_t simply be a pointer to a different (colder) pool.
>> > In fact, I think this is the thing to do first becuase it will make us
>> > fix/solve all the basic problems with flush and promote.  And extending
>> > this to include dedup (object is composed of many little bits in other
>> > pools) is then a matter of making that 'manifest' (or whatever we call it)
>> > a generic and extensible description.  Remember we also want to support
>> > pushing objects into external systems (say, glacier, or some other
>> > external object store like a backup system).
>> >
>> > 3) How do we chunk.  You have some classes that handle aligned chunking.
>> > We'll probably eventually want content-based chunking (based on Rabin
>> > fingerprinting or whatever the new hotness is).  Real users will probably
>> > want adjustable policies based on what they know of the content they're
>> > storing, and the system will probably want to support multiple CAS pools
>> > based on which policy is being used (as that determines chunk sizes
>> > etc and whether we'll actually have any dedup happening).
>> >
>> > 4) How to drive the dedup process itself.  An async agent that's part of
>> > the exiting tier_agent?  An external process?  Something inline in the
>> > write path?  This is the hardest question to answer, and the one that is
>> > most likely to collide with other planned OSD work.  It can also come
>> > last, IMO!  We can start with a simple offline agent and perhaps
>> > eventually do something more clever or efficient.
>> >
>> > In any case, I think #1 and #2 are the key discussions we should have now.
>> > I suggest starting a pad and email thread for each (pad.ceph.com) so we
>> > can brainstorm design options, weight trade-offs, and come to some
>> > consensus.  (I had some thoughts, for example, on a hybrid scheme
>> > somewhere between explicit backpointers and a simple refcount that could
>> > consume fixed overhead but still provide information that would enable a
>> > moderately efficient scrub/audit.)
>> >
>> > Thanks!
>> > sage
>> >
>> >
>> >
>> >
>> >> Write sequence (writeback mode) is
>> >>
>> >> a.  Read object data and do fingerprinting (if data is not calculated).
>> >> b. Send reference count decrement message (for previous chunk) to osd
>> >> (in CAS pool) and updates local object metadata
>> >> c. Send copy_from message to osd (in CAS pool) and send copy_from
>> >> message (in order to copy the dedup metadata) to a osd (in original
>> >> pool)
>> >>
>> >> Writeback mode also increase the number of operation. Can we reduce?
>> >>
>> >>
>> >>
>> >> 4. Performance.
>> >>
>> >> Performance is improved compared to previous results. But It still
>> >> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
>> >> thread, target_max_objects = 4)
>> >>
>> >> Major concerns are first is fingerprint overhead and second is
>> >> writeback performance in cache tier. When the chunk size is large
>> >> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
>> >> small chunk.)
>> >>
>> >> Regarding writeback performance, Flush need two more operations than
>> >> proxy mode. First is "marking clean state". Second is "reading dedup
>> >> metadata and data from storage". Therefore, actual read and write
>> >> occur. These cause that flush completion is delayed.
>> >>
>> >> Small chunk performance in the writeback mode is significantly
>> >> degraded because single flush thread handles multiple copy_from
>> >> message. It seems that we should improve basic flushing performance.
>> >>
>> >>
>> >> Write performance (MB/s)
>> >>
>> >> Dedup ratio     0         60       100
>> >>
>> >> Proxy             55       64       73
>> >>
>> >> Writeback       48       50       50
>> >>
>> >> Original           120      120      122
>> >>
>> >>
>> >>
>> >> Read performance (MB/s)
>> >>
>> >> Dedup ratio     0         60       100
>> >>
>> >> Proxy             117      130      141
>> >>
>> >> Writeback       198      197      200
>> >>
>> >> Original           280      276      285
>> >>
>> >>
>> >>
>> >>
>> >> 5. Command to enable dedup
>> >>
>> >> Ceph osd pool create sds-hot 1024
>> >> Ceph osd pool create sds-cas 1024
>> >> Ceph osd tier add_cas rbd sds-hot sds-cas
>> >> Ceph osd tier sds-hot (proxy or writeback)
>> >> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
>> >> Ceph osd tier set-overlay rbd sds-hot
>> >>
>> >>
>> >>
>> >> Thanks
>> >> Myoungwon Oh
>> >> (omwmw@sk.com)
>> >>
>> >> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> >> > On Tue, 7 Feb 2017, myoungwon oh wrote:
>> >> >> Hi sage.
>> >> >>
>> >> >> I uploaded the document which describe my overall appoach.
>> >> >> please see it and give me feedback.
>> >> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
>> >> >
>> >> > This approach looks pretty close to what we have been planning.  A few
>> >> > comments:
>> >> >
>> >> > 1) I think it may be better to view the tier/pool that has the object
>> >> > metadata as the "base" pool, and the CAS pool with the refcounted
>> >> > object chunks as as tier below that.
>> >> >
>> >> > 2) I think we can use an object class or a handful of new native rados
>> >> > operations to make the CAS pool read/write operations more efficient.  In
>> >> > your slides you describe a process something like
>> >> >
>> >> >   rados(getattr)
>> >> >   if exists
>> >> >      rados(increment ref count)
>> >> >   else
>> >> >      rados(write object and set ref count to 1)
>> >> >
>> >> > This could be collapsed into a single optimistic operation that sends the
>> >> > data and a command that says "create or increment ref count" so that the
>> >> > conditional behavior is handled at the OSD.  This will be more efficient
>> >> > for small chunks.  (For large chunks, or in cases where we have some
>> >> > confidence that the chunk probably already exists, the pessimistic
>> >> > approach might still make sense.)  Either way, we should probably support
>> >> > both.
>> >> >
>> >> > 3) We'd like to generalize the first pool behavior so that it is just a
>> >> > special case of the new tiering functionality.  The idea is that an
>> >> > object_info_t can have a 'manifest' that described where and how the
>> >> > object is really stored instead of the object data itself (much like it
>> >> > can already be a whiteout, etc.).  In the simplest case, the manifest
>> >> > would just say "this object is stored in pool X" (simple tiering).  In
>> >> > this case, the manifest would a structure like
>> >> >
>> >> >   map<offset, tuple<length, cas object, pool>>
>> >> >
>> >> > I think it'll be worth the effort to build a general struture here that we
>> >> > can use for basic tiering (not just dedup).
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> thanks
>> >> >>
>> >> >>
>> >> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
>> >> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
>> >> >> >> I have two questions.
>> >> >> >>
>> >> >> >> 1. I would like to ask about CAS location. current our implementation store
>> >> >> >> content address object in storage tier.However, If we store the CAO in the
>> >> >> >> cache tier, we can get a performance advantage. Do you think we can create
>> >> >> >> CAO in cachetier? or create a separate storage pool for CAS?
>> >> >> >
>> >> >> > It depends on the design.  If the you are naming the objects at the
>> >> >> > librados client side, then you can use the rados cluster itself
>> >> >> > unmodified (with or without a cache tier).  This is roughly how I have
>> >> >> > anticipated implementing the CAS storage portion.  If you are doing the
>> >> >> > chunking hashing and within the OSD itself, then you can't do the CAS
>> >> >> > at the first tier because the requests won't be directed at the right OSD.
>> >> >> >
>> >> >> >> 2. The results below are performance result for our current implementation.
>> >> >> >> experiment setup:
>> >> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>> >> >> >> ORIGINAL(without dedup feature and cache tier),
>> >> >> >> fio, 512K block, seq. I/O, single thread
>> >> >> >>
>> >> >> >> One thing to note is that the writeback case is slower than the proxy.
>> >> >> >> We think there are three problems as follows.
>> >> >> >>
>> >> >> >> A. The current implementation creates a fingerprint by reading the entire
>> >> >> >> object when flushing. Therefore, there is a problem that read and write are
>> >> >> >> mixed.
>> >> >> >
>> >> >> > I expect this is a small factor compared to the fact that in writeback
>> >> >> > mode you have to *write* to the cache tier, which is 3x replicated,
>> >> >> > whereas in proxy mode those writes don't happen at all.
>> >> >> >
>> >> >> >> B. When client request read, the promote_object function reads the object
>> >> >> >> and writes it back to the cache tier, which also causes a mix of read and
>> >> >> >> write.
>> >> >> >
>> >> >> > This can be mitigated by setting the min_read_recency_for_promote pool
>> >> >> > property to something >1.  Then reads will be proxied unless the object
>> >> >> > appears to be hot (because it has been touched over multiple
>> >> >> > hitset intervals).
>> >> >> >
>> >> >> >> C. When flushing, the unchanged part is rewritten because flush operation
>> >> >> >> perform per-object based.
>> >> >> >
>> >> >> > Yes.
>> >> >> >
>> >> >> > Is there a description of your overall approach somewhere?
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> Do I have something wrong? or Could you give me a suggestion to improve
>> >> >> >> performance?
>> >> >> >>
>> >> >> >>
>> >> >> >> a. Write performance (KB/s)
>> >> >> >>
>> >> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >> >>
>> >> >> >> PROXY  45586 47804 51120 52844 56167 55302
>> >> >> >>
>> >> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
>> >> >> >>
>> >> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
>> >> >> >>
>> >> >> >>
>> >> >> >> b. Read performance (KB/s)
>> >> >> >>
>> >> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >> >>
>> >> >> >> PROXY  112231 118994 118070 120071 117884 132748
>> >> >> >>
>> >> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
>> >> >> >>
>> >> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
>> >> >> >>
>> >> >> >>
>> >> >> >> thanks,
>> >> >> >> Myoungwon Oh
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-27 13:46               ` myoungwon oh
@ 2017-03-27 14:00                 ` Sage Weil
  2017-03-27 15:27                   ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-03-27 14:00 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Mon, 27 Mar 2017, myoungwon oh wrote:
> I added comments in the pad.

Looks good!  I made a few more comments.  If it looks good to you I'd 
update the pad to have just the proposed approach at the top (maybe keep 
the discussion of options at the bottom).

> I will make pads in order to discuss  #3 and #4 if you agree with #1, #2.

Sure!

sage



> 
> thanks.
> 
> 2017-03-25 4:32 GMT+09:00 Sage Weil <sweil@redhat.com>:
> > On Mon, 20 Mar 2017, myoungwon oh wrote:
> >> Hi sage.
> >>
> >> Thanks for your comments!
> >> I created pads in order to brainstorm design option about #1, #2 first.
> >>
> >> #1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
> >> #2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk
> >
> > I made some comments in the pad!
> >
> > sage
> >
> >>
> >>
> >> Thanks.
> >>
> >> 2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
> >> > Hi Myoungwon,
> >> >
> >> > This is quite a patch!  Sorry for the slow reply.
> >> >
> >> > On Tue, 14 Mar 2017, myoungwon oh wrote:
> >> >> Hi Sage
> >> >>
> >> >>
> >> >> I addressed all of your concerns (I applied CAS pool and dedup
> >> >> metadata in object_info_t) and created public repository in order to
> >> >> show the prototype implementation
> >> >> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
> >> >> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
> >> >> and is not cleaned well but you can see the basic flow (start_flush(),
> >> >> maybe_handle_cache_detail() ). It would be nice if you give me some
> >> >> comments.
> >> >>
> >> >> I have some queries mentioned below on which your feedback is highly required.
> >> >>
> >> >> 1. dedup metadata in object_info_t
> >> >>
> >> >> You mentioned that it would be nice to make tuple in object_info_t
> >> >> such as map<offset, tuple<length, cas object, pool>> But, I made
> >> >> dedup_chunk_info_t in object_info_t because I need one more parameter
> >> >> (chunk_state) and for extensibility.
> >> >
> >> > Yes, we definitely want an extensible approach to the state in
> >> > object_info_t that will support
> >> >
> >> >  - a simple redirect ("the object is in that other pool")
> >> >  - a dedup object ("the object consists of these N lumps, each one
> >> > referencing an object named X_i in pool Y_i")
> >> >  - an external system (extenral archive, like a backup system, external
> >> > object store, whatever)
> >> >
> >> > I think we should try to come up with a general notion, like "redirect" or
> >> > "object map" or something that covers other options... not just dedup!
> >> >
> >> >> This is because to avoid read and
> >> >> fingerprinting during flush time. chunk_state represents three states
> >> >> in writeback mode. First is CLEAN (data and fingerprint are not
> >> >> modified). Second is MODIFIED (data is modified but fingerprint is not
> >> >> calculated). Third is CALCULATED (data is modified and fingerprint is
> >> >> also calculated). When data is stored in cache tier, chunk_state will
> >> >> be defined. Therefore, reading data and fingerprinting can be removed
> >> >> during flush.
> >> >
> >> > I'm not following this, though.  I think "clean" would just mean we are
> >> > storing the normal object in the pool.  "modified" would mean that the
> >> > FLAG_DIRTY is set.  And "calculated" would mean we have successfully
> >> > chunked the object, stored or taken refs on the chunks, and written the
> >> > chunk map into object_info_t?
> >> >
> >> >
> >> >> 2. Single Rados Operation
> >> >>
> >> >> You mentioned a Rados operation which can concurrently read the
> >> >> reference count and write data. Do you want that API in objecter
> >> >> class? (for example, objector->read_ref_and_write())
> >> >
> >> > We may not need to make it a first-class rados operation.  For example,
> >> > cls_refcount could probably be extended with a write_or_get operation.
> >> > But it might also be advantageous to make it a native op.  The main thing
> >> > I'm worried about here is that we probably want to make the refs
> >> > reliable and autitable, which means backpointers (so you can look at a
> >> > chunk and see which dedup objects are using it).  That means that a
> >> > popular sequence of bytes might have a huge number of references, and that
> >> > will need to scale gracefully.  Or, we just use counters, accept that
> >> > failure conditions could make us leak dedup chunks, and make all of our
> >> > failure paths fail-safe.
> >> >
> >> >> 3. Write sequence for performance.
> >> >>
> >> >> Current write sequence (proxy mode) is
> >> >>
> >> >> a. Read metadata (promote_object)
> >> >> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
> >> >> original pool)
> >> >> c. If data and metadata are stored then, proxy osd will issue message
> >> >> to decrease the reference count (for previous chunk) to OSD (in CAS
> >> >> pool) and update local object metadata (via simple_opc_submit)
> >> >> d. If reference count is successful, send Ack to client
> >> >>
> >> >> As you can see, the number of operations increased due to reference
> >> >> count and metadata updates. This can degrade performance. My question
> >> >> is that can we send ack to client at (c) above? (But I am worried
> >> >> about inconsistent reference count state.)
> >> >
> >> > I'm worried that if we focus on inline dedup immediately we'll end up with
> >> > something that is less general and more fragile.  It's also harder.
> >> > Instead, we can consider the inline and async dedup separately.  Async:
> >> >
> >> > writeback:
> >> > a. normal write into object.  ack client.
> >> > ...
> >> > b. dedup agent: read object (from cache), chunk
> >> > c. dedup agent: write/refcount chunks
> >> > d. replace object with dedup manifest
> >> >
> >> > This could happen with or without a delay.  I don't think it makes sense
> >> > to consider "promote" here at all; it sounds like you're assuming the
> >> > initial dedup tier is a cache tier, and we should try not to assume that
> >> > (even though it might be possible).  Instead, I think a "basic" setup
> >> > would probably be
> >> >
> >> > 1. base pool (all ssd; contains all metadata for all objects, and absorbs
> >> >    writes).
> >> > 2. dedup pool(s) contain refcounted chunks
> >> >
> >> > If we want to do inline dedup, it would be some complex code that combines
> >> > all of the steps above into one, at the expense of client latency.
> >> >
> >> >
> >> > In any case, it's awesome that you have a working prototype.  However,
> >> > it's not going to be practical to take a huge patch(set) like this and
> >> > merge it all at once.  It's too much code to review, too complex, and too
> >> > hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
> >> > PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
> >> >
> >> > The way to approach this to get it upstream is to break this down into
> >> > different logical components and design/review/test/merge each of them
> >> > indepdendently.  Having a prototype is useful in that it will be easier to
> >> > answer a lot of the questions we'll have deciding how each part should
> >> > work and what it needs to be able to handle, but don't expect that most
> >> > of that code will end up in the final version!
> >> >
> >> > I'm guessing we can break this down into a few logical components:
> >> >
> >> > 1) How do we store chunks.  We know we want refcounted objects for each
> >> > chunk.  We don't know how we'll manage the refcounts, whether we want/need
> >> > backpointers, whether we are willing to tolerate "leaking" references in
> >> > failure cases (so that we fail to clean up all chunks if we e.g. delete
> >> > all data), whether we want to implement it as a rados class or a native
> >> > rados op, whether we want to support EC, compression, etc.  This whole
> >> > discussion one is a great place to start because it is self-contained and
> >> > doesn't break anything else.
> >> >
> >> > 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
> >> > want the solution to include or be compatible with simpler tiering, like
> >> > having the object_info_t simply be a pointer to a different (colder) pool.
> >> > In fact, I think this is the thing to do first becuase it will make us
> >> > fix/solve all the basic problems with flush and promote.  And extending
> >> > this to include dedup (object is composed of many little bits in other
> >> > pools) is then a matter of making that 'manifest' (or whatever we call it)
> >> > a generic and extensible description.  Remember we also want to support
> >> > pushing objects into external systems (say, glacier, or some other
> >> > external object store like a backup system).
> >> >
> >> > 3) How do we chunk.  You have some classes that handle aligned chunking.
> >> > We'll probably eventually want content-based chunking (based on Rabin
> >> > fingerprinting or whatever the new hotness is).  Real users will probably
> >> > want adjustable policies based on what they know of the content they're
> >> > storing, and the system will probably want to support multiple CAS pools
> >> > based on which policy is being used (as that determines chunk sizes
> >> > etc and whether we'll actually have any dedup happening).
> >> >
> >> > 4) How to drive the dedup process itself.  An async agent that's part of
> >> > the exiting tier_agent?  An external process?  Something inline in the
> >> > write path?  This is the hardest question to answer, and the one that is
> >> > most likely to collide with other planned OSD work.  It can also come
> >> > last, IMO!  We can start with a simple offline agent and perhaps
> >> > eventually do something more clever or efficient.
> >> >
> >> > In any case, I think #1 and #2 are the key discussions we should have now.
> >> > I suggest starting a pad and email thread for each (pad.ceph.com) so we
> >> > can brainstorm design options, weight trade-offs, and come to some
> >> > consensus.  (I had some thoughts, for example, on a hybrid scheme
> >> > somewhere between explicit backpointers and a simple refcount that could
> >> > consume fixed overhead but still provide information that would enable a
> >> > moderately efficient scrub/audit.)
> >> >
> >> > Thanks!
> >> > sage
> >> >
> >> >
> >> >
> >> >
> >> >> Write sequence (writeback mode) is
> >> >>
> >> >> a.  Read object data and do fingerprinting (if data is not calculated).
> >> >> b. Send reference count decrement message (for previous chunk) to osd
> >> >> (in CAS pool) and updates local object metadata
> >> >> c. Send copy_from message to osd (in CAS pool) and send copy_from
> >> >> message (in order to copy the dedup metadata) to a osd (in original
> >> >> pool)
> >> >>
> >> >> Writeback mode also increase the number of operation. Can we reduce?
> >> >>
> >> >>
> >> >>
> >> >> 4. Performance.
> >> >>
> >> >> Performance is improved compared to previous results. But It still
> >> >> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
> >> >> thread, target_max_objects = 4)
> >> >>
> >> >> Major concerns are first is fingerprint overhead and second is
> >> >> writeback performance in cache tier. When the chunk size is large
> >> >> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
> >> >> small chunk.)
> >> >>
> >> >> Regarding writeback performance, Flush need two more operations than
> >> >> proxy mode. First is "marking clean state". Second is "reading dedup
> >> >> metadata and data from storage". Therefore, actual read and write
> >> >> occur. These cause that flush completion is delayed.
> >> >>
> >> >> Small chunk performance in the writeback mode is significantly
> >> >> degraded because single flush thread handles multiple copy_from
> >> >> message. It seems that we should improve basic flushing performance.
> >> >>
> >> >>
> >> >> Write performance (MB/s)
> >> >>
> >> >> Dedup ratio     0         60       100
> >> >>
> >> >> Proxy             55       64       73
> >> >>
> >> >> Writeback       48       50       50
> >> >>
> >> >> Original           120      120      122
> >> >>
> >> >>
> >> >>
> >> >> Read performance (MB/s)
> >> >>
> >> >> Dedup ratio     0         60       100
> >> >>
> >> >> Proxy             117      130      141
> >> >>
> >> >> Writeback       198      197      200
> >> >>
> >> >> Original           280      276      285
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> 5. Command to enable dedup
> >> >>
> >> >> Ceph osd pool create sds-hot 1024
> >> >> Ceph osd pool create sds-cas 1024
> >> >> Ceph osd tier add_cas rbd sds-hot sds-cas
> >> >> Ceph osd tier sds-hot (proxy or writeback)
> >> >> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
> >> >> Ceph osd tier set-overlay rbd sds-hot
> >> >>
> >> >>
> >> >>
> >> >> Thanks
> >> >> Myoungwon Oh
> >> >> (omwmw@sk.com)
> >> >>
> >> >> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
> >> >> > On Tue, 7 Feb 2017, myoungwon oh wrote:
> >> >> >> Hi sage.
> >> >> >>
> >> >> >> I uploaded the document which describe my overall appoach.
> >> >> >> please see it and give me feedback.
> >> >> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
> >> >> >
> >> >> > This approach looks pretty close to what we have been planning.  A few
> >> >> > comments:
> >> >> >
> >> >> > 1) I think it may be better to view the tier/pool that has the object
> >> >> > metadata as the "base" pool, and the CAS pool with the refcounted
> >> >> > object chunks as as tier below that.
> >> >> >
> >> >> > 2) I think we can use an object class or a handful of new native rados
> >> >> > operations to make the CAS pool read/write operations more efficient.  In
> >> >> > your slides you describe a process something like
> >> >> >
> >> >> >   rados(getattr)
> >> >> >   if exists
> >> >> >      rados(increment ref count)
> >> >> >   else
> >> >> >      rados(write object and set ref count to 1)
> >> >> >
> >> >> > This could be collapsed into a single optimistic operation that sends the
> >> >> > data and a command that says "create or increment ref count" so that the
> >> >> > conditional behavior is handled at the OSD.  This will be more efficient
> >> >> > for small chunks.  (For large chunks, or in cases where we have some
> >> >> > confidence that the chunk probably already exists, the pessimistic
> >> >> > approach might still make sense.)  Either way, we should probably support
> >> >> > both.
> >> >> >
> >> >> > 3) We'd like to generalize the first pool behavior so that it is just a
> >> >> > special case of the new tiering functionality.  The idea is that an
> >> >> > object_info_t can have a 'manifest' that described where and how the
> >> >> > object is really stored instead of the object data itself (much like it
> >> >> > can already be a whiteout, etc.).  In the simplest case, the manifest
> >> >> > would just say "this object is stored in pool X" (simple tiering).  In
> >> >> > this case, the manifest would a structure like
> >> >> >
> >> >> >   map<offset, tuple<length, cas object, pool>>
> >> >> >
> >> >> > I think it'll be worth the effort to build a general struture here that we
> >> >> > can use for basic tiering (not just dedup).
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> thanks
> >> >> >>
> >> >> >>
> >> >> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
> >> >> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
> >> >> >> >> I have two questions.
> >> >> >> >>
> >> >> >> >> 1. I would like to ask about CAS location. current our implementation store
> >> >> >> >> content address object in storage tier.However, If we store the CAO in the
> >> >> >> >> cache tier, we can get a performance advantage. Do you think we can create
> >> >> >> >> CAO in cachetier? or create a separate storage pool for CAS?
> >> >> >> >
> >> >> >> > It depends on the design.  If the you are naming the objects at the
> >> >> >> > librados client side, then you can use the rados cluster itself
> >> >> >> > unmodified (with or without a cache tier).  This is roughly how I have
> >> >> >> > anticipated implementing the CAS storage portion.  If you are doing the
> >> >> >> > chunking hashing and within the OSD itself, then you can't do the CAS
> >> >> >> > at the first tier because the requests won't be directed at the right OSD.
> >> >> >> >
> >> >> >> >> 2. The results below are performance result for our current implementation.
> >> >> >> >> experiment setup:
> >> >> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
> >> >> >> >> ORIGINAL(without dedup feature and cache tier),
> >> >> >> >> fio, 512K block, seq. I/O, single thread
> >> >> >> >>
> >> >> >> >> One thing to note is that the writeback case is slower than the proxy.
> >> >> >> >> We think there are three problems as follows.
> >> >> >> >>
> >> >> >> >> A. The current implementation creates a fingerprint by reading the entire
> >> >> >> >> object when flushing. Therefore, there is a problem that read and write are
> >> >> >> >> mixed.
> >> >> >> >
> >> >> >> > I expect this is a small factor compared to the fact that in writeback
> >> >> >> > mode you have to *write* to the cache tier, which is 3x replicated,
> >> >> >> > whereas in proxy mode those writes don't happen at all.
> >> >> >> >
> >> >> >> >> B. When client request read, the promote_object function reads the object
> >> >> >> >> and writes it back to the cache tier, which also causes a mix of read and
> >> >> >> >> write.
> >> >> >> >
> >> >> >> > This can be mitigated by setting the min_read_recency_for_promote pool
> >> >> >> > property to something >1.  Then reads will be proxied unless the object
> >> >> >> > appears to be hot (because it has been touched over multiple
> >> >> >> > hitset intervals).
> >> >> >> >
> >> >> >> >> C. When flushing, the unchanged part is rewritten because flush operation
> >> >> >> >> perform per-object based.
> >> >> >> >
> >> >> >> > Yes.
> >> >> >> >
> >> >> >> > Is there a description of your overall approach somewhere?
> >> >> >> >
> >> >> >> > sage
> >> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Do I have something wrong? or Could you give me a suggestion to improve
> >> >> >> >> performance?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> a. Write performance (KB/s)
> >> >> >> >>
> >> >> >> >> dedup_ratio  0 20 40 60 80 100
> >> >> >> >>
> >> >> >> >> PROXY  45586 47804 51120 52844 56167 55302
> >> >> >> >>
> >> >> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
> >> >> >> >>
> >> >> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> b. Read performance (KB/s)
> >> >> >> >>
> >> >> >> >> dedup_ratio  0 20 40 60 80 100
> >> >> >> >>
> >> >> >> >> PROXY  112231 118994 118070 120071 117884 132748
> >> >> >> >>
> >> >> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
> >> >> >> >>
> >> >> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> thanks,
> >> >> >> >> Myoungwon Oh
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-27 14:00                 ` Sage Weil
@ 2017-03-27 15:27                   ` myoungwon oh
  2017-03-28 15:32                     ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-03-27 15:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

Looks good to me (I made a comment for 2pc).
I will make pads for #3, #4.

Thanks!

2017-03-27 23:00 GMT+09:00 Sage Weil <sweil@redhat.com>:
> On Mon, 27 Mar 2017, myoungwon oh wrote:
>> I added comments in the pad.
>
> Looks good!  I made a few more comments.  If it looks good to you I'd
> update the pad to have just the proposed approach at the top (maybe keep
> the discussion of options at the bottom).
>
>> I will make pads in order to discuss  #3 and #4 if you agree with #1, #2.
>
> Sure!
>
> sage
>
>
>
>>
>> thanks.
>>
>> 2017-03-25 4:32 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> > On Mon, 20 Mar 2017, myoungwon oh wrote:
>> >> Hi sage.
>> >>
>> >> Thanks for your comments!
>> >> I created pads in order to brainstorm design option about #1, #2 first.
>> >>
>> >> #1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
>> >> #2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk
>> >
>> > I made some comments in the pad!
>> >
>> > sage
>> >
>> >>
>> >>
>> >> Thanks.
>> >>
>> >> 2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> >> > Hi Myoungwon,
>> >> >
>> >> > This is quite a patch!  Sorry for the slow reply.
>> >> >
>> >> > On Tue, 14 Mar 2017, myoungwon oh wrote:
>> >> >> Hi Sage
>> >> >>
>> >> >>
>> >> >> I addressed all of your concerns (I applied CAS pool and dedup
>> >> >> metadata in object_info_t) and created public repository in order to
>> >> >> show the prototype implementation
>> >> >> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
>> >> >> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
>> >> >> and is not cleaned well but you can see the basic flow (start_flush(),
>> >> >> maybe_handle_cache_detail() ). It would be nice if you give me some
>> >> >> comments.
>> >> >>
>> >> >> I have some queries mentioned below on which your feedback is highly required.
>> >> >>
>> >> >> 1. dedup metadata in object_info_t
>> >> >>
>> >> >> You mentioned that it would be nice to make tuple in object_info_t
>> >> >> such as map<offset, tuple<length, cas object, pool>> But, I made
>> >> >> dedup_chunk_info_t in object_info_t because I need one more parameter
>> >> >> (chunk_state) and for extensibility.
>> >> >
>> >> > Yes, we definitely want an extensible approach to the state in
>> >> > object_info_t that will support
>> >> >
>> >> >  - a simple redirect ("the object is in that other pool")
>> >> >  - a dedup object ("the object consists of these N lumps, each one
>> >> > referencing an object named X_i in pool Y_i")
>> >> >  - an external system (extenral archive, like a backup system, external
>> >> > object store, whatever)
>> >> >
>> >> > I think we should try to come up with a general notion, like "redirect" or
>> >> > "object map" or something that covers other options... not just dedup!
>> >> >
>> >> >> This is because to avoid read and
>> >> >> fingerprinting during flush time. chunk_state represents three states
>> >> >> in writeback mode. First is CLEAN (data and fingerprint are not
>> >> >> modified). Second is MODIFIED (data is modified but fingerprint is not
>> >> >> calculated). Third is CALCULATED (data is modified and fingerprint is
>> >> >> also calculated). When data is stored in cache tier, chunk_state will
>> >> >> be defined. Therefore, reading data and fingerprinting can be removed
>> >> >> during flush.
>> >> >
>> >> > I'm not following this, though.  I think "clean" would just mean we are
>> >> > storing the normal object in the pool.  "modified" would mean that the
>> >> > FLAG_DIRTY is set.  And "calculated" would mean we have successfully
>> >> > chunked the object, stored or taken refs on the chunks, and written the
>> >> > chunk map into object_info_t?
>> >> >
>> >> >
>> >> >> 2. Single Rados Operation
>> >> >>
>> >> >> You mentioned a Rados operation which can concurrently read the
>> >> >> reference count and write data. Do you want that API in objecter
>> >> >> class? (for example, objector->read_ref_and_write())
>> >> >
>> >> > We may not need to make it a first-class rados operation.  For example,
>> >> > cls_refcount could probably be extended with a write_or_get operation.
>> >> > But it might also be advantageous to make it a native op.  The main thing
>> >> > I'm worried about here is that we probably want to make the refs
>> >> > reliable and autitable, which means backpointers (so you can look at a
>> >> > chunk and see which dedup objects are using it).  That means that a
>> >> > popular sequence of bytes might have a huge number of references, and that
>> >> > will need to scale gracefully.  Or, we just use counters, accept that
>> >> > failure conditions could make us leak dedup chunks, and make all of our
>> >> > failure paths fail-safe.
>> >> >
>> >> >> 3. Write sequence for performance.
>> >> >>
>> >> >> Current write sequence (proxy mode) is
>> >> >>
>> >> >> a. Read metadata (promote_object)
>> >> >> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
>> >> >> original pool)
>> >> >> c. If data and metadata are stored then, proxy osd will issue message
>> >> >> to decrease the reference count (for previous chunk) to OSD (in CAS
>> >> >> pool) and update local object metadata (via simple_opc_submit)
>> >> >> d. If reference count is successful, send Ack to client
>> >> >>
>> >> >> As you can see, the number of operations increased due to reference
>> >> >> count and metadata updates. This can degrade performance. My question
>> >> >> is that can we send ack to client at (c) above? (But I am worried
>> >> >> about inconsistent reference count state.)
>> >> >
>> >> > I'm worried that if we focus on inline dedup immediately we'll end up with
>> >> > something that is less general and more fragile.  It's also harder.
>> >> > Instead, we can consider the inline and async dedup separately.  Async:
>> >> >
>> >> > writeback:
>> >> > a. normal write into object.  ack client.
>> >> > ...
>> >> > b. dedup agent: read object (from cache), chunk
>> >> > c. dedup agent: write/refcount chunks
>> >> > d. replace object with dedup manifest
>> >> >
>> >> > This could happen with or without a delay.  I don't think it makes sense
>> >> > to consider "promote" here at all; it sounds like you're assuming the
>> >> > initial dedup tier is a cache tier, and we should try not to assume that
>> >> > (even though it might be possible).  Instead, I think a "basic" setup
>> >> > would probably be
>> >> >
>> >> > 1. base pool (all ssd; contains all metadata for all objects, and absorbs
>> >> >    writes).
>> >> > 2. dedup pool(s) contain refcounted chunks
>> >> >
>> >> > If we want to do inline dedup, it would be some complex code that combines
>> >> > all of the steps above into one, at the expense of client latency.
>> >> >
>> >> >
>> >> > In any case, it's awesome that you have a working prototype.  However,
>> >> > it's not going to be practical to take a huge patch(set) like this and
>> >> > merge it all at once.  It's too much code to review, too complex, and too
>> >> > hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
>> >> > PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
>> >> >
>> >> > The way to approach this to get it upstream is to break this down into
>> >> > different logical components and design/review/test/merge each of them
>> >> > indepdendently.  Having a prototype is useful in that it will be easier to
>> >> > answer a lot of the questions we'll have deciding how each part should
>> >> > work and what it needs to be able to handle, but don't expect that most
>> >> > of that code will end up in the final version!
>> >> >
>> >> > I'm guessing we can break this down into a few logical components:
>> >> >
>> >> > 1) How do we store chunks.  We know we want refcounted objects for each
>> >> > chunk.  We don't know how we'll manage the refcounts, whether we want/need
>> >> > backpointers, whether we are willing to tolerate "leaking" references in
>> >> > failure cases (so that we fail to clean up all chunks if we e.g. delete
>> >> > all data), whether we want to implement it as a rados class or a native
>> >> > rados op, whether we want to support EC, compression, etc.  This whole
>> >> > discussion one is a great place to start because it is self-contained and
>> >> > doesn't break anything else.
>> >> >
>> >> > 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
>> >> > want the solution to include or be compatible with simpler tiering, like
>> >> > having the object_info_t simply be a pointer to a different (colder) pool.
>> >> > In fact, I think this is the thing to do first becuase it will make us
>> >> > fix/solve all the basic problems with flush and promote.  And extending
>> >> > this to include dedup (object is composed of many little bits in other
>> >> > pools) is then a matter of making that 'manifest' (or whatever we call it)
>> >> > a generic and extensible description.  Remember we also want to support
>> >> > pushing objects into external systems (say, glacier, or some other
>> >> > external object store like a backup system).
>> >> >
>> >> > 3) How do we chunk.  You have some classes that handle aligned chunking.
>> >> > We'll probably eventually want content-based chunking (based on Rabin
>> >> > fingerprinting or whatever the new hotness is).  Real users will probably
>> >> > want adjustable policies based on what they know of the content they're
>> >> > storing, and the system will probably want to support multiple CAS pools
>> >> > based on which policy is being used (as that determines chunk sizes
>> >> > etc and whether we'll actually have any dedup happening).
>> >> >
>> >> > 4) How to drive the dedup process itself.  An async agent that's part of
>> >> > the exiting tier_agent?  An external process?  Something inline in the
>> >> > write path?  This is the hardest question to answer, and the one that is
>> >> > most likely to collide with other planned OSD work.  It can also come
>> >> > last, IMO!  We can start with a simple offline agent and perhaps
>> >> > eventually do something more clever or efficient.
>> >> >
>> >> > In any case, I think #1 and #2 are the key discussions we should have now.
>> >> > I suggest starting a pad and email thread for each (pad.ceph.com) so we
>> >> > can brainstorm design options, weight trade-offs, and come to some
>> >> > consensus.  (I had some thoughts, for example, on a hybrid scheme
>> >> > somewhere between explicit backpointers and a simple refcount that could
>> >> > consume fixed overhead but still provide information that would enable a
>> >> > moderately efficient scrub/audit.)
>> >> >
>> >> > Thanks!
>> >> > sage
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >> Write sequence (writeback mode) is
>> >> >>
>> >> >> a.  Read object data and do fingerprinting (if data is not calculated).
>> >> >> b. Send reference count decrement message (for previous chunk) to osd
>> >> >> (in CAS pool) and updates local object metadata
>> >> >> c. Send copy_from message to osd (in CAS pool) and send copy_from
>> >> >> message (in order to copy the dedup metadata) to a osd (in original
>> >> >> pool)
>> >> >>
>> >> >> Writeback mode also increase the number of operation. Can we reduce?
>> >> >>
>> >> >>
>> >> >>
>> >> >> 4. Performance.
>> >> >>
>> >> >> Performance is improved compared to previous results. But It still
>> >> >> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
>> >> >> thread, target_max_objects = 4)
>> >> >>
>> >> >> Major concerns are first is fingerprint overhead and second is
>> >> >> writeback performance in cache tier. When the chunk size is large
>> >> >> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
>> >> >> small chunk.)
>> >> >>
>> >> >> Regarding writeback performance, Flush need two more operations than
>> >> >> proxy mode. First is "marking clean state". Second is "reading dedup
>> >> >> metadata and data from storage". Therefore, actual read and write
>> >> >> occur. These cause that flush completion is delayed.
>> >> >>
>> >> >> Small chunk performance in the writeback mode is significantly
>> >> >> degraded because single flush thread handles multiple copy_from
>> >> >> message. It seems that we should improve basic flushing performance.
>> >> >>
>> >> >>
>> >> >> Write performance (MB/s)
>> >> >>
>> >> >> Dedup ratio     0         60       100
>> >> >>
>> >> >> Proxy             55       64       73
>> >> >>
>> >> >> Writeback       48       50       50
>> >> >>
>> >> >> Original           120      120      122
>> >> >>
>> >> >>
>> >> >>
>> >> >> Read performance (MB/s)
>> >> >>
>> >> >> Dedup ratio     0         60       100
>> >> >>
>> >> >> Proxy             117      130      141
>> >> >>
>> >> >> Writeback       198      197      200
>> >> >>
>> >> >> Original           280      276      285
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> 5. Command to enable dedup
>> >> >>
>> >> >> Ceph osd pool create sds-hot 1024
>> >> >> Ceph osd pool create sds-cas 1024
>> >> >> Ceph osd tier add_cas rbd sds-hot sds-cas
>> >> >> Ceph osd tier sds-hot (proxy or writeback)
>> >> >> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
>> >> >> Ceph osd tier set-overlay rbd sds-hot
>> >> >>
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >> Myoungwon Oh
>> >> >> (omwmw@sk.com)
>> >> >>
>> >> >> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> >> >> > On Tue, 7 Feb 2017, myoungwon oh wrote:
>> >> >> >> Hi sage.
>> >> >> >>
>> >> >> >> I uploaded the document which describe my overall appoach.
>> >> >> >> please see it and give me feedback.
>> >> >> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
>> >> >> >
>> >> >> > This approach looks pretty close to what we have been planning.  A few
>> >> >> > comments:
>> >> >> >
>> >> >> > 1) I think it may be better to view the tier/pool that has the object
>> >> >> > metadata as the "base" pool, and the CAS pool with the refcounted
>> >> >> > object chunks as as tier below that.
>> >> >> >
>> >> >> > 2) I think we can use an object class or a handful of new native rados
>> >> >> > operations to make the CAS pool read/write operations more efficient.  In
>> >> >> > your slides you describe a process something like
>> >> >> >
>> >> >> >   rados(getattr)
>> >> >> >   if exists
>> >> >> >      rados(increment ref count)
>> >> >> >   else
>> >> >> >      rados(write object and set ref count to 1)
>> >> >> >
>> >> >> > This could be collapsed into a single optimistic operation that sends the
>> >> >> > data and a command that says "create or increment ref count" so that the
>> >> >> > conditional behavior is handled at the OSD.  This will be more efficient
>> >> >> > for small chunks.  (For large chunks, or in cases where we have some
>> >> >> > confidence that the chunk probably already exists, the pessimistic
>> >> >> > approach might still make sense.)  Either way, we should probably support
>> >> >> > both.
>> >> >> >
>> >> >> > 3) We'd like to generalize the first pool behavior so that it is just a
>> >> >> > special case of the new tiering functionality.  The idea is that an
>> >> >> > object_info_t can have a 'manifest' that described where and how the
>> >> >> > object is really stored instead of the object data itself (much like it
>> >> >> > can already be a whiteout, etc.).  In the simplest case, the manifest
>> >> >> > would just say "this object is stored in pool X" (simple tiering).  In
>> >> >> > this case, the manifest would a structure like
>> >> >> >
>> >> >> >   map<offset, tuple<length, cas object, pool>>
>> >> >> >
>> >> >> > I think it'll be worth the effort to build a general struture here that we
>> >> >> > can use for basic tiering (not just dedup).
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> thanks
>> >> >> >>
>> >> >> >>
>> >> >> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
>> >> >> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
>> >> >> >> >> I have two questions.
>> >> >> >> >>
>> >> >> >> >> 1. I would like to ask about CAS location. current our implementation store
>> >> >> >> >> content address object in storage tier.However, If we store the CAO in the
>> >> >> >> >> cache tier, we can get a performance advantage. Do you think we can create
>> >> >> >> >> CAO in cachetier? or create a separate storage pool for CAS?
>> >> >> >> >
>> >> >> >> > It depends on the design.  If the you are naming the objects at the
>> >> >> >> > librados client side, then you can use the rados cluster itself
>> >> >> >> > unmodified (with or without a cache tier).  This is roughly how I have
>> >> >> >> > anticipated implementing the CAS storage portion.  If you are doing the
>> >> >> >> > chunking hashing and within the OSD itself, then you can't do the CAS
>> >> >> >> > at the first tier because the requests won't be directed at the right OSD.
>> >> >> >> >
>> >> >> >> >> 2. The results below are performance result for our current implementation.
>> >> >> >> >> experiment setup:
>> >> >> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>> >> >> >> >> ORIGINAL(without dedup feature and cache tier),
>> >> >> >> >> fio, 512K block, seq. I/O, single thread
>> >> >> >> >>
>> >> >> >> >> One thing to note is that the writeback case is slower than the proxy.
>> >> >> >> >> We think there are three problems as follows.
>> >> >> >> >>
>> >> >> >> >> A. The current implementation creates a fingerprint by reading the entire
>> >> >> >> >> object when flushing. Therefore, there is a problem that read and write are
>> >> >> >> >> mixed.
>> >> >> >> >
>> >> >> >> > I expect this is a small factor compared to the fact that in writeback
>> >> >> >> > mode you have to *write* to the cache tier, which is 3x replicated,
>> >> >> >> > whereas in proxy mode those writes don't happen at all.
>> >> >> >> >
>> >> >> >> >> B. When client request read, the promote_object function reads the object
>> >> >> >> >> and writes it back to the cache tier, which also causes a mix of read and
>> >> >> >> >> write.
>> >> >> >> >
>> >> >> >> > This can be mitigated by setting the min_read_recency_for_promote pool
>> >> >> >> > property to something >1.  Then reads will be proxied unless the object
>> >> >> >> > appears to be hot (because it has been touched over multiple
>> >> >> >> > hitset intervals).
>> >> >> >> >
>> >> >> >> >> C. When flushing, the unchanged part is rewritten because flush operation
>> >> >> >> >> perform per-object based.
>> >> >> >> >
>> >> >> >> > Yes.
>> >> >> >> >
>> >> >> >> > Is there a description of your overall approach somewhere?
>> >> >> >> >
>> >> >> >> > sage
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> Do I have something wrong? or Could you give me a suggestion to improve
>> >> >> >> >> performance?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> a. Write performance (KB/s)
>> >> >> >> >>
>> >> >> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >> >> >>
>> >> >> >> >> PROXY  45586 47804 51120 52844 56167 55302
>> >> >> >> >>
>> >> >> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
>> >> >> >> >>
>> >> >> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> b. Read performance (KB/s)
>> >> >> >> >>
>> >> >> >> >> dedup_ratio  0 20 40 60 80 100
>> >> >> >> >>
>> >> >> >> >> PROXY  112231 118994 118070 120071 117884 132748
>> >> >> >> >>
>> >> >> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
>> >> >> >> >>
>> >> >> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> thanks,
>> >> >> >> >> Myoungwon Oh
>> >> >> >> >> --
>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-27 15:27                   ` myoungwon oh
@ 2017-03-28 15:32                     ` myoungwon oh
  2017-04-12 15:51                       ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-03-28 15:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

Hi sage,

I made two pads in order to discuss #3, #4.

#3: http://pad.ceph.com/p/deduplication_how_do_we_chunk
#4: http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process


thanks.

2017-03-28 0:27 GMT+09:00 myoungwon oh <ohmyoungwon@gmail.com>:
> Looks good to me (I made a comment for 2pc).
> I will make pads for #3, #4.
>
> Thanks!
>
> 2017-03-27 23:00 GMT+09:00 Sage Weil <sweil@redhat.com>:
>> On Mon, 27 Mar 2017, myoungwon oh wrote:
>>> I added comments in the pad.
>>
>> Looks good!  I made a few more comments.  If it looks good to you I'd
>> update the pad to have just the proposed approach at the top (maybe keep
>> the discussion of options at the bottom).
>>
>>> I will make pads in order to discuss  #3 and #4 if you agree with #1, #2.
>>
>> Sure!
>>
>> sage
>>
>>
>>
>>>
>>> thanks.
>>>
>>> 2017-03-25 4:32 GMT+09:00 Sage Weil <sweil@redhat.com>:
>>> > On Mon, 20 Mar 2017, myoungwon oh wrote:
>>> >> Hi sage.
>>> >>
>>> >> Thanks for your comments!
>>> >> I created pads in order to brainstorm design option about #1, #2 first.
>>> >>
>>> >> #1 http://pad.ceph.com/p/deduplication_how_dedup_manifists
>>> >> #2 http://pad.ceph.com/p/deduplication_how_do_we_store_chunk
>>> >
>>> > I made some comments in the pad!
>>> >
>>> > sage
>>> >
>>> >>
>>> >>
>>> >> Thanks.
>>> >>
>>> >> 2017-03-16 22:42 GMT+09:00 Sage Weil <sweil@redhat.com>:
>>> >> > Hi Myoungwon,
>>> >> >
>>> >> > This is quite a patch!  Sorry for the slow reply.
>>> >> >
>>> >> > On Tue, 14 Mar 2017, myoungwon oh wrote:
>>> >> >> Hi Sage
>>> >> >>
>>> >> >>
>>> >> >> I addressed all of your concerns (I applied CAS pool and dedup
>>> >> >> metadata in object_info_t) and created public repository in order to
>>> >> >> show the prototype implementation
>>> >> >> (https://github.com/myoungwon/ceph/commit/13597f62405d1c5a4977d630e69331407ef3a07a,
>>> >> >> support non-aligned I/O, but for (K)RBD). This code is based on Jewel
>>> >> >> and is not cleaned well but you can see the basic flow (start_flush(),
>>> >> >> maybe_handle_cache_detail() ). It would be nice if you give me some
>>> >> >> comments.
>>> >> >>
>>> >> >> I have some queries mentioned below on which your feedback is highly required.
>>> >> >>
>>> >> >> 1. dedup metadata in object_info_t
>>> >> >>
>>> >> >> You mentioned that it would be nice to make tuple in object_info_t
>>> >> >> such as map<offset, tuple<length, cas object, pool>> But, I made
>>> >> >> dedup_chunk_info_t in object_info_t because I need one more parameter
>>> >> >> (chunk_state) and for extensibility.
>>> >> >
>>> >> > Yes, we definitely want an extensible approach to the state in
>>> >> > object_info_t that will support
>>> >> >
>>> >> >  - a simple redirect ("the object is in that other pool")
>>> >> >  - a dedup object ("the object consists of these N lumps, each one
>>> >> > referencing an object named X_i in pool Y_i")
>>> >> >  - an external system (extenral archive, like a backup system, external
>>> >> > object store, whatever)
>>> >> >
>>> >> > I think we should try to come up with a general notion, like "redirect" or
>>> >> > "object map" or something that covers other options... not just dedup!
>>> >> >
>>> >> >> This is because to avoid read and
>>> >> >> fingerprinting during flush time. chunk_state represents three states
>>> >> >> in writeback mode. First is CLEAN (data and fingerprint are not
>>> >> >> modified). Second is MODIFIED (data is modified but fingerprint is not
>>> >> >> calculated). Third is CALCULATED (data is modified and fingerprint is
>>> >> >> also calculated). When data is stored in cache tier, chunk_state will
>>> >> >> be defined. Therefore, reading data and fingerprinting can be removed
>>> >> >> during flush.
>>> >> >
>>> >> > I'm not following this, though.  I think "clean" would just mean we are
>>> >> > storing the normal object in the pool.  "modified" would mean that the
>>> >> > FLAG_DIRTY is set.  And "calculated" would mean we have successfully
>>> >> > chunked the object, stored or taken refs on the chunks, and written the
>>> >> > chunk map into object_info_t?
>>> >> >
>>> >> >
>>> >> >> 2. Single Rados Operation
>>> >> >>
>>> >> >> You mentioned a Rados operation which can concurrently read the
>>> >> >> reference count and write data. Do you want that API in objecter
>>> >> >> class? (for example, objector->read_ref_and_write())
>>> >> >
>>> >> > We may not need to make it a first-class rados operation.  For example,
>>> >> > cls_refcount could probably be extended with a write_or_get operation.
>>> >> > But it might also be advantageous to make it a native op.  The main thing
>>> >> > I'm worried about here is that we probably want to make the refs
>>> >> > reliable and autitable, which means backpointers (so you can look at a
>>> >> > chunk and see which dedup objects are using it).  That means that a
>>> >> > popular sequence of bytes might have a huge number of references, and that
>>> >> > will need to scale gracefully.  Or, we just use counters, accept that
>>> >> > failure conditions could make us leak dedup chunks, and make all of our
>>> >> > failure paths fail-safe.
>>> >> >
>>> >> >> 3. Write sequence for performance.
>>> >> >>
>>> >> >> Current write sequence (proxy mode) is
>>> >> >>
>>> >> >> a. Read metadata (promote_object)
>>> >> >> b. Send data to OSD (in CAS pool) and send dedup metadata to OSD (in
>>> >> >> original pool)
>>> >> >> c. If data and metadata are stored then, proxy osd will issue message
>>> >> >> to decrease the reference count (for previous chunk) to OSD (in CAS
>>> >> >> pool) and update local object metadata (via simple_opc_submit)
>>> >> >> d. If reference count is successful, send Ack to client
>>> >> >>
>>> >> >> As you can see, the number of operations increased due to reference
>>> >> >> count and metadata updates. This can degrade performance. My question
>>> >> >> is that can we send ack to client at (c) above? (But I am worried
>>> >> >> about inconsistent reference count state.)
>>> >> >
>>> >> > I'm worried that if we focus on inline dedup immediately we'll end up with
>>> >> > something that is less general and more fragile.  It's also harder.
>>> >> > Instead, we can consider the inline and async dedup separately.  Async:
>>> >> >
>>> >> > writeback:
>>> >> > a. normal write into object.  ack client.
>>> >> > ...
>>> >> > b. dedup agent: read object (from cache), chunk
>>> >> > c. dedup agent: write/refcount chunks
>>> >> > d. replace object with dedup manifest
>>> >> >
>>> >> > This could happen with or without a delay.  I don't think it makes sense
>>> >> > to consider "promote" here at all; it sounds like you're assuming the
>>> >> > initial dedup tier is a cache tier, and we should try not to assume that
>>> >> > (even though it might be possible).  Instead, I think a "basic" setup
>>> >> > would probably be
>>> >> >
>>> >> > 1. base pool (all ssd; contains all metadata for all objects, and absorbs
>>> >> >    writes).
>>> >> > 2. dedup pool(s) contain refcounted chunks
>>> >> >
>>> >> > If we want to do inline dedup, it would be some complex code that combines
>>> >> > all of the steps above into one, at the expense of client latency.
>>> >> >
>>> >> >
>>> >> > In any case, it's awesome that you have a working prototype.  However,
>>> >> > it's not going to be practical to take a huge patch(set) like this and
>>> >> > merge it all at once.  It's too much code to review, too complex, and too
>>> >> > hard to test.  Also, it's changing 5000 in ReplicatedPG.cc (since renamed
>>> >> > PrimaryLogPG.cc), which is slated for a big refactor right after luminous.
>>> >> >
>>> >> > The way to approach this to get it upstream is to break this down into
>>> >> > different logical components and design/review/test/merge each of them
>>> >> > indepdendently.  Having a prototype is useful in that it will be easier to
>>> >> > answer a lot of the questions we'll have deciding how each part should
>>> >> > work and what it needs to be able to handle, but don't expect that most
>>> >> > of that code will end up in the final version!
>>> >> >
>>> >> > I'm guessing we can break this down into a few logical components:
>>> >> >
>>> >> > 1) How do we store chunks.  We know we want refcounted objects for each
>>> >> > chunk.  We don't know how we'll manage the refcounts, whether we want/need
>>> >> > backpointers, whether we are willing to tolerate "leaking" references in
>>> >> > failure cases (so that we fail to clean up all chunks if we e.g. delete
>>> >> > all data), whether we want to implement it as a rados class or a native
>>> >> > rados op, whether we want to support EC, compression, etc.  This whole
>>> >> > discussion one is a great place to start because it is self-contained and
>>> >> > doesn't break anything else.
>>> >> >
>>> >> > 2) How do we do the dedup manifists (and redirects) in object_info_t.  We
>>> >> > want the solution to include or be compatible with simpler tiering, like
>>> >> > having the object_info_t simply be a pointer to a different (colder) pool.
>>> >> > In fact, I think this is the thing to do first becuase it will make us
>>> >> > fix/solve all the basic problems with flush and promote.  And extending
>>> >> > this to include dedup (object is composed of many little bits in other
>>> >> > pools) is then a matter of making that 'manifest' (or whatever we call it)
>>> >> > a generic and extensible description.  Remember we also want to support
>>> >> > pushing objects into external systems (say, glacier, or some other
>>> >> > external object store like a backup system).
>>> >> >
>>> >> > 3) How do we chunk.  You have some classes that handle aligned chunking.
>>> >> > We'll probably eventually want content-based chunking (based on Rabin
>>> >> > fingerprinting or whatever the new hotness is).  Real users will probably
>>> >> > want adjustable policies based on what they know of the content they're
>>> >> > storing, and the system will probably want to support multiple CAS pools
>>> >> > based on which policy is being used (as that determines chunk sizes
>>> >> > etc and whether we'll actually have any dedup happening).
>>> >> >
>>> >> > 4) How to drive the dedup process itself.  An async agent that's part of
>>> >> > the exiting tier_agent?  An external process?  Something inline in the
>>> >> > write path?  This is the hardest question to answer, and the one that is
>>> >> > most likely to collide with other planned OSD work.  It can also come
>>> >> > last, IMO!  We can start with a simple offline agent and perhaps
>>> >> > eventually do something more clever or efficient.
>>> >> >
>>> >> > In any case, I think #1 and #2 are the key discussions we should have now.
>>> >> > I suggest starting a pad and email thread for each (pad.ceph.com) so we
>>> >> > can brainstorm design options, weight trade-offs, and come to some
>>> >> > consensus.  (I had some thoughts, for example, on a hybrid scheme
>>> >> > somewhere between explicit backpointers and a simple refcount that could
>>> >> > consume fixed overhead but still provide information that would enable a
>>> >> > moderately efficient scrub/audit.)
>>> >> >
>>> >> > Thanks!
>>> >> > sage
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >> Write sequence (writeback mode) is
>>> >> >>
>>> >> >> a.  Read object data and do fingerprinting (if data is not calculated).
>>> >> >> b. Send reference count decrement message (for previous chunk) to osd
>>> >> >> (in CAS pool) and updates local object metadata
>>> >> >> c. Send copy_from message to osd (in CAS pool) and send copy_from
>>> >> >> message (in order to copy the dedup metadata) to a osd (in original
>>> >> >> pool)
>>> >> >>
>>> >> >> Writeback mode also increase the number of operation. Can we reduce?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> 4. Performance.
>>> >> >>
>>> >> >> Performance is improved compared to previous results. But It still
>>> >> >> seems to be improving. (512KB block, Seq. workload, fio, KRBD, single
>>> >> >> thread, target_max_objects = 4)
>>> >> >>
>>> >> >> Major concerns are first is fingerprint overhead and second is
>>> >> >> writeback performance in cache tier. When the chunk size is large
>>> >> >> (>512KB), SHA1 takes more than 3ms. (This can be reduced if we use
>>> >> >> small chunk.)
>>> >> >>
>>> >> >> Regarding writeback performance, Flush need two more operations than
>>> >> >> proxy mode. First is "marking clean state". Second is "reading dedup
>>> >> >> metadata and data from storage". Therefore, actual read and write
>>> >> >> occur. These cause that flush completion is delayed.
>>> >> >>
>>> >> >> Small chunk performance in the writeback mode is significantly
>>> >> >> degraded because single flush thread handles multiple copy_from
>>> >> >> message. It seems that we should improve basic flushing performance.
>>> >> >>
>>> >> >>
>>> >> >> Write performance (MB/s)
>>> >> >>
>>> >> >> Dedup ratio     0         60       100
>>> >> >>
>>> >> >> Proxy             55       64       73
>>> >> >>
>>> >> >> Writeback       48       50       50
>>> >> >>
>>> >> >> Original           120      120      122
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Read performance (MB/s)
>>> >> >>
>>> >> >> Dedup ratio     0         60       100
>>> >> >>
>>> >> >> Proxy             117      130      141
>>> >> >>
>>> >> >> Writeback       198      197      200
>>> >> >>
>>> >> >> Original           280      276      285
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> 5. Command to enable dedup
>>> >> >>
>>> >> >> Ceph osd pool create sds-hot 1024
>>> >> >> Ceph osd pool create sds-cas 1024
>>> >> >> Ceph osd tier add_cas rbd sds-hot sds-cas
>>> >> >> Ceph osd tier sds-hot (proxy or writeback)
>>> >> >> Ceph osd tier dedup_block rbd sds-hot sds-cas (chunk size. e.g. 65536, 131072..)
>>> >> >> Ceph osd tier set-overlay rbd sds-hot
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Thanks
>>> >> >> Myoungwon Oh
>>> >> >> (omwmw@sk.com)
>>> >> >>
>>> >> >> 2017-02-07 23:50 GMT+09:00 Sage Weil <sweil@redhat.com>:
>>> >> >> > On Tue, 7 Feb 2017, myoungwon oh wrote:
>>> >> >> >> Hi sage.
>>> >> >> >>
>>> >> >> >> I uploaded the document which describe my overall appoach.
>>> >> >> >> please see it and give me feedback.
>>> >> >> >> slide: https://www.slideshare.net/secret/JZcy3yYEDIHPyg
>>> >> >> >
>>> >> >> > This approach looks pretty close to what we have been planning.  A few
>>> >> >> > comments:
>>> >> >> >
>>> >> >> > 1) I think it may be better to view the tier/pool that has the object
>>> >> >> > metadata as the "base" pool, and the CAS pool with the refcounted
>>> >> >> > object chunks as as tier below that.
>>> >> >> >
>>> >> >> > 2) I think we can use an object class or a handful of new native rados
>>> >> >> > operations to make the CAS pool read/write operations more efficient.  In
>>> >> >> > your slides you describe a process something like
>>> >> >> >
>>> >> >> >   rados(getattr)
>>> >> >> >   if exists
>>> >> >> >      rados(increment ref count)
>>> >> >> >   else
>>> >> >> >      rados(write object and set ref count to 1)
>>> >> >> >
>>> >> >> > This could be collapsed into a single optimistic operation that sends the
>>> >> >> > data and a command that says "create or increment ref count" so that the
>>> >> >> > conditional behavior is handled at the OSD.  This will be more efficient
>>> >> >> > for small chunks.  (For large chunks, or in cases where we have some
>>> >> >> > confidence that the chunk probably already exists, the pessimistic
>>> >> >> > approach might still make sense.)  Either way, we should probably support
>>> >> >> > both.
>>> >> >> >
>>> >> >> > 3) We'd like to generalize the first pool behavior so that it is just a
>>> >> >> > special case of the new tiering functionality.  The idea is that an
>>> >> >> > object_info_t can have a 'manifest' that described where and how the
>>> >> >> > object is really stored instead of the object data itself (much like it
>>> >> >> > can already be a whiteout, etc.).  In the simplest case, the manifest
>>> >> >> > would just say "this object is stored in pool X" (simple tiering).  In
>>> >> >> > this case, the manifest would a structure like
>>> >> >> >
>>> >> >> >   map<offset, tuple<length, cas object, pool>>
>>> >> >> >
>>> >> >> > I think it'll be worth the effort to build a general struture here that we
>>> >> >> > can use for basic tiering (not just dedup).
>>> >> >> >
>>> >> >> > sage
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >>
>>> >> >> >> thanks
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> 2017-01-31 23:24 GMT+09:00 Sage Weil <sage@newdream.net>:
>>> >> >> >> > On Thu, 26 Jan 2017, myoungwon oh wrote:
>>> >> >> >> >> I have two questions.
>>> >> >> >> >>
>>> >> >> >> >> 1. I would like to ask about CAS location. current our implementation store
>>> >> >> >> >> content address object in storage tier.However, If we store the CAO in the
>>> >> >> >> >> cache tier, we can get a performance advantage. Do you think we can create
>>> >> >> >> >> CAO in cachetier? or create a separate storage pool for CAS?
>>> >> >> >> >
>>> >> >> >> > It depends on the design.  If the you are naming the objects at the
>>> >> >> >> > librados client side, then you can use the rados cluster itself
>>> >> >> >> > unmodified (with or without a cache tier).  This is roughly how I have
>>> >> >> >> > anticipated implementing the CAS storage portion.  If you are doing the
>>> >> >> >> > chunking hashing and within the OSD itself, then you can't do the CAS
>>> >> >> >> > at the first tier because the requests won't be directed at the right OSD.
>>> >> >> >> >
>>> >> >> >> >> 2. The results below are performance result for our current implementation.
>>> >> >> >> >> experiment setup:
>>> >> >> >> >> PROXY (inline dedup), WRITEBACK (lazy dedup, target_max_bytes: 50MB),
>>> >> >> >> >> ORIGINAL(without dedup feature and cache tier),
>>> >> >> >> >> fio, 512K block, seq. I/O, single thread
>>> >> >> >> >>
>>> >> >> >> >> One thing to note is that the writeback case is slower than the proxy.
>>> >> >> >> >> We think there are three problems as follows.
>>> >> >> >> >>
>>> >> >> >> >> A. The current implementation creates a fingerprint by reading the entire
>>> >> >> >> >> object when flushing. Therefore, there is a problem that read and write are
>>> >> >> >> >> mixed.
>>> >> >> >> >
>>> >> >> >> > I expect this is a small factor compared to the fact that in writeback
>>> >> >> >> > mode you have to *write* to the cache tier, which is 3x replicated,
>>> >> >> >> > whereas in proxy mode those writes don't happen at all.
>>> >> >> >> >
>>> >> >> >> >> B. When client request read, the promote_object function reads the object
>>> >> >> >> >> and writes it back to the cache tier, which also causes a mix of read and
>>> >> >> >> >> write.
>>> >> >> >> >
>>> >> >> >> > This can be mitigated by setting the min_read_recency_for_promote pool
>>> >> >> >> > property to something >1.  Then reads will be proxied unless the object
>>> >> >> >> > appears to be hot (because it has been touched over multiple
>>> >> >> >> > hitset intervals).
>>> >> >> >> >
>>> >> >> >> >> C. When flushing, the unchanged part is rewritten because flush operation
>>> >> >> >> >> perform per-object based.
>>> >> >> >> >
>>> >> >> >> > Yes.
>>> >> >> >> >
>>> >> >> >> > Is there a description of your overall approach somewhere?
>>> >> >> >> >
>>> >> >> >> > sage
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >> Do I have something wrong? or Could you give me a suggestion to improve
>>> >> >> >> >> performance?
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> a. Write performance (KB/s)
>>> >> >> >> >>
>>> >> >> >> >> dedup_ratio  0 20 40 60 80 100
>>> >> >> >> >>
>>> >> >> >> >> PROXY  45586 47804 51120 52844 56167 55302
>>> >> >> >> >>
>>> >> >> >> >> WRITEBACK  13151 11078 9531 13010 9518 8319
>>> >> >> >> >>
>>> >> >> >> >> ORIGINAL  121209 124786 122140 121195 122540 132363
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> b. Read performance (KB/s)
>>> >> >> >> >>
>>> >> >> >> >> dedup_ratio  0 20 40 60 80 100
>>> >> >> >> >>
>>> >> >> >> >> PROXY  112231 118994 118070 120071 117884 132748
>>> >> >> >> >>
>>> >> >> >> >> WRITEBACK  34040 29109 19104 26677 24756 21695
>>> >> >> >> >>
>>> >> >> >> >> ORIGINAL  285482 284398 278063 277989 271793 285094
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> thanks,
>>> >> >> >> >> Myoungwon Oh
>>> >> >> >> >> --
>>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> >>
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>
>>> >>
>>>
>>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-03-28 15:32                     ` myoungwon oh
@ 2017-04-12 15:51                       ` Sage Weil
  2017-04-18 10:04                         ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-04-12 15:51 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Wed, 29 Mar 2017, myoungwon oh wrote:
> Hi sage,
> 
> I made two pads in order to discuss #3, #4.
> 
> #3: http://pad.ceph.com/p/deduplication_how_do_we_chunk
> #4: http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process

Updated!
sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-04-12 15:51                       ` Sage Weil
@ 2017-04-18 10:04                         ` myoungwon oh
  2017-04-18 13:23                           ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: myoungwon oh @ 2017-04-18 10:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

Hi sage

I am refactoring the source code in order to support the extensible
tier (support redirect, dedup, external system)

I have a few question.

1. My understanding is that base pool contains all metadata for all
objects (do not evicted), and dedup pool or external systems contain
the object data.
Therefore, The big difference between the cache tier and the
extensible tier is whether  base pool contains metadata (such as
object_info_t ) or not. Am i wrong?

2. Do you think the extensible tier should belong to cache_mode_t (in
maybe_handle_cache_detail()) ? or is it better to work independently
from cache tier.

3. Regarding promotion object and flushing work, I think start_copy() (in
promote_object()) and start_fush() can be reused in a simple
redirection case ("the object is in that other pool"), if we modify
the object (only metadata) is not evicted (removed). Is this right
way?

Thanks.
Myoungwon.

2017-04-13 0:51 GMT+09:00 Sage Weil <sweil@redhat.com>:
> On Wed, 29 Mar 2017, myoungwon oh wrote:
>> Hi sage,
>>
>> I made two pads in order to discuss #3, #4.
>>
>> #3: http://pad.ceph.com/p/deduplication_how_do_we_chunk
>> #4: http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process
>
> Updated!
> sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-04-18 10:04                         ` myoungwon oh
@ 2017-04-18 13:23                           ` Sage Weil
  2017-04-21 10:23                             ` myoungwon oh
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2017-04-18 13:23 UTC (permalink / raw)
  To: myoungwon oh; +Cc: ceph-devel, 오명원

On Tue, 18 Apr 2017, myoungwon oh wrote:
> Hi sage
> 
> I am refactoring the source code in order to support the extensible
> tier (support redirect, dedup, external system)
> 
> I have a few question.
> 
> 1. My understanding is that base pool contains all metadata for all
> objects (do not evicted), and dedup pool or external systems contain
> the object data.
> Therefore, The big difference between the cache tier and the
> extensible tier is whether  base pool contains metadata (such as
> object_info_t ) or not. Am i wrong?

Yeah.  The base pool effectively becomes and index for everything.

> 2. Do you think the extensible tier should belong to cache_mode_t (in
> maybe_handle_cache_detail()) ? or is it better to work independently
> from cache tier.

It should be independent of the cache tiering.  We should probably use the 
tier_of and tiers fields in pg_pool_t but not cache_mode.

> 3. Regarding promotion object and flushing work, I think start_copy() (in
> promote_object()) and start_fush() can be reused in a simple
> redirection case ("the object is in that other pool"), if we modify
> the object (only metadata) is not evicted (removed). Is this right
> way?

Yes!

I suggest focusing on step 1 being the simple redirection tiering case (no 
dedup) since it'll be tricky enough getting that part right.

Thanks!
sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question about writeback performance and content address obejct for deduplication
  2017-04-18 13:23                           ` Sage Weil
@ 2017-04-21 10:23                             ` myoungwon oh
  0 siblings, 0 replies; 16+ messages in thread
From: myoungwon oh @ 2017-04-21 10:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, 오명원

I created a public repository in oder to show you the concept of step
1 (a simple redirection).
https://github.com/myoungwon/ceph/commit/47f3567a0d5894e9272bf3bcabc95eb71f736a81

Can you give me commets whether this is right way or not?

And i have a question.
Does a simple rediection need struct object_manifest_t?
I think a simple rediection does not need struct object_manifest_t
because promote_object() and start_flush() can be reused.

Thanks.






2017-04-18 22:23 GMT+09:00 Sage Weil <sweil@redhat.com>:
> On Tue, 18 Apr 2017, myoungwon oh wrote:
>> Hi sage
>>
>> I am refactoring the source code in order to support the extensible
>> tier (support redirect, dedup, external system)
>>
>> I have a few question.
>>
>> 1. My understanding is that base pool contains all metadata for all
>> objects (do not evicted), and dedup pool or external systems contain
>> the object data.
>> Therefore, The big difference between the cache tier and the
>> extensible tier is whether  base pool contains metadata (such as
>> object_info_t ) or not. Am i wrong?
>
> Yeah.  The base pool effectively becomes and index for everything.
>
>> 2. Do you think the extensible tier should belong to cache_mode_t (in
>> maybe_handle_cache_detail()) ? or is it better to work independently
>> from cache tier.
>
> It should be independent of the cache tiering.  We should probably use the
> tier_of and tiers fields in pg_pool_t but not cache_mode.
>
>> 3. Regarding promotion object and flushing work, I think start_copy() (in
>> promote_object()) and start_fush() can be reused in a simple
>> redirection case ("the object is in that other pool"), if we modify
>> the object (only metadata) is not evicted (removed). Is this right
>> way?
>
> Yes!
>
> I suggest focusing on step 1 being the simple redirection tiering case (no
> dedup) since it'll be tricky enough getting that part right.
>
> Thanks!
> sage

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-04-21 10:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26 11:04 Question about writeback performance and content address obejct for deduplication myoungwon oh
2017-01-31 14:24 ` Sage Weil
2017-02-07 11:03   ` myoungwon oh
2017-02-07 14:50     ` Sage Weil
2017-03-14  6:25       ` myoungwon oh
2017-03-16 13:42         ` Sage Weil
2017-03-20 12:43           ` myoungwon oh
2017-03-24 19:32             ` Sage Weil
2017-03-27 13:46               ` myoungwon oh
2017-03-27 14:00                 ` Sage Weil
2017-03-27 15:27                   ` myoungwon oh
2017-03-28 15:32                     ` myoungwon oh
2017-04-12 15:51                       ` Sage Weil
2017-04-18 10:04                         ` myoungwon oh
2017-04-18 13:23                           ` Sage Weil
2017-04-21 10:23                             ` myoungwon oh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.