All of lore.kernel.org
 help / color / mirror / Atom feed
* About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
@ 2017-08-04  3:41 Xuehan Xu
  2017-08-04  7:14 ` Xuehan Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-04  3:41 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

Hi, grep:-)

I finally got what you mean in https://github.com/ceph/ceph/pull/16790.

I agree with you in that " clone overlap is supposed to be tracking
which data is the same on disk".

My thought is that, "ObjectContext::new_snapset.clones" is already an
indicator about whether there are clone objects on disk, so, in the
scenario of "cache tier", although a clone oid does not corresponds to
a "present clone" in cache tier, as long as
"ObjectContext::new_snapset.clones" is not empty, there must a one
such clone object in the base tier. And, as long as
"ObjectContext::new_snapset.clones" has a strict "one-to-one"
correspondence to "ObjectContext::new_snapset.clone_overlap", passing
the condition check "if (ctx->new_snapset.clones.size() > 0)" is
enough to make the judgement that the clone object exists.

So, if I'm right, passing the condition check "if
(ctx->new_snapset.clones.size() > 0)" is already enough for us to do
"newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
pass "is_present_clone".

Am I right about this? Or am I missing anything?

Please help us, thank you:-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-04  3:41 About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled" Xuehan Xu
@ 2017-08-04  7:14 ` Xuehan Xu
  2017-08-05 15:56   ` Xuehan Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-04  7:14 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

I mean I think it's the condition check "is_present_clone" that
prevent the clone overlap to record the client write operations
modified range when the target "HEAD" object exists without its most
recent clone object, and if I'm right, just move the clone overlap
modification out of the "is_present_clone" condition check block is
enough to solve this case, just like the PR
"https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
other problems.

In our test, this fix solved the problem successfully, however, we
can't confirm it won't cause new problems yet.

So if anyone see this and knows the answer, please help us. Thank you:-)

On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Hi, grep:-)
>
> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>
> I agree with you in that " clone overlap is supposed to be tracking
> which data is the same on disk".
>
> My thought is that, "ObjectContext::new_snapset.clones" is already an
> indicator about whether there are clone objects on disk, so, in the
> scenario of "cache tier", although a clone oid does not corresponds to
> a "present clone" in cache tier, as long as
> "ObjectContext::new_snapset.clones" is not empty, there must a one
> such clone object in the base tier. And, as long as
> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
> enough to make the judgement that the clone object exists.
>
> So, if I'm right, passing the condition check "if
> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
> pass "is_present_clone".
>
> Am I right about this? Or am I missing anything?
>
> Please help us, thank you:-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-04  7:14 ` Xuehan Xu
@ 2017-08-05 15:56   ` Xuehan Xu
  2017-08-07 20:48     ` Jason Dillaman
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-05 15:56 UTC (permalink / raw)
  To: Gregory Farnum, ceph-devel

Hi, everyone.

Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
just did another test: I did some writes to an object
"rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
to the cache tier, after which I specifically write to its offset 0x40
with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
dump its "HEAD" version in the base tier, the result is as
follows(before I raise it to cache tier, there is three snaps the
latest of which is 26):

{
    "id": {
        "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
        "key": "",
        "snapid": -2,
        "hash": 1655893237,
        "max": 0,
        "pool": 3,
        "namespace": "",
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
            "key": "",
            "snapid": -2,
            "hash": 1655893237,
            "max": 0,
            "pool": 3,
            "namespace": ""
        },
        "version": "4219'16423",
        "prior_version": "3978'16310",
        "last_reqid": "osd.70.4213:2359",
        "user_version": 17205,
        "size": 4194304,
        "mtime": "2017-08-03 22:07:34.656122",
        "local_mtime": "2017-08-05 23:02:33.628734",
        "lost": 0,
        "flags": 52,
        "snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": 2822203961,
        "omap_digest": 4294967295,
        "watchers": {}
    },
    "stat": {
        "size": 4194304,
        "blksize": 4096,
        "blocks": 8200,
        "nlink": 1
    },
    "SnapSet": {
        "snap_context": {
            "seq": 26,
            "snaps": [
                26,
                25,
                16
            ]
        },
        "head_exists": 1,
        "clones": [
            {
                "snap": 16,
                "size": 4194304,
                "overlap": "[4~4194300]"
            },
            {
                "snap": 25,
                "size": 4194304,
                "overlap": "[]"
            },
            {
                "snap": 26,
                "size": 4194304,
                "overlap": "[]"
            }
        ]
    }
}

As we can see, its clone_overlap for snap 26 is empty, which,
combining with the previous test described in
http://tracker.ceph.com/issues/20896, means that the writes' "modified
range" is neither recorded in the cache tier nor in the base tier.

I think maybe we really should move the clone overlap modification out
of the IF block which has the condition check "is_present_clone". As
for now, I can't see any other way to fix this problem.

Am I right about this?

On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> I mean I think it's the condition check "is_present_clone" that
> prevent the clone overlap to record the client write operations
> modified range when the target "HEAD" object exists without its most
> recent clone object, and if I'm right, just move the clone overlap
> modification out of the "is_present_clone" condition check block is
> enough to solve this case, just like the PR
> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
> other problems.
>
> In our test, this fix solved the problem successfully, however, we
> can't confirm it won't cause new problems yet.
>
> So if anyone see this and knows the answer, please help us. Thank you:-)
>
> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> Hi, grep:-)
>>
>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>
>> I agree with you in that " clone overlap is supposed to be tracking
>> which data is the same on disk".
>>
>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>> indicator about whether there are clone objects on disk, so, in the
>> scenario of "cache tier", although a clone oid does not corresponds to
>> a "present clone" in cache tier, as long as
>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>> such clone object in the base tier. And, as long as
>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>> enough to make the judgement that the clone object exists.
>>
>> So, if I'm right, passing the condition check "if
>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>> pass "is_present_clone".
>>
>> Am I right about this? Or am I missing anything?
>>
>> Please help us, thank you:-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-05 15:56   ` Xuehan Xu
@ 2017-08-07 20:48     ` Jason Dillaman
       [not found]       ` <CAJACTucaKm+nZUUkQYszZtjt7K=X1uvdpUucxL+pV1AHrmewEw@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Dillaman @ 2017-08-07 20:48 UTC (permalink / raw)
  To: Xuehan Xu; +Cc: Gregory Farnum, ceph-devel

Could you just proxy the "list snaps" op from the cache tier down to
the base tier and combine the cache tier + base tier results? Reading
the associated ticket, it seems kludgy to me to attempt to work around
this within librbd by just refusing the provide intra-object diffs if
cache tiering is in-use.

On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Hi, everyone.
>
> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
> just did another test: I did some writes to an object
> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
> to the cache tier, after which I specifically write to its offset 0x40
> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
> dump its "HEAD" version in the base tier, the result is as
> follows(before I raise it to cache tier, there is three snaps the
> latest of which is 26):
>
> {
>     "id": {
>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>         "key": "",
>         "snapid": -2,
>         "hash": 1655893237,
>         "max": 0,
>         "pool": 3,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>             "key": "",
>             "snapid": -2,
>             "hash": 1655893237,
>             "max": 0,
>             "pool": 3,
>             "namespace": ""
>         },
>         "version": "4219'16423",
>         "prior_version": "3978'16310",
>         "last_reqid": "osd.70.4213:2359",
>         "user_version": 17205,
>         "size": 4194304,
>         "mtime": "2017-08-03 22:07:34.656122",
>         "local_mtime": "2017-08-05 23:02:33.628734",
>         "lost": 0,
>         "flags": 52,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 2822203961,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 4194304,
>         "blksize": 4096,
>         "blocks": 8200,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 26,
>             "snaps": [
>                 26,
>                 25,
>                 16
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 16,
>                 "size": 4194304,
>                 "overlap": "[4~4194300]"
>             },
>             {
>                 "snap": 25,
>                 "size": 4194304,
>                 "overlap": "[]"
>             },
>             {
>                 "snap": 26,
>                 "size": 4194304,
>                 "overlap": "[]"
>             }
>         ]
>     }
> }
>
> As we can see, its clone_overlap for snap 26 is empty, which,
> combining with the previous test described in
> http://tracker.ceph.com/issues/20896, means that the writes' "modified
> range" is neither recorded in the cache tier nor in the base tier.
>
> I think maybe we really should move the clone overlap modification out
> of the IF block which has the condition check "is_present_clone". As
> for now, I can't see any other way to fix this problem.
>
> Am I right about this?
>
> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> I mean I think it's the condition check "is_present_clone" that
>> prevent the clone overlap to record the client write operations
>> modified range when the target "HEAD" object exists without its most
>> recent clone object, and if I'm right, just move the clone overlap
>> modification out of the "is_present_clone" condition check block is
>> enough to solve this case, just like the PR
>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>> other problems.
>>
>> In our test, this fix solved the problem successfully, however, we
>> can't confirm it won't cause new problems yet.
>>
>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>
>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> Hi, grep:-)
>>>
>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>
>>> I agree with you in that " clone overlap is supposed to be tracking
>>> which data is the same on disk".
>>>
>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>> indicator about whether there are clone objects on disk, so, in the
>>> scenario of "cache tier", although a clone oid does not corresponds to
>>> a "present clone" in cache tier, as long as
>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>> such clone object in the base tier. And, as long as
>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>> enough to make the judgement that the clone object exists.
>>>
>>> So, if I'm right, passing the condition check "if
>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>> pass "is_present_clone".
>>>
>>> Am I right about this? Or am I missing anything?
>>>
>>> Please help us, thank you:-)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fwd: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
       [not found]           ` <CAJACTudkLVT=dcWo_FpgrVuxJ8smFD2VqLU6x2eVWyggh9UJAA@mail.gmail.com>
@ 2017-08-09  4:20             ` Xuehan Xu
  2017-08-09  4:24               ` Xuehan Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-09  4:20 UTC (permalink / raw)
  To: dillaman, Gregory Farnum, ceph-devel

Sorry, I didn't "reply all"....:-)


---------- Forwarded message ----------
From: Xuehan Xu <xxhdx1985126@gmail.com>
Date: 9 August 2017 at 12:14
Subject: Re: About the problem "export_diff relies on clone_overlap,
which is lost when cache tier is enabled"
To: dillaman@redhat.com


Um, no, I don't think they are related.

My problem is this:

At first , we tried to use "rbd export-diff" to do incremental backup
on Jewel verion ceph cluster which is cache-tier enabled. However,
when we compare the original rbd image and the backup rbd image, we
find that they are different. After a series of debugging, we found
that this is because WRITE ops' "modified_range" is not substracted
from the clone overlap of its targeting object's HEAD version when
that object's HEAD verion is in cache iter and its most recent clone
object is not, which led to the miscalculation of the
"calc_snap_set_diff" function.

For example, we did such a test: we first made created a snap for an
rbd image "test.2.img" whose size is only 4MB which means it only
contains one object; then, we sent a series of AioWrites to
"test.2.img" to promote its HEAD object into cache tier, while leaving
its clone object in the base tier only; at that time, we used
"ceph-objectstore-tool" to dump the object we wrote to, and the result
was as follows:

{
    "id": {
        "oid": "rbd_data.2aae62ae8944a.0000000000000000",
        "key": "",
        "snapid": -2,
        "hash": 2375431681,
        "max": 0,
        "pool": 4,
        "namespace": "",
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "rbd_data.2aae62ae8944a.0000000000000000",
            "key": "",
            "snapid": -2,
            "hash": 2375431681,
            "max": 0,
            "pool": 4,
            "namespace": ""
        },
        "version": "4536'2728",
        "prior_version": "4536'2727",
        "last_reqid": "client.174858.0:10",
        "user_version": 14706,
        "size": 68,
        "mtime": "2017-08-09 11:30:18.263983",
        "local_mtime": "2017-08-09 11:30:18.264310",
        "lost": 0,
        "flags": 4,
        "snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": 4294967295,
        "omap_digest": 4294967295,
        "watchers": {}
    },
    "stat": {
        "size": 68,
        "blksize": 4096,
        "blocks": 16,
        "nlink": 1
    },
    "SnapSet": {
        "snap_context": {
            "seq": 28,
            "snaps": [
                28
            ]
        },
        "head_exists": 1,
        "clones": [
            {
                "snap": 28,
                "size": 68,
                "overlap": "[0~64]"
            }
        ]
    }
}

In this result, we found that the overlap for clone "28" is [0~64]. So
we specifically sent a AioWrite req to this object to write to the
offset 32 with 4 bytes of ramdon data, which we thought the overlap
should change to [0~32, 36~64]. However, the result is as follows:

{
    "id": {
        "oid": "rbd_data.2aae62ae8944a.0000000000000000",
        "key": "",
        "snapid": -2,
        "hash": 2375431681,
        "max": 0,
        "pool": 4,
        "namespace": "",
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "rbd_data.2aae62ae8944a.0000000000000000",
            "key": "",
            "snapid": -2,
            "hash": 2375431681,
            "max": 0,
            "pool": 4,
            "namespace": ""
        },
        "version": "4546'2730",
        "prior_version": "4538'2729",
        "last_reqid": "client.155555.0:10",
        "user_version": 14708,
        "size": 4096,
        "mtime": "2017-08-09 11:36:20.717910",
        "local_mtime": "2017-08-09 11:36:20.719039",
        "lost": 0,
        "flags": 4,
        "snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": 4294967295,
        "omap_digest": 4294967295,
        "watchers": {}
    },
    "stat": {
        "size": 4096,
        "blksize": 4096,
        "blocks": 16,
        "nlink": 1
    },
    "SnapSet": {
        "snap_context": {
            "seq": 28,
            "snaps": [
                28
            ]
        },
        "head_exists": 1,
        "clones": [
            {
                "snap": 28,
                "size": 68,
                "overlap": "[0~64]"
            }
        ]
    }
}

It's obvious that it didn't change at all. If we do export-diff under
this circumstance, the result would be wrong. Meanwhile, in the base
tier, the "ceph-objectstore-tool" dump's result also showed that the
overlap recorded in the base tier didn't change also:
{
    "id": {
        "oid": "rbd_data.2aae62ae8944a.0000000000000000",
        "key": "",
        "snapid": -2,
        "hash": 2375431681,
        "max": 0,
        "pool": 3,
        "namespace": "",
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "rbd_data.2aae62ae8944a.0000000000000000",
            "key": "",
            "snapid": -2,
            "hash": 2375431681,
            "max": 0,
            "pool": 3,
            "namespace": ""
        },
        "version": "4536'14459",
        "prior_version": "4536'14458",
        "last_reqid": "client.174834.0:10",
        "user_version": 14648,
        "size": 68,
        "mtime": "2017-08-09 11:30:01.412634",
        "local_mtime": "2017-08-09 11:30:01.413614",
        "lost": 0,
        "flags": 36,
        "snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": 4294967295,
        "omap_digest": 4294967295,
        "watchers": {}
    },
    "stat": {
        "size": 68,
        "blksize": 4096,
        "blocks": 16,
        "nlink": 1
    },
    "SnapSet": {
        "snap_context": {
            "seq": 28,
            "snaps": [
                28
            ]
        },
        "head_exists": 1,
        "clones": [
            {
                "snap": 28,
                "size": 68,
                "overlap": "[0~64]"
            }
        ]
    }
}

Then we turn to the source code to find the reason for this. And we
found that, the reason should be that, in the
ReplicatedPG::make_writeable method, when determining whether the
write op's modified range should be subtracted from the clone overlap,
it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
and "is_present_clone(last_clone_oid)", as the code below shows.

  // update most recent clone_overlap and usage stats
  if (ctx->new_snapset.clones.size() > 0) {
    /* we need to check whether the most recent clone exists, if it's
been evicted,
     * it's not included in the stats */
    hobject_t last_clone_oid = soid;
    last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
    if (is_present_clone(last_clone_oid)) {
      interval_set<uint64_t> &newest_overlap =
ctx->new_snapset.clone_overlap.rbegin()->second;
      ctx->modified_ranges.intersection_of(newest_overlap);
      // modified_ranges is still in use by the clone
      add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
      newest_overlap.subtract(ctx->modified_ranges);
    }
  }

We thought that the latter condition check
"is_present_clone(last_clone_oid)" might not be reasonable to be a
judgement base for the determination of whether to subtract the clone
overlap with write ops' modified range, so we changed to code above to
the following, which moved the subtraction out of the latter condition
check, and submitted a pr for this:
https://github.com/ceph/ceph/pull/16790:

  // update most recent clone_overlap and usage stats
  if (ctx->new_snapset.clones.size() > 0) {
    /* we need to check whether the most recent clone exists, if it's
been evicted,
     * it's not included in the stats */
    hobject_t last_clone_oid = soid;
    last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
    interval_set<uint64_t> &newest_overlap =
ctx->new_snapset.clone_overlap.rbegin()->second;
    ctx->modified_ranges.intersection_of(newest_overlap);
    newest_overlap.subtract(ctx->modified_ranges);

    if (is_present_clone(last_clone_oid)) {
      // modified_ranges is still in use by the clone
      add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
    }
  }

In our test, this change solved the problem successfully, however, we
can't confirm that this change won't cause any new problems. So, here
we are discussing how to solve this problem:-)

On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>
> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> OK, I'll try that. Thank you:-)
>>
>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>> Could you just proxy the "list snaps" op from the cache tier down to
>>> the base tier and combine the cache tier + base tier results? Reading
>>> the associated ticket, it seems kludgy to me to attempt to work around
>>> this within librbd by just refusing the provide intra-object diffs if
>>> cache tiering is in-use.
>>>
>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> Hi, everyone.
>>>>
>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>> just did another test: I did some writes to an object
>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>> dump its "HEAD" version in the base tier, the result is as
>>>> follows(before I raise it to cache tier, there is three snaps the
>>>> latest of which is 26):
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 1655893237,
>>>>         "max": 0,
>>>>         "pool": 3,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 1655893237,
>>>>             "max": 0,
>>>>             "pool": 3,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4219'16423",
>>>>         "prior_version": "3978'16310",
>>>>         "last_reqid": "osd.70.4213:2359",
>>>>         "user_version": 17205,
>>>>         "size": 4194304,
>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>         "lost": 0,
>>>>         "flags": 52,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 2822203961,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 4194304,
>>>>         "blksize": 4096,
>>>>         "blocks": 8200,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 26,
>>>>             "snaps": [
>>>>                 26,
>>>>                 25,
>>>>                 16
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 16,
>>>>                 "size": 4194304,
>>>>                 "overlap": "[4~4194300]"
>>>>             },
>>>>             {
>>>>                 "snap": 25,
>>>>                 "size": 4194304,
>>>>                 "overlap": "[]"
>>>>             },
>>>>             {
>>>>                 "snap": 26,
>>>>                 "size": 4194304,
>>>>                 "overlap": "[]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>> combining with the previous test described in
>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>
>>>> I think maybe we really should move the clone overlap modification out
>>>> of the IF block which has the condition check "is_present_clone". As
>>>> for now, I can't see any other way to fix this problem.
>>>>
>>>> Am I right about this?
>>>>
>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>> prevent the clone overlap to record the client write operations
>>>>> modified range when the target "HEAD" object exists without its most
>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>> modification out of the "is_present_clone" condition check block is
>>>>> enough to solve this case, just like the PR
>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>> other problems.
>>>>>
>>>>> In our test, this fix solved the problem successfully, however, we
>>>>> can't confirm it won't cause new problems yet.
>>>>>
>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>
>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> Hi, grep:-)
>>>>>>
>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>
>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>> which data is the same on disk".
>>>>>>
>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>> a "present clone" in cache tier, as long as
>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>> such clone object in the base tier. And, as long as
>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>> enough to make the judgement that the clone object exists.
>>>>>>
>>>>>> So, if I'm right, passing the condition check "if
>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>> pass "is_present_clone".
>>>>>>
>>>>>> Am I right about this? Or am I missing anything?
>>>>>>
>>>>>> Please help us, thank you:-)
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Jason
>
>
>
> --
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-09  4:20             ` Fwd: " Xuehan Xu
@ 2017-08-09  4:24               ` Xuehan Xu
  2017-08-09 15:26                 ` Jason Dillaman
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-09  4:24 UTC (permalink / raw)
  To: dillaman, Gregory Farnum, ceph-devel

By the way, according to our test, since the modified range is not
recorded either in the cache tier or in the base tier, I think
proxying the "list-snaps" down to the base tier might not work as
well, is that right?

On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Sorry, I didn't "reply all"....:-)
>
>
> ---------- Forwarded message ----------
> From: Xuehan Xu <xxhdx1985126@gmail.com>
> Date: 9 August 2017 at 12:14
> Subject: Re: About the problem "export_diff relies on clone_overlap,
> which is lost when cache tier is enabled"
> To: dillaman@redhat.com
>
>
> Um, no, I don't think they are related.
>
> My problem is this:
>
> At first , we tried to use "rbd export-diff" to do incremental backup
> on Jewel verion ceph cluster which is cache-tier enabled. However,
> when we compare the original rbd image and the backup rbd image, we
> find that they are different. After a series of debugging, we found
> that this is because WRITE ops' "modified_range" is not substracted
> from the clone overlap of its targeting object's HEAD version when
> that object's HEAD verion is in cache iter and its most recent clone
> object is not, which led to the miscalculation of the
> "calc_snap_set_diff" function.
>
> For example, we did such a test: we first made created a snap for an
> rbd image "test.2.img" whose size is only 4MB which means it only
> contains one object; then, we sent a series of AioWrites to
> "test.2.img" to promote its HEAD object into cache tier, while leaving
> its clone object in the base tier only; at that time, we used
> "ceph-objectstore-tool" to dump the object we wrote to, and the result
> was as follows:
>
> {
>     "id": {
>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>         "key": "",
>         "snapid": -2,
>         "hash": 2375431681,
>         "max": 0,
>         "pool": 4,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>             "key": "",
>             "snapid": -2,
>             "hash": 2375431681,
>             "max": 0,
>             "pool": 4,
>             "namespace": ""
>         },
>         "version": "4536'2728",
>         "prior_version": "4536'2727",
>         "last_reqid": "client.174858.0:10",
>         "user_version": 14706,
>         "size": 68,
>         "mtime": "2017-08-09 11:30:18.263983",
>         "local_mtime": "2017-08-09 11:30:18.264310",
>         "lost": 0,
>         "flags": 4,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 4294967295,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 68,
>         "blksize": 4096,
>         "blocks": 16,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 28,
>             "snaps": [
>                 28
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 28,
>                 "size": 68,
>                 "overlap": "[0~64]"
>             }
>         ]
>     }
> }
>
> In this result, we found that the overlap for clone "28" is [0~64]. So
> we specifically sent a AioWrite req to this object to write to the
> offset 32 with 4 bytes of ramdon data, which we thought the overlap
> should change to [0~32, 36~64]. However, the result is as follows:
>
> {
>     "id": {
>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>         "key": "",
>         "snapid": -2,
>         "hash": 2375431681,
>         "max": 0,
>         "pool": 4,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>             "key": "",
>             "snapid": -2,
>             "hash": 2375431681,
>             "max": 0,
>             "pool": 4,
>             "namespace": ""
>         },
>         "version": "4546'2730",
>         "prior_version": "4538'2729",
>         "last_reqid": "client.155555.0:10",
>         "user_version": 14708,
>         "size": 4096,
>         "mtime": "2017-08-09 11:36:20.717910",
>         "local_mtime": "2017-08-09 11:36:20.719039",
>         "lost": 0,
>         "flags": 4,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 4294967295,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 4096,
>         "blksize": 4096,
>         "blocks": 16,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 28,
>             "snaps": [
>                 28
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 28,
>                 "size": 68,
>                 "overlap": "[0~64]"
>             }
>         ]
>     }
> }
>
> It's obvious that it didn't change at all. If we do export-diff under
> this circumstance, the result would be wrong. Meanwhile, in the base
> tier, the "ceph-objectstore-tool" dump's result also showed that the
> overlap recorded in the base tier didn't change also:
> {
>     "id": {
>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>         "key": "",
>         "snapid": -2,
>         "hash": 2375431681,
>         "max": 0,
>         "pool": 3,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>             "key": "",
>             "snapid": -2,
>             "hash": 2375431681,
>             "max": 0,
>             "pool": 3,
>             "namespace": ""
>         },
>         "version": "4536'14459",
>         "prior_version": "4536'14458",
>         "last_reqid": "client.174834.0:10",
>         "user_version": 14648,
>         "size": 68,
>         "mtime": "2017-08-09 11:30:01.412634",
>         "local_mtime": "2017-08-09 11:30:01.413614",
>         "lost": 0,
>         "flags": 36,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 4294967295,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 68,
>         "blksize": 4096,
>         "blocks": 16,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 28,
>             "snaps": [
>                 28
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 28,
>                 "size": 68,
>                 "overlap": "[0~64]"
>             }
>         ]
>     }
> }
>
> Then we turn to the source code to find the reason for this. And we
> found that, the reason should be that, in the
> ReplicatedPG::make_writeable method, when determining whether the
> write op's modified range should be subtracted from the clone overlap,
> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
> and "is_present_clone(last_clone_oid)", as the code below shows.
>
>   // update most recent clone_overlap and usage stats
>   if (ctx->new_snapset.clones.size() > 0) {
>     /* we need to check whether the most recent clone exists, if it's
> been evicted,
>      * it's not included in the stats */
>     hobject_t last_clone_oid = soid;
>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>     if (is_present_clone(last_clone_oid)) {
>       interval_set<uint64_t> &newest_overlap =
> ctx->new_snapset.clone_overlap.rbegin()->second;
>       ctx->modified_ranges.intersection_of(newest_overlap);
>       // modified_ranges is still in use by the clone
>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>       newest_overlap.subtract(ctx->modified_ranges);
>     }
>   }
>
> We thought that the latter condition check
> "is_present_clone(last_clone_oid)" might not be reasonable to be a
> judgement base for the determination of whether to subtract the clone
> overlap with write ops' modified range, so we changed to code above to
> the following, which moved the subtraction out of the latter condition
> check, and submitted a pr for this:
> https://github.com/ceph/ceph/pull/16790:
>
>   // update most recent clone_overlap and usage stats
>   if (ctx->new_snapset.clones.size() > 0) {
>     /* we need to check whether the most recent clone exists, if it's
> been evicted,
>      * it's not included in the stats */
>     hobject_t last_clone_oid = soid;
>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>     interval_set<uint64_t> &newest_overlap =
> ctx->new_snapset.clone_overlap.rbegin()->second;
>     ctx->modified_ranges.intersection_of(newest_overlap);
>     newest_overlap.subtract(ctx->modified_ranges);
>
>     if (is_present_clone(last_clone_oid)) {
>       // modified_ranges is still in use by the clone
>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>     }
>   }
>
> In our test, this change solved the problem successfully, however, we
> can't confirm that this change won't cause any new problems. So, here
> we are discussing how to solve this problem:-)
>
> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>
>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> OK, I'll try that. Thank you:-)
>>>
>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>> the base tier and combine the cache tier + base tier results? Reading
>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>> this within librbd by just refusing the provide intra-object diffs if
>>>> cache tiering is in-use.
>>>>
>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> Hi, everyone.
>>>>>
>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>> just did another test: I did some writes to an object
>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>> latest of which is 26):
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 1655893237,
>>>>>         "max": 0,
>>>>>         "pool": 3,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 1655893237,
>>>>>             "max": 0,
>>>>>             "pool": 3,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4219'16423",
>>>>>         "prior_version": "3978'16310",
>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>         "user_version": 17205,
>>>>>         "size": 4194304,
>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>         "lost": 0,
>>>>>         "flags": 52,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 2822203961,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 4194304,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 8200,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 26,
>>>>>             "snaps": [
>>>>>                 26,
>>>>>                 25,
>>>>>                 16
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 16,
>>>>>                 "size": 4194304,
>>>>>                 "overlap": "[4~4194300]"
>>>>>             },
>>>>>             {
>>>>>                 "snap": 25,
>>>>>                 "size": 4194304,
>>>>>                 "overlap": "[]"
>>>>>             },
>>>>>             {
>>>>>                 "snap": 26,
>>>>>                 "size": 4194304,
>>>>>                 "overlap": "[]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>> combining with the previous test described in
>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>
>>>>> I think maybe we really should move the clone overlap modification out
>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>> for now, I can't see any other way to fix this problem.
>>>>>
>>>>> Am I right about this?
>>>>>
>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>> prevent the clone overlap to record the client write operations
>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>> enough to solve this case, just like the PR
>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>> other problems.
>>>>>>
>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>> can't confirm it won't cause new problems yet.
>>>>>>
>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>
>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> Hi, grep:-)
>>>>>>>
>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>
>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>> which data is the same on disk".
>>>>>>>
>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>> a "present clone" in cache tier, as long as
>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>> such clone object in the base tier. And, as long as
>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>
>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>> pass "is_present_clone".
>>>>>>>
>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>
>>>>>>> Please help us, thank you:-)
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Jason
>>
>>
>>
>> --
>> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-09  4:24               ` Xuehan Xu
@ 2017-08-09 15:26                 ` Jason Dillaman
  2017-08-10  7:06                   ` Xuehan Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Dillaman @ 2017-08-09 15:26 UTC (permalink / raw)
  To: Xuehan Xu; +Cc: Gregory Farnum, ceph-devel

If you flush the object out of the cache tier so that its changes are
recorded in the base tier, is the overlap correctly recorded in the
base tier?

On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> By the way, according to our test, since the modified range is not
> recorded either in the cache tier or in the base tier, I think
> proxying the "list-snaps" down to the base tier might not work as
> well, is that right?
>
> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> Sorry, I didn't "reply all"....:-)
>>
>>
>> ---------- Forwarded message ----------
>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>> Date: 9 August 2017 at 12:14
>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>> which is lost when cache tier is enabled"
>> To: dillaman@redhat.com
>>
>>
>> Um, no, I don't think they are related.
>>
>> My problem is this:
>>
>> At first , we tried to use "rbd export-diff" to do incremental backup
>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>> when we compare the original rbd image and the backup rbd image, we
>> find that they are different. After a series of debugging, we found
>> that this is because WRITE ops' "modified_range" is not substracted
>> from the clone overlap of its targeting object's HEAD version when
>> that object's HEAD verion is in cache iter and its most recent clone
>> object is not, which led to the miscalculation of the
>> "calc_snap_set_diff" function.
>>
>> For example, we did such a test: we first made created a snap for an
>> rbd image "test.2.img" whose size is only 4MB which means it only
>> contains one object; then, we sent a series of AioWrites to
>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>> its clone object in the base tier only; at that time, we used
>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>> was as follows:
>>
>> {
>>     "id": {
>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 2375431681,
>>         "max": 0,
>>         "pool": 4,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 2375431681,
>>             "max": 0,
>>             "pool": 4,
>>             "namespace": ""
>>         },
>>         "version": "4536'2728",
>>         "prior_version": "4536'2727",
>>         "last_reqid": "client.174858.0:10",
>>         "user_version": 14706,
>>         "size": 68,
>>         "mtime": "2017-08-09 11:30:18.263983",
>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>         "lost": 0,
>>         "flags": 4,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 4294967295,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 68,
>>         "blksize": 4096,
>>         "blocks": 16,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 28,
>>             "snaps": [
>>                 28
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 28,
>>                 "size": 68,
>>                 "overlap": "[0~64]"
>>             }
>>         ]
>>     }
>> }
>>
>> In this result, we found that the overlap for clone "28" is [0~64]. So
>> we specifically sent a AioWrite req to this object to write to the
>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>> should change to [0~32, 36~64]. However, the result is as follows:
>>
>> {
>>     "id": {
>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 2375431681,
>>         "max": 0,
>>         "pool": 4,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 2375431681,
>>             "max": 0,
>>             "pool": 4,
>>             "namespace": ""
>>         },
>>         "version": "4546'2730",
>>         "prior_version": "4538'2729",
>>         "last_reqid": "client.155555.0:10",
>>         "user_version": 14708,
>>         "size": 4096,
>>         "mtime": "2017-08-09 11:36:20.717910",
>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>         "lost": 0,
>>         "flags": 4,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 4294967295,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 4096,
>>         "blksize": 4096,
>>         "blocks": 16,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 28,
>>             "snaps": [
>>                 28
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 28,
>>                 "size": 68,
>>                 "overlap": "[0~64]"
>>             }
>>         ]
>>     }
>> }
>>
>> It's obvious that it didn't change at all. If we do export-diff under
>> this circumstance, the result would be wrong. Meanwhile, in the base
>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>> overlap recorded in the base tier didn't change also:
>> {
>>     "id": {
>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 2375431681,
>>         "max": 0,
>>         "pool": 3,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 2375431681,
>>             "max": 0,
>>             "pool": 3,
>>             "namespace": ""
>>         },
>>         "version": "4536'14459",
>>         "prior_version": "4536'14458",
>>         "last_reqid": "client.174834.0:10",
>>         "user_version": 14648,
>>         "size": 68,
>>         "mtime": "2017-08-09 11:30:01.412634",
>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>         "lost": 0,
>>         "flags": 36,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 4294967295,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 68,
>>         "blksize": 4096,
>>         "blocks": 16,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 28,
>>             "snaps": [
>>                 28
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 28,
>>                 "size": 68,
>>                 "overlap": "[0~64]"
>>             }
>>         ]
>>     }
>> }
>>
>> Then we turn to the source code to find the reason for this. And we
>> found that, the reason should be that, in the
>> ReplicatedPG::make_writeable method, when determining whether the
>> write op's modified range should be subtracted from the clone overlap,
>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>
>>   // update most recent clone_overlap and usage stats
>>   if (ctx->new_snapset.clones.size() > 0) {
>>     /* we need to check whether the most recent clone exists, if it's
>> been evicted,
>>      * it's not included in the stats */
>>     hobject_t last_clone_oid = soid;
>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>     if (is_present_clone(last_clone_oid)) {
>>       interval_set<uint64_t> &newest_overlap =
>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>       // modified_ranges is still in use by the clone
>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>       newest_overlap.subtract(ctx->modified_ranges);
>>     }
>>   }
>>
>> We thought that the latter condition check
>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>> judgement base for the determination of whether to subtract the clone
>> overlap with write ops' modified range, so we changed to code above to
>> the following, which moved the subtraction out of the latter condition
>> check, and submitted a pr for this:
>> https://github.com/ceph/ceph/pull/16790:
>>
>>   // update most recent clone_overlap and usage stats
>>   if (ctx->new_snapset.clones.size() > 0) {
>>     /* we need to check whether the most recent clone exists, if it's
>> been evicted,
>>      * it's not included in the stats */
>>     hobject_t last_clone_oid = soid;
>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>     interval_set<uint64_t> &newest_overlap =
>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>     newest_overlap.subtract(ctx->modified_ranges);
>>
>>     if (is_present_clone(last_clone_oid)) {
>>       // modified_ranges is still in use by the clone
>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>     }
>>   }
>>
>> In our test, this change solved the problem successfully, however, we
>> can't confirm that this change won't cause any new problems. So, here
>> we are discussing how to solve this problem:-)
>>
>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>
>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> OK, I'll try that. Thank you:-)
>>>>
>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>> cache tiering is in-use.
>>>>>
>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> Hi, everyone.
>>>>>>
>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>> just did another test: I did some writes to an object
>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>> latest of which is 26):
>>>>>>
>>>>>> {
>>>>>>     "id": {
>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>         "key": "",
>>>>>>         "snapid": -2,
>>>>>>         "hash": 1655893237,
>>>>>>         "max": 0,
>>>>>>         "pool": 3,
>>>>>>         "namespace": "",
>>>>>>         "max": 0
>>>>>>     },
>>>>>>     "info": {
>>>>>>         "oid": {
>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>             "key": "",
>>>>>>             "snapid": -2,
>>>>>>             "hash": 1655893237,
>>>>>>             "max": 0,
>>>>>>             "pool": 3,
>>>>>>             "namespace": ""
>>>>>>         },
>>>>>>         "version": "4219'16423",
>>>>>>         "prior_version": "3978'16310",
>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>         "user_version": 17205,
>>>>>>         "size": 4194304,
>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>         "lost": 0,
>>>>>>         "flags": 52,
>>>>>>         "snaps": [],
>>>>>>         "truncate_seq": 0,
>>>>>>         "truncate_size": 0,
>>>>>>         "data_digest": 2822203961,
>>>>>>         "omap_digest": 4294967295,
>>>>>>         "watchers": {}
>>>>>>     },
>>>>>>     "stat": {
>>>>>>         "size": 4194304,
>>>>>>         "blksize": 4096,
>>>>>>         "blocks": 8200,
>>>>>>         "nlink": 1
>>>>>>     },
>>>>>>     "SnapSet": {
>>>>>>         "snap_context": {
>>>>>>             "seq": 26,
>>>>>>             "snaps": [
>>>>>>                 26,
>>>>>>                 25,
>>>>>>                 16
>>>>>>             ]
>>>>>>         },
>>>>>>         "head_exists": 1,
>>>>>>         "clones": [
>>>>>>             {
>>>>>>                 "snap": 16,
>>>>>>                 "size": 4194304,
>>>>>>                 "overlap": "[4~4194300]"
>>>>>>             },
>>>>>>             {
>>>>>>                 "snap": 25,
>>>>>>                 "size": 4194304,
>>>>>>                 "overlap": "[]"
>>>>>>             },
>>>>>>             {
>>>>>>                 "snap": 26,
>>>>>>                 "size": 4194304,
>>>>>>                 "overlap": "[]"
>>>>>>             }
>>>>>>         ]
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>> combining with the previous test described in
>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>
>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>
>>>>>> Am I right about this?
>>>>>>
>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>> enough to solve this case, just like the PR
>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>> other problems.
>>>>>>>
>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>
>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>
>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>> Hi, grep:-)
>>>>>>>>
>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>
>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>> which data is the same on disk".
>>>>>>>>
>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>
>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>> pass "is_present_clone".
>>>>>>>>
>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>
>>>>>>>> Please help us, thank you:-)
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason
>>>
>>>
>>>
>>> --
>>> Jason



-- 
Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-09 15:26                 ` Jason Dillaman
@ 2017-08-10  7:06                   ` Xuehan Xu
  2017-08-10  8:11                     ` Xuehan Xu
  2017-08-10 13:11                     ` Jason Dillaman
  0 siblings, 2 replies; 14+ messages in thread
From: Xuehan Xu @ 2017-08-10  7:06 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, ceph-devel

Hi, Jason.

I did a test, it turned out that, after flushing the object out of the
cache tier, the clone overlap in base tier changed to empty, as is
shown below. I think this maybe because that the flush operation just
mark the whole range of the object being flushed as modified, so if
the object's size hasn't changed, the overlap becomes empty. Is this
right?

Thank you:-)

{
    "id": {
        "oid": "test.obj",
        "key": "",
        "snapid": -2,
        "hash": 3575411564,
        "max": 0,
        "pool": 10,
        "namespace": "",
        "max": 0
    },
    "info": {
        "oid": {
            "oid": "test.obj",
            "key": "",
            "snapid": -2,
            "hash": 3575411564,
            "max": 0,
            "pool": 10,
            "namespace": ""
        },
        "version": "4876'9",
        "prior_version": "4854'8",
        "last_reqid": "osd.35.4869:1",
        "user_version": 16,
        "size": 4194303,
        "mtime": "2017-08-10 14:54:56.087387",
        "local_mtime": "2017-08-10 14:59:15.252755",
        "lost": 0,
        "flags": 52,
        "snaps": [],
        "truncate_seq": 0,
        "truncate_size": 0,
        "data_digest": 2827420887,
        "omap_digest": 4294967295,
        "watchers": {}
    },
    "stat": {
        "size": 4194303,
        "blksize": 4096,
        "blocks": 8200,
        "nlink": 1
    },
    "SnapSet": {
        "snap_context": {
            "seq": 3,
            "snaps": [
                3
            ]
        },
        "head_exists": 1,
        "clones": [
            {
                "snap": 3,
                "size": 4194303,
                "overlap": "[]"
            }
        ]
    }
}

On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
> If you flush the object out of the cache tier so that its changes are
> recorded in the base tier, is the overlap correctly recorded in the
> base tier?
>
> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> By the way, according to our test, since the modified range is not
>> recorded either in the cache tier or in the base tier, I think
>> proxying the "list-snaps" down to the base tier might not work as
>> well, is that right?
>>
>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> Sorry, I didn't "reply all"....:-)
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>> Date: 9 August 2017 at 12:14
>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>> which is lost when cache tier is enabled"
>>> To: dillaman@redhat.com
>>>
>>>
>>> Um, no, I don't think they are related.
>>>
>>> My problem is this:
>>>
>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>> when we compare the original rbd image and the backup rbd image, we
>>> find that they are different. After a series of debugging, we found
>>> that this is because WRITE ops' "modified_range" is not substracted
>>> from the clone overlap of its targeting object's HEAD version when
>>> that object's HEAD verion is in cache iter and its most recent clone
>>> object is not, which led to the miscalculation of the
>>> "calc_snap_set_diff" function.
>>>
>>> For example, we did such a test: we first made created a snap for an
>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>> contains one object; then, we sent a series of AioWrites to
>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>> its clone object in the base tier only; at that time, we used
>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>> was as follows:
>>>
>>> {
>>>     "id": {
>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>         "key": "",
>>>         "snapid": -2,
>>>         "hash": 2375431681,
>>>         "max": 0,
>>>         "pool": 4,
>>>         "namespace": "",
>>>         "max": 0
>>>     },
>>>     "info": {
>>>         "oid": {
>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>             "key": "",
>>>             "snapid": -2,
>>>             "hash": 2375431681,
>>>             "max": 0,
>>>             "pool": 4,
>>>             "namespace": ""
>>>         },
>>>         "version": "4536'2728",
>>>         "prior_version": "4536'2727",
>>>         "last_reqid": "client.174858.0:10",
>>>         "user_version": 14706,
>>>         "size": 68,
>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>         "lost": 0,
>>>         "flags": 4,
>>>         "snaps": [],
>>>         "truncate_seq": 0,
>>>         "truncate_size": 0,
>>>         "data_digest": 4294967295,
>>>         "omap_digest": 4294967295,
>>>         "watchers": {}
>>>     },
>>>     "stat": {
>>>         "size": 68,
>>>         "blksize": 4096,
>>>         "blocks": 16,
>>>         "nlink": 1
>>>     },
>>>     "SnapSet": {
>>>         "snap_context": {
>>>             "seq": 28,
>>>             "snaps": [
>>>                 28
>>>             ]
>>>         },
>>>         "head_exists": 1,
>>>         "clones": [
>>>             {
>>>                 "snap": 28,
>>>                 "size": 68,
>>>                 "overlap": "[0~64]"
>>>             }
>>>         ]
>>>     }
>>> }
>>>
>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>> we specifically sent a AioWrite req to this object to write to the
>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>
>>> {
>>>     "id": {
>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>         "key": "",
>>>         "snapid": -2,
>>>         "hash": 2375431681,
>>>         "max": 0,
>>>         "pool": 4,
>>>         "namespace": "",
>>>         "max": 0
>>>     },
>>>     "info": {
>>>         "oid": {
>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>             "key": "",
>>>             "snapid": -2,
>>>             "hash": 2375431681,
>>>             "max": 0,
>>>             "pool": 4,
>>>             "namespace": ""
>>>         },
>>>         "version": "4546'2730",
>>>         "prior_version": "4538'2729",
>>>         "last_reqid": "client.155555.0:10",
>>>         "user_version": 14708,
>>>         "size": 4096,
>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>         "lost": 0,
>>>         "flags": 4,
>>>         "snaps": [],
>>>         "truncate_seq": 0,
>>>         "truncate_size": 0,
>>>         "data_digest": 4294967295,
>>>         "omap_digest": 4294967295,
>>>         "watchers": {}
>>>     },
>>>     "stat": {
>>>         "size": 4096,
>>>         "blksize": 4096,
>>>         "blocks": 16,
>>>         "nlink": 1
>>>     },
>>>     "SnapSet": {
>>>         "snap_context": {
>>>             "seq": 28,
>>>             "snaps": [
>>>                 28
>>>             ]
>>>         },
>>>         "head_exists": 1,
>>>         "clones": [
>>>             {
>>>                 "snap": 28,
>>>                 "size": 68,
>>>                 "overlap": "[0~64]"
>>>             }
>>>         ]
>>>     }
>>> }
>>>
>>> It's obvious that it didn't change at all. If we do export-diff under
>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>> overlap recorded in the base tier didn't change also:
>>> {
>>>     "id": {
>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>         "key": "",
>>>         "snapid": -2,
>>>         "hash": 2375431681,
>>>         "max": 0,
>>>         "pool": 3,
>>>         "namespace": "",
>>>         "max": 0
>>>     },
>>>     "info": {
>>>         "oid": {
>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>             "key": "",
>>>             "snapid": -2,
>>>             "hash": 2375431681,
>>>             "max": 0,
>>>             "pool": 3,
>>>             "namespace": ""
>>>         },
>>>         "version": "4536'14459",
>>>         "prior_version": "4536'14458",
>>>         "last_reqid": "client.174834.0:10",
>>>         "user_version": 14648,
>>>         "size": 68,
>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>         "lost": 0,
>>>         "flags": 36,
>>>         "snaps": [],
>>>         "truncate_seq": 0,
>>>         "truncate_size": 0,
>>>         "data_digest": 4294967295,
>>>         "omap_digest": 4294967295,
>>>         "watchers": {}
>>>     },
>>>     "stat": {
>>>         "size": 68,
>>>         "blksize": 4096,
>>>         "blocks": 16,
>>>         "nlink": 1
>>>     },
>>>     "SnapSet": {
>>>         "snap_context": {
>>>             "seq": 28,
>>>             "snaps": [
>>>                 28
>>>             ]
>>>         },
>>>         "head_exists": 1,
>>>         "clones": [
>>>             {
>>>                 "snap": 28,
>>>                 "size": 68,
>>>                 "overlap": "[0~64]"
>>>             }
>>>         ]
>>>     }
>>> }
>>>
>>> Then we turn to the source code to find the reason for this. And we
>>> found that, the reason should be that, in the
>>> ReplicatedPG::make_writeable method, when determining whether the
>>> write op's modified range should be subtracted from the clone overlap,
>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>
>>>   // update most recent clone_overlap and usage stats
>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>     /* we need to check whether the most recent clone exists, if it's
>>> been evicted,
>>>      * it's not included in the stats */
>>>     hobject_t last_clone_oid = soid;
>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>     if (is_present_clone(last_clone_oid)) {
>>>       interval_set<uint64_t> &newest_overlap =
>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>       // modified_ranges is still in use by the clone
>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>     }
>>>   }
>>>
>>> We thought that the latter condition check
>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>> judgement base for the determination of whether to subtract the clone
>>> overlap with write ops' modified range, so we changed to code above to
>>> the following, which moved the subtraction out of the latter condition
>>> check, and submitted a pr for this:
>>> https://github.com/ceph/ceph/pull/16790:
>>>
>>>   // update most recent clone_overlap and usage stats
>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>     /* we need to check whether the most recent clone exists, if it's
>>> been evicted,
>>>      * it's not included in the stats */
>>>     hobject_t last_clone_oid = soid;
>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>     interval_set<uint64_t> &newest_overlap =
>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>
>>>     if (is_present_clone(last_clone_oid)) {
>>>       // modified_ranges is still in use by the clone
>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>     }
>>>   }
>>>
>>> In our test, this change solved the problem successfully, however, we
>>> can't confirm that this change won't cause any new problems. So, here
>>> we are discussing how to solve this problem:-)
>>>
>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>
>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> OK, I'll try that. Thank you:-)
>>>>>
>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>> cache tiering is in-use.
>>>>>>
>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> Hi, everyone.
>>>>>>>
>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>> just did another test: I did some writes to an object
>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>> latest of which is 26):
>>>>>>>
>>>>>>> {
>>>>>>>     "id": {
>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>         "key": "",
>>>>>>>         "snapid": -2,
>>>>>>>         "hash": 1655893237,
>>>>>>>         "max": 0,
>>>>>>>         "pool": 3,
>>>>>>>         "namespace": "",
>>>>>>>         "max": 0
>>>>>>>     },
>>>>>>>     "info": {
>>>>>>>         "oid": {
>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>             "key": "",
>>>>>>>             "snapid": -2,
>>>>>>>             "hash": 1655893237,
>>>>>>>             "max": 0,
>>>>>>>             "pool": 3,
>>>>>>>             "namespace": ""
>>>>>>>         },
>>>>>>>         "version": "4219'16423",
>>>>>>>         "prior_version": "3978'16310",
>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>         "user_version": 17205,
>>>>>>>         "size": 4194304,
>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>         "lost": 0,
>>>>>>>         "flags": 52,
>>>>>>>         "snaps": [],
>>>>>>>         "truncate_seq": 0,
>>>>>>>         "truncate_size": 0,
>>>>>>>         "data_digest": 2822203961,
>>>>>>>         "omap_digest": 4294967295,
>>>>>>>         "watchers": {}
>>>>>>>     },
>>>>>>>     "stat": {
>>>>>>>         "size": 4194304,
>>>>>>>         "blksize": 4096,
>>>>>>>         "blocks": 8200,
>>>>>>>         "nlink": 1
>>>>>>>     },
>>>>>>>     "SnapSet": {
>>>>>>>         "snap_context": {
>>>>>>>             "seq": 26,
>>>>>>>             "snaps": [
>>>>>>>                 26,
>>>>>>>                 25,
>>>>>>>                 16
>>>>>>>             ]
>>>>>>>         },
>>>>>>>         "head_exists": 1,
>>>>>>>         "clones": [
>>>>>>>             {
>>>>>>>                 "snap": 16,
>>>>>>>                 "size": 4194304,
>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>             },
>>>>>>>             {
>>>>>>>                 "snap": 25,
>>>>>>>                 "size": 4194304,
>>>>>>>                 "overlap": "[]"
>>>>>>>             },
>>>>>>>             {
>>>>>>>                 "snap": 26,
>>>>>>>                 "size": 4194304,
>>>>>>>                 "overlap": "[]"
>>>>>>>             }
>>>>>>>         ]
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>> combining with the previous test described in
>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>
>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>
>>>>>>> Am I right about this?
>>>>>>>
>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>> enough to solve this case, just like the PR
>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>> other problems.
>>>>>>>>
>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>
>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>
>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> Hi, grep:-)
>>>>>>>>>
>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>
>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>> which data is the same on disk".
>>>>>>>>>
>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>
>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>> pass "is_present_clone".
>>>>>>>>>
>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>
>>>>>>>>> Please help us, thank you:-)
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason
>>>>
>>>>
>>>>
>>>> --
>>>> Jason
>
>
>
> --
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10  7:06                   ` Xuehan Xu
@ 2017-08-10  8:11                     ` Xuehan Xu
  2017-08-10  9:48                       ` Xuehan Xu
  2017-08-10 13:11                     ` Jason Dillaman
  1 sibling, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-10  8:11 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, ceph-devel

By the way, Um..., I'm doing those test on version 10.2.5

On 10 August 2017 at 15:06, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Hi, Jason.
>
> I did a test, it turned out that, after flushing the object out of the
> cache tier, the clone overlap in base tier changed to empty, as is
> shown below. I think this maybe because that the flush operation just
> mark the whole range of the object being flushed as modified, so if
> the object's size hasn't changed, the overlap becomes empty. Is this
> right?
>
> Thank you:-)
>
> {
>     "id": {
>         "oid": "test.obj",
>         "key": "",
>         "snapid": -2,
>         "hash": 3575411564,
>         "max": 0,
>         "pool": 10,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "test.obj",
>             "key": "",
>             "snapid": -2,
>             "hash": 3575411564,
>             "max": 0,
>             "pool": 10,
>             "namespace": ""
>         },
>         "version": "4876'9",
>         "prior_version": "4854'8",
>         "last_reqid": "osd.35.4869:1",
>         "user_version": 16,
>         "size": 4194303,
>         "mtime": "2017-08-10 14:54:56.087387",
>         "local_mtime": "2017-08-10 14:59:15.252755",
>         "lost": 0,
>         "flags": 52,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 2827420887,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 4194303,
>         "blksize": 4096,
>         "blocks": 8200,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 3,
>             "snaps": [
>                 3
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 3,
>                 "size": 4194303,
>                 "overlap": "[]"
>             }
>         ]
>     }
> }
>
> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>> If you flush the object out of the cache tier so that its changes are
>> recorded in the base tier, is the overlap correctly recorded in the
>> base tier?
>>
>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> By the way, according to our test, since the modified range is not
>>> recorded either in the cache tier or in the base tier, I think
>>> proxying the "list-snaps" down to the base tier might not work as
>>> well, is that right?
>>>
>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> Sorry, I didn't "reply all"....:-)
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>> Date: 9 August 2017 at 12:14
>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>> which is lost when cache tier is enabled"
>>>> To: dillaman@redhat.com
>>>>
>>>>
>>>> Um, no, I don't think they are related.
>>>>
>>>> My problem is this:
>>>>
>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>> when we compare the original rbd image and the backup rbd image, we
>>>> find that they are different. After a series of debugging, we found
>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>> from the clone overlap of its targeting object's HEAD version when
>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>> object is not, which led to the miscalculation of the
>>>> "calc_snap_set_diff" function.
>>>>
>>>> For example, we did such a test: we first made created a snap for an
>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>> contains one object; then, we sent a series of AioWrites to
>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>> its clone object in the base tier only; at that time, we used
>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>> was as follows:
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 4,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 4,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4536'2728",
>>>>         "prior_version": "4536'2727",
>>>>         "last_reqid": "client.174858.0:10",
>>>>         "user_version": 14706,
>>>>         "size": 68,
>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>         "lost": 0,
>>>>         "flags": 4,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 68,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>> we specifically sent a AioWrite req to this object to write to the
>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 4,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 4,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4546'2730",
>>>>         "prior_version": "4538'2729",
>>>>         "last_reqid": "client.155555.0:10",
>>>>         "user_version": 14708,
>>>>         "size": 4096,
>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>         "lost": 0,
>>>>         "flags": 4,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 4096,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>> overlap recorded in the base tier didn't change also:
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 3,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 3,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4536'14459",
>>>>         "prior_version": "4536'14458",
>>>>         "last_reqid": "client.174834.0:10",
>>>>         "user_version": 14648,
>>>>         "size": 68,
>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>         "lost": 0,
>>>>         "flags": 36,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 68,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> Then we turn to the source code to find the reason for this. And we
>>>> found that, the reason should be that, in the
>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>> write op's modified range should be subtracted from the clone overlap,
>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>
>>>>   // update most recent clone_overlap and usage stats
>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>     /* we need to check whether the most recent clone exists, if it's
>>>> been evicted,
>>>>      * it's not included in the stats */
>>>>     hobject_t last_clone_oid = soid;
>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>     if (is_present_clone(last_clone_oid)) {
>>>>       interval_set<uint64_t> &newest_overlap =
>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>       // modified_ranges is still in use by the clone
>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>     }
>>>>   }
>>>>
>>>> We thought that the latter condition check
>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>> judgement base for the determination of whether to subtract the clone
>>>> overlap with write ops' modified range, so we changed to code above to
>>>> the following, which moved the subtraction out of the latter condition
>>>> check, and submitted a pr for this:
>>>> https://github.com/ceph/ceph/pull/16790:
>>>>
>>>>   // update most recent clone_overlap and usage stats
>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>     /* we need to check whether the most recent clone exists, if it's
>>>> been evicted,
>>>>      * it's not included in the stats */
>>>>     hobject_t last_clone_oid = soid;
>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>     interval_set<uint64_t> &newest_overlap =
>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>
>>>>     if (is_present_clone(last_clone_oid)) {
>>>>       // modified_ranges is still in use by the clone
>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>     }
>>>>   }
>>>>
>>>> In our test, this change solved the problem successfully, however, we
>>>> can't confirm that this change won't cause any new problems. So, here
>>>> we are discussing how to solve this problem:-)
>>>>
>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>
>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> OK, I'll try that. Thank you:-)
>>>>>>
>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>> cache tiering is in-use.
>>>>>>>
>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>> Hi, everyone.
>>>>>>>>
>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>> just did another test: I did some writes to an object
>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>> latest of which is 26):
>>>>>>>>
>>>>>>>> {
>>>>>>>>     "id": {
>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>         "key": "",
>>>>>>>>         "snapid": -2,
>>>>>>>>         "hash": 1655893237,
>>>>>>>>         "max": 0,
>>>>>>>>         "pool": 3,
>>>>>>>>         "namespace": "",
>>>>>>>>         "max": 0
>>>>>>>>     },
>>>>>>>>     "info": {
>>>>>>>>         "oid": {
>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>             "key": "",
>>>>>>>>             "snapid": -2,
>>>>>>>>             "hash": 1655893237,
>>>>>>>>             "max": 0,
>>>>>>>>             "pool": 3,
>>>>>>>>             "namespace": ""
>>>>>>>>         },
>>>>>>>>         "version": "4219'16423",
>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>         "user_version": 17205,
>>>>>>>>         "size": 4194304,
>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>         "lost": 0,
>>>>>>>>         "flags": 52,
>>>>>>>>         "snaps": [],
>>>>>>>>         "truncate_seq": 0,
>>>>>>>>         "truncate_size": 0,
>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>         "watchers": {}
>>>>>>>>     },
>>>>>>>>     "stat": {
>>>>>>>>         "size": 4194304,
>>>>>>>>         "blksize": 4096,
>>>>>>>>         "blocks": 8200,
>>>>>>>>         "nlink": 1
>>>>>>>>     },
>>>>>>>>     "SnapSet": {
>>>>>>>>         "snap_context": {
>>>>>>>>             "seq": 26,
>>>>>>>>             "snaps": [
>>>>>>>>                 26,
>>>>>>>>                 25,
>>>>>>>>                 16
>>>>>>>>             ]
>>>>>>>>         },
>>>>>>>>         "head_exists": 1,
>>>>>>>>         "clones": [
>>>>>>>>             {
>>>>>>>>                 "snap": 16,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>             },
>>>>>>>>             {
>>>>>>>>                 "snap": 25,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[]"
>>>>>>>>             },
>>>>>>>>             {
>>>>>>>>                 "snap": 26,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[]"
>>>>>>>>             }
>>>>>>>>         ]
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>> combining with the previous test described in
>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>
>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>
>>>>>>>> Am I right about this?
>>>>>>>>
>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>> other problems.
>>>>>>>>>
>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>
>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>
>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>
>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>
>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>
>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>
>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>
>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>
>>>>>>>>>> Please help us, thank you:-)
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jason
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason
>>
>>
>>
>> --
>> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10  8:11                     ` Xuehan Xu
@ 2017-08-10  9:48                       ` Xuehan Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Xuehan Xu @ 2017-08-10  9:48 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, ceph-devel

Is clone_overlap used only for the purpose of calculating the changed
subsets between clones? Or is it also used for other purpose?

On 10 August 2017 at 16:11, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> By the way, Um..., I'm doing those test on version 10.2.5
>
> On 10 August 2017 at 15:06, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> Hi, Jason.
>>
>> I did a test, it turned out that, after flushing the object out of the
>> cache tier, the clone overlap in base tier changed to empty, as is
>> shown below. I think this maybe because that the flush operation just
>> mark the whole range of the object being flushed as modified, so if
>> the object's size hasn't changed, the overlap becomes empty. Is this
>> right?
>>
>> Thank you:-)
>>
>> {
>>     "id": {
>>         "oid": "test.obj",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 3575411564,
>>         "max": 0,
>>         "pool": 10,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "test.obj",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 3575411564,
>>             "max": 0,
>>             "pool": 10,
>>             "namespace": ""
>>         },
>>         "version": "4876'9",
>>         "prior_version": "4854'8",
>>         "last_reqid": "osd.35.4869:1",
>>         "user_version": 16,
>>         "size": 4194303,
>>         "mtime": "2017-08-10 14:54:56.087387",
>>         "local_mtime": "2017-08-10 14:59:15.252755",
>>         "lost": 0,
>>         "flags": 52,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 2827420887,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 4194303,
>>         "blksize": 4096,
>>         "blocks": 8200,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 3,
>>             "snaps": [
>>                 3
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 3,
>>                 "size": 4194303,
>>                 "overlap": "[]"
>>             }
>>         ]
>>     }
>> }
>>
>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>> If you flush the object out of the cache tier so that its changes are
>>> recorded in the base tier, is the overlap correctly recorded in the
>>> base tier?
>>>
>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> By the way, according to our test, since the modified range is not
>>>> recorded either in the cache tier or in the base tier, I think
>>>> proxying the "list-snaps" down to the base tier might not work as
>>>> well, is that right?
>>>>
>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> Sorry, I didn't "reply all"....:-)
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>>> Date: 9 August 2017 at 12:14
>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>>> which is lost when cache tier is enabled"
>>>>> To: dillaman@redhat.com
>>>>>
>>>>>
>>>>> Um, no, I don't think they are related.
>>>>>
>>>>> My problem is this:
>>>>>
>>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>>> when we compare the original rbd image and the backup rbd image, we
>>>>> find that they are different. After a series of debugging, we found
>>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>>> from the clone overlap of its targeting object's HEAD version when
>>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>>> object is not, which led to the miscalculation of the
>>>>> "calc_snap_set_diff" function.
>>>>>
>>>>> For example, we did such a test: we first made created a snap for an
>>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>>> contains one object; then, we sent a series of AioWrites to
>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>>> its clone object in the base tier only; at that time, we used
>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>>> was as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'2728",
>>>>>         "prior_version": "4536'2727",
>>>>>         "last_reqid": "client.174858.0:10",
>>>>>         "user_version": 14706,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>>> we specifically sent a AioWrite req to this object to write to the
>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4546'2730",
>>>>>         "prior_version": "4538'2729",
>>>>>         "last_reqid": "client.155555.0:10",
>>>>>         "user_version": 14708,
>>>>>         "size": 4096,
>>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 4096,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>>> overlap recorded in the base tier didn't change also:
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 3,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 3,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'14459",
>>>>>         "prior_version": "4536'14458",
>>>>>         "last_reqid": "client.174834.0:10",
>>>>>         "user_version": 14648,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>>         "lost": 0,
>>>>>         "flags": 36,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> Then we turn to the source code to find the reason for this. And we
>>>>> found that, the reason should be that, in the
>>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>>> write op's modified range should be subtracted from the clone overlap,
>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>>     }
>>>>>   }
>>>>>
>>>>> We thought that the latter condition check
>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>>> judgement base for the determination of whether to subtract the clone
>>>>> overlap with write ops' modified range, so we changed to code above to
>>>>> the following, which moved the subtraction out of the latter condition
>>>>> check, and submitted a pr for this:
>>>>> https://github.com/ceph/ceph/pull/16790:
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>>
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>     }
>>>>>   }
>>>>>
>>>>> In our test, this change solved the problem successfully, however, we
>>>>> can't confirm that this change won't cause any new problems. So, here
>>>>> we are discussing how to solve this problem:-)
>>>>>
>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>>
>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> OK, I'll try that. Thank you:-)
>>>>>>>
>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>>> cache tiering is in-use.
>>>>>>>>
>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> Hi, everyone.
>>>>>>>>>
>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>>> just did another test: I did some writes to an object
>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>>> latest of which is 26):
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>     "id": {
>>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>         "key": "",
>>>>>>>>>         "snapid": -2,
>>>>>>>>>         "hash": 1655893237,
>>>>>>>>>         "max": 0,
>>>>>>>>>         "pool": 3,
>>>>>>>>>         "namespace": "",
>>>>>>>>>         "max": 0
>>>>>>>>>     },
>>>>>>>>>     "info": {
>>>>>>>>>         "oid": {
>>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>             "key": "",
>>>>>>>>>             "snapid": -2,
>>>>>>>>>             "hash": 1655893237,
>>>>>>>>>             "max": 0,
>>>>>>>>>             "pool": 3,
>>>>>>>>>             "namespace": ""
>>>>>>>>>         },
>>>>>>>>>         "version": "4219'16423",
>>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>>         "user_version": 17205,
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>>         "lost": 0,
>>>>>>>>>         "flags": 52,
>>>>>>>>>         "snaps": [],
>>>>>>>>>         "truncate_seq": 0,
>>>>>>>>>         "truncate_size": 0,
>>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>>         "watchers": {}
>>>>>>>>>     },
>>>>>>>>>     "stat": {
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "blksize": 4096,
>>>>>>>>>         "blocks": 8200,
>>>>>>>>>         "nlink": 1
>>>>>>>>>     },
>>>>>>>>>     "SnapSet": {
>>>>>>>>>         "snap_context": {
>>>>>>>>>             "seq": 26,
>>>>>>>>>             "snaps": [
>>>>>>>>>                 26,
>>>>>>>>>                 25,
>>>>>>>>>                 16
>>>>>>>>>             ]
>>>>>>>>>         },
>>>>>>>>>         "head_exists": 1,
>>>>>>>>>         "clones": [
>>>>>>>>>             {
>>>>>>>>>                 "snap": 16,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 25,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 26,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             }
>>>>>>>>>         ]
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>>> combining with the previous test described in
>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>>
>>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>>
>>>>>>>>> Am I right about this?
>>>>>>>>>
>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>>> other problems.
>>>>>>>>>>
>>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>>
>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>>
>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>>
>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>>
>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>>
>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>>
>>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>>
>>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>>
>>>>>>>>>>> Please help us, thank you:-)
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason
>>>
>>>
>>>
>>> --
>>> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10  7:06                   ` Xuehan Xu
  2017-08-10  8:11                     ` Xuehan Xu
@ 2017-08-10 13:11                     ` Jason Dillaman
  2017-08-10 15:11                       ` Xuehan Xu
  1 sibling, 1 reply; 14+ messages in thread
From: Jason Dillaman @ 2017-08-10 13:11 UTC (permalink / raw)
  To: Xuehan Xu; +Cc: Gregory Farnum, ceph-devel

Yes, it's quite possible that for simplicity sake the cache tier OSD
writes back the full object to the base tier even if only a single
byte has changed. In that case, while not optimal behavior, at least
if you combine the base + cache tier snapsets together, RBD should be
able to export enough data to reconstruct the image.

On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Hi, Jason.
>
> I did a test, it turned out that, after flushing the object out of the
> cache tier, the clone overlap in base tier changed to empty, as is
> shown below. I think this maybe because that the flush operation just
> mark the whole range of the object being flushed as modified, so if
> the object's size hasn't changed, the overlap becomes empty. Is this
> right?
>
> Thank you:-)
>
> {
>     "id": {
>         "oid": "test.obj",
>         "key": "",
>         "snapid": -2,
>         "hash": 3575411564,
>         "max": 0,
>         "pool": 10,
>         "namespace": "",
>         "max": 0
>     },
>     "info": {
>         "oid": {
>             "oid": "test.obj",
>             "key": "",
>             "snapid": -2,
>             "hash": 3575411564,
>             "max": 0,
>             "pool": 10,
>             "namespace": ""
>         },
>         "version": "4876'9",
>         "prior_version": "4854'8",
>         "last_reqid": "osd.35.4869:1",
>         "user_version": 16,
>         "size": 4194303,
>         "mtime": "2017-08-10 14:54:56.087387",
>         "local_mtime": "2017-08-10 14:59:15.252755",
>         "lost": 0,
>         "flags": 52,
>         "snaps": [],
>         "truncate_seq": 0,
>         "truncate_size": 0,
>         "data_digest": 2827420887,
>         "omap_digest": 4294967295,
>         "watchers": {}
>     },
>     "stat": {
>         "size": 4194303,
>         "blksize": 4096,
>         "blocks": 8200,
>         "nlink": 1
>     },
>     "SnapSet": {
>         "snap_context": {
>             "seq": 3,
>             "snaps": [
>                 3
>             ]
>         },
>         "head_exists": 1,
>         "clones": [
>             {
>                 "snap": 3,
>                 "size": 4194303,
>                 "overlap": "[]"
>             }
>         ]
>     }
> }
>
> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>> If you flush the object out of the cache tier so that its changes are
>> recorded in the base tier, is the overlap correctly recorded in the
>> base tier?
>>
>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> By the way, according to our test, since the modified range is not
>>> recorded either in the cache tier or in the base tier, I think
>>> proxying the "list-snaps" down to the base tier might not work as
>>> well, is that right?
>>>
>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> Sorry, I didn't "reply all"....:-)
>>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>> Date: 9 August 2017 at 12:14
>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>> which is lost when cache tier is enabled"
>>>> To: dillaman@redhat.com
>>>>
>>>>
>>>> Um, no, I don't think they are related.
>>>>
>>>> My problem is this:
>>>>
>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>> when we compare the original rbd image and the backup rbd image, we
>>>> find that they are different. After a series of debugging, we found
>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>> from the clone overlap of its targeting object's HEAD version when
>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>> object is not, which led to the miscalculation of the
>>>> "calc_snap_set_diff" function.
>>>>
>>>> For example, we did such a test: we first made created a snap for an
>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>> contains one object; then, we sent a series of AioWrites to
>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>> its clone object in the base tier only; at that time, we used
>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>> was as follows:
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 4,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 4,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4536'2728",
>>>>         "prior_version": "4536'2727",
>>>>         "last_reqid": "client.174858.0:10",
>>>>         "user_version": 14706,
>>>>         "size": 68,
>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>         "lost": 0,
>>>>         "flags": 4,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 68,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>> we specifically sent a AioWrite req to this object to write to the
>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 4,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 4,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4546'2730",
>>>>         "prior_version": "4538'2729",
>>>>         "last_reqid": "client.155555.0:10",
>>>>         "user_version": 14708,
>>>>         "size": 4096,
>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>         "lost": 0,
>>>>         "flags": 4,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 4096,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>> overlap recorded in the base tier didn't change also:
>>>> {
>>>>     "id": {
>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 2375431681,
>>>>         "max": 0,
>>>>         "pool": 3,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 2375431681,
>>>>             "max": 0,
>>>>             "pool": 3,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4536'14459",
>>>>         "prior_version": "4536'14458",
>>>>         "last_reqid": "client.174834.0:10",
>>>>         "user_version": 14648,
>>>>         "size": 68,
>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>         "lost": 0,
>>>>         "flags": 36,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 4294967295,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 68,
>>>>         "blksize": 4096,
>>>>         "blocks": 16,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 28,
>>>>             "snaps": [
>>>>                 28
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 28,
>>>>                 "size": 68,
>>>>                 "overlap": "[0~64]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> Then we turn to the source code to find the reason for this. And we
>>>> found that, the reason should be that, in the
>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>> write op's modified range should be subtracted from the clone overlap,
>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>
>>>>   // update most recent clone_overlap and usage stats
>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>     /* we need to check whether the most recent clone exists, if it's
>>>> been evicted,
>>>>      * it's not included in the stats */
>>>>     hobject_t last_clone_oid = soid;
>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>     if (is_present_clone(last_clone_oid)) {
>>>>       interval_set<uint64_t> &newest_overlap =
>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>       // modified_ranges is still in use by the clone
>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>     }
>>>>   }
>>>>
>>>> We thought that the latter condition check
>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>> judgement base for the determination of whether to subtract the clone
>>>> overlap with write ops' modified range, so we changed to code above to
>>>> the following, which moved the subtraction out of the latter condition
>>>> check, and submitted a pr for this:
>>>> https://github.com/ceph/ceph/pull/16790:
>>>>
>>>>   // update most recent clone_overlap and usage stats
>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>     /* we need to check whether the most recent clone exists, if it's
>>>> been evicted,
>>>>      * it's not included in the stats */
>>>>     hobject_t last_clone_oid = soid;
>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>     interval_set<uint64_t> &newest_overlap =
>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>
>>>>     if (is_present_clone(last_clone_oid)) {
>>>>       // modified_ranges is still in use by the clone
>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>     }
>>>>   }
>>>>
>>>> In our test, this change solved the problem successfully, however, we
>>>> can't confirm that this change won't cause any new problems. So, here
>>>> we are discussing how to solve this problem:-)
>>>>
>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>
>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> OK, I'll try that. Thank you:-)
>>>>>>
>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>> cache tiering is in-use.
>>>>>>>
>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>> Hi, everyone.
>>>>>>>>
>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>> just did another test: I did some writes to an object
>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>> latest of which is 26):
>>>>>>>>
>>>>>>>> {
>>>>>>>>     "id": {
>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>         "key": "",
>>>>>>>>         "snapid": -2,
>>>>>>>>         "hash": 1655893237,
>>>>>>>>         "max": 0,
>>>>>>>>         "pool": 3,
>>>>>>>>         "namespace": "",
>>>>>>>>         "max": 0
>>>>>>>>     },
>>>>>>>>     "info": {
>>>>>>>>         "oid": {
>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>             "key": "",
>>>>>>>>             "snapid": -2,
>>>>>>>>             "hash": 1655893237,
>>>>>>>>             "max": 0,
>>>>>>>>             "pool": 3,
>>>>>>>>             "namespace": ""
>>>>>>>>         },
>>>>>>>>         "version": "4219'16423",
>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>         "user_version": 17205,
>>>>>>>>         "size": 4194304,
>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>         "lost": 0,
>>>>>>>>         "flags": 52,
>>>>>>>>         "snaps": [],
>>>>>>>>         "truncate_seq": 0,
>>>>>>>>         "truncate_size": 0,
>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>         "watchers": {}
>>>>>>>>     },
>>>>>>>>     "stat": {
>>>>>>>>         "size": 4194304,
>>>>>>>>         "blksize": 4096,
>>>>>>>>         "blocks": 8200,
>>>>>>>>         "nlink": 1
>>>>>>>>     },
>>>>>>>>     "SnapSet": {
>>>>>>>>         "snap_context": {
>>>>>>>>             "seq": 26,
>>>>>>>>             "snaps": [
>>>>>>>>                 26,
>>>>>>>>                 25,
>>>>>>>>                 16
>>>>>>>>             ]
>>>>>>>>         },
>>>>>>>>         "head_exists": 1,
>>>>>>>>         "clones": [
>>>>>>>>             {
>>>>>>>>                 "snap": 16,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>             },
>>>>>>>>             {
>>>>>>>>                 "snap": 25,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[]"
>>>>>>>>             },
>>>>>>>>             {
>>>>>>>>                 "snap": 26,
>>>>>>>>                 "size": 4194304,
>>>>>>>>                 "overlap": "[]"
>>>>>>>>             }
>>>>>>>>         ]
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>> combining with the previous test described in
>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>
>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>
>>>>>>>> Am I right about this?
>>>>>>>>
>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>> other problems.
>>>>>>>>>
>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>
>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>
>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>
>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>
>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>
>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>
>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>
>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>
>>>>>>>>>> Please help us, thank you:-)
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jason
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason
>>
>>
>>
>> --
>> Jason



-- 
Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10 13:11                     ` Jason Dillaman
@ 2017-08-10 15:11                       ` Xuehan Xu
  2017-08-10 15:54                         ` Jason Dillaman
  0 siblings, 1 reply; 14+ messages in thread
From: Xuehan Xu @ 2017-08-10 15:11 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, ceph-devel

Yes, that's right. However, as you say, that doesn't seem to be very
efficient, and there strict time window constrains to our backup
tasks.

On the other hand, I still don't quite understand why we can't just
simply move the clone overlap subtraction out of the
"is_present_clone" condition check in the PrimaryLogPG::make_writeable
method. I've searched and read all the source codes that are related
to the clone overlap, it seems that clone overlap is only used for
calculating the difference between clones or object subsets that needs
to be recovered or backfilled. If this is true, wouldn't it be
reasonable to keep the overlap accurate no matter whether the clone
objects are in the cache tier or the base tier? Why leave it
incomplete when clone objects are not in the cache tier?

Could you show us some more design principles behind this, please?
Maybe we can figure out a better solution if we can fully understand
the design intentions:-)

Thanks for helping us:-)

On 10 August 2017 at 21:11, Jason Dillaman <jdillama@redhat.com> wrote:
> Yes, it's quite possible that for simplicity sake the cache tier OSD
> writes back the full object to the base tier even if only a single
> byte has changed. In that case, while not optimal behavior, at least
> if you combine the base + cache tier snapsets together, RBD should be
> able to export enough data to reconstruct the image.
>
> On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> Hi, Jason.
>>
>> I did a test, it turned out that, after flushing the object out of the
>> cache tier, the clone overlap in base tier changed to empty, as is
>> shown below. I think this maybe because that the flush operation just
>> mark the whole range of the object being flushed as modified, so if
>> the object's size hasn't changed, the overlap becomes empty. Is this
>> right?
>>
>> Thank you:-)
>>
>> {
>>     "id": {
>>         "oid": "test.obj",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 3575411564,
>>         "max": 0,
>>         "pool": 10,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "test.obj",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 3575411564,
>>             "max": 0,
>>             "pool": 10,
>>             "namespace": ""
>>         },
>>         "version": "4876'9",
>>         "prior_version": "4854'8",
>>         "last_reqid": "osd.35.4869:1",
>>         "user_version": 16,
>>         "size": 4194303,
>>         "mtime": "2017-08-10 14:54:56.087387",
>>         "local_mtime": "2017-08-10 14:59:15.252755",
>>         "lost": 0,
>>         "flags": 52,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 2827420887,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 4194303,
>>         "blksize": 4096,
>>         "blocks": 8200,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 3,
>>             "snaps": [
>>                 3
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 3,
>>                 "size": 4194303,
>>                 "overlap": "[]"
>>             }
>>         ]
>>     }
>> }
>>
>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>> If you flush the object out of the cache tier so that its changes are
>>> recorded in the base tier, is the overlap correctly recorded in the
>>> base tier?
>>>
>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> By the way, according to our test, since the modified range is not
>>>> recorded either in the cache tier or in the base tier, I think
>>>> proxying the "list-snaps" down to the base tier might not work as
>>>> well, is that right?
>>>>
>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> Sorry, I didn't "reply all"....:-)
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>>> Date: 9 August 2017 at 12:14
>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>>> which is lost when cache tier is enabled"
>>>>> To: dillaman@redhat.com
>>>>>
>>>>>
>>>>> Um, no, I don't think they are related.
>>>>>
>>>>> My problem is this:
>>>>>
>>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>>> when we compare the original rbd image and the backup rbd image, we
>>>>> find that they are different. After a series of debugging, we found
>>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>>> from the clone overlap of its targeting object's HEAD version when
>>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>>> object is not, which led to the miscalculation of the
>>>>> "calc_snap_set_diff" function.
>>>>>
>>>>> For example, we did such a test: we first made created a snap for an
>>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>>> contains one object; then, we sent a series of AioWrites to
>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>>> its clone object in the base tier only; at that time, we used
>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>>> was as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'2728",
>>>>>         "prior_version": "4536'2727",
>>>>>         "last_reqid": "client.174858.0:10",
>>>>>         "user_version": 14706,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>>> we specifically sent a AioWrite req to this object to write to the
>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4546'2730",
>>>>>         "prior_version": "4538'2729",
>>>>>         "last_reqid": "client.155555.0:10",
>>>>>         "user_version": 14708,
>>>>>         "size": 4096,
>>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 4096,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>>> overlap recorded in the base tier didn't change also:
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 3,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 3,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'14459",
>>>>>         "prior_version": "4536'14458",
>>>>>         "last_reqid": "client.174834.0:10",
>>>>>         "user_version": 14648,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>>         "lost": 0,
>>>>>         "flags": 36,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> Then we turn to the source code to find the reason for this. And we
>>>>> found that, the reason should be that, in the
>>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>>> write op's modified range should be subtracted from the clone overlap,
>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>>     }
>>>>>   }
>>>>>
>>>>> We thought that the latter condition check
>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>>> judgement base for the determination of whether to subtract the clone
>>>>> overlap with write ops' modified range, so we changed to code above to
>>>>> the following, which moved the subtraction out of the latter condition
>>>>> check, and submitted a pr for this:
>>>>> https://github.com/ceph/ceph/pull/16790:
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>>
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>     }
>>>>>   }
>>>>>
>>>>> In our test, this change solved the problem successfully, however, we
>>>>> can't confirm that this change won't cause any new problems. So, here
>>>>> we are discussing how to solve this problem:-)
>>>>>
>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>>
>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> OK, I'll try that. Thank you:-)
>>>>>>>
>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>>> cache tiering is in-use.
>>>>>>>>
>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> Hi, everyone.
>>>>>>>>>
>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>>> just did another test: I did some writes to an object
>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>>> latest of which is 26):
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>     "id": {
>>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>         "key": "",
>>>>>>>>>         "snapid": -2,
>>>>>>>>>         "hash": 1655893237,
>>>>>>>>>         "max": 0,
>>>>>>>>>         "pool": 3,
>>>>>>>>>         "namespace": "",
>>>>>>>>>         "max": 0
>>>>>>>>>     },
>>>>>>>>>     "info": {
>>>>>>>>>         "oid": {
>>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>             "key": "",
>>>>>>>>>             "snapid": -2,
>>>>>>>>>             "hash": 1655893237,
>>>>>>>>>             "max": 0,
>>>>>>>>>             "pool": 3,
>>>>>>>>>             "namespace": ""
>>>>>>>>>         },
>>>>>>>>>         "version": "4219'16423",
>>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>>         "user_version": 17205,
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>>         "lost": 0,
>>>>>>>>>         "flags": 52,
>>>>>>>>>         "snaps": [],
>>>>>>>>>         "truncate_seq": 0,
>>>>>>>>>         "truncate_size": 0,
>>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>>         "watchers": {}
>>>>>>>>>     },
>>>>>>>>>     "stat": {
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "blksize": 4096,
>>>>>>>>>         "blocks": 8200,
>>>>>>>>>         "nlink": 1
>>>>>>>>>     },
>>>>>>>>>     "SnapSet": {
>>>>>>>>>         "snap_context": {
>>>>>>>>>             "seq": 26,
>>>>>>>>>             "snaps": [
>>>>>>>>>                 26,
>>>>>>>>>                 25,
>>>>>>>>>                 16
>>>>>>>>>             ]
>>>>>>>>>         },
>>>>>>>>>         "head_exists": 1,
>>>>>>>>>         "clones": [
>>>>>>>>>             {
>>>>>>>>>                 "snap": 16,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 25,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 26,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             }
>>>>>>>>>         ]
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>>> combining with the previous test described in
>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>>
>>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>>
>>>>>>>>> Am I right about this?
>>>>>>>>>
>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>>> other problems.
>>>>>>>>>>
>>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>>
>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>>
>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>>
>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>>
>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>>
>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>>
>>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>>
>>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>>
>>>>>>>>>>> Please help us, thank you:-)
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason
>>>
>>>
>>>
>>> --
>>> Jason
>
>
>
> --
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10 15:11                       ` Xuehan Xu
@ 2017-08-10 15:54                         ` Jason Dillaman
  2017-08-21  3:06                           ` Xuehan Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Dillaman @ 2017-08-10 15:54 UTC (permalink / raw)
  To: Xuehan Xu; +Cc: Gregory Farnum, ceph-devel

I cannot speak to the design principles behind this since I didn't
write it. Whatever the solution is, the results should be consistent
regardless of the current promotion/eviction/clean/dirty status of an
object between the base and cache tier. For example, if the cache tier
always performs a full object write when it writes back to the base
tier, the base tier clone overlap would show zero overlap, so it would
be inconsistent for the cache tier to show an accurate clone overlap
up until the point the object is evicted.

On Thu, Aug 10, 2017 at 11:11 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
> Yes, that's right. However, as you say, that doesn't seem to be very
> efficient, and there strict time window constrains to our backup
> tasks.
>
> On the other hand, I still don't quite understand why we can't just
> simply move the clone overlap subtraction out of the
> "is_present_clone" condition check in the PrimaryLogPG::make_writeable
> method. I've searched and read all the source codes that are related
> to the clone overlap, it seems that clone overlap is only used for
> calculating the difference between clones or object subsets that needs
> to be recovered or backfilled. If this is true, wouldn't it be
> reasonable to keep the overlap accurate no matter whether the clone
> objects are in the cache tier or the base tier? Why leave it
> incomplete when clone objects are not in the cache tier?
>
> Could you show us some more design principles behind this, please?
> Maybe we can figure out a better solution if we can fully understand
> the design intentions:-)
>
> Thanks for helping us:-)
>
> On 10 August 2017 at 21:11, Jason Dillaman <jdillama@redhat.com> wrote:
>> Yes, it's quite possible that for simplicity sake the cache tier OSD
>> writes back the full object to the base tier even if only a single
>> byte has changed. In that case, while not optimal behavior, at least
>> if you combine the base + cache tier snapsets together, RBD should be
>> able to export enough data to reconstruct the image.
>>
>> On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>> Hi, Jason.
>>>
>>> I did a test, it turned out that, after flushing the object out of the
>>> cache tier, the clone overlap in base tier changed to empty, as is
>>> shown below. I think this maybe because that the flush operation just
>>> mark the whole range of the object being flushed as modified, so if
>>> the object's size hasn't changed, the overlap becomes empty. Is this
>>> right?
>>>
>>> Thank you:-)
>>>
>>> {
>>>     "id": {
>>>         "oid": "test.obj",
>>>         "key": "",
>>>         "snapid": -2,
>>>         "hash": 3575411564,
>>>         "max": 0,
>>>         "pool": 10,
>>>         "namespace": "",
>>>         "max": 0
>>>     },
>>>     "info": {
>>>         "oid": {
>>>             "oid": "test.obj",
>>>             "key": "",
>>>             "snapid": -2,
>>>             "hash": 3575411564,
>>>             "max": 0,
>>>             "pool": 10,
>>>             "namespace": ""
>>>         },
>>>         "version": "4876'9",
>>>         "prior_version": "4854'8",
>>>         "last_reqid": "osd.35.4869:1",
>>>         "user_version": 16,
>>>         "size": 4194303,
>>>         "mtime": "2017-08-10 14:54:56.087387",
>>>         "local_mtime": "2017-08-10 14:59:15.252755",
>>>         "lost": 0,
>>>         "flags": 52,
>>>         "snaps": [],
>>>         "truncate_seq": 0,
>>>         "truncate_size": 0,
>>>         "data_digest": 2827420887,
>>>         "omap_digest": 4294967295,
>>>         "watchers": {}
>>>     },
>>>     "stat": {
>>>         "size": 4194303,
>>>         "blksize": 4096,
>>>         "blocks": 8200,
>>>         "nlink": 1
>>>     },
>>>     "SnapSet": {
>>>         "snap_context": {
>>>             "seq": 3,
>>>             "snaps": [
>>>                 3
>>>             ]
>>>         },
>>>         "head_exists": 1,
>>>         "clones": [
>>>             {
>>>                 "snap": 3,
>>>                 "size": 4194303,
>>>                 "overlap": "[]"
>>>             }
>>>         ]
>>>     }
>>> }
>>>
>>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>> If you flush the object out of the cache tier so that its changes are
>>>> recorded in the base tier, is the overlap correctly recorded in the
>>>> base tier?
>>>>
>>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>> By the way, according to our test, since the modified range is not
>>>>> recorded either in the cache tier or in the base tier, I think
>>>>> proxying the "list-snaps" down to the base tier might not work as
>>>>> well, is that right?
>>>>>
>>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> Sorry, I didn't "reply all"....:-)
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>>>> Date: 9 August 2017 at 12:14
>>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>>>> which is lost when cache tier is enabled"
>>>>>> To: dillaman@redhat.com
>>>>>>
>>>>>>
>>>>>> Um, no, I don't think they are related.
>>>>>>
>>>>>> My problem is this:
>>>>>>
>>>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>>>> when we compare the original rbd image and the backup rbd image, we
>>>>>> find that they are different. After a series of debugging, we found
>>>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>>>> from the clone overlap of its targeting object's HEAD version when
>>>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>>>> object is not, which led to the miscalculation of the
>>>>>> "calc_snap_set_diff" function.
>>>>>>
>>>>>> For example, we did such a test: we first made created a snap for an
>>>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>>>> contains one object; then, we sent a series of AioWrites to
>>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>>>> its clone object in the base tier only; at that time, we used
>>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>>>> was as follows:
>>>>>>
>>>>>> {
>>>>>>     "id": {
>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>         "key": "",
>>>>>>         "snapid": -2,
>>>>>>         "hash": 2375431681,
>>>>>>         "max": 0,
>>>>>>         "pool": 4,
>>>>>>         "namespace": "",
>>>>>>         "max": 0
>>>>>>     },
>>>>>>     "info": {
>>>>>>         "oid": {
>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>             "key": "",
>>>>>>             "snapid": -2,
>>>>>>             "hash": 2375431681,
>>>>>>             "max": 0,
>>>>>>             "pool": 4,
>>>>>>             "namespace": ""
>>>>>>         },
>>>>>>         "version": "4536'2728",
>>>>>>         "prior_version": "4536'2727",
>>>>>>         "last_reqid": "client.174858.0:10",
>>>>>>         "user_version": 14706,
>>>>>>         "size": 68,
>>>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>>>         "lost": 0,
>>>>>>         "flags": 4,
>>>>>>         "snaps": [],
>>>>>>         "truncate_seq": 0,
>>>>>>         "truncate_size": 0,
>>>>>>         "data_digest": 4294967295,
>>>>>>         "omap_digest": 4294967295,
>>>>>>         "watchers": {}
>>>>>>     },
>>>>>>     "stat": {
>>>>>>         "size": 68,
>>>>>>         "blksize": 4096,
>>>>>>         "blocks": 16,
>>>>>>         "nlink": 1
>>>>>>     },
>>>>>>     "SnapSet": {
>>>>>>         "snap_context": {
>>>>>>             "seq": 28,
>>>>>>             "snaps": [
>>>>>>                 28
>>>>>>             ]
>>>>>>         },
>>>>>>         "head_exists": 1,
>>>>>>         "clones": [
>>>>>>             {
>>>>>>                 "snap": 28,
>>>>>>                 "size": 68,
>>>>>>                 "overlap": "[0~64]"
>>>>>>             }
>>>>>>         ]
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>>>> we specifically sent a AioWrite req to this object to write to the
>>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>>>
>>>>>> {
>>>>>>     "id": {
>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>         "key": "",
>>>>>>         "snapid": -2,
>>>>>>         "hash": 2375431681,
>>>>>>         "max": 0,
>>>>>>         "pool": 4,
>>>>>>         "namespace": "",
>>>>>>         "max": 0
>>>>>>     },
>>>>>>     "info": {
>>>>>>         "oid": {
>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>             "key": "",
>>>>>>             "snapid": -2,
>>>>>>             "hash": 2375431681,
>>>>>>             "max": 0,
>>>>>>             "pool": 4,
>>>>>>             "namespace": ""
>>>>>>         },
>>>>>>         "version": "4546'2730",
>>>>>>         "prior_version": "4538'2729",
>>>>>>         "last_reqid": "client.155555.0:10",
>>>>>>         "user_version": 14708,
>>>>>>         "size": 4096,
>>>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>>>         "lost": 0,
>>>>>>         "flags": 4,
>>>>>>         "snaps": [],
>>>>>>         "truncate_seq": 0,
>>>>>>         "truncate_size": 0,
>>>>>>         "data_digest": 4294967295,
>>>>>>         "omap_digest": 4294967295,
>>>>>>         "watchers": {}
>>>>>>     },
>>>>>>     "stat": {
>>>>>>         "size": 4096,
>>>>>>         "blksize": 4096,
>>>>>>         "blocks": 16,
>>>>>>         "nlink": 1
>>>>>>     },
>>>>>>     "SnapSet": {
>>>>>>         "snap_context": {
>>>>>>             "seq": 28,
>>>>>>             "snaps": [
>>>>>>                 28
>>>>>>             ]
>>>>>>         },
>>>>>>         "head_exists": 1,
>>>>>>         "clones": [
>>>>>>             {
>>>>>>                 "snap": 28,
>>>>>>                 "size": 68,
>>>>>>                 "overlap": "[0~64]"
>>>>>>             }
>>>>>>         ]
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>>>> overlap recorded in the base tier didn't change also:
>>>>>> {
>>>>>>     "id": {
>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>         "key": "",
>>>>>>         "snapid": -2,
>>>>>>         "hash": 2375431681,
>>>>>>         "max": 0,
>>>>>>         "pool": 3,
>>>>>>         "namespace": "",
>>>>>>         "max": 0
>>>>>>     },
>>>>>>     "info": {
>>>>>>         "oid": {
>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>             "key": "",
>>>>>>             "snapid": -2,
>>>>>>             "hash": 2375431681,
>>>>>>             "max": 0,
>>>>>>             "pool": 3,
>>>>>>             "namespace": ""
>>>>>>         },
>>>>>>         "version": "4536'14459",
>>>>>>         "prior_version": "4536'14458",
>>>>>>         "last_reqid": "client.174834.0:10",
>>>>>>         "user_version": 14648,
>>>>>>         "size": 68,
>>>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>>>         "lost": 0,
>>>>>>         "flags": 36,
>>>>>>         "snaps": [],
>>>>>>         "truncate_seq": 0,
>>>>>>         "truncate_size": 0,
>>>>>>         "data_digest": 4294967295,
>>>>>>         "omap_digest": 4294967295,
>>>>>>         "watchers": {}
>>>>>>     },
>>>>>>     "stat": {
>>>>>>         "size": 68,
>>>>>>         "blksize": 4096,
>>>>>>         "blocks": 16,
>>>>>>         "nlink": 1
>>>>>>     },
>>>>>>     "SnapSet": {
>>>>>>         "snap_context": {
>>>>>>             "seq": 28,
>>>>>>             "snaps": [
>>>>>>                 28
>>>>>>             ]
>>>>>>         },
>>>>>>         "head_exists": 1,
>>>>>>         "clones": [
>>>>>>             {
>>>>>>                 "snap": 28,
>>>>>>                 "size": 68,
>>>>>>                 "overlap": "[0~64]"
>>>>>>             }
>>>>>>         ]
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> Then we turn to the source code to find the reason for this. And we
>>>>>> found that, the reason should be that, in the
>>>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>>>> write op's modified range should be subtracted from the clone overlap,
>>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>>>
>>>>>>   // update most recent clone_overlap and usage stats
>>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>>> been evicted,
>>>>>>      * it's not included in the stats */
>>>>>>     hobject_t last_clone_oid = soid;
>>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>>       interval_set<uint64_t> &newest_overlap =
>>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>>       // modified_ranges is still in use by the clone
>>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>>>     }
>>>>>>   }
>>>>>>
>>>>>> We thought that the latter condition check
>>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>>>> judgement base for the determination of whether to subtract the clone
>>>>>> overlap with write ops' modified range, so we changed to code above to
>>>>>> the following, which moved the subtraction out of the latter condition
>>>>>> check, and submitted a pr for this:
>>>>>> https://github.com/ceph/ceph/pull/16790:
>>>>>>
>>>>>>   // update most recent clone_overlap and usage stats
>>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>>> been evicted,
>>>>>>      * it's not included in the stats */
>>>>>>     hobject_t last_clone_oid = soid;
>>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>>     interval_set<uint64_t> &newest_overlap =
>>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>>>
>>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>>       // modified_ranges is still in use by the clone
>>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>>     }
>>>>>>   }
>>>>>>
>>>>>> In our test, this change solved the problem successfully, however, we
>>>>>> can't confirm that this change won't cause any new problems. So, here
>>>>>> we are discussing how to solve this problem:-)
>>>>>>
>>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>>>
>>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>> OK, I'll try that. Thank you:-)
>>>>>>>>
>>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>>>> cache tiering is in-use.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>> Hi, everyone.
>>>>>>>>>>
>>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>>>> just did another test: I did some writes to an object
>>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>>>> latest of which is 26):
>>>>>>>>>>
>>>>>>>>>> {
>>>>>>>>>>     "id": {
>>>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>>         "key": "",
>>>>>>>>>>         "snapid": -2,
>>>>>>>>>>         "hash": 1655893237,
>>>>>>>>>>         "max": 0,
>>>>>>>>>>         "pool": 3,
>>>>>>>>>>         "namespace": "",
>>>>>>>>>>         "max": 0
>>>>>>>>>>     },
>>>>>>>>>>     "info": {
>>>>>>>>>>         "oid": {
>>>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>>             "key": "",
>>>>>>>>>>             "snapid": -2,
>>>>>>>>>>             "hash": 1655893237,
>>>>>>>>>>             "max": 0,
>>>>>>>>>>             "pool": 3,
>>>>>>>>>>             "namespace": ""
>>>>>>>>>>         },
>>>>>>>>>>         "version": "4219'16423",
>>>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>>>         "user_version": 17205,
>>>>>>>>>>         "size": 4194304,
>>>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>>>         "lost": 0,
>>>>>>>>>>         "flags": 52,
>>>>>>>>>>         "snaps": [],
>>>>>>>>>>         "truncate_seq": 0,
>>>>>>>>>>         "truncate_size": 0,
>>>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>>>         "watchers": {}
>>>>>>>>>>     },
>>>>>>>>>>     "stat": {
>>>>>>>>>>         "size": 4194304,
>>>>>>>>>>         "blksize": 4096,
>>>>>>>>>>         "blocks": 8200,
>>>>>>>>>>         "nlink": 1
>>>>>>>>>>     },
>>>>>>>>>>     "SnapSet": {
>>>>>>>>>>         "snap_context": {
>>>>>>>>>>             "seq": 26,
>>>>>>>>>>             "snaps": [
>>>>>>>>>>                 26,
>>>>>>>>>>                 25,
>>>>>>>>>>                 16
>>>>>>>>>>             ]
>>>>>>>>>>         },
>>>>>>>>>>         "head_exists": 1,
>>>>>>>>>>         "clones": [
>>>>>>>>>>             {
>>>>>>>>>>                 "snap": 16,
>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>>>             },
>>>>>>>>>>             {
>>>>>>>>>>                 "snap": 25,
>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>>             },
>>>>>>>>>>             {
>>>>>>>>>>                 "snap": 26,
>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>>             }
>>>>>>>>>>         ]
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>>>> combining with the previous test described in
>>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>>>
>>>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>>>
>>>>>>>>>> Am I right about this?
>>>>>>>>>>
>>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>>>> other problems.
>>>>>>>>>>>
>>>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>>>
>>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>>>
>>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>>>
>>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>>>
>>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>>>
>>>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>>>
>>>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>>>
>>>>>>>>>>>> Please help us, thank you:-)
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jason
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jason
>>>>
>>>>
>>>>
>>>> --
>>>> Jason
>>
>>
>>
>> --
>> Jason



-- 
Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"
  2017-08-10 15:54                         ` Jason Dillaman
@ 2017-08-21  3:06                           ` Xuehan Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Xuehan Xu @ 2017-08-21  3:06 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, ceph-devel

Hi, everyone.

I've found that the reason that clone overlap modifications should
pass "is_present_clone" condition check is to make cache stats right
as demonstrated by http://tracker.ceph.com/issues/7964.

So I think if I add a similar clone overlap, say cache_clone_overlap,
that tracks the disk space usage when it's in cache, leaving the
original clone overlap track all the modifications for export-diff to
be right, and make the "copy_get" operation also copys the object's
extensive attributes which are "object_info" and "snapset", the
problem should be solved. Does this sound reasonable to you?

Thank you:-)

On 10 August 2017 at 23:54, Jason Dillaman <jdillama@redhat.com> wrote:
> I cannot speak to the design principles behind this since I didn't
> write it. Whatever the solution is, the results should be consistent
> regardless of the current promotion/eviction/clean/dirty status of an
> object between the base and cache tier. For example, if the cache tier
> always performs a full object write when it writes back to the base
> tier, the base tier clone overlap would show zero overlap, so it would
> be inconsistent for the cache tier to show an accurate clone overlap
> up until the point the object is evicted.
>
> On Thu, Aug 10, 2017 at 11:11 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>> Yes, that's right. However, as you say, that doesn't seem to be very
>> efficient, and there strict time window constrains to our backup
>> tasks.
>>
>> On the other hand, I still don't quite understand why we can't just
>> simply move the clone overlap subtraction out of the
>> "is_present_clone" condition check in the PrimaryLogPG::make_writeable
>> method. I've searched and read all the source codes that are related
>> to the clone overlap, it seems that clone overlap is only used for
>> calculating the difference between clones or object subsets that needs
>> to be recovered or backfilled. If this is true, wouldn't it be
>> reasonable to keep the overlap accurate no matter whether the clone
>> objects are in the cache tier or the base tier? Why leave it
>> incomplete when clone objects are not in the cache tier?
>>
>> Could you show us some more design principles behind this, please?
>> Maybe we can figure out a better solution if we can fully understand
>> the design intentions:-)
>>
>> Thanks for helping us:-)
>>
>> On 10 August 2017 at 21:11, Jason Dillaman <jdillama@redhat.com> wrote:
>>> Yes, it's quite possible that for simplicity sake the cache tier OSD
>>> writes back the full object to the base tier even if only a single
>>> byte has changed. In that case, while not optimal behavior, at least
>>> if you combine the base + cache tier snapsets together, RBD should be
>>> able to export enough data to reconstruct the image.
>>>
>>> On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>> Hi, Jason.
>>>>
>>>> I did a test, it turned out that, after flushing the object out of the
>>>> cache tier, the clone overlap in base tier changed to empty, as is
>>>> shown below. I think this maybe because that the flush operation just
>>>> mark the whole range of the object being flushed as modified, so if
>>>> the object's size hasn't changed, the overlap becomes empty. Is this
>>>> right?
>>>>
>>>> Thank you:-)
>>>>
>>>> {
>>>>     "id": {
>>>>         "oid": "test.obj",
>>>>         "key": "",
>>>>         "snapid": -2,
>>>>         "hash": 3575411564,
>>>>         "max": 0,
>>>>         "pool": 10,
>>>>         "namespace": "",
>>>>         "max": 0
>>>>     },
>>>>     "info": {
>>>>         "oid": {
>>>>             "oid": "test.obj",
>>>>             "key": "",
>>>>             "snapid": -2,
>>>>             "hash": 3575411564,
>>>>             "max": 0,
>>>>             "pool": 10,
>>>>             "namespace": ""
>>>>         },
>>>>         "version": "4876'9",
>>>>         "prior_version": "4854'8",
>>>>         "last_reqid": "osd.35.4869:1",
>>>>         "user_version": 16,
>>>>         "size": 4194303,
>>>>         "mtime": "2017-08-10 14:54:56.087387",
>>>>         "local_mtime": "2017-08-10 14:59:15.252755",
>>>>         "lost": 0,
>>>>         "flags": 52,
>>>>         "snaps": [],
>>>>         "truncate_seq": 0,
>>>>         "truncate_size": 0,
>>>>         "data_digest": 2827420887,
>>>>         "omap_digest": 4294967295,
>>>>         "watchers": {}
>>>>     },
>>>>     "stat": {
>>>>         "size": 4194303,
>>>>         "blksize": 4096,
>>>>         "blocks": 8200,
>>>>         "nlink": 1
>>>>     },
>>>>     "SnapSet": {
>>>>         "snap_context": {
>>>>             "seq": 3,
>>>>             "snaps": [
>>>>                 3
>>>>             ]
>>>>         },
>>>>         "head_exists": 1,
>>>>         "clones": [
>>>>             {
>>>>                 "snap": 3,
>>>>                 "size": 4194303,
>>>>                 "overlap": "[]"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>> If you flush the object out of the cache tier so that its changes are
>>>>> recorded in the base tier, is the overlap correctly recorded in the
>>>>> base tier?
>>>>>
>>>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>> By the way, according to our test, since the modified range is not
>>>>>> recorded either in the cache tier or in the base tier, I think
>>>>>> proxying the "list-snaps" down to the base tier might not work as
>>>>>> well, is that right?
>>>>>>
>>>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>> Sorry, I didn't "reply all"....:-)
>>>>>>>
>>>>>>>
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: Xuehan Xu <xxhdx1985126@gmail.com>
>>>>>>> Date: 9 August 2017 at 12:14
>>>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>>>>> which is lost when cache tier is enabled"
>>>>>>> To: dillaman@redhat.com
>>>>>>>
>>>>>>>
>>>>>>> Um, no, I don't think they are related.
>>>>>>>
>>>>>>> My problem is this:
>>>>>>>
>>>>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>>>>> when we compare the original rbd image and the backup rbd image, we
>>>>>>> find that they are different. After a series of debugging, we found
>>>>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>>>>> from the clone overlap of its targeting object's HEAD version when
>>>>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>>>>> object is not, which led to the miscalculation of the
>>>>>>> "calc_snap_set_diff" function.
>>>>>>>
>>>>>>> For example, we did such a test: we first made created a snap for an
>>>>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>>>>> contains one object; then, we sent a series of AioWrites to
>>>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>>>>> its clone object in the base tier only; at that time, we used
>>>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>>>>> was as follows:
>>>>>>>
>>>>>>> {
>>>>>>>     "id": {
>>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>         "key": "",
>>>>>>>         "snapid": -2,
>>>>>>>         "hash": 2375431681,
>>>>>>>         "max": 0,
>>>>>>>         "pool": 4,
>>>>>>>         "namespace": "",
>>>>>>>         "max": 0
>>>>>>>     },
>>>>>>>     "info": {
>>>>>>>         "oid": {
>>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>             "key": "",
>>>>>>>             "snapid": -2,
>>>>>>>             "hash": 2375431681,
>>>>>>>             "max": 0,
>>>>>>>             "pool": 4,
>>>>>>>             "namespace": ""
>>>>>>>         },
>>>>>>>         "version": "4536'2728",
>>>>>>>         "prior_version": "4536'2727",
>>>>>>>         "last_reqid": "client.174858.0:10",
>>>>>>>         "user_version": 14706,
>>>>>>>         "size": 68,
>>>>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>>>>         "lost": 0,
>>>>>>>         "flags": 4,
>>>>>>>         "snaps": [],
>>>>>>>         "truncate_seq": 0,
>>>>>>>         "truncate_size": 0,
>>>>>>>         "data_digest": 4294967295,
>>>>>>>         "omap_digest": 4294967295,
>>>>>>>         "watchers": {}
>>>>>>>     },
>>>>>>>     "stat": {
>>>>>>>         "size": 68,
>>>>>>>         "blksize": 4096,
>>>>>>>         "blocks": 16,
>>>>>>>         "nlink": 1
>>>>>>>     },
>>>>>>>     "SnapSet": {
>>>>>>>         "snap_context": {
>>>>>>>             "seq": 28,
>>>>>>>             "snaps": [
>>>>>>>                 28
>>>>>>>             ]
>>>>>>>         },
>>>>>>>         "head_exists": 1,
>>>>>>>         "clones": [
>>>>>>>             {
>>>>>>>                 "snap": 28,
>>>>>>>                 "size": 68,
>>>>>>>                 "overlap": "[0~64]"
>>>>>>>             }
>>>>>>>         ]
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>>>>> we specifically sent a AioWrite req to this object to write to the
>>>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>>>>
>>>>>>> {
>>>>>>>     "id": {
>>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>         "key": "",
>>>>>>>         "snapid": -2,
>>>>>>>         "hash": 2375431681,
>>>>>>>         "max": 0,
>>>>>>>         "pool": 4,
>>>>>>>         "namespace": "",
>>>>>>>         "max": 0
>>>>>>>     },
>>>>>>>     "info": {
>>>>>>>         "oid": {
>>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>             "key": "",
>>>>>>>             "snapid": -2,
>>>>>>>             "hash": 2375431681,
>>>>>>>             "max": 0,
>>>>>>>             "pool": 4,
>>>>>>>             "namespace": ""
>>>>>>>         },
>>>>>>>         "version": "4546'2730",
>>>>>>>         "prior_version": "4538'2729",
>>>>>>>         "last_reqid": "client.155555.0:10",
>>>>>>>         "user_version": 14708,
>>>>>>>         "size": 4096,
>>>>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>>>>         "lost": 0,
>>>>>>>         "flags": 4,
>>>>>>>         "snaps": [],
>>>>>>>         "truncate_seq": 0,
>>>>>>>         "truncate_size": 0,
>>>>>>>         "data_digest": 4294967295,
>>>>>>>         "omap_digest": 4294967295,
>>>>>>>         "watchers": {}
>>>>>>>     },
>>>>>>>     "stat": {
>>>>>>>         "size": 4096,
>>>>>>>         "blksize": 4096,
>>>>>>>         "blocks": 16,
>>>>>>>         "nlink": 1
>>>>>>>     },
>>>>>>>     "SnapSet": {
>>>>>>>         "snap_context": {
>>>>>>>             "seq": 28,
>>>>>>>             "snaps": [
>>>>>>>                 28
>>>>>>>             ]
>>>>>>>         },
>>>>>>>         "head_exists": 1,
>>>>>>>         "clones": [
>>>>>>>             {
>>>>>>>                 "snap": 28,
>>>>>>>                 "size": 68,
>>>>>>>                 "overlap": "[0~64]"
>>>>>>>             }
>>>>>>>         ]
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>>>>> overlap recorded in the base tier didn't change also:
>>>>>>> {
>>>>>>>     "id": {
>>>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>         "key": "",
>>>>>>>         "snapid": -2,
>>>>>>>         "hash": 2375431681,
>>>>>>>         "max": 0,
>>>>>>>         "pool": 3,
>>>>>>>         "namespace": "",
>>>>>>>         "max": 0
>>>>>>>     },
>>>>>>>     "info": {
>>>>>>>         "oid": {
>>>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>>>             "key": "",
>>>>>>>             "snapid": -2,
>>>>>>>             "hash": 2375431681,
>>>>>>>             "max": 0,
>>>>>>>             "pool": 3,
>>>>>>>             "namespace": ""
>>>>>>>         },
>>>>>>>         "version": "4536'14459",
>>>>>>>         "prior_version": "4536'14458",
>>>>>>>         "last_reqid": "client.174834.0:10",
>>>>>>>         "user_version": 14648,
>>>>>>>         "size": 68,
>>>>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>>>>         "lost": 0,
>>>>>>>         "flags": 36,
>>>>>>>         "snaps": [],
>>>>>>>         "truncate_seq": 0,
>>>>>>>         "truncate_size": 0,
>>>>>>>         "data_digest": 4294967295,
>>>>>>>         "omap_digest": 4294967295,
>>>>>>>         "watchers": {}
>>>>>>>     },
>>>>>>>     "stat": {
>>>>>>>         "size": 68,
>>>>>>>         "blksize": 4096,
>>>>>>>         "blocks": 16,
>>>>>>>         "nlink": 1
>>>>>>>     },
>>>>>>>     "SnapSet": {
>>>>>>>         "snap_context": {
>>>>>>>             "seq": 28,
>>>>>>>             "snaps": [
>>>>>>>                 28
>>>>>>>             ]
>>>>>>>         },
>>>>>>>         "head_exists": 1,
>>>>>>>         "clones": [
>>>>>>>             {
>>>>>>>                 "snap": 28,
>>>>>>>                 "size": 68,
>>>>>>>                 "overlap": "[0~64]"
>>>>>>>             }
>>>>>>>         ]
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> Then we turn to the source code to find the reason for this. And we
>>>>>>> found that, the reason should be that, in the
>>>>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>>>>> write op's modified range should be subtracted from the clone overlap,
>>>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>>>>
>>>>>>>   // update most recent clone_overlap and usage stats
>>>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>>>> been evicted,
>>>>>>>      * it's not included in the stats */
>>>>>>>     hobject_t last_clone_oid = soid;
>>>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>>>       interval_set<uint64_t> &newest_overlap =
>>>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>>>       // modified_ranges is still in use by the clone
>>>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> We thought that the latter condition check
>>>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>>>>> judgement base for the determination of whether to subtract the clone
>>>>>>> overlap with write ops' modified range, so we changed to code above to
>>>>>>> the following, which moved the subtraction out of the latter condition
>>>>>>> check, and submitted a pr for this:
>>>>>>> https://github.com/ceph/ceph/pull/16790:
>>>>>>>
>>>>>>>   // update most recent clone_overlap and usage stats
>>>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>>>> been evicted,
>>>>>>>      * it's not included in the stats */
>>>>>>>     hobject_t last_clone_oid = soid;
>>>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>>>     interval_set<uint64_t> &newest_overlap =
>>>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>>>>
>>>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>>>       // modified_ranges is still in use by the clone
>>>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> In our test, this change solved the problem successfully, however, we
>>>>>>> can't confirm that this change won't cause any new problems. So, here
>>>>>>> we are discussing how to solve this problem:-)
>>>>>>>
>>>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>>>>
>>>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>> OK, I'll try that. Thank you:-)
>>>>>>>>>
>>>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@redhat.com> wrote:
>>>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>>>>> cache tiering is in-use.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>> Hi, everyone.
>>>>>>>>>>>
>>>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I
>>>>>>>>>>> just did another test: I did some writes to an object
>>>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>>>>> latest of which is 26):
>>>>>>>>>>>
>>>>>>>>>>> {
>>>>>>>>>>>     "id": {
>>>>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>>>         "key": "",
>>>>>>>>>>>         "snapid": -2,
>>>>>>>>>>>         "hash": 1655893237,
>>>>>>>>>>>         "max": 0,
>>>>>>>>>>>         "pool": 3,
>>>>>>>>>>>         "namespace": "",
>>>>>>>>>>>         "max": 0
>>>>>>>>>>>     },
>>>>>>>>>>>     "info": {
>>>>>>>>>>>         "oid": {
>>>>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>>>             "key": "",
>>>>>>>>>>>             "snapid": -2,
>>>>>>>>>>>             "hash": 1655893237,
>>>>>>>>>>>             "max": 0,
>>>>>>>>>>>             "pool": 3,
>>>>>>>>>>>             "namespace": ""
>>>>>>>>>>>         },
>>>>>>>>>>>         "version": "4219'16423",
>>>>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>>>>         "user_version": 17205,
>>>>>>>>>>>         "size": 4194304,
>>>>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>>>>         "lost": 0,
>>>>>>>>>>>         "flags": 52,
>>>>>>>>>>>         "snaps": [],
>>>>>>>>>>>         "truncate_seq": 0,
>>>>>>>>>>>         "truncate_size": 0,
>>>>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>>>>         "watchers": {}
>>>>>>>>>>>     },
>>>>>>>>>>>     "stat": {
>>>>>>>>>>>         "size": 4194304,
>>>>>>>>>>>         "blksize": 4096,
>>>>>>>>>>>         "blocks": 8200,
>>>>>>>>>>>         "nlink": 1
>>>>>>>>>>>     },
>>>>>>>>>>>     "SnapSet": {
>>>>>>>>>>>         "snap_context": {
>>>>>>>>>>>             "seq": 26,
>>>>>>>>>>>             "snaps": [
>>>>>>>>>>>                 26,
>>>>>>>>>>>                 25,
>>>>>>>>>>>                 16
>>>>>>>>>>>             ]
>>>>>>>>>>>         },
>>>>>>>>>>>         "head_exists": 1,
>>>>>>>>>>>         "clones": [
>>>>>>>>>>>             {
>>>>>>>>>>>                 "snap": 16,
>>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>>>>             },
>>>>>>>>>>>             {
>>>>>>>>>>>                 "snap": 25,
>>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>>>             },
>>>>>>>>>>>             {
>>>>>>>>>>>                 "snap": 26,
>>>>>>>>>>>                 "size": 4194304,
>>>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>>>             }
>>>>>>>>>>>         ]
>>>>>>>>>>>     }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>>>>> combining with the previous test described in
>>>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>>>>
>>>>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>>>>
>>>>>>>>>>> Am I right about this?
>>>>>>>>>>>
>>>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause
>>>>>>>>>>>> other problems.
>>>>>>>>>>>>
>>>>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>>>>
>>>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>>>>
>>>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@gmail.com> wrote:
>>>>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>>>>
>>>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help us, thank you:-)
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jason
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason
>>>
>>>
>>>
>>> --
>>> Jason
>
>
>
> --
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-21  3:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04  3:41 About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled" Xuehan Xu
2017-08-04  7:14 ` Xuehan Xu
2017-08-05 15:56   ` Xuehan Xu
2017-08-07 20:48     ` Jason Dillaman
     [not found]       ` <CAJACTucaKm+nZUUkQYszZtjt7K=X1uvdpUucxL+pV1AHrmewEw@mail.gmail.com>
     [not found]         ` <CA+aFP1Dki1hw+VGZo9Nm4e=qx2P6-pCcG4PLK3FeDHJ1-pAkGA@mail.gmail.com>
     [not found]           ` <CAJACTudkLVT=dcWo_FpgrVuxJ8smFD2VqLU6x2eVWyggh9UJAA@mail.gmail.com>
2017-08-09  4:20             ` Fwd: " Xuehan Xu
2017-08-09  4:24               ` Xuehan Xu
2017-08-09 15:26                 ` Jason Dillaman
2017-08-10  7:06                   ` Xuehan Xu
2017-08-10  8:11                     ` Xuehan Xu
2017-08-10  9:48                       ` Xuehan Xu
2017-08-10 13:11                     ` Jason Dillaman
2017-08-10 15:11                       ` Xuehan Xu
2017-08-10 15:54                         ` Jason Dillaman
2017-08-21  3:06                           ` Xuehan Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.