All of lore.kernel.org
 help / color / mirror / Atom feed
* new scrub and repair discussion
@ 2015-11-11 11:44 kefu chai
  2015-11-11 13:25 ` Sage Weil
  2016-05-23  3:54 ` Shinobu Kinjo
  0 siblings, 2 replies; 15+ messages in thread
From: kefu chai @ 2015-11-11 11:44 UTC (permalink / raw)
  To: ceph-devel

currently, scrub and repair are pretty primitive. there are several
improvements which need to be made:

- user should be able to initialize scrub of a PG or an object
    - int scrub(pg_t, AioCompletion*)
    - int scrub(const string& pool, const string& nspace, const
string& locator, const string& oid, AioCompletion*)
- we need a way to query the result of the most recent scrub on a pg.
    - int get_inconsistent_pools(set<uint64_t>* pools);
    - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
    - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
paged<inconsistent_t>*)
- the user should be able to query the content of the replica/shard
objects in the event of an inconsistency.
    - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
ObjectReadOperation *op, bool allow_inconsistent)
- the user should be able to perform following fixes using a new
aio_operate_scrub(
                                          const std::string& oid,
                                          shard_id_t shard,
                                          AioCompletion *c,
                                          ObjectWriteOperation *op)
    - specify which replica to use for repairing a content inconsistency
    - delete an object if it can't exist
    - write_full
    - omap_set
    - setattrs
- the user should be able to repair snapset and object_info_t
    - ObjectWriteOperation::repair_snapset(...)
        - set/remove any property/attributes, for example,
            - to reset snapset.clone_overlap
            - to set snapset.clone_size
            - to reset the digests in object_info_t,
- repair will create a new version so that possibly corrupted copies
on down OSDs will get fixed naturally.

so librados will offer enough information and facilities, with which a
smart librados client/script will be able to fix the inconsistencies
found in the scrub.

as an example, if we run into a data inconsistency where the 3
replicas failed to agree with each other after performing a deep
scrub. probably we'd like to have an election to get the auth copy.
following pseudo code explains how we will implement this using the
new rados APIs for scrub and repair.

     # something is not necessarily better than nothing
     rados.aio_scrub(pg, completion)
     completion.wait_for_complete()
     for pool in rados.get_inconsistent_pools():
          for pg in rados.get_inconsistent_pgs(pool):
               # rados.get_inconsistent_pgs() throws if "epoch" expires

               for oid, inconsistent in rados.get_inconsistent_pgs(pg,
epoch).items():
                    if inconsistent.is_data_digest_mismatch():
                         votes = defaultdict(int)
                         for osd, shard_info in inconsistent.shards:
                              votes[shard_info.object_info.data_digest] += 1
                         digest, _ = mavotes, key=operator.itemgetter(1))
                         auth_copy = None
                         for osd, shard_info in inconsistent.shards.items():
                              if shard_info.object_info.data_digest == digest:
                                   auth_copy = osd
                                   break
                         repair_op = librados.ObjectWriteOperation()
                         repair_op.repair_pick(auth_copy,
inconsistent.ver, epoch)
                         rados.aio_operate_scrub(oid, repair_op)

this plan was also discussed in the infernalis CDS. see
http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2015-11-11 11:44 new scrub and repair discussion kefu chai
@ 2015-11-11 13:25 ` Sage Weil
  2015-11-11 14:53   ` kefu chai
  2016-05-23  3:54 ` Shinobu Kinjo
  1 sibling, 1 reply; 15+ messages in thread
From: Sage Weil @ 2015-11-11 13:25 UTC (permalink / raw)
  To: kefu chai; +Cc: ceph-devel

On Wed, 11 Nov 2015, kefu chai wrote:
> currently, scrub and repair are pretty primitive. there are several
> improvements which need to be made:
> 
> - user should be able to initialize scrub of a PG or an object
>     - int scrub(pg_t, AioCompletion*)
>     - int scrub(const string& pool, const string& nspace, const
> string& locator, const string& oid, AioCompletion*)
> - we need a way to query the result of the most recent scrub on a pg.
>     - int get_inconsistent_pools(set<uint64_t>* pools);
>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
> paged<inconsistent_t>*)

What is paged<>?

> - the user should be able to query the content of the replica/shard
> objects in the event of an inconsistency.
>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
> ObjectReadOperation *op, bool allow_inconsistent)

This is exposing a bunch of internal types (pg_t, pg_shard_t, epoch_t) up 
through librados.  We might want to consider making them strings or just 
unsigned or similar?  I'm mostly worried about making it hard for us to 
change the types later...

> - the user should be able to perform following fixes using a new
> aio_operate_scrub(
>                                           const std::string& oid,
>                                           shard_id_t shard,
>                                           AioCompletion *c,
>                                           ObjectWriteOperation *op)
>     - specify which replica to use for repairing a content inconsistency
>     - delete an object if it can't exist
>     - write_full
>     - omap_set
>     - setattrs

For omap_set and setattrs do we want a _full-type equivalent, or would we 
support partial changes?  Partial updates won't necessary resolve an 
inconsistency, but I think (?) in the ec case the full xattr set is in 
the log event?

> - the user should be able to repair snapset and object_info_t
>     - ObjectWriteOperation::repair_snapset(...)
>         - set/remove any property/attributes, for example,
>             - to reset snapset.clone_overlap
>             - to set snapset.clone_size
>             - to reset the digests in object_info_t,
> - repair will create a new version so that possibly corrupted copies
> on down OSDs will get fixed naturally.
> 
> so librados will offer enough information and facilities, with which a
> smart librados client/script will be able to fix the inconsistencies
> found in the scrub.
> 
> as an example, if we run into a data inconsistency where the 3
> replicas failed to agree with each other after performing a deep
> scrub. probably we'd like to have an election to get the auth copy.
> following pseudo code explains how we will implement this using the
> new rados APIs for scrub and repair.
> 
>      # something is not necessarily better than nothing
>      rados.aio_scrub(pg, completion)
>      completion.wait_for_complete()
>      for pool in rados.get_inconsistent_pools():
>           for pg in rados.get_inconsistent_pgs(pool):
>                # rados.get_inconsistent_pgs() throws if "epoch" expires
> 
>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
> epoch).items():
>                     if inconsistent.is_data_digest_mismatch():
>                          votes = defaultdict(int)
>                          for osd, shard_info in inconsistent.shards:
>                               votes[shard_info.object_info.data_digest] += 1
>                          digest, _ = mavotes, key=operator.itemgetter(1))
>                          auth_copy = None
>                          for osd, shard_info in inconsistent.shards.items():
>                               if shard_info.object_info.data_digest == digest:
>                                    auth_copy = osd
>                                    break
>                          repair_op = librados.ObjectWriteOperation()
>                          repair_op.repair_pick(auth_copy,
> inconsistent.ver, epoch)
>                          rados.aio_operate_scrub(oid, repair_op)
> 
> this plan was also discussed in the infernalis CDS. see
> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.

We should definitely make sure these are surfaced in the python bindings 
from the start.  :)

Sounds good to me!
sage


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2015-11-11 13:25 ` Sage Weil
@ 2015-11-11 14:53   ` kefu chai
  0 siblings, 0 replies; 15+ messages in thread
From: kefu chai @ 2015-11-11 14:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Nov 11, 2015 at 9:25 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 11 Nov 2015, kefu chai wrote:
>> currently, scrub and repair are pretty primitive. there are several
>> improvements which need to be made:
>>
>> - user should be able to initialize scrub of a PG or an object
>>     - int scrub(pg_t, AioCompletion*)
>>     - int scrub(const string& pool, const string& nspace, const
>> string& locator, const string& oid, AioCompletion*)
>> - we need a way to query the result of the most recent scrub on a pg.
>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>> paged<inconsistent_t>*)
>
> What is paged<>?

it's a template supporting pagination for querying the scrub results.
something like:

template <typename T>
class Paged {
  const unsigned max_size;
  uint64_t current;
  uint64_t last;
  vector<T> page;
};

>
>> - the user should be able to query the content of the replica/shard
>> objects in the event of an inconsistency.
>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>> ObjectReadOperation *op, bool allow_inconsistent)
>
> This is exposing a bunch of internal types (pg_t, pg_shard_t, epoch_t) up
> through librados.  We might want to consider making them strings or just
> unsigned or similar?  I'm mostly worried about making it hard for us to
> change the types later...

oh, agreed! we should try to expose less/none internal types here. the
interface change would be a pain in the future.

>
>> - the user should be able to perform following fixes using a new
>> aio_operate_scrub(
>>                                           const std::string& oid,
>>                                           shard_id_t shard,
>>                                           AioCompletion *c,
>>                                           ObjectWriteOperation *op)
>>     - specify which replica to use for repairing a content inconsistency
>>     - delete an object if it can't exist
>>     - write_full
>>     - omap_set
>>     - setattrs
>
> For omap_set and setattrs do we want a _full-type equivalent, or would we
> support partial changes?  Partial updates won't necessary resolve an
> inconsistency, but I think (?) in the ec case the full xattr set is in
> the log event?

i think we will be try to support most of the librados APIs (the methods
of librados::IoCtx) so user is able to get/rewrite the omap and xattrs
while bypassing the check posed by OSD. i.e. be able to read
the data of an object even it's missing!

>
>> - the user should be able to repair snapset and object_info_t
>>     - ObjectWriteOperation::repair_snapset(...)
>>         - set/remove any property/attributes, for example,
>>             - to reset snapset.clone_overlap
>>             - to set snapset.clone_size
>>             - to reset the digests in object_info_t,
>> - repair will create a new version so that possibly corrupted copies
>> on down OSDs will get fixed naturally.
>>
>> so librados will offer enough information and facilities, with which a
>> smart librados client/script will be able to fix the inconsistencies
>> found in the scrub.
>>
>> as an example, if we run into a data inconsistency where the 3
>> replicas failed to agree with each other after performing a deep
>> scrub. probably we'd like to have an election to get the auth copy.
>> following pseudo code explains how we will implement this using the
>> new rados APIs for scrub and repair.
>>
>>      # something is not necessarily better than nothing
>>      rados.aio_scrub(pg, completion)
>>      completion.wait_for_complete()
>>      for pool in rados.get_inconsistent_pools():
>>           for pg in rados.get_inconsistent_pgs(pool):
>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>
>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>> epoch).items():
>>                     if inconsistent.is_data_digest_mismatch():
>>                          votes = defaultdict(int)
>>                          for osd, shard_info in inconsistent.shards:
>>                               votes[shard_info.object_info.data_digest] += 1
>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>                          auth_copy = None
>>                          for osd, shard_info in inconsistent.shards.items():
>>                               if shard_info.object_info.data_digest == digest:
>>                                    auth_copy = osd
>>                                    break
>>                          repair_op = librados.ObjectWriteOperation()
>>                          repair_op.repair_pick(auth_copy,
>> inconsistent.ver, epoch)
>>                          rados.aio_operate_scrub(oid, repair_op)
>>
>> this plan was also discussed in the infernalis CDS. see
>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>
> We should definitely make sure these are surfaced in the python bindings
> from the start.  :)
>
> Sounds good to me!
> sage
>



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2015-11-11 11:44 new scrub and repair discussion kefu chai
  2015-11-11 13:25 ` Sage Weil
@ 2016-05-23  3:54 ` Shinobu Kinjo
  2016-05-25 14:34   ` kefu chai
  1 sibling, 1 reply; 15+ messages in thread
From: Shinobu Kinjo @ 2016-05-23  3:54 UTC (permalink / raw)
  To: kefu chai; +Cc: ceph-devel

On Wed, Nov 11, 2015 at 8:44 PM, kefu chai <tchaikov@gmail.com> wrote:
> currently, scrub and repair are pretty primitive. there are several
> improvements which need to be made:
>
[snip]
> - repair will create a new version so that possibly corrupted copies
> on down OSDs will get fixed naturally.

If this new feature is executed by end users manually, it may be
better to implement dry-run mechanism so that the above process could
be skipped, and end users initialize scrub process with more
information, and maybe more safely.

Make sense?

Cheers,
Shinobu

>
> so librados will offer enough information and facilities, with which a
> smart librados client/script will be able to fix the inconsistencies
> found in the scrub.
>
> as an example, if we run into a data inconsistency where the 3
> replicas failed to agree with each other after performing a deep
> scrub. probably we'd like to have an election to get the auth copy.
> following pseudo code explains how we will implement this using the
> new rados APIs for scrub and repair.
>
>      # something is not necessarily better than nothing
>      rados.aio_scrub(pg, completion)
>      completion.wait_for_complete()
>      for pool in rados.get_inconsistent_pools():
>           for pg in rados.get_inconsistent_pgs(pool):
>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>
>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
> epoch).items():
>                     if inconsistent.is_data_digest_mismatch():
>                          votes = defaultdict(int)
>                          for osd, shard_info in inconsistent.shards:
>                               votes[shard_info.object_info.data_digest] += 1
>                          digest, _ = mavotes, key=operator.itemgetter(1))
>                          auth_copy = None
>                          for osd, shard_info in inconsistent.shards.items():
>                               if shard_info.object_info.data_digest == digest:
>                                    auth_copy = osd
>                                    break
>                          repair_op = librados.ObjectWriteOperation()
>                          repair_op.repair_pick(auth_copy,
> inconsistent.ver, epoch)
>                          rados.aio_operate_scrub(oid, repair_op)
>
> this plan was also discussed in the infernalis CDS. see
> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Email:
shinobu@linux.com
shinobu@redhat.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-23  3:54 ` Shinobu Kinjo
@ 2016-05-25 14:34   ` kefu chai
  0 siblings, 0 replies; 15+ messages in thread
From: kefu chai @ 2016-05-25 14:34 UTC (permalink / raw)
  To: skinjo; +Cc: ceph-devel

On Mon, May 23, 2016 at 11:54 AM, Shinobu Kinjo <shinobu.kj@gmail.com> wrote:
> On Wed, Nov 11, 2015 at 8:44 PM, kefu chai <tchaikov@gmail.com> wrote:
>> currently, scrub and repair are pretty primitive. there are several
>> improvements which need to be made:
>>
> [snip]
>> - repair will create a new version so that possibly corrupted copies
>> on down OSDs will get fixed naturally.
>
> If this new feature is executed by end users manually, it may be
> better to implement dry-run mechanism so that the above process could
> be skipped, and end users initialize scrub process with more
> information, and maybe more safely.

to implement a dry-run, we have two possible ways:

1. export the inconsistency detection logic to client, and expose the
full scrub map to client, so user can run the inconsistency detection
algorithm through the updated scrub map.
2. persist the proposed change in osd, and override the object
information with the proposed ones if any when running the
inconsistency detection logic

imho, the first one is more viable. but it is much more complicated
than current design. maybe we can do it after the repair write API is
ready.

>
> Make sense?
>
> Cheers,
> Shinobu
>
>>
>> so librados will offer enough information and facilities, with which a
>> smart librados client/script will be able to fix the inconsistencies
>> found in the scrub.
>>
>> as an example, if we run into a data inconsistency where the 3
>> replicas failed to agree with each other after performing a deep
>> scrub. probably we'd like to have an election to get the auth copy.
>> following pseudo code explains how we will implement this using the
>> new rados APIs for scrub and repair.
>>
>>      # something is not necessarily better than nothing
>>      rados.aio_scrub(pg, completion)
>>      completion.wait_for_complete()
>>      for pool in rados.get_inconsistent_pools():
>>           for pg in rados.get_inconsistent_pgs(pool):
>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>
>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>> epoch).items():
>>                     if inconsistent.is_data_digest_mismatch():
>>                          votes = defaultdict(int)
>>                          for osd, shard_info in inconsistent.shards:
>>                               votes[shard_info.object_info.data_digest] += 1
>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>                          auth_copy = None
>>                          for osd, shard_info in inconsistent.shards.items():
>>                               if shard_info.object_info.data_digest == digest:
>>                                    auth_copy = osd
>>                                    break
>>                          repair_op = librados.ObjectWriteOperation()
>>                          repair_op.repair_pick(auth_copy,
>> inconsistent.ver, epoch)
>>                          rados.aio_operate_scrub(oid, repair_op)
>>
>> this plan was also discussed in the infernalis CDS. see
>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Email:
> shinobu@linux.com
> shinobu@redhat.com



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-25 17:37         ` Samuel Just
@ 2016-06-07 13:13           ` kefu chai
  0 siblings, 0 replies; 15+ messages in thread
From: kefu chai @ 2016-06-07 13:13 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel

On Thu, May 26, 2016 at 1:37 AM, Samuel Just <sjust@redhat.com> wrote:
> On Fri, May 20, 2016 at 4:30 AM, kefu chai <tchaikov@gmail.com> wrote:
>> On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@redhat.com> wrote:
>>> How would this work for an ec pool?  Maybe the osd argument should be
>>> a set of valid peers?
>>
>> maybe in the case of ec pool, we should ignore the osd argument.
>>
>> <quote from="http://pad.ceph.com/p/scrub_repair">
>> david points out that for EC pools, don't allow incorrect shards to be
>> selected as correct
>> - admin might decide to delete an object if it can't exist
>> - similar to unfound object case
>> - repair should have a use-your-best-judgement flag -- mandatory for
>> ec (if not deleting the object)
>> - for ec, if we want to read a shard, need to specify the shard id as
>> well since an osd might have two shards
>>   - request would indicate the shard
>> </quote>
>>
>> because the ec subread does not return the payload if the size or the
>> digest fails to match, and instead, an EIO is returned. on the primary
>> side, ECBackend::handle_sub_read_reply() will send more subread ops to
>> the shard(s) which is yet used if it is unable to reconstruct the
>> requested extent with shards already returned. if we want to
>> explicitly exclude the "bad" shards from being used, maybe the
>> simplest way is to remove it before calling repair-copy. and we can
>> offer an API for removing a shard. but I doubt that we need to do
>> this. as the chance of having a corrupted shard whose checksum matches
>> with its digest stored in its object-info xattr,  is very small. and
>> maybe we fix a corrupted shard in ec pool by reading the impacted
>> object, and then overwriting the original copy with the reconstructed
>> one from the good shards.
>
> I think it would be simpler to just allow the repair_write call to
> specify a set of bad shards.  For replicated pools, we simply choose
> one which is not in that set.  For EC pools, we use that information
> to avoid bad shards.

for the replicated pool, will choose a random replica from the acting set
as long as it's not listed in the black list as the authenticated copy of the
repair.

for the ec pool, the OSD does not return shards with wrong digest at all
when handling sub read requests. so we should ignore the digest mismatch
error when reading shards. because:
 - we mark a shard inconsistent if its digest does not match with the
one stored in the shard's hash_info. (please note that each shard has
the digests of all shards of that object).
 - the user could put the consistent shard reported by
list-inconsistent-obj into the blacklist. it's a little bit scary
though.

-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-06-07 10:44       ` kefu chai
@ 2016-06-07 13:03         ` Sage Weil
  0 siblings, 0 replies; 15+ messages in thread
From: Sage Weil @ 2016-06-07 13:03 UTC (permalink / raw)
  To: kefu chai; +Cc: Dan van der Ster, ceph-devel

On Tue, 7 Jun 2016, kefu chai wrote:
> Dan, your comments are more like feature requests related to current
> scrub, instead to the
> new scrub/repair feature design. reply inlined.
> 
> On Fri, May 27, 2016 at 8:03 PM, Dan van der Ster <dan@vanderster.com> wrote:
> > Hi all,
> >
> > I have some high-level feedback for scrub/repair. Apologies if some of
> > these are already taken into account.
> >
> > 1. ceph pg cancel-scrub <pgid>: For a variety of reasons it would be
> > useful to be able to cancel an ongoing (deep-)scrub on a PG. The
> > no(deep-)scrub flags work more like a pause, but today if I want to
> > stop a scrub it requires an OSD to be restarted.
> 
> it's a feature request. we did have this feature in rados API before, but it was
> not exposed by the rados cli, and hence removed. if you'd like to get it back,
> maybe you could file an issue over tracker?
> 
> > 2. ceph pg scrub/deep-scrub/repair often do not start because the
> > master OSD cannot get a reservation on all the replica/EC-part OSDs
> > (due to osd max scrubs). It is possible using some strange gymnastics
> > to force PG to start repairing/scrubbing immediately, but those are
> > not intuitive. IMHO, ceph pg scrub/deep-scrub/repair <pgid> should
> > start immediately regardless of the 'osd max scrubs' value.
> 
> i think it's more a design decision.

My concern with this one is that lots of people have written their own 
scrub scheduling scripts (e.g., because of scheduling problems in the 
past).  I'd favor adding a --force-now option or separate command for an 
immediate scrub.

> > 5. Do we even need the shallow scrub functionality? I'm very curious
> > how many problems that shallow scrubbing finds IRL compared with
> > deep-scrubbing. Does ceph track these stats independently?
> 
> i don't have any numbers to support your theory or against it. but i think
> having a light-weight scrub is necessary.

The lightweight scrub mostly catches replication/recovery bugs.  It's 
useful enough just as a testing/development tool.  I'm not sure that it is 
as useful for users, though...

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-27 12:03     ` Dan van der Ster
@ 2016-06-07 10:44       ` kefu chai
  2016-06-07 13:03         ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: kefu chai @ 2016-06-07 10:44 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel

Dan, your comments are more like feature requests related to current
scrub, instead to the
new scrub/repair feature design. reply inlined.

On Fri, May 27, 2016 at 8:03 PM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi all,
>
> I have some high-level feedback for scrub/repair. Apologies if some of
> these are already taken into account.
>
> 1. ceph pg cancel-scrub <pgid>: For a variety of reasons it would be
> useful to be able to cancel an ongoing (deep-)scrub on a PG. The
> no(deep-)scrub flags work more like a pause, but today if I want to
> stop a scrub it requires an OSD to be restarted.

it's a feature request. we did have this feature in rados API before, but it was
not exposed by the rados cli, and hence removed. if you'd like to get it back,
maybe you could file an issue over tracker?

>
> 2. ceph pg scrub/deep-scrub/repair often do not start because the
> master OSD cannot get a reservation on all the replica/EC-part OSDs
> (due to osd max scrubs). It is possible using some strange gymnastics
> to force PG to start repairing/scrubbing immediately, but those are
> not intuitive. IMHO, ceph pg scrub/deep-scrub/repair <pgid> should
> start immediately regardless of the 'osd max scrubs' value.

i think it's more a design decision.

>
> 3. It should be possible to repair an object directly: e.g. couldn't
> we have rados repair <objectname> which reads then re-writes the whole
> object.

that's what we are discussing in this thread.

>
> 4. EC auto-repair on read/write. Surely there are some types of shard
> corruption that we can repair in-line with the IO, rather than waiting
> for the long scrub/repair cycle.

yeah, we are able to detect some shard corruptions when reading. but we
1) won't do repair on behalf of user, 2) want to offload the repair work to
client side to avoid heuristics in the OSD. so i am afraid this won't happen.

>
> 5. Do we even need the shallow scrub functionality? I'm very curious
> how many problems that shallow scrubbing finds IRL compared with
> deep-scrubbing. Does ceph track these stats independently?

i don't have any numbers to support your theory or against it. but i think
having a light-weight scrub is necessary.

> Could ceph-brag be used to gather this info?

yeah, but not yet. actually we call "ceph pg dump pools" in ceph-brag.

>
> Thanks!
>
> Dan
-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-19 13:09   ` kefu chai
  2016-05-19 17:55     ` Samuel Just
@ 2016-05-27 12:03     ` Dan van der Ster
  2016-06-07 10:44       ` kefu chai
  1 sibling, 1 reply; 15+ messages in thread
From: Dan van der Ster @ 2016-05-27 12:03 UTC (permalink / raw)
  To: kefu chai; +Cc: ceph-devel

Hi all,

I have some high-level feedback for scrub/repair. Apologies if some of
these are already taken into account.

1. ceph pg cancel-scrub <pgid>: For a variety of reasons it would be
useful to be able to cancel an ongoing (deep-)scrub on a PG. The
no(deep-)scrub flags work more like a pause, but today if I want to
stop a scrub it requires an OSD to be restarted.

2. ceph pg scrub/deep-scrub/repair often do not start because the
master OSD cannot get a reservation on all the replica/EC-part OSDs
(due to osd max scrubs). It is possible using some strange gymnastics
to force PG to start repairing/scrubbing immediately, but those are
not intuitive. IMHO, ceph pg scrub/deep-scrub/repair <pgid> should
start immediately regardless of the 'osd max scrubs' value.

3. It should be possible to repair an object directly: e.g. couldn't
we have rados repair <objectname> which reads then re-writes the whole
object.

4. EC auto-repair on read/write. Surely there are some types of shard
corruption that we can repair in-line with the IO, rather than waiting
for the long scrub/repair cycle.

5. Do we even need the shallow scrub functionality? I'm very curious
how many problems that shallow scrubbing finds IRL compared with
deep-scrubbing. Does ceph track these stats independently? Could
ceph-brag be used to gather this info?

Thanks!

Dan


On Thu, May 19, 2016 at 3:09 PM, kefu chai <tchaikov@gmail.com> wrote:
> hi cephers,
>
> I'd like to keep you guys posted on the progress of the scrub/repair
> feature. And also want to valuable comments/suggestions on it from
> you! Now, I am working on the repair-write API for the scrub/repair
> feature.
>
> the API looks like:
>
>     /**
>      * Rewrite the object with the replica hosted by specified osd
>      *
>      * @param osd from which OSD we will copy the data
>      * @param version the version of rewritten object
>      * @param what the flags indicating what we will copy
>      */
>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
> what, int32_t osd, uint32_t epoch);
>
> in which,
> - `version` is the version of the object you expect to be repairing in
> case of a racing write;
> - `what` is an OR'ed flags of follow enum:
> - `epoch` like the other scrub/repairing APIs, epoch indicating the
> scrub interval is passed in.
>
> struct repair_copy_t {
>   enum {
>     DATA = 1 << 0,
>     OMAP = 1 << 1,
>     ATTR = 1 << 2,
>   };
> };
>
> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
> the shard/replica from specified source OSD to the acting set. and the
> machinery of copy_from is reused to implement this feature. so after
> rewriting the object, a version is increased, so that possibly corrupt
> copies on down OSDs will get fixed naturally.
>
> for the code, see
> - https://github.com/ceph/ceph/pull/9203
>
> for the draft design, see
> - http://tracker.ceph.com/issues/13508
> - http://pad.ceph.com/p/scrub_repair
>
> the API for fixing snapset will be added later.
>
> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@gmail.com> wrote:
>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>>>> currently, scrub and repair are pretty primitive. there are several
>>>> improvements which need to be made:
>>>>
>>>> - user should be able to initialize scrub of a PG or an object
>>>>     - int scrub(pg_t, AioCompletion*)
>>>>     - int scrub(const string& pool, const string& nspace, const
>>>> string& locator, const string& oid, AioCompletion*)
>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>> paged<inconsistent_t>*)
>>>> - the user should be able to query the content of the replica/shard
>>>> objects in the event of an inconsistency.
>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>> - the user should be able to perform following fixes using a new
>>>> aio_operate_scrub(
>>>>                                           const std::string& oid,
>>>>                                           shard_id_t shard,
>>>>                                           AioCompletion *c,
>>>>                                           ObjectWriteOperation *op)
>>>>     - specify which replica to use for repairing a content inconsistency
>>>>     - delete an object if it can't exist
>>>>     - write_full
>>>>     - omap_set
>>>>     - setattrs
>>>> - the user should be able to repair snapset and object_info_t
>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>         - set/remove any property/attributes, for example,
>>>>             - to reset snapset.clone_overlap
>>>>             - to set snapset.clone_size
>>>>             - to reset the digests in object_info_t,
>>>> - repair will create a new version so that possibly corrupted copies
>>>> on down OSDs will get fixed naturally.
>>>>
>>>
>>> I think this exposes too much things to the user. Usually a user
>>> doesn't have knowledges like this. If we make it too much complicated,
>>> no one will use it at the end.
>>
>> well, i tend to agree with you to some degree. this is a set of very low
>> level APIs exposed to user, but we will accompany them with some
>> ready-to-use policies to repair the typical inconsistencies. like the
>> sample code attached at the end of this mail. but the point here is
>> that we will not burden the OSD daemon will all of these complicated
>> logic to fix and repair things. and let the magic happen out side of
>> the ceph-osd in a more flexible way. for the advanced users, if they
>> want to explore the possibilities to fix the inconsistencies in their own
>> way, they won't be disappointed also.
>>
>>>
>>>> so librados will offer enough information and facilities, with which a
>>>> smart librados client/script will be able to fix the inconsistencies
>>>> found in the scrub.
>>>>
>>>> as an example, if we run into a data inconsistency where the 3
>>>> replicas failed to agree with each other after performing a deep
>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>> following pseudo code explains how we will implement this using the
>>>> new rados APIs for scrub and repair.
>>>>
>>>>      # something is not necessarily better than nothing
>>>>      rados.aio_scrub(pg, completion)
>>>>      completion.wait_for_complete()
>>>>      for pool in rados.get_inconsistent_pools():
>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>
>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>> epoch).items():
>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>                          votes = defaultdict(int)
>>>>                          for osd, shard_info in inconsistent.shards:
>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>                          auth_copy = None
>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>                               if shard_info.object_info.data_digest == digest:
>>>>                                    auth_copy = osd
>>>>                                    break
>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>                          repair_op.repair_pick(auth_copy,
>>>> inconsistent.ver, epoch)
>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>
>>>> this plan was also discussed in the infernalis CDS. see
>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Regards
>> Kefu Chai
>
>
>
> --
> Regards
> Kefu Chai
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-20 11:30       ` kefu chai
@ 2016-05-25 17:37         ` Samuel Just
  2016-06-07 13:13           ` kefu chai
  0 siblings, 1 reply; 15+ messages in thread
From: Samuel Just @ 2016-05-25 17:37 UTC (permalink / raw)
  To: kefu chai; +Cc: ceph-devel

On Fri, May 20, 2016 at 4:30 AM, kefu chai <tchaikov@gmail.com> wrote:
> On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@redhat.com> wrote:
>> How would this work for an ec pool?  Maybe the osd argument should be
>> a set of valid peers?
>
> maybe in the case of ec pool, we should ignore the osd argument.
>
> <quote from="http://pad.ceph.com/p/scrub_repair">
> david points out that for EC pools, don't allow incorrect shards to be
> selected as correct
> - admin might decide to delete an object if it can't exist
> - similar to unfound object case
> - repair should have a use-your-best-judgement flag -- mandatory for
> ec (if not deleting the object)
> - for ec, if we want to read a shard, need to specify the shard id as
> well since an osd might have two shards
>   - request would indicate the shard
> </quote>
>
> because the ec subread does not return the payload if the size or the
> digest fails to match, and instead, an EIO is returned. on the primary
> side, ECBackend::handle_sub_read_reply() will send more subread ops to
> the shard(s) which is yet used if it is unable to reconstruct the
> requested extent with shards already returned. if we want to
> explicitly exclude the "bad" shards from being used, maybe the
> simplest way is to remove it before calling repair-copy. and we can
> offer an API for removing a shard. but I doubt that we need to do
> this. as the chance of having a corrupted shard whose checksum matches
> with its digest stored in its object-info xattr,  is very small. and
> maybe we fix a corrupted shard in ec pool by reading the impacted
> object, and then overwriting the original copy with the reconstructed
> one from the good shards.

I think it would be simpler to just allow the repair_write call to
specify a set of bad shards.  For replicated pools, we simply choose
one which is not in that set.  For EC pools, we use that information
to avoid bad shards.

>
> <quote from="http://pad.ceph.com/p/scrub_repair">
> repair (need a flag to aio_operate  write variant to allow overwriting
> an unfound object) (needs to bypass snapshotting) (allow to write to a
> clone?) (require x cap bit?)
>  delete
>  writefull ...
>  omap_set_...
>  setattrs ...
> </quote>
>
> do we still need the REPAIR_WRITE flag for overwriting an unfound
> object? i removed an object in osd's store directory, and the
> repair-copy does fix it for me. or I misunderstand this line...
>

It may not be necessary for this mechanism to repair unfound objects,
I merged a new version of that system in the last cycle.  I guess it
depends on what you'd consider most convenient.
-Sam

>> -Sam
>>
>> On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@gmail.com> wrote:
>>> hi cephers,
>>>
>>> I'd like to keep you guys posted on the progress of the scrub/repair
>>> feature. And also want to valuable comments/suggestions on it from
>>> you! Now, I am working on the repair-write API for the scrub/repair
>>> feature.
>>>
>>> the API looks like:
>>>
>>>     /**
>>>      * Rewrite the object with the replica hosted by specified osd
>>>      *
>>>      * @param osd from which OSD we will copy the data
>>>      * @param version the version of rewritten object
>>>      * @param what the flags indicating what we will copy
>>>      */
>>>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
>>> what, int32_t osd, uint32_t epoch);
>>>
>>> in which,
>>> - `version` is the version of the object you expect to be repairing in
>>> case of a racing write;
>>> - `what` is an OR'ed flags of follow enum:
>>> - `epoch` like the other scrub/repairing APIs, epoch indicating the
>>> scrub interval is passed in.
>>>
>>> struct repair_copy_t {
>>>   enum {
>>>     DATA = 1 << 0,
>>>     OMAP = 1 << 1,
>>>     ATTR = 1 << 2,
>>>   };
>>> };
>>>
>>> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
>>> the shard/replica from specified source OSD to the acting set. and the
>>> machinery of copy_from is reused to implement this feature. so after
>>> rewriting the object, a version is increased, so that possibly corrupt
>>> copies on down OSDs will get fixed naturally.
>>>
>>> for the code, see
>>> - https://github.com/ceph/ceph/pull/9203
>>>
>>> for the draft design, see
>>> - http://tracker.ceph.com/issues/13508
>>> - http://pad.ceph.com/p/scrub_repair
>>>
>>> the API for fixing snapset will be added later.
>>>
>>> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@gmail.com> wrote:
>>>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
>>>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>>>>>> currently, scrub and repair are pretty primitive. there are several
>>>>>> improvements which need to be made:
>>>>>>
>>>>>> - user should be able to initialize scrub of a PG or an object
>>>>>>     - int scrub(pg_t, AioCompletion*)
>>>>>>     - int scrub(const string& pool, const string& nspace, const
>>>>>> string& locator, const string& oid, AioCompletion*)
>>>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>>>> paged<inconsistent_t>*)
>>>>>> - the user should be able to query the content of the replica/shard
>>>>>> objects in the event of an inconsistency.
>>>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>>>> - the user should be able to perform following fixes using a new
>>>>>> aio_operate_scrub(
>>>>>>                                           const std::string& oid,
>>>>>>                                           shard_id_t shard,
>>>>>>                                           AioCompletion *c,
>>>>>>                                           ObjectWriteOperation *op)
>>>>>>     - specify which replica to use for repairing a content inconsistency
>>>>>>     - delete an object if it can't exist
>>>>>>     - write_full
>>>>>>     - omap_set
>>>>>>     - setattrs
>>>>>> - the user should be able to repair snapset and object_info_t
>>>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>>>         - set/remove any property/attributes, for example,
>>>>>>             - to reset snapset.clone_overlap
>>>>>>             - to set snapset.clone_size
>>>>>>             - to reset the digests in object_info_t,
>>>>>> - repair will create a new version so that possibly corrupted copies
>>>>>> on down OSDs will get fixed naturally.
>>>>>>
>>>>>
>>>>> I think this exposes too much things to the user. Usually a user
>>>>> doesn't have knowledges like this. If we make it too much complicated,
>>>>> no one will use it at the end.
>>>>
>>>> well, i tend to agree with you to some degree. this is a set of very low
>>>> level APIs exposed to user, but we will accompany them with some
>>>> ready-to-use policies to repair the typical inconsistencies. like the
>>>> sample code attached at the end of this mail. but the point here is
>>>> that we will not burden the OSD daemon will all of these complicated
>>>> logic to fix and repair things. and let the magic happen out side of
>>>> the ceph-osd in a more flexible way. for the advanced users, if they
>>>> want to explore the possibilities to fix the inconsistencies in their own
>>>> way, they won't be disappointed also.
>>>>
>>>>>
>>>>>> so librados will offer enough information and facilities, with which a
>>>>>> smart librados client/script will be able to fix the inconsistencies
>>>>>> found in the scrub.
>>>>>>
>>>>>> as an example, if we run into a data inconsistency where the 3
>>>>>> replicas failed to agree with each other after performing a deep
>>>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>>>> following pseudo code explains how we will implement this using the
>>>>>> new rados APIs for scrub and repair.
>>>>>>
>>>>>>      # something is not necessarily better than nothing
>>>>>>      rados.aio_scrub(pg, completion)
>>>>>>      completion.wait_for_complete()
>>>>>>      for pool in rados.get_inconsistent_pools():
>>>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>>>
>>>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>>>> epoch).items():
>>>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>>>                          votes = defaultdict(int)
>>>>>>                          for osd, shard_info in inconsistent.shards:
>>>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>>>                          auth_copy = None
>>>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>>>                               if shard_info.object_info.data_digest == digest:
>>>>>>                                    auth_copy = osd
>>>>>>                                    break
>>>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>>>                          repair_op.repair_pick(auth_copy,
>>>>>> inconsistent.ver, epoch)
>>>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>>>
>>>>>> this plan was also discussed in the infernalis CDS. see
>>>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>> Kefu Chai
>>>
>>>
>>>
>>> --
>>> Regards
>>> Kefu Chai
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-19 17:55     ` Samuel Just
@ 2016-05-20 11:30       ` kefu chai
  2016-05-25 17:37         ` Samuel Just
  0 siblings, 1 reply; 15+ messages in thread
From: kefu chai @ 2016-05-20 11:30 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel

On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@redhat.com> wrote:
> How would this work for an ec pool?  Maybe the osd argument should be
> a set of valid peers?

maybe in the case of ec pool, we should ignore the osd argument.

<quote from="http://pad.ceph.com/p/scrub_repair">
david points out that for EC pools, don't allow incorrect shards to be
selected as correct
- admin might decide to delete an object if it can't exist
- similar to unfound object case
- repair should have a use-your-best-judgement flag -- mandatory for
ec (if not deleting the object)
- for ec, if we want to read a shard, need to specify the shard id as
well since an osd might have two shards
  - request would indicate the shard
</quote>

because the ec subread does not return the payload if the size or the
digest fails to match, and instead, an EIO is returned. on the primary
side, ECBackend::handle_sub_read_reply() will send more subread ops to
the shard(s) which is yet used if it is unable to reconstruct the
requested extent with shards already returned. if we want to
explicitly exclude the "bad" shards from being used, maybe the
simplest way is to remove it before calling repair-copy. and we can
offer an API for removing a shard. but I doubt that we need to do
this. as the chance of having a corrupted shard whose checksum matches
with its digest stored in its object-info xattr,  is very small. and
maybe we fix a corrupted shard in ec pool by reading the impacted
object, and then overwriting the original copy with the reconstructed
one from the good shards.

<quote from="http://pad.ceph.com/p/scrub_repair">
repair (need a flag to aio_operate  write variant to allow overwriting
an unfound object) (needs to bypass snapshotting) (allow to write to a
clone?) (require x cap bit?)
 delete
 writefull ...
 omap_set_...
 setattrs ...
</quote>

do we still need the REPAIR_WRITE flag for overwriting an unfound
object? i removed an object in osd's store directory, and the
repair-copy does fix it for me. or I misunderstand this line...

> -Sam
>
> On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@gmail.com> wrote:
>> hi cephers,
>>
>> I'd like to keep you guys posted on the progress of the scrub/repair
>> feature. And also want to valuable comments/suggestions on it from
>> you! Now, I am working on the repair-write API for the scrub/repair
>> feature.
>>
>> the API looks like:
>>
>>     /**
>>      * Rewrite the object with the replica hosted by specified osd
>>      *
>>      * @param osd from which OSD we will copy the data
>>      * @param version the version of rewritten object
>>      * @param what the flags indicating what we will copy
>>      */
>>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
>> what, int32_t osd, uint32_t epoch);
>>
>> in which,
>> - `version` is the version of the object you expect to be repairing in
>> case of a racing write;
>> - `what` is an OR'ed flags of follow enum:
>> - `epoch` like the other scrub/repairing APIs, epoch indicating the
>> scrub interval is passed in.
>>
>> struct repair_copy_t {
>>   enum {
>>     DATA = 1 << 0,
>>     OMAP = 1 << 1,
>>     ATTR = 1 << 2,
>>   };
>> };
>>
>> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
>> the shard/replica from specified source OSD to the acting set. and the
>> machinery of copy_from is reused to implement this feature. so after
>> rewriting the object, a version is increased, so that possibly corrupt
>> copies on down OSDs will get fixed naturally.
>>
>> for the code, see
>> - https://github.com/ceph/ceph/pull/9203
>>
>> for the draft design, see
>> - http://tracker.ceph.com/issues/13508
>> - http://pad.ceph.com/p/scrub_repair
>>
>> the API for fixing snapset will be added later.
>>
>> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@gmail.com> wrote:
>>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
>>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>>>>> currently, scrub and repair are pretty primitive. there are several
>>>>> improvements which need to be made:
>>>>>
>>>>> - user should be able to initialize scrub of a PG or an object
>>>>>     - int scrub(pg_t, AioCompletion*)
>>>>>     - int scrub(const string& pool, const string& nspace, const
>>>>> string& locator, const string& oid, AioCompletion*)
>>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>>> paged<inconsistent_t>*)
>>>>> - the user should be able to query the content of the replica/shard
>>>>> objects in the event of an inconsistency.
>>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>>> - the user should be able to perform following fixes using a new
>>>>> aio_operate_scrub(
>>>>>                                           const std::string& oid,
>>>>>                                           shard_id_t shard,
>>>>>                                           AioCompletion *c,
>>>>>                                           ObjectWriteOperation *op)
>>>>>     - specify which replica to use for repairing a content inconsistency
>>>>>     - delete an object if it can't exist
>>>>>     - write_full
>>>>>     - omap_set
>>>>>     - setattrs
>>>>> - the user should be able to repair snapset and object_info_t
>>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>>         - set/remove any property/attributes, for example,
>>>>>             - to reset snapset.clone_overlap
>>>>>             - to set snapset.clone_size
>>>>>             - to reset the digests in object_info_t,
>>>>> - repair will create a new version so that possibly corrupted copies
>>>>> on down OSDs will get fixed naturally.
>>>>>
>>>>
>>>> I think this exposes too much things to the user. Usually a user
>>>> doesn't have knowledges like this. If we make it too much complicated,
>>>> no one will use it at the end.
>>>
>>> well, i tend to agree with you to some degree. this is a set of very low
>>> level APIs exposed to user, but we will accompany them with some
>>> ready-to-use policies to repair the typical inconsistencies. like the
>>> sample code attached at the end of this mail. but the point here is
>>> that we will not burden the OSD daemon will all of these complicated
>>> logic to fix and repair things. and let the magic happen out side of
>>> the ceph-osd in a more flexible way. for the advanced users, if they
>>> want to explore the possibilities to fix the inconsistencies in their own
>>> way, they won't be disappointed also.
>>>
>>>>
>>>>> so librados will offer enough information and facilities, with which a
>>>>> smart librados client/script will be able to fix the inconsistencies
>>>>> found in the scrub.
>>>>>
>>>>> as an example, if we run into a data inconsistency where the 3
>>>>> replicas failed to agree with each other after performing a deep
>>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>>> following pseudo code explains how we will implement this using the
>>>>> new rados APIs for scrub and repair.
>>>>>
>>>>>      # something is not necessarily better than nothing
>>>>>      rados.aio_scrub(pg, completion)
>>>>>      completion.wait_for_complete()
>>>>>      for pool in rados.get_inconsistent_pools():
>>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>>
>>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>>> epoch).items():
>>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>>                          votes = defaultdict(int)
>>>>>                          for osd, shard_info in inconsistent.shards:
>>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>>                          auth_copy = None
>>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>>                               if shard_info.object_info.data_digest == digest:
>>>>>                                    auth_copy = osd
>>>>>                                    break
>>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>>                          repair_op.repair_pick(auth_copy,
>>>>> inconsistent.ver, epoch)
>>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>>
>>>>> this plan was also discussed in the infernalis CDS. see
>>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Regards
>>> Kefu Chai
>>
>>
>>
>> --
>> Regards
>> Kefu Chai
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2016-05-19 13:09   ` kefu chai
@ 2016-05-19 17:55     ` Samuel Just
  2016-05-20 11:30       ` kefu chai
  2016-05-27 12:03     ` Dan van der Ster
  1 sibling, 1 reply; 15+ messages in thread
From: Samuel Just @ 2016-05-19 17:55 UTC (permalink / raw)
  To: kefu chai; +Cc: ceph-devel

How would this work for an ec pool?  Maybe the osd argument should be
a set of valid peers?
-Sam

On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@gmail.com> wrote:
> hi cephers,
>
> I'd like to keep you guys posted on the progress of the scrub/repair
> feature. And also want to valuable comments/suggestions on it from
> you! Now, I am working on the repair-write API for the scrub/repair
> feature.
>
> the API looks like:
>
>     /**
>      * Rewrite the object with the replica hosted by specified osd
>      *
>      * @param osd from which OSD we will copy the data
>      * @param version the version of rewritten object
>      * @param what the flags indicating what we will copy
>      */
>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
> what, int32_t osd, uint32_t epoch);
>
> in which,
> - `version` is the version of the object you expect to be repairing in
> case of a racing write;
> - `what` is an OR'ed flags of follow enum:
> - `epoch` like the other scrub/repairing APIs, epoch indicating the
> scrub interval is passed in.
>
> struct repair_copy_t {
>   enum {
>     DATA = 1 << 0,
>     OMAP = 1 << 1,
>     ATTR = 1 << 2,
>   };
> };
>
> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
> the shard/replica from specified source OSD to the acting set. and the
> machinery of copy_from is reused to implement this feature. so after
> rewriting the object, a version is increased, so that possibly corrupt
> copies on down OSDs will get fixed naturally.
>
> for the code, see
> - https://github.com/ceph/ceph/pull/9203
>
> for the draft design, see
> - http://tracker.ceph.com/issues/13508
> - http://pad.ceph.com/p/scrub_repair
>
> the API for fixing snapset will be added later.
>
> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@gmail.com> wrote:
>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>>>> currently, scrub and repair are pretty primitive. there are several
>>>> improvements which need to be made:
>>>>
>>>> - user should be able to initialize scrub of a PG or an object
>>>>     - int scrub(pg_t, AioCompletion*)
>>>>     - int scrub(const string& pool, const string& nspace, const
>>>> string& locator, const string& oid, AioCompletion*)
>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>> paged<inconsistent_t>*)
>>>> - the user should be able to query the content of the replica/shard
>>>> objects in the event of an inconsistency.
>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>> - the user should be able to perform following fixes using a new
>>>> aio_operate_scrub(
>>>>                                           const std::string& oid,
>>>>                                           shard_id_t shard,
>>>>                                           AioCompletion *c,
>>>>                                           ObjectWriteOperation *op)
>>>>     - specify which replica to use for repairing a content inconsistency
>>>>     - delete an object if it can't exist
>>>>     - write_full
>>>>     - omap_set
>>>>     - setattrs
>>>> - the user should be able to repair snapset and object_info_t
>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>         - set/remove any property/attributes, for example,
>>>>             - to reset snapset.clone_overlap
>>>>             - to set snapset.clone_size
>>>>             - to reset the digests in object_info_t,
>>>> - repair will create a new version so that possibly corrupted copies
>>>> on down OSDs will get fixed naturally.
>>>>
>>>
>>> I think this exposes too much things to the user. Usually a user
>>> doesn't have knowledges like this. If we make it too much complicated,
>>> no one will use it at the end.
>>
>> well, i tend to agree with you to some degree. this is a set of very low
>> level APIs exposed to user, but we will accompany them with some
>> ready-to-use policies to repair the typical inconsistencies. like the
>> sample code attached at the end of this mail. but the point here is
>> that we will not burden the OSD daemon will all of these complicated
>> logic to fix and repair things. and let the magic happen out side of
>> the ceph-osd in a more flexible way. for the advanced users, if they
>> want to explore the possibilities to fix the inconsistencies in their own
>> way, they won't be disappointed also.
>>
>>>
>>>> so librados will offer enough information and facilities, with which a
>>>> smart librados client/script will be able to fix the inconsistencies
>>>> found in the scrub.
>>>>
>>>> as an example, if we run into a data inconsistency where the 3
>>>> replicas failed to agree with each other after performing a deep
>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>> following pseudo code explains how we will implement this using the
>>>> new rados APIs for scrub and repair.
>>>>
>>>>      # something is not necessarily better than nothing
>>>>      rados.aio_scrub(pg, completion)
>>>>      completion.wait_for_complete()
>>>>      for pool in rados.get_inconsistent_pools():
>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>
>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>> epoch).items():
>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>                          votes = defaultdict(int)
>>>>                          for osd, shard_info in inconsistent.shards:
>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>                          auth_copy = None
>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>                               if shard_info.object_info.data_digest == digest:
>>>>                                    auth_copy = osd
>>>>                                    break
>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>                          repair_op.repair_pick(auth_copy,
>>>> inconsistent.ver, epoch)
>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>
>>>> this plan was also discussed in the infernalis CDS. see
>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Regards
>> Kefu Chai
>
>
>
> --
> Regards
> Kefu Chai
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2015-11-11 15:43 ` kefu chai
@ 2016-05-19 13:09   ` kefu chai
  2016-05-19 17:55     ` Samuel Just
  2016-05-27 12:03     ` Dan van der Ster
  0 siblings, 2 replies; 15+ messages in thread
From: kefu chai @ 2016-05-19 13:09 UTC (permalink / raw)
  To: ceph-devel

hi cephers,

I'd like to keep you guys posted on the progress of the scrub/repair
feature. And also want to valuable comments/suggestions on it from
you! Now, I am working on the repair-write API for the scrub/repair
feature.

the API looks like:

    /**
     * Rewrite the object with the replica hosted by specified osd
     *
     * @param osd from which OSD we will copy the data
     * @param version the version of rewritten object
     * @param what the flags indicating what we will copy
     */
    int repair_copy(const std::string& oid, uint64_t version, uint32_t
what, int32_t osd, uint32_t epoch);

in which,
- `version` is the version of the object you expect to be repairing in
case of a racing write;
- `what` is an OR'ed flags of follow enum:
- `epoch` like the other scrub/repairing APIs, epoch indicating the
scrub interval is passed in.

struct repair_copy_t {
  enum {
    DATA = 1 << 0,
    OMAP = 1 << 1,
    ATTR = 1 << 2,
  };
};

a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
the shard/replica from specified source OSD to the acting set. and the
machinery of copy_from is reused to implement this feature. so after
rewriting the object, a version is increased, so that possibly corrupt
copies on down OSDs will get fixed naturally.

for the code, see
- https://github.com/ceph/ceph/pull/9203

for the draft design, see
- http://tracker.ceph.com/issues/13508
- http://pad.ceph.com/p/scrub_repair

the API for fixing snapset will be added later.

On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@gmail.com> wrote:
> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>>> currently, scrub and repair are pretty primitive. there are several
>>> improvements which need to be made:
>>>
>>> - user should be able to initialize scrub of a PG or an object
>>>     - int scrub(pg_t, AioCompletion*)
>>>     - int scrub(const string& pool, const string& nspace, const
>>> string& locator, const string& oid, AioCompletion*)
>>> - we need a way to query the result of the most recent scrub on a pg.
>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>> paged<inconsistent_t>*)
>>> - the user should be able to query the content of the replica/shard
>>> objects in the event of an inconsistency.
>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>> ObjectReadOperation *op, bool allow_inconsistent)
>>> - the user should be able to perform following fixes using a new
>>> aio_operate_scrub(
>>>                                           const std::string& oid,
>>>                                           shard_id_t shard,
>>>                                           AioCompletion *c,
>>>                                           ObjectWriteOperation *op)
>>>     - specify which replica to use for repairing a content inconsistency
>>>     - delete an object if it can't exist
>>>     - write_full
>>>     - omap_set
>>>     - setattrs
>>> - the user should be able to repair snapset and object_info_t
>>>     - ObjectWriteOperation::repair_snapset(...)
>>>         - set/remove any property/attributes, for example,
>>>             - to reset snapset.clone_overlap
>>>             - to set snapset.clone_size
>>>             - to reset the digests in object_info_t,
>>> - repair will create a new version so that possibly corrupted copies
>>> on down OSDs will get fixed naturally.
>>>
>>
>> I think this exposes too much things to the user. Usually a user
>> doesn't have knowledges like this. If we make it too much complicated,
>> no one will use it at the end.
>
> well, i tend to agree with you to some degree. this is a set of very low
> level APIs exposed to user, but we will accompany them with some
> ready-to-use policies to repair the typical inconsistencies. like the
> sample code attached at the end of this mail. but the point here is
> that we will not burden the OSD daemon will all of these complicated
> logic to fix and repair things. and let the magic happen out side of
> the ceph-osd in a more flexible way. for the advanced users, if they
> want to explore the possibilities to fix the inconsistencies in their own
> way, they won't be disappointed also.
>
>>
>>> so librados will offer enough information and facilities, with which a
>>> smart librados client/script will be able to fix the inconsistencies
>>> found in the scrub.
>>>
>>> as an example, if we run into a data inconsistency where the 3
>>> replicas failed to agree with each other after performing a deep
>>> scrub. probably we'd like to have an election to get the auth copy.
>>> following pseudo code explains how we will implement this using the
>>> new rados APIs for scrub and repair.
>>>
>>>      # something is not necessarily better than nothing
>>>      rados.aio_scrub(pg, completion)
>>>      completion.wait_for_complete()
>>>      for pool in rados.get_inconsistent_pools():
>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>
>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>> epoch).items():
>>>                     if inconsistent.is_data_digest_mismatch():
>>>                          votes = defaultdict(int)
>>>                          for osd, shard_info in inconsistent.shards:
>>>                               votes[shard_info.object_info.data_digest] += 1
>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>                          auth_copy = None
>>>                          for osd, shard_info in inconsistent.shards.items():
>>>                               if shard_info.object_info.data_digest == digest:
>>>                                    auth_copy = osd
>>>                                    break
>>>                          repair_op = librados.ObjectWriteOperation()
>>>                          repair_op.repair_pick(auth_copy,
>>> inconsistent.ver, epoch)
>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>
>>> this plan was also discussed in the infernalis CDS. see
>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai



-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
  2015-11-11 14:43 王志强
@ 2015-11-11 15:43 ` kefu chai
  2016-05-19 13:09   ` kefu chai
  0 siblings, 1 reply; 15+ messages in thread
From: kefu chai @ 2015-11-11 15:43 UTC (permalink / raw)
  To: 王志强; +Cc: ceph-devel

On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@gmail.com> wrote:
> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
>> currently, scrub and repair are pretty primitive. there are several
>> improvements which need to be made:
>>
>> - user should be able to initialize scrub of a PG or an object
>>     - int scrub(pg_t, AioCompletion*)
>>     - int scrub(const string& pool, const string& nspace, const
>> string& locator, const string& oid, AioCompletion*)
>> - we need a way to query the result of the most recent scrub on a pg.
>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>> paged<inconsistent_t>*)
>> - the user should be able to query the content of the replica/shard
>> objects in the event of an inconsistency.
>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>> ObjectReadOperation *op, bool allow_inconsistent)
>> - the user should be able to perform following fixes using a new
>> aio_operate_scrub(
>>                                           const std::string& oid,
>>                                           shard_id_t shard,
>>                                           AioCompletion *c,
>>                                           ObjectWriteOperation *op)
>>     - specify which replica to use for repairing a content inconsistency
>>     - delete an object if it can't exist
>>     - write_full
>>     - omap_set
>>     - setattrs
>> - the user should be able to repair snapset and object_info_t
>>     - ObjectWriteOperation::repair_snapset(...)
>>         - set/remove any property/attributes, for example,
>>             - to reset snapset.clone_overlap
>>             - to set snapset.clone_size
>>             - to reset the digests in object_info_t,
>> - repair will create a new version so that possibly corrupted copies
>> on down OSDs will get fixed naturally.
>>
>
> I think this exposes too much things to the user. Usually a user
> doesn't have knowledges like this. If we make it too much complicated,
> no one will use it at the end.

well, i tend to agree with you to some degree. this is a set of very low
level APIs exposed to user, but we will accompany them with some
ready-to-use policies to repair the typical inconsistencies. like the
sample code attached at the end of this mail. but the point here is
that we will not burden the OSD daemon will all of these complicated
logic to fix and repair things. and let the magic happen out side of
the ceph-osd in a more flexible way. for the advanced users, if they
want to explore the possibilities to fix the inconsistencies in their own
way, they won't be disappointed also.

>
>> so librados will offer enough information and facilities, with which a
>> smart librados client/script will be able to fix the inconsistencies
>> found in the scrub.
>>
>> as an example, if we run into a data inconsistency where the 3
>> replicas failed to agree with each other after performing a deep
>> scrub. probably we'd like to have an election to get the auth copy.
>> following pseudo code explains how we will implement this using the
>> new rados APIs for scrub and repair.
>>
>>      # something is not necessarily better than nothing
>>      rados.aio_scrub(pg, completion)
>>      completion.wait_for_complete()
>>      for pool in rados.get_inconsistent_pools():
>>           for pg in rados.get_inconsistent_pgs(pool):
>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>
>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>> epoch).items():
>>                     if inconsistent.is_data_digest_mismatch():
>>                          votes = defaultdict(int)
>>                          for osd, shard_info in inconsistent.shards:
>>                               votes[shard_info.object_info.data_digest] += 1
>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>                          auth_copy = None
>>                          for osd, shard_info in inconsistent.shards.items():
>>                               if shard_info.object_info.data_digest == digest:
>>                                    auth_copy = osd
>>                                    break
>>                          repair_op = librados.ObjectWriteOperation()
>>                          repair_op.repair_pick(auth_copy,
>> inconsistent.ver, epoch)
>>                          rados.aio_operate_scrub(oid, repair_op)
>>
>> this plan was also discussed in the infernalis CDS. see
>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: new scrub and repair discussion
@ 2015-11-11 14:43 王志强
  2015-11-11 15:43 ` kefu chai
  0 siblings, 1 reply; 15+ messages in thread
From: 王志强 @ 2015-11-11 14:43 UTC (permalink / raw)
  To: ceph-devel

2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@gmail.com>:
> currently, scrub and repair are pretty primitive. there are several
> improvements which need to be made:
>
> - user should be able to initialize scrub of a PG or an object
>     - int scrub(pg_t, AioCompletion*)
>     - int scrub(const string& pool, const string& nspace, const
> string& locator, const string& oid, AioCompletion*)
> - we need a way to query the result of the most recent scrub on a pg.
>     - int get_inconsistent_pools(set<uint64_t>* pools);
>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
> paged<inconsistent_t>*)
> - the user should be able to query the content of the replica/shard
> objects in the event of an inconsistency.
>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
> ObjectReadOperation *op, bool allow_inconsistent)
> - the user should be able to perform following fixes using a new
> aio_operate_scrub(
>                                           const std::string& oid,
>                                           shard_id_t shard,
>                                           AioCompletion *c,
>                                           ObjectWriteOperation *op)
>     - specify which replica to use for repairing a content inconsistency
>     - delete an object if it can't exist
>     - write_full
>     - omap_set
>     - setattrs
> - the user should be able to repair snapset and object_info_t
>     - ObjectWriteOperation::repair_snapset(...)
>         - set/remove any property/attributes, for example,
>             - to reset snapset.clone_overlap
>             - to set snapset.clone_size
>             - to reset the digests in object_info_t,
> - repair will create a new version so that possibly corrupted copies
> on down OSDs will get fixed naturally.
>

I think this exposes too much things to the user. Usually a user
doesn't have knowledges like this. If we make it too much complicated,
no one will use it at the end.

> so librados will offer enough information and facilities, with which a
> smart librados client/script will be able to fix the inconsistencies
> found in the scrub.
>
> as an example, if we run into a data inconsistency where the 3
> replicas failed to agree with each other after performing a deep
> scrub. probably we'd like to have an election to get the auth copy.
> following pseudo code explains how we will implement this using the
> new rados APIs for scrub and repair.
>
>      # something is not necessarily better than nothing
>      rados.aio_scrub(pg, completion)
>      completion.wait_for_complete()
>      for pool in rados.get_inconsistent_pools():
>           for pg in rados.get_inconsistent_pgs(pool):
>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>
>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
> epoch).items():
>                     if inconsistent.is_data_digest_mismatch():
>                          votes = defaultdict(int)
>                          for osd, shard_info in inconsistent.shards:
>                               votes[shard_info.object_info.data_digest] += 1
>                          digest, _ = mavotes, key=operator.itemgetter(1))
>                          auth_copy = None
>                          for osd, shard_info in inconsistent.shards.items():
>                               if shard_info.object_info.data_digest == digest:
>                                    auth_copy = osd
>                                    break
>                          repair_op = librados.ObjectWriteOperation()
>                          repair_op.repair_pick(auth_copy,
> inconsistent.ver, epoch)
>                          rados.aio_operate_scrub(oid, repair_op)
>
> this plan was also discussed in the infernalis CDS. see
> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-07 13:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-11 11:44 new scrub and repair discussion kefu chai
2015-11-11 13:25 ` Sage Weil
2015-11-11 14:53   ` kefu chai
2016-05-23  3:54 ` Shinobu Kinjo
2016-05-25 14:34   ` kefu chai
2015-11-11 14:43 王志强
2015-11-11 15:43 ` kefu chai
2016-05-19 13:09   ` kefu chai
2016-05-19 17:55     ` Samuel Just
2016-05-20 11:30       ` kefu chai
2016-05-25 17:37         ` Samuel Just
2016-06-07 13:13           ` kefu chai
2016-05-27 12:03     ` Dan van der Ster
2016-06-07 10:44       ` kefu chai
2016-06-07 13:03         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.