All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sweil@redhat.com>
To: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: Gregory Farnum <gfarnum@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Ceph Luminous - pg is down due to src/osd/SnapMapper.cc: 246: FAILED assert(r == -2)
Date: Wed, 17 Jan 2018 18:48:50 +0000 (UTC)	[thread overview]
Message-ID: <alpine.DEB.2.11.1801171848400.24931@piezo.novalocal> (raw)
In-Reply-To: <8e353d25-47e8-181f-0dc5-cea0f64cbb36@profihost.ag>

On Wed, 17 Jan 2018, Stefan Priebe - Profihost AG wrote:
> Hi Sage,
> 
> this gives me another crash while that pg is recovering:
> 
>      0> 2018-01-17 19:25:09.328935 7f48f8fff700 -1
> /build/ceph/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_l
> ocal_recover(const hobject_t&, const ObjectRecoveryInfo&,
> ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f48f8fff700 ti
> me 2018-01-17 19:25:09.322287
> /build/ceph/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
> recovery_info.ss.clone_snaps.end())

Is this a cache tiering pool?

s
> 
>  ceph version 12.2.2-94-g92923ef
> (92923ef323d32d8321e86703ce1f9016f19472fb) luminous (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x55addb5eb1f2]
>  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
> const&, std::shared_ptr<ObjectContext>, bool,
> ObjectStore::Transaction*)+0x11f0) [0x55addb1957a0]
>  3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&,
> PushReplyOp*, ObjectStore::Transaction*)+0x31d) [0x55addb3071ed]
>  4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x18f)
> [0x55addb30748f]
>  5:
> (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2d1)
> [0x55addb317531]
>  6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
> [0x55addb23cf10]
>  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x77b) [0x55addb1a91eb]
>  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
> [0x55addb035bc7]
>  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
> const&)+0x57) [0x55addb2ad957]
>  10: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x108c) [0x55addb064d1c]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
> [0x55addb5f0e7d]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55addb5f2e40]
>  13: (()+0x8064) [0x7f4955b68064]
>  14: (clone()+0x6d) [0x7f4954c5c62d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> Greets,
> Stefan
> 
> Am 17.01.2018 um 15:28 schrieb Sage Weil:
> > On Wed, 17 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >> Hi,
> >>
> >> i there any chance to fix this instead of removing manually all the clones?
> > 
> > I believe you can avoid the immediate problem and get the PG up by 
> > commenting out the assert.  set_snaps() will overwrite the object->snap 
> > list mapping.
> > 
> > The problem is you'll probably still a stray snapid -> object mapping, so 
> > when snaptrimming runs you might end up with a PG in the snaptrim_error 
> > state that won't trim (although from a quick look at the code it won't 
> > crash).  I'd probably remove the assert and deal with that if/when it 
> > happens.
> > 
> > I'm adding a ticket to relax these asserts for production but keep them 
> > enabled for qa.  This isn't something that needs to take down the OSD!
> > 
> > sage
> > 
> > 
> >  > 
> > 
> >> Stefan
> >>
> >> Am 16.01.2018 um 23:24 schrieb Gregory Farnum:
> >>> On Mon, Jan 15, 2018 at 5:23 PM, Stefan Priebe - Profihost AG
> >>> <s.priebe@profihost.ag> wrote:
> >>>> Hello,
> >>>>
> >>>> currently one of my clusters is missing a whole pg due to all 3 osds
> >>>> being down.
> >>>>
> >>>> All of them fail with:
> >>>>     0> 2018-01-16 02:05:33.353293 7f944dbfe700 -1
> >>>> /build/ceph/src/osd/SnapMapper.cc: In function 'void
> >>>> SnapMapper::add_oid(const hobject_t&, const std::set<snapid_t>&,
> >>>> MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)'
> >>>> thread 7f944dbfe700 time 2018-01-16 02:05:33.349946
> >>>> /build/ceph/src/osd/SnapMapper.cc: 246: FAILED assert(r == -2)
> >>>>
> >>>>  ceph version 12.2.2-93-gd6da8d7
> >>>> (d6da8d77a4b2220e6bdd61e4bdd911a9cd91946c) luminous (stable)
> >>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>>> const*)+0x102) [0x561f9ff0b1e2]
> >>>>  2: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
> >>>> std::less<snapid_t>, std::allocator<snapid_t> > const&,
> >>>> MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x64b)
> >>>> [0x561f9fb76f3b]
> >>>>  3: (PG::update_snap_map(std::vector<pg_log_entry_t,
> >>>> std::allocator<pg_log_entry_t> > const&,
> >>>> ObjectStore::Transaction&)+0x38f) [0x561f9fa0ae3f]
> >>>>  4: (PG::append_log(std::vector<pg_log_entry_t,
> >>>> std::allocator<pg_log_entry_t> > const&, eversion_t, eversion_t,
> >>>> ObjectStore::Transaction&, bool)+0x538) [0x561f9fa31018]
> >>>>  5: (PrimaryLogPG::log_operation(std::vector<pg_log_entry_t,
> >>>> std::allocator<pg_log_entry_t> > const&,
> >>>> boost::optional<pg_hit_set_history_t> const&, eversion_t const&,
> >>>> eversion_t const&, bool, ObjectStore::Transaction&)+0x64) [0x561f9fb25d64]
> >>>>  6: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xa92)
> >>>> [0x561f9fc314b2]
> >>>>  7:
> >>>> (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a4)
> >>>> [0x561f9fc374f4]
> >>>>  8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
> >>>> [0x561f9fb5cf10]
> >>>>  9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> >>>> ThreadPool::TPHandle&)+0x77b) [0x561f9fac91eb]
> >>>>  10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> >>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
> >>>> [0x561f9f955bc7]
> >>>>  11: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
> >>>> const&)+0x57) [0x561f9fbcd947]
> >>>>  12: (OSD::ShardedOpWQ::_process(unsigned int,
> >>>> ceph::heartbeat_handle_d*)+0x108c) [0x561f9f984d1c]
> >>>>  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
> >>>> [0x561f9ff10e6d]
> >>>>  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561f9ff12e30]
> >>>>  15: (()+0x8064) [0x7f949afcb064]
> >>>>  16: (clone()+0x6d) [0x7f949a0bf62d]
> >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >>>> needed to interpret this.
> >>>
> >>> By the time it gets there, something else has gone wrong. The OSD is
> >>> adding a snapid/object pair to its "SnapMapper", and discovering that
> >>> there are already entries (which it thinks there shouldn't be).
> >>>
> >>> You'll need to post more of a log, along with background, if anybody's
> >>> going to diagnose it: is there cache tiering on the cluster? What is
> >>> this pool used for? Were there other errors on this PG in the past?
> >>>
> >>> I also notice a separate email about deleting the data; I don't have
> >>> any experience with this but you'd probably have to export the PG
> >>> using ceph-objectstore-tool and then find a way to delete the object
> >>> out of it. I see options to remove both an object and
> >>> "remove-clone-metadata" on a particular ID, but I've not used any of
> >>> them myself.
> >>> -Greg
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

  reply	other threads:[~2018-01-17 18:48 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-16  1:23 Ceph Luminous - pg is down due to src/osd/SnapMapper.cc: 246: FAILED assert(r == -2) Stefan Priebe - Profihost AG
2018-01-16 22:24 ` Gregory Farnum
2018-01-17  5:59   ` Stefan Priebe - Profihost AG
2018-01-17  7:43   ` Stefan Priebe - Profihost AG
2018-01-17 12:07   ` Stefan Priebe - Profihost AG
2018-01-17 14:28     ` Sage Weil
2018-01-17 18:28       ` Stefan Priebe - Profihost AG
2018-01-17 18:48         ` Sage Weil [this message]
2018-01-17 18:52           ` Stefan Priebe - Profihost AG
2018-01-17 18:56             ` Sage Weil
2018-01-17 20:45               ` Stefan Priebe - Profihost AG
2018-01-17 21:16                 ` Stefan Priebe - Profihost AG
2018-01-17 22:07               ` Stefan Priebe - Profihost AG
2018-01-18  8:08               ` Stefan Priebe - Profihost AG
2018-01-18 13:16                 ` Sage Weil
2018-01-18 13:26                   ` Stefan Priebe - Profihost AG
2018-01-18 14:24                     ` Sage Weil
2018-01-18 14:50                       ` Igor Fedotov
2018-01-18 20:01                       ` Stefan Priebe - Profihost AG
2018-01-18 22:17                       ` Stefan Priebe - Profihost AG
2018-01-19 20:16                       ` Stefan Priebe - Profihost AG
2018-01-19 20:19                         ` Sage Weil
2018-01-19 20:45                           ` Stefan Priebe - Profihost AG
2018-01-21 20:27                             ` Sage Weil
2018-01-22 13:22                               ` Stefan Priebe - Profihost AG
2018-01-22 14:30                                 ` Sage Weil
2018-01-22 18:49                                   ` Stefan Priebe - Profihost AG
2018-01-22 19:01                                     ` Sage Weil
2018-01-22 19:15                                       ` Stefan Priebe - Profihost AG
2018-01-23 20:48                                       ` Stefan Priebe - Profihost AG
2018-01-24  0:07                                         ` Sage Weil
2018-01-24  7:17                                           ` Stefan Priebe - Profihost AG
2018-01-24 10:16                                             ` Sage Weil
2018-01-29 15:33                                               ` Stefan Priebe - Profihost AG
2018-01-30 19:25                                               ` Stefan Priebe - Profihost AG
2018-02-02 19:19                                       ` Stefan Priebe - Profihost AG
2018-02-02 19:28                                         ` Sage Weil
2018-02-02 20:21                                           ` Stefan Priebe - Profihost AG
2018-02-02 21:05                                             ` Sage Weil
2018-02-02 21:54                                               ` Stefan Priebe - Profihost AG
2018-02-03 21:07                                               ` Stefan Priebe - Profihost AG
2018-02-05 12:27                                                 ` Sage Weil
2018-02-05  7:34                                               ` Stefan Priebe - Profihost AG
2018-02-05 13:39                                               ` Stefan Priebe - Profihost AG
2018-02-12 11:58                                                 ` Stefan Priebe - Profihost AG
2018-02-12 19:31                                                   ` Sage Weil
2018-02-12 20:06                                                     ` Stefan Priebe - Profihost AG
2018-01-18 12:02               ` Stefan Priebe - Profihost AG
2018-01-17 14:05   ` Stefan Priebe - Profihost AG
2018-01-17 14:19     ` Igor Fedotov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1801171848400.24931@piezo.novalocal \
    --to=sweil@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=s.priebe@profihost.ag \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.