From mboxrd@z Thu Jan 1 00:00:00 1970 From: Haomai Wang Subject: Re: ceph issue Date: Fri, 18 Nov 2016 19:26:12 +0800 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: Received: from mail-pu1apc01on0112.outbound.protection.outlook.com ([104.47.126.112]:47776 "EHLO APC01-PU1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753077AbcKRL0d (ORCPT ); Fri, 18 Nov 2016 06:26:33 -0500 Received: by mail-vk0-f44.google.com with SMTP id 137so165746969vkl.0 for ; Fri, 18 Nov 2016 03:26:22 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Marov Aleksey Cc: Sage Weil , "ceph-devel@vger.kernel.org" sorry, I got the issue. I submitted a pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. On Fri, Nov 18, 2016 at 5:23 PM, Marov Aleksey wrote: > I use ceph with rdma/async messenger. I have done next steps > 1. ulimit -c unlimited core > 2. fio -v : 2.1.13. Run fio rbd.fio Where rbd.fio config is : > [global] > ioengine=rbd > clientname=admin > pool=rbd > rbdname=test_img1 > invalidate=0 # mandatory > rw=randwrite > bs=4k > runtime=10m > time_based > > [rbd_iodepth32] > iodepth=32 > numjobs=1 > > 3. Got this fio crash > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: In function 'bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread 7fffd3fff700 time 2016-11-18 11:51:44.411997 > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) > ceph version 11.0.2-1554-g19ca7fd (19ca7fd92bb8813dcabcc57518932b3dbb553d4b) > 1: (()+0x15ccd5) [0x7fffe6d9ccd5] > 2: (()+0x75582) [0x7fffe6cb5582] > 3: (()+0x3b7b07) [0x7fffe6ff7b07] > 4: (()+0x215c36) [0x7fffe6e55c36] > 5: (()+0x201b51) [0x7fffe6e41b51] > 6: (()+0x1f93f4) [0x7fffe6e393f4] > 7: (()+0x1e7035) [0x7fffe6e27035] > 8: (()+0x1e733a) [0x7fffe6e2733a] > 9: (librados::RadosClient::connect()+0x96) [0x7fffe6d0bbd6] > 10: (rados_connect()+0x20) [0x7fffe6cbf2d0] > 11: /usr/local/bin/fio() [0x45b579] > 12: (td_io_init()+0x1b) [0x40d70b] > 13: /usr/local/bin/fio() [0x449eb3] > 14: (()+0x7dc5) [0x7fffe5ac9dc5] > 15: (clone()+0x6d) [0x7fffe55f2ced] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. > > 4. run gdb on core > gdb $(which fio) core.3860 >>>thread apply all bt >>>run > And got this bt: > ... > Thread 5 (Thread 0x7f1f54491880 (LWP 3860)): > #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 > #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 > #2 0x000000000044c26f in do_usleep (usecs=10000) at backend.c:1727 > #3 run_threads () at backend.c:1965 > #4 0x000000000044c7ed in fio_backend () at backend.c:2068 > #5 0x00007f1f419e8b15 in __libc_start_main () from /lib64/libc.so.6 > #6 0x000000000040b8ad in _start () > > Thread 4 (Thread 0x7f1f19ffb700 (LWP 3882)): > #0 0x00007f1f41f986d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00007f1f4326b54b in ceph::logging::Log::entry (this=0x7f1f0802b4d0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/Log.cc:451 > #2 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #3 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 3 (Thread 0x7f1f037fe700 (LWP 3883)): > #0 0x00007f1f41f98a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00007f1f43395dca in WaitUntil (when=..., mutex=..., this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:72 > #2 WaitInterval (interval=..., mutex=..., cct=, this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:81 > #3 CephContextServiceThread::entry (this=0x7f1f0807a3e0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.cc:149 > #4 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #5 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 2 (Thread 0x7f1f34db5700 (LWP 3861)): > #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 > #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 > #2 0x0000000000448500 in disk_thread_main (data=) at backend.c:1992 > #3 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #4 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x7f1f345b4700 (LWP 3881)): > #0 0x00007f1f419fc5f7 in raise () from /lib64/libc.so.6 > #1 0x00007f1f419fdce8 in abort () from /lib64/libc.so.6 > #2 0x00007f1f43267eb7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f1f4351d090 "sub < m_subsys.size()", > file=file@entry=0x7f1f4351cd48 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h", line=line@entry=62, > func=func@entry=0x7f1f4355f800 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/assert.cc:78 > #3 0x00007f1f43180582 in ceph::logging::SubsystemMap::should_gather (level=20, sub=27, this=) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h:62 > #4 0x00007f1f434c2b07 in should_gather (level=20, sub=27, this=) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:317 > #5 Infiniband::create_comp_channel (this=0xd43430) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:310 > #6 0x00007f1f43320c36 in RDMADispatcher (s=0x7f1f0807c2a8, i=, c=0x7f1f08026f60, this=0x7f1f08102bb0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.h:90 > #7 RDMAStack::RDMAStack (this=0x7f1f0807c2a8, cct=0x7f1f08026f60, t=...) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.cc:66 > #8 0x00007f1f4330cb51 in construct, std::allocator > const&> (__p=0x7f1f0807c2a8, this=) > at /usr/include/c++/4.8.2/ext/new_allocator.h:120 > #9 _S_construct, std::allocator > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 > #10 construct, std::allocator > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 > #11 _Sp_counted_ptr_inplace, std::allocator > const&> (__a=..., this=0x7f1f0807c290) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:399 > #12 construct, (__gnu_cxx::_Lock_policy)2>, std::allocator const, CephContext*&, std::basic_string, std::al > locator > const&> (__p=, this=) at /usr/include/c++/4.8.2/ext/new_allocator.h:120 > #13 _S_construct, (__gnu_cxx::_Lock_policy)2>, std::allocator const, CephContext*&, std::basic_string, std: > :allocator > const&> (__p=, __a=) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 > #14 construct, (__gnu_cxx::_Lock_policy)2>, std::allocator const, CephContext*&, std::basic_string, std::al > locator > const&> (__p=, __a=) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 > ---Type to continue, or q to quit--- > #15 __shared_count, CephContext*&, std::basic_string, std::allocator > const&> (__a=..., this=) > at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:502 > #16 __shared_ptr, CephContext*&, std::basic_string, std::allocator > const&> (__a=..., __tag=..., this=) > at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:957 > #17 shared_ptr, CephContext*&, std::basic_string, std::allocator > const&> (__a=..., __tag=..., this=) > at /usr/include/c++/4.8.2/bits/shared_ptr.h:316 > #18 allocate_shared, CephContext*&, std::basic_string, std::allocator > const&> (__a=...) at /usr/include/c++/4.8.2/bits/shared_ptr.h:598 > #19 make_shared, std::allocator > const&> () at /usr/include/c++/4.8.2/bits/shared_ptr.h:614 > #20 NetworkStack::create (c=c@entry=0x7f1f08026f60, t="rdma") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/Stack.cc:66 > #21 0x00007f1f433043f4 in StackSingleton (c=0x7f1f08026f60, this=0x7f1f0807abd0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:244 > #22 lookup_or_create_singleton_object (name="AsyncMessenger::NetworkStack", p=, this=0x7f1f08026f60) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.h:134 > #23 AsyncMessenger::AsyncMessenger (this=0x7f1f0807afd0, cct=0x7f1f08026f60, name=..., mname=..., _nonce=7528509425877766185) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:278 > #24 0x00007f1f432f2035 in Messenger::create (cct=cct@entry=0x7f1f08026f60, type="async", name=..., lname="radosclient", nonce=nonce@entry=7528509425877766185, cflags=cflags@entry=0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:40 > > #25 0x00007f1f432f233a in Messenger::create_client_messenger (cct=0x7f1f08026f60, lname="radosclient") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:20 > #26 0x00007f1f431d6bd6 in librados::RadosClient::connect (this=this@entry=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/RadosClient.cc:245 > #27 0x00007f1f4318a2d0 in rados_connect (cluster=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/librados.cc:2771 > #28 0x000000000045b579 in _fio_rbd_connect (td=) at engines/rbd.c:113 > #29 fio_rbd_init (td=) at engines/rbd.c:337 > #30 0x000000000040d70b in td_io_init (td=td@entry=0x7f1f34db6000) at ioengines.c:369 > #31 0x0000000000449eb3 in thread_main (data=0x7f1f34db6000) at backend.c:1433 > #32 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #33 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > > Hope it'll help. If you need core dump and fio binary I can send it. May be this problem relates to old fio version? (though I dont think so) > > Best regards > Alex > ________________________________________ > > hi Marov, > > Other person also met this problem when using rdma, but it's ok to me. > so plz give more infos to figure it out > > On Thu, Nov 17, 2016 at 10:49 PM, Sage Weil wrote: >> [adding ceph-devel] >> >> On Thu, 17 Nov 2016, Marov Aleksey wrote: >>> Hello Sage >>> >>> My name is Alex. I need some help with resolving issue with ceph. I have >>> been testing ceph with rdma messenger and I got an error >>> >>> src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >>> >>> I have no idea what it means. I noticed that you was the last one who >>> committed in SubsystemMap.h so I think you have some understanding of this >>> condition in assert >>> >>> bool should_gather(unsigned sub, int level) { >>> assert(sub < m_subsys.size()); >>> return level <= m_subsys[sub].gather_level || >>> level <= m_subsys[sub].log_level; >>> } >>> >>> This error occurs only when I use fio benchmark to test rbd. When I use "rbd >>> bench-write ..." it is ok. But fio is much mire flexible . In any case I >>> think it is not good to get any assert. >>> >>> Can you explain this for me please, or give a hint where to investigate my >>> trouble. >> >> Can you generate a core file, and then use gdb to capture the output of >> 'thread apply all bt'? >> >> Thanks- >> asge