* Re: ceph issue [not found] <FEC85B105C5F644CA51BDB90657EEA9828425083@ddsm-mbx01.digdes.com> @ 2016-11-17 14:49 ` Sage Weil 2016-11-18 7:19 ` Haomai Wang 0 siblings, 1 reply; 16+ messages in thread From: Sage Weil @ 2016-11-17 14:49 UTC (permalink / raw) To: Marov Aleksey; +Cc: ceph-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 1039 bytes --] [adding ceph-devel] On Thu, 17 Nov 2016, Marov Aleksey wrote: > Hello Sage > > My name is Alex. I need some help with resolving issue with ceph. I have > been testing ceph with rdma messenger and I got an error > > src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) > > I have no idea what it means. I noticed that you was the last one who > committed in SubsystemMap.h so I think you have some understanding of this > condition in assert > > bool should_gather(unsigned sub, int level) { > assert(sub < m_subsys.size()); > return level <= m_subsys[sub].gather_level || > level <= m_subsys[sub].log_level; > } > > This error occurs only when I use fio benchmark to test rbd. When I use "rbd > bench-write ..." it is ok. But fio is much mire flexible . In any case I > think it is not good to get any assert. > > Can you explain this for me please, or give a hint where to investigate my > trouble. Can you generate a core file, and then use gdb to capture the output of 'thread apply all bt'? Thanks- asge ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ceph issue 2016-11-17 14:49 ` ceph issue Sage Weil @ 2016-11-18 7:19 ` Haomai Wang 2016-11-18 9:23 ` HA: " Marov Aleksey 0 siblings, 1 reply; 16+ messages in thread From: Haomai Wang @ 2016-11-18 7:19 UTC (permalink / raw) To: Sage Weil; +Cc: Marov Aleksey, ceph-devel hi Marov, Other person also met this problem when using rdma, but it's ok to me. so plz give more infos to figure it out On Thu, Nov 17, 2016 at 10:49 PM, Sage Weil <sweil@redhat.com> wrote: > [adding ceph-devel] > > On Thu, 17 Nov 2016, Marov Aleksey wrote: >> Hello Sage >> >> My name is Alex. I need some help with resolving issue with ceph. I have >> been testing ceph with rdma messenger and I got an error >> >> src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >> >> I have no idea what it means. I noticed that you was the last one who >> committed in SubsystemMap.h so I think you have some understanding of this >> condition in assert >> >> bool should_gather(unsigned sub, int level) { >> assert(sub < m_subsys.size()); >> return level <= m_subsys[sub].gather_level || >> level <= m_subsys[sub].log_level; >> } >> >> This error occurs only when I use fio benchmark to test rbd. When I use "rbd >> bench-write ..." it is ok. But fio is much mire flexible . In any case I >> think it is not good to get any assert. >> >> Can you explain this for me please, or give a hint where to investigate my >> trouble. > > Can you generate a core file, and then use gdb to capture the output of > 'thread apply all bt'? > > Thanks- > asge ^ permalink raw reply [flat|nested] 16+ messages in thread
* HA: ceph issue 2016-11-18 7:19 ` Haomai Wang @ 2016-11-18 9:23 ` Marov Aleksey 2016-11-18 11:26 ` Haomai Wang 0 siblings, 1 reply; 16+ messages in thread From: Marov Aleksey @ 2016-11-18 9:23 UTC (permalink / raw) To: Haomai Wang, Sage Weil; +Cc: ceph-devel I use ceph with rdma/async messenger. I have done next steps 1. ulimit -c unlimited core 2. fio -v : 2.1.13. Run fio rbd.fio Where rbd.fio config is : [global] ioengine=rbd clientname=admin pool=rbd rbdname=test_img1 invalidate=0 # mandatory rw=randwrite bs=4k runtime=10m time_based [rbd_iodepth32] iodepth=32 numjobs=1 3. Got this fio crash /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: In function 'bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread 7fffd3fff700 time 2016-11-18 11:51:44.411997 /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) ceph version 11.0.2-1554-g19ca7fd (19ca7fd92bb8813dcabcc57518932b3dbb553d4b) 1: (()+0x15ccd5) [0x7fffe6d9ccd5] 2: (()+0x75582) [0x7fffe6cb5582] 3: (()+0x3b7b07) [0x7fffe6ff7b07] 4: (()+0x215c36) [0x7fffe6e55c36] 5: (()+0x201b51) [0x7fffe6e41b51] 6: (()+0x1f93f4) [0x7fffe6e393f4] 7: (()+0x1e7035) [0x7fffe6e27035] 8: (()+0x1e733a) [0x7fffe6e2733a] 9: (librados::RadosClient::connect()+0x96) [0x7fffe6d0bbd6] 10: (rados_connect()+0x20) [0x7fffe6cbf2d0] 11: /usr/local/bin/fio() [0x45b579] 12: (td_io_init()+0x1b) [0x40d70b] 13: /usr/local/bin/fio() [0x449eb3] 14: (()+0x7dc5) [0x7fffe5ac9dc5] 15: (clone()+0x6d) [0x7fffe55f2ced] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 4. run gdb on core gdb $(which fio) core.3860 >>thread apply all bt >>run And got this bt: .. Thread 5 (Thread 0x7f1f54491880 (LWP 3860)): #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 #2 0x000000000044c26f in do_usleep (usecs=10000) at backend.c:1727 #3 run_threads () at backend.c:1965 #4 0x000000000044c7ed in fio_backend () at backend.c:2068 #5 0x00007f1f419e8b15 in __libc_start_main () from /lib64/libc.so.6 #6 0x000000000040b8ad in _start () Thread 4 (Thread 0x7f1f19ffb700 (LWP 3882)): #0 0x00007f1f41f986d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f1f4326b54b in ceph::logging::Log::entry (this=0x7f1f0802b4d0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/Log.cc:451 #2 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f1f41abdced in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f1f037fe700 (LWP 3883)): #0 0x00007f1f41f98a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f1f43395dca in WaitUntil (when=..., mutex=..., this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:72 #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:81 #3 CephContextServiceThread::entry (this=0x7f1f0807a3e0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.cc:149 #4 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f1f41abdced in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f1f34db5700 (LWP 3861)): #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 #2 0x0000000000448500 in disk_thread_main (data=<optimized out>) at backend.c:1992 #3 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f1f41abdced in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f1f345b4700 (LWP 3881)): #0 0x00007f1f419fc5f7 in raise () from /lib64/libc.so.6 #1 0x00007f1f419fdce8 in abort () from /lib64/libc.so.6 #2 0x00007f1f43267eb7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f1f4351d090 "sub < m_subsys.size()", file=file@entry=0x7f1f4351cd48 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h", line=line@entry=62, func=func@entry=0x7f1f4355f800 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/assert.cc:78 #3 0x00007f1f43180582 in ceph::logging::SubsystemMap::should_gather (level=20, sub=27, this=<optimized out>) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h:62 #4 0x00007f1f434c2b07 in should_gather (level=20, sub=27, this=<optimized out>) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:317 #5 Infiniband::create_comp_channel (this=0xd43430) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:310 #6 0x00007f1f43320c36 in RDMADispatcher (s=0x7f1f0807c2a8, i=<optimized out>, c=0x7f1f08026f60, this=0x7f1f08102bb0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.h:90 #7 RDMAStack::RDMAStack (this=0x7f1f0807c2a8, cct=0x7f1f08026f60, t=...) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.cc:66 #8 0x00007f1f4330cb51 in construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, this=<optimized out>) at /usr/include/c++/4.8.2/ext/new_allocator.h:120 #9 _S_construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 #10 construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 #11 _Sp_counted_ptr_inplace<CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., this=0x7f1f0807c290) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:399 #12 construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std::al locator<char> > const&> (__p=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/4.8.2/ext/new_allocator.h:120 #13 _S_construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std: :allocator<char> > const&> (__p=<optimized out>, __a=<synthetic pointer>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 #14 construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std::al locator<char> > const&> (__p=<optimized out>, __a=<synthetic pointer>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 ---Type <return> to continue, or q <return> to quit--- #15 __shared_count<RDMAStack, std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:502 #16 __shared_ptr<std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __tag=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:957 #17 shared_ptr<std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __tag=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:316 #18 allocate_shared<RDMAStack, std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=...) at /usr/include/c++/4.8.2/bits/shared_ptr.h:598 #19 make_shared<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> () at /usr/include/c++/4.8.2/bits/shared_ptr.h:614 #20 NetworkStack::create (c=c@entry=0x7f1f08026f60, t="rdma") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/Stack.cc:66 #21 0x00007f1f433043f4 in StackSingleton (c=0x7f1f08026f60, this=0x7f1f0807abd0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:244 #22 lookup_or_create_singleton_object<StackSingleton> (name="AsyncMessenger::NetworkStack", p=<synthetic pointer>, this=0x7f1f08026f60) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.h:134 #23 AsyncMessenger::AsyncMessenger (this=0x7f1f0807afd0, cct=0x7f1f08026f60, name=..., mname=..., _nonce=7528509425877766185) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:278 #24 0x00007f1f432f2035 in Messenger::create (cct=cct@entry=0x7f1f08026f60, type="async", name=..., lname="radosclient", nonce=nonce@entry=7528509425877766185, cflags=cflags@entry=0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:40 #25 0x00007f1f432f233a in Messenger::create_client_messenger (cct=0x7f1f08026f60, lname="radosclient") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:20 #26 0x00007f1f431d6bd6 in librados::RadosClient::connect (this=this@entry=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/RadosClient.cc:245 #27 0x00007f1f4318a2d0 in rados_connect (cluster=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/librados.cc:2771 #28 0x000000000045b579 in _fio_rbd_connect (td=<optimized out>) at engines/rbd.c:113 #29 fio_rbd_init (td=<optimized out>) at engines/rbd.c:337 #30 0x000000000040d70b in td_io_init (td=td@entry=0x7f1f34db6000) at ioengines.c:369 #31 0x0000000000449eb3 in thread_main (data=0x7f1f34db6000) at backend.c:1433 #32 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 #33 0x00007f1f41abdced in clone () from /lib64/libc.so.6 Hope it'll help. If you need core dump and fio binary I can send it. May be this problem relates to old fio version? (though I dont think so) Best regards Alex ________________________________________ hi Marov, Other person also met this problem when using rdma, but it's ok to me. so plz give more infos to figure it out On Thu, Nov 17, 2016 at 10:49 PM, Sage Weil <sweil@redhat.com> wrote: > [adding ceph-devel] > > On Thu, 17 Nov 2016, Marov Aleksey wrote: >> Hello Sage >> >> My name is Alex. I need some help with resolving issue with ceph. I have >> been testing ceph with rdma messenger and I got an error >> >> src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >> >> I have no idea what it means. I noticed that you was the last one who >> committed in SubsystemMap.h so I think you have some understanding of this >> condition in assert >> >> bool should_gather(unsigned sub, int level) { >> assert(sub < m_subsys.size()); >> return level <= m_subsys[sub].gather_level || >> level <= m_subsys[sub].log_level; >> } >> >> This error occurs only when I use fio benchmark to test rbd. When I use "rbd >> bench-write ..." it is ok. But fio is much mire flexible . In any case I >> think it is not good to get any assert. >> >> Can you explain this for me please, or give a hint where to investigate my >> trouble. > > Can you generate a core file, and then use gdb to capture the output of > 'thread apply all bt'? > > Thanks- > asge ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ceph issue 2016-11-18 9:23 ` HA: " Marov Aleksey @ 2016-11-18 11:26 ` Haomai Wang 2016-11-20 13:21 ` Avner Ben Hanoch 2016-11-20 14:29 ` Avner Ben Hanoch 0 siblings, 2 replies; 16+ messages in thread From: Haomai Wang @ 2016-11-18 11:26 UTC (permalink / raw) To: Marov Aleksey; +Cc: Sage Weil, ceph-devel sorry, I got the issue. I submitted a pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. On Fri, Nov 18, 2016 at 5:23 PM, Marov Aleksey <Marov.A@raidix.com> wrote: > I use ceph with rdma/async messenger. I have done next steps > 1. ulimit -c unlimited core > 2. fio -v : 2.1.13. Run fio rbd.fio Where rbd.fio config is : > [global] > ioengine=rbd > clientname=admin > pool=rbd > rbdname=test_img1 > invalidate=0 # mandatory > rw=randwrite > bs=4k > runtime=10m > time_based > > [rbd_iodepth32] > iodepth=32 > numjobs=1 > > 3. Got this fio crash > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: In function 'bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread 7fffd3fff700 time 2016-11-18 11:51:44.411997 > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) > ceph version 11.0.2-1554-g19ca7fd (19ca7fd92bb8813dcabcc57518932b3dbb553d4b) > 1: (()+0x15ccd5) [0x7fffe6d9ccd5] > 2: (()+0x75582) [0x7fffe6cb5582] > 3: (()+0x3b7b07) [0x7fffe6ff7b07] > 4: (()+0x215c36) [0x7fffe6e55c36] > 5: (()+0x201b51) [0x7fffe6e41b51] > 6: (()+0x1f93f4) [0x7fffe6e393f4] > 7: (()+0x1e7035) [0x7fffe6e27035] > 8: (()+0x1e733a) [0x7fffe6e2733a] > 9: (librados::RadosClient::connect()+0x96) [0x7fffe6d0bbd6] > 10: (rados_connect()+0x20) [0x7fffe6cbf2d0] > 11: /usr/local/bin/fio() [0x45b579] > 12: (td_io_init()+0x1b) [0x40d70b] > 13: /usr/local/bin/fio() [0x449eb3] > 14: (()+0x7dc5) [0x7fffe5ac9dc5] > 15: (clone()+0x6d) [0x7fffe55f2ced] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > 4. run gdb on core > gdb $(which fio) core.3860 >>>thread apply all bt >>>run > And got this bt: > ... > Thread 5 (Thread 0x7f1f54491880 (LWP 3860)): > #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 > #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 > #2 0x000000000044c26f in do_usleep (usecs=10000) at backend.c:1727 > #3 run_threads () at backend.c:1965 > #4 0x000000000044c7ed in fio_backend () at backend.c:2068 > #5 0x00007f1f419e8b15 in __libc_start_main () from /lib64/libc.so.6 > #6 0x000000000040b8ad in _start () > > Thread 4 (Thread 0x7f1f19ffb700 (LWP 3882)): > #0 0x00007f1f41f986d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00007f1f4326b54b in ceph::logging::Log::entry (this=0x7f1f0802b4d0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/Log.cc:451 > #2 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #3 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 3 (Thread 0x7f1f037fe700 (LWP 3883)): > #0 0x00007f1f41f98a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x00007f1f43395dca in WaitUntil (when=..., mutex=..., this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:72 > #2 WaitInterval (interval=..., mutex=..., cct=<optimized out>, this=0x7f1f0807a460) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/Cond.h:81 > #3 CephContextServiceThread::entry (this=0x7f1f0807a3e0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.cc:149 > #4 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #5 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 2 (Thread 0x7f1f34db5700 (LWP 3861)): > #0 0x00007f1f41a84efd in nanosleep () from /lib64/libc.so.6 > #1 0x00007f1f41ab5b34 in usleep () from /lib64/libc.so.6 > #2 0x0000000000448500 in disk_thread_main (data=<optimized out>) at backend.c:1992 > #3 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #4 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x7f1f345b4700 (LWP 3881)): > #0 0x00007f1f419fc5f7 in raise () from /lib64/libc.so.6 > #1 0x00007f1f419fdce8 in abort () from /lib64/libc.so.6 > #2 0x00007f1f43267eb7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f1f4351d090 "sub < m_subsys.size()", > file=file@entry=0x7f1f4351cd48 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h", line=line@entry=62, > func=func@entry=0x7f1f4355f800 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/assert.cc:78 > #3 0x00007f1f43180582 in ceph::logging::SubsystemMap::should_gather (level=20, sub=27, this=<optimized out>) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/log/SubsystemMap.h:62 > #4 0x00007f1f434c2b07 in should_gather (level=20, sub=27, this=<optimized out>) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:317 > #5 Infiniband::create_comp_channel (this=0xd43430) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/Infiniband.cc:310 > #6 0x00007f1f43320c36 in RDMADispatcher (s=0x7f1f0807c2a8, i=<optimized out>, c=0x7f1f08026f60, this=0x7f1f08102bb0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.h:90 > #7 RDMAStack::RDMAStack (this=0x7f1f0807c2a8, cct=0x7f1f08026f60, t=...) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/rdma/RDMAStack.cc:66 > #8 0x00007f1f4330cb51 in construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, this=<optimized out>) > at /usr/include/c++/4.8.2/ext/new_allocator.h:120 > #9 _S_construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 > #10 construct<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__p=0x7f1f0807c2a8, __a=...) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 > #11 _Sp_counted_ptr_inplace<CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., this=0x7f1f0807c290) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:399 > #12 construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std::al > locator<char> > const&> (__p=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/4.8.2/ext/new_allocator.h:120 > #13 _S_construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std: > :allocator<char> > const&> (__p=<optimized out>, __a=<synthetic pointer>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:254 > #14 construct<std::_Sp_counted_ptr_inplace<RDMAStack, std::allocator<RDMAStack>, (__gnu_cxx::_Lock_policy)2>, std::allocator<RDMAStack> const, CephContext*&, std::basic_string<char, std::char_traits<char>, std::al > locator<char> > const&> (__p=<optimized out>, __a=<synthetic pointer>) at /usr/include/c++/4.8.2/bits/alloc_traits.h:393 > ---Type <return> to continue, or q <return> to quit--- > #15 __shared_count<RDMAStack, std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., this=<optimized out>) > at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:502 > #16 __shared_ptr<std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __tag=..., this=<optimized out>) > at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:957 > #17 shared_ptr<std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=..., __tag=..., this=<optimized out>) > at /usr/include/c++/4.8.2/bits/shared_ptr.h:316 > #18 allocate_shared<RDMAStack, std::allocator<RDMAStack>, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> (__a=...) at /usr/include/c++/4.8.2/bits/shared_ptr.h:598 > #19 make_shared<RDMAStack, CephContext*&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&> () at /usr/include/c++/4.8.2/bits/shared_ptr.h:614 > #20 NetworkStack::create (c=c@entry=0x7f1f08026f60, t="rdma") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/Stack.cc:66 > #21 0x00007f1f433043f4 in StackSingleton (c=0x7f1f08026f60, this=0x7f1f0807abd0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:244 > #22 lookup_or_create_singleton_object<StackSingleton> (name="AsyncMessenger::NetworkStack", p=<synthetic pointer>, this=0x7f1f08026f60) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/common/ceph_context.h:134 > #23 AsyncMessenger::AsyncMessenger (this=0x7f1f0807afd0, cct=0x7f1f08026f60, name=..., mname=..., _nonce=7528509425877766185) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/async/AsyncMessenger.cc:278 > #24 0x00007f1f432f2035 in Messenger::create (cct=cct@entry=0x7f1f08026f60, type="async", name=..., lname="radosclient", nonce=nonce@entry=7528509425877766185, cflags=cflags@entry=0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:40 > > #25 0x00007f1f432f233a in Messenger::create_client_messenger (cct=0x7f1f08026f60, lname="radosclient") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/msg/Messenger.cc:20 > #26 0x00007f1f431d6bd6 in librados::RadosClient::connect (this=this@entry=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/RadosClient.cc:245 > #27 0x00007f1f4318a2d0 in rados_connect (cluster=0x7f1f0802ed00) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-1554-g19ca7fd/src/librados/librados.cc:2771 > #28 0x000000000045b579 in _fio_rbd_connect (td=<optimized out>) at engines/rbd.c:113 > #29 fio_rbd_init (td=<optimized out>) at engines/rbd.c:337 > #30 0x000000000040d70b in td_io_init (td=td@entry=0x7f1f34db6000) at ioengines.c:369 > #31 0x0000000000449eb3 in thread_main (data=0x7f1f34db6000) at backend.c:1433 > #32 0x00007f1f41f94dc5 in start_thread () from /lib64/libpthread.so.0 > #33 0x00007f1f41abdced in clone () from /lib64/libc.so.6 > > > Hope it'll help. If you need core dump and fio binary I can send it. May be this problem relates to old fio version? (though I dont think so) > > Best regards > Alex > ________________________________________ > > hi Marov, > > Other person also met this problem when using rdma, but it's ok to me. > so plz give more infos to figure it out > > On Thu, Nov 17, 2016 at 10:49 PM, Sage Weil <sweil@redhat.com> wrote: >> [adding ceph-devel] >> >> On Thu, 17 Nov 2016, Marov Aleksey wrote: >>> Hello Sage >>> >>> My name is Alex. I need some help with resolving issue with ceph. I have >>> been testing ceph with rdma messenger and I got an error >>> >>> src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >>> >>> I have no idea what it means. I noticed that you was the last one who >>> committed in SubsystemMap.h so I think you have some understanding of this >>> condition in assert >>> >>> bool should_gather(unsigned sub, int level) { >>> assert(sub < m_subsys.size()); >>> return level <= m_subsys[sub].gather_level || >>> level <= m_subsys[sub].log_level; >>> } >>> >>> This error occurs only when I use fio benchmark to test rbd. When I use "rbd >>> bench-write ..." it is ok. But fio is much mire flexible . In any case I >>> think it is not good to get any assert. >>> >>> Can you explain this for me please, or give a hint where to investigate my >>> trouble. >> >> Can you generate a core file, and then use gdb to capture the output of >> 'thread apply all bt'? >> >> Thanks- >> asge ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: ceph issue 2016-11-18 11:26 ` Haomai Wang @ 2016-11-20 13:21 ` Avner Ben Hanoch 2016-11-20 14:29 ` Avner Ben Hanoch 1 sibling, 0 replies; 16+ messages in thread From: Avner Ben Hanoch @ 2016-11-20 13:21 UTC (permalink / raw) To: Haomai Wang, Marov Aleksey; +Cc: Sage Weil, ceph-devel This PR doesn't have any effect on the assertion. I still get it in same situation --- $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, iodepth=128 fio-2.13-91-gb678 Starting 1 process rbd engine: RBD version: 0.1.11 /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h: In function 'bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) ceph version 11.0.2-1611-geb25965 (eb25965b74aa1a0379d091169d80786f30c72a8b) --- > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Haomai Wang > Subject: Re: ceph issue > > sorry, I got the issue. I submitted a > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: ceph issue 2016-11-18 11:26 ` Haomai Wang 2016-11-20 13:21 ` Avner Ben Hanoch @ 2016-11-20 14:29 ` Avner Ben Hanoch 2016-11-21 10:40 ` Haomai Wang 1 sibling, 1 reply; 16+ messages in thread From: Avner Ben Hanoch @ 2016-11-20 14:29 UTC (permalink / raw) To: Haomai Wang, Marov Aleksey; +Cc: Sage Weil, ceph-devel Perhaps similar fix needed in additional places. See my stack trace below (failed on same assert(sub < m_subsys.size())) -- #0 0x00007fffe55525f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fffe5553ce8 in __GI_abort () at abort.c:90 #2 0x00007fffe6dbbd47 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fffe70599d8 "sub < m_subsys.size()", file=file@entry=0x7fffe7059688 "/mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h", line=line@entry=62, func=func@entry=0x7fffe7074040 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/common/assert.cc:78 #3 0x00007fffe6cd215a in ceph::logging::SubsystemMap::should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h:62 #4 0x00007fffe6e65865 in should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:180 #5 ceph::NetHandler::generic_connect (this=0x86dc18, addr=..., nonblock=nonblock@entry=false) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:174 #6 0x00007fffe6e65b17 in ceph::NetHandler::connect (this=<optimized out>, addr=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:198 #7 0x00007fffe700105c in RDMAConnectedSocketImpl::try_connect (this=this@entry=0x7fffbc000ef0, peer_addr=..., opts=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAConnectedSocketImpl.cc:111 #8 0x00007fffe6e68ed4 in RDMAWorker::connect (this=0x7fffa806e650, addr=..., opts=..., socket=0x7fffa00235b0) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAStack.cc:48 #9 0x00007fffe6fee873 in AsyncConnection::_process_connection (this=this@entry=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:864 #10 0x00007fffe6ff5148 in AsyncConnection::process (this=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:812 #11 0x00007fffe6e5d6ac in EventCenter::process_events (this=this@entry=0x7fffa806e6d0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Event.cc:430 #12 0x00007fffe6e5fbba in NetworkStack::__lambda1::operator() (__closure=0x7fffa80f5630) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Stack.cc:47 #13 0x00007fffe3e71220 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 #14 0x00007fffe5ae9dc5 in start_thread (arg=0x7fffcbb93700) at pthread_create.c:308 #15 0x00007fffe561321d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > -----Original Message----- > From: Avner Ben Hanoch > Sent: Sunday, November 20, 2016 15:22 > To: 'Haomai Wang' <haomai@xsky.com>; Marov Aleksey > <Marov.A@raidix.com> > Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org > Subject: RE: ceph issue > > This PR doesn't have any effect on the assertion. I still get it in same situation > > --- > $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 -- > clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, > iodepth=128 > fio-2.13-91-gb678 > Starting 1 process > rbd engine: RBD version: 0.1.11 > /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h: In function 'bool > ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread > 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 > /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) > ceph version 11.0.2-1611-geb25965 > (eb25965b74aa1a0379d091169d80786f30c72a8b) > --- > > > -----Original Message----- > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > > owner@vger.kernel.org] On Behalf Of Haomai Wang > > Subject: Re: ceph issue > > > > sorry, I got the issue. I submitted a > > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ceph issue 2016-11-20 14:29 ` Avner Ben Hanoch @ 2016-11-21 10:40 ` Haomai Wang 2016-11-21 16:20 ` HA: " Marov Aleksey 0 siblings, 1 reply; 16+ messages in thread From: Haomai Wang @ 2016-11-21 10:40 UTC (permalink / raw) To: Avner Ben Hanoch; +Cc: Marov Aleksey, Sage Weil, ceph-devel @Avner plz try again, I submit a new patch to fix leaks. On Sun, Nov 20, 2016 at 10:29 PM, Avner Ben Hanoch <avnerb@mellanox.com> wrote: > Perhaps similar fix needed in additional places. > See my stack trace below (failed on same assert(sub < m_subsys.size())) > > -- > #0 0x00007fffe55525f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x00007fffe5553ce8 in __GI_abort () at abort.c:90 > #2 0x00007fffe6dbbd47 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fffe70599d8 "sub < m_subsys.size()", > file=file@entry=0x7fffe7059688 "/mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h", line=line@entry=62, > func=func@entry=0x7fffe7074040 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/common/assert.cc:78 > #3 0x00007fffe6cd215a in ceph::logging::SubsystemMap::should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h:62 > #4 0x00007fffe6e65865 in should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:180 > #5 ceph::NetHandler::generic_connect (this=0x86dc18, addr=..., nonblock=nonblock@entry=false) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:174 > #6 0x00007fffe6e65b17 in ceph::NetHandler::connect (this=<optimized out>, addr=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:198 > #7 0x00007fffe700105c in RDMAConnectedSocketImpl::try_connect (this=this@entry=0x7fffbc000ef0, peer_addr=..., opts=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAConnectedSocketImpl.cc:111 > #8 0x00007fffe6e68ed4 in RDMAWorker::connect (this=0x7fffa806e650, addr=..., opts=..., socket=0x7fffa00235b0) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAStack.cc:48 > #9 0x00007fffe6fee873 in AsyncConnection::_process_connection (this=this@entry=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:864 > #10 0x00007fffe6ff5148 in AsyncConnection::process (this=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:812 > #11 0x00007fffe6e5d6ac in EventCenter::process_events (this=this@entry=0x7fffa806e6d0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Event.cc:430 > #12 0x00007fffe6e5fbba in NetworkStack::__lambda1::operator() (__closure=0x7fffa80f5630) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Stack.cc:47 > #13 0x00007fffe3e71220 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 > #14 0x00007fffe5ae9dc5 in start_thread (arg=0x7fffcbb93700) at pthread_create.c:308 > #15 0x00007fffe561321d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > >> -----Original Message----- >> From: Avner Ben Hanoch >> Sent: Sunday, November 20, 2016 15:22 >> To: 'Haomai Wang' <haomai@xsky.com>; Marov Aleksey >> <Marov.A@raidix.com> >> Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org >> Subject: RE: ceph issue >> >> This PR doesn't have any effect on the assertion. I still get it in same situation >> >> --- >> $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 -- >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 >> 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, >> iodepth=128 >> fio-2.13-91-gb678 >> Starting 1 process >> rbd engine: RBD version: 0.1.11 >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- >> geb25965/src/log/SubsystemMap.h: In function 'bool >> ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread >> 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- >> geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >> ceph version 11.0.2-1611-geb25965 >> (eb25965b74aa1a0379d091169d80786f30c72a8b) >> --- >> >> > -----Original Message----- >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >> > owner@vger.kernel.org] On Behalf Of Haomai Wang >> > Subject: Re: ceph issue >> > >> > sorry, I got the issue. I submitted a >> > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* HA: ceph issue 2016-11-21 10:40 ` Haomai Wang @ 2016-11-21 16:20 ` Marov Aleksey 2016-11-22 14:41 ` Avner Ben Hanoch 0 siblings, 1 reply; 16+ messages in thread From: Marov Aleksey @ 2016-11-21 16:20 UTC (permalink / raw) To: Haomai Wang, Avner Ben Hanoch; +Cc: Sage Weil, ceph-devel It seems for me that your last patch fixed the problem. It works fine with fio 2.13 and fio 2.15. I think it may be merged in master. Thanks a lot for your work. I'll do some performnace tests next. Best Regards Alex Marov ________________________________________ @Avner plz try again, I submit a new patch to fix leaks. On Sun, Nov 20, 2016 at 10:29 PM, Avner Ben Hanoch <avnerb@mellanox.com> wrote: > Perhaps similar fix needed in additional places. > See my stack trace below (failed on same assert(sub < m_subsys.size())) > > -- > #0 0x00007fffe55525f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x00007fffe5553ce8 in __GI_abort () at abort.c:90 > #2 0x00007fffe6dbbd47 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7fffe70599d8 "sub < m_subsys.size()", > file=file@entry=0x7fffe7059688 "/mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h", line=line@entry=62, > func=func@entry=0x7fffe7074040 <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, int)") > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/common/assert.cc:78 > #3 0x00007fffe6cd215a in ceph::logging::SubsystemMap::should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/log/SubsystemMap.h:62 > #4 0x00007fffe6e65865 in should_gather (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:180 > #5 ceph::NetHandler::generic_connect (this=0x86dc18, addr=..., nonblock=nonblock@entry=false) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:174 > #6 0x00007fffe6e65b17 in ceph::NetHandler::connect (this=<optimized out>, addr=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/net_handler.cc:198 > #7 0x00007fffe700105c in RDMAConnectedSocketImpl::try_connect (this=this@entry=0x7fffbc000ef0, peer_addr=..., opts=...) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAConnectedSocketImpl.cc:111 > #8 0x00007fffe6e68ed4 in RDMAWorker::connect (this=0x7fffa806e650, addr=..., opts=..., socket=0x7fffa00235b0) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/rdma/RDMAStack.cc:48 > #9 0x00007fffe6fee873 in AsyncConnection::_process_connection (this=this@entry=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:864 > #10 0x00007fffe6ff5148 in AsyncConnection::process (this=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/AsyncConnection.cc:812 > #11 0x00007fffe6e5d6ac in EventCenter::process_events (this=this@entry=0x7fffa806e6d0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Event.cc:430 > #12 0x00007fffe6e5fbba in NetworkStack::__lambda1::operator() (__closure=0x7fffa80f5630) at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Stack.cc:47 > #13 0x00007fffe3e71220 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 > #14 0x00007fffe5ae9dc5 in start_thread (arg=0x7fffcbb93700) at pthread_create.c:308 > #15 0x00007fffe561321d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > >> -----Original Message----- >> From: Avner Ben Hanoch >> Sent: Sunday, November 20, 2016 15:22 >> To: 'Haomai Wang' <haomai@xsky.com>; Marov Aleksey >> <Marov.A@raidix.com> >> Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org >> Subject: RE: ceph issue >> >> This PR doesn't have any effect on the assertion. I still get it in same situation >> >> --- >> $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 -- >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 >> 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, >> iodepth=128 >> fio-2.13-91-gb678 >> Starting 1 process >> rbd engine: RBD version: 0.1.11 >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- >> geb25965/src/log/SubsystemMap.h: In function 'bool >> ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread >> 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- >> geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size()) >> ceph version 11.0.2-1611-geb25965 >> (eb25965b74aa1a0379d091169d80786f30c72a8b) >> --- >> >> > -----Original Message----- >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >> > owner@vger.kernel.org] On Behalf Of Haomai Wang >> > Subject: Re: ceph issue >> > >> > sorry, I got the issue. I submitted a >> > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: ceph issue 2016-11-21 16:20 ` HA: " Marov Aleksey @ 2016-11-22 14:41 ` Avner Ben Hanoch 2016-11-22 15:59 ` HA: " Marov Aleksey 0 siblings, 1 reply; 16+ messages in thread From: Avner Ben Hanoch @ 2016-11-22 14:41 UTC (permalink / raw) To: Marov Aleksey, Haomai Wang; +Cc: Sage Weil, ceph-devel Yup. same good status here. Thanks for the fix. I also recommend merging to master. On a side note, executing "fio --blocksize=10M" bring my cluster to HEALTH_WARN with 8 requests are blocked > 32 sec. The cluster recovers from this situation only after I kill the "bad fio process" Avner > -----Original Message----- > From: Marov Aleksey [mailto:Marov.A@raidix.com] > Sent: Monday, November 21, 2016 18:20 > To: Haomai Wang <haomai@xsky.com>; Avner Ben Hanoch > <avnerb@mellanox.com> > Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org > Subject: HA: ceph issue > > It seems for me that your last patch fixed the problem. It works fine with fio > 2.13 and fio 2.15. I think it may be merged in master. > > Thanks a lot for your work. I'll do some performnace tests next. > > Best Regards > Alex Marov > ________________________________________ > > > @Avner plz try again, I submit a new patch to fix leaks. > > On Sun, Nov 20, 2016 at 10:29 PM, Avner Ben Hanoch > <avnerb@mellanox.com> wrote: > > Perhaps similar fix needed in additional places. > > See my stack trace below (failed on same assert(sub < m_subsys.size())) > > > > -- > > #0 0x00007fffe55525f7 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > > #1 0x00007fffe5553ce8 in __GI_abort () at abort.c:90 > > #2 0x00007fffe6dbbd47 in ceph::__ceph_assert_fail > (assertion=assertion@entry=0x7fffe70599d8 "sub < m_subsys.size()", > > file=file@entry=0x7fffe7059688 > "/mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h", line=line@entry=62, > > func=func@entry=0x7fffe7074040 > <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCT > ION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, > int)") > > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/common/assert.cc:78 > > #3 0x00007fffe6cd215a in ceph::logging::SubsystemMap::should_gather > (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h:62 > > #4 0x00007fffe6e65865 in should_gather (level=10, sub=27, this=<optimized > out>) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:180 > > #5 ceph::NetHandler::generic_connect (this=0x86dc18, addr=..., > nonblock=nonblock@entry=false) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:174 > > #6 0x00007fffe6e65b17 in ceph::NetHandler::connect (this=<optimized > out>, addr=...) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:198 > > #7 0x00007fffe700105c in RDMAConnectedSocketImpl::try_connect > (this=this@entry=0x7fffbc000ef0, peer_addr=..., opts=...) at > /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/rdma/RDMAConnectedSocketImpl.cc:111 > > #8 0x00007fffe6e68ed4 in RDMAWorker::connect (this=0x7fffa806e650, > addr=..., opts=..., socket=0x7fffa00235b0) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/rdma/RDMAStack.cc:48 > > #9 0x00007fffe6fee873 in AsyncConnection::_process_connection > (this=this@entry=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/AsyncConnection.cc:864 > > #10 0x00007fffe6ff5148 in AsyncConnection::process (this=0x7fffa0023450) > at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/AsyncConnection.cc:812 > > #11 0x00007fffe6e5d6ac in EventCenter::process_events > (this=this@entry=0x7fffa806e6d0, timeout_microseconds=<optimized out>, > timeout_microseconds@entry=30000000) > > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Event.cc:430 > > #12 0x00007fffe6e5fbba in NetworkStack::__lambda1::operator() > (__closure=0x7fffa80f5630) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/Stack.cc:47 > > #13 0x00007fffe3e71220 in std::(anonymous > namespace)::execute_native_thread_routine (__p=<optimized out>) at > ../../../../../libstdc++-v3/src/c++11/thread.cc:84 > > #14 0x00007fffe5ae9dc5 in start_thread (arg=0x7fffcbb93700) at > pthread_create.c:308 > > #15 0x00007fffe561321d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > > > >> -----Original Message----- > >> From: Avner Ben Hanoch > >> Sent: Sunday, November 20, 2016 15:22 > >> To: 'Haomai Wang' <haomai@xsky.com>; Marov Aleksey > >> <Marov.A@raidix.com> > >> Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org > >> Subject: RE: ceph issue > >> > >> This PR doesn't have any effect on the assertion. I still get it in same > situation > >> > >> --- > >> $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 -- > >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > >> 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, > >> iodepth=128 > >> fio-2.13-91-gb678 > >> Starting 1 process > >> rbd engine: RBD version: 0.1.11 > >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > >> geb25965/src/log/SubsystemMap.h: In function 'bool > >> ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread > >> 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 > >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > >> geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < > m_subsys.size()) > >> ceph version 11.0.2-1611-geb25965 > >> (eb25965b74aa1a0379d091169d80786f30c72a8b) > >> --- > >> > >> > -----Original Message----- > >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > >> > owner@vger.kernel.org] On Behalf Of Haomai Wang > >> > Subject: Re: ceph issue > >> > > >> > sorry, I got the issue. I submitted a > >> > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* HA: ceph issue 2016-11-22 14:41 ` Avner Ben Hanoch @ 2016-11-22 15:59 ` Marov Aleksey 2016-11-23 9:30 ` Avner Ben Hanoch 0 siblings, 1 reply; 16+ messages in thread From: Marov Aleksey @ 2016-11-22 15:59 UTC (permalink / raw) To: Avner Ben Hanoch, Haomai Wang; +Cc: Sage Weil, ceph-devel I didn't try this blocksize. But in my case fio crushed if I use more than one job. With one job everything works fine. Is it worth more deep investigating? Alex ________________________________________ От: Avner Ben Hanoch [avnerb@mellanox.com] Отправлено: 22 ноября 2016 г. 17:41 Кому: Marov Aleksey; Haomai Wang Копия: Sage Weil; ceph-devel@vger.kernel.org Тема: RE: ceph issue Yup. same good status here. Thanks for the fix. I also recommend merging to master. On a side note, executing "fio --blocksize=10M" bring my cluster to HEALTH_WARN with 8 requests are blocked > 32 sec. The cluster recovers from this situation only after I kill the "bad fio process" Avner > -----Original Message----- > From: Marov Aleksey [mailto:Marov.A@raidix.com] > Sent: Monday, November 21, 2016 18:20 > To: Haomai Wang <haomai@xsky.com>; Avner Ben Hanoch > <avnerb@mellanox.com> > Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org > Subject: HA: ceph issue > > It seems for me that your last patch fixed the problem. It works fine with fio > 2.13 and fio 2.15. I think it may be merged in master. > > Thanks a lot for your work. I'll do some performnace tests next. > > Best Regards > Alex Marov > ________________________________________ > > > @Avner plz try again, I submit a new patch to fix leaks. > > On Sun, Nov 20, 2016 at 10:29 PM, Avner Ben Hanoch > <avnerb@mellanox.com> wrote: > > Perhaps similar fix needed in additional places. > > See my stack trace below (failed on same assert(sub < m_subsys.size())) > > > > -- > > #0 0x00007fffe55525f7 in __GI_raise (sig=sig@entry=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > > #1 0x00007fffe5553ce8 in __GI_abort () at abort.c:90 > > #2 0x00007fffe6dbbd47 in ceph::__ceph_assert_fail > (assertion=assertion@entry=0x7fffe70599d8 "sub < m_subsys.size()", > > file=file@entry=0x7fffe7059688 > "/mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h", line=line@entry=62, > > func=func@entry=0x7fffe7074040 > <_ZZN4ceph7logging12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCT > ION__> "bool ceph::logging::SubsystemMap::should_gather(unsigned int, > int)") > > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/common/assert.cc:78 > > #3 0x00007fffe6cd215a in ceph::logging::SubsystemMap::should_gather > (level=10, sub=27, this=<optimized out>) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/log/SubsystemMap.h:62 > > #4 0x00007fffe6e65865 in should_gather (level=10, sub=27, this=<optimized > out>) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:180 > > #5 ceph::NetHandler::generic_connect (this=0x86dc18, addr=..., > nonblock=nonblock@entry=false) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:174 > > #6 0x00007fffe6e65b17 in ceph::NetHandler::connect (this=<optimized > out>, addr=...) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/net_handler.cc:198 > > #7 0x00007fffe700105c in RDMAConnectedSocketImpl::try_connect > (this=this@entry=0x7fffbc000ef0, peer_addr=..., opts=...) at > /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/rdma/RDMAConnectedSocketImpl.cc:111 > > #8 0x00007fffe6e68ed4 in RDMAWorker::connect (this=0x7fffa806e650, > addr=..., opts=..., socket=0x7fffa00235b0) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/rdma/RDMAStack.cc:48 > > #9 0x00007fffe6fee873 in AsyncConnection::_process_connection > (this=this@entry=0x7fffa0023450) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/AsyncConnection.cc:864 > > #10 0x00007fffe6ff5148 in AsyncConnection::process (this=0x7fffa0023450) > at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/AsyncConnection.cc:812 > > #11 0x00007fffe6e5d6ac in EventCenter::process_events > (this=this@entry=0x7fffa806e6d0, timeout_microseconds=<optimized out>, > timeout_microseconds@entry=30000000) > > at /usr/src/debug/ceph-11.0.2-1611-geb25965/src/msg/async/Event.cc:430 > > #12 0x00007fffe6e5fbba in NetworkStack::__lambda1::operator() > (__closure=0x7fffa80f5630) at /usr/src/debug/ceph-11.0.2-1611- > geb25965/src/msg/async/Stack.cc:47 > > #13 0x00007fffe3e71220 in std::(anonymous > namespace)::execute_native_thread_routine (__p=<optimized out>) at > ../../../../../libstdc++-v3/src/c++11/thread.cc:84 > > #14 0x00007fffe5ae9dc5 in start_thread (arg=0x7fffcbb93700) at > pthread_create.c:308 > > #15 0x00007fffe561321d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > > > >> -----Original Message----- > >> From: Avner Ben Hanoch > >> Sent: Sunday, November 20, 2016 15:22 > >> To: 'Haomai Wang' <haomai@xsky.com>; Marov Aleksey > >> <Marov.A@raidix.com> > >> Cc: Sage Weil <sweil@redhat.com>; ceph-devel@vger.kernel.org > >> Subject: RE: ceph issue > >> > >> This PR doesn't have any effect on the assertion. I still get it in same > situation > >> > >> --- > >> $ ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=10M --numjobs=1 -- > >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > >> 1: (g=0): rw=write, bs=10M-10M/10M-10M/10M-10M, ioengine=rbd, > >> iodepth=128 > >> fio-2.13-91-gb678 > >> Starting 1 process > >> rbd engine: RBD version: 0.1.11 > >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > >> geb25965/src/log/SubsystemMap.h: In function 'bool > >> ceph::logging::SubsystemMap::should_gather(unsigned int, int)' thread > >> 7f7c7b3a5700 time 2016-11-20 13:17:56.090289 > >> /mnt/data/avnerb/rpmbuild/BUILD/ceph-11.0.2-1611- > >> geb25965/src/log/SubsystemMap.h: 62: FAILED assert(sub < > m_subsys.size()) > >> ceph version 11.0.2-1611-geb25965 > >> (eb25965b74aa1a0379d091169d80786f30c72a8b) > >> --- > >> > >> > -----Original Message----- > >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > >> > owner@vger.kernel.org] On Behalf Of Haomai Wang > >> > Subject: Re: ceph issue > >> > > >> > sorry, I got the issue. I submitted a > >> > pr(https://github.com/ceph/ceph/pull/12068). plz tested with this. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: ceph issue 2016-11-22 15:59 ` HA: " Marov Aleksey @ 2016-11-23 9:30 ` Avner Ben Hanoch 2016-12-02 3:12 ` Haomai Wang 0 siblings, 1 reply; 16+ messages in thread From: Avner Ben Hanoch @ 2016-11-23 9:30 UTC (permalink / raw) To: Marov Aleksey, Haomai Wang; +Cc: Sage Weil, ceph-devel I guess that like the rest of ceph, the new rdma code must also support multiple applications in parallel. I am also reproducing your error => 2 instances of fio can't run in parallel with ceph rdma. * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 sec") * and with all osds printing messages like " heartbeat_check: no reply from ..." * And with log files contains errors: $ grep error ceph-osd.0.log 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk reading from fd=139 : Unknown error -104 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work request returned error for buffer(0x7f9b553d20d0) status(12:RETRY_EXC_ERR Command lines that I used: ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 > -----Original Message----- > From: Marov Aleksey > Sent: Tuesday, November 22, 2016 17:59 > > I didn't try this blocksize. But in my case fio crushed if I use more than one > job. With one job everything works fine. Is it worth more deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ceph issue 2016-11-23 9:30 ` Avner Ben Hanoch @ 2016-12-02 3:12 ` Haomai Wang 2016-12-05 9:37 ` Avner Ben Hanoch 0 siblings, 1 reply; 16+ messages in thread From: Haomai Wang @ 2016-12-02 3:12 UTC (permalink / raw) To: Avner Ben Hanoch; +Cc: Marov Aleksey, Sage Weil, ceph-devel On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch <avnerb@mellanox.com> wrote: > > I guess that like the rest of ceph, the new rdma code must also support multiple applications in parallel. > > I am also reproducing your error => 2 instances of fio can't run in parallel with ceph rdma. > > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 sec") > > * and with all osds printing messages like " heartbeat_check: no reply from ..." > > * And with log files contains errors: > $ grep error ceph-osd.0.log > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk reading from fd=139 : Unknown error -104 > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work request returned error for buffer(0x7f9b553d20d0) status(12:RETRY_EXC_ERR error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs." we set qp retry_cnt to 7 and timeout is 14 // How long to wait before retrying if packet lost or server dead. // Supposedly the timeout is 4.096us*2^timeout. However, the actual // timeout appears to be 4.096us*2^(timeout+1), so the setting // below creates a 135ms timeout. qpa.timeout = 14; // How many times to retry after timeouts before giving up. qpa.retry_cnt = 7; is this means the receiver side lack of memory or not polling work request ASAP? > > > > Command lines that I used: > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 > > > -----Original Message----- > > From: Marov Aleksey > > Sent: Tuesday, November 22, 2016 17:59 > > > > I didn't try this blocksize. But in my case fio crushed if I use more than one > > job. With one job everything works fine. Is it worth more deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: ceph issue 2016-12-02 3:12 ` Haomai Wang @ 2016-12-05 9:37 ` Avner Ben Hanoch 2016-12-06 15:36 ` HA: " Marov Aleksey 0 siblings, 1 reply; 16+ messages in thread From: Avner Ben Hanoch @ 2016-12-05 9:37 UTC (permalink / raw) To: Haomai Wang; +Cc: Marov Aleksey, Sage Weil, ceph-devel Hi Haomai, Alexey With latest async/rdma code I don't see the fio errors (not for multiple fio instances neither to big block size) - thanks for your work Haomai. Alexey - do you still see any issue with fio? Regards, Avner > -----Original Message----- > From: Haomai Wang [mailto:haomai@xsky.com] > Sent: Friday, December 02, 2016 05:12 > To: Avner Ben Hanoch <avnerb@mellanox.com> > Cc: Marov Aleksey <Marov.A@raidix.com>; Sage Weil <sweil@redhat.com>; > ceph-devel@vger.kernel.org > Subject: Re: ceph issue > > On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch > <avnerb@mellanox.com> wrote: > > > > I guess that like the rest of ceph, the new rdma code must also support > multiple applications in parallel. > > > > I am also reproducing your error => 2 instances of fio can't run in parallel > with ceph rdma. > > > > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 > > sec") > > > > * and with all osds printing messages like " heartbeat_check: no reply from > ..." > > > > * And with log files contains errors: > > $ grep error ceph-osd.0.log > > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' > error = (2) No such file or directory > > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> > 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 > l=1).read_bulk reading from fd=139 : Unknown error -104 > > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work > request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR > > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work > > request returned error for buffer(0x7f9b553d20d0) > > status(12:RETRY_EXC_ERR > > error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter > Exceeded: The local transport timeout retry counter was exceeded while > trying to send this message. This means that the remote side didn't send any > Ack or Nack. If this happens when sending the first message, usually this mean > that the connection attributes are wrong or the remote side isn't in a state > that it can respond to messages. If this happens after sending the first > message, usually it means that the remote QP isn't available anymore. > Relevant for RC QPs." > > we set qp retry_cnt to 7 and timeout is 14 > > // How long to wait before retrying if packet lost or server dead. > // Supposedly the timeout is 4.096us*2^timeout. However, the actual > // timeout appears to be 4.096us*2^(timeout+1), so the setting > // below creates a 135ms timeout. > qpa.timeout = 14; > > // How many times to retry after timeouts before giving up. > qpa.retry_cnt = 7; > > is this means the receiver side lack of memory or not polling work request > ASAP? > > > > > > > > > Command lines that I used: > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 -- > clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 > > --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 > > > > > -----Original Message----- > > > From: Marov Aleksey > > > Sent: Tuesday, November 22, 2016 17:59 > > > > > > I didn't try this blocksize. But in my case fio crushed if I use > > > more than one job. With one job everything works fine. Is it worth more > deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
* HA: ceph issue 2016-12-05 9:37 ` Avner Ben Hanoch @ 2016-12-06 15:36 ` Marov Aleksey 2016-12-06 17:15 ` Haomai Wang 0 siblings, 1 reply; 16+ messages in thread From: Marov Aleksey @ 2016-12-06 15:36 UTC (permalink / raw) To: Avner Ben Hanoch, Haomai Wang; +Cc: Sage Weil, ceph-devel I have tried the latest changes. It works fine for any blocksize and for small number of fio jobs. But if I set numjobs >=16 it crushes with the assert:: /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: In function 'int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)' thread 7f3d64ff9700 time 2016-12-06 18:32:33.517932 /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: 102: FAILED assert(fd >= 0) core dump showed me this: Thread 1 (Thread 0x7f6aeb7fe700 (LWP 15151)): #0 0x00007f6c3d68d5f7 in raise () from /lib64/libc.so.6 #1 0x00007f6c3d68ece8 in abort () from /lib64/libc.so.6 #2 0x00007f6c3eef95e7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f6c3f1c8722 "fd >= 0", file=file@entry=0x7f6c3f1cd100 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h", line=line@entry=102, func=func@entry=0x7f6c3f1cd8c0 <RDMADispatcher::register_qp(Infiniband::QueuePair*, RDMAConnectedSocketImpl*)::__PRETTY_FUNCTION__> "int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/common/assert.cc:78 #3 0x00007f6c3efb443e in register_qp (csi=0x7f6ac83e00d0, qp=0x7f6ac83e0650, this=0x7f6bec145560) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:102 #4 RDMAConnectedSocketImpl (w=0x7f6bec0bee50, s=0x7f6bec145560, ib=<optimized out>, cct=0x7f6bec0b30f0, this=0x7f6ac83e00d0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:297 ---Type <return> to continue, or q <return> to quit--- #5 RDMAWorker::connect (this=0x7f6bec0bee50, addr=..., opts=..., socket=0x7f69b409fef0) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.cc:49 #6 0x00007f6c3f13bb03 in AsyncConnection::_process_connection (this=this@entry=0x7f69b409fd90) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:864 #7 0x00007f6c3f1423b8 in AsyncConnection::process (this=0x7f69b409fd90) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:812 #8 0x00007f6c3ef9b53c in EventCenter::process_events (this=this@entry=0x7f6bec0beed0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Event.cc:430 #9 0x00007f6c3ef9da4a in NetworkStack::__lambda1::operator() (__closure=0x7f6bec146030) at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Stack.cc:46 #10 0x00007f6c3bd51220 in ?? () from /lib64/libstdc++.so.6 #11 0x00007f6c3dc25dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f6c3d74eced in clone () from /lib64/libc.so.6 my fio config looks like this : [global] #logging #write_iops_log=write_iops_log #write_bw_log=write_bw_log #write_lat_log=write_lat_log ioengine=rbd direct=1 #clustername=ceph clientname=admin pool=rbd rbdname=test_img1 invalidate=0 # mandatory rw=randwrite bs=4K runtime=10m time_based randrepeat=0 [rbd_iodepth32] iodepth=128 numjobs=16 # 16 dosent work But it works perfect with 8 numjobs. If it is only me who got this problem then may be I have some problems with th ib drivers or settings ? Best regards Aleksei Marov ________________________________________ От: Avner Ben Hanoch [avnerb@mellanox.com] Отправлено: 5 декабря 2016 г. 12:37 Кому: Haomai Wang Копия: Marov Aleksey; Sage Weil; ceph-devel@vger.kernel.org Тема: RE: ceph issue Hi Haomai, Alexey With latest async/rdma code I don't see the fio errors (not for multiple fio instances neither to big block size) - thanks for your work Haomai. Alexey - do you still see any issue with fio? Regards, Avner > -----Original Message----- > From: Haomai Wang [mailto:haomai@xsky.com] > Sent: Friday, December 02, 2016 05:12 > To: Avner Ben Hanoch <avnerb@mellanox.com> > Cc: Marov Aleksey <Marov.A@raidix.com>; Sage Weil <sweil@redhat.com>; > ceph-devel@vger.kernel.org > Subject: Re: ceph issue > > On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch > <avnerb@mellanox.com> wrote: > > > > I guess that like the rest of ceph, the new rdma code must also support > multiple applications in parallel. > > > > I am also reproducing your error => 2 instances of fio can't run in parallel > with ceph rdma. > > > > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 > > sec") > > > > * and with all osds printing messages like " heartbeat_check: no reply from > ..." > > > > * And with log files contains errors: > > $ grep error ceph-osd.0.log > > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' > error = (2) No such file or directory > > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> > 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 > l=1).read_bulk reading from fd=139 : Unknown error -104 > > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work > request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR > > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work > > request returned error for buffer(0x7f9b553d20d0) > > status(12:RETRY_EXC_ERR > > error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter > Exceeded: The local transport timeout retry counter was exceeded while > trying to send this message. This means that the remote side didn't send any > Ack or Nack. If this happens when sending the first message, usually this mean > that the connection attributes are wrong or the remote side isn't in a state > that it can respond to messages. If this happens after sending the first > message, usually it means that the remote QP isn't available anymore. > Relevant for RC QPs." > > we set qp retry_cnt to 7 and timeout is 14 > > // How long to wait before retrying if packet lost or server dead. > // Supposedly the timeout is 4.096us*2^timeout. However, the actual > // timeout appears to be 4.096us*2^(timeout+1), so the setting > // below creates a 135ms timeout. > qpa.timeout = 14; > > // How many times to retry after timeouts before giving up. > qpa.retry_cnt = 7; > > is this means the receiver side lack of memory or not polling work request > ASAP? > > > > > > > > > Command lines that I used: > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 -- > clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 > > --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 > > > > > -----Original Message----- > > > From: Marov Aleksey > > > Sent: Tuesday, November 22, 2016 17:59 > > > > > > I didn't try this blocksize. But in my case fio crushed if I use > > > more than one job. With one job everything works fine. Is it worth more > deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ceph issue 2016-12-06 15:36 ` HA: " Marov Aleksey @ 2016-12-06 17:15 ` Haomai Wang 2016-12-07 8:57 ` HA: " Marov Aleksey 0 siblings, 1 reply; 16+ messages in thread From: Haomai Wang @ 2016-12-06 17:15 UTC (permalink / raw) To: Marov Aleksey; +Cc: Avner Ben Hanoch, Sage Weil, ceph-devel you need to increase system fd limits, rdma backend will uses double fd than before, one is tcp socket fd, another is linux eventfd On Tue, Dec 6, 2016 at 11:36 PM, Marov Aleksey <Marov.A@raidix.com> wrote: > I have tried the latest changes. It works fine for any blocksize and for small number of fio jobs. But if I set numjobs >=16 it crushes with the assert:: > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: In function 'int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)' thread 7f3d64ff9700 time 2016-12-06 18:32:33.517932 > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: 102: FAILED assert(fd >= 0) > > core dump showed me this: > Thread 1 (Thread 0x7f6aeb7fe700 (LWP 15151)): > #0 0x00007f6c3d68d5f7 in raise () from /lib64/libc.so.6 > #1 0x00007f6c3d68ece8 in abort () from /lib64/libc.so.6 > #2 0x00007f6c3eef95e7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f6c3f1c8722 "fd >= 0", > file=file@entry=0x7f6c3f1cd100 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h", line=line@entry=102, > func=func@entry=0x7f6c3f1cd8c0 <RDMADispatcher::register_qp(Infiniband::QueuePair*, RDMAConnectedSocketImpl*)::__PRETTY_FUNCTION__> "int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/common/assert.cc:78 > #3 0x00007f6c3efb443e in register_qp (csi=0x7f6ac83e00d0, qp=0x7f6ac83e0650, this=0x7f6bec145560) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:102 > #4 RDMAConnectedSocketImpl (w=0x7f6bec0bee50, s=0x7f6bec145560, ib=<optimized out>, cct=0x7f6bec0b30f0, > this=0x7f6ac83e00d0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:297 > ---Type <return> to continue, or q <return> to quit--- > #5 RDMAWorker::connect (this=0x7f6bec0bee50, addr=..., opts=..., socket=0x7f69b409fef0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.cc:49 > #6 0x00007f6c3f13bb03 in AsyncConnection::_process_connection (this=this@entry=0x7f69b409fd90) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:864 > #7 0x00007f6c3f1423b8 in AsyncConnection::process (this=0x7f69b409fd90) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:812 > #8 0x00007f6c3ef9b53c in EventCenter::process_events (this=this@entry=0x7f6bec0beed0, > timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Event.cc:430 > #9 0x00007f6c3ef9da4a in NetworkStack::__lambda1::operator() (__closure=0x7f6bec146030) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Stack.cc:46 > #10 0x00007f6c3bd51220 in ?? () from /lib64/libstdc++.so.6 > #11 0x00007f6c3dc25dc5 in start_thread () from /lib64/libpthread.so.0 > #12 0x00007f6c3d74eced in clone () from /lib64/libc.so.6 > > my fio config looks like this : > [global] > #logging > #write_iops_log=write_iops_log > #write_bw_log=write_bw_log > #write_lat_log=write_lat_log > ioengine=rbd > direct=1 > #clustername=ceph > clientname=admin > pool=rbd > rbdname=test_img1 > invalidate=0 # mandatory > rw=randwrite > bs=4K > runtime=10m > time_based > randrepeat=0 > > [rbd_iodepth32] > iodepth=128 > numjobs=16 # 16 dosent work > > > But it works perfect with 8 numjobs. If it is only me who got this problem then may be I have some problems with th ib drivers or settings ? > > Best regards > Aleksei Marov > ________________________________________ > От: Avner Ben Hanoch [avnerb@mellanox.com] > Отправлено: 5 декабря 2016 г. 12:37 > Кому: Haomai Wang > Копия: Marov Aleksey; Sage Weil; ceph-devel@vger.kernel.org > Тема: RE: ceph issue > > Hi Haomai, Alexey > > With latest async/rdma code I don't see the fio errors (not for multiple fio instances neither to big block size) - thanks for your work Haomai. > > Alexey - do you still see any issue with fio? > > Regards, > Avner > >> -----Original Message----- >> From: Haomai Wang [mailto:haomai@xsky.com] >> Sent: Friday, December 02, 2016 05:12 >> To: Avner Ben Hanoch <avnerb@mellanox.com> >> Cc: Marov Aleksey <Marov.A@raidix.com>; Sage Weil <sweil@redhat.com>; >> ceph-devel@vger.kernel.org >> Subject: Re: ceph issue >> >> On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch >> <avnerb@mellanox.com> wrote: >> > >> > I guess that like the rest of ceph, the new rdma code must also support >> multiple applications in parallel. >> > >> > I am also reproducing your error => 2 instances of fio can't run in parallel >> with ceph rdma. >> > >> > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 >> > sec") >> > >> > * and with all osds printing messages like " heartbeat_check: no reply from >> ..." >> > >> > * And with log files contains errors: >> > $ grep error ceph-osd.0.log >> > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' >> error = (2) No such file or directory >> > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 >> l=1).read_bulk reading from fd=139 : Unknown error -104 >> > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work >> request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR >> > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work >> > request returned error for buffer(0x7f9b553d20d0) >> > status(12:RETRY_EXC_ERR >> >> error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter >> Exceeded: The local transport timeout retry counter was exceeded while >> trying to send this message. This means that the remote side didn't send any >> Ack or Nack. If this happens when sending the first message, usually this mean >> that the connection attributes are wrong or the remote side isn't in a state >> that it can respond to messages. If this happens after sending the first >> message, usually it means that the remote QP isn't available anymore. >> Relevant for RC QPs." >> >> we set qp retry_cnt to 7 and timeout is 14 >> >> // How long to wait before retrying if packet lost or server dead. >> // Supposedly the timeout is 4.096us*2^timeout. However, the actual >> // timeout appears to be 4.096us*2^(timeout+1), so the setting >> // below creates a 135ms timeout. >> qpa.timeout = 14; >> >> // How many times to retry after timeouts before giving up. >> qpa.retry_cnt = 7; >> >> is this means the receiver side lack of memory or not polling work request >> ASAP? >> >> > >> > >> > >> > Command lines that I used: >> > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 -- >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 >> > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 >> > --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 >> > >> > > -----Original Message----- >> > > From: Marov Aleksey >> > > Sent: Tuesday, November 22, 2016 17:59 >> > > >> > > I didn't try this blocksize. But in my case fio crushed if I use >> > > more than one job. With one job everything works fine. Is it worth more >> deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
* HA: ceph issue 2016-12-06 17:15 ` Haomai Wang @ 2016-12-07 8:57 ` Marov Aleksey 0 siblings, 0 replies; 16+ messages in thread From: Marov Aleksey @ 2016-12-07 8:57 UTC (permalink / raw) To: Haomai Wang; +Cc: Avner Ben Hanoch, Sage Weil, ceph-devel You were right. Increasing of fd limits helped me. Thank you Haomai for great work done with rdma async messenger. ________________________________________ От: Haomai Wang [haomai@xsky.com] Отправлено: 6 декабря 2016 г. 20:15 Кому: Marov Aleksey Копия: Avner Ben Hanoch; Sage Weil; ceph-devel@vger.kernel.org Тема: Re: ceph issue you need to increase system fd limits, rdma backend will uses double fd than before, one is tcp socket fd, another is linux eventfd On Tue, Dec 6, 2016 at 11:36 PM, Marov Aleksey <Marov.A@raidix.com> wrote: > I have tried the latest changes. It works fine for any blocksize and for small number of fio jobs. But if I set numjobs >=16 it crushes with the assert:: > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: In function 'int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)' thread 7f3d64ff9700 time 2016-12-06 18:32:33.517932 > /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h: 102: FAILED assert(fd >= 0) > > core dump showed me this: > Thread 1 (Thread 0x7f6aeb7fe700 (LWP 15151)): > #0 0x00007f6c3d68d5f7 in raise () from /lib64/libc.so.6 > #1 0x00007f6c3d68ece8 in abort () from /lib64/libc.so.6 > #2 0x00007f6c3eef95e7 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f6c3f1c8722 "fd >= 0", > file=file@entry=0x7f6c3f1cd100 "/mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h", line=line@entry=102, > func=func@entry=0x7f6c3f1cd8c0 <RDMADispatcher::register_qp(Infiniband::QueuePair*, RDMAConnectedSocketImpl*)::__PRETTY_FUNCTION__> "int RDMADispatcher::register_qp(RDMADispatcher::QueuePair*, RDMAConnectedSocketImpl*)") at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/common/assert.cc:78 > #3 0x00007f6c3efb443e in register_qp (csi=0x7f6ac83e00d0, qp=0x7f6ac83e0650, this=0x7f6bec145560) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:102 > #4 RDMAConnectedSocketImpl (w=0x7f6bec0bee50, s=0x7f6bec145560, ib=<optimized out>, cct=0x7f6bec0b30f0, > this=0x7f6ac83e00d0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.h:297 > ---Type <return> to continue, or q <return> to quit--- > #5 RDMAWorker::connect (this=0x7f6bec0bee50, addr=..., opts=..., socket=0x7f69b409fef0) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/rdma/RDMAStack.cc:49 > #6 0x00007f6c3f13bb03 in AsyncConnection::_process_connection (this=this@entry=0x7f69b409fd90) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:864 > #7 0x00007f6c3f1423b8 in AsyncConnection::process (this=0x7f69b409fd90) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/AsyncConnection.cc:812 > #8 0x00007f6c3ef9b53c in EventCenter::process_events (this=this@entry=0x7f6bec0beed0, > timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Event.cc:430 > #9 0x00007f6c3ef9da4a in NetworkStack::__lambda1::operator() (__closure=0x7f6bec146030) > at /mnt/ceph_src/rpmbuild/BUILD/ceph-11.0.2-2234-g19ca696/src/msg/async/Stack.cc:46 > #10 0x00007f6c3bd51220 in ?? () from /lib64/libstdc++.so.6 > #11 0x00007f6c3dc25dc5 in start_thread () from /lib64/libpthread.so.0 > #12 0x00007f6c3d74eced in clone () from /lib64/libc.so.6 > > my fio config looks like this : > [global] > #logging > #write_iops_log=write_iops_log > #write_bw_log=write_bw_log > #write_lat_log=write_lat_log > ioengine=rbd > direct=1 > #clustername=ceph > clientname=admin > pool=rbd > rbdname=test_img1 > invalidate=0 # mandatory > rw=randwrite > bs=4K > runtime=10m > time_based > randrepeat=0 > > [rbd_iodepth32] > iodepth=128 > numjobs=16 # 16 dosent work > > > But it works perfect with 8 numjobs. If it is only me who got this problem then may be I have some problems with th ib drivers or settings ? > > Best regards > Aleksei Marov > ________________________________________ > От: Avner Ben Hanoch [avnerb@mellanox.com] > Отправлено: 5 декабря 2016 г. 12:37 > Кому: Haomai Wang > Копия: Marov Aleksey; Sage Weil; ceph-devel@vger.kernel.org > Тема: RE: ceph issue > > Hi Haomai, Alexey > > With latest async/rdma code I don't see the fio errors (not for multiple fio instances neither to big block size) - thanks for your work Haomai. > > Alexey - do you still see any issue with fio? > > Regards, > Avner > >> -----Original Message----- >> From: Haomai Wang [mailto:haomai@xsky.com] >> Sent: Friday, December 02, 2016 05:12 >> To: Avner Ben Hanoch <avnerb@mellanox.com> >> Cc: Marov Aleksey <Marov.A@raidix.com>; Sage Weil <sweil@redhat.com>; >> ceph-devel@vger.kernel.org >> Subject: Re: ceph issue >> >> On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch >> <avnerb@mellanox.com> wrote: >> > >> > I guess that like the rest of ceph, the new rdma code must also support >> multiple applications in parallel. >> > >> > I am also reproducing your error => 2 instances of fio can't run in parallel >> with ceph rdma. >> > >> > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 >> > sec") >> > >> > * and with all osds printing messages like " heartbeat_check: no reply from >> ..." >> > >> > * And with log files contains errors: >> > $ grep error ceph-osd.0.log >> > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' >> error = (2) No such file or directory >> > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 >> l=1).read_bulk reading from fd=139 : Unknown error -104 >> > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work >> request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR >> > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work >> > request returned error for buffer(0x7f9b553d20d0) >> > status(12:RETRY_EXC_ERR >> >> error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter >> Exceeded: The local transport timeout retry counter was exceeded while >> trying to send this message. This means that the remote side didn't send any >> Ack or Nack. If this happens when sending the first message, usually this mean >> that the connection attributes are wrong or the remote side isn't in a state >> that it can respond to messages. If this happens after sending the first >> message, usually it means that the remote QP isn't available anymore. >> Relevant for RC QPs." >> >> we set qp retry_cnt to 7 and timeout is 14 >> >> // How long to wait before retrying if packet lost or server dead. >> // Supposedly the timeout is 4.096us*2^timeout. However, the actual >> // timeout appears to be 4.096us*2^(timeout+1), so the setting >> // below creates a 135ms timeout. >> qpa.timeout = 14; >> >> // How many times to retry after timeouts before giving up. >> qpa.retry_cnt = 7; >> >> is this means the receiver side lack of memory or not polling work request >> ASAP? >> >> > >> > >> > >> > Command lines that I used: >> > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 -- >> clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 >> > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 >> > --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 >> > >> > > -----Original Message----- >> > > From: Marov Aleksey >> > > Sent: Tuesday, November 22, 2016 17:59 >> > > >> > > I didn't try this blocksize. But in my case fio crushed if I use >> > > more than one job. With one job everything works fine. Is it worth more >> deep investigating? ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-12-07 8:58 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <FEC85B105C5F644CA51BDB90657EEA9828425083@ddsm-mbx01.digdes.com> 2016-11-17 14:49 ` ceph issue Sage Weil 2016-11-18 7:19 ` Haomai Wang 2016-11-18 9:23 ` HA: " Marov Aleksey 2016-11-18 11:26 ` Haomai Wang 2016-11-20 13:21 ` Avner Ben Hanoch 2016-11-20 14:29 ` Avner Ben Hanoch 2016-11-21 10:40 ` Haomai Wang 2016-11-21 16:20 ` HA: " Marov Aleksey 2016-11-22 14:41 ` Avner Ben Hanoch 2016-11-22 15:59 ` HA: " Marov Aleksey 2016-11-23 9:30 ` Avner Ben Hanoch 2016-12-02 3:12 ` Haomai Wang 2016-12-05 9:37 ` Avner Ben Hanoch 2016-12-06 15:36 ` HA: " Marov Aleksey 2016-12-06 17:15 ` Haomai Wang 2016-12-07 8:57 ` HA: " Marov Aleksey
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.