Segfault when connecting to cluster using Rados API (problem with pick_a_shard()?)

* Segfault when connecting to cluster using Rados API (problem with pick_a_shard()?)
@ 2017-12-18 14:28 Eric Cano
  2017-12-18 14:44 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Cano @ 2017-12-18 14:28 UTC (permalink / raw)
  To: ceph-devel

Hi everyone,

We experience segfaults when connecting to the Rados cluster from our application. The problem was first encountered when switching from 12.2.0 to 12.2.2. We downgraded to 12.2.1, which helped for some time, but we now also encounter the problem in 12.2.1. The current crash is for 12.2.1, which we switched for as it seemed to work better.

We had a crash of a command line tool for our application, so the context of the crash is rather simple. The segfault happens in a Rados thread, where apparently pick_a_shard() delivered a wrong address :

#0  operator+= (__i=1, this=0x8345dbd0a008) at /usr/include/c++/4.8.2/bits/atomic_base.h:420
#1  mempool::pool_t::adjust_count (this=0x8345dbd0a000, items=items@entry=1, bytes=bytes@entry=4008) at /usr/src/debug/ceph-12.2.1/src/common/mempool.cc:85
#2  0x00007f544efa0755 in reassign_to_mempool (this=<optimized out>, this=<optimized out>, pool=1026552624) at /usr/src/debug/ceph-12.2.1/src/common/buffer.cc:207
#3  ceph::buffer::list::append (this=this@entry=0x7f543d2ff330, data=0xf64d89 "", len=len@entry=272) at /usr/src/debug/ceph-12.2.1/src/common/buffer.cc:1915
#4  0x00007f54443c7976 in AsyncConnection::_process_connection (this=this@entry=0xf61740) at /usr/src/debug/ceph-12.2.1/src/msg/async/AsyncConnection.cc:962
#5  0x00007f54443ca8a8 in AsyncConnection::process (this=0xf61740) at /usr/src/debug/ceph-12.2.1/src/msg/async/AsyncConnection.cc:838
#6  0x00007f54443dceb9 in EventCenter::process_events (this=this@entry=0xf1dda0, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f543d2ffaf0) at /usr/src/debug/ceph-12.2.1/src/msg/async/Event.cc:409
#7  0x00007f54443e05ee in NetworkStack::__lambda4::operator() (__closure=0xf4af60) at /usr/src/debug/ceph-12.2.1/src/msg/async/Stack.cc:51
#8  0x00007f544d5ed2b0 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
#9  0x00007f544e16ee25 in start_thread (arg=0x7f543d301700) at pthread_create.c:308
#10 0x00007f544cd5534d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) frame 1
#1  mempool::pool_t::adjust_count (this=0x8345dbd0a000, items=items@entry=1, bytes=bytes@entry=4008) at /usr/src/debug/ceph-12.2.1/src/common/mempool.cc:85
85        shard->items += items;
(gdb) l
80      }
81
82      void mempool::pool_t::adjust_count(ssize_t items, ssize_t bytes)
83      {
84        shard_t *shard = pick_a_shard();
85        shard->items += items;
86        shard->bytes += bytes;
87      }
88
89      void mempool::pool_t::get_stats(
(gdb) p shard
$1 = (mempool::shard_t *) 0x8345dbd0a000
(gdb) p *shard
Cannot access memory at address 0x8345dbd0a000

The user and main thread is as follows (listing is for frame 4):

(gdb) thread 7
[Switching to thread 7 (Thread 0x7f545004e9c0 (LWP 31308))]
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
238     62:     movq    %rax, %r14
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f54442a2e3c in WaitUntil (when=..., mutex=..., this=0xe74190) at /usr/src/debug/ceph-12.2.1/src/common/Cond.h:64
#2  MonClient::authenticate (this=this@entry=0xe73d58, timeout=300) at /usr/src/debug/ceph-12.2.1/src/mon/MonClient.cc:464
#3  0x00007f544ef9058c in librados::RadosClient::connect (this=0xe73d10) at /usr/src/debug/ceph-12.2.1/src/librados/RadosClient.cc:299
#4  0x00007f544f73a1ee in cta::objectstore::BackendRados::BackendRados (this=0xe08680, logger=..., userId="eoscta", pool="eoscta_metadata", radosNameSpace="cta-ns") at /usr/src/debug/cta-0.0-85/objectstore/BackendRados.cpp:100
#5  0x00007f544f75eabc in cta::objectstore::BackendFactory::createBackend (URL="rados://eoscta@eoscta_metadata:cta-ns", logger=...) at /usr/src/debug/cta-0.0-85/objectstore/BackendFactory.cpp:42
#6  0x000000000041024e in main (argc=2, argv=0x7ffdde466e78) at /usr/src/debug/cta-0.0-85/objectstore/cta-objectstore-dump-object.cpp:44

(gdb) l -
75      #define TIMESTAMPEDPRINT(A)
76      #define NOTIFYLOCKED()
77      #define NOTIFYRELEASED()
78      #endif
79
80      namespace cta { namespace objectstore {
81
82      cta::threading::Mutex BackendRados::RadosTimeoutLogger::g_mutex;
83
84      BackendRados::BackendRados(log::Logger & logger, const std::string & userId, const std::string & pool,
(gdb) l
85        const std::string &radosNameSpace) :
86      m_user(userId), m_pool(pool), m_namespace(radosNameSpace), m_cluster(), m_radosCtxPool() {
87        log::LogContext lc(logger);
88        cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.init(userId.c_str()),
89            "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.init");
90        try {
91          RadosTimeoutLogger rtl;
92          cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.conf_read_file(NULL),
93              "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.conf_read_file");
94          rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.conf_read_file()", "no object");
(gdb) l
95          rtl.reset();
96          cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.conf_parse_env(NULL),
97              "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.conf_parse_env");
98          rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.conf_parse_env()", "no object");
99          rtl.reset();
100         cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.connect(),
101             "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.connect");
102         rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.connect()", "no object");
103         // Create the connection pool. One per CPU hardware thread.
104         for (size_t i=0; i<std::thread::hardware_concurrency(); i++) {

Is there anything we do wrong or a bug somewhere in rados?

Thanks for any help,

Eric Cano

^ permalink raw reply	[flat|nested] 3+ messages in thread