send more reads on recovery

From: Linux Chips <linux.chips@gmail.com>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: send more reads on recovery
Date: Mon, 4 Sep 2017 15:09:23 +0300	[thread overview]
Message-ID: <4a5ef825-83bb-5e7c-7712-73c0a83dd4ac@gmail.com> (raw)

Hi,

since we had the last issue in our cluster.
https://marc.info/?t=150297924500005&r=1&w=2

we had all sorts of files being partially written to disk due to lots of 
crashes, or not sure for what every reason. we reported it here:
http://tracker.ceph.com/issues/21173
and probably related or same as this
http://tracker.ceph.com/issues/14009
we had exactly the same trace and same assert

we had thousands of them, we could remove those corrupted shards 
manually, but that was not practical.

and being hit by this previously, I decided to try and patch it (my 
first time patching ceph, so excuse my mistakes) and any comments are 
welcomed. I basically borrowed code from the redundant/fast read block, 
and I m not sure if i satisfied the "TODO" above in the same function.

--- ceph-12.2.0/src/osd/ECBackend.cc    2017-08-28 12:30:20.000000000 -0400
+++ ceph-12.2.0/src/osd/ECBackend.cc    2017-09-01 08:55:38.209400108 -0400
@@ -402,10 +402,13 @@
      target[*i] = &(op.returned_data[*i]);
    }
    map<int, bufferlist> from;
+  uint64_t size = to_read.get<2>().begin()->second.length();
    for(map<pg_shard_t, bufferlist>::iterator i = to_read.get<2>().begin();
        i != to_read.get<2>().end();
        ++i) {
-    from[i->first.shard].claim(i->second);
+      if (size == i->second.length() ) {
+        from[i->first.shard].claim(i->second);
+      }
    }
    dout(10) << __func__ << ": " << from << dendl;
    int r = ECUtil::decode(sinfo, ec_impl, from, target);
@@ -1247,7 +1250,93 @@
    if (rop.in_progress.empty() || is_complete == rop.complete.size()) {
      dout(20) << __func__ << " Complete: " << rop << dendl;
      rop.trace.event("ec read complete");
-    complete_read_op(rop, m);
+
+    // send more reads if recovery needs them
+    bool size_mismatched = false;
+    bool sent_for_more = false;
+    for (map<hobject_t, read_result_t>::const_iterator iter =
+        rop.complete.begin();
+      iter != rop.complete.end();
+      ++iter) {
+      uint64_t total_data_size = 
iter->second.returned.front().get<2>().begin()->second.length();
+      set<int> have;
+      for (map<pg_shard_t, bufferlist>::const_iterator j =
+          iter->second.returned.front().get<2>().begin();
+        j != iter->second.returned.front().get<2>().end();
+        ++j) {
+
+       if ( j->second.length() == total_data_size ) {
+         have.insert(j->first.shard);
+       } else {
+         size_mismatched = true;
+         dout(0) << __func__ << " Checking Complete: [ERR] shard " << 
j->first.shard << " has mismatched size: " << j->second.length() << " != 
" << total_data_size << ". " << rop << dendl;
+       }
+      }
+
+      // one or more shards have different size. corrupted?! maybe a 
crashed node or oom kill.
+      set<int> want_to_read, dummy_minimum;
+      int err;
+      if (size_mismatched || true) {   // not sure if we always should 
check this, may be check for read op errors too
+//      if (size_mismatched) {
+        get_want_to_read_shards(&want_to_read);
+        if ((err = ec_impl->minimum_to_decode(want_to_read, have, 
&dummy_minimum)) < 0) {
+         dout(20) << __func__ << " minimum_to_decode failed,  we only 
have shards " << have << dendl;
+          if (rop.in_progress.empty()) {
+         // If we don't have enough copies and we haven't sent reads 
for all shards
+         // we can send the rest of the reads, if any.
+           if (!rop.do_redundant_reads) {
+             dout(10) << __func__ << ": sending for more reads" << dendl;
+             int r = send_all_remaining_reads(iter->first, rop);
+             if (r == 0) {
+               sent_for_more = true;
+               continue;
+             }
+           // Couldn't read any additional shards so handle as 
completed with errors
+           }
+           // We don't want to confuse clients / RBD with objectstore error
+           // values in particular ENOENT.  We may have different error 
returns
+           // from different shards, so we'll return 
minimum_to_decode() error
+           // (usually EIO) to reader.  It is likely an error here is 
due to a
+           // damaged pg.
+           rop.complete[iter->first].r = err;
+         }
+        } else {
+         dout(20) << __func__ << " minimum_to_decode succeeded " << 
dummy_minimum << dendl;
+          assert(rop.complete[iter->first].r == 0);
+         if (!rop.complete[iter->first].errors.empty()) {
+           if (cct->_conf->osd_read_ec_check_for_errors) {
+             dout(10) << __func__ << ": Not ignoring errors, use one 
shard err=" << err << dendl;
+             err = rop.complete[iter->first].errors.begin()->second;
+              rop.complete[iter->first].r = err;
+             dout(10) << __func__ << ": Not ignoring errors 2, use one 
shard err=" << err << dendl;
+           } else {
+             get_parent()->clog_warn() << "Error(s) ignored for "
+                                      << iter->first << " enough copies 
available";
+             dout(10) << __func__ << " Error(s) ignored for " << 
iter->first
+                    << " enough copies available" << dendl;
+             rop.complete[iter->first].errors.clear();
+           }
+         }
+//       ++is_complete;
+        }
+      } else {
+       dout(10) << __func__ << " Checking Complete: all shards are the 
same size" << dendl;
+      }
+    }
+    if (!sent_for_more) {
+      complete_read_op(rop, m);
+    } else {
+      dout(10) << __func__ << " readop needed more reads: " << rop << 
dendl;
+    }
    } else {
      dout(10) << __func__ << " readop not complete: " << rop << dendl;
    }



the patch is not perfect, I had few improvements in mind after 
deployment, but since i deployed the patch to our cluster and we do not 
want to interrupt the recovery (been recovering for the past 3 days with 
this patch). so i ll not have the chance to test it again since I could 
not replicate the same error in a test cluster.
truncating any file in our test cluster, or touching it in any way, 
generated a read error in the sub read op. while I can clearly see the 
sub read op in the production cluster reporting success with incorrect 
size! if any one had an idea on how to replicate this i d very much like 
to finish this patch properly. if you need this in any other form or a 
pull request i ll be more than happy to do it.

the caveats we saw in this patch until now; if the shard getting 
compared to is the wrong/corrupted one, the op fail and we get 
"failed_push" error. and if the primary osd had a corrupted shard, the 
reconstruction of the chunk will fail and we get
FAILED assert(pop.data.length() == 
sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
but those are way more less than dealing with every shard.

may be it needs more sophisticated way of knowing which shard to compare 
to (may be find distinct sizes and count them, then take the largest count)

thanks
ali