All of lore.kernel.org
 help / color / mirror / Atom feed
From: Coly Li <colyli@suse.de>
To: linux-bcache@vger.kernel.org
Cc: linux-block@vger.kernel.org, Coly Li <colyli@suse.de>
Subject: [PATCH v2 00/12] bcache: device failure handling improvement
Date: Sun, 14 Jan 2018 01:10:38 +0800	[thread overview]
Message-ID: <20180113171050.22467-1-colyli@suse.de> (raw)

Hi maintainers and folks,

This patch set tries to improve bcache device failure handling, including
cache device and backing device failures.

The basic idea to handle failed cache device is,
- Unregister cache set
- Detach all backing devices attached to this cache set
- Stop all bcache devices linked to this cache set
The above process is named 'cache set retire' by me. The result of cache
set retire is, cache set and bcache devices are all removed, following
I/O requests will get failed immediately to notift upper layer or user
space coce that the cache device is failed or disconnected.

For failed backing device, there are two ways to handle them,
- If device is disconnected, when kernel thread dc->status_update_thread
  finds it is offline for BACKING_DEV_OFFLINE_TIMEOUT (5) seconds, the
  kernel thread will set dc->io_disable and call bcache_device_stop() to
  stop and remove the bcache device from system.
- If device is connected but too many I/O errors happen, after errors
  number exceeds dc->error_limit, call bch_cached_dev_error() to set
  dc->io_disable and stop bcache device. Then the broken backing device
  and its bcache device will be removed from system. 

The v2 patch set fixes the problems addressed in v1 patch reviews, adds
failure handling for backing device. This patch set also includes a patch
from Junhui Tang. And the v2 patch set does not include 2 patches which are
in bcache-for-next already. 

A basic testing covered with writethrough, writeback, writearound mode, and
read/write/readwrite workloads, cache set or bcache device can be removed
by too many I/O errors or delete the device. For plugging out physical
disks, a kernel bug triggers rcu oops in __do_softirq() and locks up all
following accesses to the disconnected disk, this blocks my testing.

While posting v2 patch set, I also continue to test the code from my side.
Any comment, question and review are warmly welcome.

Open issues:
1, Detach backing device by writing sysfs detach file does not work, it is
   because writeback thread does not drop dc->count refcount when cache
   device turns from dirty into clean. This issue will be fixed in v3
   patch set.
2, A kernel bug in __do_softirq() when plugging out hard disk with heavy
   I/O blocks my physical disk disconnection test. If any one knows this
   bug, please give me a hint.

Changelog:
v2: fixes all problems found in v1 review.
    add patches to handle backing device failure.
    add one more patch to set writeback_rate_update_seconds range.
    include a patch from Junhui Tang.
v1: the initial version, only handles cache device failure.

Coly Li (11):
  bcache: set writeback_rate_update_seconds in range [1, 60] seconds
  bcache: properly set task state in bch_writeback_thread()
  bcache: set task properly in allocator_wait()
  bcache: fix cached_dev->count usage for bch_cache_set_error()
  bcache: stop dc->writeback_rate_update properly
  bcache: set error_limit correctly
  bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
  bcache: stop all attached bcache devices for a retired cache set
  bcache: add backing_request_endio() for bi_end_io of attached backing
    device I/O
  bcache: add io_disable to struct cached_dev
  bcache: stop bcache device when backing device is offline

Tang Junhui (1):
  bcache: fix inaccurate io state for detached bcache devices

 drivers/md/bcache/alloc.c     |   5 +-
 drivers/md/bcache/bcache.h    |  37 ++++++++-
 drivers/md/bcache/btree.c     |  10 ++-
 drivers/md/bcache/io.c        |  16 +++-
 drivers/md/bcache/journal.c   |   4 +-
 drivers/md/bcache/request.c   | 188 +++++++++++++++++++++++++++++++++++-------
 drivers/md/bcache/super.c     | 134 ++++++++++++++++++++++++++++--
 drivers/md/bcache/sysfs.c     |  45 +++++++++-
 drivers/md/bcache/util.h      |   6 --
 drivers/md/bcache/writeback.c |  79 +++++++++++++++---
 drivers/md/bcache/writeback.h |   5 +-
 11 files changed, 458 insertions(+), 71 deletions(-)

-- 
2.15.1

             reply	other threads:[~2018-01-13 17:11 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-13 17:10 Coly Li [this message]
2018-01-13 17:10 ` [PATCH v2 01/12] bcache: set writeback_rate_update_seconds in range [1, 60] seconds Coly Li
2018-01-13 17:10 ` [PATCH v2 02/12] bcache: properly set task state in bch_writeback_thread() Coly Li
2018-01-13 17:10 ` [PATCH v2 03/12] bcache: set task properly in allocator_wait() Coly Li
2018-01-13 17:10 ` [PATCH v2 04/12] bcache: fix cached_dev->count usage for bch_cache_set_error() Coly Li
2018-01-13 17:10 ` [PATCH v2 05/12] bcache: stop dc->writeback_rate_update properly Coly Li
2018-01-13 17:10 ` [PATCH v2 06/12] bcache: set error_limit correctly Coly Li
2018-01-13 17:10 ` [PATCH v2 07/12] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags Coly Li
2018-01-13 17:10 ` [PATCH v2 08/12] bcache: stop all attached bcache devices for a retired cache set Coly Li
2018-01-13 17:10 ` [PATCH v2 09/12] bcache: fix inaccurate io state for detached bcache devices Coly Li
2018-01-13 17:10 ` [PATCH v2 10/12] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O Coly Li
2018-01-13 17:10 ` [PATCH v2 11/12] bcache: add io_disable to struct cached_dev Coly Li
2018-01-13 17:10 ` [PATCH v2 12/12] bcache: stop bcache device when backing device is offline Coly Li
  -- strict thread matches above, loose matches on Subject: below --
2018-01-13 17:01 [PATCH v2 00/12] bcache: device failure handling improvement Coly Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180113171050.22467-1-colyli@suse.de \
    --to=colyli@suse.de \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.