linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
@ 2019-10-22  9:47 Heming Zhao
  2019-10-23 21:31 ` Joe Thornber
  0 siblings, 1 reply; 15+ messages in thread
From: Heming Zhao @ 2019-10-22  9:47 UTC (permalink / raw)
  To: linux-lvm, David Teigland

Hello List & David,

This patch is responsible for legacy mail:
[linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512"

I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue.

Thanks
zhm

------(patch for branch stable-2.02) ----------
 From d0d77d0bdad6136c792c9664444d73dd47b809cb Mon Sep 17 00:00:00 2001
From: Zhao Heming <heming.zhao@suse.com>
Date: Tue, 22 Oct 2019 17:22:17 +0800
Subject: [PATCH] bcache may mistakenly write data to another disk when writes
  error

When bcache write data error, the errored fd and its data is saved in
cache->errored, then this fd is closed. Later lvm will reuse this
closed fd to new opened devs, but the fd related data still in
cache->errored and flags with BF_DIRTY. It make the data may mistakenly
write to another disk.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
  .gitignore                    |  2 +-
  lib/cache/lvmcache.c          |  2 +-
  lib/device/bcache.c           | 44 ++++++++++++++++++++++++++++++-------------
  lib/format_text/format-text.c | 11 ++++-------
  lib/label/label.c             | 29 ++++++++++++++--------------
  lib/metadata/mirror.c         |  1 -
  6 files changed, 52 insertions(+), 37 deletions(-)

diff --git a/.gitignore b/.gitignore
index f51bb67fca..57bb007005 100644
--- a/.gitignore
+++ b/.gitignore
@@ -28,7 +28,7 @@ make.tmpl
  /config.log
  /config.status
  /configure.scan
-/cscope.out
+/cscope.*
  /tags
  /tmp/
  
diff --git a/lib/cache/lvmcache.c b/lib/cache/lvmcache.c
index 9890325d2e..9c6e8032d6 100644
--- a/lib/cache/lvmcache.c
+++ b/lib/cache/lvmcache.c
@@ -1429,7 +1429,7 @@ int lvmcache_label_rescan_vg(struct cmd_context *cmd, const char *vgname, const
   * incorrectly placed PVs should have been moved from the orphan vginfo
   * onto their correct vginfo's, and the orphan vginfo should (in theory)
   * represent only real orphan PVs.  (Note: if lvmcache_label_scan is run
- * after vg_read udpates to lvmcache state, then the lvmcache will be
+ * after vg_read updates to lvmcache state, then the lvmcache will be
   * incorrect again, so do not run lvmcache_label_scan during the
   * processing phase.)
   *
diff --git a/lib/device/bcache.c b/lib/device/bcache.c
index d487ca2a77..f0fe07f921 100644
--- a/lib/device/bcache.c
+++ b/lib/device/bcache.c
@@ -293,6 +293,10 @@ static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
  
  	if (r < 0) {
  		_cb_free(e->cbs, cb);
+		((struct block *)context)->error = r;
+		log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
+			(d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
+			(long long unsigned)nbytes, r, strerror(-r));
  		return false;
  	}
  
@@ -869,7 +873,7 @@ static void _complete_io(void *context, int err)
  
  	if (b->error) {
  		dm_list_add(&cache->errored, &b->list);
-
+		log_warn("_complete_io fd: %d error: %d", b->fd, err);
  	} else {
  		_clear_flags(b, BF_DIRTY);
  		_link_block(b);
@@ -896,8 +900,7 @@ static void _issue_low_level(struct block *b, enum dir d)
  	dm_list_move(&cache->io_pending, &b->list);
  
  	if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
-		/* FIXME: if io_submit() set an errno, return that instead of EIO? */
-		_complete_io(b, -EIO);
+		_complete_io(b, b->error);
  		return;
  	}
  }
@@ -921,16 +924,20 @@ static bool _wait_io(struct bcache *cache)
   * High level IO handling
   *--------------------------------------------------------------*/
  
-static void _wait_all(struct bcache *cache)
+static bool _wait_all(struct bcache *cache)
  {
+	bool ret = true;
  	while (!dm_list_empty(&cache->io_pending))
-		_wait_io(cache);
+		ret = _wait_io(cache);
+	return ret;
  }
  
-static void _wait_specific(struct block *b)
+static bool _wait_specific(struct block *b)
  {
+	bool ret = true;
  	while (_test_flags(b, BF_IO_PENDING))
-		_wait_io(b->cache);
+		ret = _wait_io(b->cache);
+	return ret;
  }
  
  static unsigned _writeback(struct bcache *cache, unsigned count)
@@ -1290,10 +1297,7 @@ void bcache_put(struct block *b)
  
  bool bcache_flush(struct bcache *cache)
  {
-	// Only dirty data is on the errored list, since bad read blocks get
-	// recycled straight away.  So we put these back on the dirty list, and
-	// try and rewrite everything.
-	dm_list_splice(&cache->dirty, &cache->errored);
+	bool write_ret = true, wait_ret = true;
  
  	while (!dm_list_empty(&cache->dirty)) {
  		struct block *b = dm_list_item(_list_pop(&cache->dirty), struct block);
@@ -1303,11 +1307,16 @@ bool bcache_flush(struct bcache *cache)
  		}
  
  		_issue_write(b);
+		if (b->error) write_ret = false;
  	}
  
-	_wait_all(cache);
+	wait_ret = _wait_all(cache);
  
-	return dm_list_empty(&cache->errored);
+	if (write_ret == false || wait_ret == false ||
+			!dm_list_empty(&cache->errored))
+		return false;
+	else
+		return true;
  }
  
  /*
@@ -1352,6 +1361,15 @@ bool bcache_invalidate_fd(struct bcache *cache, int fd)
  	struct block *b, *tmp;
  	bool r = true;
  
+	// clean cache->errored list
+	dm_list_iterate_items_safe (b, tmp, &cache->errored) {
+		b->flags = 0;
+		b->index = 0;
+		b->ref_count = 0;
+		b->error = 0;
+		_recycle_block(cache, b);
+	}
+
  	// Start writing back any dirty blocks on this fd.
  	dm_list_iterate_items_safe (b, tmp, &cache->dirty)
  		if (b->fd == fd)
diff --git a/lib/format_text/format-text.c b/lib/format_text/format-text.c
index f39051cf02..b2221a21b0 100644
--- a/lib/format_text/format-text.c
+++ b/lib/format_text/format-text.c
@@ -402,8 +402,7 @@ static int _raw_write_mda_header(const struct format_type *fmt,
  	dev_set_last_byte(dev, start_byte + MDA_HEADER_SIZE);
  
  	if (!dev_write_bytes(dev, start_byte, MDA_HEADER_SIZE, mdah)) {
-		dev_unset_last_byte(dev);
-		log_error("Failed to write mda header to %s fd %d", dev_name(dev), dev->bcache_fd);
+		log_error("Failed to write mda header to %s", dev_name(dev));
  		return 0;
  	}
  	dev_unset_last_byte(dev);
@@ -690,8 +689,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  	if (!dev_write_bytes(mdac->area.dev, mdac->area.start + mdac->rlocn.offset,
  		                (size_t) (mdac->rlocn.size - new_wrap),
  		                fidtc->raw_metadata_buf)) {
-		log_error("Failed to write metadata to %s fd %d", dev_name(mdac->area.dev), mdac->area.dev->bcache_fd);
-		dev_unset_last_byte(mdac->area.dev);
+		log_error("Failed to write metadata to %s", dev_name(mdac->area.dev));
  		goto out;
  	}
  
@@ -704,8 +702,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  		if (!dev_write_bytes(mdac->area.dev, mdac->area.start + MDA_HEADER_SIZE,
  			                (size_t) new_wrap,
  			                fidtc->raw_metadata_buf + mdac->rlocn.size - new_wrap)) {
-			log_error("Failed to write metadata wrap to %s fd %d", dev_name(mdac->area.dev), mdac->area.dev->bcache_fd);
-			dev_unset_last_byte(mdac->area.dev);
+			log_error("Failed to write metadata wrap to %s", dev_name(mdac->area.dev));
  			goto out;
  		}
  	}
@@ -723,7 +720,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  
  	r = 1;
  
-      out:
+out:
  	if (!r) {
  		dm_free(fidtc->raw_metadata_buf);
  		fidtc->raw_metadata_buf = NULL;
diff --git a/lib/label/label.c b/lib/label/label.c
index e4a106854f..576e1baaf4 100644
--- a/lib/label/label.c
+++ b/lib/label/label.c
@@ -216,7 +216,7 @@ int label_write(struct device *dev, struct label *label)
  
  	if (!dev_write_bytes(dev, offset, LABEL_SIZE, buf)) {
  		log_debug_devs("Failed to write label to %s", dev_name(dev));
-		r = 0;
+		return 0;
  	}
  
  	dev_unset_last_byte(dev);
@@ -621,7 +621,6 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f,
  	int submit_count;
  	int scan_failed;
  	int is_lvm_device;
-	int error;
  	int ret;
  
  	dm_list_init(&wait_devs);
@@ -668,12 +667,12 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f,
  
  	dm_list_iterate_items_safe(devl, devl2, &wait_devs) {
  		bb = NULL;
-		error = 0;
  		scan_failed = 0;
  		is_lvm_device = 0;
  
  		if (!bcache_get(scan_bcache, devl->dev->bcache_fd, 0, 0, &bb)) {
-			log_debug_devs("Scan failed to read %s error %d.", dev_name(devl->dev), error);
+			log_debug_devs("Scan failed to read %s fd: %d bb %p error %d.",
+					dev_name(devl->dev), devl->dev->bcache_fd, bb, bb ? bb->error : 0);
  			scan_failed = 1;
  			scan_read_errors++;
  			scan_failed_count++;
@@ -992,11 +991,8 @@ int label_scan_pvscan_all(struct cmd_context *cmd, struct dm_list *scan_devs)
  	}
  
  	while ((dev = dev_iter_get(iter))) {
-		if (!(devl = dm_zalloc(sizeof(*devl)))) {
-			log_error("Failed to allocated device list.");
-			dev_iter_destroy(iter);
+		if (!(devl = dm_zalloc(sizeof(*devl))))
  			return 0;
-		}
  		devl->dev = dev;
  		dm_list_add(&all_devs, &devl->list);
  
@@ -1021,7 +1017,7 @@ int label_scan_pvscan_all(struct cmd_context *cmd, struct dm_list *scan_devs)
  
  	if (!scan_bcache) {
  		if (!_setup_bcache(dm_list_size(&all_devs)))
-			return_0;
+			return 0;
  	}
  
  	_scan_list(cmd, cmd->lvmetad_filter, &all_devs, NULL);
@@ -1358,7 +1354,8 @@ bool dev_write_bytes(struct device *dev, uint64_t start, size_t len, void *data)
  
  	if (!scan_bcache) {
  		/* Should not happen */
-		log_error("dev_write bcache not set up %s", dev_name(dev));
+		log_error("dev_write bcache not set up %s fd: %d", dev_name(dev),
+					dev->bcache_fd);
  		return false;
  	}
  
@@ -1383,15 +1380,19 @@ bool dev_write_bytes(struct device *dev, uint64_t start, size_t len, void *data)
  	}
  
  	if (!bcache_write_bytes(scan_bcache, dev->bcache_fd, start, len, data)) {
-		log_error("Error writing device %s at %llu length %u.",
-			  dev_name(dev), (unsigned long long)start, (uint32_t)len);
+		log_error("Error writing device %s at %llu length %u fd %d.",
+				dev_name(dev), (unsigned long long)start, (uint32_t)len,
+				dev->bcache_fd);
+		dev_unset_last_byte(dev);
  		label_scan_invalidate(dev);
  		return false;
  	}
  
  	if (!bcache_flush(scan_bcache)) {
-		log_error("Error writing device %s at %llu length %u.",
-			  dev_name(dev), (unsigned long long)start, (uint32_t)len);
+		log_error("Error writing device %s at %llu length %u fd %d.",
+				dev_name(dev), (unsigned long long)start, (uint32_t)len,
+				dev->bcache_fd);
+		dev_unset_last_byte(dev);
  		label_scan_invalidate(dev);
  		return false;
  	}
diff --git a/lib/metadata/mirror.c b/lib/metadata/mirror.c
index cd8ce1e9e1..9792df7eac 100644
--- a/lib/metadata/mirror.c
+++ b/lib/metadata/mirror.c
@@ -305,7 +305,6 @@ static int _write_log_header(struct cmd_context *cmd, struct logical_volume *lv)
  	dev_set_last_byte(dev, sizeof(log_header));
  
  	if (!dev_write_bytes(dev, UINT64_C(0), sizeof(log_header), &log_header)) {
-		dev_unset_last_byte(dev);
  		log_error("Failed to write log header to %s.", name);
  		return 0;
  	}
-- 
2.16.4

----------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-22  9:47 [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error Heming Zhao
@ 2019-10-23 21:31 ` Joe Thornber
  2019-10-24  3:06   ` Heming Zhao
  2019-10-24  3:13   ` Heming Zhao
  0 siblings, 2 replies; 15+ messages in thread
From: Joe Thornber @ 2019-10-23 21:31 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: David Teigland

On Tue, Oct 22, 2019 at 09:47:32AM +0000, Heming Zhao wrote:
> Hello List & David,
> 
> This patch is responsible for legacy mail:
> [linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512"
> 
> I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue.
> 
> Thanks
> zhm
> 
> ------(patch for branch stable-2.02) ----------
>  From d0d77d0bdad6136c792c9664444d73dd47b809cb Mon Sep 17 00:00:00 2001
> From: Zhao Heming <heming.zhao@suse.com>
> Date: Tue, 22 Oct 2019 17:22:17 +0800
> Subject: [PATCH] bcache may mistakenly write data to another disk when writes
>   error
> 
> When bcache write data error, the errored fd and its data is saved in
> cache->errored, then this fd is closed. Later lvm will reuse this
> closed fd to new opened devs, but the fd related data still in
> cache->errored and flags with BF_DIRTY. It make the data may mistakenly
> write to another disk.

I think real issue here is that the flush fails, and the error path for that
calls invalidate dev, which also fails, but that return value is not checked.
The fd is subsequently closed, and reopened with data still in the cache.

So I think the correct fix is to have a variant of invalidate, that doesn't
bother retrying the IO, and just throws away the dirty data.  bcache_abort()?
This should be called when the flush() fails.

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-23 21:31 ` Joe Thornber
@ 2019-10-24  3:06   ` Heming Zhao
  2019-10-28 15:43     ` Joe Thornber
  2019-10-24  3:13   ` Heming Zhao
  1 sibling, 1 reply; 15+ messages in thread
From: Heming Zhao @ 2019-10-24  3:06 UTC (permalink / raw)
  To: LVM general discussion and development, David Teigland

Hello Joe,

Please see my comments in below. Thanks.

On 10/24/19 5:31 AM, Joe Thornber wrote:
> On Tue, Oct 22, 2019 at 09:47:32AM +0000, Heming Zhao wrote:
>> Hello List & David,
>>
>> This patch is responsible for legacy mail:
>> [linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512"
>>
>> I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue.
>>
>> Thanks
>> zhm
>>
>> ------(patch for branch stable-2.02) ----------
>>   From d0d77d0bdad6136c792c9664444d73dd47b809cb Mon Sep 17 00:00:00 2001
>> From: Zhao Heming <heming.zhao@suse.com>
>> Date: Tue, 22 Oct 2019 17:22:17 +0800
>> Subject: [PATCH] bcache may mistakenly write data to another disk when writes
>>    error
>>
>> When bcache write data error, the errored fd and its data is saved in
>> cache->errored, then this fd is closed. Later lvm will reuse this
>> closed fd to new opened devs, but the fd related data still in
>> cache->errored and flags with BF_DIRTY. It make the data may mistakenly
>> write to another disk.
> 
> I think real issue here is that the flush fails, and the error path for that
> calls invalidate dev, which also fails, but that return value is not checked.
> The fd is subsequently closed, and reopened with data still in the cache.
> 
First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
(because the data holds BF_DIRTY flag).

> So I think the correct fix is to have a variant of invalidate, that doesn't
> bother retrying the IO, and just throws away the dirty data.  bcache_abort()?
> This should be called when the flush() fails.
> 
Currently lvm2 calls bcache_invalidate_fd when flush fails. So I add abort action
in invalidate(). You can see the handle errored list codes in invalidate(). It's
not necessary to add a new func bcache_abort().

So IMO both bcache_flush &  bcache_invalidate_fd have bug.
- bcache_flush should do more error/fail check.
- invalidate should do abort job.

BTW, yesterday I found my patch a mistake, it overwrites some correct code. And I
only sent it to David with V2 patch.


> - Joe
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-23 21:31 ` Joe Thornber
  2019-10-24  3:06   ` Heming Zhao
@ 2019-10-24  3:13   ` Heming Zhao
  2019-10-28  8:38     ` Heming Zhao
  1 sibling, 1 reply; 15+ messages in thread
From: Heming Zhao @ 2019-10-24  3:13 UTC (permalink / raw)
  To: Joe Thornber, LVM general discussion and development; +Cc: David Teigland

Hello Joe,

Please see my comments in below. Thanks.
(+joe, pls ignore previous mail. It is very wired: in last mail, Thunderbird reply all not include joe.)

On 10/24/19 5:31 AM, Joe Thornber wrote:
> On Tue, Oct 22, 2019 at 09:47:32AM +0000, Heming Zhao wrote:
>> Hello List & David,
>>
>> This patch is responsible for legacy mail:
>> [linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512"
>>
>> I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue.
>>
>> Thanks
>> zhm
>>
>> ------(patch for branch stable-2.02) ----------
>>   From d0d77d0bdad6136c792c9664444d73dd47b809cb Mon Sep 17 00:00:00 2001
>> From: Zhao Heming <heming.zhao@suse.com>
>> Date: Tue, 22 Oct 2019 17:22:17 +0800
>> Subject: [PATCH] bcache may mistakenly write data to another disk when writes
>>    error
>>
>> When bcache write data error, the errored fd and its data is saved in
>> cache->errored, then this fd is closed. Later lvm will reuse this
>> closed fd to new opened devs, but the fd related data still in
>> cache->errored and flags with BF_DIRTY. It make the data may mistakenly
>> write to another disk.
> 
> I think real issue here is that the flush fails, and the error path for that
> calls invalidate dev, which also fails, but that return value is not checked.
> The fd is subsequently closed, and reopened with data still in the cache.
> 
First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
(because the data holds BF_DIRTY flag).

> So I think the correct fix is to have a variant of invalidate, that doesn't
> bother retrying the IO, and just throws away the dirty data.  bcache_abort()?
> This should be called when the flush() fails.
> 
Currently lvm2 calls bcache_invalidate_fd when flush fails. So I add abort action
in invalidate(). You can see the handle errored list codes in invalidate(). It's
not necessary to add a new func bcache_abort().

So IMO both bcache_flush &  bcache_invalidate_fd have bug.
- bcache_flush should do more error/fail check.
- invalidate should do abort job.

BTW, yesterday I found my patch a mistake, it overwrites some correct code. And I
only sent it to David with V2 patch.

> - Joe
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-24  3:13   ` Heming Zhao
@ 2019-10-28  8:38     ` Heming Zhao
  0 siblings, 0 replies; 15+ messages in thread
From: Heming Zhao @ 2019-10-28  8:38 UTC (permalink / raw)
  To: Joe Thornber, LVM general discussion and development; +Cc: David Teigland

Hello Joe & list,

This is v2 patch.

v1 patch overwrites some correct code in lib/label/label.c label_scan_pvscan_all().
```
     while ((dev = dev_iter_get(iter))) {
-       if (!(devl = dm_zalloc(sizeof(*devl)))) {
-           log_error("Failed to allocated device list.");
-           dev_iter_destroy(iter);
+       if (!(devl = dm_zalloc(sizeof(*devl))))
             return 0;
-       }
```

------ v2 patch ------
 From a4b53d6fd670c3046784b9cce3925ec5d187b138 Mon Sep 17 00:00:00 2001
From: Zhao Heming <heming.zhao@suse.com>
Date: Mon, 28 Oct 2019 16:31:01 +0800
Subject: [PATCH] bcache: fix mistakenly write data to another disk when writes
  error

When bcache write data error, the errored fd and its data is saved in
cache->errored, then this fd is closed. Later lvm will reuse this
closed fd to new opened devs, but the fd related data still in
cache->errored and flags with BF_DIRTY. It make the data may mistakenly
write to another disk.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
---
  .gitignore                    |  2 +-
  lib/cache/lvmcache.c          |  2 +-
  lib/device/bcache.c           | 46 ++++++++++++++++++++++++++++++-------------
  lib/format_text/format-text.c | 11 ++++-------
  lib/label/label.c             | 22 ++++++++++++---------
  lib/metadata/mirror.c         |  1 -
  6 files changed, 51 insertions(+), 33 deletions(-)

diff --git a/.gitignore b/.gitignore
index f51bb67fca..57bb007005 100644
--- a/.gitignore
+++ b/.gitignore
@@ -28,7 +28,7 @@ make.tmpl
  /config.log
  /config.status
  /configure.scan
-/cscope.out
+/cscope.*
  /tags
  /tmp/
  
diff --git a/lib/cache/lvmcache.c b/lib/cache/lvmcache.c
index 9890325d2e..9c6e8032d6 100644
--- a/lib/cache/lvmcache.c
+++ b/lib/cache/lvmcache.c
@@ -1429,7 +1429,7 @@ int lvmcache_label_rescan_vg(struct cmd_context *cmd, const char *vgname, const
   * incorrectly placed PVs should have been moved from the orphan vginfo
   * onto their correct vginfo's, and the orphan vginfo should (in theory)
   * represent only real orphan PVs.  (Note: if lvmcache_label_scan is run
- * after vg_read udpates to lvmcache state, then the lvmcache will be
+ * after vg_read updates to lvmcache state, then the lvmcache will be
   * incorrect again, so do not run lvmcache_label_scan during the
   * processing phase.)
   *
diff --git a/lib/device/bcache.c b/lib/device/bcache.c
index d487ca2a77..1959b333ba 100644
--- a/lib/device/bcache.c
+++ b/lib/device/bcache.c
@@ -293,6 +293,10 @@ static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
  
  	if (r < 0) {
  		_cb_free(e->cbs, cb);
+		((struct block *)context)->error = r;
+		log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
+			(d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
+			(long long unsigned)nbytes, r, strerror(-r));
  		return false;
  	}
  
@@ -869,7 +873,7 @@ static void _complete_io(void *context, int err)
  
  	if (b->error) {
  		dm_list_add(&cache->errored, &b->list);
-
+		log_warn("_complete_io fd: %d error: %d", b->fd, err);
  	} else {
  		_clear_flags(b, BF_DIRTY);
  		_link_block(b);
@@ -896,8 +900,7 @@ static void _issue_low_level(struct block *b, enum dir d)
  	dm_list_move(&cache->io_pending, &b->list);
  
  	if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
-		/* FIXME: if io_submit() set an errno, return that instead of EIO? */
-		_complete_io(b, -EIO);
+		_complete_io(b, b->error);
  		return;
  	}
  }
@@ -921,16 +924,20 @@ static bool _wait_io(struct bcache *cache)
   * High level IO handling
   *--------------------------------------------------------------*/
  
-static void _wait_all(struct bcache *cache)
+static bool _wait_all(struct bcache *cache)
  {
+	bool ret = true;
  	while (!dm_list_empty(&cache->io_pending))
-		_wait_io(cache);
+		ret = _wait_io(cache);
+	return ret;
  }
  
-static void _wait_specific(struct block *b)
+static bool _wait_specific(struct block *b)
  {
+	bool ret = true;
  	while (_test_flags(b, BF_IO_PENDING))
-		_wait_io(b->cache);
+		ret = _wait_io(b->cache);
+	return ret;
  }
  
  static unsigned _writeback(struct bcache *cache, unsigned count)
@@ -1125,7 +1132,7 @@ struct bcache *bcache_create(sector_t block_sectors, unsigned nr_cache_blocks,
  	unsigned max_io = engine->max_io(engine);
  	long pgsize = sysconf(_SC_PAGESIZE);
  
-	if ((pgsize = sysconf(_SC_PAGESIZE)) < 0) {
+	if (pgsize < 0) {
  		log_warn("bcache cannot read pagesize.");
  		return NULL;
  	}
@@ -1290,10 +1297,7 @@ void bcache_put(struct block *b)
  
  bool bcache_flush(struct bcache *cache)
  {
-	// Only dirty data is on the errored list, since bad read blocks get
-	// recycled straight away.  So we put these back on the dirty list, and
-	// try and rewrite everything.
-	dm_list_splice(&cache->dirty, &cache->errored);
+	bool write_ret = true, wait_ret = true;
  
  	while (!dm_list_empty(&cache->dirty)) {
  		struct block *b = dm_list_item(_list_pop(&cache->dirty), struct block);
@@ -1303,11 +1307,16 @@ bool bcache_flush(struct bcache *cache)
  		}
  
  		_issue_write(b);
+		if (b->error) write_ret = false;
  	}
  
-	_wait_all(cache);
+	wait_ret = _wait_all(cache);
  
-	return dm_list_empty(&cache->errored);
+	if (write_ret == false || wait_ret == false ||
+			!dm_list_empty(&cache->errored))
+		return false;
+	else
+		return true;
  }
  
  /*
@@ -1352,6 +1361,15 @@ bool bcache_invalidate_fd(struct bcache *cache, int fd)
  	struct block *b, *tmp;
  	bool r = true;
  
+	// clean cache->errored list
+	dm_list_iterate_items_safe (b, tmp, &cache->errored) {
+		b->flags = 0;
+		b->index = 0;
+		b->ref_count = 0;
+		b->error = 0;
+		_recycle_block(cache, b);
+	}
+
  	// Start writing back any dirty blocks on this fd.
  	dm_list_iterate_items_safe (b, tmp, &cache->dirty)
  		if (b->fd == fd)
diff --git a/lib/format_text/format-text.c b/lib/format_text/format-text.c
index 6eea07fe46..86e2caaad6 100644
--- a/lib/format_text/format-text.c
+++ b/lib/format_text/format-text.c
@@ -402,8 +402,7 @@ static int _raw_write_mda_header(const struct format_type *fmt,
  	dev_set_last_byte(dev, start_byte + MDA_HEADER_SIZE);
  
  	if (!dev_write_bytes(dev, start_byte, MDA_HEADER_SIZE, mdah)) {
-		dev_unset_last_byte(dev);
-		log_error("Failed to write mda header to %s fd %d", dev_name(dev), dev->bcache_fd);
+		log_error("Failed to write mda header to %s", dev_name(dev));
  		return 0;
  	}
  	dev_unset_last_byte(dev);
@@ -690,8 +689,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  	if (!dev_write_bytes(mdac->area.dev, mdac->area.start + mdac->rlocn.offset,
  		                (size_t) (mdac->rlocn.size - new_wrap),
  		                fidtc->raw_metadata_buf)) {
-		log_error("Failed to write metadata to %s fd %d", dev_name(mdac->area.dev), mdac->area.dev->bcache_fd);
-		dev_unset_last_byte(mdac->area.dev);
+		log_error("Failed to write metadata to %s", dev_name(mdac->area.dev));
  		goto out;
  	}
  
@@ -704,8 +702,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  		if (!dev_write_bytes(mdac->area.dev, mdac->area.start + MDA_HEADER_SIZE,
  			                (size_t) new_wrap,
  			                fidtc->raw_metadata_buf + mdac->rlocn.size - new_wrap)) {
-			log_error("Failed to write metadata wrap to %s fd %d", dev_name(mdac->area.dev), mdac->area.dev->bcache_fd);
-			dev_unset_last_byte(mdac->area.dev);
+			log_error("Failed to write metadata wrap to %s", dev_name(mdac->area.dev));
  			goto out;
  		}
  	}
@@ -723,7 +720,7 @@ static int _vg_write_raw(struct format_instance *fid, struct volume_group *vg,
  
  	r = 1;
  
-      out:
+out:
  	if (!r) {
  		dm_free(fidtc->raw_metadata_buf);
  		fidtc->raw_metadata_buf = NULL;
diff --git a/lib/label/label.c b/lib/label/label.c
index e4a106854f..904dd23395 100644
--- a/lib/label/label.c
+++ b/lib/label/label.c
@@ -216,7 +216,7 @@ int label_write(struct device *dev, struct label *label)
  
  	if (!dev_write_bytes(dev, offset, LABEL_SIZE, buf)) {
  		log_debug_devs("Failed to write label to %s", dev_name(dev));
-		r = 0;
+		return 0;
  	}
  
  	dev_unset_last_byte(dev);
@@ -621,7 +621,6 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f,
  	int submit_count;
  	int scan_failed;
  	int is_lvm_device;
-	int error;
  	int ret;
  
  	dm_list_init(&wait_devs);
@@ -668,12 +667,12 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f,
  
  	dm_list_iterate_items_safe(devl, devl2, &wait_devs) {
  		bb = NULL;
-		error = 0;
  		scan_failed = 0;
  		is_lvm_device = 0;
  
  		if (!bcache_get(scan_bcache, devl->dev->bcache_fd, 0, 0, &bb)) {
-			log_debug_devs("Scan failed to read %s error %d.", dev_name(devl->dev), error);
+			log_debug_devs("Scan failed to read %s fd: %d bb %p error %d.",
+					dev_name(devl->dev), devl->dev->bcache_fd, bb, bb ? bb->error : 0);
  			scan_failed = 1;
  			scan_read_errors++;
  			scan_failed_count++;
@@ -1358,7 +1357,8 @@ bool dev_write_bytes(struct device *dev, uint64_t start, size_t len, void *data)
  
  	if (!scan_bcache) {
  		/* Should not happen */
-		log_error("dev_write bcache not set up %s", dev_name(dev));
+		log_error("dev_write bcache not set up %s fd: %d", dev_name(dev),
+					dev->bcache_fd);
  		return false;
  	}
  
@@ -1383,15 +1383,19 @@ bool dev_write_bytes(struct device *dev, uint64_t start, size_t len, void *data)
  	}
  
  	if (!bcache_write_bytes(scan_bcache, dev->bcache_fd, start, len, data)) {
-		log_error("Error writing device %s at %llu length %u.",
-			  dev_name(dev), (unsigned long long)start, (uint32_t)len);
+		log_error("Error writing device %s at %llu length %u fd %d.",
+				dev_name(dev), (unsigned long long)start, (uint32_t)len,
+				dev->bcache_fd);
+		dev_unset_last_byte(dev);
  		label_scan_invalidate(dev);
  		return false;
  	}
  
  	if (!bcache_flush(scan_bcache)) {
-		log_error("Error writing device %s at %llu length %u.",
-			  dev_name(dev), (unsigned long long)start, (uint32_t)len);
+		log_error("Error writing device %s at %llu length %u fd %d.",
+				dev_name(dev), (unsigned long long)start, (uint32_t)len,
+				dev->bcache_fd);
+		dev_unset_last_byte(dev);
  		label_scan_invalidate(dev);
  		return false;
  	}
diff --git a/lib/metadata/mirror.c b/lib/metadata/mirror.c
index cd8ce1e9e1..9792df7eac 100644
--- a/lib/metadata/mirror.c
+++ b/lib/metadata/mirror.c
@@ -305,7 +305,6 @@ static int _write_log_header(struct cmd_context *cmd, struct logical_volume *lv)
  	dev_set_last_byte(dev, sizeof(log_header));
  
  	if (!dev_write_bytes(dev, UINT64_C(0), sizeof(log_header), &log_header)) {
-		dev_unset_last_byte(dev);
  		log_error("Failed to write log header to %s.", name);
  		return 0;
  	}
-- 
2.16.4

------------

Thanks.

On 10/24/19 11:13 AM, Heming Zhao wrote:
> Hello Joe,
> 
> Please see my comments in below. Thanks.
> (+joe, pls ignore previous mail. It is very wired: in last mail, Thunderbird reply all not include joe.)
> 
> On 10/24/19 5:31 AM, Joe Thornber wrote:
>> On Tue, Oct 22, 2019 at 09:47:32AM +0000, Heming Zhao wrote:
>>> Hello List & David,
>>>
>>> This patch is responsible for legacy mail:
>>> [linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512"
>>>
>>> I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue.
>>>
>>> Thanks
>>> zhm
>>>
>>> ------(patch for branch stable-2.02) ----------
>>>    From d0d77d0bdad6136c792c9664444d73dd47b809cb Mon Sep 17 00:00:00 2001
>>> From: Zhao Heming <heming.zhao@suse.com>
>>> Date: Tue, 22 Oct 2019 17:22:17 +0800
>>> Subject: [PATCH] bcache may mistakenly write data to another disk when writes
>>>     error
>>>
>>> When bcache write data error, the errored fd and its data is saved in
>>> cache->errored, then this fd is closed. Later lvm will reuse this
>>> closed fd to new opened devs, but the fd related data still in
>>> cache->errored and flags with BF_DIRTY. It make the data may mistakenly
>>> write to another disk.
>>
>> I think real issue here is that the flush fails, and the error path for that
>> calls invalidate dev, which also fails, but that return value is not checked.
>> The fd is subsequently closed, and reopened with data still in the cache.
>>
> First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
> in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
> move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
> (because the data holds BF_DIRTY flag).
> 
>> So I think the correct fix is to have a variant of invalidate, that doesn't
>> bother retrying the IO, and just throws away the dirty data.  bcache_abort()?
>> This should be called when the flush() fails.
>>
> Currently lvm2 calls bcache_invalidate_fd when flush fails. So I add abort action
> in invalidate(). You can see the handle errored list codes in invalidate(). It's
> not necessary to add a new func bcache_abort().
> 
> So IMO both bcache_flush &  bcache_invalidate_fd have bug.
> - bcache_flush should do more error/fail check.
> - invalidate should do abort job.
> 
> BTW, yesterday I found my patch a mistake, it overwrites some correct code. And I
> only sent it to David with V2 patch.
> 
>> - Joe
>>
>> _______________________________________________
>> linux-lvm mailing list
>> linux-lvm@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-lvm
>> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-24  3:06   ` Heming Zhao
@ 2019-10-28 15:43     ` Joe Thornber
  2019-10-29  5:07       ` Heming Zhao
  0 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2019-10-28 15:43 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: David Teigland

On Thu, Oct 24, 2019 at 03:06:18AM +0000, Heming Zhao wrote:
> First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
> in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
> move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
> (because the data holds BF_DIRTY flag).

I just pushed a couple of patches that will hopefully fix this issue for you:

commit 6b0d969b2a85ba69046afa26af4d7bcddddbccd5 (HEAD -> master, origin/master, origin/HEAD, 2019-10-11-bcache-purge)
Author: Joe Thornber <ejt@redhat.com>
Date:   Mon Oct 28 15:01:47 2019 +0000

    [label] Use bcache_abort_fd() to ensure blocks are no longer in the cache.
    
    The return value from bcache_invalidate_fd() was not being checked.
    
    So I've introduced a little function, _invalidate_fd() that always
    calls bcache_abort_fd() if the write fails.

commit 2938b4dcca0a1df661758abfab7f402ea7aab018
Author: Joe Thornber <ejt@redhat.com>
Date:   Mon Oct 28 14:29:47 2019 +0000

    [bcache] add bcache_abort()
    
    This gives us a way to cope with write failures.




Also there are big changes to bcache coming, that remove file descriptors from
the interface completely.  See the branch 2019-09-05-add-io-manager for more info
(bcache has been renamed io_manager).

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-28 15:43     ` Joe Thornber
@ 2019-10-29  5:07       ` Heming Zhao
  2019-10-29  9:46         ` Heming Zhao
  2019-10-29 11:01         ` Joe Thornber
  0 siblings, 2 replies; 15+ messages in thread
From: Heming Zhao @ 2019-10-29  5:07 UTC (permalink / raw)
  To: LVM general discussion and development, Joe Thornber; +Cc: David Teigland

Hello Joe,

Please check my comments for your commit 2938b4dcc & 6b0d969b

1. b->ref_count is non-zero, and write error happens, the data never release?
    (no place to call _unlink_block & _free_block)

2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is wrong.

3. I still think below error handling should be added.
    Below base on stable-2.02, but the core idea is same, should access the return value of io_submit & io_getevents.
    
```
static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
    ... ...
     if (r < 0) {
         _cb_free(e->cbs, cb);
+       ((struct block *)context)->error = r; <== assign errno & print warning
+       log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
+           (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
+           (long long unsigned)nbytes, r, strerror(-r));
         return false;
     }

static void _issue_low_level(struct block *b, enum dir d)
    ... ...
     dm_list_move(&cache->io_pending, &b->list);
  
     if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
-       /* FIXME: if io_submit() set an errno, return that instead of EIO? */
-       _complete_io(b, -EIO);
+       _complete_io(b, b->error); <=== this pass the right errno to caller.
         return;
     }
  }
  
-static void _wait_all(struct bcache *cache)
+static bool _wait_all(struct bcache *cache) <=== change to return error
  {
+   bool ret = true;
     while (!dm_list_empty(&cache->io_pending))
-       _wait_io(cache);
+       ret = _wait_io(cache);
+   return ret;
  }
  
-static void _wait_specific(struct block *b)
+static bool _wait_specific(struct block *b) <=== change to return error
  {
+   bool ret = true;
     while (_test_flags(b, BF_IO_PENDING))
-       _wait_io(b->cache);
+       ret = _wait_io(b->cache);
+   return ret;
  }

  bool bcache_flush(struct bcache *cache) <==== add more error handling
  {
+   bool write_ret = true, wait_ret = true;
  
     ... ...
         _issue_write(b);
+       if (b->error) write_ret = false;
     }
  
-   _wait_all(cache);
+   wait_ret = _wait_all(cache);
  
-   return dm_list_empty(&cache->errored);
+   if (write_ret == false || wait_ret == false ||
+           !dm_list_empty(&cache->errored))
+       return false;
+   else
+       return true;
  }
```

Thanks,
zhm

On 10/28/19 11:43 PM, Joe Thornber wrote:
> On Thu, Oct 24, 2019 at 03:06:18AM +0000, Heming Zhao wrote:
>> First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
>> in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
>> move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
>> (because the data holds BF_DIRTY flag).
> 
> I just pushed a couple of patches that will hopefully fix this issue for you:
> 
> commit 6b0d969b2a85ba69046afa26af4d7bcddddbccd5 (HEAD -> master, origin/master, origin/HEAD, 2019-10-11-bcache-purge)
> Author: Joe Thornber <ejt@redhat.com>
> Date:   Mon Oct 28 15:01:47 2019 +0000
> 
>      [label] Use bcache_abort_fd() to ensure blocks are no longer in the cache.
>      
>      The return value from bcache_invalidate_fd() was not being checked.
>      
>      So I've introduced a little function, _invalidate_fd() that always
>      calls bcache_abort_fd() if the write fails.
> 
> commit 2938b4dcca0a1df661758abfab7f402ea7aab018
> Author: Joe Thornber <ejt@redhat.com>
> Date:   Mon Oct 28 14:29:47 2019 +0000
> 
>      [bcache] add bcache_abort()
>      
>      This gives us a way to cope with write failures.
> 
> 
> 
> 
> Also there are big changes to bcache coming, that remove file descriptors from
> the interface completely.  See the branch 2019-09-05-add-io-manager for more info
> (bcache has been renamed io_manager).
> 
> - Joe
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29  5:07       ` Heming Zhao
@ 2019-10-29  9:46         ` Heming Zhao
  2019-10-29 11:05           ` Joe Thornber
  2019-10-29 11:01         ` Joe Thornber
  1 sibling, 1 reply; 15+ messages in thread
From: Heming Zhao @ 2019-10-29  9:46 UTC (permalink / raw)
  To: LVM general discussion and development, Joe Thornber; +Cc: David Teigland

Add another comment.

The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd().
But from my analysis, it looks your codes do not fix this issue.

_invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd.
bcache_invalidate_fd work as before, only return true or false to indicate there is or isn't fd in cache->errored.
Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & _free_block.
These two functions only delist block from currently cache->errored & cache->free list.
The data still in radix tree with flags BF_DIRTY.
So when next lvm reuses closed error fd, and calls _lookup_or_read_block (with legacy fd), the data will reuse again.
The bug will be trigger.

Thanks,
zhm

On 10/29/19 1:07 PM, heming.zhao wrote:
> Hello Joe,
> 
> Please check my comments for your commit 2938b4dcc & 6b0d969b
> 
> 1. b->ref_count is non-zero, and write error happens, the data never release?
>  �� (no place to call _unlink_block & _free_block)
> 
> 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is wrong.
> 
> 3. I still think below error handling should be added.
>  �� Below base on stable-2.02, but the core idea is same, should access the return value of io_submit & io_getevents.
> ```
> static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
>  �� ... ...
>  ��� if (r < 0) {
>  ������� _cb_free(e->cbs, cb);
> +������ ((struct block *)context)->error = r; <== assign errno & print warning
> +������ log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
> +���������� (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
> +���������� (long long unsigned)nbytes, r, strerror(-r));
>  ������� return false;
>  ��� }
> 
> static void _issue_low_level(struct block *b, enum dir d)
>  �� ... ...
>  ��� dm_list_move(&cache->io_pending, &b->list);
> 
>  ��� if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
> -������ /* FIXME: if io_submit() set an errno, return that instead of EIO? */
> -������ _complete_io(b, -EIO);
> +������ _complete_io(b, b->error); <=== this pass the right errno to caller.
>  ������� return;
>  ��� }
>  �}
> 
> -static void _wait_all(struct bcache *cache)
> +static bool _wait_all(struct bcache *cache) <=== change to return error
>  �{
> +�� bool ret = true;
>  ��� while (!dm_list_empty(&cache->io_pending))
> -������ _wait_io(cache);
> +������ ret = _wait_io(cache);
> +�� return ret;
>  �}
> 
> -static void _wait_specific(struct block *b)
> +static bool _wait_specific(struct block *b) <=== change to return error
>  �{
> +�� bool ret = true;
>  ��� while (_test_flags(b, BF_IO_PENDING))
> -������ _wait_io(b->cache);
> +������ ret = _wait_io(b->cache);
> +�� return ret;
>  �}
> 
>  �bool bcache_flush(struct bcache *cache) <==== add more error handling
>  �{
> +�� bool write_ret = true, wait_ret = true;
> 
>  ��� ... ...
>  ������� _issue_write(b);
> +������ if (b->error) write_ret = false;
>  ��� }
> 
> -�� _wait_all(cache);
> +�� wait_ret = _wait_all(cache);
> 
> -�� return dm_list_empty(&cache->errored);
> +�� if (write_ret == false || wait_ret == false ||
> +���������� !dm_list_empty(&cache->errored))
> +������ return false;
> +�� else
> +������ return true;
>  �}
> ```
> 
> Thanks,
> zhm
> 
> On 10/28/19 11:43 PM, Joe Thornber wrote:
>> On Thu, Oct 24, 2019 at 03:06:18AM +0000, Heming Zhao wrote:
>>> First fail is in bcache_flush, then bcache_invalidate_fd does nothing because the data
>>> in cache->errored, which not belongs to dirty & clean list. Then the data mistakenly
>>> move from cache->errored into cache->dirty by "bcache_get => _lookup_or_read_block"
>>> (because the data holds BF_DIRTY flag).
>>
>> I just pushed a couple of patches that will hopefully fix this issue for you:
>>
>> commit 6b0d969b2a85ba69046afa26af4d7bcddddbccd5 (HEAD -> master, origin/master, origin/HEAD, 2019-10-11-bcache-purge)
>> Author: Joe Thornber <ejt@redhat.com>
>> Date:�� Mon Oct 28 15:01:47 2019 +0000
>>
>> ���� [label] Use bcache_abort_fd() to ensure blocks are no longer in the cache.
>> ���� The return value from bcache_invalidate_fd() was not being checked.
>> ���� So I've introduced a little function, _invalidate_fd() that always
>> ���� calls bcache_abort_fd() if the write fails.
>>
>> commit 2938b4dcca0a1df661758abfab7f402ea7aab018
>> Author: Joe Thornber <ejt@redhat.com>
>> Date:�� Mon Oct 28 14:29:47 2019 +0000
>>
>> ���� [bcache] add bcache_abort()
>> ���� This gives us a way to cope with write failures.
>>
>>
>>
>>
>> Also there are big changes to bcache coming, that remove file descriptors from
>> the interface completely.� See the branch 2019-09-05-add-io-manager for more info
>> (bcache has been renamed io_manager).
>>
>> - Joe
>>
>> _______________________________________________
>> linux-lvm mailing list
>> linux-lvm@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-lvm
>> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29  5:07       ` Heming Zhao
  2019-10-29  9:46         ` Heming Zhao
@ 2019-10-29 11:01         ` Joe Thornber
  2019-10-29 11:41           ` Heming Zhao
  1 sibling, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2019-10-29 11:01 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: David Teigland

On Tue, Oct 29, 2019 at 05:07:30AM +0000, Heming Zhao wrote:
> Hello Joe,
> 
> Please check my comments for your commit 2938b4dcc & 6b0d969b
> 
> 1. b->ref_count is non-zero, and write error happens, the data never release?
>     (no place to call _unlink_block & _free_block)

Correct, the data will not be released until the client calls bcache_abort_fd(), to
indicate that it's giving up on the write.  That way the client is free to retry
io, eg, see this unit test:

 689│static void test_write_bad_issue_stops_flush(void *context)
   1│{
   2│        struct fixture *f = context;
   3│        struct mock_engine *me = f->me;
   4│        struct bcache *cache = f->cache;
   5│        struct block *b;
   6│        int fd = 17;
   7│ 
   8│        T_ASSERT(bcache_get(cache, fd, 0, GF_ZERO, &b));
   9│        _expect_write_bad_issue(me, fd, 0);
  10│        bcache_put(b);
  11│        T_ASSERT(!bcache_flush(cache));
  12│
  13│        // we'll let it succeed the second time
  14│        _expect_write(me, fd, 0);
  15│        _expect(me, E_WAIT);
  16│        T_ASSERT(bcache_flush(cache));
  17│}


> 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is wrong.

Quite possibly, this unset_last_byte stuff is a hack that Dave put in.  I'll look some more.


> 3. I still think below error handling should be added.
>     Below base on stable-2.02, but the core idea is same, should access the return value of io_submit & io_getevents.
>     
> ```
> static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
>     ... ...
>      if (r < 0) {
>          _cb_free(e->cbs, cb);
> +       ((struct block *)context)->error = r; <== assign errno & print warning
> +       log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
> +           (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
> +           (long long unsigned)nbytes, r, strerror(-r));
>          return false;
>      }
> 
> static void _issue_low_level(struct block *b, enum dir d)
>     ... ...
>      dm_list_move(&cache->io_pending, &b->list);
>   
>      if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
> -       /* FIXME: if io_submit() set an errno, return that instead of EIO? */
> -       _complete_io(b, -EIO);
> +       _complete_io(b, b->error); <=== this pass the right errno to caller.
>          return;
>      }
>   }

Yep, this is good. Added.


> -static void _wait_all(struct bcache *cache)
> +static bool _wait_all(struct bcache *cache) <=== change to return error
>   {
> +   bool ret = true;
>      while (!dm_list_empty(&cache->io_pending))
> -       _wait_io(cache);
> +       ret = _wait_io(cache);
> +   return ret;
>   }
>   
> -static void _wait_specific(struct block *b)
> +static bool _wait_specific(struct block *b) <=== change to return error
>   {
> +   bool ret = true;
>      while (_test_flags(b, BF_IO_PENDING))
> -       _wait_io(b->cache);
> +       ret = _wait_io(b->cache);
> +   return ret;
>   }

No, the wait functions just wait for io to complete, they're not interested
in whether it succeeded.  That's what b->error is for.


> 
>   bool bcache_flush(struct bcache *cache) <==== add more error handling
>   {
> +   bool write_ret = true, wait_ret = true;
>   
>      ... ...
>          _issue_write(b);
> +       if (b->error) write_ret = false;
>      }
>   
> -   _wait_all(cache);
> +   wait_ret = _wait_all(cache);
>   
> -   return dm_list_empty(&cache->errored);
> +   if (write_ret == false || wait_ret == false ||
> +           !dm_list_empty(&cache->errored))
> +       return false;
> +   else
> +       return true;
>   }

I don't understand how this changes the behaviour from just checking the
size of cache->errored.

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29  9:46         ` Heming Zhao
@ 2019-10-29 11:05           ` Joe Thornber
  2019-10-29 11:47             ` Heming Zhao
  0 siblings, 1 reply; 15+ messages in thread
From: Joe Thornber @ 2019-10-29 11:05 UTC (permalink / raw)
  To: Heming Zhao; +Cc: David Teigland, LVM general discussion and development

On Tue, Oct 29, 2019 at 09:46:56AM +0000, Heming Zhao wrote:
> Add another comment.
> 
> The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd().
> But from my analysis, it looks your codes do not fix this issue.
> 
> _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd.
> bcache_invalidate_fd work as before, only return true or false to indicate there is or isn't fd in cache->errored.
> Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & _free_block.
> These two functions only delist block from currently cache->errored & cache->free list.
> The data still in radix tree with flags BF_DIRTY.

In _abort_v():

1402│        // We can't remove the block from the radix tree yet because
   1│        // we're in the middle of an iteration.

and then after the iteration:

1416│        radix_tree_remove_prefix(cache->rtree, k.bytes, k.bytes + sizeof(k.parts.fd));

I've added a unit test to demonstrate it does indeed work:

 840│static void test_abort_forces_reread(void *context)
   1│{        
   2│        struct fixture *f = context;
   3│        struct mock_engine *me = f->me;
   4│        struct bcache *cache = f->cache;
   5│        struct block *b;
   6│        int fd = 17;
   7│ 
   8│        _expect_read(me, fd, 0);
   9│        _expect(me, E_WAIT);
  10│        T_ASSERT(bcache_get(cache, fd, 0, GF_DIRTY, &b));
  11│        bcache_put(b);
  12│ 
  13│        bcache_abort_fd(cache, fd);
  14│        T_ASSERT(bcache_flush(cache));
  15│ 
  16│        // Check the block is re-read
  17│        _expect_read(me, fd, 0);
  18│        _expect(me, E_WAIT);
  19│        T_ASSERT(bcache_get(cache, fd, 0, 0, &b));
  20│        bcache_put(b);
  21│}

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29 11:01         ` Joe Thornber
@ 2019-10-29 11:41           ` Heming Zhao
  0 siblings, 0 replies; 15+ messages in thread
From: Heming Zhao @ 2019-10-29 11:41 UTC (permalink / raw)
  To: LVM general discussion and development, Joe Thornber; +Cc: David Teigland



On 10/29/19 7:01 PM, Joe Thornber wrote:
> On Tue, Oct 29, 2019 at 05:07:30AM +0000, Heming Zhao wrote:
>> Hello Joe,
>>
>> Please check my comments for your commit 2938b4dcc & 6b0d969b
>>
>> 1. b->ref_count is non-zero, and write error happens, the data never release?
>>      (no place to call _unlink_block & _free_block)
> 
> Correct, the data will not be released until the client calls bcache_abort_fd(), to
> indicate that it's giving up on the write.  That way the client is free to retry
> io, eg, see this unit test:
> 
>   689│static void test_write_bad_issue_stops_flush(void *context)
>     1│{
>     2│        struct fixture *f = context;
>     3│        struct mock_engine *me = f->me;
>     4│        struct bcache *cache = f->cache;
>     5│        struct block *b;
>     6│        int fd = 17;
>     7│
>     8│        T_ASSERT(bcache_get(cache, fd, 0, GF_ZERO, &b));
>     9│        _expect_write_bad_issue(me, fd, 0);
>    10│        bcache_put(b);
>    11│        T_ASSERT(!bcache_flush(cache));
>    12│
>    13│        // we'll let it succeed the second time
>    14│        _expect_write(me, fd, 0);
>    15│        _expect(me, E_WAIT);
>    16│        T_ASSERT(bcache_flush(cache));
>    17│}
> 
you are right.

> 
>> 2. when dev_write_bytes failed, call dev_unset_last_byte with "fd=-1" is wrong.
> 
> Quite possibly, this unset_last_byte stuff is a hack that Dave put in.  I'll look some more.
> 
> 
>> 3. I still think below error handling should be added.
>>      Below base on stable-2.02, but the core idea is same, should access the return value of io_submit & io_getevents.
>>      
>> ```
>> static bool _async_issue(struct io_engine *ioe, enum dir d, int fd,
>>      ... ...
>>       if (r < 0) {
>>           _cb_free(e->cbs, cb);
>> +       ((struct block *)context)->error = r; <== assign errno & print warning
>> +       log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
>> +           (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
>> +           (long long unsigned)nbytes, r, strerror(-r));
>>           return false;
>>       }
>>
>> static void _issue_low_level(struct block *b, enum dir d)
>>      ... ...
>>       dm_list_move(&cache->io_pending, &b->list);
>>    
>>       if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) {
>> -       /* FIXME: if io_submit() set an errno, return that instead of EIO? */
>> -       _complete_io(b, -EIO);
>> +       _complete_io(b, b->error); <=== this pass the right errno to caller.
>>           return;
>>       }
>>    }
> 
> Yep, this is good. Added.
> 
> 
>> -static void _wait_all(struct bcache *cache)
>> +static bool _wait_all(struct bcache *cache) <=== change to return error
>>    {
>> +   bool ret = true;
>>       while (!dm_list_empty(&cache->io_pending))
>> -       _wait_io(cache);
>> +       ret = _wait_io(cache);
>> +   return ret;
>>    }
>>    
>> -static void _wait_specific(struct block *b)
>> +static bool _wait_specific(struct block *b) <=== change to return error
>>    {
>> +   bool ret = true;
>>       while (_test_flags(b, BF_IO_PENDING))
>> -       _wait_io(b->cache);
>> +       ret = _wait_io(b->cache);
>> +   return ret;
>>    }
> 
> No, the wait functions just wait for io to complete, they're not interested
> in whether it succeeded.  That's what b->error is for.
> 
if io_getevents failed, how do you do? just ignore?
the data still in cache->io_pending not in cache->errored.

> 
>>
>>    bool bcache_flush(struct bcache *cache) <==== add more error handling
>>    {
>> +   bool write_ret = true, wait_ret = true;
>>    
>>       ... ...
>>           _issue_write(b);
>> +       if (b->error) write_ret = false;
>>       }
>>    
>> -   _wait_all(cache);
>> +   wait_ret = _wait_all(cache);
>>    
>> -   return dm_list_empty(&cache->errored);
>> +   if (write_ret == false || wait_ret == false ||
>> +           !dm_list_empty(&cache->errored))
>> +       return false;
>> +   else
>> +       return true;
>>    }
> 
> I don't understand how this changes the behaviour from just checking the
> size of cache->errored.
this is stable-2.02 code. master branch like below.
the core idea is to check the io_submit & io_getevents return value. (refer above codes changes)
```
bool bcache_flush(struct bcache *cache)
{
+   bool write_ret = true, wait_ret = true;
+
     // Only dirty data is on the errored list, since bad read blocks get
     // recycled straight away.  So we put these back on the dirty list, and
     // try and rewrite everything.
     dm_list_splice(&cache->dirty, &cache->errored);

     while (!dm_list_empty(&cache->dirty)) {
         struct block *b = dm_list_item(_list_pop(&cache->dirty), struct block);
         if (b->ref_count || _test_flags(b, BF_IO_PENDING)) {
             // The superblock may well be still locked.
             continue;
         }

-        _issue_write(b);
+       if (b->error) write_ret = false;
     }

-    _wait_all(cache);
+   wait_ret = _wait_all(cache);

-   return dm_list_empty(&cache->errored);
+   if (write_ret == false || wait_ret == false ||
+           !dm_list_empty(&cache->errored))
+       return false;
+   else
+       return true;
}
```

> 
> - Joe
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29 11:05           ` Joe Thornber
@ 2019-10-29 11:47             ` Heming Zhao
  2019-10-29 14:41               ` Joe Thornber
  0 siblings, 1 reply; 15+ messages in thread
From: Heming Zhao @ 2019-10-29 11:47 UTC (permalink / raw)
  To: LVM general discussion and development, Joe Thornber; +Cc: David Teigland

You are right. I forgot the radix_tree_remove_prefix().
The radix_tree_remove_prefix() is called both bcache_invalidate_fd & bcache_abort_fd.
So the job will not work as expected in bcache_abort_fd, because the node is already removed.
Please correct me if my thinking is wrong.

On 10/29/19 7:05 PM, Joe Thornber wrote:
> On Tue, Oct 29, 2019 at 09:46:56AM +0000, Heming Zhao wrote:
>> Add another comment.
>>
>> The key of commit 2938b4dcc & 6b0d969b is function _invalidate_fd().
>> But from my analysis, it looks your codes do not fix this issue.
>>
>> _invalidate_fd calls bcache_invalidate_fd & bcache_abort_fd.
>> bcache_invalidate_fd work as before, only return true or false to indicate there is or isn't fd in cache->errored.
>> Then bcache_abort_fd calls _abort_v, _abort_v calls _unlink_block & _free_block.
>> These two functions only delist block from currently cache->errored & cache->free list.
>> The data still in radix tree with flags BF_DIRTY.
> 
> In _abort_v():
> 
> 1402│        // We can't remove the block from the radix tree yet because
>     1│        // we're in the middle of an iteration.
> 
> and then after the iteration:
> 
> 1416│        radix_tree_remove_prefix(cache->rtree, k.bytes, k.bytes + sizeof(k.parts.fd));
> 
> I've added a unit test to demonstrate it does indeed work:
> 
>   840│static void test_abort_forces_reread(void *context)
>     1│{
>     2│        struct fixture *f = context;
>     3│        struct mock_engine *me = f->me;
>     4│        struct bcache *cache = f->cache;
>     5│        struct block *b;
>     6│        int fd = 17;
>     7│
>     8│        _expect_read(me, fd, 0);
>     9│        _expect(me, E_WAIT);
>    10│        T_ASSERT(bcache_get(cache, fd, 0, GF_DIRTY, &b));
>    11│        bcache_put(b);
>    12│
>    13│        bcache_abort_fd(cache, fd);
>    14│        T_ASSERT(bcache_flush(cache));
>    15│
>    16│        // Check the block is re-read
>    17│        _expect_read(me, fd, 0);
>    18│        _expect(me, E_WAIT);
>    19│        T_ASSERT(bcache_get(cache, fd, 0, 0, &b));
>    20│        bcache_put(b);
>    21│}
> 
> - Joe
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
  2019-10-29 11:47             ` Heming Zhao
@ 2019-10-29 14:41               ` Joe Thornber
  0 siblings, 0 replies; 15+ messages in thread
From: Joe Thornber @ 2019-10-29 14:41 UTC (permalink / raw)
  To: Heming Zhao; +Cc: David Teigland, LVM general discussion and development

On Tue, Oct 29, 2019 at 11:47:19AM +0000, Heming Zhao wrote:
> You are right. I forgot the radix_tree_remove_prefix().
> The radix_tree_remove_prefix() is called both bcache_invalidate_fd & bcache_abort_fd.
> So the job will not work as expected in bcache_abort_fd, because the node is already removed.
> Please correct me if my thinking is wrong.

Slightly wrong, the remove_prefix in invalidate should only be running if it.success is true.
Well spotted.

- Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
       [not found]     ` <667efc9f-1001-37cc-c0af-b352ff366c03@suse.com>
@ 2019-11-13 15:41       ` David Teigland
  0 siblings, 0 replies; 15+ messages in thread
From: David Teigland @ 2019-11-13 15:41 UTC (permalink / raw)
  To: Heming Zhao; +Cc: thornber, linux-lvm

On Wed, Nov 13, 2019 at 02:23:42AM +0000, Heming Zhao wrote:
> label_write // @ lib/label/label.c
>   +-> dev_write_bytes //
>   |    +-> if (!bcache_flush()) //when failed, call below invalidate
>   |          label_scan_invalidate // will set fd=-1 in _scan_dev_close()
>   +-> dev_unset_last_byte // with fd: -1
>       +-> bcache_unset_last_byte
>             +-> _last_byte_fd != fd(-1), the _last_byte_xx never clean.
> ```
> So all bcache_unset_last_byte which after dev_write_bytes should be modified.
> For detail code fix, please check my V2 patch which I sent on Oct 28th.

Got it, thanks for your work on that, please check I got all the necessary
lines from your patch.

https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=13c254fc05386d05ab6bbda2806f9ca4d3358a0c

> Would you have any plan/schedule to backport bcache bug fixes to stable-2.02 branch?

I'll try to apply them all today.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
       [not found] ` <872328cd-3d51-97bb-1c50-b54cc194c6f2@suse.com>
@ 2019-11-12 15:21   ` David Teigland
       [not found]     ` <667efc9f-1001-37cc-c0af-b352ff366c03@suse.com>
  0 siblings, 1 reply; 15+ messages in thread
From: David Teigland @ 2019-11-12 15:21 UTC (permalink / raw)
  To: Heming Zhao; +Cc: thornber, linux-lvm

On Tue, Nov 12, 2019 at 08:38:00AM +0000, Heming Zhao wrote:
> There still have bug codes wait for fixing.
> Please see below item <2>. the _last_byte_xx (in bcache.c) should be restored to ZERO when fail/error.
> Do you have any plan/schedule to this issue?
> Do you have any plan/schedule to backport these fixes to stable-2.02 branch?

I don't think it should actually be a problem since bcache_unset_last_byte
checks for _last_byte_fd == fd.  But, I think it's better to not rely on that
check, so how about a change like this?

void dev_unset_last_byte(struct device *dev)
{
        if (dev->bcache_fd > 0)
                bcache_unset_last_byte(scan_bcache, dev->bcache_fd);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-13 15:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22  9:47 [linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error Heming Zhao
2019-10-23 21:31 ` Joe Thornber
2019-10-24  3:06   ` Heming Zhao
2019-10-28 15:43     ` Joe Thornber
2019-10-29  5:07       ` Heming Zhao
2019-10-29  9:46         ` Heming Zhao
2019-10-29 11:05           ` Joe Thornber
2019-10-29 11:47             ` Heming Zhao
2019-10-29 14:41               ` Joe Thornber
2019-10-29 11:01         ` Joe Thornber
2019-10-29 11:41           ` Heming Zhao
2019-10-24  3:13   ` Heming Zhao
2019-10-28  8:38     ` Heming Zhao
     [not found] <fc8ca0d7-23d9-8145-05e5-27a7ea2a7682@suse.com>
     [not found] ` <872328cd-3d51-97bb-1c50-b54cc194c6f2@suse.com>
2019-11-12 15:21   ` David Teigland
     [not found]     ` <667efc9f-1001-37cc-c0af-b352ff366c03@suse.com>
2019-11-13 15:41       ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).