[PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
@ 2019-12-04 14:07 Nikos Tsironis
  2019-12-04 14:07 ` [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback Nikos Tsironis
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-04 14:07 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: thornber, ntsironis

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

For more information regarding the implications of this please see the
relevant commit.

To solve this and avoid the potential data corruption we have to flush
the pool's data device before committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Nikos Tsironis (2):
  dm thin metadata: Add support for a pre-commit callback
  dm thin: Flush data device before committing metadata

 drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
 drivers/md/dm-thin-metadata.h |  7 +++++++
 drivers/md/dm-thin.c          | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback
  2019-12-04 14:07 [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Nikos Tsironis
@ 2019-12-04 14:07 ` Nikos Tsironis
  2019-12-05 19:40   ` Mike Snitzer
  2019-12-04 14:07 ` [PATCH 2/2] dm thin: Flush data device before committing metadata Nikos Tsironis
  2019-12-04 19:58 ` [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Eric Wheeler
  2 siblings, 1 reply; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-04 14:07 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: thornber, ntsironis

Add support for one pre-commit callback which is run right before the
metadata are committed.

This allows the thin provisioning target to run a callback before the
metadata are committed and is required by the next commit.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
---
 drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
 drivers/md/dm-thin-metadata.h |  7 +++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index 4c68a7b93d5e..b88d6d701f5b 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -189,6 +189,15 @@ struct dm_pool_metadata {
 	sector_t data_block_size;
 
 	/*
+	 * Pre-commit callback.
+	 *
+	 * This allows the thin provisioning target to run a callback before
+	 * the metadata are committed.
+	 */
+	dm_pool_pre_commit_fn pre_commit_fn;
+	void *pre_commit_context;
+
+	/*
 	 * We reserve a section of the metadata for commit overhead.
 	 * All reported space does *not* include this.
 	 */
@@ -826,6 +835,14 @@ static int __commit_transaction(struct dm_pool_metadata *pmd)
 	if (unlikely(!pmd->in_service))
 		return 0;
 
+	if (pmd->pre_commit_fn) {
+		r = pmd->pre_commit_fn(pmd->pre_commit_context);
+		if (r < 0) {
+			DMERR("pre-commit callback failed");
+			return r;
+		}
+	}
+
 	r = __write_changed_details(pmd);
 	if (r < 0)
 		return r;
@@ -892,6 +909,8 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
 	pmd->in_service = false;
 	pmd->bdev = bdev;
 	pmd->data_block_size = data_block_size;
+	pmd->pre_commit_fn = NULL;
+	pmd->pre_commit_context = NULL;
 
 	r = __create_persistent_data_objects(pmd, format_device);
 	if (r) {
@@ -2044,6 +2063,16 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
 	return r;
 }
 
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
+					  dm_pool_pre_commit_fn fn,
+					  void *context)
+{
+	pmd_write_lock_in_core(pmd);
+	pmd->pre_commit_fn = fn;
+	pmd->pre_commit_context = context;
+	pmd_write_unlock(pmd);
+}
+
 int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
 {
 	int r = -EINVAL;
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index f6be0d733c20..7ef56bd2a7e3 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -230,6 +230,13 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
  */
 void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
 
+/* Pre-commit callback */
+typedef int (*dm_pool_pre_commit_fn)(void *context);
+
+void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
+					  dm_pool_pre_commit_fn fn,
+					  void *context);
+
 /*----------------------------------------------------------------*/
 
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] dm thin: Flush data device before committing metadata
  2019-12-04 14:07 [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Nikos Tsironis
  2019-12-04 14:07 ` [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback Nikos Tsironis
@ 2019-12-04 14:07 ` Nikos Tsironis
  2019-12-04 15:27   ` Joe Thornber
  2019-12-04 19:58 ` [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Eric Wheeler
  2 siblings, 1 reply; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-04 14:07 UTC (permalink / raw)
  To: snitzer, agk, dm-devel; +Cc: thornber, ntsironis

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
---
 drivers/md/dm-thin.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 5a2c494cb552..e0be545080d0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3180,6 +3180,34 @@ static void metadata_low_callback(void *context)
 	dm_table_event(pool->ti->table);
 }

+/*
+ * We need to flush the data device **before** committing the metadata.
+ *
+ * This ensures that the data blocks of any newly inserted mappings are
+ * properly written to non-volatile storage and won't be lost in case of a
+ * crash.
+ *
+ * Failure to do so can result in data corruption in the case of internal or
+ * external snapshots and in the case of newly provisioned blocks, when block
+ * zeroing is enabled.
+ */
+static int metadata_pre_commit_callback(void *context)
+{
+	struct pool_c *pt = context;
+	struct bio bio;
+	int r;
+
+	bio_init(&bio, NULL, 0);
+	bio_set_dev(&bio, pt->data_dev->bdev);
+	bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	r = submit_bio_wait(&bio);
+
+	bio_uninit(&bio);
+
+	return r;
+}
+
 static sector_t get_dev_size(struct block_device *bdev)
 {
 	return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
@@ -3374,6 +3402,10 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	if (r)
 		goto out_flags_changed;

+	dm_pool_register_pre_commit_callback(pt->pool->pmd,
+					     metadata_pre_commit_callback,
+					     pt);
+
 	pt->callbacks.congested_fn = pool_is_congested;
 	dm_table_add_target_callbacks(ti->table, &pt->callbacks);

-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] dm thin: Flush data device before committing metadata
  2019-12-04 14:07 ` [PATCH 2/2] dm thin: Flush data device before committing metadata Nikos Tsironis
@ 2019-12-04 15:27   ` Joe Thornber
  2019-12-04 16:17     ` Nikos Tsironis
  0 siblings, 1 reply; 18+ messages in thread
From: Joe Thornber @ 2019-12-04 15:27 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, agk, snitzer

On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
> The thin provisioning target maintains per thin device mappings that map
> virtual blocks to data blocks in the data device.


Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
original bio is still remapped and issued at the end of process_deferred_bios?

- Joe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] dm thin: Flush data device before committing metadata
  2019-12-04 15:27   ` Joe Thornber
@ 2019-12-04 16:17     ` Nikos Tsironis
  2019-12-04 16:39       ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-04 16:17 UTC (permalink / raw)
  To: snitzer, agk, dm-devel, thornber

On 12/4/19 5:27 PM, Joe Thornber wrote:
> On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
>> The thin provisioning target maintains per thin device mappings that map
>> virtual blocks to data blocks in the data device.
> 
> 
> Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
> original bio is still remapped and issued at the end of process_deferred_bios?
> 

Yes, that's correct. I thought of it and of putting a check in
process_deferred_bios() to complete FLUSH bios immediately, but I have
one concern and I preferred to be safe than sorry.

In __commit_transaction() there is the following check:

   if (unlikely(!pmd->in_service))
             return 0;

, which means we don't commit the metadata, and thus we don't flush the
data device, in case the pool is not in service.

Opening a thin device doesn't seem to put the pool in service, since
dm_pool_open_thin_device() uses pmd_write_lock_in_core().

Can I assume that the pool is in service if I/O can be mapped to a thin
device? If so, it's safe to put such a check in process_deferred_bios().

On second thought though, in order for a flush bio to end up in
deferred_flush_bios in the first place, someone must have changed the
metadata and thus put the pool in service. Otherwise, it would have been
submitted directly to the data device. So, it's probably safe to check
for flush bios after commit() in process_deferred_bios() and complete
them immediately.

If you confirm too that this is safe, I will send a second version of
the patch adding the check.

Thanks,
Nikos

> - Joe
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] dm thin: Flush data device before committing metadata
  2019-12-04 16:17     ` Nikos Tsironis
@ 2019-12-04 16:39       ` Mike Snitzer
  2019-12-04 16:47         ` Nikos Tsironis
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2019-12-04 16:39 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, thornber, agk

On Wed, Dec 04 2019 at 11:17am -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/4/19 5:27 PM, Joe Thornber wrote:
> >On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
> >>The thin provisioning target maintains per thin device mappings that map
> >>virtual blocks to data blocks in the data device.
> >
> >
> >Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
> >original bio is still remapped and issued at the end of process_deferred_bios?
> >
> 
> Yes, that's correct. I thought of it and of putting a check in
> process_deferred_bios() to complete FLUSH bios immediately, but I have
> one concern and I preferred to be safe than sorry.
> 
> In __commit_transaction() there is the following check:
> 
>   if (unlikely(!pmd->in_service))
>             return 0;
> 
> , which means we don't commit the metadata, and thus we don't flush the
> data device, in case the pool is not in service.
> 
> Opening a thin device doesn't seem to put the pool in service, since
> dm_pool_open_thin_device() uses pmd_write_lock_in_core().
> 
> Can I assume that the pool is in service if I/O can be mapped to a thin
> device? If so, it's safe to put such a check in process_deferred_bios().

In service means upper layer has issued a write to a thin device of a
pool.  The header for commit 873f258becca87 gets into more detail.

> On second thought though, in order for a flush bio to end up in
> deferred_flush_bios in the first place, someone must have changed the
> metadata and thus put the pool in service. Otherwise, it would have been
> submitted directly to the data device. So, it's probably safe to check
> for flush bios after commit() in process_deferred_bios() and complete
> them immediately.

Yes, I think so, which was Joe's original point.
 
> If you confirm too that this is safe, I will send a second version of
> the patch adding the check.

Not seeing why we need another in_service check.  After your changes are
applied, any commit will trigger a preceeding flush.. so the deferred
flushes are redundant.

By definition, these deferred bios imply the pool is in service.

I'd be fine with seeing a 3rd follow-on thinp patch that completes the
redundant flushes immediately.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] dm thin: Flush data device before committing metadata
  2019-12-04 16:39       ` Mike Snitzer
@ 2019-12-04 16:47         ` Nikos Tsironis
  0 siblings, 0 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-04 16:47 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, thornber, agk

On 12/4/19 6:39 PM, Mike Snitzer wrote:>
On Wed, Dec 04 2019 at 11:17am -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/4/19 5:27 PM, Joe Thornber wrote:
>>> On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
>>>> The thin provisioning target maintains per thin device mappings that map
>>>> virtual blocks to data blocks in the data device.
>>>
>>>
>>> Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
>>> original bio is still remapped and issued at the end of process_deferred_bios?
>>>
>>
>> Yes, that's correct. I thought of it and of putting a check in
>> process_deferred_bios() to complete FLUSH bios immediately, but I have
>> one concern and I preferred to be safe than sorry.
>>
>> In __commit_transaction() there is the following check:
>>
>>    if (unlikely(!pmd->in_service))
>>              return 0;
>>
>> , which means we don't commit the metadata, and thus we don't flush the
>> data device, in case the pool is not in service.
>>
>> Opening a thin device doesn't seem to put the pool in service, since
>> dm_pool_open_thin_device() uses pmd_write_lock_in_core().
>>
>> Can I assume that the pool is in service if I/O can be mapped to a thin
>> device? If so, it's safe to put such a check in process_deferred_bios().
> 
> In service means upper layer has issued a write to a thin device of a
> pool.  The header for commit 873f258becca87 gets into more detail.
> 
>> On second thought though, in order for a flush bio to end up in
>> deferred_flush_bios in the first place, someone must have changed the
>> metadata and thus put the pool in service. Otherwise, it would have been
>> submitted directly to the data device. So, it's probably safe to check
>> for flush bios after commit() in process_deferred_bios() and complete
>> them immediately.
> 
> Yes, I think so, which was Joe's original point.
>   
>> If you confirm too that this is safe, I will send a second version of
>> the patch adding the check.
> 
> Not seeing why we need another in_service check.  After your changes are
> applied, any commit will trigger a preceeding flush.. so the deferred
> flushes are redundant.
> 

Yes, I meant add a check in process_deferred_bios(), after commit(), to
check for REQ_PREFLUSH bios and complete them immediately. I should have
clarified that.

> By definition, these deferred bios imply the pool is in service.
> 
> I'd be fine with seeing a 3rd follow-on thinp patch that completes the
> redundant flushes immediately.
> 

Ack, I will send another patch fixing this.

Nikos

> Thanks,
> Mike
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-04 14:07 [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Nikos Tsironis
  2019-12-04 14:07 ` [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback Nikos Tsironis
  2019-12-04 14:07 ` [PATCH 2/2] dm thin: Flush data device before committing metadata Nikos Tsironis
@ 2019-12-04 19:58 ` Eric Wheeler
  2019-12-04 20:17   ` Mike Snitzer
  2 siblings, 1 reply; 18+ messages in thread
From: Eric Wheeler @ 2019-12-04 19:58 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, thornber, agk, snitzer

On Wed, 4 Dec 2019, Nikos Tsironis wrote:

> The thin provisioning target maintains per thin device mappings that map
> virtual blocks to data blocks in the data device.
> 
> When we write to a shared block, in case of internal snapshots, or
> provision a new block, in case of external snapshots, we copy the shared
> block to a new data block (COW), update the mapping for the relevant
> virtual block and then issue the write to the new data block.
> 
> Suppose the data device has a volatile write-back cache and the
> following sequence of events occur:

For those with NV caches, can the data disk flush be optional (maybe as a 
table flag)?

--
Eric Wheeler



> 
> 1. We write to a shared block
> 2. A new data block is allocated
> 3. We copy the shared block to the new data block using kcopyd (COW)
> 4. We insert the new mapping for the virtual block in the btree for that
>    thin device.
> 5. The commit timeout expires and we commit the metadata, that now
>    includes the new mapping from step (4).
> 6. The system crashes and the data device's cache has not been flushed,
>    meaning that the COWed data are lost.
> 
> The next time we read that virtual block of the thin device we read it
> from the data block allocated in step (2), since the metadata have been
> successfully committed. The data are lost due to the crash, so we read
> garbage instead of the old, shared data.
> 
> Moreover, apart from internal and external snapshots, the same issue
> exists for newly provisioned blocks, when block zeroing is enabled.
> After the system recovers the provisioned blocks might contain garbage
> instead of zeroes.
> 
> For more information regarding the implications of this please see the
> relevant commit.
> 
> To solve this and avoid the potential data corruption we have to flush
> the pool's data device before committing its metadata.
> 
> This ensures that the data blocks of any newly inserted mappings are
> properly written to non-volatile storage and won't be lost in case of a
> crash.
> 
> Nikos Tsironis (2):
>   dm thin metadata: Add support for a pre-commit callback
>   dm thin: Flush data device before committing metadata
> 
>  drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
>  drivers/md/dm-thin-metadata.h |  7 +++++++
>  drivers/md/dm-thin.c          | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 68 insertions(+)
> 
> -- 
> 2.11.0
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-04 19:58 ` [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Eric Wheeler
@ 2019-12-04 20:17   ` Mike Snitzer
  2019-12-05 15:31     ` Nikos Tsironis
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2019-12-04 20:17 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: dm-devel, thornber, Nikos Tsironis, agk

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:

> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> 
> > The thin provisioning target maintains per thin device mappings that map
> > virtual blocks to data blocks in the data device.
> > 
> > When we write to a shared block, in case of internal snapshots, or
> > provision a new block, in case of external snapshots, we copy the shared
> > block to a new data block (COW), update the mapping for the relevant
> > virtual block and then issue the write to the new data block.
> > 
> > Suppose the data device has a volatile write-back cache and the
> > following sequence of events occur:
> 
> For those with NV caches, can the data disk flush be optional (maybe as a 
> table flag)?

IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-04 20:17   ` Mike Snitzer
@ 2019-12-05 15:31     ` Nikos Tsironis
  2019-12-05 15:42       ` Mike Snitzer
  2019-12-05 22:34       ` Eric Wheeler
  0 siblings, 2 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-05 15:31 UTC (permalink / raw)
  To: Mike Snitzer, Eric Wheeler; +Cc: dm-devel, thornber, agk

On 12/4/19 10:17 PM, Mike Snitzer wrote:
> On Wed, Dec 04 2019 at  2:58pm -0500,
> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> 
>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>
>>> The thin provisioning target maintains per thin device mappings that map
>>> virtual blocks to data blocks in the data device.
>>>
>>> When we write to a shared block, in case of internal snapshots, or
>>> provision a new block, in case of external snapshots, we copy the shared
>>> block to a new data block (COW), update the mapping for the relevant
>>> virtual block and then issue the write to the new data block.
>>>
>>> Suppose the data device has a volatile write-back cache and the
>>> following sequence of events occur:
>>
>> For those with NV caches, can the data disk flush be optional (maybe as a
>> table flag)?
> 
> IIRC block core should avoid issuing the flush if not needed.  I'll have
> a closer look to verify as much.
> 

For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

		/*
		 * Filter flush bio's early so that make_request based
		 * drivers without flush support don't have to worry
		 * about them.
		 */
		if (op_is_flush(bio->bi_opf) &&
		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
		        if (!nr_sectors) {
		                status = BLK_STS_OK;
		                goto end_io;
		        }
		}

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.

You could check this by looking at /sys/block/<device>/queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).

Nikos

> Mike
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-05 15:31     ` Nikos Tsironis
@ 2019-12-05 15:42       ` Mike Snitzer
  2019-12-05 16:02         ` Nikos Tsironis
  2019-12-05 22:34       ` Eric Wheeler
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2019-12-05 15:42 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: Eric Wheeler, dm-devel, thornber, agk

On Thu, Dec 05 2019 at 10:31am -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/4/19 10:17 PM, Mike Snitzer wrote:
> >On Wed, Dec 04 2019 at  2:58pm -0500,
> >Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> >
> >>On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> >>
> >>>The thin provisioning target maintains per thin device mappings that map
> >>>virtual blocks to data blocks in the data device.
> >>>
> >>>When we write to a shared block, in case of internal snapshots, or
> >>>provision a new block, in case of external snapshots, we copy the shared
> >>>block to a new data block (COW), update the mapping for the relevant
> >>>virtual block and then issue the write to the new data block.
> >>>
> >>>Suppose the data device has a volatile write-back cache and the
> >>>following sequence of events occur:
> >>
> >>For those with NV caches, can the data disk flush be optional (maybe as a
> >>table flag)?
> >
> >IIRC block core should avoid issuing the flush if not needed.  I'll have
> >a closer look to verify as much.
> >
> 
> For devices without a volatile write-back cache block core strips off
> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> completes empty REQ_PREFLUSH requests before entering the driver.
> 
> This happens in generic_make_request_checks():
> 
> 		/*
> 		 * Filter flush bio's early so that make_request based
> 		 * drivers without flush support don't have to worry
> 		 * about them.
> 		 */
> 		if (op_is_flush(bio->bi_opf) &&
> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 		        if (!nr_sectors) {
> 		                status = BLK_STS_OK;
> 		                goto end_io;
> 		        }
> 		}
> 
> If I am not mistaken, it all depends on whether the underlying device
> reports the existence of a write back cache or not.

Yes, thanks for confirming my memory of the situation.

> You could check this by looking at /sys/block/<device>/queue/write_cache
> If it says "write back" then flushes will be issued.
> 
> In case the sysfs entry reports a "write back" cache for a device with a
> non-volatile write cache, I think you can change the kernel's view of
> the device by writing to this entry (you could also create a udev rule
> for this).
> 
> This way you can set the write cache as write through. This will
> eliminate the cache flushes issued by the kernel, without altering the
> device state (Documentation/block/queue-sysfs.rst).

Not delved into this aspect of Linux's capabilities but it strikes me as
"dangerous" to twiddle device capabilities like this.  Best to fix
driver to properly expose cache (or not, as the case may be).  It should
also be noted that with DM; the capabilities are stac ked up at device
creation time.  So any changes to the underlying devices will _not_ be
reflected to the high level DM device.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-05 15:42       ` Mike Snitzer
@ 2019-12-05 16:02         ` Nikos Tsironis
  0 siblings, 0 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-05 16:02 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Eric Wheeler, dm-devel, thornber, agk

On 12/5/19 5:42 PM, Mike Snitzer wrote:
> On Thu, Dec 05 2019 at 10:31am -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>
>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>
>>>>> The thin provisioning target maintains per thin device mappings that map
>>>>> virtual blocks to data blocks in the data device.
>>>>>
>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>> provision a new block, in case of external snapshots, we copy the shared
>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>> virtual block and then issue the write to the new data block.
>>>>>
>>>>> Suppose the data device has a volatile write-back cache and the
>>>>> following sequence of events occur:
>>>>
>>>> For those with NV caches, can the data disk flush be optional (maybe as a
>>>> table flag)?
>>>
>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>> a closer look to verify as much.
>>>
>>
>> For devices without a volatile write-back cache block core strips off
>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>> completes empty REQ_PREFLUSH requests before entering the driver.
>>
>> This happens in generic_make_request_checks():
>>
>> 		/*
>> 		 * Filter flush bio's early so that make_request based
>> 		 * drivers without flush support don't have to worry
>> 		 * about them.
>> 		 */
>> 		if (op_is_flush(bio->bi_opf) &&
>> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>> 		        if (!nr_sectors) {
>> 		                status = BLK_STS_OK;
>> 		                goto end_io;
>> 		        }
>> 		}
>>
>> If I am not mistaken, it all depends on whether the underlying device
>> reports the existence of a write back cache or not.
> 
> Yes, thanks for confirming my memory of the situation.
> 
>> You could check this by looking at /sys/block/<device>/queue/write_cache
>> If it says "write back" then flushes will be issued.
>>
>> In case the sysfs entry reports a "write back" cache for a device with a
>> non-volatile write cache, I think you can change the kernel's view of
>> the device by writing to this entry (you could also create a udev rule
>> for this).
>>
>> This way you can set the write cache as write through. This will
>> eliminate the cache flushes issued by the kernel, without altering the
>> device state (Documentation/block/queue-sysfs.rst).
> 
> Not delved into this aspect of Linux's capabilities but it strikes me as
> "dangerous" to twiddle device capabilities like this.  Best to fix
> driver to properly expose cache (or not, as the case may be).  It should
> also be noted that with DM; the capabilities are stac ked up at device
> creation time.  So any changes to the underlying devices will _not_ be
> reflected to the high level DM device.
> 

Yes, I agree completely. The queue-sysfs doc also mentions that it's not
safe to do that. I just mentioned it for completeness.

As far as DM is concerned, you are right. You would have to deactivate
and reactivate all DM devices for the change to propagate to upper
layers. That's why I mentioned udev, because that way the change will be
made to the lower level device when its queue is first created and it
will be properly propagated to upper layers.

But, again, I agree that this is not something safe to do and it's
better to make sure the driver properly exposes the cache capabilities,
as you said.

Nikos

> Mike
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback
  2019-12-04 14:07 ` [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback Nikos Tsironis
@ 2019-12-05 19:40   ` Mike Snitzer
  2019-12-05 21:33     ` Nikos Tsironis
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2019-12-05 19:40 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, thornber, agk

On Wed, Dec 04 2019 at  9:07P -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> Add support for one pre-commit callback which is run right before the
> metadata are committed.
> 
> This allows the thin provisioning target to run a callback before the
> metadata are committed and is required by the next commit.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> ---
>  drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
>  drivers/md/dm-thin-metadata.h |  7 +++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
> index 4c68a7b93d5e..b88d6d701f5b 100644
> --- a/drivers/md/dm-thin-metadata.c
> +++ b/drivers/md/dm-thin-metadata.c
> @@ -189,6 +189,15 @@ struct dm_pool_metadata {
>  	sector_t data_block_size;
>  
>  	/*
> +	 * Pre-commit callback.
> +	 *
> +	 * This allows the thin provisioning target to run a callback before
> +	 * the metadata are committed.
> +	 */
> +	dm_pool_pre_commit_fn pre_commit_fn;
> +	void *pre_commit_context;
> +
> +	/*
>  	 * We reserve a section of the metadata for commit overhead.
>  	 * All reported space does *not* include this.
>  	 */
> @@ -826,6 +835,14 @@ static int __commit_transaction(struct dm_pool_metadata *pmd)
>  	if (unlikely(!pmd->in_service))
>  		return 0;
>  
> +	if (pmd->pre_commit_fn) {
> +		r = pmd->pre_commit_fn(pmd->pre_commit_context);
> +		if (r < 0) {
> +			DMERR("pre-commit callback failed");
> +			return r;
> +		}
> +	}
> +
>  	r = __write_changed_details(pmd);
>  	if (r < 0)
>  		return r;
> @@ -892,6 +909,8 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
>  	pmd->in_service = false;
>  	pmd->bdev = bdev;
>  	pmd->data_block_size = data_block_size;
> +	pmd->pre_commit_fn = NULL;
> +	pmd->pre_commit_context = NULL;
>  
>  	r = __create_persistent_data_objects(pmd, format_device);
>  	if (r) {
> @@ -2044,6 +2063,16 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
>  	return r;
>  }
>  
> +void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
> +					  dm_pool_pre_commit_fn fn,
> +					  void *context)
> +{
> +	pmd_write_lock_in_core(pmd);
> +	pmd->pre_commit_fn = fn;
> +	pmd->pre_commit_context = context;
> +	pmd_write_unlock(pmd);
> +}
> +
>  int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
>  {
>  	int r = -EINVAL;
> diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
> index f6be0d733c20..7ef56bd2a7e3 100644
> --- a/drivers/md/dm-thin-metadata.h
> +++ b/drivers/md/dm-thin-metadata.h
> @@ -230,6 +230,13 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
>   */
>  void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
>  
> +/* Pre-commit callback */
> +typedef int (*dm_pool_pre_commit_fn)(void *context);
> +
> +void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
> +					  dm_pool_pre_commit_fn fn,
> +					  void *context);
> +
>  /*----------------------------------------------------------------*/
>  
>  #endif
> -- 
> 2.11.0
> 

I have this incremental, not seeing need to avoid using blkdev_issue_flush

---
 drivers/md/dm-thin.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 9c9a323c0c30..255a52f7bbf0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3203,18 +3203,8 @@ static void metadata_low_callback(void *context)
 static int metadata_pre_commit_callback(void *context)
 {
 	struct pool_c *pt = context;
-	struct bio bio;
-	int r;
-
-	bio_init(&bio, NULL, 0);
-	bio_set_dev(&bio, pt->data_dev->bdev);
-	bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-
-	r = submit_bio_wait(&bio);
 
-	bio_uninit(&bio);
-
-	return r;
+	return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);
 }
 
 static sector_t get_dev_size(struct block_device *bdev)
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback
  2019-12-05 19:40   ` Mike Snitzer
@ 2019-12-05 21:33     ` Nikos Tsironis
  0 siblings, 0 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-05 21:33 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, thornber, agk

On 12/5/19 9:40 PM, Mike Snitzer wrote:
> On Wed, Dec 04 2019 at  9:07P -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> Add support for one pre-commit callback which is run right before the
>> metadata are committed.
>>
>> This allows the thin provisioning target to run a callback before the
>> metadata are committed and is required by the next commit.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>> ---
>>   drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
>>   drivers/md/dm-thin-metadata.h |  7 +++++++
>>   2 files changed, 36 insertions(+)
>>
>> diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
>> index 4c68a7b93d5e..b88d6d701f5b 100644
>> --- a/drivers/md/dm-thin-metadata.c
>> +++ b/drivers/md/dm-thin-metadata.c
>> @@ -189,6 +189,15 @@ struct dm_pool_metadata {
>>   	sector_t data_block_size;
>>   
>>   	/*
>> +	 * Pre-commit callback.
>> +	 *
>> +	 * This allows the thin provisioning target to run a callback before
>> +	 * the metadata are committed.
>> +	 */
>> +	dm_pool_pre_commit_fn pre_commit_fn;
>> +	void *pre_commit_context;
>> +
>> +	/*
>>   	 * We reserve a section of the metadata for commit overhead.
>>   	 * All reported space does *not* include this.
>>   	 */
>> @@ -826,6 +835,14 @@ static int __commit_transaction(struct dm_pool_metadata *pmd)
>>   	if (unlikely(!pmd->in_service))
>>   		return 0;
>>   
>> +	if (pmd->pre_commit_fn) {
>> +		r = pmd->pre_commit_fn(pmd->pre_commit_context);
>> +		if (r < 0) {
>> +			DMERR("pre-commit callback failed");
>> +			return r;
>> +		}
>> +	}
>> +
>>   	r = __write_changed_details(pmd);
>>   	if (r < 0)
>>   		return r;
>> @@ -892,6 +909,8 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
>>   	pmd->in_service = false;
>>   	pmd->bdev = bdev;
>>   	pmd->data_block_size = data_block_size;
>> +	pmd->pre_commit_fn = NULL;
>> +	pmd->pre_commit_context = NULL;
>>   
>>   	r = __create_persistent_data_objects(pmd, format_device);
>>   	if (r) {
>> @@ -2044,6 +2063,16 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
>>   	return r;
>>   }
>>   
>> +void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
>> +					  dm_pool_pre_commit_fn fn,
>> +					  void *context)
>> +{
>> +	pmd_write_lock_in_core(pmd);
>> +	pmd->pre_commit_fn = fn;
>> +	pmd->pre_commit_context = context;
>> +	pmd_write_unlock(pmd);
>> +}
>> +
>>   int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
>>   {
>>   	int r = -EINVAL;
>> diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
>> index f6be0d733c20..7ef56bd2a7e3 100644
>> --- a/drivers/md/dm-thin-metadata.h
>> +++ b/drivers/md/dm-thin-metadata.h
>> @@ -230,6 +230,13 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
>>    */
>>   void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
>>   
>> +/* Pre-commit callback */
>> +typedef int (*dm_pool_pre_commit_fn)(void *context);
>> +
>> +void dm_pool_register_pre_commit_callback(struct dm_pool_metadata *pmd,
>> +					  dm_pool_pre_commit_fn fn,
>> +					  void *context);
>> +
>>   /*----------------------------------------------------------------*/
>>   
>>   #endif
>> -- 
>> 2.11.0
>>
> 
> I have this incremental, not seeing need to avoid using blkdev_issue_flush
>

Ack,

Nikos.
  
> ---
>   drivers/md/dm-thin.c | 12 +-----------
>   1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 9c9a323c0c30..255a52f7bbf0 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -3203,18 +3203,8 @@ static void metadata_low_callback(void *context)
>   static int metadata_pre_commit_callback(void *context)
>   {
>   	struct pool_c *pt = context;
> -	struct bio bio;
> -	int r;
> -
> -	bio_init(&bio, NULL, 0);
> -	bio_set_dev(&bio, pt->data_dev->bdev);
> -	bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> -
> -	r = submit_bio_wait(&bio);
>   
> -	bio_uninit(&bio);
> -
> -	return r;
> +	return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);
>   }
>   
>   static sector_t get_dev_size(struct block_device *bdev)
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-05 15:31     ` Nikos Tsironis
  2019-12-05 15:42       ` Mike Snitzer
@ 2019-12-05 22:34       ` Eric Wheeler
  2019-12-06 15:14         ` Nikos Tsironis
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Wheeler @ 2019-12-05 22:34 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, thornber, agk, Mike Snitzer

On Thu, 5 Dec 2019, Nikos Tsironis wrote:
> On 12/4/19 10:17 PM, Mike Snitzer wrote:
> > On Wed, Dec 04 2019 at  2:58pm -0500,
> > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > 
> > > On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> > >
> > > > The thin provisioning target maintains per thin device mappings that map
> > > > virtual blocks to data blocks in the data device.
> > > >
> > > > When we write to a shared block, in case of internal snapshots, or
> > > > provision a new block, in case of external snapshots, we copy the shared
> > > > block to a new data block (COW), update the mapping for the relevant
> > > > virtual block and then issue the write to the new data block.
> > > >
> > > > Suppose the data device has a volatile write-back cache and the
> > > > following sequence of events occur:
> > >
> > > For those with NV caches, can the data disk flush be optional (maybe as a
> > > table flag)?
> > 
> > IIRC block core should avoid issuing the flush if not needed.  I'll have
> > a closer look to verify as much.
> > 
> 
> For devices without a volatile write-back cache block core strips off
> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> completes empty REQ_PREFLUSH requests before entering the driver.
> 
> This happens in generic_make_request_checks():
> 
> 		/*
> 		 * Filter flush bio's early so that make_request based
> 		 * drivers without flush support don't have to worry
> 		 * about them.
> 		 */
> 		if (op_is_flush(bio->bi_opf) &&
> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 		        if (!nr_sectors) {
> 		                status = BLK_STS_OK;
> 		                goto end_io;
> 		        }
> 		}
> 
> If I am not mistaken, it all depends on whether the underlying device
> reports the existence of a write back cache or not.
> 
> You could check this by looking at /sys/block/<device>/queue/write_cache
> If it says "write back" then flushes will be issued.
> 
> In case the sysfs entry reports a "write back" cache for a device with a
> non-volatile write cache, I think you can change the kernel's view of
> the device by writing to this entry (you could also create a udev rule
> for this).
> 
> This way you can set the write cache as write through. This will
> eliminate the cache flushes issued by the kernel, without altering the
> device state (Documentation/block/queue-sysfs.rst).

Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards:
	'This means that it might not be safe to toggle the setting from 
	"write back" to "write through", since that will also eliminate
	cache flushes issued by the kernel.'
	[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]


How does this work with stacking blockdevs?  Does it inherit from the 
lower-level dev? If an upper-level is misconfigured, would a writeback at 
higher levels would clear the flush for lower levels?

--
Eric Wheeler



> Nikos
> 
> > Mike
> > 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-05 22:34       ` Eric Wheeler
@ 2019-12-06 15:14         ` Nikos Tsironis
  2019-12-06 20:06           ` Eric Wheeler
  0 siblings, 1 reply; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-06 15:14 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: dm-devel, thornber, agk, Mike Snitzer

On 12/6/19 12:34 AM, Eric Wheeler wrote:
> On Thu, 5 Dec 2019, Nikos Tsironis wrote:
>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>
>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>
>>>>> The thin provisioning target maintains per thin device mappings that map
>>>>> virtual blocks to data blocks in the data device.
>>>>>
>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>> provision a new block, in case of external snapshots, we copy the shared
>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>> virtual block and then issue the write to the new data block.
>>>>>
>>>>> Suppose the data device has a volatile write-back cache and the
>>>>> following sequence of events occur:
>>>>
>>>> For those with NV caches, can the data disk flush be optional (maybe as a
>>>> table flag)?
>>>
>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>> a closer look to verify as much.
>>>
>>
>> For devices without a volatile write-back cache block core strips off
>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>> completes empty REQ_PREFLUSH requests before entering the driver.
>>
>> This happens in generic_make_request_checks():
>>
>> 		/*
>> 		 * Filter flush bio's early so that make_request based
>> 		 * drivers without flush support don't have to worry
>> 		 * about them.
>> 		 */
>> 		if (op_is_flush(bio->bi_opf) &&
>> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>> 		        if (!nr_sectors) {
>> 		                status = BLK_STS_OK;
>> 		                goto end_io;
>> 		        }
>> 		}
>>
>> If I am not mistaken, it all depends on whether the underlying device
>> reports the existence of a write back cache or not.
>>
>> You could check this by looking at /sys/block/<device>/queue/write_cache
>> If it says "write back" then flushes will be issued.
>>
>> In case the sysfs entry reports a "write back" cache for a device with a
>> non-volatile write cache, I think you can change the kernel's view of
>> the device by writing to this entry (you could also create a udev rule
>> for this).
>>
>> This way you can set the write cache as write through. This will
>> eliminate the cache flushes issued by the kernel, without altering the
>> device state (Documentation/block/queue-sysfs.rst).
> 
> Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards:
> 	'This means that it might not be safe to toggle the setting from
> 	"write back" to "write through", since that will also eliminate
> 	cache flushes issued by the kernel.'
> 	[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
> 
> 

If a device has a volatile cache then the write_cache sysfs entry will
be "write back" and we have to issue flushes to the device. In all other
cases write_cache will be "write through".

It's not safe to toggle write_cache from "write back" to "write through"
because this stops the kernel from sending flushes to the device, but
the device will continue caching the writes. So, in case something goes
wrong, you might lose your writes or end up with some kind of
corruption.

> How does this work with stacking blockdevs?  Does it inherit from the
> lower-level dev? If an upper-level is misconfigured, would a writeback at
> higher levels would clear the flush for lower levels?
> 

As Mike already mentioned in another reply to this thread, the device
capabilities are stacked up when each device is created and are
inherited from component devices.

The logic for device stacking is implemented in various functions in
block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
etc.), which are used also by DM core in dm-table.c to set the
capabilities of DM devices.

If an upper layer device reports a "write back" cache then flushes will
be issued to it by the kernel, no matter what the capabilities of the
underlying devices are.

Normally an upper layer device would report a "write back" cache if at
least one underlying device supports flushes. But, some DM devices
report a "write back" cache irrespective of the underlying devices,
e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
their own metadata. They then pass the flush request down to the
underlying device and rely on block core to do the right thing. Either
actually send the flush to the device, if it has a volatile cache, or
complete it immediately.

Nikos

> --
> Eric Wheeler
> 
> 
> 
>> Nikos
>>
>>> Mike
>>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-06 15:14         ` Nikos Tsironis
@ 2019-12-06 20:06           ` Eric Wheeler
  2019-12-09 14:25             ` Nikos Tsironis
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wheeler @ 2019-12-06 20:06 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: dm-devel, thornber, agk, Mike Snitzer

On Fri, 6 Dec 2019, Nikos Tsironis wrote:
> On 12/6/19 12:34 AM, Eric Wheeler wrote:
> > On Thu, 5 Dec 2019, Nikos Tsironis wrote:
> > > On 12/4/19 10:17 PM, Mike Snitzer wrote:
> > > > On Wed, Dec 04 2019 at  2:58pm -0500,
> > > > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > > >
> > > > > On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> > > > >
> > > > > > The thin provisioning target maintains per thin device mappings that
> > > > > > map
> > > > > > virtual blocks to data blocks in the data device.
> > > > > >
> > > > > > When we write to a shared block, in case of internal snapshots, or
> > > > > > provision a new block, in case of external snapshots, we copy the
> > > > > > shared
> > > > > > block to a new data block (COW), update the mapping for the relevant
> > > > > > virtual block and then issue the write to the new data block.
> > > > > >
> > > > > > Suppose the data device has a volatile write-back cache and the
> > > > > > following sequence of events occur:
> > > > >
> > > > > For those with NV caches, can the data disk flush be optional (maybe
> > > > > as a
> > > > > table flag)?
> > > >
> > > > IIRC block core should avoid issuing the flush if not needed.  I'll have
> > > > a closer look to verify as much.
> > > >
> > >
> > > For devices without a volatile write-back cache block core strips off
> > > the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> > > completes empty REQ_PREFLUSH requests before entering the driver.
> > >
> > > This happens in generic_make_request_checks():
> > >
> > >   /*
> > >    * Filter flush bio's early so that make_request based
> > >    * drivers without flush support don't have to worry
> > >    * about them.
> > >    */
> > >   if (op_is_flush(bio->bi_opf) &&
> > >       !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> > >           bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> > >           if (!nr_sectors) {
> > >                   status = BLK_STS_OK;
> > >                   goto end_io;
> > >           }
> > >   }
> > >
> > > If I am not mistaken, it all depends on whether the underlying device
> > > reports the existence of a write back cache or not.
> > >
> > > You could check this by looking at /sys/block/<device>/queue/write_cache
> > > If it says "write back" then flushes will be issued.
> > >
> > > In case the sysfs entry reports a "write back" cache for a device with a
> > > non-volatile write cache, I think you can change the kernel's view of
> > > the device by writing to this entry (you could also create a udev rule
> > > for this).
> > >
> > > This way you can set the write cache as write through. This will
> > > eliminate the cache flushes issued by the kernel, without altering the
> > > device state (Documentation/block/queue-sysfs.rst).
> > 
> > Interesting, I'll remember that. I think this is a documentation bug, isn't
> > this backwards:
> >  'This means that it might not be safe to toggle the setting from
> >  "write back" to "write through", since that will also eliminate
> >  cache flushes issued by the kernel.'
> >  [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
> > 
> > 
> 
> If a device has a volatile cache then the write_cache sysfs entry will
> be "write back" and we have to issue flushes to the device. In all other
> cases write_cache will be "write through".

Forgive my misunderstanding, but if I have a RAID controller with a cache 
and BBU with the RAID volume set to write-back mode in the controller, are 
you saying that the sysfs entry should show "write through"? I had always 
understood that it was safe to disable flushes with a non-volatile cache 
and a non-volatile cache is called a write-back cache.

It is strange to me that this terminology in the kernel would be backwards 
from how it is expressed in a RAID controller. Incidentally, I have an 
Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in 
write-back mode and the second volume is write-through. In both cases 
sysfs reports "write through":

[root@hv1-he ~]# cat /sys/block/sda/queue/write_cache 
write through
[root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache 
write through

This is running 4.19.75, so we can at least say that the 9460 does not 
support proper representation of the VD cache mode in sysfs, but which is 
correct? Should it not be that the sysfs entry reports the same cache mode 
of the RAID controller?

-Eric

> 
> It's not safe to toggle write_cache from "write back" to "write through"
> because this stops the kernel from sending flushes to the device, but
> the device will continue caching the writes. So, in case something goes
> wrong, you might lose your writes or end up with some kind of
> corruption.
> 
> > How does this work with stacking blockdevs?  Does it inherit from the
> > lower-level dev? If an upper-level is misconfigured, would a writeback at
> > higher levels would clear the flush for lower levels?
> > 
> 
> As Mike already mentioned in another reply to this thread, the device
> capabilities are stacked up when each device is created and are
> inherited from component devices.
> 
> The logic for device stacking is implemented in various functions in
> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
> etc.), which are used also by DM core in dm-table.c to set the
> capabilities of DM devices.
> 
> If an upper layer device reports a "write back" cache then flushes will
> be issued to it by the kernel, no matter what the capabilities of the
> underlying devices are.
> 
> Normally an upper layer device would report a "write back" cache if at
> least one underlying device supports flushes. But, some DM devices
> report a "write back" cache irrespective of the underlying devices,
> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
> their own metadata. They then pass the flush request down to the
> underlying device and rely on block core to do the right thing. Either
> actually send the flush to the device, if it has a volatile cache, or
> complete it immediately.
> 
> Nikos
> 
> > --
> > Eric Wheeler
> > 
> > 
> > 
> > > Nikos
> > >
> > > > Mike
> > > >
> > >
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption
  2019-12-06 20:06           ` Eric Wheeler
@ 2019-12-09 14:25             ` Nikos Tsironis
  0 siblings, 0 replies; 18+ messages in thread
From: Nikos Tsironis @ 2019-12-09 14:25 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: dm-devel, thornber, agk, Mike Snitzer

On 12/6/19 10:06 PM, Eric Wheeler wrote:
> On Fri, 6 Dec 2019, Nikos Tsironis wrote:
>> On 12/6/19 12:34 AM, Eric Wheeler wrote:
>>> On Thu, 5 Dec 2019, Nikos Tsironis wrote:
>>>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>>>
>>>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>>>
>>>>>>> The thin provisioning target maintains per thin device mappings that
>>>>>>> map
>>>>>>> virtual blocks to data blocks in the data device.
>>>>>>>
>>>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>>>> provision a new block, in case of external snapshots, we copy the
>>>>>>> shared
>>>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>>>> virtual block and then issue the write to the new data block.
>>>>>>>
>>>>>>> Suppose the data device has a volatile write-back cache and the
>>>>>>> following sequence of events occur:
>>>>>>
>>>>>> For those with NV caches, can the data disk flush be optional (maybe
>>>>>> as a
>>>>>> table flag)?
>>>>>
>>>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>>>> a closer look to verify as much.
>>>>>
>>>>
>>>> For devices without a volatile write-back cache block core strips off
>>>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>>>> completes empty REQ_PREFLUSH requests before entering the driver.
>>>>
>>>> This happens in generic_make_request_checks():
>>>>
>>>>    /*
>>>>     * Filter flush bio's early so that make_request based
>>>>     * drivers without flush support don't have to worry
>>>>     * about them.
>>>>     */
>>>>    if (op_is_flush(bio->bi_opf) &&
>>>>        !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>>>>            bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>>>>            if (!nr_sectors) {
>>>>                    status = BLK_STS_OK;
>>>>                    goto end_io;
>>>>            }
>>>>    }
>>>>
>>>> If I am not mistaken, it all depends on whether the underlying device
>>>> reports the existence of a write back cache or not.
>>>>
>>>> You could check this by looking at /sys/block/<device>/queue/write_cache
>>>> If it says "write back" then flushes will be issued.
>>>>
>>>> In case the sysfs entry reports a "write back" cache for a device with a
>>>> non-volatile write cache, I think you can change the kernel's view of
>>>> the device by writing to this entry (you could also create a udev rule
>>>> for this).
>>>>
>>>> This way you can set the write cache as write through. This will
>>>> eliminate the cache flushes issued by the kernel, without altering the
>>>> device state (Documentation/block/queue-sysfs.rst).
>>>
>>> Interesting, I'll remember that. I think this is a documentation bug, isn't
>>> this backwards:
>>>   'This means that it might not be safe to toggle the setting from
>>>   "write back" to "write through", since that will also eliminate
>>>   cache flushes issued by the kernel.'
>>>   [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
>>>
>>>
>>
>> If a device has a volatile cache then the write_cache sysfs entry will
>> be "write back" and we have to issue flushes to the device. In all other
>> cases write_cache will be "write through".
> 
> Forgive my misunderstanding, but if I have a RAID controller with a cache
> and BBU with the RAID volume set to write-back mode in the controller, are
> you saying that the sysfs entry should show "write through"? I had always
> understood that it was safe to disable flushes with a non-volatile cache
> and a non-volatile cache is called a write-back cache.
> 

 From the device perspective, a non-volatile cache operating in
write-back mode is indeed called a write-back cache.

But, from the OS perspective, a non-volatile cache (whether it operates
in write-back or write-through mode), for all intents and purposes, is
equivalent to a write-through cache: when the device acknowledges a
write it's guaranteed that the written data won't be lost in case of
power loss.

So, in the case of a controller with a BBU and/or a non-volatile cache,
you don't care what the device does internally. All that matters is that
acked writes won't be lost in case of power failure.

I believe that the sysfs entry reports exactly that. Whether the kernel
should treat the device as having a volatile write-back cache, so we
have to issue flushes to ensure the data are properly persisted, or as
having no cache or a write-through cache, so flushes are not necessary.

> It is strange to me that this terminology in the kernel would be backwards
> from how it is expressed in a RAID controller. Incidentally, I have an
> Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in
> write-back mode and the second volume is write-through. In both cases
> sysfs reports "write through":
> 
> [root@hv1-he ~]# cat /sys/block/sda/queue/write_cache
> write through
> [root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache
> write through
> 
> This is running 4.19.75, so we can at least say that the 9460 does not
> support proper representation of the VD cache mode in sysfs, but which is
> correct? Should it not be that the sysfs entry reports the same cache mode
> of the RAID controller?
> 

My guess is that the controller reports to the kernel that it has a
write-through cache (or no cache at all) on purpose, to avoid
unnecessary flushes. Since it can ensure the persistence of acked writes
with other means, e.g., a BBU unit, as far as the kernel is concerned
the device can be treated as one with a write-through cache.

Moreover, I think that MegaRAID controllers, in the default write back
mode, automatically switch the write policy to write-through if the BBU
is low, has failed or is being charged.

So, I think it makes sense to report to the kernel that the device has a
write-through cache, even though internally the device operates the
cache in write-back mode.

Nikos

> -Eric
> 
>>
>> It's not safe to toggle write_cache from "write back" to "write through"
>> because this stops the kernel from sending flushes to the device, but
>> the device will continue caching the writes. So, in case something goes
>> wrong, you might lose your writes or end up with some kind of
>> corruption.
>>
>>> How does this work with stacking blockdevs?  Does it inherit from the
>>> lower-level dev? If an upper-level is misconfigured, would a writeback at
>>> higher levels would clear the flush for lower levels?
>>>
>>
>> As Mike already mentioned in another reply to this thread, the device
>> capabilities are stacked up when each device is created and are
>> inherited from component devices.
>>
>> The logic for device stacking is implemented in various functions in
>> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
>> etc.), which are used also by DM core in dm-table.c to set the
>> capabilities of DM devices.
>>
>> If an upper layer device reports a "write back" cache then flushes will
>> be issued to it by the kernel, no matter what the capabilities of the
>> underlying devices are.
>>
>> Normally an upper layer device would report a "write back" cache if at
>> least one underlying device supports flushes. But, some DM devices
>> report a "write back" cache irrespective of the underlying devices,
>> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
>> their own metadata. They then pass the flush request down to the
>> underlying device and rely on block core to do the right thing. Either
>> actually send the flush to the device, if it has a volatile cache, or
>> complete it immediately.
>>
>> Nikos
>>
>>> --
>>> Eric Wheeler
>>>
>>>
>>>
>>>> Nikos
>>>>
>>>>> Mike
>>>>>
>>>>
>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-12-09 14:25 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-04 14:07 [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Nikos Tsironis
2019-12-04 14:07 ` [PATCH 1/2] dm thin metadata: Add support for a pre-commit callback Nikos Tsironis
2019-12-05 19:40   ` Mike Snitzer
2019-12-05 21:33     ` Nikos Tsironis
2019-12-04 14:07 ` [PATCH 2/2] dm thin: Flush data device before committing metadata Nikos Tsironis
2019-12-04 15:27   ` Joe Thornber
2019-12-04 16:17     ` Nikos Tsironis
2019-12-04 16:39       ` Mike Snitzer
2019-12-04 16:47         ` Nikos Tsironis
2019-12-04 19:58 ` [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Eric Wheeler
2019-12-04 20:17   ` Mike Snitzer
2019-12-05 15:31     ` Nikos Tsironis
2019-12-05 15:42       ` Mike Snitzer
2019-12-05 16:02         ` Nikos Tsironis
2019-12-05 22:34       ` Eric Wheeler
2019-12-06 15:14         ` Nikos Tsironis
2019-12-06 20:06           ` Eric Wheeler
2019-12-09 14:25             ` Nikos Tsironis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.