All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
@ 2020-10-23  3:31 Christopher Unkel
  2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Christopher Unkel @ 2020-10-23  3:31 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, Song Liu, cunkel

Hello all,

While investigating some performance issues on mdraid 10 volumes
formed with "512e" disks (4k native/physical sector size but with 512
byte sector emulation), I've found two cases where mdraid will
needlessly issue writes that start on 4k byte boundary, but are are
shorter than 4k:

1. writes of the raid superblock; and
2. writes of the last page of the write-intent bitmap.

The following is an excerpt of a blocktrace of one of the component
members of a mdraid 10 volume during a 4k write near the end of the
array:

  8,32  11        2     0.000001687   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11        5     0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
* 8,32  11        8     0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
  8,32  11       11     0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
  8,32  11       14     0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11       17     0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
  8,32  11       20     5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11       23    15.740923558   711  D  WS 2080 + 7 [kworker/11:1H]

Note the starred transactions, which each start on a 4k boundary, but
are less than 4k in length, and so will use the 512-byte emulation.
Sector 2056 holds the superblock, and is written as a single 512-byte
write.  Sector 2086 holds the bitmap bit relevant to the written
sector.  When it is written the active bits of the last page of the
bitmap are written, starting at sector 2080, padded out to the end of
the 512-byte logical sector as required.  This results in a 3.5kb
write, again using the 512-byte emulation.

Note that in some arrays the last page of the bitmap may be
sufficiently full that they are not affected by the issue with the
bitmap write.

As there can be a substantial penalty to using the 512-byte sector
emulation (turning writes into read-modify writes if the relevant
sector is not in the drive's cache) I believe it makes sense to pad
these writes out to a 4k boundary.  The writes are already padded out
for "4k native" drives, where the short access is illegal.

The following patch set changes the superblock and bitmap writes to
respect the physical block size (e.g. 4k for today's 512e drives) when
possible.  In each case there is already logic for padding out to the
underlying logical sector size.  I reuse or repeat the logic for
padding out to the physical sector size, but treat the padding out as
optional rather than mandatory.

The corresponding block trace with these patches is:

   8,32   1        2     0.000003410   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1        5     0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   1        8     0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
   8,32   1       11     0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
   8,32   1       14     0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1       17     0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   1       20     5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1       23    15.577963062   694  D  WS 2080 + 8 [kworker/1:1H]

I do notice that the code for bitmap writes has a more sophisticated
and thorough check for overlap than the code for superblock writes.
(Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
what I know since the various structures starts have always been 4k
aligned anyway, it is always safe to pad the superblock write out to
4k (as occurs on 4k native drives) but not necessarily futher.

Feedback appreciated.

  --Chris


Christopher Unkel (3):
  md: align superblock writes to physical blocks
  md: factor sb write alignment check into function
  md: pad writes to end of bitmap to physical blocks

 drivers/md/md-bitmap.c | 80 +++++++++++++++++++++++++-----------------
 drivers/md/md.c        | 15 ++++++++
 2 files changed, 63 insertions(+), 32 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] md: align superblock writes to physical blocks
  2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
@ 2020-10-23  3:31 ` Christopher Unkel
  2020-10-23  5:46   ` Song Liu
  2020-10-23  8:29   ` Christoph Hellwig
  2020-10-23  3:31 ` [PATCH 2/3] md: factor sb write alignment check into function Christopher Unkel
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 11+ messages in thread
From: Christopher Unkel @ 2020-10-23  3:31 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, Song Liu, cunkel

Writes of the md superblock are aligned to the logical blocks of the
containing device, but no attempt is made to align them to physical
block boundaries.  This means that on a "512e" device (4k physical, 512
logical) every superblock update hits the 512-byte emulation and the
possible associated performance penalty.

Respect the physical block alignment when possible.

Signed-off-by: Christopher Unkel <cunkel@drivescale.com>
---
 drivers/md/md.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..2b42850acfb3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	    && rdev->new_data_offset < sb_start + (rdev->sb_size/512))
 		return -EINVAL;
 
+	/* Respect physical block size if feasible. */
+	bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
+	if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
+		int candidate_size = (rdev->sb_size | bmask) + 1;
+
+		if (minor_version) {
+			int sectors = candidate_size / 512;
+
+			if (rdev->data_offset >= sb_start + sectors
+			    && rdev->new_data_offset >= sb_start + sectors)
+				rdev->sb_size = candidate_size;
+		} else if (bmask <= 4095)
+			rdev->sb_size = candidate_size;
+	}
+
 	if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
 		rdev->desc_nr = -1;
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] md: factor sb write alignment check into function
  2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
  2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
@ 2020-10-23  3:31 ` Christopher Unkel
  2020-10-23  3:31 ` [PATCH 3/3] md: pad writes to end of bitmap to physical blocks Christopher Unkel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Christopher Unkel @ 2020-10-23  3:31 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, Song Liu, cunkel

Refactor in preparation for a second use of the logic.

Signed-off-by: Christopher Unkel <cunkel@drivescale.com>
---
 drivers/md/md-bitmap.c | 72 +++++++++++++++++++++++-------------------
 1 file changed, 40 insertions(+), 32 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..600b89d5a3ad 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -209,6 +209,44 @@ static struct md_rdev *next_active_rdev(struct md_rdev *rdev, struct mddev *mdde
 	return NULL;
 }
 
+static int sb_write_alignment_ok(struct mddev *mddev, struct md_rdev *rdev,
+				 struct page *page, int offset, int size)
+{
+	if (mddev->external) {
+		/* Bitmap could be anywhere. */
+		if (rdev->sb_start + offset + (page->index
+					       * (PAGE_SIZE/512))
+		    > rdev->data_offset
+		    &&
+		    rdev->sb_start + offset
+		    < (rdev->data_offset + mddev->dev_sectors
+		     + (PAGE_SIZE/512)))
+			return 0;
+	} else if (offset < 0) {
+		/* DATA  BITMAP METADATA  */
+		if (offset
+		    + (long)(page->index * (PAGE_SIZE/512))
+		    + size/512 > 0)
+			/* bitmap runs in to metadata */
+			return 0;
+		if (rdev->data_offset + mddev->dev_sectors
+		    > rdev->sb_start + offset)
+			/* data runs in to bitmap */
+			return 0;
+	} else if (rdev->sb_start < rdev->data_offset) {
+		/* METADATA BITMAP DATA */
+		if (rdev->sb_start
+		    + offset
+		    + page->index*(PAGE_SIZE/512) + size/512
+		    > rdev->data_offset)
+			/* bitmap runs in to data */
+			return 0;
+	} else {
+		/* DATA METADATA BITMAP - no problems */
+	}
+	return 1;
+}
+
 static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 {
 	struct md_rdev *rdev;
@@ -234,38 +272,8 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 		/* Just make sure we aren't corrupting data or
 		 * metadata
 		 */
-		if (mddev->external) {
-			/* Bitmap could be anywhere. */
-			if (rdev->sb_start + offset + (page->index
-						       * (PAGE_SIZE/512))
-			    > rdev->data_offset
-			    &&
-			    rdev->sb_start + offset
-			    < (rdev->data_offset + mddev->dev_sectors
-			     + (PAGE_SIZE/512)))
-				goto bad_alignment;
-		} else if (offset < 0) {
-			/* DATA  BITMAP METADATA  */
-			if (offset
-			    + (long)(page->index * (PAGE_SIZE/512))
-			    + size/512 > 0)
-				/* bitmap runs in to metadata */
-				goto bad_alignment;
-			if (rdev->data_offset + mddev->dev_sectors
-			    > rdev->sb_start + offset)
-				/* data runs in to bitmap */
-				goto bad_alignment;
-		} else if (rdev->sb_start < rdev->data_offset) {
-			/* METADATA BITMAP DATA */
-			if (rdev->sb_start
-			    + offset
-			    + page->index*(PAGE_SIZE/512) + size/512
-			    > rdev->data_offset)
-				/* bitmap runs in to data */
-				goto bad_alignment;
-		} else {
-			/* DATA METADATA BITMAP - no problems */
-		}
+		if (!sb_write_alignment_ok(mddev, rdev, page, offset, size))
+			goto bad_alignment;
 		md_super_write(mddev, rdev,
 			       rdev->sb_start + offset
 			       + page->index * (PAGE_SIZE/512),
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] md: pad writes to end of bitmap to physical blocks
  2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
  2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
  2020-10-23  3:31 ` [PATCH 2/3] md: factor sb write alignment check into function Christopher Unkel
@ 2020-10-23  3:31 ` Christopher Unkel
  2020-10-23  5:41 ` [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Song Liu
  2020-11-02  7:04 ` Xiao Ni
  4 siblings, 0 replies; 11+ messages in thread
From: Christopher Unkel @ 2020-10-23  3:31 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, Song Liu, cunkel

Writes of the last page of the bitmap are padded out to the next logical
block boundary.  However, they are not padded out to the next physical
block boundary, so the writes may be less than a physical block.  On a
"512e" disk (logical block 512 bytes, physical block 4k) and if the last
page of the bitmap is less than 3584 bytes, this means that writes of
the last bitmap page hit the 512-byte emulation.

Respect the physical block boundary as long as the resulting write
doesn't run into other data, and is no longer than a page.  (If the
physical block size is larger than a page no bitmap write will respect
the physical block boundaries.)

Signed-off-by: Christopher Unkel <cunkel@drivescale.com>
---
 drivers/md/md-bitmap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 600b89d5a3ad..21af5f94d495 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -264,10 +264,18 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 
 		if (page->index == store->file_pages-1) {
 			int last_page_size = store->bytes & (PAGE_SIZE-1);
+			int pb_aligned_size;
 			if (last_page_size == 0)
 				last_page_size = PAGE_SIZE;
 			size = roundup(last_page_size,
 				       bdev_logical_block_size(bdev));
+			pb_aligned_size = roundup(last_page_size,
+						  bdev_physical_block_size(bdev));
+			if (pb_aligned_size > size
+			    && pb_aligned_size <= PAGE_SIZE
+			    && sb_write_alignment_ok(mddev, rdev, page, offset,
+						     pb_aligned_size))
+				size = pb_aligned_size;
 		}
 		/* Just make sure we aren't corrupting data or
 		 * metadata
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
  2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
                   ` (2 preceding siblings ...)
  2020-10-23  3:31 ` [PATCH 3/3] md: pad writes to end of bitmap to physical blocks Christopher Unkel
@ 2020-10-23  5:41 ` Song Liu
  2020-10-23  7:03   ` Chris Unkel
  2020-11-02  7:04 ` Xiao Ni
  4 siblings, 1 reply; 11+ messages in thread
From: Song Liu @ 2020-10-23  5:41 UTC (permalink / raw)
  To: Christopher Unkel; +Cc: linux-raid, open list

On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel <cunkel@drivescale.com> wrote:
>
> Hello all,
>
> While investigating some performance issues on mdraid 10 volumes
> formed with "512e" disks (4k native/physical sector size but with 512
> byte sector emulation), I've found two cases where mdraid will
> needlessly issue writes that start on 4k byte boundary, but are are
> shorter than 4k:
>
> 1. writes of the raid superblock; and
> 2. writes of the last page of the write-intent bitmap.
>
> The following is an excerpt of a blocktrace of one of the component
> members of a mdraid 10 volume during a 4k write near the end of the
> array:
>
>   8,32  11        2     0.000001687   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11        5     0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
> * 8,32  11        8     0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
>   8,32  11       11     0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
>   8,32  11       14     0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11       17     0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
>   8,32  11       20     5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11       23    15.740923558   711  D  WS 2080 + 7 [kworker/11:1H]
>
> Note the starred transactions, which each start on a 4k boundary, but
> are less than 4k in length, and so will use the 512-byte emulation.
> Sector 2056 holds the superblock, and is written as a single 512-byte
> write.  Sector 2086 holds the bitmap bit relevant to the written
> sector.  When it is written the active bits of the last page of the
> bitmap are written, starting at sector 2080, padded out to the end of
> the 512-byte logical sector as required.  This results in a 3.5kb
> write, again using the 512-byte emulation.
>
> Note that in some arrays the last page of the bitmap may be
> sufficiently full that they are not affected by the issue with the
> bitmap write.
>
> As there can be a substantial penalty to using the 512-byte sector
> emulation (turning writes into read-modify writes if the relevant
> sector is not in the drive's cache) I believe it makes sense to pad
> these writes out to a 4k boundary.  The writes are already padded out
> for "4k native" drives, where the short access is illegal.
>
> The following patch set changes the superblock and bitmap writes to
> respect the physical block size (e.g. 4k for today's 512e drives) when
> possible.  In each case there is already logic for padding out to the
> underlying logical sector size.  I reuse or repeat the logic for
> padding out to the physical sector size, but treat the padding out as
> optional rather than mandatory.
>
> The corresponding block trace with these patches is:
>
>    8,32   1        2     0.000003410   694  D  WS 2064 + 8 [kworker/1:1H]
>    8,32   1        5     0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
>    8,32   1        8     0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
>    8,32   1       11     0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
>    8,32   1       14     0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
>    8,32   1       17     0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
>    8,32   1       20     5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
>    8,32   1       23    15.577963062   694  D  WS 2080 + 8 [kworker/1:1H]
>
> I do notice that the code for bitmap writes has a more sophisticated
> and thorough check for overlap than the code for superblock writes.
> (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
> what I know since the various structures starts have always been 4k
> aligned anyway, it is always safe to pad the superblock write out to
> 4k (as occurs on 4k native drives) but not necessarily futher.
>
> Feedback appreciated.
>
>   --Chris

Thanks for the patches. Do you have performance numbers before/after these
changes? Some micro benchmarks results would be great motivation.

Thanks,
Song


>
>
> Christopher Unkel (3):
>   md: align superblock writes to physical blocks
>   md: factor sb write alignment check into function
>   md: pad writes to end of bitmap to physical blocks
>
>  drivers/md/md-bitmap.c | 80 +++++++++++++++++++++++++-----------------
>  drivers/md/md.c        | 15 ++++++++
>  2 files changed, 63 insertions(+), 32 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] md: align superblock writes to physical blocks
  2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
@ 2020-10-23  5:46   ` Song Liu
  2020-10-23  8:29   ` Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Song Liu @ 2020-10-23  5:46 UTC (permalink / raw)
  To: Christopher Unkel; +Cc: linux-raid, open list

On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel <cunkel@drivescale.com> wrote:
>
> Writes of the md superblock are aligned to the logical blocks of the
> containing device, but no attempt is made to align them to physical
> block boundaries.  This means that on a "512e" device (4k physical, 512
> logical) every superblock update hits the 512-byte emulation and the
> possible associated performance penalty.
>
> Respect the physical block alignment when possible.
>
> Signed-off-by: Christopher Unkel <cunkel@drivescale.com>
> ---
>  drivers/md/md.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..2b42850acfb3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>             && rdev->new_data_offset < sb_start + (rdev->sb_size/512))
>                 return -EINVAL;
>
> +       /* Respect physical block size if feasible. */
> +       bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
> +       if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
> +               int candidate_size = (rdev->sb_size | bmask) + 1;
> +
> +               if (minor_version) {
> +                       int sectors = candidate_size / 512;
> +
> +                       if (rdev->data_offset >= sb_start + sectors
> +                           && rdev->new_data_offset >= sb_start + sectors)
> +                               rdev->sb_size = candidate_size;
> +               } else if (bmask <= 4095)
> +                       rdev->sb_size = candidate_size;
> +       }

In super_1_load() and super_1_sync(), we have

    bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;

I think we should replace it with queue_physical_block_size() so the logic is
cleaner. Would this work?

Thanks,
Song

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
  2020-10-23  5:41 ` [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Song Liu
@ 2020-10-23  7:03   ` Chris Unkel
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Unkel @ 2020-10-23  7:03 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, open list

I don't offhand, but point well taken.  I suspect that showing a
difference in a benchmark is dependent on finding one where the
metadata has been evicted from the drive cache.  Let me think about it
a bit.

Thanks,

  --Chris


On Thu, Oct 22, 2020 at 10:42 PM Song Liu <song@kernel.org> wrote:
>
> On Thu, Oct 22, 2020 at 8:31 PM Christopher Unkel <cunkel@drivescale.com> wrote:
> >
> > Hello all,
> >
> > While investigating some performance issues on mdraid 10 volumes
> > formed with "512e" disks (4k native/physical sector size but with 512
> > byte sector emulation), I've found two cases where mdraid will
> > needlessly issue writes that start on 4k byte boundary, but are are
> > shorter than 4k:
> >
> > 1. writes of the raid superblock; and
> > 2. writes of the last page of the write-intent bitmap.
> >
> > The following is an excerpt of a blocktrace of one of the component
> > members of a mdraid 10 volume during a 4k write near the end of the
> > array:
> >
> >   8,32  11        2     0.000001687   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11        5     0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
> > * 8,32  11        8     0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
> >   8,32  11       11     0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
> >   8,32  11       14     0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11       17     0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
> >   8,32  11       20     5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11       23    15.740923558   711  D  WS 2080 + 7 [kworker/11:1H]
> >
> > Note the starred transactions, which each start on a 4k boundary, but
> > are less than 4k in length, and so will use the 512-byte emulation.
> > Sector 2056 holds the superblock, and is written as a single 512-byte
> > write.  Sector 2086 holds the bitmap bit relevant to the written
> > sector.  When it is written the active bits of the last page of the
> > bitmap are written, starting at sector 2080, padded out to the end of
> > the 512-byte logical sector as required.  This results in a 3.5kb
> > write, again using the 512-byte emulation.
> >
> > Note that in some arrays the last page of the bitmap may be
> > sufficiently full that they are not affected by the issue with the
> > bitmap write.
> >
> > As there can be a substantial penalty to using the 512-byte sector
> > emulation (turning writes into read-modify writes if the relevant
> > sector is not in the drive's cache) I believe it makes sense to pad
> > these writes out to a 4k boundary.  The writes are already padded out
> > for "4k native" drives, where the short access is illegal.
> >
> > The following patch set changes the superblock and bitmap writes to
> > respect the physical block size (e.g. 4k for today's 512e drives) when
> > possible.  In each case there is already logic for padding out to the
> > underlying logical sector size.  I reuse or repeat the logic for
> > padding out to the physical sector size, but treat the padding out as
> > optional rather than mandatory.
> >
> > The corresponding block trace with these patches is:
> >
> >    8,32   1        2     0.000003410   694  D  WS 2064 + 8 [kworker/1:1H]
> >    8,32   1        5     0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
> >    8,32   1        8     0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
> >    8,32   1       11     0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
> >    8,32   1       14     0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
> >    8,32   1       17     0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
> >    8,32   1       20     5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
> >    8,32   1       23    15.577963062   694  D  WS 2080 + 8 [kworker/1:1H]
> >
> > I do notice that the code for bitmap writes has a more sophisticated
> > and thorough check for overlap than the code for superblock writes.
> > (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
> > what I know since the various structures starts have always been 4k
> > aligned anyway, it is always safe to pad the superblock write out to
> > 4k (as occurs on 4k native drives) but not necessarily futher.
> >
> > Feedback appreciated.
> >
> >   --Chris
>
> Thanks for the patches. Do you have performance numbers before/after these
> changes? Some micro benchmarks results would be great motivation.
>
> Thanks,
> Song
>
>
> >
> >
> > Christopher Unkel (3):
> >   md: align superblock writes to physical blocks
> >   md: factor sb write alignment check into function
> >   md: pad writes to end of bitmap to physical blocks
> >
> >  drivers/md/md-bitmap.c | 80 +++++++++++++++++++++++++-----------------
> >  drivers/md/md.c        | 15 ++++++++
> >  2 files changed, 63 insertions(+), 32 deletions(-)
> >
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] md: align superblock writes to physical blocks
  2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
  2020-10-23  5:46   ` Song Liu
@ 2020-10-23  8:29   ` Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2020-10-23  8:29 UTC (permalink / raw)
  To: Christopher Unkel; +Cc: linux-raid, linux-kernel, Song Liu

> +	/* Respect physical block size if feasible. */
> +	bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
> +	if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
> +		int candidate_size = (rdev->sb_size | bmask) + 1;
> +
> +		if (minor_version) {
> +			int sectors = candidate_size / 512;
> +
> +			if (rdev->data_offset >= sb_start + sectors
> +			    && rdev->new_data_offset >= sb_start + sectors)

Linux coding style wants operators before the continuing line.

> +				rdev->sb_size = candidate_size;
> +		} else if (bmask <= 4095)
> +			rdev->sb_size = candidate_size;
> +	}

I also think this code would benefit from being factored into a helper.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
  2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
                   ` (3 preceding siblings ...)
  2020-10-23  5:41 ` [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Song Liu
@ 2020-11-02  7:04 ` Xiao Ni
  2020-11-02 18:59   ` Chris Unkel
  4 siblings, 1 reply; 11+ messages in thread
From: Xiao Ni @ 2020-11-02  7:04 UTC (permalink / raw)
  To: Christopher Unkel, linux-raid; +Cc: linux-kernel, Song Liu



On 10/23/2020 11:31 AM, Christopher Unkel wrote:
> Hello all,
>
> While investigating some performance issues on mdraid 10 volumes
> formed with "512e" disks (4k native/physical sector size but with 512
> byte sector emulation), I've found two cases where mdraid will
> needlessly issue writes that start on 4k byte boundary, but are are
> shorter than 4k:
>
> 1. writes of the raid superblock; and
> 2. writes of the last page of the write-intent bitmap.
>
> The following is an excerpt of a blocktrace of one of the component
> members of a mdraid 10 volume during a 4k write near the end of the
> array:
>
>    8,32  11        2     0.000001687   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11        5     0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
> * 8,32  11        8     0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
>    8,32  11       11     0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
>    8,32  11       14     0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11       17     0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
>    8,32  11       20     5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
> * 8,32  11       23    15.740923558   711  D  WS 2080 + 7 [kworker/11:1H]
>
> Note the starred transactions, which each start on a 4k boundary, but
> are less than 4k in length, and so will use the 512-byte emulation.
> Sector 2056 holds the superblock, and is written as a single 512-byte
> write.  Sector 2086 holds the bitmap bit relevant to the written
> sector.  When it is written the active bits of the last page of the
> bitmap are written, starting at sector 2080, padded out to the end of
> the 512-byte logical sector as required.  This results in a 3.5kb
> write, again using the 512-byte emulation.

Hi Christopher

Which superblock version do you use? If it's super1.1, superblock starts 
at 0 sector.
If it's super1.2, superblock starts at 8 sector. If it's super1.0, 
superblock starts at the
end of device and bitmap is before superblock. As mentioned above, 
bitmap is behind
the superblock, so it should not be super1.0. So I have a question why 
does 2056 hold
the superblock?

Regards
Xiao

>
> Note that in some arrays the last page of the bitmap may be
> sufficiently full that they are not affected by the issue with the
> bitmap write.
>
> As there can be a substantial penalty to using the 512-byte sector
> emulation (turning writes into read-modify writes if the relevant
> sector is not in the drive's cache) I believe it makes sense to pad
> these writes out to a 4k boundary.  The writes are already padded out
> for "4k native" drives, where the short access is illegal.
>
> The following patch set changes the superblock and bitmap writes to
> respect the physical block size (e.g. 4k for today's 512e drives) when
> possible.  In each case there is already logic for padding out to the
> underlying logical sector size.  I reuse or repeat the logic for
> padding out to the physical sector size, but treat the padding out as
> optional rather than mandatory.
>
> The corresponding block trace with these patches is:
>
>     8,32   1        2     0.000003410   694  D  WS 2064 + 8 [kworker/1:1H]
>     8,32   1        5     0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
>     8,32   1        8     0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
>     8,32   1       11     0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
>     8,32   1       14     0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
>     8,32   1       17     0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
>     8,32   1       20     5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
>     8,32   1       23    15.577963062   694  D  WS 2080 + 8 [kworker/1:1H]
>
> I do notice that the code for bitmap writes has a more sophisticated
> and thorough check for overlap than the code for superblock writes.
> (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
> what I know since the various structures starts have always been 4k
> aligned anyway, it is always safe to pad the superblock write out to
> 4k (as occurs on 4k native drives) but not necessarily futher.
>
> Feedback appreciated.
>
>    --Chris
>
>
> Christopher Unkel (3):
>    md: align superblock writes to physical blocks
>    md: factor sb write alignment check into function
>    md: pad writes to end of bitmap to physical blocks
>
>   drivers/md/md-bitmap.c | 80 +++++++++++++++++++++++++-----------------
>   drivers/md/md.c        | 15 ++++++++
>   2 files changed, 63 insertions(+), 32 deletions(-)
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
  2020-11-02  7:04 ` Xiao Ni
@ 2020-11-02 18:59   ` Chris Unkel
  2020-11-03  8:32     ` Xiao Ni
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Unkel @ 2020-11-02 18:59 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, open list, Song Liu

Hi Xiao,

That particular array is super1.2.  The block trace was captured on
the disk underlying the partition device on which the md array member
resides, not on the partition device itself.  The partition starts
2048 sectors into the disk (1MB).  So the 2048 sectors offset to the
beginning of the partition, plus the 8 sector superblock offset for
super1.2 ends up at 2056.

Sorry for the confusion there.

Regards,

 --Chris

On Sun, Nov 1, 2020 at 11:04 PM Xiao Ni <xni@redhat.com> wrote:
>
>
>
> On 10/23/2020 11:31 AM, Christopher Unkel wrote:
> > Hello all,
> >
> > While investigating some performance issues on mdraid 10 volumes
> > formed with "512e" disks (4k native/physical sector size but with 512
> > byte sector emulation), I've found two cases where mdraid will
> > needlessly issue writes that start on 4k byte boundary, but are are
> > shorter than 4k:
> >
> > 1. writes of the raid superblock; and
> > 2. writes of the last page of the write-intent bitmap.
> >
> > The following is an excerpt of a blocktrace of one of the component
> > members of a mdraid 10 volume during a 4k write near the end of the
> > array:
> >
> >    8,32  11        2     0.000001687   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11        5     0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
> > * 8,32  11        8     0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
> >    8,32  11       11     0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
> >    8,32  11       14     0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11       17     0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
> >    8,32  11       20     5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
> > * 8,32  11       23    15.740923558   711  D  WS 2080 + 7 [kworker/11:1H]
> >
> > Note the starred transactions, which each start on a 4k boundary, but
> > are less than 4k in length, and so will use the 512-byte emulation.
> > Sector 2056 holds the superblock, and is written as a single 512-byte
> > write.  Sector 2086 holds the bitmap bit relevant to the written
> > sector.  When it is written the active bits of the last page of the
> > bitmap are written, starting at sector 2080, padded out to the end of
> > the 512-byte logical sector as required.  This results in a 3.5kb
> > write, again using the 512-byte emulation.
>
> Hi Christopher
>
> Which superblock version do you use? If it's super1.1, superblock starts
> at 0 sector.
> If it's super1.2, superblock starts at 8 sector. If it's super1.0,
> superblock starts at the
> end of device and bitmap is before superblock. As mentioned above,
> bitmap is behind
> the superblock, so it should not be super1.0. So I have a question why
> does 2056 hold
> the superblock?
>
> Regards
> Xiao
>
> >
> > Note that in some arrays the last page of the bitmap may be
> > sufficiently full that they are not affected by the issue with the
> > bitmap write.
> >
> > As there can be a substantial penalty to using the 512-byte sector
> > emulation (turning writes into read-modify writes if the relevant
> > sector is not in the drive's cache) I believe it makes sense to pad
> > these writes out to a 4k boundary.  The writes are already padded out
> > for "4k native" drives, where the short access is illegal.
> >
> > The following patch set changes the superblock and bitmap writes to
> > respect the physical block size (e.g. 4k for today's 512e drives) when
> > possible.  In each case there is already logic for padding out to the
> > underlying logical sector size.  I reuse or repeat the logic for
> > padding out to the physical sector size, but treat the padding out as
> > optional rather than mandatory.
> >
> > The corresponding block trace with these patches is:
> >
> >     8,32   1        2     0.000003410   694  D  WS 2064 + 8 [kworker/1:1H]
> >     8,32   1        5     0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
> >     8,32   1        8     0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
> >     8,32   1       11     0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
> >     8,32   1       14     0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
> >     8,32   1       17     0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
> >     8,32   1       20     5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
> >     8,32   1       23    15.577963062   694  D  WS 2080 + 8 [kworker/1:1H]
> >
> > I do notice that the code for bitmap writes has a more sophisticated
> > and thorough check for overlap than the code for superblock writes.
> > (Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
> > what I know since the various structures starts have always been 4k
> > aligned anyway, it is always safe to pad the superblock write out to
> > 4k (as occurs on 4k native drives) but not necessarily futher.
> >
> > Feedback appreciated.
> >
> >    --Chris
> >
> >
> > Christopher Unkel (3):
> >    md: align superblock writes to physical blocks
> >    md: factor sb write alignment check into function
> >    md: pad writes to end of bitmap to physical blocks
> >
> >   drivers/md/md-bitmap.c | 80 +++++++++++++++++++++++++-----------------
> >   drivers/md/md.c        | 15 ++++++++
> >   2 files changed, 63 insertions(+), 32 deletions(-)
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives
  2020-11-02 18:59   ` Chris Unkel
@ 2020-11-03  8:32     ` Xiao Ni
  0 siblings, 0 replies; 11+ messages in thread
From: Xiao Ni @ 2020-11-03  8:32 UTC (permalink / raw)
  To: Chris Unkel; +Cc: linux-raid, open list, Song Liu



On 11/03/2020 02:59 AM, Chris Unkel wrote:
> Hi Xiao,
>
> That particular array is super1.2.  The block trace was captured on
> the disk underlying the partition device on which the md array member
> resides, not on the partition device itself.  The partition starts
> 2048 sectors into the disk (1MB).  So the 2048 sectors offset to the
> beginning of the partition, plus the 8 sector superblock offset for
> super1.2 ends up at 2056.
>
> Sorry for the confusion there.
>
> Regards,
>
>   --Chris

Thanks for the explanation. I still have some questions in other emails 
about
your patch. Could you have a look when you are free.

Regards
Xiao


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-11-03  8:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23  3:31 [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Christopher Unkel
2020-10-23  3:31 ` [PATCH 1/3] md: align superblock writes to physical blocks Christopher Unkel
2020-10-23  5:46   ` Song Liu
2020-10-23  8:29   ` Christoph Hellwig
2020-10-23  3:31 ` [PATCH 2/3] md: factor sb write alignment check into function Christopher Unkel
2020-10-23  3:31 ` [PATCH 3/3] md: pad writes to end of bitmap to physical blocks Christopher Unkel
2020-10-23  5:41 ` [PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives Song Liu
2020-10-23  7:03   ` Chris Unkel
2020-11-02  7:04 ` Xiao Ni
2020-11-02 18:59   ` Chris Unkel
2020-11-03  8:32     ` Xiao Ni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.