All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] md: Add support for Multiple PPLs
@ 2017-08-16 15:13 Pawel Baldysiak
  2017-08-16 15:13 ` [PATCH 1/2] md: Runtime support for multiple ppls Pawel Baldysiak
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Pawel Baldysiak @ 2017-08-16 15:13 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Pawel Baldysiak

Current Partial Parity Log implementation allows only single PPL to be
written to particular member drive. In such case PPL entry is
overwritten by every write as it's stored at the same sector of a drive.
Such approach is suboptimal for SSD drives as it leads to increased
write amplification factor due to write-after-write limitations.

This patchset changes the implementation by extending PPL area to 1MB.
PPL is not stored at fixed location but multiple PPLs are written in a
circular buffer. There is at most one valid PPL at any time (latest
one). Recovery algorithm stays the same. The main benefits of this
approach are increased drive lifespan and in some cases improved
performance. 

Pawel Baldysiak (2):
  md: Runtime support for multiple ppls
  raid5-ppl: Recovery support for multiple partial parity logs

 drivers/md/md.c                |  16 +++-
 drivers/md/md.h                |   1 +
 drivers/md/raid0.c             |   3 +-
 drivers/md/raid1.c             |   3 +-
 drivers/md/raid5-ppl.c         | 171 +++++++++++++++++++++++++++++++----------
 drivers/md/raid5.c             |   1 +
 include/uapi/linux/raid/md_p.h |   4 +-
 7 files changed, 152 insertions(+), 47 deletions(-)

-- 
2.13.4


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] md: Runtime support for multiple ppls
  2017-08-16 15:13 [PATCH 0/2] md: Add support for Multiple PPLs Pawel Baldysiak
@ 2017-08-16 15:13 ` Pawel Baldysiak
  2017-08-16 15:13 ` [PATCH 2/2] raid5-ppl: Recovery support for multiple partial parity logs Pawel Baldysiak
  2017-08-17 18:03 ` [PATCH 0/2] md: Add support for Multiple PPLs Shaohua Li
  2 siblings, 0 replies; 10+ messages in thread
From: Pawel Baldysiak @ 2017-08-16 15:13 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Pawel Baldysiak, Artur Paszkiewicz

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it). 

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c                | 16 +++++++++++++---
 drivers/md/md.h                |  1 +
 drivers/md/raid0.c             |  3 ++-
 drivers/md/raid1.c             |  3 ++-
 drivers/md/raid5-ppl.c         | 43 +++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.c             |  1 +
 include/uapi/linux/raid/md_p.h |  4 +++-
 7 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8cdca0296749..31a87b9b548c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1536,7 +1536,8 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	} else if (sb->bblog_offset != 0)
 		rdev->badblocks.shift = 0;
 
-	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+	if ((le32_to_cpu(sb->feature_map) &
+	    (MD_FEATURE_PPL | MD_FEATURE_MULTIPLE_PPLS))) {
 		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
 		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
 		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
@@ -1655,10 +1656,15 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
 
-		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+		if (le32_to_cpu(sb->feature_map) &
+		    (MD_FEATURE_PPL | MD_FEATURE_MULTIPLE_PPLS)) {
 			if (le32_to_cpu(sb->feature_map) &
 			    (MD_FEATURE_BITMAP_OFFSET | MD_FEATURE_JOURNAL))
 				return -EINVAL;
+			if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) &&
+			    (le32_to_cpu(sb->feature_map) &
+					    MD_FEATURE_MULTIPLE_PPLS))
+				return -EINVAL;
 			set_bit(MD_HAS_PPL, &mddev->flags);
 		}
 	} else if (mddev->pers == NULL) {
@@ -1875,7 +1881,11 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
 
 	if (test_bit(MD_HAS_PPL, &mddev->flags)) {
-		sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
+		if (test_bit(MD_HAS_MULTIPLE_PPLS, &mddev->flags))
+			sb->feature_map |=
+			    cpu_to_le32(MD_FEATURE_MULTIPLE_PPLS);
+		else
+			sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
 		sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
 		sb->ppl.size = cpu_to_le16(rdev->ppl.size);
 	}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b50eb4ac1b82..43025baa6761 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -236,6 +236,7 @@ enum mddev_flags {
 				 * never cause the array to become failed.
 				 */
 	MD_HAS_PPL,		/* The raid array has PPL feature set */
+	MD_HAS_MULTIPLE_PPLS,	/* The raid array has multiple PPLs feature set */
 };
 
 enum mddev_sb_flags {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 94d9ae9b0fd0..d17ca1520a48 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -30,7 +30,8 @@
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
 	 (1L << MD_FAILFAST_SUPPORTED) |\
-	 (1L << MD_HAS_PPL))
+	 (1L << MD_HAS_PPL) |		\
+	 (1L << MD_HAS_MULTIPLE_PPLS))
 
 static int raid0_congested(struct mddev *mddev, int bits)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3febfc8391fb..3fb8db2b55ad 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -48,7 +48,8 @@
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
-	 (1L << MD_HAS_PPL))
+	 (1L << MD_HAS_PPL) |		\
+	 (1L << MD_HAS_MULTIPLE_PPLS))
 
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 44ad5baf3206..b313f17a6260 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -87,6 +87,8 @@
  * The current io_unit accepting new stripes is always at the end of the list.
  */
 
+#define PPL_SPACE_SIZE (128 * 1024)
+
 struct ppl_conf {
 	struct mddev *mddev;
 
@@ -122,6 +124,10 @@ struct ppl_log {
 					 * always at the end of io_list */
 	spinlock_t io_list_lock;
 	struct list_head io_list;	/* all io_units of this log */
+
+	sector_t next_io_sector;
+	unsigned int entry_space;
+	bool use_multippl;
 };
 
 #define PPL_IO_INLINE_BVECS 32
@@ -264,13 +270,12 @@ static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
 	int i;
 	sector_t data_sector = 0;
 	int data_disks = 0;
-	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
 	struct r5conf *conf = sh->raid_conf;
 
 	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
 
 	/* check if current io_unit is full */
-	if (io && (io->pp_size == entry_space ||
+	if (io && (io->pp_size == log->entry_space ||
 		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
 		pr_debug("%s: add io_unit blocked by seq: %llu\n",
 			 __func__, io->seq);
@@ -451,12 +456,25 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 	pplhdr->entries_count = cpu_to_le32(io->entries_count);
 	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
 
+	/* Rewind the buffer if current PPL is larger then remaining space */
+	if (log->use_multippl &&
+	    log->rdev->ppl.sector + log->rdev->ppl.size - log->next_io_sector <
+	    (PPL_HEADER_SIZE + io->pp_size) >> 9)
+		log->next_io_sector = log->rdev->ppl.sector;
+
+
 	bio->bi_end_io = ppl_log_endio;
 	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
 	bio->bi_bdev = log->rdev->bdev;
-	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
+	bio->bi_iter.bi_sector = log->next_io_sector;
 	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
 
+	pr_debug("%s: log->current_io_sector: %llu\n", __func__,
+	    (unsigned long long)log->next_io_sector);
+
+	if (log->use_multippl)
+		log->next_io_sector += (PPL_HEADER_SIZE + io->pp_size) >> 9;
+
 	list_for_each_entry(sh, &io->stripe_list, log_list) {
 		/* entries for full stripe writes have no partial parity */
 		if (test_bit(STRIPE_FULL_WRITE, &sh->state))
@@ -1031,6 +1049,7 @@ static int ppl_load(struct ppl_conf *ppl_conf)
 static void __ppl_exit_log(struct ppl_conf *ppl_conf)
 {
 	clear_bit(MD_HAS_PPL, &ppl_conf->mddev->flags);
+	clear_bit(MD_HAS_MULTIPLE_PPLS, &ppl_conf->mddev->flags);
 
 	kfree(ppl_conf->child_logs);
 
@@ -1099,6 +1118,22 @@ static int ppl_validate_rdev(struct md_rdev *rdev)
 	return 0;
 }
 
+static void ppl_init_child_log(struct ppl_log *log, struct md_rdev *rdev)
+{
+	if ((rdev->ppl.size << 9) >= (PPL_SPACE_SIZE +
+				      PPL_HEADER_SIZE) * 2) {
+		log->use_multippl = true;
+		set_bit(MD_HAS_MULTIPLE_PPLS,
+			&log->ppl_conf->mddev->flags);
+		log->entry_space = PPL_SPACE_SIZE;
+	} else {
+		log->use_multippl = false;
+		log->entry_space = (log->rdev->ppl.size << 9) -
+				   PPL_HEADER_SIZE;
+	}
+	log->next_io_sector = rdev->ppl.sector;
+}
+
 int ppl_init_log(struct r5conf *conf)
 {
 	struct ppl_conf *ppl_conf;
@@ -1196,6 +1231,7 @@ int ppl_init_log(struct r5conf *conf)
 			q = bdev_get_queue(rdev->bdev);
 			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
 				need_cache_flush = true;
+			ppl_init_child_log(log, rdev);
 		}
 	}
 
@@ -1261,6 +1297,7 @@ int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add)
 		if (!ret) {
 			log->rdev = rdev;
 			ret = ppl_write_empty_header(log);
+			ppl_init_child_log(log, rdev);
 		}
 	} else {
 		log->rdev = NULL;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index aeeb8d6854e2..c1b1b8e4f1af 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7242,6 +7242,7 @@ static int raid5_run(struct mddev *mddev)
 		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
 			mdname(mddev));
 		clear_bit(MD_HAS_PPL, &mddev->flags);
+		clear_bit(MD_HAS_MULTIPLE_PPLS, &mddev->flags);
 	}
 
 	if (mddev->private == NULL)
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index d500bd224979..b9197976b660 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -324,9 +324,10 @@ struct mdp_superblock_1 {
 #define	MD_FEATURE_RECOVERY_BITMAP	128 /* recovery that is happening
 					     * is guided by bitmap.
 					     */
-#define MD_FEATURE_CLUSTERED		256 /* clustered MD */
+#define	MD_FEATURE_CLUSTERED		256 /* clustered MD */
 #define	MD_FEATURE_JOURNAL		512 /* support write cache */
 #define	MD_FEATURE_PPL			1024 /* support PPL */
+#define	MD_FEATURE_MULTIPLE_PPLS	2048 /* support for multiple PPLs */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -338,6 +339,7 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_CLUSTERED		\
 					|MD_FEATURE_JOURNAL		\
 					|MD_FEATURE_PPL			\
+					|MD_FEATURE_MULTIPLE_PPLS	\
 					)
 
 struct r5l_payload_header {
-- 
2.13.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] raid5-ppl: Recovery support for multiple partial parity logs
  2017-08-16 15:13 [PATCH 0/2] md: Add support for Multiple PPLs Pawel Baldysiak
  2017-08-16 15:13 ` [PATCH 1/2] md: Runtime support for multiple ppls Pawel Baldysiak
@ 2017-08-16 15:13 ` Pawel Baldysiak
  2017-08-17 18:03 ` [PATCH 0/2] md: Add support for Multiple PPLs Shaohua Li
  2 siblings, 0 replies; 10+ messages in thread
From: Pawel Baldysiak @ 2017-08-16 15:13 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, Pawel Baldysiak

Search PPL buffer in order to find out the latest PPL header (the one
with largest generation number) and use it for recovery. The PPL entry
format and recovery algorithm are the same as for single PPL approach.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 drivers/md/raid5-ppl.c | 128 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 90 insertions(+), 38 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index b313f17a6260..a98ef172f8e8 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -831,12 +831,14 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
 	return ret;
 }
 
-static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
+static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr,
+		       sector_t offset)
 {
 	struct ppl_conf *ppl_conf = log->ppl_conf;
 	struct md_rdev *rdev = log->rdev;
 	struct mddev *mddev = rdev->mddev;
-	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
+	sector_t ppl_sector = rdev->ppl.sector + offset +
+			      (PPL_HEADER_SIZE >> 9);
 	struct page *page;
 	int i;
 	int ret = 0;
@@ -920,6 +922,9 @@ static int ppl_write_empty_header(struct ppl_log *log)
 		return -ENOMEM;
 
 	pplhdr = page_address(page);
+	/* zero out PPL space to avoid collision with old PPLs */
+	blkdev_issue_zeroout(rdev->bdev, rdev->ppl.sector,
+			    log->rdev->ppl.size, GFP_NOIO, 0);
 	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
 	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
 	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
@@ -940,63 +945,110 @@ static int ppl_load_distributed(struct ppl_log *log)
 	struct ppl_conf *ppl_conf = log->ppl_conf;
 	struct md_rdev *rdev = log->rdev;
 	struct mddev *mddev = rdev->mddev;
-	struct page *page;
-	struct ppl_header *pplhdr;
+	struct page *page, *page2, *tmp;
+	struct ppl_header *pplhdr = NULL, *prev_pplhdr = NULL;
 	u32 crc, crc_stored;
 	u32 signature;
-	int ret = 0;
+	int ret = 0, i;
+	sector_t pplhdr_offset = 0, prev_pplhdr_offset = 0;
 
 	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
-
-	/* read PPL header */
+	/* read PPL headers, find the recent one */
 	page = alloc_page(GFP_KERNEL);
 	if (!page)
 		return -ENOMEM;
 
-	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
-			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
-		md_error(mddev, rdev);
-		ret = -EIO;
-		goto out;
+	page2 = alloc_page(GFP_KERNEL);
+	if (!page2) {
+		__free_page(page);
+		return -ENOMEM;
 	}
-	pplhdr = page_address(page);
 
-	/* check header validity */
-	crc_stored = le32_to_cpu(pplhdr->checksum);
-	pplhdr->checksum = 0;
-	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
+	/* searching ppl area for latest ppl */
+	while (pplhdr_offset < rdev->ppl.size - (PPL_HEADER_SIZE >> 9)) {
+		if (!sync_page_io(rdev,
+				  rdev->ppl.sector - rdev->data_offset +
+				  pplhdr_offset, PAGE_SIZE, page, REQ_OP_READ,
+				  0, false)) {
+			md_error(mddev, rdev);
+			ret = -EIO;
+			/* if not able to read - don't recover any PPL */
+			pplhdr = NULL;
+			break;
+		}
+		pplhdr = page_address(page);
+
+		/* check header validity */
+		crc_stored = le32_to_cpu(pplhdr->checksum);
+		pplhdr->checksum = 0;
+		crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
+
+		if (crc_stored != crc) {
+			pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x (offset: %llu)\n",
+				 __func__, crc_stored, crc,
+				 (unsigned long long)pplhdr_offset);
+			pplhdr = prev_pplhdr;
+			pplhdr_offset = prev_pplhdr_offset;
+			break;
+		}
 
-	if (crc_stored != crc) {
-		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
-			 __func__, crc_stored, crc);
-		ppl_conf->mismatch_count++;
-		goto out;
-	}
+		signature = le32_to_cpu(pplhdr->signature);
 
-	signature = le32_to_cpu(pplhdr->signature);
+		if (mddev->external) {
+			/*
+			 * For external metadata the header signature is set and
+			 * validated in userspace.
+			 */
+			ppl_conf->signature = signature;
+		} else if (ppl_conf->signature != signature) {
+			pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x (offset: %llu)\n",
+				 __func__, signature, ppl_conf->signature,
+				 (unsigned long long)pplhdr_offset);
+			pplhdr = prev_pplhdr;
+			pplhdr_offset = prev_pplhdr_offset;
+			break;
+		}
 
-	if (mddev->external) {
-		/*
-		 * For external metadata the header signature is set and
-		 * validated in userspace.
-		 */
-		ppl_conf->signature = signature;
-	} else if (ppl_conf->signature != signature) {
-		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
-			 __func__, signature, ppl_conf->signature);
-		ppl_conf->mismatch_count++;
-		goto out;
+		if (prev_pplhdr && le64_to_cpu(prev_pplhdr->generation) >
+		    le64_to_cpu(pplhdr->generation)) {
+			/* previous was newest */
+			pplhdr = prev_pplhdr;
+			pplhdr_offset = prev_pplhdr_offset;
+			break;
+		}
+
+		prev_pplhdr_offset = pplhdr_offset;
+		prev_pplhdr = pplhdr;
+
+		tmp = page;
+		page = page2;
+		page2 = tmp;
+
+		/* calculate next potential ppl offset */
+		for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++)
+			pplhdr_offset +=
+			    le32_to_cpu(pplhdr->entries[i].pp_size) >> 9;
+		pplhdr_offset += PPL_HEADER_SIZE >> 9;
 	}
 
+	/* no valid ppl found */
+	if (!pplhdr)
+		ppl_conf->mismatch_count++;
+	else
+		pr_debug("%s: latest PPL found at offset: %llu, with generation: %llu\n",
+		    __func__, (unsigned long long)pplhdr_offset,
+		    le64_to_cpu(pplhdr->generation));
+
 	/* attempt to recover from log if we are starting a dirty array */
-	if (!mddev->pers && mddev->recovery_cp != MaxSector)
-		ret = ppl_recover(log, pplhdr);
-out:
+	if (pplhdr && !mddev->pers && mddev->recovery_cp != MaxSector)
+		ret = ppl_recover(log, pplhdr, pplhdr_offset);
+
 	/* write empty header if we are starting the array */
 	if (!ret && !mddev->pers)
 		ret = ppl_write_empty_header(log);
 
 	__free_page(page);
+	__free_page(page2);
 
 	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
 		 __func__, ret, ppl_conf->mismatch_count,
-- 
2.13.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-16 15:13 [PATCH 0/2] md: Add support for Multiple PPLs Pawel Baldysiak
  2017-08-16 15:13 ` [PATCH 1/2] md: Runtime support for multiple ppls Pawel Baldysiak
  2017-08-16 15:13 ` [PATCH 2/2] raid5-ppl: Recovery support for multiple partial parity logs Pawel Baldysiak
@ 2017-08-17 18:03 ` Shaohua Li
  2017-08-21 14:54   ` Pawel Baldysiak
  2 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-08-17 18:03 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid

On Wed, Aug 16, 2017 at 05:13:44PM +0200, Pawel Baldysiak wrote:
> Current Partial Parity Log implementation allows only single PPL to be
> written to particular member drive. In such case PPL entry is
> overwritten by every write as it's stored at the same sector of a drive.
> Such approach is suboptimal for SSD drives as it leads to increased
> write amplification factor due to write-after-write limitations.
> 
> This patchset changes the implementation by extending PPL area to 1MB.
> PPL is not stored at fixed location but multiple PPLs are written in a
> circular buffer. There is at most one valid PPL at any time (latest
> one). Recovery algorithm stays the same. The main benefits of this
> approach are increased drive lifespan and in some cases improved
> performance. 

The patches look good. Is this in intel RST standard?

I didn't get the point why this is helpful though. You mentioned reduce write
amplification. Why does this reduce write amplification? As far as I know, SSD
never does in-place write, write to the same sector or different sector has no
difference.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-17 18:03 ` [PATCH 0/2] md: Add support for Multiple PPLs Shaohua Li
@ 2017-08-21 14:54   ` Pawel Baldysiak
  2017-08-21 16:14     ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Pawel Baldysiak @ 2017-08-21 14:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid



On 08/17/2017 08:03 PM, Shaohua Li wrote:
> On Wed, Aug 16, 2017 at 05:13:44PM +0200, Pawel Baldysiak wrote:
>> Current Partial Parity Log implementation allows only single PPL to be
>> written to particular member drive. In such case PPL entry is
>> overwritten by every write as it's stored at the same sector of a drive.
>> Such approach is suboptimal for SSD drives as it leads to increased
>> write amplification factor due to write-after-write limitations.
>>
>> This patchset changes the implementation by extending PPL area to 1MB.
>> PPL is not stored at fixed location but multiple PPLs are written in a
>> circular buffer. There is at most one valid PPL at any time (latest
>> one). Recovery algorithm stays the same. The main benefits of this
>> approach are increased drive lifespan and in some cases improved
>> performance.
> 
> The patches look good. Is this in intel RST standard?

Hello,
Yes, this is a part of upcoming RSTe version.

> 
> I didn't get the point why this is helpful though. You mentioned reduce write
> amplification. Why does this reduce write amplification? As far as I know, SSD
> never does in-place write, write to the same sector or different sector has no
> difference.

Well, it depends on SSD drive and firmware implementation.

Thanks,
Pawel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-21 14:54   ` Pawel Baldysiak
@ 2017-08-21 16:14     ` Shaohua Li
  2017-08-25 20:02       ` Pawel Baldysiak
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-08-21 16:14 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid

On Mon, Aug 21, 2017 at 04:54:21PM +0200, Pawel Baldysiak wrote:
> 
> 
> On 08/17/2017 08:03 PM, Shaohua Li wrote:
> > On Wed, Aug 16, 2017 at 05:13:44PM +0200, Pawel Baldysiak wrote:
> > > Current Partial Parity Log implementation allows only single PPL to be
> > > written to particular member drive. In such case PPL entry is
> > > overwritten by every write as it's stored at the same sector of a drive.
> > > Such approach is suboptimal for SSD drives as it leads to increased
> > > write amplification factor due to write-after-write limitations.
> > > 
> > > This patchset changes the implementation by extending PPL area to 1MB.
> > > PPL is not stored at fixed location but multiple PPLs are written in a
> > > circular buffer. There is at most one valid PPL at any time (latest
> > > one). Recovery algorithm stays the same. The main benefits of this
> > > approach are increased drive lifespan and in some cases improved
> > > performance.
> > 
> > The patches look good. Is this in intel RST standard?
> 
> Hello,
> Yes, this is a part of upcoming RSTe version.
> 
> > 
> > I didn't get the point why this is helpful though. You mentioned reduce write
> > amplification. Why does this reduce write amplification? As far as I know, SSD
> > never does in-place write, write to the same sector or different sector has no
> > difference.
> 
> Well, it depends on SSD drive and firmware implementation.

Can you elaborate more?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-21 16:14     ` Shaohua Li
@ 2017-08-25 20:02       ` Pawel Baldysiak
  2017-08-25 21:53         ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Pawel Baldysiak @ 2017-08-25 20:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

On 08/21/2017 06:14 PM, Shaohua Li wrote> Can you elaborate more?
> 
Storage presented to the host may have a RAID on the back-end. For NVMe 
SSDs, the 1.3 specification provided a field to advertise chunk 
boundaries: Identify Namespace NOIOB. We need to span the entire stripe 
to utilize the total available capacity, otherwise we'll repeatedly 
write to only a subset, increasing write amplification factor. The new 
PPL approach is designed to access the full capacity.

Thanks
Pawel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-25 20:02       ` Pawel Baldysiak
@ 2017-08-25 21:53         ` Shaohua Li
  2017-08-28  7:44           ` Artur Paszkiewicz
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-08-25 21:53 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid

On Fri, Aug 25, 2017 at 10:02:07PM +0200, Pawel Baldysiak wrote:
> On 08/21/2017 06:14 PM, Shaohua Li wrote> Can you elaborate more?
> > 
> Storage presented to the host may have a RAID on the back-end. For NVMe
> SSDs, the 1.3 specification provided a field to advertise chunk
> boundaries: Identify Namespace NOIOB. We need to span the entire stripe
> to utilize the total available capacity, otherwise we'll repeatedly
> write to only a subset, increasing write amplification factor. The new
> PPL approach is designed to access the full capacity.

Ok, this makes sense. The 1MB area looks arbitrary though. What if the
chunk is 2MB? Could we make it configurable?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-25 21:53         ` Shaohua Li
@ 2017-08-28  7:44           ` Artur Paszkiewicz
  2017-08-28 14:54             ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: Artur Paszkiewicz @ 2017-08-28  7:44 UTC (permalink / raw)
  To: Shaohua Li, Pawel Baldysiak; +Cc: linux-raid

On 08/25/2017 11:53 PM, Shaohua Li wrote:
> On Fri, Aug 25, 2017 at 10:02:07PM +0200, Pawel Baldysiak wrote:
>> On 08/21/2017 06:14 PM, Shaohua Li wrote> Can you elaborate more?
>>>
>> Storage presented to the host may have a RAID on the back-end. For NVMe
>> SSDs, the 1.3 specification provided a field to advertise chunk
>> boundaries: Identify Namespace NOIOB. We need to span the entire stripe
>> to utilize the total available capacity, otherwise we'll repeatedly
>> write to only a subset, increasing write amplification factor. The new
>> PPL approach is designed to access the full capacity.
> 
> Ok, this makes sense. The 1MB area looks arbitrary though. What if the
> chunk is 2MB? Could we make it configurable?

Actually, it is configurable by 'ppl_size' which comes from mdadm. The
1MB is what the current RSTe standard defines but it will be possible to
change it without affecting the kernel.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] md: Add support for Multiple PPLs
  2017-08-28  7:44           ` Artur Paszkiewicz
@ 2017-08-28 14:54             ` Shaohua Li
  0 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2017-08-28 14:54 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: Pawel Baldysiak, linux-raid

On Mon, Aug 28, 2017 at 09:44:20AM +0200, Artur Paszkiewicz wrote:
> On 08/25/2017 11:53 PM, Shaohua Li wrote:
> > On Fri, Aug 25, 2017 at 10:02:07PM +0200, Pawel Baldysiak wrote:
> >> On 08/21/2017 06:14 PM, Shaohua Li wrote> Can you elaborate more?
> >>>
> >> Storage presented to the host may have a RAID on the back-end. For NVMe
> >> SSDs, the 1.3 specification provided a field to advertise chunk
> >> boundaries: Identify Namespace NOIOB. We need to span the entire stripe
> >> to utilize the total available capacity, otherwise we'll repeatedly
> >> write to only a subset, increasing write amplification factor. The new
> >> PPL approach is designed to access the full capacity.
> > 
> > Ok, this makes sense. The 1MB area looks arbitrary though. What if the
> > chunk is 2MB? Could we make it configurable?
> 
> Actually, it is configurable by 'ppl_size' which comes from mdadm. The
> 1MB is what the current RSTe standard defines but it will be possible to
> change it without affecting the kernel.

ok, queued the patches, thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-08-28 14:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-16 15:13 [PATCH 0/2] md: Add support for Multiple PPLs Pawel Baldysiak
2017-08-16 15:13 ` [PATCH 1/2] md: Runtime support for multiple ppls Pawel Baldysiak
2017-08-16 15:13 ` [PATCH 2/2] raid5-ppl: Recovery support for multiple partial parity logs Pawel Baldysiak
2017-08-17 18:03 ` [PATCH 0/2] md: Add support for Multiple PPLs Shaohua Li
2017-08-21 14:54   ` Pawel Baldysiak
2017-08-21 16:14     ` Shaohua Li
2017-08-25 20:02       ` Pawel Baldysiak
2017-08-25 21:53         ` Shaohua Li
2017-08-28  7:44           ` Artur Paszkiewicz
2017-08-28 14:54             ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.