[RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency
@ 2018-09-17  5:29 Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 1/6] lightnvm: pblk: refactor read and write APIs Heiner Litz
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

Hi All,
this patchset introduces RAIL, a mechanism to enforce low tail read latency for
lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
always served from LUNs that do not serve a high latency operation such as a
write or erase. This avoids that reads become serialized behind these operations
reducing tail latency by ~10x. In particular, in the absence of ECC read errors,
it provides 99.99 percentile read latencies of below 500us. RAIL introduces
capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.

This patch is based on pblk/core and requires two additional patches from Javier
to be applicable (let me know if you want me to rebase):

The 1st patch exposes some existing APIs so they can be used by RAIL
The 2nd patch introduces a configurable sector mapping function
The 3rd patch refactors the write path so the end_io_fn can be specified when
setting up the request
The 4th patch adds a new submit io function that acquires the write semaphore
The 5th patch introduces the RAIL feature and its API
The 6th patch integrates RAIL into pblk's read and write path

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/6] lightnvm: pblk: refactor read and write APIs
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 2/6] lightnvm: pblk: Add configurable mapping function Heiner Litz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

In prepartion of supporting RAIL, expose read and write APIs so their
functionality can be leveraged by RAIL.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-read.c  | 8 +++-----
 drivers/lightnvm/pblk-write.c | 4 ++--
 drivers/lightnvm/pblk.h       | 7 +++++++
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 6d13763f2f6a..67d44caefff4 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -170,8 +170,7 @@ static void pblk_end_user_read(struct bio *bio)
 	bio_endio(bio);
 }
 
-static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd,
-			       bool put_line)
+void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd, bool put_line)
 {
 	struct nvm_tgt_dev *dev = pblk->dev;
 	struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
@@ -285,10 +284,9 @@ static void pblk_end_partial_read(struct nvm_rq *rqd)
 	__pblk_end_io_read(pblk, rqd, false);
 }
 
-static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd,
+int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd,
 			    unsigned int bio_init_idx,
-			    unsigned long *read_bitmap,
-			    int nr_holes)
+			    unsigned long *read_bitmap, int nr_holes)
 {
 	struct pblk_sec_meta *meta_list = rqd->meta_list;
 	struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 9554febee480..1ce03d7c873b 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -217,7 +217,7 @@ static void pblk_submit_rec(struct work_struct *work)
 }
 
 
-static void pblk_end_w_fail(struct pblk *pblk, struct nvm_rq *rqd)
+void pblk_end_w_fail(struct pblk *pblk, struct nvm_rq *rqd)
 {
 	struct pblk_rec_ctx *recovery;
 
@@ -500,7 +500,7 @@ static struct pblk_line *pblk_should_submit_meta_io(struct pblk *pblk,
 	return meta_line;
 }
 
-static int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
+int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
 {
 	struct ppa_addr erase_ppa;
 	struct pblk_line *meta_line;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 3596043332f2..eab50df70ae6 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -861,6 +861,8 @@ void pblk_lookup_l2p_seq(struct pblk *pblk, struct ppa_addr *ppas,
 int pblk_write_to_cache(struct pblk *pblk, struct bio *bio,
 			unsigned long flags);
 int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq);
+void pblk_end_w_fail(struct pblk *pblk, struct nvm_rq *rqd);
+int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd);
 
 /*
  * pblk map
@@ -886,6 +888,11 @@ void pblk_write_kick(struct pblk *pblk);
 extern struct bio_set pblk_bio_set;
 int pblk_submit_read(struct pblk *pblk, struct bio *bio);
 int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq);
+void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd, bool put_line);
+int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd,
+			    unsigned int bio_init_idx,
+			    unsigned long *read_bitmap, int nr_holes);
+
 /*
  * pblk recovery
  */
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/6] lightnvm: pblk: Add configurable mapping function
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 1/6] lightnvm: pblk: refactor read and write APIs Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 3/6] lightnvm: pblk: Refactor end_io function in pblk_submit_io_set Heiner Litz
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

In prepartion of supporting RAIL, introduce a new function pointer so that
different mapping functions can be used to determine sector placement.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-init.c |  2 ++
 drivers/lightnvm/pblk-map.c  | 18 +++++++++---------
 drivers/lightnvm/pblk.h      | 13 +++++++++++++
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index fb66bc84d5ca..2b9c6ebd9fac 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -411,6 +411,8 @@ static int pblk_core_init(struct pblk *pblk)
 	pblk->pad_rst_wa = 0;
 	pblk->gc_rst_wa = 0;
 
+	pblk->map_page = pblk_map_page_data;
+
 	atomic64_set(&pblk->nr_flush, 0);
 	pblk->nr_flush_rst = 0;
 
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index ff677ca6e4e1..9490601de3a5 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -18,11 +18,11 @@
 
 #include "pblk.h"
 
-static int pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
-			      struct ppa_addr *ppa_list,
-			      unsigned long *lun_bitmap,
-			      struct pblk_sec_meta *meta_list,
-			      unsigned int valid_secs)
+int pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
+		       struct ppa_addr *ppa_list,
+		       unsigned long *lun_bitmap,
+		       struct pblk_sec_meta *meta_list,
+		       unsigned int valid_secs)
 {
 	struct pblk_line *line = pblk_line_get_data(pblk);
 	struct pblk_emeta *emeta;
@@ -95,8 +95,8 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
 
 	for (i = off; i < rqd->nr_ppas; i += min) {
 		map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
-		if (pblk_map_page_data(pblk, sentry + i, &ppa_list[i],
-					lun_bitmap, &meta_list[i], map_secs)) {
+		if (pblk->map_page(pblk, sentry + i, &ppa_list[i], lun_bitmap,
+				   &meta_list[i], map_secs)) {
 			bio_put(rqd->bio);
 			pblk_free_rqd(pblk, rqd, PBLK_WRITE);
 			pblk_pipeline_stop(pblk);
@@ -121,8 +121,8 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd,
 
 	for (i = 0; i < rqd->nr_ppas; i += min) {
 		map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
-		if (pblk_map_page_data(pblk, sentry + i, &ppa_list[i],
-					lun_bitmap, &meta_list[i], map_secs)) {
+		if (pblk->map_page(pblk, sentry + i, &ppa_list[i], lun_bitmap,
+				   &meta_list[i], map_secs)) {
 			bio_put(rqd->bio);
 			pblk_free_rqd(pblk, rqd, PBLK_WRITE);
 			pblk_pipeline_stop(pblk);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index eab50df70ae6..87dc24772dad 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -604,6 +604,12 @@ struct pblk_addrf {
 	int sec_ws_stripe;
 };
 
+typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
+			       struct ppa_addr *ppa_list,
+			       unsigned long *lun_bitmap,
+			       struct pblk_sec_meta *meta_list,
+			       unsigned int valid_secs);
+
 struct pblk {
 	struct nvm_tgt_dev *dev;
 	struct gendisk *disk;
@@ -709,6 +715,8 @@ struct pblk {
 	struct timer_list wtimer;
 
 	struct pblk_gc gc;
+
+	pblk_map_page_fn *map_page;
 };
 
 struct pblk_line_ws {
@@ -873,6 +881,11 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq *rqd,
 void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
 		 unsigned long *lun_bitmap, unsigned int valid_secs,
 		 unsigned int off);
+int pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
+		       struct ppa_addr *ppa_list,
+		       unsigned long *lun_bitmap,
+		       struct pblk_sec_meta *meta_list,
+		       unsigned int valid_secs);
 
 /*
  * pblk write thread
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/6] lightnvm: pblk: Refactor end_io function in pblk_submit_io_set
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 1/6] lightnvm: pblk: refactor read and write APIs Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 2/6] lightnvm: pblk: Add configurable mapping function Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 4/6] lightnvm: pblk: Add pblk_submit_io_sem Heiner Litz
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

In preparation of supporting RAIL, refactor pblk_submit_io_set in the write
path so that the end_io function can be specified when setting up the
request.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-write.c | 11 ++++++-----
 drivers/lightnvm/pblk.h       |  3 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 1ce03d7c873b..6eba38b83acd 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -309,7 +309,7 @@ static int pblk_alloc_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
 }
 
 static int pblk_setup_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
-			   struct ppa_addr *erase_ppa)
+			   struct ppa_addr *erase_ppa, nvm_end_io_fn(*end_io))
 {
 	struct pblk_line_meta *lm = &pblk->lm;
 	struct pblk_line *e_line = pblk_line_get_erase(pblk);
@@ -325,7 +325,7 @@ static int pblk_setup_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
 		return -ENOMEM;
 	c_ctx->lun_bitmap = lun_bitmap;
 
-	ret = pblk_alloc_w_rq(pblk, rqd, nr_secs, pblk_end_io_write);
+	ret = pblk_alloc_w_rq(pblk, rqd, nr_secs, end_io);
 	if (ret) {
 		kfree(lun_bitmap);
 		return ret;
@@ -500,7 +500,8 @@ static struct pblk_line *pblk_should_submit_meta_io(struct pblk *pblk,
 	return meta_line;
 }
 
-int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
+int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd,
+		       nvm_end_io_fn(*end_io))
 {
 	struct ppa_addr erase_ppa;
 	struct pblk_line *meta_line;
@@ -509,7 +510,7 @@ int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
 	pblk_ppa_set_empty(&erase_ppa);
 
 	/* Assign lbas to ppas and populate request structure */
-	err = pblk_setup_w_rq(pblk, rqd, &erase_ppa);
+	err = pblk_setup_w_rq(pblk, rqd, &erase_ppa, end_io);
 	if (err) {
 		pblk_err(pblk, "could not setup write request: %d\n", err);
 		return NVM_IO_ERR;
@@ -631,7 +632,7 @@ static int pblk_submit_write(struct pblk *pblk)
 		goto fail_put_bio;
 	}
 
-	if (pblk_submit_io_set(pblk, rqd))
+	if (pblk_submit_io_set(pblk, rqd, pblk_end_io_write))
 		goto fail_free_bio;
 
 #ifdef CONFIG_NVM_PBLK_DEBUG
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 87dc24772dad..64d9c206ec52 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -870,7 +870,8 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio,
 			unsigned long flags);
 int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq);
 void pblk_end_w_fail(struct pblk *pblk, struct nvm_rq *rqd);
-int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd);
+int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd,
+		       nvm_end_io_fn(*end_io));
 
 /*
  * pblk map
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/6] lightnvm: pblk: Add pblk_submit_io_sem
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
                   ` (2 preceding siblings ...)
  2018-09-17  5:29 ` [RFC PATCH 3/6] lightnvm: pblk: Refactor end_io function in pblk_submit_io_set Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-17  5:29 ` [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface Heiner Litz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

In preparation of supporting RAIL, add a new API pblk_submit_io_sem which
takes the lun semaphore before submitting the asynchronous request.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-core.c | 11 +++++++++++
 drivers/lightnvm/pblk.h      |  1 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 2e40666fdf80..a31bf359f905 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -492,6 +492,17 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
 	return nvm_submit_io(dev, rqd);
 }
 
+int pblk_submit_io_sem(struct pblk *pblk, struct nvm_rq *rqd)
+{
+	struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
+	int ret;
+
+	pblk_down_chunk(pblk, ppa_list[0]);
+	ret = pblk_submit_io(pblk, rqd);
+
+	return ret;
+}
+
 void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd)
 {
 	struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 64d9c206ec52..bd88784e51d9 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -797,6 +797,7 @@ struct nvm_chk_meta *pblk_chunk_get_off(struct pblk *pblk,
 void pblk_log_write_err(struct pblk *pblk, struct nvm_rq *rqd);
 void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd);
 int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd);
+int pblk_submit_io_sem(struct pblk *pblk, struct nvm_rq *rqd);
 int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd);
 int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd);
 int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
                   ` (3 preceding siblings ...)
  2018-09-17  5:29 ` [RFC PATCH 4/6] lightnvm: pblk: Add pblk_submit_io_sem Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-18 11:28   ` Hans Holmberg
  2018-09-17  5:29 ` [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL Heiner Litz
  2018-09-18 11:46 ` [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Hans Holmberg
  6 siblings, 1 reply; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

In prepartion of supporting RAIL, add the RAIL API.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
 drivers/lightnvm/pblk.h      |  63 ++-
 2 files changed, 870 insertions(+), 1 deletion(-)
 create mode 100644 drivers/lightnvm/pblk-rail.c

diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
new file mode 100644
index 000000000000..a48ed31a0ba9
--- /dev/null
+++ b/drivers/lightnvm/pblk-rail.c
@@ -0,0 +1,808 @@
+/*
+ * Copyright (C) 2018 Heiner Litz
+ * Initial release: Heiner Litz <hlitz@ucsc.edu>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * pblk-rail.c - pblk's RAIL path
+ */
+
+#include "pblk.h"
+
+#define PBLK_RAIL_EMPTY ~0x0
+#define PBLK_RAIL_PARITY_WRITE 0x8000
+
+/* RAIL auxiliary functions */
+static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+
+	return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
+}
+
+static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+
+	return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
+}
+
+static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+
+	return lm->blk_per_line * pblk->min_write_pgs;
+}
+
+static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
+{
+	return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
+}
+
+static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
+{
+	return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
+}
+
+static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+
+	return (lun & (lm->blk_per_line - 1));
+}
+
+bool pblk_rail_meta_distance(struct pblk_line *data_line)
+{
+	return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
+}
+
+/* Notify readers that LUN is serving high latency operation */
+static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
+{
+	WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
+	/* Make sure that busy bit is seen by reader before proceeding */
+	smp_mb__after_atomic();
+}
+
+static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
+{
+	/* Make sure that write is completed before releasing busy bit */
+	smp_mb__before_atomic();
+	WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
+}
+
+int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
+{
+	struct nvm_tgt_dev *dev = pblk->dev;
+	struct nvm_geo *geo = &dev->geo;
+	int lun_pos = pblk_ppa_to_pos(geo, ppa);
+
+	return test_bit(lun_pos, pblk->rail.busy_bitmap);
+}
+
+/* Enforces one writer per stride */
+int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
+{
+	struct pblk_lun *rlun;
+	int strides = pblk_rail_nr_parity_luns(pblk);
+	int stride = lun_pos % strides;
+	int ret;
+
+	rlun = &pblk->luns[stride];
+	ret = down_timeout(&rlun->wr_sem, timeout);
+	pblk_rail_notify_reader_down(pblk, lun_pos);
+
+	return ret;
+}
+
+void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
+{
+	struct pblk_lun *rlun;
+	int strides = pblk_rail_nr_parity_luns(pblk);
+	int stride = lun_pos % strides;
+
+	pblk_rail_notify_reader_up(pblk, lun_pos);
+	rlun = &pblk->luns[stride];
+	up(&rlun->wr_sem);
+}
+
+/* Determine whether a sector holds data, meta or is bad*/
+bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+	struct nvm_tgt_dev *dev = pblk->dev;
+	struct nvm_geo *geo = &dev->geo;
+	struct ppa_addr ppa;
+	int lun;
+
+	if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
+		return false;
+
+	if (pos >= line->emeta_ssec &&
+	    pos < (line->emeta_ssec + lm->emeta_sec[0]))
+		return false;
+
+	ppa = addr_to_gen_ppa(pblk, pos, line->id);
+	lun = pblk_ppa_to_pos(geo, ppa);
+
+	return !test_bit(lun, line->blk_bitmap);
+}
+
+/* Delay rb overwrite until whole stride has been written */
+int pblk_rail_rb_delay(struct pblk_rb *rb)
+{
+	struct pblk *pblk = container_of(rb, struct pblk, rwb);
+
+	return pblk_rail_sec_per_stripe(pblk);
+}
+
+static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
+{
+	unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
+	int page = sec_in_stripe / pblk->min_write_pgs;
+
+	return page % pblk_rail_nr_parity_luns(pblk);
+}
+
+static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
+{
+	unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
+
+	return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
+}
+
+static void pblk_rail_data_parity(void *dest, void *src)
+{
+	unsigned int i;
+
+	for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
+		((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
+}
+
+static void pblk_rail_lba_parity(u64 *dest, u64 *src)
+{
+	*dest ^= *src;
+}
+
+/* Tracks where a sector is located in the rwb */
+void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
+			 int sentry, int nr_valid)
+{
+	int stride, idx, pos;
+
+	stride = pblk_rail_sec_to_stride(pblk, cur_sec);
+	idx = pblk_rail_sec_to_idx(pblk, cur_sec);
+	pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
+	pblk->rail.p2b[stride][idx].pos = pos;
+	pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
+}
+
+/* RAIL's sector mapping function */
+static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
+			      int sentry, struct pblk_sec_meta *meta_list,
+			      __le64 *lba_list, struct ppa_addr ppa)
+{
+	struct pblk_w_ctx *w_ctx;
+	__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
+
+	kref_get(&line->ref);
+
+	if (sentry & PBLK_RAIL_PARITY_WRITE) {
+		u64 *lba;
+
+		sentry &= ~PBLK_RAIL_PARITY_WRITE;
+		lba = &pblk->rail.lba[sentry];
+		meta_list->lba = cpu_to_le64(*lba);
+		*lba_list = cpu_to_le64(*lba);
+		line->nr_valid_lbas++;
+	} else {
+		w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
+		w_ctx->ppa = ppa;
+		meta_list->lba = cpu_to_le64(w_ctx->lba);
+		*lba_list = cpu_to_le64(w_ctx->lba);
+
+		if (*lba_list != addr_empty)
+			line->nr_valid_lbas++;
+		else
+			atomic64_inc(&pblk->pad_wa);
+	}
+}
+
+int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
+			    struct ppa_addr *ppa_list,
+			    unsigned long *lun_bitmap,
+			    struct pblk_sec_meta *meta_list,
+			    unsigned int valid_secs)
+{
+	struct pblk_line *line = pblk_line_get_data(pblk);
+	struct pblk_emeta *emeta;
+	__le64 *lba_list;
+	u64 paddr;
+	int nr_secs = pblk->min_write_pgs;
+	int i;
+
+	if (pblk_line_is_full(line)) {
+		struct pblk_line *prev_line = line;
+
+		/* If we cannot allocate a new line, make sure to store metadata
+		 * on current line and then fail
+		 */
+		line = pblk_line_replace_data(pblk);
+		pblk_line_close_meta(pblk, prev_line);
+
+		if (!line)
+			return -EINTR;
+	}
+
+	emeta = line->emeta;
+	lba_list = emeta_to_lbas(pblk, emeta->buf);
+
+	paddr = pblk_alloc_page(pblk, line, nr_secs);
+
+	pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
+
+	for (i = 0; i < nr_secs; i++, paddr++) {
+		__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
+
+		/* ppa to be sent to the device */
+		ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
+
+		/* Write context for target bio completion on write buffer. Note
+		 * that the write buffer is protected by the sync backpointer,
+		 * and a single writer thread have access to each specific entry
+		 * at a time. Thus, it is safe to modify the context for the
+		 * entry we are setting up for submission without taking any
+		 * lock or memory barrier.
+		 */
+		if (i < valid_secs) {
+			pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
+					  &lba_list[paddr], ppa_list[i]);
+		} else {
+			lba_list[paddr] = meta_list[i].lba = addr_empty;
+			__pblk_map_invalidate(pblk, line, paddr);
+		}
+	}
+
+	pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
+	return 0;
+}
+
+/* RAIL Initialization and tear down */
+int pblk_rail_init(struct pblk *pblk)
+{
+	struct pblk_line_meta *lm = &pblk->lm;
+	int i, p2be;
+	unsigned int nr_strides;
+	unsigned int psecs;
+	void *kaddr;
+
+	if (!PBLK_RAIL_STRIDE_WIDTH)
+		return 0;
+
+	if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
+	    (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
+		pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
+		return -EINVAL;
+	}
+
+	psecs = pblk_rail_psec_per_stripe(pblk);
+	nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
+
+	pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
+				       GFP_KERNEL);
+	if (!pblk->rail.p2b)
+		return -ENOMEM;
+
+	for (p2be = 0; p2be < nr_strides; p2be++) {
+		pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
+					       sizeof(struct p2b_entry),
+					       GFP_KERNEL);
+		if (!pblk->rail.p2b[p2be])
+			goto free_p2b_entries;
+	}
+
+	pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
+	if (!pblk->rail.data)
+		goto free_p2b_entries;
+
+	pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
+	if (!pblk->rail.pages)
+		goto free_data;
+
+	kaddr = page_address(pblk->rail.pages);
+	for (i = 0; i < psecs; i++)
+		pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
+
+	pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
+	if (!pblk->rail.lba)
+		goto free_pages;
+
+	/* Subtract parity bits from device capacity */
+	pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
+		PBLK_RAIL_STRIDE_WIDTH;
+
+	pblk->map_page = pblk_rail_map_page_data;
+
+	return 0;
+
+free_pages:
+	free_pages((unsigned long)page_address(pblk->rail.pages),
+		   get_count_order(psecs));
+free_data:
+	kfree(pblk->rail.data);
+free_p2b_entries:
+	for (p2be = p2be - 1; p2be >= 0; p2be--)
+		kfree(pblk->rail.p2b[p2be]);
+	kfree(pblk->rail.p2b);
+
+	return -ENOMEM;
+}
+
+void pblk_rail_free(struct pblk *pblk)
+{
+	unsigned int i;
+	unsigned int nr_strides;
+	unsigned int psecs;
+
+	psecs = pblk_rail_psec_per_stripe(pblk);
+	nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
+
+	kfree(pblk->rail.lba);
+	free_pages((unsigned long)page_address(pblk->rail.pages),
+		   get_count_order(psecs));
+	kfree(pblk->rail.data);
+	for (i = 0; i < nr_strides; i++)
+		kfree(pblk->rail.p2b[i]);
+	kfree(pblk->rail.p2b);
+}
+
+/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
+ * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
+ * split the bio
+ */
+static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
+{
+	struct nvm_tgt_dev *dev = pblk->dev;
+	struct bio *split;
+
+	sec *= (dev->geo.csecs >> 9);
+
+	split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
+	/* there isn't chance to merge the split bio */
+	split->bi_opf |= REQ_NOMERGE;
+	bio_set_flag(*bio, BIO_QUEUE_ENTERED);
+	bio_chain(split, *bio);
+	generic_make_request(*bio);
+	*bio = split;
+}
+
+/* RAIL's Write Path */
+static int pblk_rail_sched_parity(struct pblk *pblk)
+{
+	struct pblk_line *line = pblk_line_get_data(pblk);
+	unsigned int sec_in_stripe;
+
+	while (1) {
+		sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
+
+		/* Schedule parity write at end of data section */
+		if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
+			return 1;
+
+		/* Skip bad blocks and meta sectors until we find a valid sec */
+		if (test_bit(line->cur_sec, line->map_bitmap))
+			line->cur_sec += pblk->min_write_pgs;
+		else
+			break;
+	}
+
+	return 0;
+}
+
+/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
+void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
+{
+	int off, bit;
+
+	for (off = pblk_rail_dsec_per_stripe(pblk);
+	     off < pblk->lm.sec_per_line;
+	     off += pblk_rail_sec_per_stripe(pblk)) {
+		for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
+			set_bit(off + bit, line->invalid_bitmap);
+	}
+}
+
+void pblk_rail_end_io_write(struct nvm_rq *rqd)
+{
+	struct pblk *pblk = rqd->private;
+	struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
+
+	if (rqd->error) {
+		pblk_log_write_err(pblk, rqd);
+		return pblk_end_w_fail(pblk, rqd);
+	}
+#ifdef CONFIG_NVM_DEBUG
+	else
+		WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
+#endif
+
+	pblk_up_rq(pblk, c_ctx->lun_bitmap);
+
+	pblk_rq_to_line_put(pblk, rqd);
+	bio_put(rqd->bio);
+	pblk_free_rqd(pblk, rqd, PBLK_WRITE);
+
+	atomic_dec(&pblk->inflight_io);
+}
+
+static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
+			  struct bio *bio, unsigned int stride,
+			  unsigned int nr_secs, unsigned int paddr)
+{
+	struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
+	int sec, i;
+	int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
+	struct pblk_line *line = pblk_line_get_data(pblk);
+
+	c_ctx->nr_valid = nr_secs;
+	/* sentry indexes rail page buffer, instead of rwb */
+	c_ctx->sentry = stride * pblk->min_write_pgs;
+	c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
+
+	for (sec = 0; sec < pblk->min_write_pgs; sec++) {
+		void *pg_addr;
+		struct page *page;
+		u64 *lba;
+
+		lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
+		pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
+		page = virt_to_page(pg_addr);
+
+		if (!page) {
+			pr_err("pblk: could not allocate RAIL bio page %p\n",
+			       pg_addr);
+			return -NVM_IO_ERR;
+		}
+
+		if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
+		    pblk->rwb.seg_size) {
+			pr_err("pblk: could not add page to RAIL bio\n");
+			return -NVM_IO_ERR;
+		}
+
+		*lba = 0;
+		memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
+
+		for (i = 0; i < nr_data; i++) {
+			struct pblk_rb_entry *entry;
+			struct pblk_w_ctx *w_ctx;
+			u64 lba_src;
+			unsigned int pos;
+			unsigned int cur;
+			int distance = pblk_rail_psec_per_stripe(pblk);
+
+			cur = paddr - distance * (nr_data - i) + sec;
+
+			if (!pblk_rail_valid_sector(pblk, line, cur))
+				continue;
+
+			pos = pblk->rail.p2b[stride][i].pos;
+			pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
+			entry = &pblk->rwb.entries[pos];
+			w_ctx = &entry->w_ctx;
+			lba_src = w_ctx->lba;
+
+			if (sec < pblk->rail.p2b[stride][i].nr_valid &&
+			    lba_src != ADDR_EMPTY) {
+				pblk_rail_data_parity(pg_addr, entry->data);
+				pblk_rail_lba_parity(lba, &lba_src);
+			}
+		}
+	}
+
+	return 0;
+}
+
+int pblk_rail_submit_write(struct pblk *pblk)
+{
+	int i;
+	struct nvm_rq *rqd;
+	struct bio *bio;
+	struct pblk_line *line = pblk_line_get_data(pblk);
+	int start, end, bb_offset;
+	unsigned int stride = 0;
+
+	if (!pblk_rail_sched_parity(pblk))
+		return 0;
+
+	start = line->cur_sec;
+	bb_offset = start % pblk_rail_sec_per_stripe(pblk);
+	end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
+
+	for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
+		/* Do not generate parity in this slot if the sec is bad
+		 * or reserved for meta.
+		 * We check on the read path and perform a conventional
+		 * read, to avoid reading parity from the bad block
+		 */
+		if (!pblk_rail_valid_sector(pblk, line, i))
+			continue;
+
+		rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
+		if (IS_ERR(rqd)) {
+			pr_err("pblk: cannot allocate parity write req.\n");
+			return -ENOMEM;
+		}
+
+		bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
+		if (!bio) {
+			pr_err("pblk: cannot allocate parity write bio\n");
+			pblk_free_rqd(pblk, rqd, PBLK_WRITE);
+			return -ENOMEM;
+		}
+
+		bio->bi_iter.bi_sector = 0; /* internal bio */
+		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+		rqd->bio = bio;
+
+		pblk_rail_read_to_bio(pblk, rqd, bio, stride,
+				      pblk->min_write_pgs, i);
+
+		if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
+			bio_put(rqd->bio);
+			pblk_free_rqd(pblk, rqd, PBLK_WRITE);
+
+			return -NVM_IO_ERR;
+		}
+	}
+
+	return 0;
+}
+
+/* RAIL's Read Path */
+static void pblk_rail_end_io_read(struct nvm_rq *rqd)
+{
+	struct pblk *pblk = rqd->private;
+	struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+	struct pblk_pr_ctx *pr_ctx = r_ctx->private;
+	struct bio *new_bio = rqd->bio;
+	struct bio *bio = pr_ctx->orig_bio;
+	struct bio_vec src_bv, dst_bv;
+	struct pblk_sec_meta *meta_list = rqd->meta_list;
+	int bio_init_idx = pr_ctx->bio_init_idx;
+	int nr_secs = pr_ctx->orig_nr_secs;
+	__le64 *lba_list_mem, *lba_list_media;
+	__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
+	void *src_p, *dst_p;
+	int i, r, rail_ppa = 0;
+	unsigned char valid;
+
+	if (unlikely(rqd->nr_ppas == 1)) {
+		struct ppa_addr ppa;
+
+		ppa = rqd->ppa_addr;
+		rqd->ppa_list = pr_ctx->ppa_ptr;
+		rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
+		rqd->ppa_list[0] = ppa;
+	}
+
+	/* Re-use allocated memory for intermediate lbas */
+	lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
+	lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
+
+	for (i = 0; i < rqd->nr_ppas; i++)
+		lba_list_media[i] = meta_list[i].lba;
+	for (i = 0; i < nr_secs; i++)
+		meta_list[i].lba = lba_list_mem[i];
+
+	for (i = 0; i < nr_secs; i++) {
+		struct pblk_line *line;
+		u64 meta_lba = 0x0UL, mlba;
+
+		line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
+
+		valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
+		bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
+				   PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
+
+		if (valid == 0) /* Skip cached reads */
+			continue;
+
+		kref_put(&line->ref, pblk_line_put);
+
+		dst_bv = bio->bi_io_vec[bio_init_idx + i];
+		dst_p = kmap_atomic(dst_bv.bv_page);
+
+		memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
+		meta_list[i].lba = cpu_to_le64(0x0UL);
+
+		for (r = 0; r < valid; r++, rail_ppa++) {
+			src_bv = new_bio->bi_io_vec[rail_ppa];
+
+			if (lba_list_media[rail_ppa] != addr_empty) {
+				src_p = kmap_atomic(src_bv.bv_page);
+				pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
+						      src_p + src_bv.bv_offset);
+				mlba = le64_to_cpu(lba_list_media[rail_ppa]);
+				pblk_rail_lba_parity(&meta_lba, &mlba);
+				kunmap_atomic(src_p);
+			}
+
+			mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
+		}
+		meta_list[i].lba = cpu_to_le64(meta_lba);
+		kunmap_atomic(dst_p);
+	}
+
+	bio_put(new_bio);
+	rqd->nr_ppas = pr_ctx->orig_nr_secs;
+	kfree(pr_ctx);
+	rqd->bio = NULL;
+
+	bio_endio(bio);
+	__pblk_end_io_read(pblk, rqd, false);
+}
+
+/* Converts original ppa into ppa list of RAIL reads */
+static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
+				struct ppa_addr *rail_ppas,
+				unsigned char *pvalid, int *nr_rail_ppas,
+				int *rail_reads)
+{
+	struct nvm_tgt_dev *dev = pblk->dev;
+	struct nvm_geo *geo = &dev->geo;
+	struct ppa_addr rail_ppa = ppa;
+	unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
+	unsigned int strides = pblk_rail_nr_parity_luns(pblk);
+	struct pblk_line *line;
+	unsigned int i;
+	int ppas = *nr_rail_ppas;
+	int valid = 0;
+
+	for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
+		unsigned int neighbor, lun, chnl;
+		int laddr;
+
+		neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
+
+		lun = pblk_pos_to_lun(geo, neighbor);
+		chnl = pblk_pos_to_chnl(geo, neighbor);
+		pblk_dev_ppa_set_lun(&rail_ppa, lun);
+		pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
+
+		line = pblk_ppa_to_line(pblk, rail_ppa);
+		laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
+
+		/* Do not read from bad blocks */
+		if (!pblk_rail_valid_sector(pblk, line, laddr)) {
+			/* Perform regular read if parity sector is bad */
+			if (neighbor >= pblk_rail_nr_data_luns(pblk))
+				return 0;
+
+			/* If any other neighbor is bad we can just skip it */
+			continue;
+		}
+
+		rail_ppas[ppas++] = rail_ppa;
+		valid++;
+	}
+
+	if (valid == 1)
+		return 0;
+
+	*pvalid = valid;
+	*nr_rail_ppas = ppas;
+	(*rail_reads)++;
+	return 1;
+}
+
+static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
+				 int ppa, struct ppa_addr *rail_ppa_list,
+				 int *nr_rail_ppas, unsigned long *read_bitmap,
+				 unsigned long *pvalid, int *rail_reads)
+{
+	unsigned char valid;
+
+	if (test_bit(ppa, read_bitmap))
+		return;
+
+	if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
+	    pblk_rail_setup_ppas(pblk, ppa_list[ppa],
+				 rail_ppa_list, &valid,
+				 nr_rail_ppas, rail_reads)) {
+		WARN_ON(test_and_set_bit(ppa, read_bitmap));
+		bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
+	} else {
+		rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
+		bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
+	}
+}
+
+int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
+		       unsigned long *read_bitmap, int bio_init_idx,
+		       struct bio **bio)
+{
+	struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+	struct pblk_pr_ctx *pr_ctx;
+	struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
+	DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
+	int nr_secs = rqd->nr_ppas;
+	bool read_empty = bitmap_empty(read_bitmap, nr_secs);
+	int nr_rail_ppas = 0, rail_reads = 0;
+	int i;
+	int ret;
+
+	/* Fully cached reads should not enter this path */
+	WARN_ON(bitmap_full(read_bitmap, nr_secs));
+
+	bitmap_zero(pvalid, PR_BITMAP_SIZE);
+	if (rqd->nr_ppas == 1) {
+		pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
+				     &nr_rail_ppas, read_bitmap, pvalid,
+				     &rail_reads);
+
+		if (nr_rail_ppas == 1) {
+			memcpy(&rqd->ppa_addr, rail_ppa_list,
+			       nr_rail_ppas * sizeof(struct ppa_addr));
+		} else {
+			rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
+			rqd->dma_ppa_list = rqd->dma_meta_list +
+			  pblk_dma_meta_size;
+			memcpy(rqd->ppa_list, rail_ppa_list,
+			       nr_rail_ppas * sizeof(struct ppa_addr));
+		}
+	} else {
+		for (i = 0; i < rqd->nr_ppas; i++) {
+			pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
+					     rail_ppa_list, &nr_rail_ppas,
+					     read_bitmap, pvalid, &rail_reads);
+
+			/* Don't split if this it the last ppa of the rqd */
+			if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
+			     NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
+				struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+
+				pblk_rail_bio_split(pblk, bio, i + 1);
+				rqd->nr_ppas = pblk_get_secs(*bio);
+				r_ctx->private = *bio;
+				break;
+			}
+		}
+		memcpy(rqd->ppa_list, rail_ppa_list,
+		       nr_rail_ppas * sizeof(struct ppa_addr));
+	}
+
+	if (bitmap_empty(read_bitmap, rqd->nr_ppas))
+		return NVM_IO_REQUEUE;
+
+	if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
+		bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
+
+	if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
+				    nr_rail_ppas))
+		return NVM_IO_ERR;
+
+	rqd->end_io = pblk_rail_end_io_read;
+	pr_ctx = r_ctx->private;
+	bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
+
+	ret = pblk_submit_io(pblk, rqd);
+	if (ret) {
+		bio_put(rqd->bio);
+		pr_err("pblk: partial RAIL read IO submission failed\n");
+		/* Free allocated pages in new bio */
+		pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
+		kfree(pr_ctx);
+		__pblk_end_io_read(pblk, rqd, false);
+		return NVM_IO_ERR;
+	}
+
+	return NVM_IO_OK;
+}
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index bd88784e51d9..01fe4362b27e 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -28,6 +28,7 @@
 #include <linux/vmalloc.h>
 #include <linux/crc32.h>
 #include <linux/uuid.h>
+#include <linux/log2.h>
 
 #include <linux/lightnvm.h>
 
@@ -45,7 +46,7 @@
 #define PBLK_COMMAND_TIMEOUT_MS 30000
 
 /* Max 512 LUNs per device */
-#define PBLK_MAX_LUNS_BITMAP (4)
+#define PBLK_MAX_LUNS_BITMAP (512)
 
 #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
 
@@ -123,6 +124,13 @@ struct pblk_g_ctx {
 	u64 lba;
 };
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+#define PBLK_RAIL_STRIDE_WIDTH 4
+#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
+#else
+#define PR_BITMAP_SIZE NVM_MAX_VLBA
+#endif
+
 /* partial read context */
 struct pblk_pr_ctx {
 	struct bio *orig_bio;
@@ -604,6 +612,39 @@ struct pblk_addrf {
 	int sec_ws_stripe;
 };
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+
+struct p2b_entry {
+	int pos;
+	int nr_valid;
+};
+
+struct pblk_rail {
+	struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
+	struct page *pages;             /* Pages to hold parity writes */
+	void **data;                    /* Buffer that holds parity pages */
+	DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
+	u64 *lba;                       /* Buffer to compute LBA parity */
+};
+
+/* Initialize and tear down RAIL */
+int pblk_rail_init(struct pblk *pblk);
+void pblk_rail_free(struct pblk *pblk);
+/* Adjust some system parameters */
+bool pblk_rail_meta_distance(struct pblk_line *data_line);
+int pblk_rail_rb_delay(struct pblk_rb *rb);
+/* Core */
+void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
+int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
+void pblk_rail_up_stride(struct pblk *pblk, int lun);
+/* Write path */
+int pblk_rail_submit_write(struct pblk *pblk);
+/* Read Path */
+int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
+		       unsigned long *read_bitmap, int bio_init_idx,
+		       struct bio **bio);
+#endif /* CONFIG_NVM_PBLK_RAIL */
+
 typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
 			       struct ppa_addr *ppa_list,
 			       unsigned long *lun_bitmap,
@@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
 	return paddr;
 }
 
+static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
+{
+	return pos >> ilog2(geo->num_ch);
+}
+
+static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
+{
+	return pos % geo->num_ch;
+}
+
+static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
+{
+	p->a.lun = lun;
+}
+
+static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
+{
+	p->a.ch = chnl;
+}
+
 static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
 {
 	struct nvm_tgt_dev *dev = pblk->dev;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
                   ` (4 preceding siblings ...)
  2018-09-17  5:29 ` [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface Heiner Litz
@ 2018-09-17  5:29 ` Heiner Litz
  2018-09-18 11:38   ` Hans Holmberg
  2018-09-18 11:46 ` [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Hans Holmberg
  6 siblings, 1 reply; 17+ messages in thread
From: Heiner Litz @ 2018-09-17  5:29 UTC (permalink / raw)
  To: linux-block, hlitz; +Cc: javier, mb, igor.j.konopko, marcin.dziegielewski

Integrate Redundant Array of Independent Luns (RAIL) into lightnvm. RAIL
enforces low tail read latency by guaranteeing that reads are never
serialized behind writes and erases to the same LUN. Whenever LUNs serve a
high latency operation, reads are performed by recomputing the original
utilizing redundant parity information.
Rail trades-off read latency for capacity (redundancy) which, however, can
be leveraged for fault tolerance.

On FIO, with the kyber scheduler set to a target read latency of 500us,
RAIL reduces tail latency percentiles (us) as follows:

       Avg    90%    99%     99.9%  99.95%  99.99%
       pblk   90     1000    2200   3000    6000
       RAIL   85     100     250    400     500

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/Kconfig      | 10 ++++++++++
 drivers/lightnvm/Makefile     |  1 +
 drivers/lightnvm/pblk-core.c  | 36 ++++++++++++++++++++++++++++++++++-
 drivers/lightnvm/pblk-init.c  | 17 +++++++++++++++++
 drivers/lightnvm/pblk-rail.c  |  1 +
 drivers/lightnvm/pblk-rb.c    |  6 ++++++
 drivers/lightnvm/pblk-read.c  |  9 +++++++++
 drivers/lightnvm/pblk-write.c |  9 +++++++++
 drivers/lightnvm/pblk.h       |  5 +++++
 9 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index a872cd720967..165d5a29acc3 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -35,6 +35,16 @@ config NVM_PBLK_DEBUG
 	  vocal error messages, and extra tracking fields in the pblk sysfs
 	  entries.
 
+config NVM_PBLK_RAIL
+       bool "Pblk RAIL Support"
+       default n
+       help
+         Enables RAIL for pblk. RAIL enforces tail read latency guarantees by
+	 eliminiating reads being serialized behind writes to the same LUN.
+	 RAIL partitions LUNs into strides and enforces that only one LUN per
+	 stride is written at a time. Reads can bypass busy LUNs by recompting
+	 requested data using parity redundancy.
+
 endif # NVM_PBLK_DEBUG
 
 endif # NVM
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
index 97d9d7c71550..92f4376428cc 100644
--- a/drivers/lightnvm/Makefile
+++ b/drivers/lightnvm/Makefile
@@ -5,6 +5,7 @@
 
 obj-$(CONFIG_NVM)		:= core.o
 obj-$(CONFIG_NVM_PBLK)		+= pblk.o
+obj-$(CONFIG_NVM_PBLK_RAIL)	+= pblk-rail.o
 pblk-y				:= pblk-init.o pblk-core.o pblk-rb.o \
 				   pblk-write.o pblk-cache.o pblk-read.o \
 				   pblk-gc.o pblk-recovery.o pblk-map.o \
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index a31bf359f905..ca74d7763fa9 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -113,6 +113,12 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
 {
 	struct pblk *pblk = rqd->private;
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
+
+	pblk_up_chunk(pblk, ppa_list[0]);
+#endif
+
 	__pblk_end_io_erase(pblk, rqd);
 	mempool_free(rqd, &pblk->e_rq_pool);
 }
@@ -940,7 +946,11 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct ppa_addr ppa)
 	/* The write thread schedules erases so that it minimizes disturbances
 	 * with writes. Thus, there is no need to take the LUN semaphore.
 	 */
+#ifdef CONFIG_NVM_PBLK_RAIL
+	ret = pblk_submit_io_sync_sem(pblk, &rqd);
+#else
 	ret = pblk_submit_io_sync(pblk, &rqd);
+#endif
 	rqd.private = pblk;
 	__pblk_end_io_erase(pblk, &rqd);
 
@@ -1754,7 +1764,11 @@ int pblk_blk_erase_async(struct pblk *pblk, struct ppa_addr ppa)
 	/* The write thread schedules erases so that it minimizes disturbances
 	 * with writes. Thus, there is no need to take the LUN semaphore.
 	 */
+#ifdef CONFIG_NVM_PBLK_RAIL
+	err = pblk_submit_io_sem(pblk, rqd);
+#else
 	err = pblk_submit_io(pblk, rqd);
+#endif
 	if (err) {
 		struct nvm_tgt_dev *dev = pblk->dev;
 		struct nvm_geo *geo = &dev->geo;
@@ -1909,6 +1923,10 @@ void pblk_line_close_ws(struct work_struct *work)
 	if (w_err_gc->has_write_err)
 		pblk_save_lba_list(pblk, line);
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	pblk_rail_line_close(pblk, line);
+#endif
+
 	pblk_line_close(pblk, line);
 	mempool_free(line_ws, &pblk->gen_ws_pool);
 }
@@ -1938,8 +1956,12 @@ static void __pblk_down_chunk(struct pblk *pblk, int pos)
 	 * Only send one inflight I/O per LUN. Since we map at a page
 	 * granurality, all ppas in the I/O will map to the same LUN
 	 */
-
+#ifdef CONFIG_NVM_PBLK_RAIL
+	(void)rlun;
+	ret = pblk_rail_down_stride(pblk, pos, msecs_to_jiffies(30000));
+#else
 	ret = down_timeout(&rlun->wr_sem, msecs_to_jiffies(30000));
+#endif
 	if (ret == -ETIME || ret == -EINTR)
 		pblk_err(pblk, "taking lun semaphore timed out: err %d\n",
 				-ret);
@@ -1978,7 +2000,13 @@ void pblk_up_chunk(struct pblk *pblk, struct ppa_addr ppa)
 	int pos = pblk_ppa_to_pos(geo, ppa);
 
 	rlun = &pblk->luns[pos];
+
+#ifdef CONFIG_NVM_PBLK_RAIL
+	pblk_rail_up_stride(pblk, pos);
+#else
 	up(&rlun->wr_sem);
+#endif
+
 }
 
 void pblk_up_rq(struct pblk *pblk, unsigned long *lun_bitmap)
@@ -1991,7 +2019,13 @@ void pblk_up_rq(struct pblk *pblk, unsigned long *lun_bitmap)
 
 	while ((bit = find_next_bit(lun_bitmap, num_lun, bit + 1)) < num_lun) {
 		rlun = &pblk->luns[bit];
+
+#ifdef CONFIG_NVM_PBLK_RAIL
+		pblk_rail_up_stride(pblk, bit);
+#else
 		up(&rlun->wr_sem);
+#endif
+
 	}
 }
 
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 2b9c6ebd9fac..3e8255c8873f 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1050,6 +1050,7 @@ static int pblk_lines_init(struct pblk *pblk)
 	kfree(pblk->lines);
 fail_free_chunk_meta:
 	kfree(chunk_meta);
+
 fail_free_luns:
 	kfree(pblk->luns);
 fail_free_meta:
@@ -1108,6 +1109,11 @@ static void pblk_tear_down(struct pblk *pblk, bool graceful)
 		__pblk_pipeline_flush(pblk);
 	__pblk_pipeline_stop(pblk);
 	pblk_writer_stop(pblk);
+
+#ifdef CONFIG_NVM_PBLK_RAIL
+	pblk_rail_free(pblk);
+#endif
+
 	pblk_rb_sync_l2p(&pblk->rwb);
 	pblk_rl_free(&pblk->rl);
 
@@ -1226,6 +1232,12 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
 		goto fail_stop_writer;
 	}
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	ret = pblk_rail_init(pblk);
+	if (ret)
+		goto fail_free_gc;
+#endif
+
 	/* inherit the size from the underlying device */
 	blk_queue_logical_block_size(tqueue, queue_physical_block_size(bqueue));
 	blk_queue_max_hw_sectors(tqueue, queue_max_hw_sectors(bqueue));
@@ -1249,6 +1261,11 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
 
 	return pblk;
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+fail_free_gc:
+	pblk_gc_exit(pblk, false);
+#endif
+
 fail_stop_writer:
 	pblk_writer_stop(pblk);
 fail_free_l2p:
diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
index a48ed31a0ba9..619ff9689d29 100644
--- a/drivers/lightnvm/pblk-rail.c
+++ b/drivers/lightnvm/pblk-rail.c
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2018 Heiner Litz
  * Initial release: Heiner Litz <hlitz@ucsc.edu>
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index a7648e12f54f..b04462479fe3 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -389,8 +389,14 @@ static int __pblk_rb_may_write(struct pblk_rb *rb, unsigned int nr_entries,
 	sync = READ_ONCE(rb->sync);
 	mem = READ_ONCE(rb->mem);
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	if (pblk_rb_ring_space(rb, mem, sync, rb->nr_entries) <
+	    nr_entries + pblk_rail_rb_delay(rb))
+		return 0;
+#else
 	if (pblk_rb_ring_space(rb, mem, sync, rb->nr_entries) < nr_entries)
 		return 0;
+#endif
 
 	if (pblk_rb_update_l2p(rb, nr_entries, mem, sync))
 		return 0;
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 67d44caefff4..a3f33503f60c 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -472,6 +472,15 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
 		return NVM_IO_DONE;
 	}
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	ret = pblk_rail_read_bio(pblk, rqd, blba, read_bitmap, bio_init_idx,
+				 &bio);
+	if (ret == NVM_IO_OK)
+		return ret;
+	if (ret == NVM_IO_ERR)
+		goto fail_end_io;
+#endif
+
 	/* All sectors are to be read from the device */
 	if (bitmap_empty(read_bitmap, rqd->nr_ppas)) {
 		struct bio *int_bio = NULL;
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 6eba38b83acd..db42184cfba3 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -469,6 +469,11 @@ static inline bool pblk_valid_meta_ppa(struct pblk *pblk,
 				test_bit(pos_opt, data_line->blk_bitmap))
 		return true;
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	if (unlikely(pblk_rail_meta_distance(data_line)))
+		data_line->meta_distance--;
+#endif
+
 	if (unlikely(pblk_ppa_comp(ppa_opt, ppa)))
 		data_line->meta_distance--;
 
@@ -571,6 +576,10 @@ static int pblk_submit_write(struct pblk *pblk)
 	unsigned long pos;
 	unsigned int resubmit;
 
+#ifdef CONFIG_NVM_PBLK_RAIL
+	pblk_rail_submit_write(pblk);
+#endif
+
 	spin_lock(&pblk->resubmit_lock);
 	resubmit = !list_empty(&pblk->resubmit_list);
 	spin_unlock(&pblk->resubmit_lock);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 01fe4362b27e..9742524f74ea 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -758,6 +758,11 @@ struct pblk {
 	struct pblk_gc gc;
 
 	pblk_map_page_fn *map_page;
+
+#ifdef CONFIG_NVM_PBLK_RAIL
+	struct pblk_rail rail;
+#endif
+
 };
 
 struct pblk_line_ws {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-17  5:29 ` [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface Heiner Litz
@ 2018-09-18 11:28   ` Hans Holmberg
  2018-09-18 16:11     ` Heiner Litz
  0 siblings, 1 reply; 17+ messages in thread
From: Hans Holmberg @ 2018-09-18 11:28 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> In prepartion of supporting RAIL, add the RAIL API.
>
> Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> ---
>  drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
>  drivers/lightnvm/pblk.h      |  63 ++-
>  2 files changed, 870 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/lightnvm/pblk-rail.c
>
> diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> new file mode 100644
> index 000000000000..a48ed31a0ba9
> --- /dev/null
> +++ b/drivers/lightnvm/pblk-rail.c
> @@ -0,0 +1,808 @@
> +/*
> + * Copyright (C) 2018 Heiner Litz
> + * Initial release: Heiner Litz <hlitz@ucsc.edu>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * pblk-rail.c - pblk's RAIL path
> + */
> +
> +#include "pblk.h"
> +
> +#define PBLK_RAIL_EMPTY ~0x0
This constant is not being used.

> +#define PBLK_RAIL_PARITY_WRITE 0x8000
Where does this magic number come from? Please document.

> +
> +/* RAIL auxiliary functions */
> +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +
> +       return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> +}
> +
> +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +
> +       return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> +}
> +
> +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +
> +       return lm->blk_per_line * pblk->min_write_pgs;
> +}
> +
> +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> +{
> +       return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> +}
> +
> +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> +{
> +       return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
> +}
> +
> +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +
> +       return (lun & (lm->blk_per_line - 1));
> +}
> +
> +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> +{
> +       return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> +}
> +
> +/* Notify readers that LUN is serving high latency operation */
> +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> +{
> +       WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> +       /* Make sure that busy bit is seen by reader before proceeding */
> +       smp_mb__after_atomic();
> +}
> +
> +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> +{
> +       /* Make sure that write is completed before releasing busy bit */
> +       smp_mb__before_atomic();
> +       WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> +}
> +
> +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> +{
> +       struct nvm_tgt_dev *dev = pblk->dev;
> +       struct nvm_geo *geo = &dev->geo;
> +       int lun_pos = pblk_ppa_to_pos(geo, ppa);
> +
> +       return test_bit(lun_pos, pblk->rail.busy_bitmap);
> +}
> +
> +/* Enforces one writer per stride */
> +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> +{
> +       struct pblk_lun *rlun;
> +       int strides = pblk_rail_nr_parity_luns(pblk);
> +       int stride = lun_pos % strides;
> +       int ret;
> +
> +       rlun = &pblk->luns[stride];
> +       ret = down_timeout(&rlun->wr_sem, timeout);
> +       pblk_rail_notify_reader_down(pblk, lun_pos);
> +
> +       return ret;
> +}
> +
> +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> +{
> +       struct pblk_lun *rlun;
> +       int strides = pblk_rail_nr_parity_luns(pblk);
> +       int stride = lun_pos % strides;
> +
> +       pblk_rail_notify_reader_up(pblk, lun_pos);
> +       rlun = &pblk->luns[stride];
> +       up(&rlun->wr_sem);
> +}
> +
> +/* Determine whether a sector holds data, meta or is bad*/
> +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +       struct nvm_tgt_dev *dev = pblk->dev;
> +       struct nvm_geo *geo = &dev->geo;
> +       struct ppa_addr ppa;
> +       int lun;
> +
> +       if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
> +               return false;
> +
> +       if (pos >= line->emeta_ssec &&
> +           pos < (line->emeta_ssec + lm->emeta_sec[0]))
> +               return false;
> +
> +       ppa = addr_to_gen_ppa(pblk, pos, line->id);
> +       lun = pblk_ppa_to_pos(geo, ppa);
> +
> +       return !test_bit(lun, line->blk_bitmap);
> +}
> +
> +/* Delay rb overwrite until whole stride has been written */
> +int pblk_rail_rb_delay(struct pblk_rb *rb)
> +{
> +       struct pblk *pblk = container_of(rb, struct pblk, rwb);
> +
> +       return pblk_rail_sec_per_stripe(pblk);
> +}
> +
> +static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
> +{
> +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> +       int page = sec_in_stripe / pblk->min_write_pgs;
> +
> +       return page % pblk_rail_nr_parity_luns(pblk);
> +}
> +
> +static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
> +{
> +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> +
> +       return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
> +}
> +
> +static void pblk_rail_data_parity(void *dest, void *src)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
> +               ((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
> +}
> +
> +static void pblk_rail_lba_parity(u64 *dest, u64 *src)
> +{
> +       *dest ^= *src;
> +}
> +
> +/* Tracks where a sector is located in the rwb */
> +void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
> +                        int sentry, int nr_valid)
> +{
> +       int stride, idx, pos;
> +
> +       stride = pblk_rail_sec_to_stride(pblk, cur_sec);
> +       idx = pblk_rail_sec_to_idx(pblk, cur_sec);
> +       pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
> +       pblk->rail.p2b[stride][idx].pos = pos;
> +       pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
> +}
> +
> +/* RAIL's sector mapping function */
> +static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
> +                             int sentry, struct pblk_sec_meta *meta_list,
> +                             __le64 *lba_list, struct ppa_addr ppa)
> +{
> +       struct pblk_w_ctx *w_ctx;
> +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> +
> +       kref_get(&line->ref);
> +
> +       if (sentry & PBLK_RAIL_PARITY_WRITE) {
> +               u64 *lba;
> +
> +               sentry &= ~PBLK_RAIL_PARITY_WRITE;
> +               lba = &pblk->rail.lba[sentry];
> +               meta_list->lba = cpu_to_le64(*lba);
> +               *lba_list = cpu_to_le64(*lba);
> +               line->nr_valid_lbas++;
> +       } else {
> +               w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
> +               w_ctx->ppa = ppa;
> +               meta_list->lba = cpu_to_le64(w_ctx->lba);
> +               *lba_list = cpu_to_le64(w_ctx->lba);
> +
> +               if (*lba_list != addr_empty)
> +                       line->nr_valid_lbas++;
> +               else
> +                       atomic64_inc(&pblk->pad_wa);
> +       }
> +}
> +
> +int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
> +                           struct ppa_addr *ppa_list,
> +                           unsigned long *lun_bitmap,
> +                           struct pblk_sec_meta *meta_list,
> +                           unsigned int valid_secs)
> +{
> +       struct pblk_line *line = pblk_line_get_data(pblk);
> +       struct pblk_emeta *emeta;
> +       __le64 *lba_list;
> +       u64 paddr;
> +       int nr_secs = pblk->min_write_pgs;
> +       int i;
> +
> +       if (pblk_line_is_full(line)) {
> +               struct pblk_line *prev_line = line;
> +
> +               /* If we cannot allocate a new line, make sure to store metadata
> +                * on current line and then fail
> +                */
> +               line = pblk_line_replace_data(pblk);
> +               pblk_line_close_meta(pblk, prev_line);
> +
> +               if (!line)
> +                       return -EINTR;
> +       }
> +
> +       emeta = line->emeta;
> +       lba_list = emeta_to_lbas(pblk, emeta->buf);
> +
> +       paddr = pblk_alloc_page(pblk, line, nr_secs);
> +
> +       pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
> +
> +       for (i = 0; i < nr_secs; i++, paddr++) {
> +               __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> +
> +               /* ppa to be sent to the device */
> +               ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
> +
> +               /* Write context for target bio completion on write buffer. Note
> +                * that the write buffer is protected by the sync backpointer,
> +                * and a single writer thread have access to each specific entry
> +                * at a time. Thus, it is safe to modify the context for the
> +                * entry we are setting up for submission without taking any
> +                * lock or memory barrier.
> +                */
> +               if (i < valid_secs) {
> +                       pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
> +                                         &lba_list[paddr], ppa_list[i]);
> +               } else {
> +                       lba_list[paddr] = meta_list[i].lba = addr_empty;
> +                       __pblk_map_invalidate(pblk, line, paddr);
> +               }
> +       }
> +
> +       pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
> +       return 0;
> +}

This is a lot of duplication of code from the "normal" pblk map
function -  could you refactor to avoid this?

> +
> +/* RAIL Initialization and tear down */
> +int pblk_rail_init(struct pblk *pblk)
> +{
> +       struct pblk_line_meta *lm = &pblk->lm;
> +       int i, p2be;
> +       unsigned int nr_strides;
> +       unsigned int psecs;
> +       void *kaddr;
> +
> +       if (!PBLK_RAIL_STRIDE_WIDTH)
> +               return 0;
> +
> +       if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
> +           (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
> +               pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
> +               return -EINVAL;
> +       }

This is just a check of the maximum blocks per line - bad blocks will
reduce the number of writable blocks. What happens when a line goes
below PBLK_RAIL_STRIDE_WIDTH writable blocks?

> +
> +       psecs = pblk_rail_psec_per_stripe(pblk);
> +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> +
> +       pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
> +                                      GFP_KERNEL);
> +       if (!pblk->rail.p2b)
> +               return -ENOMEM;
> +
> +       for (p2be = 0; p2be < nr_strides; p2be++) {
> +               pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
> +                                              sizeof(struct p2b_entry),
> +                                              GFP_KERNEL);
> +               if (!pblk->rail.p2b[p2be])
> +                       goto free_p2b_entries;
> +       }
> +
> +       pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
> +       if (!pblk->rail.data)
> +               goto free_p2b_entries;
> +
> +       pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
> +       if (!pblk->rail.pages)
> +               goto free_data;
> +
> +       kaddr = page_address(pblk->rail.pages);
> +       for (i = 0; i < psecs; i++)
> +               pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
> +
> +       pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
> +       if (!pblk->rail.lba)
> +               goto free_pages;
> +
> +       /* Subtract parity bits from device capacity */
> +       pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
> +               PBLK_RAIL_STRIDE_WIDTH;
> +
> +       pblk->map_page = pblk_rail_map_page_data;
> +
> +       return 0;
> +
> +free_pages:
> +       free_pages((unsigned long)page_address(pblk->rail.pages),
> +                  get_count_order(psecs));
> +free_data:
> +       kfree(pblk->rail.data);
> +free_p2b_entries:
> +       for (p2be = p2be - 1; p2be >= 0; p2be--)
> +               kfree(pblk->rail.p2b[p2be]);
> +       kfree(pblk->rail.p2b);
> +
> +       return -ENOMEM;
> +}
> +
> +void pblk_rail_free(struct pblk *pblk)
> +{
> +       unsigned int i;
> +       unsigned int nr_strides;
> +       unsigned int psecs;
> +
> +       psecs = pblk_rail_psec_per_stripe(pblk);
> +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> +
> +       kfree(pblk->rail.lba);
> +       free_pages((unsigned long)page_address(pblk->rail.pages),
> +                  get_count_order(psecs));
> +       kfree(pblk->rail.data);
> +       for (i = 0; i < nr_strides; i++)
> +               kfree(pblk->rail.p2b[i]);
> +       kfree(pblk->rail.p2b);
> +}
> +
> +/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
> + * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
> + * split the bio
> + */
> +static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
> +{
> +       struct nvm_tgt_dev *dev = pblk->dev;
> +       struct bio *split;
> +
> +       sec *= (dev->geo.csecs >> 9);
> +
> +       split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
> +       /* there isn't chance to merge the split bio */
> +       split->bi_opf |= REQ_NOMERGE;
> +       bio_set_flag(*bio, BIO_QUEUE_ENTERED);
> +       bio_chain(split, *bio);
> +       generic_make_request(*bio);
> +       *bio = split;
> +}
> +
> +/* RAIL's Write Path */
> +static int pblk_rail_sched_parity(struct pblk *pblk)
> +{
> +       struct pblk_line *line = pblk_line_get_data(pblk);
> +       unsigned int sec_in_stripe;
> +
> +       while (1) {
> +               sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
> +
> +               /* Schedule parity write at end of data section */
> +               if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
> +                       return 1;
> +
> +               /* Skip bad blocks and meta sectors until we find a valid sec */
> +               if (test_bit(line->cur_sec, line->map_bitmap))
> +                       line->cur_sec += pblk->min_write_pgs;
> +               else
> +                       break;
> +       }
> +
> +       return 0;
> +}
> +
> +/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
> +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
> +{
> +       int off, bit;
> +
> +       for (off = pblk_rail_dsec_per_stripe(pblk);
> +            off < pblk->lm.sec_per_line;
> +            off += pblk_rail_sec_per_stripe(pblk)) {
> +               for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
> +                       set_bit(off + bit, line->invalid_bitmap);
> +       }
> +}
> +
> +void pblk_rail_end_io_write(struct nvm_rq *rqd)
> +{
> +       struct pblk *pblk = rqd->private;
> +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> +
> +       if (rqd->error) {
> +               pblk_log_write_err(pblk, rqd);
> +               return pblk_end_w_fail(pblk, rqd);

The write error recovery path relies on that that sentry in c_ctx is
an index in the write buffer, so this won't work.

Additionally, If a write(data or parity) fails, the whole stride would
be broken and need to fall back on "normal" reads, right?
One solution could be to check line->w_err_gc->has_write_err on the read path.

> +       }
> +#ifdef CONFIG_NVM_DEBUG
> +       else
> +               WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
> +#endif
> +
> +       pblk_up_rq(pblk, c_ctx->lun_bitmap);
> +
> +       pblk_rq_to_line_put(pblk, rqd);
> +       bio_put(rqd->bio);
> +       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> +
> +       atomic_dec(&pblk->inflight_io);
> +}
> +
> +static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
> +                         struct bio *bio, unsigned int stride,
> +                         unsigned int nr_secs, unsigned int paddr)
> +{
> +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> +       int sec, i;
> +       int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
> +       struct pblk_line *line = pblk_line_get_data(pblk);
> +
> +       c_ctx->nr_valid = nr_secs;
> +       /* sentry indexes rail page buffer, instead of rwb */
> +       c_ctx->sentry = stride * pblk->min_write_pgs;
> +       c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
> +
> +       for (sec = 0; sec < pblk->min_write_pgs; sec++) {
> +               void *pg_addr;
> +               struct page *page;
> +               u64 *lba;
> +
> +               lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
> +               pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
> +               page = virt_to_page(pg_addr);
> +
> +               if (!page) {
> +                       pr_err("pblk: could not allocate RAIL bio page %p\n",
> +                              pg_addr);
> +                       return -NVM_IO_ERR;
> +               }
> +
> +               if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
> +                   pblk->rwb.seg_size) {
> +                       pr_err("pblk: could not add page to RAIL bio\n");
> +                       return -NVM_IO_ERR;
> +               }
> +
> +               *lba = 0;
> +               memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
> +
> +               for (i = 0; i < nr_data; i++) {
> +                       struct pblk_rb_entry *entry;
> +                       struct pblk_w_ctx *w_ctx;
> +                       u64 lba_src;
> +                       unsigned int pos;
> +                       unsigned int cur;
> +                       int distance = pblk_rail_psec_per_stripe(pblk);
> +
> +                       cur = paddr - distance * (nr_data - i) + sec;
> +
> +                       if (!pblk_rail_valid_sector(pblk, line, cur))
> +                               continue;
> +
> +                       pos = pblk->rail.p2b[stride][i].pos;
> +                       pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
> +                       entry = &pblk->rwb.entries[pos];
> +                       w_ctx = &entry->w_ctx;
> +                       lba_src = w_ctx->lba;
> +
> +                       if (sec < pblk->rail.p2b[stride][i].nr_valid &&
> +                           lba_src != ADDR_EMPTY) {
> +                               pblk_rail_data_parity(pg_addr, entry->data);
> +                               pblk_rail_lba_parity(lba, &lba_src);

What keeps the parity lba values from invalidating "real" data lbas
during recovery?

> +                       }
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +int pblk_rail_submit_write(struct pblk *pblk)
> +{
> +       int i;
> +       struct nvm_rq *rqd;
> +       struct bio *bio;
> +       struct pblk_line *line = pblk_line_get_data(pblk);
> +       int start, end, bb_offset;
> +       unsigned int stride = 0;
> +
> +       if (!pblk_rail_sched_parity(pblk))
> +               return 0;
> +
> +       start = line->cur_sec;
> +       bb_offset = start % pblk_rail_sec_per_stripe(pblk);
> +       end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
> +
> +       for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
> +               /* Do not generate parity in this slot if the sec is bad
> +                * or reserved for meta.
> +                * We check on the read path and perform a conventional
> +                * read, to avoid reading parity from the bad block
> +                */
> +               if (!pblk_rail_valid_sector(pblk, line, i))
> +                       continue;
> +
> +               rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
> +               if (IS_ERR(rqd)) {
> +                       pr_err("pblk: cannot allocate parity write req.\n");
> +                       return -ENOMEM;
> +               }
> +
> +               bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
> +               if (!bio) {
> +                       pr_err("pblk: cannot allocate parity write bio\n");
> +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> +                       return -ENOMEM;
> +               }
> +
> +               bio->bi_iter.bi_sector = 0; /* internal bio */
> +               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> +               rqd->bio = bio;
> +
> +               pblk_rail_read_to_bio(pblk, rqd, bio, stride,
> +                                     pblk->min_write_pgs, i);
> +
> +               if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
> +                       bio_put(rqd->bio);
> +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> +
> +                       return -NVM_IO_ERR;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +/* RAIL's Read Path */
> +static void pblk_rail_end_io_read(struct nvm_rq *rqd)
> +{
> +       struct pblk *pblk = rqd->private;
> +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> +       struct pblk_pr_ctx *pr_ctx = r_ctx->private;
> +       struct bio *new_bio = rqd->bio;
> +       struct bio *bio = pr_ctx->orig_bio;
> +       struct bio_vec src_bv, dst_bv;
> +       struct pblk_sec_meta *meta_list = rqd->meta_list;
> +       int bio_init_idx = pr_ctx->bio_init_idx;
> +       int nr_secs = pr_ctx->orig_nr_secs;
> +       __le64 *lba_list_mem, *lba_list_media;
> +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> +       void *src_p, *dst_p;
> +       int i, r, rail_ppa = 0;
> +       unsigned char valid;
> +
> +       if (unlikely(rqd->nr_ppas == 1)) {
> +               struct ppa_addr ppa;
> +
> +               ppa = rqd->ppa_addr;
> +               rqd->ppa_list = pr_ctx->ppa_ptr;
> +               rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
> +               rqd->ppa_list[0] = ppa;
> +       }
> +
> +       /* Re-use allocated memory for intermediate lbas */
> +       lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
> +       lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
> +
> +       for (i = 0; i < rqd->nr_ppas; i++)
> +               lba_list_media[i] = meta_list[i].lba;
> +       for (i = 0; i < nr_secs; i++)
> +               meta_list[i].lba = lba_list_mem[i];
> +
> +       for (i = 0; i < nr_secs; i++) {
> +               struct pblk_line *line;
> +               u64 meta_lba = 0x0UL, mlba;
> +
> +               line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
> +
> +               valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
> +               bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
> +                                  PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
> +
> +               if (valid == 0) /* Skip cached reads */
> +                       continue;
> +
> +               kref_put(&line->ref, pblk_line_put);
> +
> +               dst_bv = bio->bi_io_vec[bio_init_idx + i];
> +               dst_p = kmap_atomic(dst_bv.bv_page);
> +
> +               memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
> +               meta_list[i].lba = cpu_to_le64(0x0UL);
> +
> +               for (r = 0; r < valid; r++, rail_ppa++) {
> +                       src_bv = new_bio->bi_io_vec[rail_ppa];
> +
> +                       if (lba_list_media[rail_ppa] != addr_empty) {
> +                               src_p = kmap_atomic(src_bv.bv_page);
> +                               pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
> +                                                     src_p + src_bv.bv_offset);
> +                               mlba = le64_to_cpu(lba_list_media[rail_ppa]);
> +                               pblk_rail_lba_parity(&meta_lba, &mlba);
> +                               kunmap_atomic(src_p);
> +                       }
> +
> +                       mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
> +               }
> +               meta_list[i].lba = cpu_to_le64(meta_lba);
> +               kunmap_atomic(dst_p);
> +       }
> +
> +       bio_put(new_bio);
> +       rqd->nr_ppas = pr_ctx->orig_nr_secs;
> +       kfree(pr_ctx);
> +       rqd->bio = NULL;
> +
> +       bio_endio(bio);
> +       __pblk_end_io_read(pblk, rqd, false);
> +}
> +
> +/* Converts original ppa into ppa list of RAIL reads */
> +static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
> +                               struct ppa_addr *rail_ppas,
> +                               unsigned char *pvalid, int *nr_rail_ppas,
> +                               int *rail_reads)
> +{
> +       struct nvm_tgt_dev *dev = pblk->dev;
> +       struct nvm_geo *geo = &dev->geo;
> +       struct ppa_addr rail_ppa = ppa;
> +       unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
> +       unsigned int strides = pblk_rail_nr_parity_luns(pblk);
> +       struct pblk_line *line;
> +       unsigned int i;
> +       int ppas = *nr_rail_ppas;
> +       int valid = 0;
> +
> +       for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
> +               unsigned int neighbor, lun, chnl;
> +               int laddr;
> +
> +               neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
> +
> +               lun = pblk_pos_to_lun(geo, neighbor);
> +               chnl = pblk_pos_to_chnl(geo, neighbor);
> +               pblk_dev_ppa_set_lun(&rail_ppa, lun);
> +               pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
> +
> +               line = pblk_ppa_to_line(pblk, rail_ppa);
> +               laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
> +
> +               /* Do not read from bad blocks */
> +               if (!pblk_rail_valid_sector(pblk, line, laddr)) {
> +                       /* Perform regular read if parity sector is bad */
> +                       if (neighbor >= pblk_rail_nr_data_luns(pblk))
> +                               return 0;
> +
> +                       /* If any other neighbor is bad we can just skip it */
> +                       continue;
> +               }
> +
> +               rail_ppas[ppas++] = rail_ppa;
> +               valid++;
> +       }
> +
> +       if (valid == 1)
> +               return 0;
> +
> +       *pvalid = valid;
> +       *nr_rail_ppas = ppas;
> +       (*rail_reads)++;
> +       return 1;
> +}
> +
> +static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
> +                                int ppa, struct ppa_addr *rail_ppa_list,
> +                                int *nr_rail_ppas, unsigned long *read_bitmap,
> +                                unsigned long *pvalid, int *rail_reads)
> +{
> +       unsigned char valid;
> +
> +       if (test_bit(ppa, read_bitmap))
> +               return;
> +
> +       if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
> +           pblk_rail_setup_ppas(pblk, ppa_list[ppa],
> +                                rail_ppa_list, &valid,
> +                                nr_rail_ppas, rail_reads)) {
> +               WARN_ON(test_and_set_bit(ppa, read_bitmap));
> +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
> +       } else {
> +               rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
> +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
> +       }
> +}
> +
> +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> +                      unsigned long *read_bitmap, int bio_init_idx,
> +                      struct bio **bio)
> +{
> +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> +       struct pblk_pr_ctx *pr_ctx;
> +       struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
> +       DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
> +       int nr_secs = rqd->nr_ppas;
> +       bool read_empty = bitmap_empty(read_bitmap, nr_secs);
> +       int nr_rail_ppas = 0, rail_reads = 0;
> +       int i;
> +       int ret;
> +
> +       /* Fully cached reads should not enter this path */
> +       WARN_ON(bitmap_full(read_bitmap, nr_secs));
> +
> +       bitmap_zero(pvalid, PR_BITMAP_SIZE);
> +       if (rqd->nr_ppas == 1) {
> +               pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
> +                                    &nr_rail_ppas, read_bitmap, pvalid,
> +                                    &rail_reads);
> +
> +               if (nr_rail_ppas == 1) {
> +                       memcpy(&rqd->ppa_addr, rail_ppa_list,
> +                              nr_rail_ppas * sizeof(struct ppa_addr));
> +               } else {
> +                       rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
> +                       rqd->dma_ppa_list = rqd->dma_meta_list +
> +                         pblk_dma_meta_size;
> +                       memcpy(rqd->ppa_list, rail_ppa_list,
> +                              nr_rail_ppas * sizeof(struct ppa_addr));
> +               }
> +       } else {
> +               for (i = 0; i < rqd->nr_ppas; i++) {
> +                       pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
> +                                            rail_ppa_list, &nr_rail_ppas,
> +                                            read_bitmap, pvalid, &rail_reads);
> +
> +                       /* Don't split if this it the last ppa of the rqd */
> +                       if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
> +                            NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
> +                               struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> +
> +                               pblk_rail_bio_split(pblk, bio, i + 1);
> +                               rqd->nr_ppas = pblk_get_secs(*bio);
> +                               r_ctx->private = *bio;
> +                               break;
> +                       }
> +               }
> +               memcpy(rqd->ppa_list, rail_ppa_list,
> +                      nr_rail_ppas * sizeof(struct ppa_addr));
> +       }
> +
> +       if (bitmap_empty(read_bitmap, rqd->nr_ppas))
> +               return NVM_IO_REQUEUE;
> +
> +       if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
> +               bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
> +
> +       if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
> +                                   nr_rail_ppas))
> +               return NVM_IO_ERR;
> +
> +       rqd->end_io = pblk_rail_end_io_read;
> +       pr_ctx = r_ctx->private;
> +       bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
> +
> +       ret = pblk_submit_io(pblk, rqd);
> +       if (ret) {
> +               bio_put(rqd->bio);
> +               pr_err("pblk: partial RAIL read IO submission failed\n");
> +               /* Free allocated pages in new bio */
> +               pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
> +               kfree(pr_ctx);
> +               __pblk_end_io_read(pblk, rqd, false);
> +               return NVM_IO_ERR;
> +       }
> +
> +       return NVM_IO_OK;
> +}
> diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> index bd88784e51d9..01fe4362b27e 100644
> --- a/drivers/lightnvm/pblk.h
> +++ b/drivers/lightnvm/pblk.h
> @@ -28,6 +28,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/crc32.h>
>  #include <linux/uuid.h>
> +#include <linux/log2.h>
>
>  #include <linux/lightnvm.h>
>
> @@ -45,7 +46,7 @@
>  #define PBLK_COMMAND_TIMEOUT_MS 30000
>
>  /* Max 512 LUNs per device */
> -#define PBLK_MAX_LUNS_BITMAP (4)
> +#define PBLK_MAX_LUNS_BITMAP (512)

512 is probably enough for everyone for now, but why not make this dynamic?
Better not waste memory and introduce an artificial limit on number of luns.

>
>  #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
>
> @@ -123,6 +124,13 @@ struct pblk_g_ctx {
>         u64 lba;
>  };
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +#define PBLK_RAIL_STRIDE_WIDTH 4
> +#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
> +#else
> +#define PR_BITMAP_SIZE NVM_MAX_VLBA
> +#endif
> +
>  /* partial read context */
>  struct pblk_pr_ctx {
>         struct bio *orig_bio;
> @@ -604,6 +612,39 @@ struct pblk_addrf {
>         int sec_ws_stripe;
>  };
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +
> +struct p2b_entry {
> +       int pos;
> +       int nr_valid;
> +};
> +
> +struct pblk_rail {
> +       struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
> +       struct page *pages;             /* Pages to hold parity writes */
> +       void **data;                    /* Buffer that holds parity pages */
> +       DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
> +       u64 *lba;                       /* Buffer to compute LBA parity */
> +};
> +
> +/* Initialize and tear down RAIL */
> +int pblk_rail_init(struct pblk *pblk);
> +void pblk_rail_free(struct pblk *pblk);
> +/* Adjust some system parameters */
> +bool pblk_rail_meta_distance(struct pblk_line *data_line);
> +int pblk_rail_rb_delay(struct pblk_rb *rb);
> +/* Core */
> +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
> +int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
> +void pblk_rail_up_stride(struct pblk *pblk, int lun);
> +/* Write path */
> +int pblk_rail_submit_write(struct pblk *pblk);
> +/* Read Path */
> +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> +                      unsigned long *read_bitmap, int bio_init_idx,
> +                      struct bio **bio);
> +#endif /* CONFIG_NVM_PBLK_RAIL */
> +
>  typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
>                                struct ppa_addr *ppa_list,
>                                unsigned long *lun_bitmap,
> @@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
>         return paddr;
>  }
>
> +static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
> +{
> +       return pos >> ilog2(geo->num_ch);
> +}
> +
> +static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
> +{
> +       return pos % geo->num_ch;
> +}
> +
> +static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
> +{
> +       p->a.lun = lun;
> +}
> +
> +static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
> +{
> +       p->a.ch = chnl;
> +}

What is the motivation for adding the lun and chnl setters? They seem
uncalled for.

> +
>  static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
>  {
>         struct nvm_tgt_dev *dev = pblk->dev;
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL
  2018-09-17  5:29 ` [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL Heiner Litz
@ 2018-09-18 11:38   ` Hans Holmberg
  0 siblings, 0 replies; 17+ messages in thread
From: Hans Holmberg @ 2018-09-18 11:38 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> Integrate Redundant Array of Independent Luns (RAIL) into lightnvm. RAIL
> enforces low tail read latency by guaranteeing that reads are never
> serialized behind writes and erases to the same LUN. Whenever LUNs serve a
> high latency operation, reads are performed by recomputing the original
> utilizing redundant parity information.
> Rail trades-off read latency for capacity (redundancy) which, however, can
> be leveraged for fault tolerance.
>
> On FIO, with the kyber scheduler set to a target read latency of 500us,
> RAIL reduces tail latency percentiles (us) as follows:
>
>        Avg    90%    99%     99.9%  99.95%  99.99%
>        pblk   90     1000    2200   3000    6000
>        RAIL   85     100     250    400     500
>
> Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> ---
>  drivers/lightnvm/Kconfig      | 10 ++++++++++
>  drivers/lightnvm/Makefile     |  1 +
>  drivers/lightnvm/pblk-core.c  | 36 ++++++++++++++++++++++++++++++++++-
>  drivers/lightnvm/pblk-init.c  | 17 +++++++++++++++++
>  drivers/lightnvm/pblk-rail.c  |  1 +
>  drivers/lightnvm/pblk-rb.c    |  6 ++++++
>  drivers/lightnvm/pblk-read.c  |  9 +++++++++
>  drivers/lightnvm/pblk-write.c |  9 +++++++++
>  drivers/lightnvm/pblk.h       |  5 +++++
>  9 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
> index a872cd720967..165d5a29acc3 100644
> --- a/drivers/lightnvm/Kconfig
> +++ b/drivers/lightnvm/Kconfig
> @@ -35,6 +35,16 @@ config NVM_PBLK_DEBUG
>           vocal error messages, and extra tracking fields in the pblk sysfs
>           entries.
>
> +config NVM_PBLK_RAIL
> +       bool "Pblk RAIL Support"
> +       default n
> +       help
> +         Enables RAIL for pblk. RAIL enforces tail read latency guarantees by
> +        eliminiating reads being serialized behind writes to the same LUN.
> +        RAIL partitions LUNs into strides and enforces that only one LUN per
> +        stride is written at a time. Reads can bypass busy LUNs by recompting
> +        requested data using parity redundancy.
> +
>  endif # NVM_PBLK_DEBUG

Having a compile-time option forces the user (or even worse,
distribution provider) to pick the rail- OR non-rail version of pblk.
It's also a pain having to re-compile and re-provision the kernel when testing.

I see no reason why this should not be dynamically handled within pblk
(rail on/off and stride width could be supplied via the create ioctl)
One would want to configure stride-width to fit a given workload in any case.

nvm_ioctl_create_extended has 16 reserved bits, so we have room for
adding RAIL parameters.

>
>  endif # NVM
> diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
> index 97d9d7c71550..92f4376428cc 100644
> --- a/drivers/lightnvm/Makefile
> +++ b/drivers/lightnvm/Makefile
> @@ -5,6 +5,7 @@
>
>  obj-$(CONFIG_NVM)              := core.o
>  obj-$(CONFIG_NVM_PBLK)         += pblk.o
> +obj-$(CONFIG_NVM_PBLK_RAIL)    += pblk-rail.o
>  pblk-y                         := pblk-init.o pblk-core.o pblk-rb.o \
>                                    pblk-write.o pblk-cache.o pblk-read.o \
>                                    pblk-gc.o pblk-recovery.o pblk-map.o \
> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> index a31bf359f905..ca74d7763fa9 100644
> --- a/drivers/lightnvm/pblk-core.c
> +++ b/drivers/lightnvm/pblk-core.c
> @@ -113,6 +113,12 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
>  {
>         struct pblk *pblk = rqd->private;
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
> +
> +       pblk_up_chunk(pblk, ppa_list[0]);
> +#endif
> +
>         __pblk_end_io_erase(pblk, rqd);
>         mempool_free(rqd, &pblk->e_rq_pool);
>  }
> @@ -940,7 +946,11 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct ppa_addr ppa)
>         /* The write thread schedules erases so that it minimizes disturbances
>          * with writes. Thus, there is no need to take the LUN semaphore.
>          */
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       ret = pblk_submit_io_sync_sem(pblk, &rqd);
> +#else
>         ret = pblk_submit_io_sync(pblk, &rqd);
> +#endif
>         rqd.private = pblk;
>         __pblk_end_io_erase(pblk, &rqd);
>
> @@ -1754,7 +1764,11 @@ int pblk_blk_erase_async(struct pblk *pblk, struct ppa_addr ppa)
>         /* The write thread schedules erases so that it minimizes disturbances
>          * with writes. Thus, there is no need to take the LUN semaphore.
>          */
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       err = pblk_submit_io_sem(pblk, rqd);
> +#else
>         err = pblk_submit_io(pblk, rqd);
> +#endif
>         if (err) {
>                 struct nvm_tgt_dev *dev = pblk->dev;
>                 struct nvm_geo *geo = &dev->geo;
> @@ -1909,6 +1923,10 @@ void pblk_line_close_ws(struct work_struct *work)
>         if (w_err_gc->has_write_err)
>                 pblk_save_lba_list(pblk, line);
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       pblk_rail_line_close(pblk, line);
> +#endif
> +
>         pblk_line_close(pblk, line);
>         mempool_free(line_ws, &pblk->gen_ws_pool);
>  }
> @@ -1938,8 +1956,12 @@ static void __pblk_down_chunk(struct pblk *pblk, int pos)
>          * Only send one inflight I/O per LUN. Since we map at a page
>          * granurality, all ppas in the I/O will map to the same LUN
>          */
> -
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       (void)rlun;
> +       ret = pblk_rail_down_stride(pblk, pos, msecs_to_jiffies(30000));
> +#else
>         ret = down_timeout(&rlun->wr_sem, msecs_to_jiffies(30000));
> +#endif
>         if (ret == -ETIME || ret == -EINTR)
>                 pblk_err(pblk, "taking lun semaphore timed out: err %d\n",
>                                 -ret);
> @@ -1978,7 +2000,13 @@ void pblk_up_chunk(struct pblk *pblk, struct ppa_addr ppa)
>         int pos = pblk_ppa_to_pos(geo, ppa);
>
>         rlun = &pblk->luns[pos];
> +
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       pblk_rail_up_stride(pblk, pos);
> +#else
>         up(&rlun->wr_sem);
> +#endif
> +
>  }
>
>  void pblk_up_rq(struct pblk *pblk, unsigned long *lun_bitmap)
> @@ -1991,7 +2019,13 @@ void pblk_up_rq(struct pblk *pblk, unsigned long *lun_bitmap)
>
>         while ((bit = find_next_bit(lun_bitmap, num_lun, bit + 1)) < num_lun) {
>                 rlun = &pblk->luns[bit];
> +
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +               pblk_rail_up_stride(pblk, bit);
> +#else
>                 up(&rlun->wr_sem);
> +#endif
> +
>         }
>  }
>
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index 2b9c6ebd9fac..3e8255c8873f 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -1050,6 +1050,7 @@ static int pblk_lines_init(struct pblk *pblk)
>         kfree(pblk->lines);
>  fail_free_chunk_meta:
>         kfree(chunk_meta);
> +
>  fail_free_luns:
>         kfree(pblk->luns);
>  fail_free_meta:
> @@ -1108,6 +1109,11 @@ static void pblk_tear_down(struct pblk *pblk, bool graceful)
>                 __pblk_pipeline_flush(pblk);
>         __pblk_pipeline_stop(pblk);
>         pblk_writer_stop(pblk);
> +
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       pblk_rail_free(pblk);
> +#endif
> +
>         pblk_rb_sync_l2p(&pblk->rwb);
>         pblk_rl_free(&pblk->rl);
>
> @@ -1226,6 +1232,12 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
>                 goto fail_stop_writer;
>         }
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       ret = pblk_rail_init(pblk);
> +       if (ret)
> +               goto fail_free_gc;
> +#endif
> +
>         /* inherit the size from the underlying device */
>         blk_queue_logical_block_size(tqueue, queue_physical_block_size(bqueue));
>         blk_queue_max_hw_sectors(tqueue, queue_max_hw_sectors(bqueue));
> @@ -1249,6 +1261,11 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
>
>         return pblk;
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +fail_free_gc:
> +       pblk_gc_exit(pblk, false);
> +#endif
> +
>  fail_stop_writer:
>         pblk_writer_stop(pblk);
>  fail_free_l2p:
> diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> index a48ed31a0ba9..619ff9689d29 100644
> --- a/drivers/lightnvm/pblk-rail.c
> +++ b/drivers/lightnvm/pblk-rail.c
> @@ -1,3 +1,4 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
>  /*
>   * Copyright (C) 2018 Heiner Litz
>   * Initial release: Heiner Litz <hlitz@ucsc.edu>
> diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
> index a7648e12f54f..b04462479fe3 100644
> --- a/drivers/lightnvm/pblk-rb.c
> +++ b/drivers/lightnvm/pblk-rb.c
> @@ -389,8 +389,14 @@ static int __pblk_rb_may_write(struct pblk_rb *rb, unsigned int nr_entries,
>         sync = READ_ONCE(rb->sync);
>         mem = READ_ONCE(rb->mem);
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       if (pblk_rb_ring_space(rb, mem, sync, rb->nr_entries) <
> +           nr_entries + pblk_rail_rb_delay(rb))
> +               return 0;
> +#else
>         if (pblk_rb_ring_space(rb, mem, sync, rb->nr_entries) < nr_entries)
>                 return 0;
> +#endif
>
>         if (pblk_rb_update_l2p(rb, nr_entries, mem, sync))
>                 return 0;
> diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> index 67d44caefff4..a3f33503f60c 100644
> --- a/drivers/lightnvm/pblk-read.c
> +++ b/drivers/lightnvm/pblk-read.c
> @@ -472,6 +472,15 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
>                 return NVM_IO_DONE;
>         }
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       ret = pblk_rail_read_bio(pblk, rqd, blba, read_bitmap, bio_init_idx,
> +                                &bio);
> +       if (ret == NVM_IO_OK)
> +               return ret;
> +       if (ret == NVM_IO_ERR)
> +               goto fail_end_io;
> +#endif
> +
>         /* All sectors are to be read from the device */
>         if (bitmap_empty(read_bitmap, rqd->nr_ppas)) {
>                 struct bio *int_bio = NULL;
> diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
> index 6eba38b83acd..db42184cfba3 100644
> --- a/drivers/lightnvm/pblk-write.c
> +++ b/drivers/lightnvm/pblk-write.c
> @@ -469,6 +469,11 @@ static inline bool pblk_valid_meta_ppa(struct pblk *pblk,
>                                 test_bit(pos_opt, data_line->blk_bitmap))
>                 return true;
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       if (unlikely(pblk_rail_meta_distance(data_line)))
> +               data_line->meta_distance--;
> +#endif
> +
>         if (unlikely(pblk_ppa_comp(ppa_opt, ppa)))
>                 data_line->meta_distance--;
>
> @@ -571,6 +576,10 @@ static int pblk_submit_write(struct pblk *pblk)
>         unsigned long pos;
>         unsigned int resubmit;
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       pblk_rail_submit_write(pblk);
> +#endif
> +
>         spin_lock(&pblk->resubmit_lock);
>         resubmit = !list_empty(&pblk->resubmit_list);
>         spin_unlock(&pblk->resubmit_lock);
> diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> index 01fe4362b27e..9742524f74ea 100644
> --- a/drivers/lightnvm/pblk.h
> +++ b/drivers/lightnvm/pblk.h
> @@ -758,6 +758,11 @@ struct pblk {
>         struct pblk_gc gc;
>
>         pblk_map_page_fn *map_page;
> +
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +       struct pblk_rail rail;
> +#endif
> +
>  };
>
>  struct pblk_line_ws {
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency
  2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
                   ` (5 preceding siblings ...)
  2018-09-17  5:29 ` [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL Heiner Litz
@ 2018-09-18 11:46 ` Hans Holmberg
  2018-09-18 16:13   ` Heiner Litz
  6 siblings, 1 reply; 17+ messages in thread
From: Hans Holmberg @ 2018-09-18 11:46 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> Hi All,
> this patchset introduces RAIL, a mechanism to enforce low tail read latency for
> lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
> always served from LUNs that do not serve a high latency operation such as a
> write or erase. This avoids that reads become serialized behind these operations
> reducing tail latency by ~10x. In particular, in the absence of ECC read errors,
> it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
>
> This patch is based on pblk/core and requires two additional patches from Javier
> to be applicable (let me know if you want me to rebase):

As the patches do not apply, could you make a branch available so I
can get hold of the code in it's present state?
That would make reviewing and testing so much easier.

I have some concerns regarding recovery and write error handling, but
I have not found anything that can't be fixed.
I also believe that rail/on off and stride width should not be
configured at build-time, but instead be part of the create IOCTL.

See my comments on the individual patches for details.

>
> The 1st patch exposes some existing APIs so they can be used by RAIL
> The 2nd patch introduces a configurable sector mapping function
> The 3rd patch refactors the write path so the end_io_fn can be specified when
> setting up the request
> The 4th patch adds a new submit io function that acquires the write semaphore
> The 5th patch introduces the RAIL feature and its API
> The 6th patch integrates RAIL into pblk's read and write path
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-18 11:28   ` Hans Holmberg
@ 2018-09-18 16:11     ` Heiner Litz
  2018-09-19  7:53       ` Hans Holmberg
  0 siblings, 1 reply; 17+ messages in thread
From: Heiner Litz @ 2018-09-18 16:11 UTC (permalink / raw)
  To: hans.ml.holmberg
  Cc: linux-block, Javier Gonzalez, Matias Bjørling,
	igor.j.konopko, marcin.dziegielewski

On Tue, Sep 18, 2018 at 4:28 AM Hans Holmberg
<hans.ml.holmberg@owltronix.com> wrote:
>
> On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> >
> > In prepartion of supporting RAIL, add the RAIL API.
> >
> > Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> > ---
> >  drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
> >  drivers/lightnvm/pblk.h      |  63 ++-
> >  2 files changed, 870 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/lightnvm/pblk-rail.c
> >
> > diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> > new file mode 100644
> > index 000000000000..a48ed31a0ba9
> > --- /dev/null
> > +++ b/drivers/lightnvm/pblk-rail.c
> > @@ -0,0 +1,808 @@
> > +/*
> > + * Copyright (C) 2018 Heiner Litz
> > + * Initial release: Heiner Litz <hlitz@ucsc.edu>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License version
> > + * 2 as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful, but
> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * pblk-rail.c - pblk's RAIL path
> > + */
> > +
> > +#include "pblk.h"
> > +
> > +#define PBLK_RAIL_EMPTY ~0x0
> This constant is not being used.

thanks, will remove

> > +#define PBLK_RAIL_PARITY_WRITE 0x8000
> Where does this magic number come from? Please document.

ok , will document

>
> > +
> > +/* RAIL auxiliary functions */
> > +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +
> > +       return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> > +}
> > +
> > +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +
> > +       return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +
> > +       return lm->blk_per_line * pblk->min_write_pgs;
> > +}
> > +
> > +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> > +{
> > +       return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> > +}
> > +
> > +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> > +{
> > +       return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +
> > +       return (lun & (lm->blk_per_line - 1));
> > +}
> > +
> > +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> > +{
> > +       return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> > +}
> > +
> > +/* Notify readers that LUN is serving high latency operation */
> > +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> > +{
> > +       WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> > +       /* Make sure that busy bit is seen by reader before proceeding */
> > +       smp_mb__after_atomic();
> > +}
> > +
> > +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> > +{
> > +       /* Make sure that write is completed before releasing busy bit */
> > +       smp_mb__before_atomic();
> > +       WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> > +}
> > +
> > +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> > +{
> > +       struct nvm_tgt_dev *dev = pblk->dev;
> > +       struct nvm_geo *geo = &dev->geo;
> > +       int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > +
> > +       return test_bit(lun_pos, pblk->rail.busy_bitmap);
> > +}
> > +
> > +/* Enforces one writer per stride */
> > +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> > +{
> > +       struct pblk_lun *rlun;
> > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > +       int stride = lun_pos % strides;
> > +       int ret;
> > +
> > +       rlun = &pblk->luns[stride];
> > +       ret = down_timeout(&rlun->wr_sem, timeout);
> > +       pblk_rail_notify_reader_down(pblk, lun_pos);
> > +
> > +       return ret;
> > +}
> > +
> > +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> > +{
> > +       struct pblk_lun *rlun;
> > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > +       int stride = lun_pos % strides;
> > +
> > +       pblk_rail_notify_reader_up(pblk, lun_pos);
> > +       rlun = &pblk->luns[stride];
> > +       up(&rlun->wr_sem);
> > +}
> > +
> > +/* Determine whether a sector holds data, meta or is bad*/
> > +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +       struct nvm_tgt_dev *dev = pblk->dev;
> > +       struct nvm_geo *geo = &dev->geo;
> > +       struct ppa_addr ppa;
> > +       int lun;
> > +
> > +       if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
> > +               return false;
> > +
> > +       if (pos >= line->emeta_ssec &&
> > +           pos < (line->emeta_ssec + lm->emeta_sec[0]))
> > +               return false;
> > +
> > +       ppa = addr_to_gen_ppa(pblk, pos, line->id);
> > +       lun = pblk_ppa_to_pos(geo, ppa);
> > +
> > +       return !test_bit(lun, line->blk_bitmap);
> > +}
> > +
> > +/* Delay rb overwrite until whole stride has been written */
> > +int pblk_rail_rb_delay(struct pblk_rb *rb)
> > +{
> > +       struct pblk *pblk = container_of(rb, struct pblk, rwb);
> > +
> > +       return pblk_rail_sec_per_stripe(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
> > +{
> > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > +       int page = sec_in_stripe / pblk->min_write_pgs;
> > +
> > +       return page % pblk_rail_nr_parity_luns(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
> > +{
> > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > +
> > +       return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
> > +}
> > +
> > +static void pblk_rail_data_parity(void *dest, void *src)
> > +{
> > +       unsigned int i;
> > +
> > +       for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
> > +               ((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
> > +}
> > +
> > +static void pblk_rail_lba_parity(u64 *dest, u64 *src)
> > +{
> > +       *dest ^= *src;
> > +}
> > +
> > +/* Tracks where a sector is located in the rwb */
> > +void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
> > +                        int sentry, int nr_valid)
> > +{
> > +       int stride, idx, pos;
> > +
> > +       stride = pblk_rail_sec_to_stride(pblk, cur_sec);
> > +       idx = pblk_rail_sec_to_idx(pblk, cur_sec);
> > +       pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
> > +       pblk->rail.p2b[stride][idx].pos = pos;
> > +       pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
> > +}
> > +
> > +/* RAIL's sector mapping function */
> > +static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
> > +                             int sentry, struct pblk_sec_meta *meta_list,
> > +                             __le64 *lba_list, struct ppa_addr ppa)
> > +{
> > +       struct pblk_w_ctx *w_ctx;
> > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > +
> > +       kref_get(&line->ref);
> > +
> > +       if (sentry & PBLK_RAIL_PARITY_WRITE) {
> > +               u64 *lba;
> > +
> > +               sentry &= ~PBLK_RAIL_PARITY_WRITE;
> > +               lba = &pblk->rail.lba[sentry];
> > +               meta_list->lba = cpu_to_le64(*lba);
> > +               *lba_list = cpu_to_le64(*lba);
> > +               line->nr_valid_lbas++;
> > +       } else {
> > +               w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
> > +               w_ctx->ppa = ppa;
> > +               meta_list->lba = cpu_to_le64(w_ctx->lba);
> > +               *lba_list = cpu_to_le64(w_ctx->lba);
> > +
> > +               if (*lba_list != addr_empty)
> > +                       line->nr_valid_lbas++;
> > +               else
> > +                       atomic64_inc(&pblk->pad_wa);
> > +       }
> > +}
> > +
> > +int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
> > +                           struct ppa_addr *ppa_list,
> > +                           unsigned long *lun_bitmap,
> > +                           struct pblk_sec_meta *meta_list,
> > +                           unsigned int valid_secs)
> > +{
> > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > +       struct pblk_emeta *emeta;
> > +       __le64 *lba_list;
> > +       u64 paddr;
> > +       int nr_secs = pblk->min_write_pgs;
> > +       int i;
> > +
> > +       if (pblk_line_is_full(line)) {
> > +               struct pblk_line *prev_line = line;
> > +
> > +               /* If we cannot allocate a new line, make sure to store metadata
> > +                * on current line and then fail
> > +                */
> > +               line = pblk_line_replace_data(pblk);
> > +               pblk_line_close_meta(pblk, prev_line);
> > +
> > +               if (!line)
> > +                       return -EINTR;
> > +       }
> > +
> > +       emeta = line->emeta;
> > +       lba_list = emeta_to_lbas(pblk, emeta->buf);
> > +
> > +       paddr = pblk_alloc_page(pblk, line, nr_secs);
> > +
> > +       pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
> > +
> > +       for (i = 0; i < nr_secs; i++, paddr++) {
> > +               __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > +
> > +               /* ppa to be sent to the device */
> > +               ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
> > +
> > +               /* Write context for target bio completion on write buffer. Note
> > +                * that the write buffer is protected by the sync backpointer,
> > +                * and a single writer thread have access to each specific entry
> > +                * at a time. Thus, it is safe to modify the context for the
> > +                * entry we are setting up for submission without taking any
> > +                * lock or memory barrier.
> > +                */
> > +               if (i < valid_secs) {
> > +                       pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
> > +                                         &lba_list[paddr], ppa_list[i]);
> > +               } else {
> > +                       lba_list[paddr] = meta_list[i].lba = addr_empty;
> > +                       __pblk_map_invalidate(pblk, line, paddr);
> > +               }
> > +       }
> > +
> > +       pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
> > +       return 0;
> > +}
>
> This is a lot of duplication of code from the "normal" pblk map
> function -  could you refactor to avoid this?

I wanted to keep the mapping function as general as possible in case
we want to support other mapping functions at some point. If you think
this is not needed I can reduce the mapping func to only code that
differs between the mapping function. E.g. we could turn
pblk_map_page_data into a pblk_map_sec_data

>
> > +
> > +/* RAIL Initialization and tear down */
> > +int pblk_rail_init(struct pblk *pblk)
> > +{
> > +       struct pblk_line_meta *lm = &pblk->lm;
> > +       int i, p2be;
> > +       unsigned int nr_strides;
> > +       unsigned int psecs;
> > +       void *kaddr;
> > +
> > +       if (!PBLK_RAIL_STRIDE_WIDTH)
> > +               return 0;
> > +
> > +       if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
> > +           (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
> > +               pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
> > +               return -EINVAL;
> > +       }
>
> This is just a check of the maximum blocks per line - bad blocks will
> reduce the number of writable blocks. What happens when a line goes
> below PBLK_RAIL_STRIDE_WIDTH writable blocks?

This check just guarantees that lm->blk_per_line is a multiple of
PBLK_RAIL_STRIDE_WIDTH. Bad blocks are handled dynamically at runtime
via pblk_rail_valid_sector(pblk, line, cur) which skips parity
computation if the parity block is bad. In theory a line can have
fewer writable blocks than PBLK_RAIL_STRIDE_WIDTH, in this case parity
is computed based on fewer number of blocks.

>
> > +
> > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > +
> > +       pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
> > +                                      GFP_KERNEL);
> > +       if (!pblk->rail.p2b)
> > +               return -ENOMEM;
> > +
> > +       for (p2be = 0; p2be < nr_strides; p2be++) {
> > +               pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
> > +                                              sizeof(struct p2b_entry),
> > +                                              GFP_KERNEL);
> > +               if (!pblk->rail.p2b[p2be])
> > +                       goto free_p2b_entries;
> > +       }
> > +
> > +       pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
> > +       if (!pblk->rail.data)
> > +               goto free_p2b_entries;
> > +
> > +       pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
> > +       if (!pblk->rail.pages)
> > +               goto free_data;
> > +
> > +       kaddr = page_address(pblk->rail.pages);
> > +       for (i = 0; i < psecs; i++)
> > +               pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
> > +
> > +       pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
> > +       if (!pblk->rail.lba)
> > +               goto free_pages;
> > +
> > +       /* Subtract parity bits from device capacity */
> > +       pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
> > +               PBLK_RAIL_STRIDE_WIDTH;
> > +
> > +       pblk->map_page = pblk_rail_map_page_data;
> > +
> > +       return 0;
> > +
> > +free_pages:
> > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > +                  get_count_order(psecs));
> > +free_data:
> > +       kfree(pblk->rail.data);
> > +free_p2b_entries:
> > +       for (p2be = p2be - 1; p2be >= 0; p2be--)
> > +               kfree(pblk->rail.p2b[p2be]);
> > +       kfree(pblk->rail.p2b);
> > +
> > +       return -ENOMEM;
> > +}
> > +
> > +void pblk_rail_free(struct pblk *pblk)
> > +{
> > +       unsigned int i;
> > +       unsigned int nr_strides;
> > +       unsigned int psecs;
> > +
> > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > +
> > +       kfree(pblk->rail.lba);
> > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > +                  get_count_order(psecs));
> > +       kfree(pblk->rail.data);
> > +       for (i = 0; i < nr_strides; i++)
> > +               kfree(pblk->rail.p2b[i]);
> > +       kfree(pblk->rail.p2b);
> > +}
> > +
> > +/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
> > + * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
> > + * split the bio
> > + */
> > +static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
> > +{
> > +       struct nvm_tgt_dev *dev = pblk->dev;
> > +       struct bio *split;
> > +
> > +       sec *= (dev->geo.csecs >> 9);
> > +
> > +       split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
> > +       /* there isn't chance to merge the split bio */
> > +       split->bi_opf |= REQ_NOMERGE;
> > +       bio_set_flag(*bio, BIO_QUEUE_ENTERED);
> > +       bio_chain(split, *bio);
> > +       generic_make_request(*bio);
> > +       *bio = split;
> > +}
> > +
> > +/* RAIL's Write Path */
> > +static int pblk_rail_sched_parity(struct pblk *pblk)
> > +{
> > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > +       unsigned int sec_in_stripe;
> > +
> > +       while (1) {
> > +               sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
> > +
> > +               /* Schedule parity write at end of data section */
> > +               if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
> > +                       return 1;
> > +
> > +               /* Skip bad blocks and meta sectors until we find a valid sec */
> > +               if (test_bit(line->cur_sec, line->map_bitmap))
> > +                       line->cur_sec += pblk->min_write_pgs;
> > +               else
> > +                       break;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
> > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
> > +{
> > +       int off, bit;
> > +
> > +       for (off = pblk_rail_dsec_per_stripe(pblk);
> > +            off < pblk->lm.sec_per_line;
> > +            off += pblk_rail_sec_per_stripe(pblk)) {
> > +               for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
> > +                       set_bit(off + bit, line->invalid_bitmap);
> > +       }
> > +}
> > +
> > +void pblk_rail_end_io_write(struct nvm_rq *rqd)
> > +{
> > +       struct pblk *pblk = rqd->private;
> > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > +
> > +       if (rqd->error) {
> > +               pblk_log_write_err(pblk, rqd);
> > +               return pblk_end_w_fail(pblk, rqd);
>
> The write error recovery path relies on that that sentry in c_ctx is
> an index in the write buffer, so this won't work.

You mean a RAIL parity write? Yes, good catch.

>
> Additionally, If a write(data or parity) fails, the whole stride would
> be broken and need to fall back on "normal" reads, right?
> One solution could be to check line->w_err_gc->has_write_err on the read path.

when a data write fails it is remapped and the rail mapping function
tracks that new location in the r2b. The page will be marked bad and
hence taken into account when computing parity in the case of parity
writes and RAIL reads, so the line should still be intact. This might
be insufficiently tested but in theory it should work.

>
> > +       }
> > +#ifdef CONFIG_NVM_DEBUG
> > +       else
> > +               WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
> > +#endif
> > +
> > +       pblk_up_rq(pblk, c_ctx->lun_bitmap);
> > +
> > +       pblk_rq_to_line_put(pblk, rqd);
> > +       bio_put(rqd->bio);
> > +       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > +
> > +       atomic_dec(&pblk->inflight_io);
> > +}
> > +
> > +static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
> > +                         struct bio *bio, unsigned int stride,
> > +                         unsigned int nr_secs, unsigned int paddr)
> > +{
> > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > +       int sec, i;
> > +       int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
> > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > +
> > +       c_ctx->nr_valid = nr_secs;
> > +       /* sentry indexes rail page buffer, instead of rwb */
> > +       c_ctx->sentry = stride * pblk->min_write_pgs;
> > +       c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
> > +
> > +       for (sec = 0; sec < pblk->min_write_pgs; sec++) {
> > +               void *pg_addr;
> > +               struct page *page;
> > +               u64 *lba;
> > +
> > +               lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
> > +               pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
> > +               page = virt_to_page(pg_addr);
> > +
> > +               if (!page) {
> > +                       pr_err("pblk: could not allocate RAIL bio page %p\n",
> > +                              pg_addr);
> > +                       return -NVM_IO_ERR;
> > +               }
> > +
> > +               if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
> > +                   pblk->rwb.seg_size) {
> > +                       pr_err("pblk: could not add page to RAIL bio\n");
> > +                       return -NVM_IO_ERR;
> > +               }
> > +
> > +               *lba = 0;
> > +               memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
> > +
> > +               for (i = 0; i < nr_data; i++) {
> > +                       struct pblk_rb_entry *entry;
> > +                       struct pblk_w_ctx *w_ctx;
> > +                       u64 lba_src;
> > +                       unsigned int pos;
> > +                       unsigned int cur;
> > +                       int distance = pblk_rail_psec_per_stripe(pblk);
> > +
> > +                       cur = paddr - distance * (nr_data - i) + sec;
> > +
> > +                       if (!pblk_rail_valid_sector(pblk, line, cur))
> > +                               continue;
> > +
> > +                       pos = pblk->rail.p2b[stride][i].pos;
> > +                       pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
> > +                       entry = &pblk->rwb.entries[pos];
> > +                       w_ctx = &entry->w_ctx;
> > +                       lba_src = w_ctx->lba;
> > +
> > +                       if (sec < pblk->rail.p2b[stride][i].nr_valid &&
> > +                           lba_src != ADDR_EMPTY) {
> > +                               pblk_rail_data_parity(pg_addr, entry->data);
> > +                               pblk_rail_lba_parity(lba, &lba_src);
>
> What keeps the parity lba values from invalidating "real" data lbas
> during recovery?

The RAIL geometry is known during recovery so the parity LBAs can be
ignored, not implemented yet.

>
> > +                       }
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +int pblk_rail_submit_write(struct pblk *pblk)
> > +{
> > +       int i;
> > +       struct nvm_rq *rqd;
> > +       struct bio *bio;
> > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > +       int start, end, bb_offset;
> > +       unsigned int stride = 0;
> > +
> > +       if (!pblk_rail_sched_parity(pblk))
> > +               return 0;
> > +
> > +       start = line->cur_sec;
> > +       bb_offset = start % pblk_rail_sec_per_stripe(pblk);
> > +       end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
> > +
> > +       for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
> > +               /* Do not generate parity in this slot if the sec is bad
> > +                * or reserved for meta.
> > +                * We check on the read path and perform a conventional
> > +                * read, to avoid reading parity from the bad block
> > +                */
> > +               if (!pblk_rail_valid_sector(pblk, line, i))
> > +                       continue;
> > +
> > +               rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
> > +               if (IS_ERR(rqd)) {
> > +                       pr_err("pblk: cannot allocate parity write req.\n");
> > +                       return -ENOMEM;
> > +               }
> > +
> > +               bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
> > +               if (!bio) {
> > +                       pr_err("pblk: cannot allocate parity write bio\n");
> > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > +                       return -ENOMEM;
> > +               }
> > +
> > +               bio->bi_iter.bi_sector = 0; /* internal bio */
> > +               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> > +               rqd->bio = bio;
> > +
> > +               pblk_rail_read_to_bio(pblk, rqd, bio, stride,
> > +                                     pblk->min_write_pgs, i);
> > +
> > +               if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
> > +                       bio_put(rqd->bio);
> > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > +
> > +                       return -NVM_IO_ERR;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +/* RAIL's Read Path */
> > +static void pblk_rail_end_io_read(struct nvm_rq *rqd)
> > +{
> > +       struct pblk *pblk = rqd->private;
> > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > +       struct pblk_pr_ctx *pr_ctx = r_ctx->private;
> > +       struct bio *new_bio = rqd->bio;
> > +       struct bio *bio = pr_ctx->orig_bio;
> > +       struct bio_vec src_bv, dst_bv;
> > +       struct pblk_sec_meta *meta_list = rqd->meta_list;
> > +       int bio_init_idx = pr_ctx->bio_init_idx;
> > +       int nr_secs = pr_ctx->orig_nr_secs;
> > +       __le64 *lba_list_mem, *lba_list_media;
> > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > +       void *src_p, *dst_p;
> > +       int i, r, rail_ppa = 0;
> > +       unsigned char valid;
> > +
> > +       if (unlikely(rqd->nr_ppas == 1)) {
> > +               struct ppa_addr ppa;
> > +
> > +               ppa = rqd->ppa_addr;
> > +               rqd->ppa_list = pr_ctx->ppa_ptr;
> > +               rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
> > +               rqd->ppa_list[0] = ppa;
> > +       }
> > +
> > +       /* Re-use allocated memory for intermediate lbas */
> > +       lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
> > +       lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
> > +
> > +       for (i = 0; i < rqd->nr_ppas; i++)
> > +               lba_list_media[i] = meta_list[i].lba;
> > +       for (i = 0; i < nr_secs; i++)
> > +               meta_list[i].lba = lba_list_mem[i];
> > +
> > +       for (i = 0; i < nr_secs; i++) {
> > +               struct pblk_line *line;
> > +               u64 meta_lba = 0x0UL, mlba;
> > +
> > +               line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
> > +
> > +               valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
> > +               bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
> > +                                  PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
> > +
> > +               if (valid == 0) /* Skip cached reads */
> > +                       continue;
> > +
> > +               kref_put(&line->ref, pblk_line_put);
> > +
> > +               dst_bv = bio->bi_io_vec[bio_init_idx + i];
> > +               dst_p = kmap_atomic(dst_bv.bv_page);
> > +
> > +               memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
> > +               meta_list[i].lba = cpu_to_le64(0x0UL);
> > +
> > +               for (r = 0; r < valid; r++, rail_ppa++) {
> > +                       src_bv = new_bio->bi_io_vec[rail_ppa];
> > +
> > +                       if (lba_list_media[rail_ppa] != addr_empty) {
> > +                               src_p = kmap_atomic(src_bv.bv_page);
> > +                               pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
> > +                                                     src_p + src_bv.bv_offset);
> > +                               mlba = le64_to_cpu(lba_list_media[rail_ppa]);
> > +                               pblk_rail_lba_parity(&meta_lba, &mlba);
> > +                               kunmap_atomic(src_p);
> > +                       }
> > +
> > +                       mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
> > +               }
> > +               meta_list[i].lba = cpu_to_le64(meta_lba);
> > +               kunmap_atomic(dst_p);
> > +       }
> > +
> > +       bio_put(new_bio);
> > +       rqd->nr_ppas = pr_ctx->orig_nr_secs;
> > +       kfree(pr_ctx);
> > +       rqd->bio = NULL;
> > +
> > +       bio_endio(bio);
> > +       __pblk_end_io_read(pblk, rqd, false);
> > +}
> > +
> > +/* Converts original ppa into ppa list of RAIL reads */
> > +static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
> > +                               struct ppa_addr *rail_ppas,
> > +                               unsigned char *pvalid, int *nr_rail_ppas,
> > +                               int *rail_reads)
> > +{
> > +       struct nvm_tgt_dev *dev = pblk->dev;
> > +       struct nvm_geo *geo = &dev->geo;
> > +       struct ppa_addr rail_ppa = ppa;
> > +       unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > +       unsigned int strides = pblk_rail_nr_parity_luns(pblk);
> > +       struct pblk_line *line;
> > +       unsigned int i;
> > +       int ppas = *nr_rail_ppas;
> > +       int valid = 0;
> > +
> > +       for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
> > +               unsigned int neighbor, lun, chnl;
> > +               int laddr;
> > +
> > +               neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
> > +
> > +               lun = pblk_pos_to_lun(geo, neighbor);
> > +               chnl = pblk_pos_to_chnl(geo, neighbor);
> > +               pblk_dev_ppa_set_lun(&rail_ppa, lun);
> > +               pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
> > +
> > +               line = pblk_ppa_to_line(pblk, rail_ppa);
> > +               laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
> > +
> > +               /* Do not read from bad blocks */
> > +               if (!pblk_rail_valid_sector(pblk, line, laddr)) {
> > +                       /* Perform regular read if parity sector is bad */
> > +                       if (neighbor >= pblk_rail_nr_data_luns(pblk))
> > +                               return 0;
> > +
> > +                       /* If any other neighbor is bad we can just skip it */
> > +                       continue;
> > +               }
> > +
> > +               rail_ppas[ppas++] = rail_ppa;
> > +               valid++;
> > +       }
> > +
> > +       if (valid == 1)
> > +               return 0;
> > +
> > +       *pvalid = valid;
> > +       *nr_rail_ppas = ppas;
> > +       (*rail_reads)++;
> > +       return 1;
> > +}
> > +
> > +static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
> > +                                int ppa, struct ppa_addr *rail_ppa_list,
> > +                                int *nr_rail_ppas, unsigned long *read_bitmap,
> > +                                unsigned long *pvalid, int *rail_reads)
> > +{
> > +       unsigned char valid;
> > +
> > +       if (test_bit(ppa, read_bitmap))
> > +               return;
> > +
> > +       if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
> > +           pblk_rail_setup_ppas(pblk, ppa_list[ppa],
> > +                                rail_ppa_list, &valid,
> > +                                nr_rail_ppas, rail_reads)) {
> > +               WARN_ON(test_and_set_bit(ppa, read_bitmap));
> > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
> > +       } else {
> > +               rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
> > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
> > +       }
> > +}
> > +
> > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > +                      unsigned long *read_bitmap, int bio_init_idx,
> > +                      struct bio **bio)
> > +{
> > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > +       struct pblk_pr_ctx *pr_ctx;
> > +       struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
> > +       DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
> > +       int nr_secs = rqd->nr_ppas;
> > +       bool read_empty = bitmap_empty(read_bitmap, nr_secs);
> > +       int nr_rail_ppas = 0, rail_reads = 0;
> > +       int i;
> > +       int ret;
> > +
> > +       /* Fully cached reads should not enter this path */
> > +       WARN_ON(bitmap_full(read_bitmap, nr_secs));
> > +
> > +       bitmap_zero(pvalid, PR_BITMAP_SIZE);
> > +       if (rqd->nr_ppas == 1) {
> > +               pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
> > +                                    &nr_rail_ppas, read_bitmap, pvalid,
> > +                                    &rail_reads);
> > +
> > +               if (nr_rail_ppas == 1) {
> > +                       memcpy(&rqd->ppa_addr, rail_ppa_list,
> > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > +               } else {
> > +                       rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
> > +                       rqd->dma_ppa_list = rqd->dma_meta_list +
> > +                         pblk_dma_meta_size;
> > +                       memcpy(rqd->ppa_list, rail_ppa_list,
> > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > +               }
> > +       } else {
> > +               for (i = 0; i < rqd->nr_ppas; i++) {
> > +                       pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
> > +                                            rail_ppa_list, &nr_rail_ppas,
> > +                                            read_bitmap, pvalid, &rail_reads);
> > +
> > +                       /* Don't split if this it the last ppa of the rqd */
> > +                       if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
> > +                            NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
> > +                               struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > +
> > +                               pblk_rail_bio_split(pblk, bio, i + 1);
> > +                               rqd->nr_ppas = pblk_get_secs(*bio);
> > +                               r_ctx->private = *bio;
> > +                               break;
> > +                       }
> > +               }
> > +               memcpy(rqd->ppa_list, rail_ppa_list,
> > +                      nr_rail_ppas * sizeof(struct ppa_addr));
> > +       }
> > +
> > +       if (bitmap_empty(read_bitmap, rqd->nr_ppas))
> > +               return NVM_IO_REQUEUE;
> > +
> > +       if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
> > +               bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
> > +
> > +       if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
> > +                                   nr_rail_ppas))
> > +               return NVM_IO_ERR;
> > +
> > +       rqd->end_io = pblk_rail_end_io_read;
> > +       pr_ctx = r_ctx->private;
> > +       bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
> > +
> > +       ret = pblk_submit_io(pblk, rqd);
> > +       if (ret) {
> > +               bio_put(rqd->bio);
> > +               pr_err("pblk: partial RAIL read IO submission failed\n");
> > +               /* Free allocated pages in new bio */
> > +               pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
> > +               kfree(pr_ctx);
> > +               __pblk_end_io_read(pblk, rqd, false);
> > +               return NVM_IO_ERR;
> > +       }
> > +
> > +       return NVM_IO_OK;
> > +}
> > diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> > index bd88784e51d9..01fe4362b27e 100644
> > --- a/drivers/lightnvm/pblk.h
> > +++ b/drivers/lightnvm/pblk.h
> > @@ -28,6 +28,7 @@
> >  #include <linux/vmalloc.h>
> >  #include <linux/crc32.h>
> >  #include <linux/uuid.h>
> > +#include <linux/log2.h>
> >
> >  #include <linux/lightnvm.h>
> >
> > @@ -45,7 +46,7 @@
> >  #define PBLK_COMMAND_TIMEOUT_MS 30000
> >
> >  /* Max 512 LUNs per device */
> > -#define PBLK_MAX_LUNS_BITMAP (4)
> > +#define PBLK_MAX_LUNS_BITMAP (512)
>
> 512 is probably enough for everyone for now, but why not make this dynamic?
> Better not waste memory and introduce an artificial limit on number of luns.

I can make it dynamic. It just makes the init path more messy as
meta_init takes the write semaphore (and hence busy bitmap) so I have
to init RAIL in the middle of everything else.

>
> >
> >  #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
> >
> > @@ -123,6 +124,13 @@ struct pblk_g_ctx {
> >         u64 lba;
> >  };
> >
> > +#ifdef CONFIG_NVM_PBLK_RAIL
> > +#define PBLK_RAIL_STRIDE_WIDTH 4
> > +#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
> > +#else
> > +#define PR_BITMAP_SIZE NVM_MAX_VLBA
> > +#endif
> > +
> >  /* partial read context */
> >  struct pblk_pr_ctx {
> >         struct bio *orig_bio;
> > @@ -604,6 +612,39 @@ struct pblk_addrf {
> >         int sec_ws_stripe;
> >  };
> >
> > +#ifdef CONFIG_NVM_PBLK_RAIL
> > +
> > +struct p2b_entry {
> > +       int pos;
> > +       int nr_valid;
> > +};
> > +
> > +struct pblk_rail {
> > +       struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
> > +       struct page *pages;             /* Pages to hold parity writes */
> > +       void **data;                    /* Buffer that holds parity pages */
> > +       DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
> > +       u64 *lba;                       /* Buffer to compute LBA parity */
> > +};
> > +
> > +/* Initialize and tear down RAIL */
> > +int pblk_rail_init(struct pblk *pblk);
> > +void pblk_rail_free(struct pblk *pblk);
> > +/* Adjust some system parameters */
> > +bool pblk_rail_meta_distance(struct pblk_line *data_line);
> > +int pblk_rail_rb_delay(struct pblk_rb *rb);
> > +/* Core */
> > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
> > +int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
> > +void pblk_rail_up_stride(struct pblk *pblk, int lun);
> > +/* Write path */
> > +int pblk_rail_submit_write(struct pblk *pblk);
> > +/* Read Path */
> > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > +                      unsigned long *read_bitmap, int bio_init_idx,
> > +                      struct bio **bio);
> > +#endif /* CONFIG_NVM_PBLK_RAIL */
> > +
> >  typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
> >                                struct ppa_addr *ppa_list,
> >                                unsigned long *lun_bitmap,
> > @@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
> >         return paddr;
> >  }
> >
> > +static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
> > +{
> > +       return pos >> ilog2(geo->num_ch);
> > +}
> > +
> > +static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
> > +{
> > +       return pos % geo->num_ch;
> > +}
> > +
> > +static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
> > +{
> > +       p->a.lun = lun;
> > +}
> > +
> > +static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
> > +{
> > +       p->a.ch = chnl;
> > +}
>
> What is the motivation for adding the lun and chnl setters? They seem
> uncalled for.

They are used in RAIL's read path to generate the ppas for RAIL reads

>
> > +
> >  static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
> >  {
> >         struct nvm_tgt_dev *dev = pblk->dev;
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency
  2018-09-18 11:46 ` [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Hans Holmberg
@ 2018-09-18 16:13   ` Heiner Litz
  2018-09-19  7:58     ` Hans Holmberg
  0 siblings, 1 reply; 17+ messages in thread
From: Heiner Litz @ 2018-09-18 16:13 UTC (permalink / raw)
  To: hans.ml.holmberg
  Cc: linux-block, Javier Gonzalez, Matias Bjørling,
	igor.j.konopko, marcin.dziegielewski

Hi Hans,
thanks a lot for your comments! I will send you a git repo to test. I
have a patch which enables/disables RAIL via ioctl and will send that
as well.

Heiner
On Tue, Sep 18, 2018 at 4:46 AM Hans Holmberg
<hans.ml.holmberg@owltronix.com> wrote:
>
> On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> >
> > Hi All,
> > this patchset introduces RAIL, a mechanism to enforce low tail read latency for
> > lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
> > always served from LUNs that do not serve a high latency operation such as a
> > write or erase. This avoids that reads become serialized behind these operations
> > reducing tail latency by ~10x. In particular, in the absence of ECC read errors,
> > it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> > capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> > tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
> >
> > This patch is based on pblk/core and requires two additional patches from Javier
> > to be applicable (let me know if you want me to rebase):
>
> As the patches do not apply, could you make a branch available so I
> can get hold of the code in it's present state?
> That would make reviewing and testing so much easier.
>
> I have some concerns regarding recovery and write error handling, but
> I have not found anything that can't be fixed.
> I also believe that rail/on off and stride width should not be
> configured at build-time, but instead be part of the create IOCTL.
>
> See my comments on the individual patches for details.
>
> >
> > The 1st patch exposes some existing APIs so they can be used by RAIL
> > The 2nd patch introduces a configurable sector mapping function
> > The 3rd patch refactors the write path so the end_io_fn can be specified when
> > setting up the request
> > The 4th patch adds a new submit io function that acquires the write semaphore
> > The 5th patch introduces the RAIL feature and its API
> > The 6th patch integrates RAIL into pblk's read and write path
> >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-18 16:11     ` Heiner Litz
@ 2018-09-19  7:53       ` Hans Holmberg
  2018-09-20 23:58         ` Heiner Litz
  0 siblings, 1 reply; 17+ messages in thread
From: Hans Holmberg @ 2018-09-19  7:53 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Tue, Sep 18, 2018 at 6:11 PM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> On Tue, Sep 18, 2018 at 4:28 AM Hans Holmberg
> <hans.ml.holmberg@owltronix.com> wrote:
> >
> > On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> > >
> > > In prepartion of supporting RAIL, add the RAIL API.
> > >
> > > Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> > > ---
> > >  drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
> > >  drivers/lightnvm/pblk.h      |  63 ++-
> > >  2 files changed, 870 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/lightnvm/pblk-rail.c
> > >
> > > diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> > > new file mode 100644
> > > index 000000000000..a48ed31a0ba9
> > > --- /dev/null
> > > +++ b/drivers/lightnvm/pblk-rail.c
> > > @@ -0,0 +1,808 @@
> > > +/*
> > > + * Copyright (C) 2018 Heiner Litz
> > > + * Initial release: Heiner Litz <hlitz@ucsc.edu>
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License version
> > > + * 2 as published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed in the hope that it will be useful, but
> > > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > + * General Public License for more details.
> > > + *
> > > + * pblk-rail.c - pblk's RAIL path
> > > + */
> > > +
> > > +#include "pblk.h"
> > > +
> > > +#define PBLK_RAIL_EMPTY ~0x0
> > This constant is not being used.
>
> thanks, will remove
>
> > > +#define PBLK_RAIL_PARITY_WRITE 0x8000
> > Where does this magic number come from? Please document.
>
> ok , will document
>
> >
> > > +
> > > +/* RAIL auxiliary functions */
> > > +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +
> > > +       return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> > > +}
> > > +
> > > +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +
> > > +       return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> > > +}
> > > +
> > > +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +
> > > +       return lm->blk_per_line * pblk->min_write_pgs;
> > > +}
> > > +
> > > +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> > > +{
> > > +       return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> > > +}
> > > +
> > > +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> > > +{
> > > +       return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
> > > +}
> > > +
> > > +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +
> > > +       return (lun & (lm->blk_per_line - 1));
> > > +}
> > > +
> > > +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> > > +{
> > > +       return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> > > +}
> > > +
> > > +/* Notify readers that LUN is serving high latency operation */
> > > +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> > > +{
> > > +       WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> > > +       /* Make sure that busy bit is seen by reader before proceeding */
> > > +       smp_mb__after_atomic();
> > > +}
> > > +
> > > +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> > > +{
> > > +       /* Make sure that write is completed before releasing busy bit */
> > > +       smp_mb__before_atomic();
> > > +       WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> > > +}
> > > +
> > > +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> > > +{
> > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > +       struct nvm_geo *geo = &dev->geo;
> > > +       int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > +
> > > +       return test_bit(lun_pos, pblk->rail.busy_bitmap);
> > > +}
> > > +
> > > +/* Enforces one writer per stride */
> > > +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> > > +{
> > > +       struct pblk_lun *rlun;
> > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > +       int stride = lun_pos % strides;
> > > +       int ret;
> > > +
> > > +       rlun = &pblk->luns[stride];
> > > +       ret = down_timeout(&rlun->wr_sem, timeout);
> > > +       pblk_rail_notify_reader_down(pblk, lun_pos);
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> > > +{
> > > +       struct pblk_lun *rlun;
> > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > +       int stride = lun_pos % strides;
> > > +
> > > +       pblk_rail_notify_reader_up(pblk, lun_pos);
> > > +       rlun = &pblk->luns[stride];
> > > +       up(&rlun->wr_sem);
> > > +}
> > > +
> > > +/* Determine whether a sector holds data, meta or is bad*/
> > > +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > +       struct nvm_geo *geo = &dev->geo;
> > > +       struct ppa_addr ppa;
> > > +       int lun;
> > > +
> > > +       if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
> > > +               return false;
> > > +
> > > +       if (pos >= line->emeta_ssec &&
> > > +           pos < (line->emeta_ssec + lm->emeta_sec[0]))
> > > +               return false;
> > > +
> > > +       ppa = addr_to_gen_ppa(pblk, pos, line->id);
> > > +       lun = pblk_ppa_to_pos(geo, ppa);
> > > +
> > > +       return !test_bit(lun, line->blk_bitmap);
> > > +}
> > > +
> > > +/* Delay rb overwrite until whole stride has been written */
> > > +int pblk_rail_rb_delay(struct pblk_rb *rb)
> > > +{
> > > +       struct pblk *pblk = container_of(rb, struct pblk, rwb);
> > > +
> > > +       return pblk_rail_sec_per_stripe(pblk);
> > > +}
> > > +
> > > +static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
> > > +{
> > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > +       int page = sec_in_stripe / pblk->min_write_pgs;
> > > +
> > > +       return page % pblk_rail_nr_parity_luns(pblk);
> > > +}
> > > +
> > > +static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
> > > +{
> > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > +
> > > +       return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
> > > +}
> > > +
> > > +static void pblk_rail_data_parity(void *dest, void *src)
> > > +{
> > > +       unsigned int i;
> > > +
> > > +       for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
> > > +               ((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
> > > +}
> > > +
> > > +static void pblk_rail_lba_parity(u64 *dest, u64 *src)
> > > +{
> > > +       *dest ^= *src;
> > > +}
> > > +
> > > +/* Tracks where a sector is located in the rwb */
> > > +void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
> > > +                        int sentry, int nr_valid)
> > > +{
> > > +       int stride, idx, pos;
> > > +
> > > +       stride = pblk_rail_sec_to_stride(pblk, cur_sec);
> > > +       idx = pblk_rail_sec_to_idx(pblk, cur_sec);
> > > +       pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
> > > +       pblk->rail.p2b[stride][idx].pos = pos;
> > > +       pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
> > > +}
> > > +
> > > +/* RAIL's sector mapping function */
> > > +static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
> > > +                             int sentry, struct pblk_sec_meta *meta_list,
> > > +                             __le64 *lba_list, struct ppa_addr ppa)
> > > +{
> > > +       struct pblk_w_ctx *w_ctx;
> > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > +
> > > +       kref_get(&line->ref);
> > > +
> > > +       if (sentry & PBLK_RAIL_PARITY_WRITE) {
> > > +               u64 *lba;
> > > +
> > > +               sentry &= ~PBLK_RAIL_PARITY_WRITE;
> > > +               lba = &pblk->rail.lba[sentry];
> > > +               meta_list->lba = cpu_to_le64(*lba);
> > > +               *lba_list = cpu_to_le64(*lba);
> > > +               line->nr_valid_lbas++;
> > > +       } else {
> > > +               w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
> > > +               w_ctx->ppa = ppa;
> > > +               meta_list->lba = cpu_to_le64(w_ctx->lba);
> > > +               *lba_list = cpu_to_le64(w_ctx->lba);
> > > +
> > > +               if (*lba_list != addr_empty)
> > > +                       line->nr_valid_lbas++;
> > > +               else
> > > +                       atomic64_inc(&pblk->pad_wa);
> > > +       }
> > > +}
> > > +
> > > +int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
> > > +                           struct ppa_addr *ppa_list,
> > > +                           unsigned long *lun_bitmap,
> > > +                           struct pblk_sec_meta *meta_list,
> > > +                           unsigned int valid_secs)
> > > +{
> > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > +       struct pblk_emeta *emeta;
> > > +       __le64 *lba_list;
> > > +       u64 paddr;
> > > +       int nr_secs = pblk->min_write_pgs;
> > > +       int i;
> > > +
> > > +       if (pblk_line_is_full(line)) {
> > > +               struct pblk_line *prev_line = line;
> > > +
> > > +               /* If we cannot allocate a new line, make sure to store metadata
> > > +                * on current line and then fail
> > > +                */
> > > +               line = pblk_line_replace_data(pblk);
> > > +               pblk_line_close_meta(pblk, prev_line);
> > > +
> > > +               if (!line)
> > > +                       return -EINTR;
> > > +       }
> > > +
> > > +       emeta = line->emeta;
> > > +       lba_list = emeta_to_lbas(pblk, emeta->buf);
> > > +
> > > +       paddr = pblk_alloc_page(pblk, line, nr_secs);
> > > +
> > > +       pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
> > > +
> > > +       for (i = 0; i < nr_secs; i++, paddr++) {
> > > +               __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > +
> > > +               /* ppa to be sent to the device */
> > > +               ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
> > > +
> > > +               /* Write context for target bio completion on write buffer. Note
> > > +                * that the write buffer is protected by the sync backpointer,
> > > +                * and a single writer thread have access to each specific entry
> > > +                * at a time. Thus, it is safe to modify the context for the
> > > +                * entry we are setting up for submission without taking any
> > > +                * lock or memory barrier.
> > > +                */
> > > +               if (i < valid_secs) {
> > > +                       pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
> > > +                                         &lba_list[paddr], ppa_list[i]);
> > > +               } else {
> > > +                       lba_list[paddr] = meta_list[i].lba = addr_empty;
> > > +                       __pblk_map_invalidate(pblk, line, paddr);
> > > +               }
> > > +       }
> > > +
> > > +       pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
> > > +       return 0;
> > > +}
> >
> > This is a lot of duplication of code from the "normal" pblk map
> > function -  could you refactor to avoid this?
>
> I wanted to keep the mapping function as general as possible in case
> we want to support other mapping functions at some point. If you think
> this is not needed I can reduce the mapping func to only code that
> differs between the mapping function. E.g. we could turn
> pblk_map_page_data into a pblk_map_sec_data

I think it would be better to try to keep common code as far as we
can, and if we would introduce other mapping functions in the future
we'll rework the common denominator.

>
> >
> > > +
> > > +/* RAIL Initialization and tear down */
> > > +int pblk_rail_init(struct pblk *pblk)
> > > +{
> > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > +       int i, p2be;
> > > +       unsigned int nr_strides;
> > > +       unsigned int psecs;
> > > +       void *kaddr;
> > > +
> > > +       if (!PBLK_RAIL_STRIDE_WIDTH)
> > > +               return 0;
> > > +
> > > +       if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
> > > +           (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
> > > +               pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
> > > +               return -EINVAL;
> > > +       }
> >
> > This is just a check of the maximum blocks per line - bad blocks will
> > reduce the number of writable blocks. What happens when a line goes
> > below PBLK_RAIL_STRIDE_WIDTH writable blocks?
>
> This check just guarantees that lm->blk_per_line is a multiple of
> PBLK_RAIL_STRIDE_WIDTH. Bad blocks are handled dynamically at runtime
> via pblk_rail_valid_sector(pblk, line, cur) which skips parity
> computation if the parity block is bad. In theory a line can have
> fewer writable blocks than PBLK_RAIL_STRIDE_WIDTH, in this case parity
> is computed based on fewer number of blocks.

Yes, I see now, it should work.

The only case i see that is problematic is if only the parity block(s)
are non-bad in a line, resulting in no data being written, just parity
(adding a huge write latency penalty) - we could either disable RAIL
for that class of lines or mark them as bad.

>
> >
> > > +
> > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > +
> > > +       pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
> > > +                                      GFP_KERNEL);
> > > +       if (!pblk->rail.p2b)
> > > +               return -ENOMEM;
> > > +
> > > +       for (p2be = 0; p2be < nr_strides; p2be++) {
> > > +               pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
> > > +                                              sizeof(struct p2b_entry),
> > > +                                              GFP_KERNEL);
> > > +               if (!pblk->rail.p2b[p2be])
> > > +                       goto free_p2b_entries;
> > > +       }
> > > +
> > > +       pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
> > > +       if (!pblk->rail.data)
> > > +               goto free_p2b_entries;
> > > +
> > > +       pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
> > > +       if (!pblk->rail.pages)
> > > +               goto free_data;
> > > +
> > > +       kaddr = page_address(pblk->rail.pages);
> > > +       for (i = 0; i < psecs; i++)
> > > +               pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
> > > +
> > > +       pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
> > > +       if (!pblk->rail.lba)
> > > +               goto free_pages;
> > > +
> > > +       /* Subtract parity bits from device capacity */
> > > +       pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
> > > +               PBLK_RAIL_STRIDE_WIDTH;
> > > +
> > > +       pblk->map_page = pblk_rail_map_page_data;
> > > +
> > > +       return 0;
> > > +
> > > +free_pages:
> > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > +                  get_count_order(psecs));
> > > +free_data:
> > > +       kfree(pblk->rail.data);
> > > +free_p2b_entries:
> > > +       for (p2be = p2be - 1; p2be >= 0; p2be--)
> > > +               kfree(pblk->rail.p2b[p2be]);
> > > +       kfree(pblk->rail.p2b);
> > > +
> > > +       return -ENOMEM;
> > > +}
> > > +
> > > +void pblk_rail_free(struct pblk *pblk)
> > > +{
> > > +       unsigned int i;
> > > +       unsigned int nr_strides;
> > > +       unsigned int psecs;
> > > +
> > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > +
> > > +       kfree(pblk->rail.lba);
> > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > +                  get_count_order(psecs));
> > > +       kfree(pblk->rail.data);
> > > +       for (i = 0; i < nr_strides; i++)
> > > +               kfree(pblk->rail.p2b[i]);
> > > +       kfree(pblk->rail.p2b);
> > > +}
> > > +
> > > +/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
> > > + * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
> > > + * split the bio
> > > + */
> > > +static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
> > > +{
> > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > +       struct bio *split;
> > > +
> > > +       sec *= (dev->geo.csecs >> 9);
> > > +
> > > +       split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
> > > +       /* there isn't chance to merge the split bio */
> > > +       split->bi_opf |= REQ_NOMERGE;
> > > +       bio_set_flag(*bio, BIO_QUEUE_ENTERED);
> > > +       bio_chain(split, *bio);
> > > +       generic_make_request(*bio);
> > > +       *bio = split;
> > > +}
> > > +
> > > +/* RAIL's Write Path */
> > > +static int pblk_rail_sched_parity(struct pblk *pblk)
> > > +{
> > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > +       unsigned int sec_in_stripe;
> > > +
> > > +       while (1) {
> > > +               sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
> > > +
> > > +               /* Schedule parity write at end of data section */
> > > +               if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
> > > +                       return 1;
> > > +
> > > +               /* Skip bad blocks and meta sectors until we find a valid sec */
> > > +               if (test_bit(line->cur_sec, line->map_bitmap))
> > > +                       line->cur_sec += pblk->min_write_pgs;
> > > +               else
> > > +                       break;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
> > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
> > > +{
> > > +       int off, bit;
> > > +
> > > +       for (off = pblk_rail_dsec_per_stripe(pblk);
> > > +            off < pblk->lm.sec_per_line;
> > > +            off += pblk_rail_sec_per_stripe(pblk)) {
> > > +               for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
> > > +                       set_bit(off + bit, line->invalid_bitmap);
> > > +       }
> > > +}
> > > +
> > > +void pblk_rail_end_io_write(struct nvm_rq *rqd)
> > > +{
> > > +       struct pblk *pblk = rqd->private;
> > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > +
> > > +       if (rqd->error) {
> > > +               pblk_log_write_err(pblk, rqd);
> > > +               return pblk_end_w_fail(pblk, rqd);
> >
> > The write error recovery path relies on that that sentry in c_ctx is
> > an index in the write buffer, so this won't work.
>
> You mean a RAIL parity write? Yes, good catch.
>

It does not make sense to re-issue failing parity writes anyway, right?

> >
> > Additionally, If a write(data or parity) fails, the whole stride would
> > be broken and need to fall back on "normal" reads, right?
> > One solution could be to check line->w_err_gc->has_write_err on the read path.
>
> when a data write fails it is remapped and the rail mapping function
> tracks that new location in the r2b. The page will be marked bad and
> hence taken into account when computing parity in the case of parity
> writes and RAIL reads, so the line should still be intact. This might
> be insufficiently tested but in theory it should work.

As far as I can tell from the code, pblk_rail_valid_sector only checks
if the sector is occupied by metadata or if the whole block is bad.
In the case of a write failure, the block will not be marked bad. What
we could do is to keep track of the write pointer internally to check
if the sector had been successfully written.

I can create a patch for keeping track of the write pointer for each
block - this would be useful for debugging purposes in any case. Once
this is in place it would be easy to add a check in
pblk_rail_valid_sector ensuring that the sector has actually been
written successfully.

>
> >
> > > +       }
> > > +#ifdef CONFIG_NVM_DEBUG
> > > +       else
> > > +               WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
> > > +#endif
> > > +
> > > +       pblk_up_rq(pblk, c_ctx->lun_bitmap);
> > > +
> > > +       pblk_rq_to_line_put(pblk, rqd);
> > > +       bio_put(rqd->bio);
> > > +       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > +
> > > +       atomic_dec(&pblk->inflight_io);
> > > +}
> > > +
> > > +static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
> > > +                         struct bio *bio, unsigned int stride,
> > > +                         unsigned int nr_secs, unsigned int paddr)
> > > +{
> > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > +       int sec, i;
> > > +       int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
> > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > +
> > > +       c_ctx->nr_valid = nr_secs;
> > > +       /* sentry indexes rail page buffer, instead of rwb */
> > > +       c_ctx->sentry = stride * pblk->min_write_pgs;
> > > +       c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
> > > +
> > > +       for (sec = 0; sec < pblk->min_write_pgs; sec++) {
> > > +               void *pg_addr;
> > > +               struct page *page;
> > > +               u64 *lba;
> > > +
> > > +               lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
> > > +               pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
> > > +               page = virt_to_page(pg_addr);
> > > +
> > > +               if (!page) {
> > > +                       pr_err("pblk: could not allocate RAIL bio page %p\n",
> > > +                              pg_addr);
> > > +                       return -NVM_IO_ERR;
> > > +               }
> > > +
> > > +               if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
> > > +                   pblk->rwb.seg_size) {
> > > +                       pr_err("pblk: could not add page to RAIL bio\n");
> > > +                       return -NVM_IO_ERR;
> > > +               }
> > > +
> > > +               *lba = 0;
> > > +               memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > +
> > > +               for (i = 0; i < nr_data; i++) {
> > > +                       struct pblk_rb_entry *entry;
> > > +                       struct pblk_w_ctx *w_ctx;
> > > +                       u64 lba_src;
> > > +                       unsigned int pos;
> > > +                       unsigned int cur;
> > > +                       int distance = pblk_rail_psec_per_stripe(pblk);
> > > +
> > > +                       cur = paddr - distance * (nr_data - i) + sec;
> > > +
> > > +                       if (!pblk_rail_valid_sector(pblk, line, cur))
> > > +                               continue;
> > > +
> > > +                       pos = pblk->rail.p2b[stride][i].pos;
> > > +                       pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
> > > +                       entry = &pblk->rwb.entries[pos];
> > > +                       w_ctx = &entry->w_ctx;
> > > +                       lba_src = w_ctx->lba;
> > > +
> > > +                       if (sec < pblk->rail.p2b[stride][i].nr_valid &&
> > > +                           lba_src != ADDR_EMPTY) {
> > > +                               pblk_rail_data_parity(pg_addr, entry->data);
> > > +                               pblk_rail_lba_parity(lba, &lba_src);
> >
> > What keeps the parity lba values from invalidating "real" data lbas
> > during recovery?
>
> The RAIL geometry is known during recovery so the parity LBAs can be
> ignored, not implemented yet.

Ah, it's not in place yet, then it makes sense.
It would be straight forward to implement, using something like
sector_is_parity(line, paddr) to avoid mapping parity sectors during
recovery.

>
> >
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +int pblk_rail_submit_write(struct pblk *pblk)
> > > +{
> > > +       int i;
> > > +       struct nvm_rq *rqd;
> > > +       struct bio *bio;
> > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > +       int start, end, bb_offset;
> > > +       unsigned int stride = 0;
> > > +
> > > +       if (!pblk_rail_sched_parity(pblk))
> > > +               return 0;
> > > +
> > > +       start = line->cur_sec;
> > > +       bb_offset = start % pblk_rail_sec_per_stripe(pblk);
> > > +       end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
> > > +
> > > +       for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
> > > +               /* Do not generate parity in this slot if the sec is bad
> > > +                * or reserved for meta.
> > > +                * We check on the read path and perform a conventional
> > > +                * read, to avoid reading parity from the bad block
> > > +                */
> > > +               if (!pblk_rail_valid_sector(pblk, line, i))
> > > +                       continue;
> > > +
> > > +               rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
> > > +               if (IS_ERR(rqd)) {
> > > +                       pr_err("pblk: cannot allocate parity write req.\n");
> > > +                       return -ENOMEM;
> > > +               }
> > > +
> > > +               bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
> > > +               if (!bio) {
> > > +                       pr_err("pblk: cannot allocate parity write bio\n");
> > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > +                       return -ENOMEM;
> > > +               }
> > > +
> > > +               bio->bi_iter.bi_sector = 0; /* internal bio */
> > > +               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> > > +               rqd->bio = bio;
> > > +
> > > +               pblk_rail_read_to_bio(pblk, rqd, bio, stride,
> > > +                                     pblk->min_write_pgs, i);
> > > +
> > > +               if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
> > > +                       bio_put(rqd->bio);
> > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > +
> > > +                       return -NVM_IO_ERR;
> > > +               }
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +/* RAIL's Read Path */
> > > +static void pblk_rail_end_io_read(struct nvm_rq *rqd)
> > > +{
> > > +       struct pblk *pblk = rqd->private;
> > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > +       struct pblk_pr_ctx *pr_ctx = r_ctx->private;
> > > +       struct bio *new_bio = rqd->bio;
> > > +       struct bio *bio = pr_ctx->orig_bio;
> > > +       struct bio_vec src_bv, dst_bv;
> > > +       struct pblk_sec_meta *meta_list = rqd->meta_list;
> > > +       int bio_init_idx = pr_ctx->bio_init_idx;
> > > +       int nr_secs = pr_ctx->orig_nr_secs;
> > > +       __le64 *lba_list_mem, *lba_list_media;
> > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > +       void *src_p, *dst_p;
> > > +       int i, r, rail_ppa = 0;
> > > +       unsigned char valid;
> > > +
> > > +       if (unlikely(rqd->nr_ppas == 1)) {
> > > +               struct ppa_addr ppa;
> > > +
> > > +               ppa = rqd->ppa_addr;
> > > +               rqd->ppa_list = pr_ctx->ppa_ptr;
> > > +               rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
> > > +               rqd->ppa_list[0] = ppa;
> > > +       }
> > > +
> > > +       /* Re-use allocated memory for intermediate lbas */
> > > +       lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
> > > +       lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
> > > +
> > > +       for (i = 0; i < rqd->nr_ppas; i++)
> > > +               lba_list_media[i] = meta_list[i].lba;
> > > +       for (i = 0; i < nr_secs; i++)
> > > +               meta_list[i].lba = lba_list_mem[i];
> > > +
> > > +       for (i = 0; i < nr_secs; i++) {
> > > +               struct pblk_line *line;
> > > +               u64 meta_lba = 0x0UL, mlba;
> > > +
> > > +               line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
> > > +
> > > +               valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
> > > +               bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
> > > +                                  PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
> > > +
> > > +               if (valid == 0) /* Skip cached reads */
> > > +                       continue;
> > > +
> > > +               kref_put(&line->ref, pblk_line_put);
> > > +
> > > +               dst_bv = bio->bi_io_vec[bio_init_idx + i];
> > > +               dst_p = kmap_atomic(dst_bv.bv_page);
> > > +
> > > +               memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > +               meta_list[i].lba = cpu_to_le64(0x0UL);
> > > +
> > > +               for (r = 0; r < valid; r++, rail_ppa++) {
> > > +                       src_bv = new_bio->bi_io_vec[rail_ppa];
> > > +
> > > +                       if (lba_list_media[rail_ppa] != addr_empty) {
> > > +                               src_p = kmap_atomic(src_bv.bv_page);
> > > +                               pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
> > > +                                                     src_p + src_bv.bv_offset);
> > > +                               mlba = le64_to_cpu(lba_list_media[rail_ppa]);
> > > +                               pblk_rail_lba_parity(&meta_lba, &mlba);
> > > +                               kunmap_atomic(src_p);
> > > +                       }
> > > +
> > > +                       mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
> > > +               }
> > > +               meta_list[i].lba = cpu_to_le64(meta_lba);
> > > +               kunmap_atomic(dst_p);
> > > +       }
> > > +
> > > +       bio_put(new_bio);
> > > +       rqd->nr_ppas = pr_ctx->orig_nr_secs;
> > > +       kfree(pr_ctx);
> > > +       rqd->bio = NULL;
> > > +
> > > +       bio_endio(bio);
> > > +       __pblk_end_io_read(pblk, rqd, false);
> > > +}
> > > +
> > > +/* Converts original ppa into ppa list of RAIL reads */
> > > +static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
> > > +                               struct ppa_addr *rail_ppas,
> > > +                               unsigned char *pvalid, int *nr_rail_ppas,
> > > +                               int *rail_reads)
> > > +{
> > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > +       struct nvm_geo *geo = &dev->geo;
> > > +       struct ppa_addr rail_ppa = ppa;
> > > +       unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > +       unsigned int strides = pblk_rail_nr_parity_luns(pblk);
> > > +       struct pblk_line *line;
> > > +       unsigned int i;
> > > +       int ppas = *nr_rail_ppas;
> > > +       int valid = 0;
> > > +
> > > +       for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
> > > +               unsigned int neighbor, lun, chnl;
> > > +               int laddr;
> > > +
> > > +               neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
> > > +
> > > +               lun = pblk_pos_to_lun(geo, neighbor);
> > > +               chnl = pblk_pos_to_chnl(geo, neighbor);
> > > +               pblk_dev_ppa_set_lun(&rail_ppa, lun);
> > > +               pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
> > > +
> > > +               line = pblk_ppa_to_line(pblk, rail_ppa);
> > > +               laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
> > > +
> > > +               /* Do not read from bad blocks */
> > > +               if (!pblk_rail_valid_sector(pblk, line, laddr)) {
> > > +                       /* Perform regular read if parity sector is bad */
> > > +                       if (neighbor >= pblk_rail_nr_data_luns(pblk))
> > > +                               return 0;
> > > +
> > > +                       /* If any other neighbor is bad we can just skip it */
> > > +                       continue;
> > > +               }
> > > +
> > > +               rail_ppas[ppas++] = rail_ppa;
> > > +               valid++;
> > > +       }
> > > +
> > > +       if (valid == 1)
> > > +               return 0;
> > > +
> > > +       *pvalid = valid;
> > > +       *nr_rail_ppas = ppas;
> > > +       (*rail_reads)++;
> > > +       return 1;
> > > +}
> > > +
> > > +static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
> > > +                                int ppa, struct ppa_addr *rail_ppa_list,
> > > +                                int *nr_rail_ppas, unsigned long *read_bitmap,
> > > +                                unsigned long *pvalid, int *rail_reads)
> > > +{
> > > +       unsigned char valid;
> > > +
> > > +       if (test_bit(ppa, read_bitmap))
> > > +               return;
> > > +
> > > +       if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
> > > +           pblk_rail_setup_ppas(pblk, ppa_list[ppa],
> > > +                                rail_ppa_list, &valid,
> > > +                                nr_rail_ppas, rail_reads)) {
> > > +               WARN_ON(test_and_set_bit(ppa, read_bitmap));
> > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
> > > +       } else {
> > > +               rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
> > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
> > > +       }
> > > +}
> > > +
> > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > +                      struct bio **bio)
> > > +{
> > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > +       struct pblk_pr_ctx *pr_ctx;
> > > +       struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
> > > +       DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
> > > +       int nr_secs = rqd->nr_ppas;
> > > +       bool read_empty = bitmap_empty(read_bitmap, nr_secs);
> > > +       int nr_rail_ppas = 0, rail_reads = 0;
> > > +       int i;
> > > +       int ret;
> > > +
> > > +       /* Fully cached reads should not enter this path */
> > > +       WARN_ON(bitmap_full(read_bitmap, nr_secs));
> > > +
> > > +       bitmap_zero(pvalid, PR_BITMAP_SIZE);
> > > +       if (rqd->nr_ppas == 1) {
> > > +               pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
> > > +                                    &nr_rail_ppas, read_bitmap, pvalid,
> > > +                                    &rail_reads);
> > > +
> > > +               if (nr_rail_ppas == 1) {
> > > +                       memcpy(&rqd->ppa_addr, rail_ppa_list,
> > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > +               } else {
> > > +                       rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
> > > +                       rqd->dma_ppa_list = rqd->dma_meta_list +
> > > +                         pblk_dma_meta_size;
> > > +                       memcpy(rqd->ppa_list, rail_ppa_list,
> > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > +               }
> > > +       } else {
> > > +               for (i = 0; i < rqd->nr_ppas; i++) {
> > > +                       pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
> > > +                                            rail_ppa_list, &nr_rail_ppas,
> > > +                                            read_bitmap, pvalid, &rail_reads);
> > > +
> > > +                       /* Don't split if this it the last ppa of the rqd */
> > > +                       if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
> > > +                            NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
> > > +                               struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > +
> > > +                               pblk_rail_bio_split(pblk, bio, i + 1);
> > > +                               rqd->nr_ppas = pblk_get_secs(*bio);
> > > +                               r_ctx->private = *bio;
> > > +                               break;
> > > +                       }
> > > +               }
> > > +               memcpy(rqd->ppa_list, rail_ppa_list,
> > > +                      nr_rail_ppas * sizeof(struct ppa_addr));
> > > +       }
> > > +
> > > +       if (bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > +               return NVM_IO_REQUEUE;
> > > +
> > > +       if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > +               bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
> > > +
> > > +       if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
> > > +                                   nr_rail_ppas))
> > > +               return NVM_IO_ERR;
> > > +
> > > +       rqd->end_io = pblk_rail_end_io_read;
> > > +       pr_ctx = r_ctx->private;
> > > +       bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
> > > +
> > > +       ret = pblk_submit_io(pblk, rqd);
> > > +       if (ret) {
> > > +               bio_put(rqd->bio);
> > > +               pr_err("pblk: partial RAIL read IO submission failed\n");
> > > +               /* Free allocated pages in new bio */
> > > +               pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
> > > +               kfree(pr_ctx);
> > > +               __pblk_end_io_read(pblk, rqd, false);
> > > +               return NVM_IO_ERR;
> > > +       }
> > > +
> > > +       return NVM_IO_OK;
> > > +}
> > > diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> > > index bd88784e51d9..01fe4362b27e 100644
> > > --- a/drivers/lightnvm/pblk.h
> > > +++ b/drivers/lightnvm/pblk.h
> > > @@ -28,6 +28,7 @@
> > >  #include <linux/vmalloc.h>
> > >  #include <linux/crc32.h>
> > >  #include <linux/uuid.h>
> > > +#include <linux/log2.h>
> > >
> > >  #include <linux/lightnvm.h>
> > >
> > > @@ -45,7 +46,7 @@
> > >  #define PBLK_COMMAND_TIMEOUT_MS 30000
> > >
> > >  /* Max 512 LUNs per device */
> > > -#define PBLK_MAX_LUNS_BITMAP (4)
> > > +#define PBLK_MAX_LUNS_BITMAP (512)
> >
> > 512 is probably enough for everyone for now, but why not make this dynamic?
> > Better not waste memory and introduce an artificial limit on number of luns.
>
> I can make it dynamic. It just makes the init path more messy as
> meta_init takes the write semaphore (and hence busy bitmap) so I have
> to init RAIL in the middle of everything else.
>
> >
> > >
> > >  #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
> > >
> > > @@ -123,6 +124,13 @@ struct pblk_g_ctx {
> > >         u64 lba;
> > >  };
> > >
> > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > +#define PBLK_RAIL_STRIDE_WIDTH 4
> > > +#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
> > > +#else
> > > +#define PR_BITMAP_SIZE NVM_MAX_VLBA
> > > +#endif
> > > +
> > >  /* partial read context */
> > >  struct pblk_pr_ctx {
> > >         struct bio *orig_bio;
> > > @@ -604,6 +612,39 @@ struct pblk_addrf {
> > >         int sec_ws_stripe;
> > >  };
> > >
> > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > +
> > > +struct p2b_entry {
> > > +       int pos;
> > > +       int nr_valid;
> > > +};
> > > +
> > > +struct pblk_rail {
> > > +       struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
> > > +       struct page *pages;             /* Pages to hold parity writes */
> > > +       void **data;                    /* Buffer that holds parity pages */
> > > +       DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
> > > +       u64 *lba;                       /* Buffer to compute LBA parity */
> > > +};
> > > +
> > > +/* Initialize and tear down RAIL */
> > > +int pblk_rail_init(struct pblk *pblk);
> > > +void pblk_rail_free(struct pblk *pblk);
> > > +/* Adjust some system parameters */
> > > +bool pblk_rail_meta_distance(struct pblk_line *data_line);
> > > +int pblk_rail_rb_delay(struct pblk_rb *rb);
> > > +/* Core */
> > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
> > > +int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
> > > +void pblk_rail_up_stride(struct pblk *pblk, int lun);
> > > +/* Write path */
> > > +int pblk_rail_submit_write(struct pblk *pblk);
> > > +/* Read Path */
> > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > +                      struct bio **bio);
> > > +#endif /* CONFIG_NVM_PBLK_RAIL */
> > > +
> > >  typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
> > >                                struct ppa_addr *ppa_list,
> > >                                unsigned long *lun_bitmap,
> > > @@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
> > >         return paddr;
> > >  }
> > >
> > > +static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
> > > +{
> > > +       return pos >> ilog2(geo->num_ch);
> > > +}
> > > +
> > > +static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
> > > +{
> > > +       return pos % geo->num_ch;
> > > +}
> > > +
> > > +static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
> > > +{
> > > +       p->a.lun = lun;
> > > +}
> > > +
> > > +static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
> > > +{
> > > +       p->a.ch = chnl;
> > > +}
> >
> > What is the motivation for adding the lun and chnl setters? They seem
> > uncalled for.
>
> They are used in RAIL's read path to generate the ppas for RAIL reads

It's just a style thing, but they seem a bit redundant to me.

>
> >
> > > +
> > >  static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
> > >  {
> > >         struct nvm_tgt_dev *dev = pblk->dev;
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency
  2018-09-18 16:13   ` Heiner Litz
@ 2018-09-19  7:58     ` Hans Holmberg
  2018-09-21  4:34       ` Heiner Litz
  0 siblings, 1 reply; 17+ messages in thread
From: Hans Holmberg @ 2018-09-19  7:58 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Tue, Sep 18, 2018 at 6:13 PM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> Hi Hans,
> thanks a lot for your comments! I will send you a git repo to test. I
> have a patch which enables/disables RAIL via ioctl and will send that
> as well.

Great!

Once I have the code in a branch i can start creating test cases for
bad-block corner cases, recovery and write error handling.

Thanks,
Hans

>
> Heiner
> On Tue, Sep 18, 2018 at 4:46 AM Hans Holmberg
> <hans.ml.holmberg@owltronix.com> wrote:
> >
> > On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> > >
> > > Hi All,
> > > this patchset introduces RAIL, a mechanism to enforce low tail read latency for
> > > lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
> > > always served from LUNs that do not serve a high latency operation such as a
> > > write or erase. This avoids that reads become serialized behind these operations
> > > reducing tail latency by ~10x. In particular, in the absence of ECC read errors,
> > > it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> > > capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> > > tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
> > >
> > > This patch is based on pblk/core and requires two additional patches from Javier
> > > to be applicable (let me know if you want me to rebase):
> >
> > As the patches do not apply, could you make a branch available so I
> > can get hold of the code in it's present state?
> > That would make reviewing and testing so much easier.
> >
> > I have some concerns regarding recovery and write error handling, but
> > I have not found anything that can't be fixed.
> > I also believe that rail/on off and stride width should not be
> > configured at build-time, but instead be part of the create IOCTL.
> >
> > See my comments on the individual patches for details.
> >
> > >
> > > The 1st patch exposes some existing APIs so they can be used by RAIL
> > > The 2nd patch introduces a configurable sector mapping function
> > > The 3rd patch refactors the write path so the end_io_fn can be specified when
> > > setting up the request
> > > The 4th patch adds a new submit io function that acquires the write semaphore
> > > The 5th patch introduces the RAIL feature and its API
> > > The 6th patch integrates RAIL into pblk's read and write path
> > >
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-19  7:53       ` Hans Holmberg
@ 2018-09-20 23:58         ` Heiner Litz
  2018-09-21  7:04           ` Hans Holmberg
  0 siblings, 1 reply; 17+ messages in thread
From: Heiner Litz @ 2018-09-20 23:58 UTC (permalink / raw)
  To: hans.ml.holmberg
  Cc: linux-block, Javier Gonzalez, Matias Bjørling,
	igor.j.konopko, marcin.dziegielewski

On Wed, Sep 19, 2018 at 12:53 AM Hans Holmberg
<hans.ml.holmberg@owltronix.com> wrote:
>
> On Tue, Sep 18, 2018 at 6:11 PM Heiner Litz <hlitz@ucsc.edu> wrote:
> >
> > On Tue, Sep 18, 2018 at 4:28 AM Hans Holmberg
> > <hans.ml.holmberg@owltronix.com> wrote:
> > >
> > > On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> > > >
> > > > In prepartion of supporting RAIL, add the RAIL API.
> > > >
> > > > Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> > > > ---
> > > >  drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
> > > >  drivers/lightnvm/pblk.h      |  63 ++-
> > > >  2 files changed, 870 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/lightnvm/pblk-rail.c
> > > >
> > > > diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> > > > new file mode 100644
> > > > index 000000000000..a48ed31a0ba9
> > > > --- /dev/null
> > > > +++ b/drivers/lightnvm/pblk-rail.c
> > > > @@ -0,0 +1,808 @@
> > > > +/*
> > > > + * Copyright (C) 2018 Heiner Litz
> > > > + * Initial release: Heiner Litz <hlitz@ucsc.edu>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU General Public License version
> > > > + * 2 as published by the Free Software Foundation.
> > > > + *
> > > > + * This program is distributed in the hope that it will be useful, but
> > > > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > > + * General Public License for more details.
> > > > + *
> > > > + * pblk-rail.c - pblk's RAIL path
> > > > + */
> > > > +
> > > > +#include "pblk.h"
> > > > +
> > > > +#define PBLK_RAIL_EMPTY ~0x0
> > > This constant is not being used.
> >
> > thanks, will remove
> >
> > > > +#define PBLK_RAIL_PARITY_WRITE 0x8000
> > > Where does this magic number come from? Please document.
> >
> > ok , will document
> >
> > >
> > > > +
> > > > +/* RAIL auxiliary functions */
> > > > +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +
> > > > +       return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +
> > > > +       return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +
> > > > +       return lm->blk_per_line * pblk->min_write_pgs;
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> > > > +{
> > > > +       return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> > > > +{
> > > > +       return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +
> > > > +       return (lun & (lm->blk_per_line - 1));
> > > > +}
> > > > +
> > > > +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> > > > +{
> > > > +       return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> > > > +}
> > > > +
> > > > +/* Notify readers that LUN is serving high latency operation */
> > > > +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> > > > +{
> > > > +       WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> > > > +       /* Make sure that busy bit is seen by reader before proceeding */
> > > > +       smp_mb__after_atomic();
> > > > +}
> > > > +
> > > > +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> > > > +{
> > > > +       /* Make sure that write is completed before releasing busy bit */
> > > > +       smp_mb__before_atomic();
> > > > +       WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> > > > +}
> > > > +
> > > > +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> > > > +{
> > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > +       struct nvm_geo *geo = &dev->geo;
> > > > +       int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > > +
> > > > +       return test_bit(lun_pos, pblk->rail.busy_bitmap);
> > > > +}
> > > > +
> > > > +/* Enforces one writer per stride */
> > > > +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> > > > +{
> > > > +       struct pblk_lun *rlun;
> > > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > > +       int stride = lun_pos % strides;
> > > > +       int ret;
> > > > +
> > > > +       rlun = &pblk->luns[stride];
> > > > +       ret = down_timeout(&rlun->wr_sem, timeout);
> > > > +       pblk_rail_notify_reader_down(pblk, lun_pos);
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> > > > +{
> > > > +       struct pblk_lun *rlun;
> > > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > > +       int stride = lun_pos % strides;
> > > > +
> > > > +       pblk_rail_notify_reader_up(pblk, lun_pos);
> > > > +       rlun = &pblk->luns[stride];
> > > > +       up(&rlun->wr_sem);
> > > > +}
> > > > +
> > > > +/* Determine whether a sector holds data, meta or is bad*/
> > > > +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > +       struct nvm_geo *geo = &dev->geo;
> > > > +       struct ppa_addr ppa;
> > > > +       int lun;
> > > > +
> > > > +       if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
> > > > +               return false;
> > > > +
> > > > +       if (pos >= line->emeta_ssec &&
> > > > +           pos < (line->emeta_ssec + lm->emeta_sec[0]))
> > > > +               return false;
> > > > +
> > > > +       ppa = addr_to_gen_ppa(pblk, pos, line->id);
> > > > +       lun = pblk_ppa_to_pos(geo, ppa);
> > > > +
> > > > +       return !test_bit(lun, line->blk_bitmap);
> > > > +}
> > > > +
> > > > +/* Delay rb overwrite until whole stride has been written */
> > > > +int pblk_rail_rb_delay(struct pblk_rb *rb)
> > > > +{
> > > > +       struct pblk *pblk = container_of(rb, struct pblk, rwb);
> > > > +
> > > > +       return pblk_rail_sec_per_stripe(pblk);
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
> > > > +{
> > > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > > +       int page = sec_in_stripe / pblk->min_write_pgs;
> > > > +
> > > > +       return page % pblk_rail_nr_parity_luns(pblk);
> > > > +}
> > > > +
> > > > +static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
> > > > +{
> > > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > > +
> > > > +       return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
> > > > +}
> > > > +
> > > > +static void pblk_rail_data_parity(void *dest, void *src)
> > > > +{
> > > > +       unsigned int i;
> > > > +
> > > > +       for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
> > > > +               ((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
> > > > +}
> > > > +
> > > > +static void pblk_rail_lba_parity(u64 *dest, u64 *src)
> > > > +{
> > > > +       *dest ^= *src;
> > > > +}
> > > > +
> > > > +/* Tracks where a sector is located in the rwb */
> > > > +void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
> > > > +                        int sentry, int nr_valid)
> > > > +{
> > > > +       int stride, idx, pos;
> > > > +
> > > > +       stride = pblk_rail_sec_to_stride(pblk, cur_sec);
> > > > +       idx = pblk_rail_sec_to_idx(pblk, cur_sec);
> > > > +       pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
> > > > +       pblk->rail.p2b[stride][idx].pos = pos;
> > > > +       pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
> > > > +}
> > > > +
> > > > +/* RAIL's sector mapping function */
> > > > +static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
> > > > +                             int sentry, struct pblk_sec_meta *meta_list,
> > > > +                             __le64 *lba_list, struct ppa_addr ppa)
> > > > +{
> > > > +       struct pblk_w_ctx *w_ctx;
> > > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > +
> > > > +       kref_get(&line->ref);
> > > > +
> > > > +       if (sentry & PBLK_RAIL_PARITY_WRITE) {
> > > > +               u64 *lba;
> > > > +
> > > > +               sentry &= ~PBLK_RAIL_PARITY_WRITE;
> > > > +               lba = &pblk->rail.lba[sentry];
> > > > +               meta_list->lba = cpu_to_le64(*lba);
> > > > +               *lba_list = cpu_to_le64(*lba);
> > > > +               line->nr_valid_lbas++;
> > > > +       } else {
> > > > +               w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
> > > > +               w_ctx->ppa = ppa;
> > > > +               meta_list->lba = cpu_to_le64(w_ctx->lba);
> > > > +               *lba_list = cpu_to_le64(w_ctx->lba);
> > > > +
> > > > +               if (*lba_list != addr_empty)
> > > > +                       line->nr_valid_lbas++;
> > > > +               else
> > > > +                       atomic64_inc(&pblk->pad_wa);
> > > > +       }
> > > > +}
> > > > +
> > > > +int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
> > > > +                           struct ppa_addr *ppa_list,
> > > > +                           unsigned long *lun_bitmap,
> > > > +                           struct pblk_sec_meta *meta_list,
> > > > +                           unsigned int valid_secs)
> > > > +{
> > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > +       struct pblk_emeta *emeta;
> > > > +       __le64 *lba_list;
> > > > +       u64 paddr;
> > > > +       int nr_secs = pblk->min_write_pgs;
> > > > +       int i;
> > > > +
> > > > +       if (pblk_line_is_full(line)) {
> > > > +               struct pblk_line *prev_line = line;
> > > > +
> > > > +               /* If we cannot allocate a new line, make sure to store metadata
> > > > +                * on current line and then fail
> > > > +                */
> > > > +               line = pblk_line_replace_data(pblk);
> > > > +               pblk_line_close_meta(pblk, prev_line);
> > > > +
> > > > +               if (!line)
> > > > +                       return -EINTR;
> > > > +       }
> > > > +
> > > > +       emeta = line->emeta;
> > > > +       lba_list = emeta_to_lbas(pblk, emeta->buf);
> > > > +
> > > > +       paddr = pblk_alloc_page(pblk, line, nr_secs);
> > > > +
> > > > +       pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
> > > > +
> > > > +       for (i = 0; i < nr_secs; i++, paddr++) {
> > > > +               __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > +
> > > > +               /* ppa to be sent to the device */
> > > > +               ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
> > > > +
> > > > +               /* Write context for target bio completion on write buffer. Note
> > > > +                * that the write buffer is protected by the sync backpointer,
> > > > +                * and a single writer thread have access to each specific entry
> > > > +                * at a time. Thus, it is safe to modify the context for the
> > > > +                * entry we are setting up for submission without taking any
> > > > +                * lock or memory barrier.
> > > > +                */
> > > > +               if (i < valid_secs) {
> > > > +                       pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
> > > > +                                         &lba_list[paddr], ppa_list[i]);
> > > > +               } else {
> > > > +                       lba_list[paddr] = meta_list[i].lba = addr_empty;
> > > > +                       __pblk_map_invalidate(pblk, line, paddr);
> > > > +               }
> > > > +       }
> > > > +
> > > > +       pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
> > > > +       return 0;
> > > > +}
> > >
> > > This is a lot of duplication of code from the "normal" pblk map
> > > function -  could you refactor to avoid this?
> >
> > I wanted to keep the mapping function as general as possible in case
> > we want to support other mapping functions at some point. If you think
> > this is not needed I can reduce the mapping func to only code that
> > differs between the mapping function. E.g. we could turn
> > pblk_map_page_data into a pblk_map_sec_data
>
> I think it would be better to try to keep common code as far as we
> can, and if we would introduce other mapping functions in the future
> we'll rework the common denominator.
>
> >
> > >
> > > > +
> > > > +/* RAIL Initialization and tear down */
> > > > +int pblk_rail_init(struct pblk *pblk)
> > > > +{
> > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > +       int i, p2be;
> > > > +       unsigned int nr_strides;
> > > > +       unsigned int psecs;
> > > > +       void *kaddr;
> > > > +
> > > > +       if (!PBLK_RAIL_STRIDE_WIDTH)
> > > > +               return 0;
> > > > +
> > > > +       if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
> > > > +           (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
> > > > +               pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
> > > > +               return -EINVAL;
> > > > +       }
> > >
> > > This is just a check of the maximum blocks per line - bad blocks will
> > > reduce the number of writable blocks. What happens when a line goes
> > > below PBLK_RAIL_STRIDE_WIDTH writable blocks?
> >
> > This check just guarantees that lm->blk_per_line is a multiple of
> > PBLK_RAIL_STRIDE_WIDTH. Bad blocks are handled dynamically at runtime
> > via pblk_rail_valid_sector(pblk, line, cur) which skips parity
> > computation if the parity block is bad. In theory a line can have
> > fewer writable blocks than PBLK_RAIL_STRIDE_WIDTH, in this case parity
> > is computed based on fewer number of blocks.
>
> Yes, I see now, it should work.
>
> The only case i see that is problematic is if only the parity block(s)
> are non-bad in a line, resulting in no data being written, just parity
> (adding a huge write latency penalty) - we could either disable RAIL
> for that class of lines or mark them as bad.
>
> >
> > >
> > > > +
> > > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > > +
> > > > +       pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
> > > > +                                      GFP_KERNEL);
> > > > +       if (!pblk->rail.p2b)
> > > > +               return -ENOMEM;
> > > > +
> > > > +       for (p2be = 0; p2be < nr_strides; p2be++) {
> > > > +               pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
> > > > +                                              sizeof(struct p2b_entry),
> > > > +                                              GFP_KERNEL);
> > > > +               if (!pblk->rail.p2b[p2be])
> > > > +                       goto free_p2b_entries;
> > > > +       }
> > > > +
> > > > +       pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
> > > > +       if (!pblk->rail.data)
> > > > +               goto free_p2b_entries;
> > > > +
> > > > +       pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
> > > > +       if (!pblk->rail.pages)
> > > > +               goto free_data;
> > > > +
> > > > +       kaddr = page_address(pblk->rail.pages);
> > > > +       for (i = 0; i < psecs; i++)
> > > > +               pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
> > > > +
> > > > +       pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
> > > > +       if (!pblk->rail.lba)
> > > > +               goto free_pages;
> > > > +
> > > > +       /* Subtract parity bits from device capacity */
> > > > +       pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
> > > > +               PBLK_RAIL_STRIDE_WIDTH;
> > > > +
> > > > +       pblk->map_page = pblk_rail_map_page_data;
> > > > +
> > > > +       return 0;
> > > > +
> > > > +free_pages:
> > > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > > +                  get_count_order(psecs));
> > > > +free_data:
> > > > +       kfree(pblk->rail.data);
> > > > +free_p2b_entries:
> > > > +       for (p2be = p2be - 1; p2be >= 0; p2be--)
> > > > +               kfree(pblk->rail.p2b[p2be]);
> > > > +       kfree(pblk->rail.p2b);
> > > > +
> > > > +       return -ENOMEM;
> > > > +}
> > > > +
> > > > +void pblk_rail_free(struct pblk *pblk)
> > > > +{
> > > > +       unsigned int i;
> > > > +       unsigned int nr_strides;
> > > > +       unsigned int psecs;
> > > > +
> > > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > > +
> > > > +       kfree(pblk->rail.lba);
> > > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > > +                  get_count_order(psecs));
> > > > +       kfree(pblk->rail.data);
> > > > +       for (i = 0; i < nr_strides; i++)
> > > > +               kfree(pblk->rail.p2b[i]);
> > > > +       kfree(pblk->rail.p2b);
> > > > +}
> > > > +
> > > > +/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
> > > > + * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
> > > > + * split the bio
> > > > + */
> > > > +static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
> > > > +{
> > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > +       struct bio *split;
> > > > +
> > > > +       sec *= (dev->geo.csecs >> 9);
> > > > +
> > > > +       split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
> > > > +       /* there isn't chance to merge the split bio */
> > > > +       split->bi_opf |= REQ_NOMERGE;
> > > > +       bio_set_flag(*bio, BIO_QUEUE_ENTERED);
> > > > +       bio_chain(split, *bio);
> > > > +       generic_make_request(*bio);
> > > > +       *bio = split;
> > > > +}
> > > > +
> > > > +/* RAIL's Write Path */
> > > > +static int pblk_rail_sched_parity(struct pblk *pblk)
> > > > +{
> > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > +       unsigned int sec_in_stripe;
> > > > +
> > > > +       while (1) {
> > > > +               sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
> > > > +
> > > > +               /* Schedule parity write at end of data section */
> > > > +               if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
> > > > +                       return 1;
> > > > +
> > > > +               /* Skip bad blocks and meta sectors until we find a valid sec */
> > > > +               if (test_bit(line->cur_sec, line->map_bitmap))
> > > > +                       line->cur_sec += pblk->min_write_pgs;
> > > > +               else
> > > > +                       break;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
> > > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
> > > > +{
> > > > +       int off, bit;
> > > > +
> > > > +       for (off = pblk_rail_dsec_per_stripe(pblk);
> > > > +            off < pblk->lm.sec_per_line;
> > > > +            off += pblk_rail_sec_per_stripe(pblk)) {
> > > > +               for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
> > > > +                       set_bit(off + bit, line->invalid_bitmap);
> > > > +       }
> > > > +}
> > > > +
> > > > +void pblk_rail_end_io_write(struct nvm_rq *rqd)
> > > > +{
> > > > +       struct pblk *pblk = rqd->private;
> > > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > > +
> > > > +       if (rqd->error) {
> > > > +               pblk_log_write_err(pblk, rqd);
> > > > +               return pblk_end_w_fail(pblk, rqd);
> > >
> > > The write error recovery path relies on that that sentry in c_ctx is
> > > an index in the write buffer, so this won't work.
> >
> > You mean a RAIL parity write? Yes, good catch.
> >
>
> It does not make sense to re-issue failing parity writes anyway, right?

Yes this is correct. I think I can just take out end_w_fail, but we
need to let readers know that the page is bad (see below).

>
> > >
> > > Additionally, If a write(data or parity) fails, the whole stride would
> > > be broken and need to fall back on "normal" reads, right?
> > > One solution could be to check line->w_err_gc->has_write_err on the read path.
> >
> > when a data write fails it is remapped and the rail mapping function
> > tracks that new location in the r2b. The page will be marked bad and
> > hence taken into account when computing parity in the case of parity
> > writes and RAIL reads, so the line should still be intact. This might
> > be insufficiently tested but in theory it should work.
>
> As far as I can tell from the code, pblk_rail_valid_sector only checks
> if the sector is occupied by metadata or if the whole block is bad.
> In the case of a write failure, the block will not be marked bad. What
> we could do is to keep track of the write pointer internally to check
> if the sector had been successfully written.
>
> I can create a patch for keeping track of the write pointer for each
> block - this would be useful for debugging purposes in any case. Once
> this is in place it would be easy to add a check in
> pblk_rail_valid_sector ensuring that the sector has actually been
> written successfully.

Hmm, I don't think keeping track of the write pointer is sufficient. Readers
need to be able to determine whether a page is bad at any time. So I
believe we need a per-line bitmap telling us which stripes (horizontal pages)
are bad, or am I missing something?

>
> >
> > >
> > > > +       }
> > > > +#ifdef CONFIG_NVM_DEBUG
> > > > +       else
> > > > +               WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
> > > > +#endif
> > > > +
> > > > +       pblk_up_rq(pblk, c_ctx->lun_bitmap);
> > > > +
> > > > +       pblk_rq_to_line_put(pblk, rqd);
> > > > +       bio_put(rqd->bio);
> > > > +       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > +
> > > > +       atomic_dec(&pblk->inflight_io);
> > > > +}
> > > > +
> > > > +static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
> > > > +                         struct bio *bio, unsigned int stride,
> > > > +                         unsigned int nr_secs, unsigned int paddr)
> > > > +{
> > > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > > +       int sec, i;
> > > > +       int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
> > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > +
> > > > +       c_ctx->nr_valid = nr_secs;
> > > > +       /* sentry indexes rail page buffer, instead of rwb */
> > > > +       c_ctx->sentry = stride * pblk->min_write_pgs;
> > > > +       c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
> > > > +
> > > > +       for (sec = 0; sec < pblk->min_write_pgs; sec++) {
> > > > +               void *pg_addr;
> > > > +               struct page *page;
> > > > +               u64 *lba;
> > > > +
> > > > +               lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
> > > > +               pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
> > > > +               page = virt_to_page(pg_addr);
> > > > +
> > > > +               if (!page) {
> > > > +                       pr_err("pblk: could not allocate RAIL bio page %p\n",
> > > > +                              pg_addr);
> > > > +                       return -NVM_IO_ERR;
> > > > +               }
> > > > +
> > > > +               if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
> > > > +                   pblk->rwb.seg_size) {
> > > > +                       pr_err("pblk: could not add page to RAIL bio\n");
> > > > +                       return -NVM_IO_ERR;
> > > > +               }
> > > > +
> > > > +               *lba = 0;
> > > > +               memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > > +
> > > > +               for (i = 0; i < nr_data; i++) {
> > > > +                       struct pblk_rb_entry *entry;
> > > > +                       struct pblk_w_ctx *w_ctx;
> > > > +                       u64 lba_src;
> > > > +                       unsigned int pos;
> > > > +                       unsigned int cur;
> > > > +                       int distance = pblk_rail_psec_per_stripe(pblk);
> > > > +
> > > > +                       cur = paddr - distance * (nr_data - i) + sec;
> > > > +
> > > > +                       if (!pblk_rail_valid_sector(pblk, line, cur))
> > > > +                               continue;
> > > > +
> > > > +                       pos = pblk->rail.p2b[stride][i].pos;
> > > > +                       pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
> > > > +                       entry = &pblk->rwb.entries[pos];
> > > > +                       w_ctx = &entry->w_ctx;
> > > > +                       lba_src = w_ctx->lba;
> > > > +
> > > > +                       if (sec < pblk->rail.p2b[stride][i].nr_valid &&
> > > > +                           lba_src != ADDR_EMPTY) {
> > > > +                               pblk_rail_data_parity(pg_addr, entry->data);
> > > > +                               pblk_rail_lba_parity(lba, &lba_src);
> > >
> > > What keeps the parity lba values from invalidating "real" data lbas
> > > during recovery?
> >
> > The RAIL geometry is known during recovery so the parity LBAs can be
> > ignored, not implemented yet.
>
> Ah, it's not in place yet, then it makes sense.
> It would be straight forward to implement, using something like
> sector_is_parity(line, paddr) to avoid mapping parity sectors during
> recovery.
>
> >
> > >
> > > > +                       }
> > > > +               }
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +int pblk_rail_submit_write(struct pblk *pblk)
> > > > +{
> > > > +       int i;
> > > > +       struct nvm_rq *rqd;
> > > > +       struct bio *bio;
> > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > +       int start, end, bb_offset;
> > > > +       unsigned int stride = 0;
> > > > +
> > > > +       if (!pblk_rail_sched_parity(pblk))
> > > > +               return 0;
> > > > +
> > > > +       start = line->cur_sec;
> > > > +       bb_offset = start % pblk_rail_sec_per_stripe(pblk);
> > > > +       end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
> > > > +
> > > > +       for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
> > > > +               /* Do not generate parity in this slot if the sec is bad
> > > > +                * or reserved for meta.
> > > > +                * We check on the read path and perform a conventional
> > > > +                * read, to avoid reading parity from the bad block
> > > > +                */
> > > > +               if (!pblk_rail_valid_sector(pblk, line, i))
> > > > +                       continue;
> > > > +
> > > > +               rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
> > > > +               if (IS_ERR(rqd)) {
> > > > +                       pr_err("pblk: cannot allocate parity write req.\n");
> > > > +                       return -ENOMEM;
> > > > +               }
> > > > +
> > > > +               bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
> > > > +               if (!bio) {
> > > > +                       pr_err("pblk: cannot allocate parity write bio\n");
> > > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > +                       return -ENOMEM;
> > > > +               }
> > > > +
> > > > +               bio->bi_iter.bi_sector = 0; /* internal bio */
> > > > +               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> > > > +               rqd->bio = bio;
> > > > +
> > > > +               pblk_rail_read_to_bio(pblk, rqd, bio, stride,
> > > > +                                     pblk->min_write_pgs, i);
> > > > +
> > > > +               if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
> > > > +                       bio_put(rqd->bio);
> > > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > +
> > > > +                       return -NVM_IO_ERR;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/* RAIL's Read Path */
> > > > +static void pblk_rail_end_io_read(struct nvm_rq *rqd)
> > > > +{
> > > > +       struct pblk *pblk = rqd->private;
> > > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > +       struct pblk_pr_ctx *pr_ctx = r_ctx->private;
> > > > +       struct bio *new_bio = rqd->bio;
> > > > +       struct bio *bio = pr_ctx->orig_bio;
> > > > +       struct bio_vec src_bv, dst_bv;
> > > > +       struct pblk_sec_meta *meta_list = rqd->meta_list;
> > > > +       int bio_init_idx = pr_ctx->bio_init_idx;
> > > > +       int nr_secs = pr_ctx->orig_nr_secs;
> > > > +       __le64 *lba_list_mem, *lba_list_media;
> > > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > +       void *src_p, *dst_p;
> > > > +       int i, r, rail_ppa = 0;
> > > > +       unsigned char valid;
> > > > +
> > > > +       if (unlikely(rqd->nr_ppas == 1)) {
> > > > +               struct ppa_addr ppa;
> > > > +
> > > > +               ppa = rqd->ppa_addr;
> > > > +               rqd->ppa_list = pr_ctx->ppa_ptr;
> > > > +               rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
> > > > +               rqd->ppa_list[0] = ppa;
> > > > +       }
> > > > +
> > > > +       /* Re-use allocated memory for intermediate lbas */
> > > > +       lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
> > > > +       lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
> > > > +
> > > > +       for (i = 0; i < rqd->nr_ppas; i++)
> > > > +               lba_list_media[i] = meta_list[i].lba;
> > > > +       for (i = 0; i < nr_secs; i++)
> > > > +               meta_list[i].lba = lba_list_mem[i];
> > > > +
> > > > +       for (i = 0; i < nr_secs; i++) {
> > > > +               struct pblk_line *line;
> > > > +               u64 meta_lba = 0x0UL, mlba;
> > > > +
> > > > +               line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
> > > > +
> > > > +               valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
> > > > +               bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
> > > > +                                  PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
> > > > +
> > > > +               if (valid == 0) /* Skip cached reads */
> > > > +                       continue;
> > > > +
> > > > +               kref_put(&line->ref, pblk_line_put);
> > > > +
> > > > +               dst_bv = bio->bi_io_vec[bio_init_idx + i];
> > > > +               dst_p = kmap_atomic(dst_bv.bv_page);
> > > > +
> > > > +               memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > > +               meta_list[i].lba = cpu_to_le64(0x0UL);
> > > > +
> > > > +               for (r = 0; r < valid; r++, rail_ppa++) {
> > > > +                       src_bv = new_bio->bi_io_vec[rail_ppa];
> > > > +
> > > > +                       if (lba_list_media[rail_ppa] != addr_empty) {
> > > > +                               src_p = kmap_atomic(src_bv.bv_page);
> > > > +                               pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
> > > > +                                                     src_p + src_bv.bv_offset);
> > > > +                               mlba = le64_to_cpu(lba_list_media[rail_ppa]);
> > > > +                               pblk_rail_lba_parity(&meta_lba, &mlba);
> > > > +                               kunmap_atomic(src_p);
> > > > +                       }
> > > > +
> > > > +                       mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
> > > > +               }
> > > > +               meta_list[i].lba = cpu_to_le64(meta_lba);
> > > > +               kunmap_atomic(dst_p);
> > > > +       }
> > > > +
> > > > +       bio_put(new_bio);
> > > > +       rqd->nr_ppas = pr_ctx->orig_nr_secs;
> > > > +       kfree(pr_ctx);
> > > > +       rqd->bio = NULL;
> > > > +
> > > > +       bio_endio(bio);
> > > > +       __pblk_end_io_read(pblk, rqd, false);
> > > > +}
> > > > +
> > > > +/* Converts original ppa into ppa list of RAIL reads */
> > > > +static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
> > > > +                               struct ppa_addr *rail_ppas,
> > > > +                               unsigned char *pvalid, int *nr_rail_ppas,
> > > > +                               int *rail_reads)
> > > > +{
> > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > +       struct nvm_geo *geo = &dev->geo;
> > > > +       struct ppa_addr rail_ppa = ppa;
> > > > +       unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > > +       unsigned int strides = pblk_rail_nr_parity_luns(pblk);
> > > > +       struct pblk_line *line;
> > > > +       unsigned int i;
> > > > +       int ppas = *nr_rail_ppas;
> > > > +       int valid = 0;
> > > > +
> > > > +       for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
> > > > +               unsigned int neighbor, lun, chnl;
> > > > +               int laddr;
> > > > +
> > > > +               neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
> > > > +
> > > > +               lun = pblk_pos_to_lun(geo, neighbor);
> > > > +               chnl = pblk_pos_to_chnl(geo, neighbor);
> > > > +               pblk_dev_ppa_set_lun(&rail_ppa, lun);
> > > > +               pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
> > > > +
> > > > +               line = pblk_ppa_to_line(pblk, rail_ppa);
> > > > +               laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
> > > > +
> > > > +               /* Do not read from bad blocks */
> > > > +               if (!pblk_rail_valid_sector(pblk, line, laddr)) {
> > > > +                       /* Perform regular read if parity sector is bad */
> > > > +                       if (neighbor >= pblk_rail_nr_data_luns(pblk))
> > > > +                               return 0;
> > > > +
> > > > +                       /* If any other neighbor is bad we can just skip it */
> > > > +                       continue;
> > > > +               }
> > > > +
> > > > +               rail_ppas[ppas++] = rail_ppa;
> > > > +               valid++;
> > > > +       }
> > > > +
> > > > +       if (valid == 1)
> > > > +               return 0;
> > > > +
> > > > +       *pvalid = valid;
> > > > +       *nr_rail_ppas = ppas;
> > > > +       (*rail_reads)++;
> > > > +       return 1;
> > > > +}
> > > > +
> > > > +static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
> > > > +                                int ppa, struct ppa_addr *rail_ppa_list,
> > > > +                                int *nr_rail_ppas, unsigned long *read_bitmap,
> > > > +                                unsigned long *pvalid, int *rail_reads)
> > > > +{
> > > > +       unsigned char valid;
> > > > +
> > > > +       if (test_bit(ppa, read_bitmap))
> > > > +               return;
> > > > +
> > > > +       if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
> > > > +           pblk_rail_setup_ppas(pblk, ppa_list[ppa],
> > > > +                                rail_ppa_list, &valid,
> > > > +                                nr_rail_ppas, rail_reads)) {
> > > > +               WARN_ON(test_and_set_bit(ppa, read_bitmap));
> > > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
> > > > +       } else {
> > > > +               rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
> > > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
> > > > +       }
> > > > +}
> > > > +
> > > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > > +                      struct bio **bio)
> > > > +{
> > > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > +       struct pblk_pr_ctx *pr_ctx;
> > > > +       struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
> > > > +       DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
> > > > +       int nr_secs = rqd->nr_ppas;
> > > > +       bool read_empty = bitmap_empty(read_bitmap, nr_secs);
> > > > +       int nr_rail_ppas = 0, rail_reads = 0;
> > > > +       int i;
> > > > +       int ret;
> > > > +
> > > > +       /* Fully cached reads should not enter this path */
> > > > +       WARN_ON(bitmap_full(read_bitmap, nr_secs));
> > > > +
> > > > +       bitmap_zero(pvalid, PR_BITMAP_SIZE);
> > > > +       if (rqd->nr_ppas == 1) {
> > > > +               pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
> > > > +                                    &nr_rail_ppas, read_bitmap, pvalid,
> > > > +                                    &rail_reads);
> > > > +
> > > > +               if (nr_rail_ppas == 1) {
> > > > +                       memcpy(&rqd->ppa_addr, rail_ppa_list,
> > > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > > +               } else {
> > > > +                       rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
> > > > +                       rqd->dma_ppa_list = rqd->dma_meta_list +
> > > > +                         pblk_dma_meta_size;
> > > > +                       memcpy(rqd->ppa_list, rail_ppa_list,
> > > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > > +               }
> > > > +       } else {
> > > > +               for (i = 0; i < rqd->nr_ppas; i++) {
> > > > +                       pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
> > > > +                                            rail_ppa_list, &nr_rail_ppas,
> > > > +                                            read_bitmap, pvalid, &rail_reads);
> > > > +
> > > > +                       /* Don't split if this it the last ppa of the rqd */
> > > > +                       if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
> > > > +                            NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
> > > > +                               struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > +
> > > > +                               pblk_rail_bio_split(pblk, bio, i + 1);
> > > > +                               rqd->nr_ppas = pblk_get_secs(*bio);
> > > > +                               r_ctx->private = *bio;
> > > > +                               break;
> > > > +                       }
> > > > +               }
> > > > +               memcpy(rqd->ppa_list, rail_ppa_list,
> > > > +                      nr_rail_ppas * sizeof(struct ppa_addr));
> > > > +       }
> > > > +
> > > > +       if (bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > > +               return NVM_IO_REQUEUE;
> > > > +
> > > > +       if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > > +               bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
> > > > +
> > > > +       if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
> > > > +                                   nr_rail_ppas))
> > > > +               return NVM_IO_ERR;
> > > > +
> > > > +       rqd->end_io = pblk_rail_end_io_read;
> > > > +       pr_ctx = r_ctx->private;
> > > > +       bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
> > > > +
> > > > +       ret = pblk_submit_io(pblk, rqd);
> > > > +       if (ret) {
> > > > +               bio_put(rqd->bio);
> > > > +               pr_err("pblk: partial RAIL read IO submission failed\n");
> > > > +               /* Free allocated pages in new bio */
> > > > +               pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
> > > > +               kfree(pr_ctx);
> > > > +               __pblk_end_io_read(pblk, rqd, false);
> > > > +               return NVM_IO_ERR;
> > > > +       }
> > > > +
> > > > +       return NVM_IO_OK;
> > > > +}
> > > > diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> > > > index bd88784e51d9..01fe4362b27e 100644
> > > > --- a/drivers/lightnvm/pblk.h
> > > > +++ b/drivers/lightnvm/pblk.h
> > > > @@ -28,6 +28,7 @@
> > > >  #include <linux/vmalloc.h>
> > > >  #include <linux/crc32.h>
> > > >  #include <linux/uuid.h>
> > > > +#include <linux/log2.h>
> > > >
> > > >  #include <linux/lightnvm.h>
> > > >
> > > > @@ -45,7 +46,7 @@
> > > >  #define PBLK_COMMAND_TIMEOUT_MS 30000
> > > >
> > > >  /* Max 512 LUNs per device */
> > > > -#define PBLK_MAX_LUNS_BITMAP (4)
> > > > +#define PBLK_MAX_LUNS_BITMAP (512)
> > >
> > > 512 is probably enough for everyone for now, but why not make this dynamic?
> > > Better not waste memory and introduce an artificial limit on number of luns.
> >
> > I can make it dynamic. It just makes the init path more messy as
> > meta_init takes the write semaphore (and hence busy bitmap) so I have
> > to init RAIL in the middle of everything else.
> >
> > >
> > > >
> > > >  #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
> > > >
> > > > @@ -123,6 +124,13 @@ struct pblk_g_ctx {
> > > >         u64 lba;
> > > >  };
> > > >
> > > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > > +#define PBLK_RAIL_STRIDE_WIDTH 4
> > > > +#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
> > > > +#else
> > > > +#define PR_BITMAP_SIZE NVM_MAX_VLBA
> > > > +#endif
> > > > +
> > > >  /* partial read context */
> > > >  struct pblk_pr_ctx {
> > > >         struct bio *orig_bio;
> > > > @@ -604,6 +612,39 @@ struct pblk_addrf {
> > > >         int sec_ws_stripe;
> > > >  };
> > > >
> > > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > > +
> > > > +struct p2b_entry {
> > > > +       int pos;
> > > > +       int nr_valid;
> > > > +};
> > > > +
> > > > +struct pblk_rail {
> > > > +       struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
> > > > +       struct page *pages;             /* Pages to hold parity writes */
> > > > +       void **data;                    /* Buffer that holds parity pages */
> > > > +       DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
> > > > +       u64 *lba;                       /* Buffer to compute LBA parity */
> > > > +};
> > > > +
> > > > +/* Initialize and tear down RAIL */
> > > > +int pblk_rail_init(struct pblk *pblk);
> > > > +void pblk_rail_free(struct pblk *pblk);
> > > > +/* Adjust some system parameters */
> > > > +bool pblk_rail_meta_distance(struct pblk_line *data_line);
> > > > +int pblk_rail_rb_delay(struct pblk_rb *rb);
> > > > +/* Core */
> > > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
> > > > +int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
> > > > +void pblk_rail_up_stride(struct pblk *pblk, int lun);
> > > > +/* Write path */
> > > > +int pblk_rail_submit_write(struct pblk *pblk);
> > > > +/* Read Path */
> > > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > > +                      struct bio **bio);
> > > > +#endif /* CONFIG_NVM_PBLK_RAIL */
> > > > +
> > > >  typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
> > > >                                struct ppa_addr *ppa_list,
> > > >                                unsigned long *lun_bitmap,
> > > > @@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
> > > >         return paddr;
> > > >  }
> > > >
> > > > +static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
> > > > +{
> > > > +       return pos >> ilog2(geo->num_ch);
> > > > +}
> > > > +
> > > > +static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
> > > > +{
> > > > +       return pos % geo->num_ch;
> > > > +}
> > > > +
> > > > +static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
> > > > +{
> > > > +       p->a.lun = lun;
> > > > +}
> > > > +
> > > > +static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
> > > > +{
> > > > +       p->a.ch = chnl;
> > > > +}
> > >
> > > What is the motivation for adding the lun and chnl setters? They seem
> > > uncalled for.
> >
> > They are used in RAIL's read path to generate the ppas for RAIL reads
>
> It's just a style thing, but they seem a bit redundant to me.
>
> >
> > >
> > > > +
> > > >  static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
> > > >  {
> > > >         struct nvm_tgt_dev *dev = pblk->dev;
> > > > --
> > > > 2.17.1
> > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency
  2018-09-19  7:58     ` Hans Holmberg
@ 2018-09-21  4:34       ` Heiner Litz
  0 siblings, 0 replies; 17+ messages in thread
From: Heiner Litz @ 2018-09-21  4:34 UTC (permalink / raw)
  To: hans.ml.holmberg
  Cc: linux-block, Javier Gonzalez, Matias Bjørling,
	igor.j.konopko, marcin.dziegielewski

Hi Hans,
here is my git branch:
https://github.com/hlitz/rail_lightnvm/tree/rail_4-20
thanks for testing!
Heiner
On Wed, Sep 19, 2018 at 12:58 AM Hans Holmberg
<hans.ml.holmberg@owltronix.com> wrote:
>
> On Tue, Sep 18, 2018 at 6:13 PM Heiner Litz <hlitz@ucsc.edu> wrote:
> >
> > Hi Hans,
> > thanks a lot for your comments! I will send you a git repo to test. I
> > have a patch which enables/disables RAIL via ioctl and will send that
> > as well.
>
> Great!
>
> Once I have the code in a branch i can start creating test cases for
> bad-block corner cases, recovery and write error handling.
>
> Thanks,
> Hans
>
> >
> > Heiner
> > On Tue, Sep 18, 2018 at 4:46 AM Hans Holmberg
> > <hans.ml.holmberg@owltronix.com> wrote:
> > >
> > > On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> > > >
> > > > Hi All,
> > > > this patchset introduces RAIL, a mechanism to enforce low tail read latency for
> > > > lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
> > > > always served from LUNs that do not serve a high latency operation such as a
> > > > write or erase. This avoids that reads become serialized behind these operations
> > > > reducing tail latency by ~10x. In particular, in the absence of ECC read errors,
> > > > it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> > > > capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> > > > tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
> > > >
> > > > This patch is based on pblk/core and requires two additional patches from Javier
> > > > to be applicable (let me know if you want me to rebase):
> > >
> > > As the patches do not apply, could you make a branch available so I
> > > can get hold of the code in it's present state?
> > > That would make reviewing and testing so much easier.
> > >
> > > I have some concerns regarding recovery and write error handling, but
> > > I have not found anything that can't be fixed.
> > > I also believe that rail/on off and stride width should not be
> > > configured at build-time, but instead be part of the create IOCTL.
> > >
> > > See my comments on the individual patches for details.
> > >
> > > >
> > > > The 1st patch exposes some existing APIs so they can be used by RAIL
> > > > The 2nd patch introduces a configurable sector mapping function
> > > > The 3rd patch refactors the write path so the end_io_fn can be specified when
> > > > setting up the request
> > > > The 4th patch adds a new submit io function that acquires the write semaphore
> > > > The 5th patch introduces the RAIL feature and its API
> > > > The 6th patch integrates RAIL into pblk's read and write path
> > > >
> > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface
  2018-09-20 23:58         ` Heiner Litz
@ 2018-09-21  7:04           ` Hans Holmberg
  0 siblings, 0 replies; 17+ messages in thread
From: Hans Holmberg @ 2018-09-21  7:04 UTC (permalink / raw)
  To: hlitz
  Cc: linux-block, Javier Gonzalez, Matias Bjorling, igor.j.konopko,
	marcin.dziegielewski

On Fri, Sep 21, 2018 at 1:59 AM Heiner Litz <hlitz@ucsc.edu> wrote:
>
> On Wed, Sep 19, 2018 at 12:53 AM Hans Holmberg
> <hans.ml.holmberg@owltronix.com> wrote:
> >
> > On Tue, Sep 18, 2018 at 6:11 PM Heiner Litz <hlitz@ucsc.edu> wrote:
> > >
> > > On Tue, Sep 18, 2018 at 4:28 AM Hans Holmberg
> > > <hans.ml.holmberg@owltronix.com> wrote:
> > > >
> > > > On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz <hlitz@ucsc.edu> wrote:
> > > > >
> > > > > In prepartion of supporting RAIL, add the RAIL API.
> > > > >
> > > > > Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> > > > > ---
> > > > >  drivers/lightnvm/pblk-rail.c | 808 +++++++++++++++++++++++++++++++++++
> > > > >  drivers/lightnvm/pblk.h      |  63 ++-
> > > > >  2 files changed, 870 insertions(+), 1 deletion(-)
> > > > >  create mode 100644 drivers/lightnvm/pblk-rail.c
> > > > >
> > > > > diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> > > > > new file mode 100644
> > > > > index 000000000000..a48ed31a0ba9
> > > > > --- /dev/null
> > > > > +++ b/drivers/lightnvm/pblk-rail.c
> > > > > @@ -0,0 +1,808 @@
> > > > > +/*
> > > > > + * Copyright (C) 2018 Heiner Litz
> > > > > + * Initial release: Heiner Litz <hlitz@ucsc.edu>
> > > > > + *
> > > > > + * This program is free software; you can redistribute it and/or
> > > > > + * modify it under the terms of the GNU General Public License version
> > > > > + * 2 as published by the Free Software Foundation.
> > > > > + *
> > > > > + * This program is distributed in the hope that it will be useful, but
> > > > > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > > > + * General Public License for more details.
> > > > > + *
> > > > > + * pblk-rail.c - pblk's RAIL path
> > > > > + */
> > > > > +
> > > > > +#include "pblk.h"
> > > > > +
> > > > > +#define PBLK_RAIL_EMPTY ~0x0
> > > > This constant is not being used.
> > >
> > > thanks, will remove
> > >
> > > > > +#define PBLK_RAIL_PARITY_WRITE 0x8000
> > > > Where does this magic number come from? Please document.
> > >
> > > ok , will document
> > >
> > > >
> > > > > +
> > > > > +/* RAIL auxiliary functions */
> > > > > +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +
> > > > > +       return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +
> > > > > +       return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +
> > > > > +       return lm->blk_per_line * pblk->min_write_pgs;
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> > > > > +{
> > > > > +       return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> > > > > +{
> > > > > +       return pblk_rail_sec_per_stripe(pblk) - pblk_rail_psec_per_stripe(pblk);
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +
> > > > > +       return (lun & (lm->blk_per_line - 1));
> > > > > +}
> > > > > +
> > > > > +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> > > > > +{
> > > > > +       return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> > > > > +}
> > > > > +
> > > > > +/* Notify readers that LUN is serving high latency operation */
> > > > > +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> > > > > +{
> > > > > +       WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> > > > > +       /* Make sure that busy bit is seen by reader before proceeding */
> > > > > +       smp_mb__after_atomic();
> > > > > +}
> > > > > +
> > > > > +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> > > > > +{
> > > > > +       /* Make sure that write is completed before releasing busy bit */
> > > > > +       smp_mb__before_atomic();
> > > > > +       WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> > > > > +}
> > > > > +
> > > > > +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> > > > > +{
> > > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > > +       struct nvm_geo *geo = &dev->geo;
> > > > > +       int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > > > +
> > > > > +       return test_bit(lun_pos, pblk->rail.busy_bitmap);
> > > > > +}
> > > > > +
> > > > > +/* Enforces one writer per stride */
> > > > > +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> > > > > +{
> > > > > +       struct pblk_lun *rlun;
> > > > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > > > +       int stride = lun_pos % strides;
> > > > > +       int ret;
> > > > > +
> > > > > +       rlun = &pblk->luns[stride];
> > > > > +       ret = down_timeout(&rlun->wr_sem, timeout);
> > > > > +       pblk_rail_notify_reader_down(pblk, lun_pos);
> > > > > +
> > > > > +       return ret;
> > > > > +}
> > > > > +
> > > > > +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> > > > > +{
> > > > > +       struct pblk_lun *rlun;
> > > > > +       int strides = pblk_rail_nr_parity_luns(pblk);
> > > > > +       int stride = lun_pos % strides;
> > > > > +
> > > > > +       pblk_rail_notify_reader_up(pblk, lun_pos);
> > > > > +       rlun = &pblk->luns[stride];
> > > > > +       up(&rlun->wr_sem);
> > > > > +}
> > > > > +
> > > > > +/* Determine whether a sector holds data, meta or is bad*/
> > > > > +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int pos)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > > +       struct nvm_geo *geo = &dev->geo;
> > > > > +       struct ppa_addr ppa;
> > > > > +       int lun;
> > > > > +
> > > > > +       if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + lm->smeta_sec))
> > > > > +               return false;
> > > > > +
> > > > > +       if (pos >= line->emeta_ssec &&
> > > > > +           pos < (line->emeta_ssec + lm->emeta_sec[0]))
> > > > > +               return false;
> > > > > +
> > > > > +       ppa = addr_to_gen_ppa(pblk, pos, line->id);
> > > > > +       lun = pblk_ppa_to_pos(geo, ppa);
> > > > > +
> > > > > +       return !test_bit(lun, line->blk_bitmap);
> > > > > +}
> > > > > +
> > > > > +/* Delay rb overwrite until whole stride has been written */
> > > > > +int pblk_rail_rb_delay(struct pblk_rb *rb)
> > > > > +{
> > > > > +       struct pblk *pblk = container_of(rb, struct pblk, rwb);
> > > > > +
> > > > > +       return pblk_rail_sec_per_stripe(pblk);
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_sec_to_stride(struct pblk *pblk, unsigned int sec)
> > > > > +{
> > > > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > > > +       int page = sec_in_stripe / pblk->min_write_pgs;
> > > > > +
> > > > > +       return page % pblk_rail_nr_parity_luns(pblk);
> > > > > +}
> > > > > +
> > > > > +static unsigned int pblk_rail_sec_to_idx(struct pblk *pblk, unsigned int sec)
> > > > > +{
> > > > > +       unsigned int sec_in_stripe = sec % pblk_rail_sec_per_stripe(pblk);
> > > > > +
> > > > > +       return sec_in_stripe / pblk_rail_psec_per_stripe(pblk);
> > > > > +}
> > > > > +
> > > > > +static void pblk_rail_data_parity(void *dest, void *src)
> > > > > +{
> > > > > +       unsigned int i;
> > > > > +
> > > > > +       for (i = 0; i < PBLK_EXPOSED_PAGE_SIZE / sizeof(unsigned long); i++)
> > > > > +               ((unsigned long *)dest)[i] ^= ((unsigned long *)src)[i];
> > > > > +}
> > > > > +
> > > > > +static void pblk_rail_lba_parity(u64 *dest, u64 *src)
> > > > > +{
> > > > > +       *dest ^= *src;
> > > > > +}
> > > > > +
> > > > > +/* Tracks where a sector is located in the rwb */
> > > > > +void pblk_rail_track_sec(struct pblk *pblk, struct pblk_line *line, int cur_sec,
> > > > > +                        int sentry, int nr_valid)
> > > > > +{
> > > > > +       int stride, idx, pos;
> > > > > +
> > > > > +       stride = pblk_rail_sec_to_stride(pblk, cur_sec);
> > > > > +       idx = pblk_rail_sec_to_idx(pblk, cur_sec);
> > > > > +       pos = pblk_rb_wrap_pos(&pblk->rwb, sentry);
> > > > > +       pblk->rail.p2b[stride][idx].pos = pos;
> > > > > +       pblk->rail.p2b[stride][idx].nr_valid = nr_valid;
> > > > > +}
> > > > > +
> > > > > +/* RAIL's sector mapping function */
> > > > > +static void pblk_rail_map_sec(struct pblk *pblk, struct pblk_line *line,
> > > > > +                             int sentry, struct pblk_sec_meta *meta_list,
> > > > > +                             __le64 *lba_list, struct ppa_addr ppa)
> > > > > +{
> > > > > +       struct pblk_w_ctx *w_ctx;
> > > > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > > +
> > > > > +       kref_get(&line->ref);
> > > > > +
> > > > > +       if (sentry & PBLK_RAIL_PARITY_WRITE) {
> > > > > +               u64 *lba;
> > > > > +
> > > > > +               sentry &= ~PBLK_RAIL_PARITY_WRITE;
> > > > > +               lba = &pblk->rail.lba[sentry];
> > > > > +               meta_list->lba = cpu_to_le64(*lba);
> > > > > +               *lba_list = cpu_to_le64(*lba);
> > > > > +               line->nr_valid_lbas++;
> > > > > +       } else {
> > > > > +               w_ctx = pblk_rb_w_ctx(&pblk->rwb, sentry);
> > > > > +               w_ctx->ppa = ppa;
> > > > > +               meta_list->lba = cpu_to_le64(w_ctx->lba);
> > > > > +               *lba_list = cpu_to_le64(w_ctx->lba);
> > > > > +
> > > > > +               if (*lba_list != addr_empty)
> > > > > +                       line->nr_valid_lbas++;
> > > > > +               else
> > > > > +                       atomic64_inc(&pblk->pad_wa);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +int pblk_rail_map_page_data(struct pblk *pblk, unsigned int sentry,
> > > > > +                           struct ppa_addr *ppa_list,
> > > > > +                           unsigned long *lun_bitmap,
> > > > > +                           struct pblk_sec_meta *meta_list,
> > > > > +                           unsigned int valid_secs)
> > > > > +{
> > > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > > +       struct pblk_emeta *emeta;
> > > > > +       __le64 *lba_list;
> > > > > +       u64 paddr;
> > > > > +       int nr_secs = pblk->min_write_pgs;
> > > > > +       int i;
> > > > > +
> > > > > +       if (pblk_line_is_full(line)) {
> > > > > +               struct pblk_line *prev_line = line;
> > > > > +
> > > > > +               /* If we cannot allocate a new line, make sure to store metadata
> > > > > +                * on current line and then fail
> > > > > +                */
> > > > > +               line = pblk_line_replace_data(pblk);
> > > > > +               pblk_line_close_meta(pblk, prev_line);
> > > > > +
> > > > > +               if (!line)
> > > > > +                       return -EINTR;
> > > > > +       }
> > > > > +
> > > > > +       emeta = line->emeta;
> > > > > +       lba_list = emeta_to_lbas(pblk, emeta->buf);
> > > > > +
> > > > > +       paddr = pblk_alloc_page(pblk, line, nr_secs);
> > > > > +
> > > > > +       pblk_rail_track_sec(pblk, line, paddr, sentry, valid_secs);
> > > > > +
> > > > > +       for (i = 0; i < nr_secs; i++, paddr++) {
> > > > > +               __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > > +
> > > > > +               /* ppa to be sent to the device */
> > > > > +               ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
> > > > > +
> > > > > +               /* Write context for target bio completion on write buffer. Note
> > > > > +                * that the write buffer is protected by the sync backpointer,
> > > > > +                * and a single writer thread have access to each specific entry
> > > > > +                * at a time. Thus, it is safe to modify the context for the
> > > > > +                * entry we are setting up for submission without taking any
> > > > > +                * lock or memory barrier.
> > > > > +                */
> > > > > +               if (i < valid_secs) {
> > > > > +                       pblk_rail_map_sec(pblk, line, sentry + i, &meta_list[i],
> > > > > +                                         &lba_list[paddr], ppa_list[i]);
> > > > > +               } else {
> > > > > +                       lba_list[paddr] = meta_list[i].lba = addr_empty;
> > > > > +                       __pblk_map_invalidate(pblk, line, paddr);
> > > > > +               }
> > > > > +       }
> > > > > +
> > > > > +       pblk_down_rq(pblk, ppa_list[0], lun_bitmap);
> > > > > +       return 0;
> > > > > +}
> > > >
> > > > This is a lot of duplication of code from the "normal" pblk map
> > > > function -  could you refactor to avoid this?
> > >
> > > I wanted to keep the mapping function as general as possible in case
> > > we want to support other mapping functions at some point. If you think
> > > this is not needed I can reduce the mapping func to only code that
> > > differs between the mapping function. E.g. we could turn
> > > pblk_map_page_data into a pblk_map_sec_data
> >
> > I think it would be better to try to keep common code as far as we
> > can, and if we would introduce other mapping functions in the future
> > we'll rework the common denominator.
> >
> > >
> > > >
> > > > > +
> > > > > +/* RAIL Initialization and tear down */
> > > > > +int pblk_rail_init(struct pblk *pblk)
> > > > > +{
> > > > > +       struct pblk_line_meta *lm = &pblk->lm;
> > > > > +       int i, p2be;
> > > > > +       unsigned int nr_strides;
> > > > > +       unsigned int psecs;
> > > > > +       void *kaddr;
> > > > > +
> > > > > +       if (!PBLK_RAIL_STRIDE_WIDTH)
> > > > > +               return 0;
> > > > > +
> > > > > +       if (((lm->blk_per_line % PBLK_RAIL_STRIDE_WIDTH) != 0) ||
> > > > > +           (lm->blk_per_line < PBLK_RAIL_STRIDE_WIDTH)) {
> > > > > +               pr_err("pblk: unsupported RAIL stride %i\n", lm->blk_per_line);
> > > > > +               return -EINVAL;
> > > > > +       }
> > > >
> > > > This is just a check of the maximum blocks per line - bad blocks will
> > > > reduce the number of writable blocks. What happens when a line goes
> > > > below PBLK_RAIL_STRIDE_WIDTH writable blocks?
> > >
> > > This check just guarantees that lm->blk_per_line is a multiple of
> > > PBLK_RAIL_STRIDE_WIDTH. Bad blocks are handled dynamically at runtime
> > > via pblk_rail_valid_sector(pblk, line, cur) which skips parity
> > > computation if the parity block is bad. In theory a line can have
> > > fewer writable blocks than PBLK_RAIL_STRIDE_WIDTH, in this case parity
> > > is computed based on fewer number of blocks.
> >
> > Yes, I see now, it should work.
> >
> > The only case i see that is problematic is if only the parity block(s)
> > are non-bad in a line, resulting in no data being written, just parity
> > (adding a huge write latency penalty) - we could either disable RAIL
> > for that class of lines or mark them as bad.
> >
> > >
> > > >
> > > > > +
> > > > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > > > +
> > > > > +       pblk->rail.p2b = kmalloc_array(nr_strides, sizeof(struct p2b_entry *),
> > > > > +                                      GFP_KERNEL);
> > > > > +       if (!pblk->rail.p2b)
> > > > > +               return -ENOMEM;
> > > > > +
> > > > > +       for (p2be = 0; p2be < nr_strides; p2be++) {
> > > > > +               pblk->rail.p2b[p2be] = kmalloc_array(PBLK_RAIL_STRIDE_WIDTH - 1,
> > > > > +                                              sizeof(struct p2b_entry),
> > > > > +                                              GFP_KERNEL);
> > > > > +               if (!pblk->rail.p2b[p2be])
> > > > > +                       goto free_p2b_entries;
> > > > > +       }
> > > > > +
> > > > > +       pblk->rail.data = kmalloc(psecs * sizeof(void *), GFP_KERNEL);
> > > > > +       if (!pblk->rail.data)
> > > > > +               goto free_p2b_entries;
> > > > > +
> > > > > +       pblk->rail.pages = alloc_pages(GFP_KERNEL, get_count_order(psecs));
> > > > > +       if (!pblk->rail.pages)
> > > > > +               goto free_data;
> > > > > +
> > > > > +       kaddr = page_address(pblk->rail.pages);
> > > > > +       for (i = 0; i < psecs; i++)
> > > > > +               pblk->rail.data[i] = kaddr + i * PBLK_EXPOSED_PAGE_SIZE;
> > > > > +
> > > > > +       pblk->rail.lba = kmalloc_array(psecs, sizeof(u64 *), GFP_KERNEL);
> > > > > +       if (!pblk->rail.lba)
> > > > > +               goto free_pages;
> > > > > +
> > > > > +       /* Subtract parity bits from device capacity */
> > > > > +       pblk->capacity = pblk->capacity * (PBLK_RAIL_STRIDE_WIDTH - 1) /
> > > > > +               PBLK_RAIL_STRIDE_WIDTH;
> > > > > +
> > > > > +       pblk->map_page = pblk_rail_map_page_data;
> > > > > +
> > > > > +       return 0;
> > > > > +
> > > > > +free_pages:
> > > > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > > > +                  get_count_order(psecs));
> > > > > +free_data:
> > > > > +       kfree(pblk->rail.data);
> > > > > +free_p2b_entries:
> > > > > +       for (p2be = p2be - 1; p2be >= 0; p2be--)
> > > > > +               kfree(pblk->rail.p2b[p2be]);
> > > > > +       kfree(pblk->rail.p2b);
> > > > > +
> > > > > +       return -ENOMEM;
> > > > > +}
> > > > > +
> > > > > +void pblk_rail_free(struct pblk *pblk)
> > > > > +{
> > > > > +       unsigned int i;
> > > > > +       unsigned int nr_strides;
> > > > > +       unsigned int psecs;
> > > > > +
> > > > > +       psecs = pblk_rail_psec_per_stripe(pblk);
> > > > > +       nr_strides = pblk_rail_sec_per_stripe(pblk) / PBLK_RAIL_STRIDE_WIDTH;
> > > > > +
> > > > > +       kfree(pblk->rail.lba);
> > > > > +       free_pages((unsigned long)page_address(pblk->rail.pages),
> > > > > +                  get_count_order(psecs));
> > > > > +       kfree(pblk->rail.data);
> > > > > +       for (i = 0; i < nr_strides; i++)
> > > > > +               kfree(pblk->rail.p2b[i]);
> > > > > +       kfree(pblk->rail.p2b);
> > > > > +}
> > > > > +
> > > > > +/* PBLK supports 64 ppas max. By performing RAIL reads, a sector is read using
> > > > > + * multiple ppas which can lead to violation of the 64 ppa limit. In this case,
> > > > > + * split the bio
> > > > > + */
> > > > > +static void pblk_rail_bio_split(struct pblk *pblk, struct bio **bio, int sec)
> > > > > +{
> > > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > > +       struct bio *split;
> > > > > +
> > > > > +       sec *= (dev->geo.csecs >> 9);
> > > > > +
> > > > > +       split = bio_split(*bio, sec, GFP_KERNEL, &pblk_bio_set);
> > > > > +       /* there isn't chance to merge the split bio */
> > > > > +       split->bi_opf |= REQ_NOMERGE;
> > > > > +       bio_set_flag(*bio, BIO_QUEUE_ENTERED);
> > > > > +       bio_chain(split, *bio);
> > > > > +       generic_make_request(*bio);
> > > > > +       *bio = split;
> > > > > +}
> > > > > +
> > > > > +/* RAIL's Write Path */
> > > > > +static int pblk_rail_sched_parity(struct pblk *pblk)
> > > > > +{
> > > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > > +       unsigned int sec_in_stripe;
> > > > > +
> > > > > +       while (1) {
> > > > > +               sec_in_stripe = line->cur_sec % pblk_rail_sec_per_stripe(pblk);
> > > > > +
> > > > > +               /* Schedule parity write at end of data section */
> > > > > +               if (sec_in_stripe >= pblk_rail_dsec_per_stripe(pblk))
> > > > > +                       return 1;
> > > > > +
> > > > > +               /* Skip bad blocks and meta sectors until we find a valid sec */
> > > > > +               if (test_bit(line->cur_sec, line->map_bitmap))
> > > > > +                       line->cur_sec += pblk->min_write_pgs;
> > > > > +               else
> > > > > +                       break;
> > > > > +       }
> > > > > +
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +/* Mark RAIL parity sectors as invalid sectors so they will be gc'ed */
> > > > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line)
> > > > > +{
> > > > > +       int off, bit;
> > > > > +
> > > > > +       for (off = pblk_rail_dsec_per_stripe(pblk);
> > > > > +            off < pblk->lm.sec_per_line;
> > > > > +            off += pblk_rail_sec_per_stripe(pblk)) {
> > > > > +               for (bit = 0; bit < pblk_rail_psec_per_stripe(pblk); bit++)
> > > > > +                       set_bit(off + bit, line->invalid_bitmap);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +void pblk_rail_end_io_write(struct nvm_rq *rqd)
> > > > > +{
> > > > > +       struct pblk *pblk = rqd->private;
> > > > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > > > +
> > > > > +       if (rqd->error) {
> > > > > +               pblk_log_write_err(pblk, rqd);
> > > > > +               return pblk_end_w_fail(pblk, rqd);
> > > >
> > > > The write error recovery path relies on that that sentry in c_ctx is
> > > > an index in the write buffer, so this won't work.
> > >
> > > You mean a RAIL parity write? Yes, good catch.
> > >
> >
> > It does not make sense to re-issue failing parity writes anyway, right?
>
> Yes this is correct. I think I can just take out end_w_fail, but we
> need to let readers know that the page is bad (see below).
>
> >
> > > >
> > > > Additionally, If a write(data or parity) fails, the whole stride would
> > > > be broken and need to fall back on "normal" reads, right?
> > > > One solution could be to check line->w_err_gc->has_write_err on the read path.
> > >
> > > when a data write fails it is remapped and the rail mapping function
> > > tracks that new location in the r2b. The page will be marked bad and
> > > hence taken into account when computing parity in the case of parity
> > > writes and RAIL reads, so the line should still be intact. This might
> > > be insufficiently tested but in theory it should work.
> >
> > As far as I can tell from the code, pblk_rail_valid_sector only checks
> > if the sector is occupied by metadata or if the whole block is bad.
> > In the case of a write failure, the block will not be marked bad. What
> > we could do is to keep track of the write pointer internally to check
> > if the sector had been successfully written.
> >
> > I can create a patch for keeping track of the write pointer for each
> > block - this would be useful for debugging purposes in any case. Once
> > this is in place it would be easy to add a check in
> > pblk_rail_valid_sector ensuring that the sector has actually been
> > written successfully.
>
> Hmm, I don't think keeping track of the write pointer is sufficient. Readers
> need to be able to determine whether a page is bad at any time. So I
> believe we need a per-line bitmap telling us which stripes (horizontal pages)
> are bad, or am I missing something?

The easiest fix is to check if the line has write error(s) and fall
back on non-rail-reads until the line has been recovered. Write errors
are rare, so this would not be so bad.

If we would start tracking per-chunk write pointers we can check if a
rail-read is impossible due to a write error.  If the
chunk-sector-offset of a read in a rail-request is greater than the wp
in its chunk it has not been written successfully)


>
> >
> > >
> > > >
> > > > > +       }
> > > > > +#ifdef CONFIG_NVM_DEBUG
> > > > > +       else
> > > > > +               WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
> > > > > +#endif
> > > > > +
> > > > > +       pblk_up_rq(pblk, c_ctx->lun_bitmap);
> > > > > +
> > > > > +       pblk_rq_to_line_put(pblk, rqd);
> > > > > +       bio_put(rqd->bio);
> > > > > +       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > > +
> > > > > +       atomic_dec(&pblk->inflight_io);
> > > > > +}
> > > > > +
> > > > > +static int pblk_rail_read_to_bio(struct pblk *pblk, struct nvm_rq *rqd,
> > > > > +                         struct bio *bio, unsigned int stride,
> > > > > +                         unsigned int nr_secs, unsigned int paddr)
> > > > > +{
> > > > > +       struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
> > > > > +       int sec, i;
> > > > > +       int nr_data = PBLK_RAIL_STRIDE_WIDTH - 1;
> > > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > > +
> > > > > +       c_ctx->nr_valid = nr_secs;
> > > > > +       /* sentry indexes rail page buffer, instead of rwb */
> > > > > +       c_ctx->sentry = stride * pblk->min_write_pgs;
> > > > > +       c_ctx->sentry |= PBLK_RAIL_PARITY_WRITE;
> > > > > +
> > > > > +       for (sec = 0; sec < pblk->min_write_pgs; sec++) {
> > > > > +               void *pg_addr;
> > > > > +               struct page *page;
> > > > > +               u64 *lba;
> > > > > +
> > > > > +               lba = &pblk->rail.lba[stride * pblk->min_write_pgs + sec];
> > > > > +               pg_addr = pblk->rail.data[stride * pblk->min_write_pgs + sec];
> > > > > +               page = virt_to_page(pg_addr);
> > > > > +
> > > > > +               if (!page) {
> > > > > +                       pr_err("pblk: could not allocate RAIL bio page %p\n",
> > > > > +                              pg_addr);
> > > > > +                       return -NVM_IO_ERR;
> > > > > +               }
> > > > > +
> > > > > +               if (bio_add_page(bio, page, pblk->rwb.seg_size, 0) !=
> > > > > +                   pblk->rwb.seg_size) {
> > > > > +                       pr_err("pblk: could not add page to RAIL bio\n");
> > > > > +                       return -NVM_IO_ERR;
> > > > > +               }
> > > > > +
> > > > > +               *lba = 0;
> > > > > +               memset(pg_addr, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > > > +
> > > > > +               for (i = 0; i < nr_data; i++) {
> > > > > +                       struct pblk_rb_entry *entry;
> > > > > +                       struct pblk_w_ctx *w_ctx;
> > > > > +                       u64 lba_src;
> > > > > +                       unsigned int pos;
> > > > > +                       unsigned int cur;
> > > > > +                       int distance = pblk_rail_psec_per_stripe(pblk);
> > > > > +
> > > > > +                       cur = paddr - distance * (nr_data - i) + sec;
> > > > > +
> > > > > +                       if (!pblk_rail_valid_sector(pblk, line, cur))
> > > > > +                               continue;
> > > > > +
> > > > > +                       pos = pblk->rail.p2b[stride][i].pos;
> > > > > +                       pos = pblk_rb_wrap_pos(&pblk->rwb, pos + sec);
> > > > > +                       entry = &pblk->rwb.entries[pos];
> > > > > +                       w_ctx = &entry->w_ctx;
> > > > > +                       lba_src = w_ctx->lba;
> > > > > +
> > > > > +                       if (sec < pblk->rail.p2b[stride][i].nr_valid &&
> > > > > +                           lba_src != ADDR_EMPTY) {
> > > > > +                               pblk_rail_data_parity(pg_addr, entry->data);
> > > > > +                               pblk_rail_lba_parity(lba, &lba_src);
> > > >
> > > > What keeps the parity lba values from invalidating "real" data lbas
> > > > during recovery?
> > >
> > > The RAIL geometry is known during recovery so the parity LBAs can be
> > > ignored, not implemented yet.
> >
> > Ah, it's not in place yet, then it makes sense.
> > It would be straight forward to implement, using something like
> > sector_is_parity(line, paddr) to avoid mapping parity sectors during
> > recovery.
> >
> > >
> > > >
> > > > > +                       }
> > > > > +               }
> > > > > +       }
> > > > > +
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +int pblk_rail_submit_write(struct pblk *pblk)
> > > > > +{
> > > > > +       int i;
> > > > > +       struct nvm_rq *rqd;
> > > > > +       struct bio *bio;
> > > > > +       struct pblk_line *line = pblk_line_get_data(pblk);
> > > > > +       int start, end, bb_offset;
> > > > > +       unsigned int stride = 0;
> > > > > +
> > > > > +       if (!pblk_rail_sched_parity(pblk))
> > > > > +               return 0;
> > > > > +
> > > > > +       start = line->cur_sec;
> > > > > +       bb_offset = start % pblk_rail_sec_per_stripe(pblk);
> > > > > +       end = start + pblk_rail_sec_per_stripe(pblk) - bb_offset;
> > > > > +
> > > > > +       for (i = start; i < end; i += pblk->min_write_pgs, stride++) {
> > > > > +               /* Do not generate parity in this slot if the sec is bad
> > > > > +                * or reserved for meta.
> > > > > +                * We check on the read path and perform a conventional
> > > > > +                * read, to avoid reading parity from the bad block
> > > > > +                */
> > > > > +               if (!pblk_rail_valid_sector(pblk, line, i))
> > > > > +                       continue;
> > > > > +
> > > > > +               rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
> > > > > +               if (IS_ERR(rqd)) {
> > > > > +                       pr_err("pblk: cannot allocate parity write req.\n");
> > > > > +                       return -ENOMEM;
> > > > > +               }
> > > > > +
> > > > > +               bio = bio_alloc(GFP_KERNEL, pblk->min_write_pgs);
> > > > > +               if (!bio) {
> > > > > +                       pr_err("pblk: cannot allocate parity write bio\n");
> > > > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > > +                       return -ENOMEM;
> > > > > +               }
> > > > > +
> > > > > +               bio->bi_iter.bi_sector = 0; /* internal bio */
> > > > > +               bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> > > > > +               rqd->bio = bio;
> > > > > +
> > > > > +               pblk_rail_read_to_bio(pblk, rqd, bio, stride,
> > > > > +                                     pblk->min_write_pgs, i);
> > > > > +
> > > > > +               if (pblk_submit_io_set(pblk, rqd, pblk_rail_end_io_write)) {
> > > > > +                       bio_put(rqd->bio);
> > > > > +                       pblk_free_rqd(pblk, rqd, PBLK_WRITE);
> > > > > +
> > > > > +                       return -NVM_IO_ERR;
> > > > > +               }
> > > > > +       }
> > > > > +
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +/* RAIL's Read Path */
> > > > > +static void pblk_rail_end_io_read(struct nvm_rq *rqd)
> > > > > +{
> > > > > +       struct pblk *pblk = rqd->private;
> > > > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > > +       struct pblk_pr_ctx *pr_ctx = r_ctx->private;
> > > > > +       struct bio *new_bio = rqd->bio;
> > > > > +       struct bio *bio = pr_ctx->orig_bio;
> > > > > +       struct bio_vec src_bv, dst_bv;
> > > > > +       struct pblk_sec_meta *meta_list = rqd->meta_list;
> > > > > +       int bio_init_idx = pr_ctx->bio_init_idx;
> > > > > +       int nr_secs = pr_ctx->orig_nr_secs;
> > > > > +       __le64 *lba_list_mem, *lba_list_media;
> > > > > +       __le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
> > > > > +       void *src_p, *dst_p;
> > > > > +       int i, r, rail_ppa = 0;
> > > > > +       unsigned char valid;
> > > > > +
> > > > > +       if (unlikely(rqd->nr_ppas == 1)) {
> > > > > +               struct ppa_addr ppa;
> > > > > +
> > > > > +               ppa = rqd->ppa_addr;
> > > > > +               rqd->ppa_list = pr_ctx->ppa_ptr;
> > > > > +               rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
> > > > > +               rqd->ppa_list[0] = ppa;
> > > > > +       }
> > > > > +
> > > > > +       /* Re-use allocated memory for intermediate lbas */
> > > > > +       lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
> > > > > +       lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
> > > > > +
> > > > > +       for (i = 0; i < rqd->nr_ppas; i++)
> > > > > +               lba_list_media[i] = meta_list[i].lba;
> > > > > +       for (i = 0; i < nr_secs; i++)
> > > > > +               meta_list[i].lba = lba_list_mem[i];
> > > > > +
> > > > > +       for (i = 0; i < nr_secs; i++) {
> > > > > +               struct pblk_line *line;
> > > > > +               u64 meta_lba = 0x0UL, mlba;
> > > > > +
> > > > > +               line = pblk_ppa_to_line(pblk, rqd->ppa_list[rail_ppa]);
> > > > > +
> > > > > +               valid = bitmap_weight(pr_ctx->bitmap, PBLK_RAIL_STRIDE_WIDTH);
> > > > > +               bitmap_shift_right(pr_ctx->bitmap, pr_ctx->bitmap,
> > > > > +                                  PBLK_RAIL_STRIDE_WIDTH, PR_BITMAP_SIZE);
> > > > > +
> > > > > +               if (valid == 0) /* Skip cached reads */
> > > > > +                       continue;
> > > > > +
> > > > > +               kref_put(&line->ref, pblk_line_put);
> > > > > +
> > > > > +               dst_bv = bio->bi_io_vec[bio_init_idx + i];
> > > > > +               dst_p = kmap_atomic(dst_bv.bv_page);
> > > > > +
> > > > > +               memset(dst_p + dst_bv.bv_offset, 0, PBLK_EXPOSED_PAGE_SIZE);
> > > > > +               meta_list[i].lba = cpu_to_le64(0x0UL);
> > > > > +
> > > > > +               for (r = 0; r < valid; r++, rail_ppa++) {
> > > > > +                       src_bv = new_bio->bi_io_vec[rail_ppa];
> > > > > +
> > > > > +                       if (lba_list_media[rail_ppa] != addr_empty) {
> > > > > +                               src_p = kmap_atomic(src_bv.bv_page);
> > > > > +                               pblk_rail_data_parity(dst_p + dst_bv.bv_offset,
> > > > > +                                                     src_p + src_bv.bv_offset);
> > > > > +                               mlba = le64_to_cpu(lba_list_media[rail_ppa]);
> > > > > +                               pblk_rail_lba_parity(&meta_lba, &mlba);
> > > > > +                               kunmap_atomic(src_p);
> > > > > +                       }
> > > > > +
> > > > > +                       mempool_free(src_bv.bv_page, &pblk->page_bio_pool);
> > > > > +               }
> > > > > +               meta_list[i].lba = cpu_to_le64(meta_lba);
> > > > > +               kunmap_atomic(dst_p);
> > > > > +       }
> > > > > +
> > > > > +       bio_put(new_bio);
> > > > > +       rqd->nr_ppas = pr_ctx->orig_nr_secs;
> > > > > +       kfree(pr_ctx);
> > > > > +       rqd->bio = NULL;
> > > > > +
> > > > > +       bio_endio(bio);
> > > > > +       __pblk_end_io_read(pblk, rqd, false);
> > > > > +}
> > > > > +
> > > > > +/* Converts original ppa into ppa list of RAIL reads */
> > > > > +static int pblk_rail_setup_ppas(struct pblk *pblk, struct ppa_addr ppa,
> > > > > +                               struct ppa_addr *rail_ppas,
> > > > > +                               unsigned char *pvalid, int *nr_rail_ppas,
> > > > > +                               int *rail_reads)
> > > > > +{
> > > > > +       struct nvm_tgt_dev *dev = pblk->dev;
> > > > > +       struct nvm_geo *geo = &dev->geo;
> > > > > +       struct ppa_addr rail_ppa = ppa;
> > > > > +       unsigned int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > > > > +       unsigned int strides = pblk_rail_nr_parity_luns(pblk);
> > > > > +       struct pblk_line *line;
> > > > > +       unsigned int i;
> > > > > +       int ppas = *nr_rail_ppas;
> > > > > +       int valid = 0;
> > > > > +
> > > > > +       for (i = 1; i < PBLK_RAIL_STRIDE_WIDTH; i++) {
> > > > > +               unsigned int neighbor, lun, chnl;
> > > > > +               int laddr;
> > > > > +
> > > > > +               neighbor = pblk_rail_wrap_lun(pblk, lun_pos + i * strides);
> > > > > +
> > > > > +               lun = pblk_pos_to_lun(geo, neighbor);
> > > > > +               chnl = pblk_pos_to_chnl(geo, neighbor);
> > > > > +               pblk_dev_ppa_set_lun(&rail_ppa, lun);
> > > > > +               pblk_dev_ppa_set_chnl(&rail_ppa, chnl);
> > > > > +
> > > > > +               line = pblk_ppa_to_line(pblk, rail_ppa);
> > > > > +               laddr = pblk_dev_ppa_to_line_addr(pblk, rail_ppa);
> > > > > +
> > > > > +               /* Do not read from bad blocks */
> > > > > +               if (!pblk_rail_valid_sector(pblk, line, laddr)) {
> > > > > +                       /* Perform regular read if parity sector is bad */
> > > > > +                       if (neighbor >= pblk_rail_nr_data_luns(pblk))
> > > > > +                               return 0;
> > > > > +
> > > > > +                       /* If any other neighbor is bad we can just skip it */
> > > > > +                       continue;
> > > > > +               }
> > > > > +
> > > > > +               rail_ppas[ppas++] = rail_ppa;
> > > > > +               valid++;
> > > > > +       }
> > > > > +
> > > > > +       if (valid == 1)
> > > > > +               return 0;
> > > > > +
> > > > > +       *pvalid = valid;
> > > > > +       *nr_rail_ppas = ppas;
> > > > > +       (*rail_reads)++;
> > > > > +       return 1;
> > > > > +}
> > > > > +
> > > > > +static void pblk_rail_set_bitmap(struct pblk *pblk, struct ppa_addr *ppa_list,
> > > > > +                                int ppa, struct ppa_addr *rail_ppa_list,
> > > > > +                                int *nr_rail_ppas, unsigned long *read_bitmap,
> > > > > +                                unsigned long *pvalid, int *rail_reads)
> > > > > +{
> > > > > +       unsigned char valid;
> > > > > +
> > > > > +       if (test_bit(ppa, read_bitmap))
> > > > > +               return;
> > > > > +
> > > > > +       if (pblk_rail_lun_busy(pblk, ppa_list[ppa]) &&
> > > > > +           pblk_rail_setup_ppas(pblk, ppa_list[ppa],
> > > > > +                                rail_ppa_list, &valid,
> > > > > +                                nr_rail_ppas, rail_reads)) {
> > > > > +               WARN_ON(test_and_set_bit(ppa, read_bitmap));
> > > > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, valid);
> > > > > +       } else {
> > > > > +               rail_ppa_list[(*nr_rail_ppas)++] = ppa_list[ppa];
> > > > > +               bitmap_set(pvalid, ppa * PBLK_RAIL_STRIDE_WIDTH, 1);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > > > +                      struct bio **bio)
> > > > > +{
> > > > > +       struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > > +       struct pblk_pr_ctx *pr_ctx;
> > > > > +       struct ppa_addr rail_ppa_list[NVM_MAX_VLBA];
> > > > > +       DECLARE_BITMAP(pvalid, PR_BITMAP_SIZE);
> > > > > +       int nr_secs = rqd->nr_ppas;
> > > > > +       bool read_empty = bitmap_empty(read_bitmap, nr_secs);
> > > > > +       int nr_rail_ppas = 0, rail_reads = 0;
> > > > > +       int i;
> > > > > +       int ret;
> > > > > +
> > > > > +       /* Fully cached reads should not enter this path */
> > > > > +       WARN_ON(bitmap_full(read_bitmap, nr_secs));
> > > > > +
> > > > > +       bitmap_zero(pvalid, PR_BITMAP_SIZE);
> > > > > +       if (rqd->nr_ppas == 1) {
> > > > > +               pblk_rail_set_bitmap(pblk, &rqd->ppa_addr, 0, rail_ppa_list,
> > > > > +                                    &nr_rail_ppas, read_bitmap, pvalid,
> > > > > +                                    &rail_reads);
> > > > > +
> > > > > +               if (nr_rail_ppas == 1) {
> > > > > +                       memcpy(&rqd->ppa_addr, rail_ppa_list,
> > > > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > > > +               } else {
> > > > > +                       rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
> > > > > +                       rqd->dma_ppa_list = rqd->dma_meta_list +
> > > > > +                         pblk_dma_meta_size;
> > > > > +                       memcpy(rqd->ppa_list, rail_ppa_list,
> > > > > +                              nr_rail_ppas * sizeof(struct ppa_addr));
> > > > > +               }
> > > > > +       } else {
> > > > > +               for (i = 0; i < rqd->nr_ppas; i++) {
> > > > > +                       pblk_rail_set_bitmap(pblk, rqd->ppa_list, i,
> > > > > +                                            rail_ppa_list, &nr_rail_ppas,
> > > > > +                                            read_bitmap, pvalid, &rail_reads);
> > > > > +
> > > > > +                       /* Don't split if this it the last ppa of the rqd */
> > > > > +                       if (((nr_rail_ppas + PBLK_RAIL_STRIDE_WIDTH) >=
> > > > > +                            NVM_MAX_VLBA) && (i + 1 < rqd->nr_ppas)) {
> > > > > +                               struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> > > > > +
> > > > > +                               pblk_rail_bio_split(pblk, bio, i + 1);
> > > > > +                               rqd->nr_ppas = pblk_get_secs(*bio);
> > > > > +                               r_ctx->private = *bio;
> > > > > +                               break;
> > > > > +                       }
> > > > > +               }
> > > > > +               memcpy(rqd->ppa_list, rail_ppa_list,
> > > > > +                      nr_rail_ppas * sizeof(struct ppa_addr));
> > > > > +       }
> > > > > +
> > > > > +       if (bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > > > +               return NVM_IO_REQUEUE;
> > > > > +
> > > > > +       if (read_empty && !bitmap_empty(read_bitmap, rqd->nr_ppas))
> > > > > +               bio_advance(*bio, (rqd->nr_ppas) * PBLK_EXPOSED_PAGE_SIZE);
> > > > > +
> > > > > +       if (pblk_setup_partial_read(pblk, rqd, bio_init_idx, read_bitmap,
> > > > > +                                   nr_rail_ppas))
> > > > > +               return NVM_IO_ERR;
> > > > > +
> > > > > +       rqd->end_io = pblk_rail_end_io_read;
> > > > > +       pr_ctx = r_ctx->private;
> > > > > +       bitmap_copy(pr_ctx->bitmap, pvalid, PR_BITMAP_SIZE);
> > > > > +
> > > > > +       ret = pblk_submit_io(pblk, rqd);
> > > > > +       if (ret) {
> > > > > +               bio_put(rqd->bio);
> > > > > +               pr_err("pblk: partial RAIL read IO submission failed\n");
> > > > > +               /* Free allocated pages in new bio */
> > > > > +               pblk_bio_free_pages(pblk, rqd->bio, 0, rqd->bio->bi_vcnt);
> > > > > +               kfree(pr_ctx);
> > > > > +               __pblk_end_io_read(pblk, rqd, false);
> > > > > +               return NVM_IO_ERR;
> > > > > +       }
> > > > > +
> > > > > +       return NVM_IO_OK;
> > > > > +}
> > > > > diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> > > > > index bd88784e51d9..01fe4362b27e 100644
> > > > > --- a/drivers/lightnvm/pblk.h
> > > > > +++ b/drivers/lightnvm/pblk.h
> > > > > @@ -28,6 +28,7 @@
> > > > >  #include <linux/vmalloc.h>
> > > > >  #include <linux/crc32.h>
> > > > >  #include <linux/uuid.h>
> > > > > +#include <linux/log2.h>
> > > > >
> > > > >  #include <linux/lightnvm.h>
> > > > >
> > > > > @@ -45,7 +46,7 @@
> > > > >  #define PBLK_COMMAND_TIMEOUT_MS 30000
> > > > >
> > > > >  /* Max 512 LUNs per device */
> > > > > -#define PBLK_MAX_LUNS_BITMAP (4)
> > > > > +#define PBLK_MAX_LUNS_BITMAP (512)
> > > >
> > > > 512 is probably enough for everyone for now, but why not make this dynamic?
> > > > Better not waste memory and introduce an artificial limit on number of luns.
> > >
> > > I can make it dynamic. It just makes the init path more messy as
> > > meta_init takes the write semaphore (and hence busy bitmap) so I have
> > > to init RAIL in the middle of everything else.
> > >
> > > >
> > > > >
> > > > >  #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
> > > > >
> > > > > @@ -123,6 +124,13 @@ struct pblk_g_ctx {
> > > > >         u64 lba;
> > > > >  };
> > > > >
> > > > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > > > +#define PBLK_RAIL_STRIDE_WIDTH 4
> > > > > +#define PR_BITMAP_SIZE (NVM_MAX_VLBA * PBLK_RAIL_STRIDE_WIDTH)
> > > > > +#else
> > > > > +#define PR_BITMAP_SIZE NVM_MAX_VLBA
> > > > > +#endif
> > > > > +
> > > > >  /* partial read context */
> > > > >  struct pblk_pr_ctx {
> > > > >         struct bio *orig_bio;
> > > > > @@ -604,6 +612,39 @@ struct pblk_addrf {
> > > > >         int sec_ws_stripe;
> > > > >  };
> > > > >
> > > > > +#ifdef CONFIG_NVM_PBLK_RAIL
> > > > > +
> > > > > +struct p2b_entry {
> > > > > +       int pos;
> > > > > +       int nr_valid;
> > > > > +};
> > > > > +
> > > > > +struct pblk_rail {
> > > > > +       struct p2b_entry **p2b;         /* Maps RAIL sectors to rb pos */
> > > > > +       struct page *pages;             /* Pages to hold parity writes */
> > > > > +       void **data;                    /* Buffer that holds parity pages */
> > > > > +       DECLARE_BITMAP(busy_bitmap, PBLK_MAX_LUNS_BITMAP);
> > > > > +       u64 *lba;                       /* Buffer to compute LBA parity */
> > > > > +};
> > > > > +
> > > > > +/* Initialize and tear down RAIL */
> > > > > +int pblk_rail_init(struct pblk *pblk);
> > > > > +void pblk_rail_free(struct pblk *pblk);
> > > > > +/* Adjust some system parameters */
> > > > > +bool pblk_rail_meta_distance(struct pblk_line *data_line);
> > > > > +int pblk_rail_rb_delay(struct pblk_rb *rb);
> > > > > +/* Core */
> > > > > +void pblk_rail_line_close(struct pblk *pblk, struct pblk_line *line);
> > > > > +int pblk_rail_down_stride(struct pblk *pblk, int lun, int timeout);
> > > > > +void pblk_rail_up_stride(struct pblk *pblk, int lun);
> > > > > +/* Write path */
> > > > > +int pblk_rail_submit_write(struct pblk *pblk);
> > > > > +/* Read Path */
> > > > > +int pblk_rail_read_bio(struct pblk *pblk, struct nvm_rq *rqd, int blba,
> > > > > +                      unsigned long *read_bitmap, int bio_init_idx,
> > > > > +                      struct bio **bio);
> > > > > +#endif /* CONFIG_NVM_PBLK_RAIL */
> > > > > +
> > > > >  typedef int (pblk_map_page_fn)(struct pblk *pblk, unsigned int sentry,
> > > > >                                struct ppa_addr *ppa_list,
> > > > >                                unsigned long *lun_bitmap,
> > > > > @@ -1115,6 +1156,26 @@ static inline u64 pblk_dev_ppa_to_line_addr(struct pblk *pblk,
> > > > >         return paddr;
> > > > >  }
> > > > >
> > > > > +static inline int pblk_pos_to_lun(struct nvm_geo *geo, int pos)
> > > > > +{
> > > > > +       return pos >> ilog2(geo->num_ch);
> > > > > +}
> > > > > +
> > > > > +static inline int pblk_pos_to_chnl(struct nvm_geo *geo, int pos)
> > > > > +{
> > > > > +       return pos % geo->num_ch;
> > > > > +}
> > > > > +
> > > > > +static inline void pblk_dev_ppa_set_lun(struct ppa_addr *p, int lun)
> > > > > +{
> > > > > +       p->a.lun = lun;
> > > > > +}
> > > > > +
> > > > > +static inline void pblk_dev_ppa_set_chnl(struct ppa_addr *p, int chnl)
> > > > > +{
> > > > > +       p->a.ch = chnl;
> > > > > +}
> > > >
> > > > What is the motivation for adding the lun and chnl setters? They seem
> > > > uncalled for.
> > >
> > > They are used in RAIL's read path to generate the ppas for RAIL reads
> >
> > It's just a style thing, but they seem a bit redundant to me.
> >
> > >
> > > >
> > > > > +
> > > > >  static inline struct ppa_addr pblk_ppa32_to_ppa64(struct pblk *pblk, u32 ppa32)
> > > > >  {
> > > > >         struct nvm_tgt_dev *dev = pblk->dev;
> > > > > --
> > > > > 2.17.1
> > > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-09-21 12:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-17  5:29 [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Heiner Litz
2018-09-17  5:29 ` [RFC PATCH 1/6] lightnvm: pblk: refactor read and write APIs Heiner Litz
2018-09-17  5:29 ` [RFC PATCH 2/6] lightnvm: pblk: Add configurable mapping function Heiner Litz
2018-09-17  5:29 ` [RFC PATCH 3/6] lightnvm: pblk: Refactor end_io function in pblk_submit_io_set Heiner Litz
2018-09-17  5:29 ` [RFC PATCH 4/6] lightnvm: pblk: Add pblk_submit_io_sem Heiner Litz
2018-09-17  5:29 ` [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface Heiner Litz
2018-09-18 11:28   ` Hans Holmberg
2018-09-18 16:11     ` Heiner Litz
2018-09-19  7:53       ` Hans Holmberg
2018-09-20 23:58         ` Heiner Litz
2018-09-21  7:04           ` Hans Holmberg
2018-09-17  5:29 ` [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL Heiner Litz
2018-09-18 11:38   ` Hans Holmberg
2018-09-18 11:46 ` [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency Hans Holmberg
2018-09-18 16:13   ` Heiner Litz
2018-09-19  7:58     ` Hans Holmberg
2018-09-21  4:34       ` Heiner Litz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.