All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 v2] Support for Open-Channel SSDs
@ 2015-04-15 12:34 ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

A problem with SSDs is that they expose a narrow read/write interface,
where the host and device must communicate their intent. The narrow
interface leaves little information to be carried down from file-systems
and applications, and therefore performance guarantees in these devices
are best-efforts.

In the case of SSDs, there are various approaches to mitigate it.
Examples include trim and multi-streams. However, these approaches are
specialized by each vendor, each having their own behavior. More
importantly, they do not allow the host to completely control data
placement, parallelism and garbage collection.

By exposing the physical characteristics of an SSD to the host,
file-systems and applications can directly place data and control when
and where garbage collection should be applied. We call the class of
SSDs that expose these physical characteristics Open-Channel SSDs.

For this class of SSDs, LightNVM is a specification that defines a
common interface. It allows the host to manage data placement, garbage
collection, and parallelism. With it, the kernel can expose a building
block for further integration into file-systems and applications.
Immediate benefits include strict control of access latency and IO
response variance.

This patchset wires up support in the block layer, introduces a simple
block device target called rrpc, and at last adds support in the
null_blk and NVMe drivers.

Patches are against v4.0.

Development and further information on LightNVM can be found at:

  https://github.com/OpenChannelSSD/linux

Changes since v1:

 - Splitted LightNVM into two parts. A get/put interface for flash
   blocks and the respective targets that implement flash translation
   layer logic.
 - Updated the patches accordring to the LightNVM specification changes.
 - Added interface to add/remove targets for a block device.

Matias Bjørling (5):
  blk-mq: Add prep/unprep support
  blk-mq: Support for Open-Channel SSDs
  lightnvm: RRPC target
  null_blk: LightNVM support
  nvme: LightNVM support

 Documentation/block/null_blk.txt |    8 +
 block/Kconfig                    |   12 +
 block/Makefile                   |    2 +-
 block/blk-mq.c                   |   40 +-
 block/blk-nvm.c                  |  722 ++++++++++++++++++++++
 block/blk-sysfs.c                |   11 +
 block/blk.h                      |   18 +
 drivers/Kconfig                  |    2 +
 drivers/Makefile                 |    2 +
 drivers/block/null_blk.c         |   89 ++-
 drivers/block/nvme-core.c        |  380 +++++++++++-
 drivers/lightnvm/Kconfig         |   29 +
 drivers/lightnvm/Makefile        |    5 +
 drivers/lightnvm/rrpc.c          | 1222 ++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.h          |  203 +++++++
 include/linux/bio.h              |    9 +
 include/linux/blk-mq.h           |    3 +
 include/linux/blk_types.h        |   12 +-
 include/linux/blkdev.h           |  218 +++++++
 include/linux/lightnvm.h         |   55 ++
 include/linux/nvme.h             |    2 +
 include/uapi/linux/nvm.h         |   70 +++
 include/uapi/linux/nvme.h        |  116 ++++
 23 files changed, 3217 insertions(+), 13 deletions(-)
 create mode 100644 block/blk-nvm.c
 create mode 100644 drivers/lightnvm/Kconfig
 create mode 100644 drivers/lightnvm/Makefile
 create mode 100644 drivers/lightnvm/rrpc.c
 create mode 100644 drivers/lightnvm/rrpc.h
 create mode 100644 include/linux/lightnvm.h
 create mode 100644 include/uapi/linux/nvm.h

-- 
1.9.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 0/5 v2] Support for Open-Channel SSDs
@ 2015-04-15 12:34 ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


A problem with SSDs is that they expose a narrow read/write interface,
where the host and device must communicate their intent. The narrow
interface leaves little information to be carried down from file-systems
and applications, and therefore performance guarantees in these devices
are best-efforts.

In the case of SSDs, there are various approaches to mitigate it.
Examples include trim and multi-streams. However, these approaches are
specialized by each vendor, each having their own behavior. More
importantly, they do not allow the host to completely control data
placement, parallelism and garbage collection.

By exposing the physical characteristics of an SSD to the host,
file-systems and applications can directly place data and control when
and where garbage collection should be applied. We call the class of
SSDs that expose these physical characteristics Open-Channel SSDs.

For this class of SSDs, LightNVM is a specification that defines a
common interface. It allows the host to manage data placement, garbage
collection, and parallelism. With it, the kernel can expose a building
block for further integration into file-systems and applications.
Immediate benefits include strict control of access latency and IO
response variance.

This patchset wires up support in the block layer, introduces a simple
block device target called rrpc, and at last adds support in the
null_blk and NVMe drivers.

Patches are against v4.0.

Development and further information on LightNVM can be found at:

  https://github.com/OpenChannelSSD/linux

Changes since v1:

 - Splitted LightNVM into two parts. A get/put interface for flash
   blocks and the respective targets that implement flash translation
   layer logic.
 - Updated the patches accordring to the LightNVM specification changes.
 - Added interface to add/remove targets for a block device.

Matias Bj?rling (5):
  blk-mq: Add prep/unprep support
  blk-mq: Support for Open-Channel SSDs
  lightnvm: RRPC target
  null_blk: LightNVM support
  nvme: LightNVM support

 Documentation/block/null_blk.txt |    8 +
 block/Kconfig                    |   12 +
 block/Makefile                   |    2 +-
 block/blk-mq.c                   |   40 +-
 block/blk-nvm.c                  |  722 ++++++++++++++++++++++
 block/blk-sysfs.c                |   11 +
 block/blk.h                      |   18 +
 drivers/Kconfig                  |    2 +
 drivers/Makefile                 |    2 +
 drivers/block/null_blk.c         |   89 ++-
 drivers/block/nvme-core.c        |  380 +++++++++++-
 drivers/lightnvm/Kconfig         |   29 +
 drivers/lightnvm/Makefile        |    5 +
 drivers/lightnvm/rrpc.c          | 1222 ++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.h          |  203 +++++++
 include/linux/bio.h              |    9 +
 include/linux/blk-mq.h           |    3 +
 include/linux/blk_types.h        |   12 +-
 include/linux/blkdev.h           |  218 +++++++
 include/linux/lightnvm.h         |   55 ++
 include/linux/nvme.h             |    2 +
 include/uapi/linux/nvm.h         |   70 +++
 include/uapi/linux/nvme.h        |  116 ++++
 23 files changed, 3217 insertions(+), 13 deletions(-)
 create mode 100644 block/blk-nvm.c
 create mode 100644 drivers/lightnvm/Kconfig
 create mode 100644 drivers/lightnvm/Makefile
 create mode 100644 drivers/lightnvm/rrpc.c
 create mode 100644 drivers/lightnvm/rrpc.h
 create mode 100644 include/linux/lightnvm.h
 create mode 100644 include/uapi/linux/nvm.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-15 12:34 ` Matias Bjørling
@ 2015-04-15 12:34   ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

Allow users to hook into prep/unprep functions just before an IO is
dispatched to the device driver. This is necessary for request-based
logic to take place at upper layers.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  1 +
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33c4285..f3dd028 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -338,6 +338,11 @@ EXPORT_SYMBOL(__blk_mq_end_request);
 
 void blk_mq_end_request(struct request *rq, int error)
 {
+	struct request_queue *q = rq->q;
+
+	if (q->unprep_rq_fn)
+		q->unprep_rq_fn(q, rq);
+
 	if (blk_update_request(rq, error, blk_rq_bytes(rq)))
 		BUG();
 	__blk_mq_end_request(rq, error);
@@ -753,6 +758,17 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	}
 }
 
+static int blk_mq_prep_rq(struct request_queue *q, struct request *rq)
+{
+	if (!q->prep_rq_fn)
+		return 0;
+
+	if (rq->cmd_flags & REQ_DONTPREP)
+		return 0;
+
+	return q->prep_rq_fn(q, rq);
+}
+
 /*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
@@ -812,11 +828,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(hctx, &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
 			continue;
+		case BLK_MQ_RQ_QUEUE_DONE:
+			continue;
 		case BLK_MQ_RQ_QUEUE_BUSY:
 			list_add(&rq->queuelist, &rq_list);
 			__blk_mq_requeue_request(rq);
@@ -1270,10 +1290,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * error (busy), just add it to our list as we previously
 		 * would have done
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(data.hctx, &bd);
 		if (ret == BLK_MQ_RQ_QUEUE_OK)
 			goto done;
 		else {
+			if (ret == BLK_MQ_RQ_QUEUE_DONE)
+				goto done;
 			__blk_mq_requeue_request(rq);
 
 			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7aec861..d7b39af 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,7 @@ enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


Allow users to hook into prep/unprep functions just before an IO is
dispatched to the device driver. This is necessary for request-based
logic to take place at upper layers.

Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  1 +
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33c4285..f3dd028 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -338,6 +338,11 @@ EXPORT_SYMBOL(__blk_mq_end_request);
 
 void blk_mq_end_request(struct request *rq, int error)
 {
+	struct request_queue *q = rq->q;
+
+	if (q->unprep_rq_fn)
+		q->unprep_rq_fn(q, rq);
+
 	if (blk_update_request(rq, error, blk_rq_bytes(rq)))
 		BUG();
 	__blk_mq_end_request(rq, error);
@@ -753,6 +758,17 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	}
 }
 
+static int blk_mq_prep_rq(struct request_queue *q, struct request *rq)
+{
+	if (!q->prep_rq_fn)
+		return 0;
+
+	if (rq->cmd_flags & REQ_DONTPREP)
+		return 0;
+
+	return q->prep_rq_fn(q, rq);
+}
+
 /*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
@@ -812,11 +828,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(hctx, &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
 			continue;
+		case BLK_MQ_RQ_QUEUE_DONE:
+			continue;
 		case BLK_MQ_RQ_QUEUE_BUSY:
 			list_add(&rq->queuelist, &rq_list);
 			__blk_mq_requeue_request(rq);
@@ -1270,10 +1290,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * error (busy), just add it to our list as we previously
 		 * would have done
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
+		ret = blk_mq_prep_rq(q, rq);
+		if (likely(!ret))
+			ret = q->mq_ops->queue_rq(data.hctx, &bd);
 		if (ret == BLK_MQ_RQ_QUEUE_OK)
 			goto done;
 		else {
+			if (ret == BLK_MQ_RQ_QUEUE_DONE)
+				goto done;
 			__blk_mq_requeue_request(rq);
 
 			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7aec861..d7b39af 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,7 @@ enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
  2015-04-15 12:34 ` Matias Bjørling
  (?)
@ 2015-04-15 12:34   ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

  Javier Gonzalez <javier@paletta.io>
  Jesper Madsen <jmad@itu.dk>

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/Kconfig             |  12 +
 block/Makefile            |   2 +-
 block/blk-mq.c            |  12 +-
 block/blk-nvm.c           | 722 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c         |  11 +
 block/blk.h               |  18 ++
 include/linux/bio.h       |   9 +
 include/linux/blk-mq.h    |   4 +-
 include/linux/blk_types.h |  12 +-
 include/linux/blkdev.h    | 218 ++++++++++++++
 include/linux/lightnvm.h  |  56 ++++
 include/uapi/linux/nvm.h  |  70 +++++
 12 files changed, 1140 insertions(+), 6 deletions(-)
 create mode 100644 block/blk-nvm.c
 create mode 100644 include/linux/lightnvm.h
 create mode 100644 include/uapi/linux/nvm.h

diff --git a/block/Kconfig b/block/Kconfig
index 161491d..a3fca8f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -88,6 +88,18 @@ config BLK_DEV_INTEGRITY
 	T10/SCSI Data Integrity Field or the T13/ATA External Path
 	Protection.  If in doubt, say N.
 
+config BLK_DEV_NVM
+	bool "Block layer Open-Channel SSD support"
+	depends on BLK_DEV
+	default y
+	---help---
+	  Say Y here to get to enable support for Open-channel SSDs.
+
+	  Open-Channel SSDs expose direct access to the underlying non-volatile
+	  memory.
+
+	  This option is required by Open-Channel SSD target drivers.
+
 config BLK_DEV_THROTTLING
 	bool "Block layer bio throttling support"
 	depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 00ecc97..66a5826 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,4 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
-
+obj-$(CONFIG_BLK_DEV_NVM)  += blk-nvm.o
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3dd028..58a8a71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
+#if CONFIG_BLK_DEV_NVM
+	rq->phys_sector = 0;
+#endif
 
 	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
 }
@@ -1445,6 +1448,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	unsigned int cmd_size = set->cmd_size;
 
 	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
 				set->numa_node,
@@ -1462,11 +1466,14 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		return NULL;
 	}
 
+	if (set->flags & BLK_MQ_F_NVM)
+		cmd_size += sizeof(struct nvm_per_rq);
+
 	/*
 	 * rq_size is the size of the request plus driver payload, rounded
 	 * to the cacheline size
 	 */
-	rq_size = round_up(sizeof(struct request) + set->cmd_size,
+	rq_size = round_up(sizeof(struct request) + cmd_size,
 				cache_line_size());
 	left = rq_size * set->queue_depth;
 
@@ -1978,6 +1985,9 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;
 
+	if (set->flags & BLK_MQ_F_NVM)
+		q->queue_flags |= 1 << QUEUE_FLAG_NVM;
+
 	q->sg_reserved_size = INT_MAX;
 
 	INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
diff --git a/block/blk-nvm.c b/block/blk-nvm.c
new file mode 100644
index 0000000..722821c
--- /dev/null
+++ b/block/blk-nvm.c
@@ -0,0 +1,722 @@
+/*
+ * blk-nvm.c - Block layer Open-channel SSD integration
+ *
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj@itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/sem.h>
+#include <linux/bitmap.h>
+
+#include <linux/lightnvm.h>
+
+#include "blk.h"
+
+static LIST_HEAD(_targets);
+static DECLARE_RWSEM(_lock);
+
+struct nvm_target_type *nvm_find_target_type(const char *name)
+{
+	struct nvm_target_type *tt;
+
+	list_for_each_entry(tt, &_targets, list)
+		if (!strcmp(name, tt->name))
+			return tt;
+
+	return NULL;
+}
+
+int nvm_register_target(struct nvm_target_type *tt)
+{
+	int ret = 0;
+
+	down_write(&_lock);
+	if (nvm_find_target_type(tt->name))
+		ret = -EEXIST;
+	else
+		list_add(&tt->list, &_targets);
+	up_write(&_lock);
+
+	return ret;
+}
+
+void nvm_unregister_target(struct nvm_target_type *tt)
+{
+	if (!tt)
+		return;
+
+	down_write(&_lock);
+	list_del(&tt->list);
+	up_write(&_lock);
+}
+
+static void nvm_reset_block(struct nvm_lun *lun, struct nvm_block *block)
+{
+	spin_lock(&block->lock);
+	bitmap_zero(block->invalid_pages, lun->nr_pages_per_blk);
+	block->next_page = 0;
+	block->nr_invalid_pages = 0;
+	atomic_set(&block->data_cmnt_size, 0);
+	spin_unlock(&block->lock);
+}
+
+/* use blk_nvm_lun_[get/put]_block to administer the blocks in use for each lun.
+ * Whenever a block is in used by an append point, we store it within the
+ * used_list. We then move it back when its free to be used by another append
+ * point.
+ *
+ * The newly claimed block is always added to the back of used_list. As we
+ * assume that the start of used list is the oldest block, and therefore
+ * more likely to contain invalidated pages.
+ */
+struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
+{
+	struct nvm_block *block = NULL;
+
+	BUG_ON(!lun);
+
+	spin_lock(&lun->lock);
+
+	if (list_empty(&lun->free_list)) {
+		pr_err_ratelimited("nvm: lun %u have no free pages available",
+								lun->id);
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	while (!is_gc && lun->nr_free_blocks < lun->reserved_blocks) {
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	block = list_first_entry(&lun->free_list, struct nvm_block, list);
+	list_move_tail(&block->list, &lun->used_list);
+
+	lun->nr_free_blocks--;
+
+	spin_unlock(&lun->lock);
+
+	nvm_reset_block(lun, block);
+
+out:
+	return block;
+}
+EXPORT_SYMBOL(blk_nvm_get_blk);
+
+/* We assume that all valid pages have already been moved when added back to the
+ * free list. We add it last to allow round-robin use of all pages. Thereby
+ * provide simple (naive) wear-leveling.
+ */
+void blk_nvm_put_blk(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	spin_lock(&lun->lock);
+
+	list_move_tail(&block->list, &lun->free_list);
+	lun->nr_free_blocks++;
+
+	spin_unlock(&lun->lock);
+}
+EXPORT_SYMBOL(blk_nvm_put_blk);
+
+sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	sector_t addr = ADDR_EMPTY;
+
+	spin_lock(&block->lock);
+	if (block_is_full(block))
+		goto out;
+
+	addr = block_to_addr(block) + block->next_page;
+
+	block->next_page++;
+out:
+	spin_unlock(&block->lock);
+	return addr;
+}
+EXPORT_SYMBOL(blk_nvm_alloc_addr);
+
+/* Send erase command to device */
+int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *block)
+{
+	if (dev->ops->erase_block)
+		return dev->ops->erase_block(dev->q, block->id);
+
+	return 0;
+}
+EXPORT_SYMBOL(blk_nvm_erase_blk);
+
+static void nvm_blocks_free(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	int i;
+
+	nvm_for_each_lun(dev, lun, i) {
+		if (!lun->blocks)
+			break;
+		vfree(lun->blocks);
+	}
+}
+
+static void nvm_luns_free(struct nvm_dev *dev)
+{
+	kfree(dev->luns);
+}
+
+static int nvm_luns_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_id_chnl *chnl;
+	int i;
+
+	dev->luns = kcalloc(dev->nr_luns, sizeof(struct nvm_lun), GFP_KERNEL);
+	if (!dev->luns)
+		return -ENOMEM;
+
+	nvm_for_each_lun(dev, lun, i) {
+		chnl = &dev->identity.chnls[i];
+		pr_info("nvm: p %u qsize %u gr %u ge %u begin %llu end %llu\n",
+			i, chnl->queue_size, chnl->gran_read, chnl->gran_erase,
+			chnl->laddr_begin, chnl->laddr_end);
+
+		spin_lock_init(&lun->lock);
+
+		INIT_LIST_HEAD(&lun->free_list);
+		INIT_LIST_HEAD(&lun->used_list);
+
+		lun->id = i;
+		lun->dev = dev;
+		lun->chnl = chnl;
+		lun->reserved_blocks = 2; /* for GC only */
+		lun->nr_blocks =
+				(chnl->laddr_end - chnl->laddr_begin + 1) /
+				(chnl->gran_erase / chnl->gran_read);
+		lun->nr_free_blocks = lun->nr_blocks;
+		lun->nr_pages_per_blk = chnl->gran_erase / chnl->gran_write *
+					(chnl->gran_write / dev->sector_size);
+
+		dev->total_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+		dev->total_blocks += lun->nr_blocks;
+
+		if (lun->nr_pages_per_blk >
+				MAX_INVALID_PAGES_STORAGE * BITS_PER_LONG) {
+			pr_err("nvm: number of pages per block too high.");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_block_map(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct nvm_dev *dev = private;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t total_pgs_per_lun = /* each lun have the same configuration */
+		   dev->luns[0].nr_blocks * dev->luns[0].nr_pages_per_blk;
+	u64 i;
+	int lun_id;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		/* resolve block from physical address */
+		lun_id = pba / total_pgs_per_lun;
+		lun = &dev->luns[lun_id];
+
+		/* Calculate block offset into lun */
+		pba = pba - (total_pgs_per_lun * lun_id);
+		blk = &lun->blocks[pba / lun->nr_pages_per_blk];
+
+		if (!blk->type) {
+			/* at this point, we don't know anything about the
+			 * block. It's up to the FTL on top to re-etablish the
+			 * block state */
+			list_move_tail(&blk->list, &lun->used_list);
+			blk->type = 1;
+			lun->nr_free_blocks--;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_blocks_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_block *block;
+	sector_t lun_iter, block_iter, cur_block_id = 0;
+	int ret;
+
+	nvm_for_each_lun(dev, lun, lun_iter) {
+		lun->blocks = vzalloc(sizeof(struct nvm_block) *
+						lun->nr_blocks);
+		if (!lun->blocks)
+			return -ENOMEM;
+
+		lun_for_each_block(lun, block, block_iter) {
+			spin_lock_init(&block->lock);
+			INIT_LIST_HEAD(&block->list);
+
+			block->lun = lun;
+			block->id = cur_block_id++;
+
+			/* First block is reserved for device */
+			if (unlikely(lun_iter == 0 && block_iter == 0))
+				continue;
+
+			list_add_tail(&block->list, &lun->free_list);
+		}
+	}
+
+	/* Without bad block table support, we can use the mapping table to get
+	   restore the state of each block. */
+	if (dev->ops->get_l2p_tbl) {
+		ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							nvm_block_map, dev);
+		if (ret) {
+			pr_err("nvm: could not read L2P table.\n");
+			pr_warn("nvm: default block initialization");
+		}
+	}
+
+	return 0;
+}
+
+static void nvm_core_free(struct nvm_dev *dev)
+{
+	kfree(dev->identity.chnls);
+	kfree(dev);
+}
+
+static int nvm_core_init(struct nvm_dev *dev, int max_qdepth)
+{
+	dev->nr_luns = dev->identity.nchannels;
+	dev->sector_size = EXPOSED_PAGE_SIZE;
+	INIT_LIST_HEAD(&dev->online_targets);
+
+	return 0;
+}
+
+static void nvm_free(struct nvm_dev *dev)
+{
+	if (!dev)
+		return;
+
+	nvm_blocks_free(dev);
+	nvm_luns_free(dev);
+	nvm_core_free(dev);
+}
+
+int nvm_validate_features(struct nvm_dev *dev)
+{
+	struct nvm_get_features gf;
+	int ret;
+
+	ret = dev->ops->get_features(dev->q, &gf);
+	if (ret)
+		return ret;
+
+	/* Only default configuration is supported.
+	 * I.e. L2P, No ondrive GC and drive performs ECC */
+	if (gf.rsp != 0x0 || gf.ext != 0x0)
+		return -EINVAL;
+
+	return 0;
+}
+
+int nvm_validate_responsibility(struct nvm_dev *dev)
+{
+	if (!dev->ops->set_responsibility)
+		return 0;
+
+	return dev->ops->set_responsibility(dev->q, 0);
+}
+
+int nvm_init(struct nvm_dev *dev)
+{
+	struct blk_mq_tag_set *tag_set = dev->q->tag_set;
+	int max_qdepth;
+	int ret = 0;
+
+	if (!dev->q || !dev->ops)
+		return -EINVAL;
+
+	if (dev->ops->identify(dev->q, &dev->identity)) {
+		pr_err("nvm: device could not be identified\n");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	max_qdepth = tag_set->queue_depth * tag_set->nr_hw_queues;
+
+	pr_debug("nvm dev: ver %u type %u chnls %u max qdepth: %i\n",
+			dev->identity.ver_id,
+			dev->identity.nvm_type,
+			dev->identity.nchannels,
+			max_qdepth);
+
+	ret = nvm_validate_features(dev);
+	if (ret) {
+		pr_err("nvm: disk features are not supported.");
+		goto err;
+	}
+
+	ret = nvm_validate_responsibility(dev);
+	if (ret) {
+		pr_err("nvm: disk responsibilities are not supported.");
+		goto err;
+	}
+
+	ret = nvm_core_init(dev, max_qdepth);
+	if (ret) {
+		pr_err("nvm: could not initialize core structures.\n");
+		goto err;
+	}
+
+	ret = nvm_luns_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	if (!dev->nr_luns) {
+		pr_err("nvm: device did not expose any luns.\n");
+		goto err;
+	}
+
+	ret = nvm_blocks_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize blocks\n");
+		goto err;
+	}
+
+	pr_info("nvm: allocating %lu physical pages (%lu KB)\n",
+		dev->total_pages, dev->total_pages * dev->sector_size / 1024);
+	pr_info("nvm: luns: %u\n", dev->nr_luns);
+	pr_info("nvm: blocks: %lu\n", dev->total_blocks);
+	pr_info("nvm: target sector size=%d\n", dev->sector_size);
+
+	return 0;
+err:
+	nvm_free(dev);
+	pr_err("nvm: failed to initialize nvm\n");
+	return ret;
+}
+
+void nvm_exit(struct nvm_dev *dev)
+{
+	nvm_free(dev);
+
+	pr_info("nvm: successfully unloaded\n");
+}
+
+int blk_nvm_register(struct request_queue *q, struct nvm_dev_ops *ops)
+{
+	struct nvm_dev *dev;
+	int ret;
+
+	if (!ops->identify || !ops->get_features)
+		return -EINVAL;
+
+	/* does not yet support multi-page IOs. */
+	blk_queue_max_hw_sectors(q, queue_logical_block_size(q) >> 9);
+
+	dev = kzalloc(sizeof(struct nvm_dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->q = q;
+	dev->ops = ops;
+
+	ret = nvm_init(dev);
+	if (ret)
+		goto err_init;
+
+	q->nvm = dev;
+
+	return 0;
+err_init:
+	kfree(dev);
+	return ret;
+}
+EXPORT_SYMBOL(blk_nvm_register);
+
+void blk_nvm_unregister(struct request_queue *q)
+{
+	if (!blk_queue_nvm(q))
+		return;
+
+	nvm_exit(q->nvm);
+}
+
+static int nvm_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int nvm_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void nvm_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations nvm_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvm_ioctl,
+	.open		= nvm_open,
+	.release	= nvm_release,
+};
+
+static int nvm_create_target(struct gendisk *qdisk, char *ttname, char *tname,
+						int lun_begin, int lun_end)
+{
+	struct request_queue *qqueue = qdisk->queue;
+	struct nvm_dev *qnvm = qqueue->nvm;
+	struct request_queue *tqueue;
+	struct gendisk *tdisk;
+	struct nvm_target_type *tt;
+	struct nvm_target *t;
+	void *targetdata;
+
+	tt = nvm_find_target_type(ttname);
+	if (!tt) {
+		pr_err("nvm: target type %s not found\n", ttname);
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &qnvm->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			pr_err("nvm: target name already exists.\n");
+			up_write(&_lock);
+			return -EINVAL;
+		}
+	}
+	up_write(&_lock);
+
+	t = kmalloc(sizeof(struct nvm_target), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	tqueue = blk_alloc_queue_node(GFP_KERNEL, qqueue->node);
+	if (!tqueue)
+		goto err_t;
+	blk_queue_make_request(tqueue, tt->make_rq);
+
+	tdisk = alloc_disk(0);
+	if (!tdisk)
+		goto err_queue;
+
+	sprintf(tdisk->disk_name, "%s", tname);
+	tdisk->flags = GENHD_FL_EXT_DEVT;
+	tdisk->major = 0;
+	tdisk->first_minor = 0;
+	tdisk->fops = &nvm_fops;
+	tdisk->queue = tqueue;
+
+	targetdata = tt->init(qqueue, tqueue, qdisk, tdisk, lun_begin, lun_end);
+	if (IS_ERR(targetdata))
+		goto err_init;
+
+	tdisk->private_data = targetdata;
+	tqueue->queuedata = targetdata;
+
+	blk_queue_prep_rq(qqueue, tt->prep_rq);
+	blk_queue_unprep_rq(qqueue, tt->unprep_rq);
+
+	set_capacity(tdisk, tt->capacity(targetdata));
+	add_disk(tdisk);
+
+	t->type = tt;
+	t->disk = tdisk;
+
+	down_write(&_lock);
+	list_add_tail(&t->list, &qnvm->online_targets);
+	up_write(&_lock);
+
+	return 0;
+err_init:
+	put_disk(tdisk);
+err_queue:
+	blk_cleanup_queue(tqueue);
+err_t:
+	kfree(t);
+	return -ENOMEM;
+}
+
+/* _lock must be taken */
+static void nvm_remove_target(struct nvm_target *t)
+{
+	struct nvm_target_type *tt = t->type;
+	struct gendisk *tdisk = t->disk;
+	struct request_queue *q = tdisk->queue;
+
+	del_gendisk(tdisk);
+	if (tt->exit)
+		tt->exit(tdisk->private_data);
+	blk_cleanup_queue(q);
+
+	put_disk(tdisk);
+
+	list_del(&t->list);
+	kfree(t);
+}
+
+static ssize_t free_blocks_show(struct device *d, struct device_attribute *attr,
+		char *page)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+
+	char *page_start = page;
+	struct nvm_lun *lun;
+	unsigned int i;
+
+	nvm_for_each_lun(dev, lun, i)
+		page += sprintf(page, "%8u\t%u\n", i, lun->nr_free_blocks);
+
+	return page - page_start;
+}
+
+DEVICE_ATTR_RO(free_blocks);
+
+static ssize_t configure_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	char name[255], ttname[255];
+	int lun_begin, lun_end, ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s %s %u:%u", name, ttname, &lun_begin, &lun_end);
+	if (ret != 4) {
+		pr_err("nvm: configure must be in the format of \"name targetname lun_begin:lun_end\".\n");
+		return -EINVAL;
+	}
+
+	if (lun_begin > lun_end || lun_end > dev->nr_luns) {
+		pr_err("nvm: lun out of bound (%u:%u > %u)\n",
+					lun_begin, lun_end, dev->nr_luns);
+		return -EINVAL;
+	}
+
+	ret = nvm_create_target(disk, name, ttname, lun_begin, lun_end);
+	if (ret)
+		pr_err("nvm: configure disk failed\n");
+
+	return cnt;
+}
+DEVICE_ATTR_WO(configure);
+
+static ssize_t remove_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	struct nvm_target *t = NULL;
+	char tname[255];
+	int ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s", tname);
+	if (ret != 1) {
+		pr_err("nvm: remove use the following format \"targetname\".\n");
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &dev->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			nvm_remove_target(t);
+			ret = 0;
+			break;
+		}
+	}
+	up_write(&_lock);
+
+	if (ret)
+		pr_err("nvm: target \"%s\" doesn't exist.\n", tname);
+
+	return cnt;
+}
+
+DEVICE_ATTR_WO(remove);
+
+static struct attribute *nvm_attrs[] = {
+	&dev_attr_free_blocks.attr,
+	&dev_attr_configure.attr,
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+static struct attribute_group nvm_attribute_group = {
+	.name = "nvm",
+	.attrs = nvm_attrs,
+};
+
+int blk_nvm_init_sysfs(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_group(&dev->kobj, &nvm_attribute_group);
+	if (ret)
+		return ret;
+
+	kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+
+	return 0;
+}
+
+void blk_nvm_remove_sysfs(struct device *dev)
+{
+	sysfs_remove_group(&dev->kobj, &nvm_attribute_group);
+}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index faaf36a..ad8cf2f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -568,6 +568,12 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		return ret;
 
+	if (blk_queue_nvm(q)) {
+		ret = blk_nvm_init_sysfs(dev);
+		if (ret)
+			return ret;
+	}
+
 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
 	if (ret < 0) {
 		blk_trace_remove_sysfs(dev);
@@ -601,6 +607,11 @@ void blk_unregister_queue(struct gendisk *disk)
 	if (WARN_ON(!q))
 		return;
 
+	if (blk_queue_nvm(q)) {
+		blk_nvm_unregister(q);
+		blk_nvm_remove_sysfs(disk_to_dev(disk));
+	}
+
 	if (q->mq_ops)
 		blk_mq_unregister_disk(disk);
 
diff --git a/block/blk.h b/block/blk.h
index 43b0361..3e4abee 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -281,4 +281,22 @@ static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_target {
+	struct list_head list;
+	struct nvm_target_type *type;
+	struct gendisk *disk;
+};
+
+struct nvm_dev_ops;
+
+extern void blk_nvm_unregister(struct request_queue *);
+extern int blk_nvm_init_sysfs(struct device *);
+extern void blk_nvm_remove_sysfs(struct device *);
+#else
+static void blk_nvm_unregister(struct request_queue *q) { }
+static int blk_nvm_init_sysfs(struct device *) { return 0; }
+static void blk_nvm_remove_sysfs(struct device *) { }
+#endif /* CONFIG_BLK_DEV_NVM */
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127..ace0b23 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -354,6 +354,15 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#if defined(CONFIG_BLK_DEV_NVM)
+
+/* bio open-channel ssd payload */
+struct bio_nvm_payload {
+	void *private;
+};
+
+#endif /* CONFIG_BLK_DEV_NVM */
+
 extern void bio_trim(struct bio *bio, int offset, int size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7b39af..75e1497 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,13 +140,15 @@ enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
-	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO handled by prep */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_SYSFS_UP	= 1 << 3,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
+	BLK_MQ_F_NVM		= 1 << 5,
+
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e3..a619844 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -83,7 +83,10 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
-
+#if defined(CONFIG_BLK_DEV_NVM)
+	struct bio_nvm_payload *bi_nvm; /* open-channel ssd
+								     support */
+#endif
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
 	/*
@@ -193,6 +196,8 @@ enum rq_flag_bits {
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
 	__REQ_NO_TIMEOUT,	/* requests may never expire */
+	__REQ_NVM_MAPPED,	/* NVM mapped this request */
+	__REQ_NVM_NO_INFLIGHT,	/* request should not use inflight protection */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -213,7 +218,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_NVM_NO_INFLIGHT)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
@@ -247,5 +252,6 @@ enum rq_flag_bits {
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
 #define REQ_NO_TIMEOUT		(1ULL << __REQ_NO_TIMEOUT)
-
+#define REQ_NVM_MAPPED		(1ULL << __REQ_NVM_MAPPED)
+#define REQ_NVM_NO_INFLIGHT	(1ULL << __REQ_NVM_NO_INFLIGHT)
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..d416fd5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,6 +209,9 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+#if CONFIG_BLK_DEV_NVM
+	sector_t phys_sector;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -309,6 +312,10 @@ struct queue_limits {
 	unsigned char		raid_partial_stripes_expensive;
 };
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_dev;
+#endif
+
 struct request_queue {
 	/*
 	 * Together with queue_head for cacheline sharing
@@ -455,6 +462,9 @@ struct request_queue {
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace	*blk_trace;
 #endif
+#ifdef CONFIG_BLK_DEV_NVM
+	struct nvm_dev *nvm;
+#endif
 	/*
 	 * for flush operations
 	 */
@@ -513,6 +523,7 @@ struct request_queue {
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_SG_GAPS     22	/* queue doesn't support SG gaps */
+#define QUEUE_FLAG_NVM         23	/* open-channel SSD managed queue */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -601,6 +612,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_nvm(q)	test_bit(QUEUE_FLAG_NVM, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
@@ -822,6 +834,7 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
 extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
+extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
@@ -902,6 +915,11 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
+static inline sector_t blk_rq_phys_pos(const struct request *rq)
+{
+	return rq->phys_sector;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1504,6 +1522,8 @@ extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
 static inline
 struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
 {
+	if (unlikely(!bdev))
+		return NULL;
 	return bdev->bd_disk->integrity;
 }
 
@@ -1598,6 +1618,204 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#ifdef CONFIG_BLK_DEV_NVM
+
+#include <uapi/linux/nvm.h>
+
+typedef int (nvm_l2p_update_fn)(u64, u64, u64 *, void *);
+typedef int (nvm_id_fn)(struct request_queue *, struct nvm_id *);
+typedef int (nvm_get_features_fn)(struct request_queue *,
+				  struct nvm_get_features *);
+typedef int (nvm_set_rsp_fn)(struct request_queue *, u64);
+typedef int (nvm_get_l2p_tbl_fn)(struct request_queue *, u64, u64,
+				 nvm_l2p_update_fn *, void *);
+typedef int (nvm_erase_blk_fn)(struct request_queue *, sector_t);
+
+struct nvm_dev_ops {
+	nvm_id_fn		*identify;
+	nvm_get_features_fn	*get_features;
+	nvm_set_rsp_fn		*set_responsibility;
+	nvm_get_l2p_tbl_fn	*get_l2p_tbl;
+
+	nvm_erase_blk_fn	*erase_block;
+};
+
+struct nvm_blocks;
+
+/*
+ * We assume that the device exposes its channels as a linear address
+ * space. A lun therefore have a phy_addr_start and phy_addr_end that
+ * denotes the start and end. This abstraction is used to let the
+ * open-channel SSD (or any other device) expose its read/write/erase
+ * interface and be administrated by the host system.
+ */
+struct nvm_lun {
+	struct nvm_dev *dev;
+
+	/* lun block lists */
+	struct list_head used_list;	/* In-use blocks */
+	struct list_head free_list;	/* Not used blocks i.e. released
+					 *  and ready for use */
+
+	struct {
+		spinlock_t lock;
+	} ____cacheline_aligned_in_smp;
+
+	struct nvm_block *blocks;
+	struct nvm_id_chnl *chnl;
+
+	int id;
+	int reserved_blocks;
+
+	unsigned int nr_blocks;		/* end_block - start_block. */
+	unsigned int nr_free_blocks;	/* Number of unused blocks */
+
+	int nr_pages_per_blk;
+};
+
+struct nvm_block {
+	/* Management structures */
+	struct list_head list;
+	struct nvm_lun *lun;
+
+	spinlock_t lock;
+
+#define MAX_INVALID_PAGES_STORAGE 8
+	/* Bitmap for invalid page intries */
+	unsigned long invalid_pages[MAX_INVALID_PAGES_STORAGE];
+	/* points to the next writable page within a block */
+	unsigned int next_page;
+	/* number of pages that are invalid, wrt host page size */
+	unsigned int nr_invalid_pages;
+
+	unsigned int id;
+	int type;
+	/* Persistent data structures */
+	atomic_t data_cmnt_size; /* data pages committed to stable storage */
+};
+
+struct nvm_dev {
+	struct nvm_dev_ops *ops;
+	struct request_queue *q;
+
+	struct nvm_id identity;
+
+	struct list_head online_targets;
+
+	/* Open-channel SSD stores extra data after the private driver data */
+	unsigned int drv_cmd_size;
+
+	int nr_luns;
+	struct nvm_lun *luns;
+
+	/*int nr_blks_per_lun;
+	int nr_pages_per_blk;*/
+	/* Calculated/Cached values. These do not reflect the actual usuable
+	 * blocks at run-time. */
+	unsigned long total_pages;
+	unsigned long total_blocks;
+
+	uint32_t sector_size;
+};
+
+/* Logical to physical mapping */
+struct nvm_addr {
+	sector_t addr;
+	struct nvm_block *block;
+};
+
+/* Physical to logical mapping */
+struct nvm_rev_addr {
+	sector_t addr;
+};
+
+struct rrpc_inflight_rq {
+	struct list_head list;
+	sector_t l_start;
+	sector_t l_end;
+};
+
+struct nvm_per_rq {
+	struct rrpc_inflight_rq inflight_rq;
+	struct nvm_addr *addr;
+	unsigned int flags;
+};
+
+typedef void (nvm_tgt_make_rq)(struct request_queue *, struct bio *);
+typedef int (nvm_tgt_prep_rq)(struct request_queue *, struct request *);
+typedef void (nvm_tgt_unprep_rq)(struct request_queue *, struct request *);
+typedef sector_t (nvm_tgt_capacity)(void *);
+typedef void *(nvm_tgt_init_fn)(struct request_queue *, struct request_queue *,
+				struct gendisk *, struct gendisk *, int, int);
+typedef void (nvm_tgt_exit_fn)(void *);
+
+struct nvm_target_type {
+	const char *name;
+	unsigned int version[3];
+
+	/* target entry points */
+	nvm_tgt_make_rq *make_rq;
+	nvm_tgt_prep_rq *prep_rq;
+	nvm_tgt_unprep_rq *unprep_rq;
+	nvm_tgt_capacity *capacity;
+
+	/* module-specific init/teardown */
+	nvm_tgt_init_fn *init;
+	nvm_tgt_exit_fn *exit;
+
+	/* For open-channel SSD internal use */
+	struct list_head list;
+};
+
+extern struct nvm_target_type *nvm_find_target_type(const char *name);
+extern int nvm_register_target(struct nvm_target_type *tt);
+extern void nvm_unregister_target(struct nvm_target_type *tt);
+extern int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *);
+extern struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int);
+extern void blk_nvm_put_blk(struct nvm_block *block);
+extern int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *);
+extern sector_t blk_nvm_alloc_addr(struct nvm_block *);
+static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
+{
+	return q->nvm;
+}
+#else
+struct nvm_dev_ops;
+struct nvm_lun;
+struct nvm_block;
+struct nvm_target_type;
+
+struct nvm_target_type *nvm_find_target_type(const char *)
+{
+	return NULL;
+}
+int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *)
+{
+	return -EINVAL;
+}
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+{
+	return NULL;
+}
+static inline void blk_nvm_put_blk(struct nvm_block *) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+{
+	return -EINVAL;
+}
+static inline int blk_nvm_get_dev(struct request_queue *)
+{
+	return NULL;
+}
+static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	return 0;
+}
+#endif /* CONFIG_BLK_DEV_NVM */
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
new file mode 100644
index 0000000..888d994
--- /dev/null
+++ b/include/linux/lightnvm.h
@@ -0,0 +1,56 @@
+#ifndef NVM_H
+#define NVM_H
+
+#include <linux/blkdev.h>
+#include <linux/types.h>
+
+#define nvm_for_each_lun(dev, lun, i) \
+		for ((i) = 0, lun = &(dev)->luns[0]; \
+			(i) < (dev)->nr_luns; (i)++, lun = &(dev)->luns[(i)])
+
+#define lun_for_each_block(p, b, i) \
+		for ((i) = 0, b = &(p)->blocks[0]; \
+			(i) < (p)->nr_blocks; (i)++, b = &(p)->blocks[(i)])
+
+#define block_for_each_page(b, p) \
+		for ((p)->addr = block_to_addr((b)), (p)->block = (b); \
+			(p)->addr < block_to_addr((b)) \
+				+ (b)->lun->dev->nr_pages_per_blk; \
+			(p)->addr++)
+
+/* We currently assume that we the lightnvm device is accepting data in 512
+ * bytes chunks. This should be set to the smallest command size available for a
+ * given device.
+ */
+#define NVM_SECTOR 512
+#define EXPOSED_PAGE_SIZE 4096
+
+#define NR_PHY_IN_LOG (EXPOSED_PAGE_SIZE / NVM_SECTOR)
+
+#define NVM_MSG_PREFIX "nvm"
+#define ADDR_EMPTY (~0ULL)
+#define LTOP_POISON 0xD3ADB33F
+
+/* core.c */
+
+static inline int block_is_full(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->next_page == lun->nr_pages_per_blk;
+}
+
+static inline sector_t block_to_addr(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->id * lun->nr_pages_per_blk;
+}
+
+static inline struct nvm_lun *paddr_to_lun(struct nvm_dev *dev,
+							sector_t p_addr)
+{
+	return &dev->luns[p_addr / (dev->total_pages / dev->nr_luns)];
+}
+
+#endif
diff --git a/include/uapi/linux/nvm.h b/include/uapi/linux/nvm.h
new file mode 100644
index 0000000..fb95cf5
--- /dev/null
+++ b/include/uapi/linux/nvm.h
@@ -0,0 +1,70 @@
+/*
+ * Definitions for the LightNVM interface
+ * Copyright (c) 2015, IT University of Copenhagen
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _UAPI_LINUX_LIGHTNVM_H
+#define _UAPI_LINUX_LIGHTNVM_H
+
+#include <linux/types.h>
+
+enum {
+	/* HW Responsibilities */
+	NVM_RSP_L2P	= 0x00,
+	NVM_RSP_GC	= 0x01,
+	NVM_RSP_ECC	= 0x02,
+
+	/* Physical NVM Type */
+	NVM_NVMT_BLK	= 0,
+	NVM_NVMT_BYTE	= 1,
+
+	/* Internal IO Scheduling algorithm */
+	NVM_IOSCHED_CHANNEL	= 0,
+	NVM_IOSCHED_CHIP	= 1,
+
+	/* Status codes */
+	NVM_SUCCESS		= 0,
+	NVM_RSP_NOT_CHANGEABLE	= 1,
+};
+
+struct nvm_id_chnl {
+	u64	laddr_begin;
+	u64	laddr_end;
+	u32	oob_size;
+	u32	queue_size;
+	u32	gran_read;
+	u32	gran_write;
+	u32	gran_erase;
+	u32	t_r;
+	u32	t_sqr;
+	u32	t_w;
+	u32	t_sqw;
+	u32	t_e;
+	u16	chnl_parallelism;
+	u8	io_sched;
+	u8	res[133];
+};
+
+struct nvm_id {
+	u8	ver_id;
+	u8	nvm_type;
+	u16	nchannels;
+	struct nvm_id_chnl *chnls;
+};
+
+struct nvm_get_features {
+	u64	rsp;
+	u64	ext;
+};
+
+#endif /* _UAPI_LINUX_LIGHTNVM_H */
+
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

  Javier Gonzalez <javier@paletta.io>
  Jesper Madsen <jmad@itu.dk>

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 block/Kconfig             |  12 +
 block/Makefile            |   2 +-
 block/blk-mq.c            |  12 +-
 block/blk-nvm.c           | 722 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c         |  11 +
 block/blk.h               |  18 ++
 include/linux/bio.h       |   9 +
 include/linux/blk-mq.h    |   4 +-
 include/linux/blk_types.h |  12 +-
 include/linux/blkdev.h    | 218 ++++++++++++++
 include/linux/lightnvm.h  |  56 ++++
 include/uapi/linux/nvm.h  |  70 +++++
 12 files changed, 1140 insertions(+), 6 deletions(-)
 create mode 100644 block/blk-nvm.c
 create mode 100644 include/linux/lightnvm.h
 create mode 100644 include/uapi/linux/nvm.h

diff --git a/block/Kconfig b/block/Kconfig
index 161491d..a3fca8f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -88,6 +88,18 @@ config BLK_DEV_INTEGRITY
 	T10/SCSI Data Integrity Field or the T13/ATA External Path
 	Protection.  If in doubt, say N.
 
+config BLK_DEV_NVM
+	bool "Block layer Open-Channel SSD support"
+	depends on BLK_DEV
+	default y
+	---help---
+	  Say Y here to get to enable support for Open-channel SSDs.
+
+	  Open-Channel SSDs expose direct access to the underlying non-volatile
+	  memory.
+
+	  This option is required by Open-Channel SSD target drivers.
+
 config BLK_DEV_THROTTLING
 	bool "Block layer bio throttling support"
 	depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 00ecc97..66a5826 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,4 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
-
+obj-$(CONFIG_BLK_DEV_NVM)  += blk-nvm.o
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3dd028..58a8a71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
+#if CONFIG_BLK_DEV_NVM
+	rq->phys_sector = 0;
+#endif
 
 	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
 }
@@ -1445,6 +1448,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	unsigned int cmd_size = set->cmd_size;
 
 	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
 				set->numa_node,
@@ -1462,11 +1466,14 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		return NULL;
 	}
 
+	if (set->flags & BLK_MQ_F_NVM)
+		cmd_size += sizeof(struct nvm_per_rq);
+
 	/*
 	 * rq_size is the size of the request plus driver payload, rounded
 	 * to the cacheline size
 	 */
-	rq_size = round_up(sizeof(struct request) + set->cmd_size,
+	rq_size = round_up(sizeof(struct request) + cmd_size,
 				cache_line_size());
 	left = rq_size * set->queue_depth;
 
@@ -1978,6 +1985,9 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;
 
+	if (set->flags & BLK_MQ_F_NVM)
+		q->queue_flags |= 1 << QUEUE_FLAG_NVM;
+
 	q->sg_reserved_size = INT_MAX;
 
 	INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
diff --git a/block/blk-nvm.c b/block/blk-nvm.c
new file mode 100644
index 0000000..722821c
--- /dev/null
+++ b/block/blk-nvm.c
@@ -0,0 +1,722 @@
+/*
+ * blk-nvm.c - Block layer Open-channel SSD integration
+ *
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj@itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/sem.h>
+#include <linux/bitmap.h>
+
+#include <linux/lightnvm.h>
+
+#include "blk.h"
+
+static LIST_HEAD(_targets);
+static DECLARE_RWSEM(_lock);
+
+struct nvm_target_type *nvm_find_target_type(const char *name)
+{
+	struct nvm_target_type *tt;
+
+	list_for_each_entry(tt, &_targets, list)
+		if (!strcmp(name, tt->name))
+			return tt;
+
+	return NULL;
+}
+
+int nvm_register_target(struct nvm_target_type *tt)
+{
+	int ret = 0;
+
+	down_write(&_lock);
+	if (nvm_find_target_type(tt->name))
+		ret = -EEXIST;
+	else
+		list_add(&tt->list, &_targets);
+	up_write(&_lock);
+
+	return ret;
+}
+
+void nvm_unregister_target(struct nvm_target_type *tt)
+{
+	if (!tt)
+		return;
+
+	down_write(&_lock);
+	list_del(&tt->list);
+	up_write(&_lock);
+}
+
+static void nvm_reset_block(struct nvm_lun *lun, struct nvm_block *block)
+{
+	spin_lock(&block->lock);
+	bitmap_zero(block->invalid_pages, lun->nr_pages_per_blk);
+	block->next_page = 0;
+	block->nr_invalid_pages = 0;
+	atomic_set(&block->data_cmnt_size, 0);
+	spin_unlock(&block->lock);
+}
+
+/* use blk_nvm_lun_[get/put]_block to administer the blocks in use for each lun.
+ * Whenever a block is in used by an append point, we store it within the
+ * used_list. We then move it back when its free to be used by another append
+ * point.
+ *
+ * The newly claimed block is always added to the back of used_list. As we
+ * assume that the start of used list is the oldest block, and therefore
+ * more likely to contain invalidated pages.
+ */
+struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
+{
+	struct nvm_block *block = NULL;
+
+	BUG_ON(!lun);
+
+	spin_lock(&lun->lock);
+
+	if (list_empty(&lun->free_list)) {
+		pr_err_ratelimited("nvm: lun %u have no free pages available",
+								lun->id);
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	while (!is_gc && lun->nr_free_blocks < lun->reserved_blocks) {
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	block = list_first_entry(&lun->free_list, struct nvm_block, list);
+	list_move_tail(&block->list, &lun->used_list);
+
+	lun->nr_free_blocks--;
+
+	spin_unlock(&lun->lock);
+
+	nvm_reset_block(lun, block);
+
+out:
+	return block;
+}
+EXPORT_SYMBOL(blk_nvm_get_blk);
+
+/* We assume that all valid pages have already been moved when added back to the
+ * free list. We add it last to allow round-robin use of all pages. Thereby
+ * provide simple (naive) wear-leveling.
+ */
+void blk_nvm_put_blk(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	spin_lock(&lun->lock);
+
+	list_move_tail(&block->list, &lun->free_list);
+	lun->nr_free_blocks++;
+
+	spin_unlock(&lun->lock);
+}
+EXPORT_SYMBOL(blk_nvm_put_blk);
+
+sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	sector_t addr = ADDR_EMPTY;
+
+	spin_lock(&block->lock);
+	if (block_is_full(block))
+		goto out;
+
+	addr = block_to_addr(block) + block->next_page;
+
+	block->next_page++;
+out:
+	spin_unlock(&block->lock);
+	return addr;
+}
+EXPORT_SYMBOL(blk_nvm_alloc_addr);
+
+/* Send erase command to device */
+int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *block)
+{
+	if (dev->ops->erase_block)
+		return dev->ops->erase_block(dev->q, block->id);
+
+	return 0;
+}
+EXPORT_SYMBOL(blk_nvm_erase_blk);
+
+static void nvm_blocks_free(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	int i;
+
+	nvm_for_each_lun(dev, lun, i) {
+		if (!lun->blocks)
+			break;
+		vfree(lun->blocks);
+	}
+}
+
+static void nvm_luns_free(struct nvm_dev *dev)
+{
+	kfree(dev->luns);
+}
+
+static int nvm_luns_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_id_chnl *chnl;
+	int i;
+
+	dev->luns = kcalloc(dev->nr_luns, sizeof(struct nvm_lun), GFP_KERNEL);
+	if (!dev->luns)
+		return -ENOMEM;
+
+	nvm_for_each_lun(dev, lun, i) {
+		chnl = &dev->identity.chnls[i];
+		pr_info("nvm: p %u qsize %u gr %u ge %u begin %llu end %llu\n",
+			i, chnl->queue_size, chnl->gran_read, chnl->gran_erase,
+			chnl->laddr_begin, chnl->laddr_end);
+
+		spin_lock_init(&lun->lock);
+
+		INIT_LIST_HEAD(&lun->free_list);
+		INIT_LIST_HEAD(&lun->used_list);
+
+		lun->id = i;
+		lun->dev = dev;
+		lun->chnl = chnl;
+		lun->reserved_blocks = 2; /* for GC only */
+		lun->nr_blocks =
+				(chnl->laddr_end - chnl->laddr_begin + 1) /
+				(chnl->gran_erase / chnl->gran_read);
+		lun->nr_free_blocks = lun->nr_blocks;
+		lun->nr_pages_per_blk = chnl->gran_erase / chnl->gran_write *
+					(chnl->gran_write / dev->sector_size);
+
+		dev->total_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+		dev->total_blocks += lun->nr_blocks;
+
+		if (lun->nr_pages_per_blk >
+				MAX_INVALID_PAGES_STORAGE * BITS_PER_LONG) {
+			pr_err("nvm: number of pages per block too high.");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_block_map(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct nvm_dev *dev = private;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t total_pgs_per_lun = /* each lun have the same configuration */
+		   dev->luns[0].nr_blocks * dev->luns[0].nr_pages_per_blk;
+	u64 i;
+	int lun_id;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		/* resolve block from physical address */
+		lun_id = pba / total_pgs_per_lun;
+		lun = &dev->luns[lun_id];
+
+		/* Calculate block offset into lun */
+		pba = pba - (total_pgs_per_lun * lun_id);
+		blk = &lun->blocks[pba / lun->nr_pages_per_blk];
+
+		if (!blk->type) {
+			/* at this point, we don't know anything about the
+			 * block. It's up to the FTL on top to re-etablish the
+			 * block state */
+			list_move_tail(&blk->list, &lun->used_list);
+			blk->type = 1;
+			lun->nr_free_blocks--;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_blocks_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_block *block;
+	sector_t lun_iter, block_iter, cur_block_id = 0;
+	int ret;
+
+	nvm_for_each_lun(dev, lun, lun_iter) {
+		lun->blocks = vzalloc(sizeof(struct nvm_block) *
+						lun->nr_blocks);
+		if (!lun->blocks)
+			return -ENOMEM;
+
+		lun_for_each_block(lun, block, block_iter) {
+			spin_lock_init(&block->lock);
+			INIT_LIST_HEAD(&block->list);
+
+			block->lun = lun;
+			block->id = cur_block_id++;
+
+			/* First block is reserved for device */
+			if (unlikely(lun_iter == 0 && block_iter == 0))
+				continue;
+
+			list_add_tail(&block->list, &lun->free_list);
+		}
+	}
+
+	/* Without bad block table support, we can use the mapping table to get
+	   restore the state of each block. */
+	if (dev->ops->get_l2p_tbl) {
+		ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							nvm_block_map, dev);
+		if (ret) {
+			pr_err("nvm: could not read L2P table.\n");
+			pr_warn("nvm: default block initialization");
+		}
+	}
+
+	return 0;
+}
+
+static void nvm_core_free(struct nvm_dev *dev)
+{
+	kfree(dev->identity.chnls);
+	kfree(dev);
+}
+
+static int nvm_core_init(struct nvm_dev *dev, int max_qdepth)
+{
+	dev->nr_luns = dev->identity.nchannels;
+	dev->sector_size = EXPOSED_PAGE_SIZE;
+	INIT_LIST_HEAD(&dev->online_targets);
+
+	return 0;
+}
+
+static void nvm_free(struct nvm_dev *dev)
+{
+	if (!dev)
+		return;
+
+	nvm_blocks_free(dev);
+	nvm_luns_free(dev);
+	nvm_core_free(dev);
+}
+
+int nvm_validate_features(struct nvm_dev *dev)
+{
+	struct nvm_get_features gf;
+	int ret;
+
+	ret = dev->ops->get_features(dev->q, &gf);
+	if (ret)
+		return ret;
+
+	/* Only default configuration is supported.
+	 * I.e. L2P, No ondrive GC and drive performs ECC */
+	if (gf.rsp != 0x0 || gf.ext != 0x0)
+		return -EINVAL;
+
+	return 0;
+}
+
+int nvm_validate_responsibility(struct nvm_dev *dev)
+{
+	if (!dev->ops->set_responsibility)
+		return 0;
+
+	return dev->ops->set_responsibility(dev->q, 0);
+}
+
+int nvm_init(struct nvm_dev *dev)
+{
+	struct blk_mq_tag_set *tag_set = dev->q->tag_set;
+	int max_qdepth;
+	int ret = 0;
+
+	if (!dev->q || !dev->ops)
+		return -EINVAL;
+
+	if (dev->ops->identify(dev->q, &dev->identity)) {
+		pr_err("nvm: device could not be identified\n");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	max_qdepth = tag_set->queue_depth * tag_set->nr_hw_queues;
+
+	pr_debug("nvm dev: ver %u type %u chnls %u max qdepth: %i\n",
+			dev->identity.ver_id,
+			dev->identity.nvm_type,
+			dev->identity.nchannels,
+			max_qdepth);
+
+	ret = nvm_validate_features(dev);
+	if (ret) {
+		pr_err("nvm: disk features are not supported.");
+		goto err;
+	}
+
+	ret = nvm_validate_responsibility(dev);
+	if (ret) {
+		pr_err("nvm: disk responsibilities are not supported.");
+		goto err;
+	}
+
+	ret = nvm_core_init(dev, max_qdepth);
+	if (ret) {
+		pr_err("nvm: could not initialize core structures.\n");
+		goto err;
+	}
+
+	ret = nvm_luns_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	if (!dev->nr_luns) {
+		pr_err("nvm: device did not expose any luns.\n");
+		goto err;
+	}
+
+	ret = nvm_blocks_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize blocks\n");
+		goto err;
+	}
+
+	pr_info("nvm: allocating %lu physical pages (%lu KB)\n",
+		dev->total_pages, dev->total_pages * dev->sector_size / 1024);
+	pr_info("nvm: luns: %u\n", dev->nr_luns);
+	pr_info("nvm: blocks: %lu\n", dev->total_blocks);
+	pr_info("nvm: target sector size=%d\n", dev->sector_size);
+
+	return 0;
+err:
+	nvm_free(dev);
+	pr_err("nvm: failed to initialize nvm\n");
+	return ret;
+}
+
+void nvm_exit(struct nvm_dev *dev)
+{
+	nvm_free(dev);
+
+	pr_info("nvm: successfully unloaded\n");
+}
+
+int blk_nvm_register(struct request_queue *q, struct nvm_dev_ops *ops)
+{
+	struct nvm_dev *dev;
+	int ret;
+
+	if (!ops->identify || !ops->get_features)
+		return -EINVAL;
+
+	/* does not yet support multi-page IOs. */
+	blk_queue_max_hw_sectors(q, queue_logical_block_size(q) >> 9);
+
+	dev = kzalloc(sizeof(struct nvm_dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->q = q;
+	dev->ops = ops;
+
+	ret = nvm_init(dev);
+	if (ret)
+		goto err_init;
+
+	q->nvm = dev;
+
+	return 0;
+err_init:
+	kfree(dev);
+	return ret;
+}
+EXPORT_SYMBOL(blk_nvm_register);
+
+void blk_nvm_unregister(struct request_queue *q)
+{
+	if (!blk_queue_nvm(q))
+		return;
+
+	nvm_exit(q->nvm);
+}
+
+static int nvm_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int nvm_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void nvm_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations nvm_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvm_ioctl,
+	.open		= nvm_open,
+	.release	= nvm_release,
+};
+
+static int nvm_create_target(struct gendisk *qdisk, char *ttname, char *tname,
+						int lun_begin, int lun_end)
+{
+	struct request_queue *qqueue = qdisk->queue;
+	struct nvm_dev *qnvm = qqueue->nvm;
+	struct request_queue *tqueue;
+	struct gendisk *tdisk;
+	struct nvm_target_type *tt;
+	struct nvm_target *t;
+	void *targetdata;
+
+	tt = nvm_find_target_type(ttname);
+	if (!tt) {
+		pr_err("nvm: target type %s not found\n", ttname);
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &qnvm->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			pr_err("nvm: target name already exists.\n");
+			up_write(&_lock);
+			return -EINVAL;
+		}
+	}
+	up_write(&_lock);
+
+	t = kmalloc(sizeof(struct nvm_target), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	tqueue = blk_alloc_queue_node(GFP_KERNEL, qqueue->node);
+	if (!tqueue)
+		goto err_t;
+	blk_queue_make_request(tqueue, tt->make_rq);
+
+	tdisk = alloc_disk(0);
+	if (!tdisk)
+		goto err_queue;
+
+	sprintf(tdisk->disk_name, "%s", tname);
+	tdisk->flags = GENHD_FL_EXT_DEVT;
+	tdisk->major = 0;
+	tdisk->first_minor = 0;
+	tdisk->fops = &nvm_fops;
+	tdisk->queue = tqueue;
+
+	targetdata = tt->init(qqueue, tqueue, qdisk, tdisk, lun_begin, lun_end);
+	if (IS_ERR(targetdata))
+		goto err_init;
+
+	tdisk->private_data = targetdata;
+	tqueue->queuedata = targetdata;
+
+	blk_queue_prep_rq(qqueue, tt->prep_rq);
+	blk_queue_unprep_rq(qqueue, tt->unprep_rq);
+
+	set_capacity(tdisk, tt->capacity(targetdata));
+	add_disk(tdisk);
+
+	t->type = tt;
+	t->disk = tdisk;
+
+	down_write(&_lock);
+	list_add_tail(&t->list, &qnvm->online_targets);
+	up_write(&_lock);
+
+	return 0;
+err_init:
+	put_disk(tdisk);
+err_queue:
+	blk_cleanup_queue(tqueue);
+err_t:
+	kfree(t);
+	return -ENOMEM;
+}
+
+/* _lock must be taken */
+static void nvm_remove_target(struct nvm_target *t)
+{
+	struct nvm_target_type *tt = t->type;
+	struct gendisk *tdisk = t->disk;
+	struct request_queue *q = tdisk->queue;
+
+	del_gendisk(tdisk);
+	if (tt->exit)
+		tt->exit(tdisk->private_data);
+	blk_cleanup_queue(q);
+
+	put_disk(tdisk);
+
+	list_del(&t->list);
+	kfree(t);
+}
+
+static ssize_t free_blocks_show(struct device *d, struct device_attribute *attr,
+		char *page)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+
+	char *page_start = page;
+	struct nvm_lun *lun;
+	unsigned int i;
+
+	nvm_for_each_lun(dev, lun, i)
+		page += sprintf(page, "%8u\t%u\n", i, lun->nr_free_blocks);
+
+	return page - page_start;
+}
+
+DEVICE_ATTR_RO(free_blocks);
+
+static ssize_t configure_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	char name[255], ttname[255];
+	int lun_begin, lun_end, ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s %s %u:%u", name, ttname, &lun_begin, &lun_end);
+	if (ret != 4) {
+		pr_err("nvm: configure must be in the format of \"name targetname lun_begin:lun_end\".\n");
+		return -EINVAL;
+	}
+
+	if (lun_begin > lun_end || lun_end > dev->nr_luns) {
+		pr_err("nvm: lun out of bound (%u:%u > %u)\n",
+					lun_begin, lun_end, dev->nr_luns);
+		return -EINVAL;
+	}
+
+	ret = nvm_create_target(disk, name, ttname, lun_begin, lun_end);
+	if (ret)
+		pr_err("nvm: configure disk failed\n");
+
+	return cnt;
+}
+DEVICE_ATTR_WO(configure);
+
+static ssize_t remove_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	struct nvm_target *t = NULL;
+	char tname[255];
+	int ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s", tname);
+	if (ret != 1) {
+		pr_err("nvm: remove use the following format \"targetname\".\n");
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &dev->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			nvm_remove_target(t);
+			ret = 0;
+			break;
+		}
+	}
+	up_write(&_lock);
+
+	if (ret)
+		pr_err("nvm: target \"%s\" doesn't exist.\n", tname);
+
+	return cnt;
+}
+
+DEVICE_ATTR_WO(remove);
+
+static struct attribute *nvm_attrs[] = {
+	&dev_attr_free_blocks.attr,
+	&dev_attr_configure.attr,
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+static struct attribute_group nvm_attribute_group = {
+	.name = "nvm",
+	.attrs = nvm_attrs,
+};
+
+int blk_nvm_init_sysfs(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_group(&dev->kobj, &nvm_attribute_group);
+	if (ret)
+		return ret;
+
+	kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+
+	return 0;
+}
+
+void blk_nvm_remove_sysfs(struct device *dev)
+{
+	sysfs_remove_group(&dev->kobj, &nvm_attribute_group);
+}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index faaf36a..ad8cf2f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -568,6 +568,12 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		return ret;
 
+	if (blk_queue_nvm(q)) {
+		ret = blk_nvm_init_sysfs(dev);
+		if (ret)
+			return ret;
+	}
+
 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
 	if (ret < 0) {
 		blk_trace_remove_sysfs(dev);
@@ -601,6 +607,11 @@ void blk_unregister_queue(struct gendisk *disk)
 	if (WARN_ON(!q))
 		return;
 
+	if (blk_queue_nvm(q)) {
+		blk_nvm_unregister(q);
+		blk_nvm_remove_sysfs(disk_to_dev(disk));
+	}
+
 	if (q->mq_ops)
 		blk_mq_unregister_disk(disk);
 
diff --git a/block/blk.h b/block/blk.h
index 43b0361..3e4abee 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -281,4 +281,22 @@ static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_target {
+	struct list_head list;
+	struct nvm_target_type *type;
+	struct gendisk *disk;
+};
+
+struct nvm_dev_ops;
+
+extern void blk_nvm_unregister(struct request_queue *);
+extern int blk_nvm_init_sysfs(struct device *);
+extern void blk_nvm_remove_sysfs(struct device *);
+#else
+static void blk_nvm_unregister(struct request_queue *q) { }
+static int blk_nvm_init_sysfs(struct device *) { return 0; }
+static void blk_nvm_remove_sysfs(struct device *) { }
+#endif /* CONFIG_BLK_DEV_NVM */
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127..ace0b23 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -354,6 +354,15 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#if defined(CONFIG_BLK_DEV_NVM)
+
+/* bio open-channel ssd payload */
+struct bio_nvm_payload {
+	void *private;
+};
+
+#endif /* CONFIG_BLK_DEV_NVM */
+
 extern void bio_trim(struct bio *bio, int offset, int size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7b39af..75e1497 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,13 +140,15 @@ enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
-	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO handled by prep */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_SYSFS_UP	= 1 << 3,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
+	BLK_MQ_F_NVM		= 1 << 5,
+
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e3..a619844 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -83,7 +83,10 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
-
+#if defined(CONFIG_BLK_DEV_NVM)
+	struct bio_nvm_payload *bi_nvm; /* open-channel ssd
+								     support */
+#endif
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
 	/*
@@ -193,6 +196,8 @@ enum rq_flag_bits {
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
 	__REQ_NO_TIMEOUT,	/* requests may never expire */
+	__REQ_NVM_MAPPED,	/* NVM mapped this request */
+	__REQ_NVM_NO_INFLIGHT,	/* request should not use inflight protection */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -213,7 +218,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_NVM_NO_INFLIGHT)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
@@ -247,5 +252,6 @@ enum rq_flag_bits {
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
 #define REQ_NO_TIMEOUT		(1ULL << __REQ_NO_TIMEOUT)
-
+#define REQ_NVM_MAPPED		(1ULL << __REQ_NVM_MAPPED)
+#define REQ_NVM_NO_INFLIGHT	(1ULL << __REQ_NVM_NO_INFLIGHT)
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..d416fd5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,6 +209,9 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+#if CONFIG_BLK_DEV_NVM
+	sector_t phys_sector;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -309,6 +312,10 @@ struct queue_limits {
 	unsigned char		raid_partial_stripes_expensive;
 };
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_dev;
+#endif
+
 struct request_queue {
 	/*
 	 * Together with queue_head for cacheline sharing
@@ -455,6 +462,9 @@ struct request_queue {
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace	*blk_trace;
 #endif
+#ifdef CONFIG_BLK_DEV_NVM
+	struct nvm_dev *nvm;
+#endif
 	/*
 	 * for flush operations
 	 */
@@ -513,6 +523,7 @@ struct request_queue {
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_SG_GAPS     22	/* queue doesn't support SG gaps */
+#define QUEUE_FLAG_NVM         23	/* open-channel SSD managed queue */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -601,6 +612,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_nvm(q)	test_bit(QUEUE_FLAG_NVM, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
@@ -822,6 +834,7 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
 extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
+extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
@@ -902,6 +915,11 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
+static inline sector_t blk_rq_phys_pos(const struct request *rq)
+{
+	return rq->phys_sector;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1504,6 +1522,8 @@ extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
 static inline
 struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
 {
+	if (unlikely(!bdev))
+		return NULL;
 	return bdev->bd_disk->integrity;
 }
 
@@ -1598,6 +1618,204 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#ifdef CONFIG_BLK_DEV_NVM
+
+#include <uapi/linux/nvm.h>
+
+typedef int (nvm_l2p_update_fn)(u64, u64, u64 *, void *);
+typedef int (nvm_id_fn)(struct request_queue *, struct nvm_id *);
+typedef int (nvm_get_features_fn)(struct request_queue *,
+				  struct nvm_get_features *);
+typedef int (nvm_set_rsp_fn)(struct request_queue *, u64);
+typedef int (nvm_get_l2p_tbl_fn)(struct request_queue *, u64, u64,
+				 nvm_l2p_update_fn *, void *);
+typedef int (nvm_erase_blk_fn)(struct request_queue *, sector_t);
+
+struct nvm_dev_ops {
+	nvm_id_fn		*identify;
+	nvm_get_features_fn	*get_features;
+	nvm_set_rsp_fn		*set_responsibility;
+	nvm_get_l2p_tbl_fn	*get_l2p_tbl;
+
+	nvm_erase_blk_fn	*erase_block;
+};
+
+struct nvm_blocks;
+
+/*
+ * We assume that the device exposes its channels as a linear address
+ * space. A lun therefore have a phy_addr_start and phy_addr_end that
+ * denotes the start and end. This abstraction is used to let the
+ * open-channel SSD (or any other device) expose its read/write/erase
+ * interface and be administrated by the host system.
+ */
+struct nvm_lun {
+	struct nvm_dev *dev;
+
+	/* lun block lists */
+	struct list_head used_list;	/* In-use blocks */
+	struct list_head free_list;	/* Not used blocks i.e. released
+					 *  and ready for use */
+
+	struct {
+		spinlock_t lock;
+	} ____cacheline_aligned_in_smp;
+
+	struct nvm_block *blocks;
+	struct nvm_id_chnl *chnl;
+
+	int id;
+	int reserved_blocks;
+
+	unsigned int nr_blocks;		/* end_block - start_block. */
+	unsigned int nr_free_blocks;	/* Number of unused blocks */
+
+	int nr_pages_per_blk;
+};
+
+struct nvm_block {
+	/* Management structures */
+	struct list_head list;
+	struct nvm_lun *lun;
+
+	spinlock_t lock;
+
+#define MAX_INVALID_PAGES_STORAGE 8
+	/* Bitmap for invalid page intries */
+	unsigned long invalid_pages[MAX_INVALID_PAGES_STORAGE];
+	/* points to the next writable page within a block */
+	unsigned int next_page;
+	/* number of pages that are invalid, wrt host page size */
+	unsigned int nr_invalid_pages;
+
+	unsigned int id;
+	int type;
+	/* Persistent data structures */
+	atomic_t data_cmnt_size; /* data pages committed to stable storage */
+};
+
+struct nvm_dev {
+	struct nvm_dev_ops *ops;
+	struct request_queue *q;
+
+	struct nvm_id identity;
+
+	struct list_head online_targets;
+
+	/* Open-channel SSD stores extra data after the private driver data */
+	unsigned int drv_cmd_size;
+
+	int nr_luns;
+	struct nvm_lun *luns;
+
+	/*int nr_blks_per_lun;
+	int nr_pages_per_blk;*/
+	/* Calculated/Cached values. These do not reflect the actual usuable
+	 * blocks at run-time. */
+	unsigned long total_pages;
+	unsigned long total_blocks;
+
+	uint32_t sector_size;
+};
+
+/* Logical to physical mapping */
+struct nvm_addr {
+	sector_t addr;
+	struct nvm_block *block;
+};
+
+/* Physical to logical mapping */
+struct nvm_rev_addr {
+	sector_t addr;
+};
+
+struct rrpc_inflight_rq {
+	struct list_head list;
+	sector_t l_start;
+	sector_t l_end;
+};
+
+struct nvm_per_rq {
+	struct rrpc_inflight_rq inflight_rq;
+	struct nvm_addr *addr;
+	unsigned int flags;
+};
+
+typedef void (nvm_tgt_make_rq)(struct request_queue *, struct bio *);
+typedef int (nvm_tgt_prep_rq)(struct request_queue *, struct request *);
+typedef void (nvm_tgt_unprep_rq)(struct request_queue *, struct request *);
+typedef sector_t (nvm_tgt_capacity)(void *);
+typedef void *(nvm_tgt_init_fn)(struct request_queue *, struct request_queue *,
+				struct gendisk *, struct gendisk *, int, int);
+typedef void (nvm_tgt_exit_fn)(void *);
+
+struct nvm_target_type {
+	const char *name;
+	unsigned int version[3];
+
+	/* target entry points */
+	nvm_tgt_make_rq *make_rq;
+	nvm_tgt_prep_rq *prep_rq;
+	nvm_tgt_unprep_rq *unprep_rq;
+	nvm_tgt_capacity *capacity;
+
+	/* module-specific init/teardown */
+	nvm_tgt_init_fn *init;
+	nvm_tgt_exit_fn *exit;
+
+	/* For open-channel SSD internal use */
+	struct list_head list;
+};
+
+extern struct nvm_target_type *nvm_find_target_type(const char *name);
+extern int nvm_register_target(struct nvm_target_type *tt);
+extern void nvm_unregister_target(struct nvm_target_type *tt);
+extern int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *);
+extern struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int);
+extern void blk_nvm_put_blk(struct nvm_block *block);
+extern int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *);
+extern sector_t blk_nvm_alloc_addr(struct nvm_block *);
+static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
+{
+	return q->nvm;
+}
+#else
+struct nvm_dev_ops;
+struct nvm_lun;
+struct nvm_block;
+struct nvm_target_type;
+
+struct nvm_target_type *nvm_find_target_type(const char *)
+{
+	return NULL;
+}
+int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *)
+{
+	return -EINVAL;
+}
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+{
+	return NULL;
+}
+static inline void blk_nvm_put_blk(struct nvm_block *) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+{
+	return -EINVAL;
+}
+static inline int blk_nvm_get_dev(struct request_queue *)
+{
+	return NULL;
+}
+static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	return 0;
+}
+#endif /* CONFIG_BLK_DEV_NVM */
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
new file mode 100644
index 0000000..888d994
--- /dev/null
+++ b/include/linux/lightnvm.h
@@ -0,0 +1,56 @@
+#ifndef NVM_H
+#define NVM_H
+
+#include <linux/blkdev.h>
+#include <linux/types.h>
+
+#define nvm_for_each_lun(dev, lun, i) \
+		for ((i) = 0, lun = &(dev)->luns[0]; \
+			(i) < (dev)->nr_luns; (i)++, lun = &(dev)->luns[(i)])
+
+#define lun_for_each_block(p, b, i) \
+		for ((i) = 0, b = &(p)->blocks[0]; \
+			(i) < (p)->nr_blocks; (i)++, b = &(p)->blocks[(i)])
+
+#define block_for_each_page(b, p) \
+		for ((p)->addr = block_to_addr((b)), (p)->block = (b); \
+			(p)->addr < block_to_addr((b)) \
+				+ (b)->lun->dev->nr_pages_per_blk; \
+			(p)->addr++)
+
+/* We currently assume that we the lightnvm device is accepting data in 512
+ * bytes chunks. This should be set to the smallest command size available for a
+ * given device.
+ */
+#define NVM_SECTOR 512
+#define EXPOSED_PAGE_SIZE 4096
+
+#define NR_PHY_IN_LOG (EXPOSED_PAGE_SIZE / NVM_SECTOR)
+
+#define NVM_MSG_PREFIX "nvm"
+#define ADDR_EMPTY (~0ULL)
+#define LTOP_POISON 0xD3ADB33F
+
+/* core.c */
+
+static inline int block_is_full(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->next_page == lun->nr_pages_per_blk;
+}
+
+static inline sector_t block_to_addr(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->id * lun->nr_pages_per_blk;
+}
+
+static inline struct nvm_lun *paddr_to_lun(struct nvm_dev *dev,
+							sector_t p_addr)
+{
+	return &dev->luns[p_addr / (dev->total_pages / dev->nr_luns)];
+}
+
+#endif
diff --git a/include/uapi/linux/nvm.h b/include/uapi/linux/nvm.h
new file mode 100644
index 0000000..fb95cf5
--- /dev/null
+++ b/include/uapi/linux/nvm.h
@@ -0,0 +1,70 @@
+/*
+ * Definitions for the LightNVM interface
+ * Copyright (c) 2015, IT University of Copenhagen
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _UAPI_LINUX_LIGHTNVM_H
+#define _UAPI_LINUX_LIGHTNVM_H
+
+#include <linux/types.h>
+
+enum {
+	/* HW Responsibilities */
+	NVM_RSP_L2P	= 0x00,
+	NVM_RSP_GC	= 0x01,
+	NVM_RSP_ECC	= 0x02,
+
+	/* Physical NVM Type */
+	NVM_NVMT_BLK	= 0,
+	NVM_NVMT_BYTE	= 1,
+
+	/* Internal IO Scheduling algorithm */
+	NVM_IOSCHED_CHANNEL	= 0,
+	NVM_IOSCHED_CHIP	= 1,
+
+	/* Status codes */
+	NVM_SUCCESS		= 0,
+	NVM_RSP_NOT_CHANGEABLE	= 1,
+};
+
+struct nvm_id_chnl {
+	u64	laddr_begin;
+	u64	laddr_end;
+	u32	oob_size;
+	u32	queue_size;
+	u32	gran_read;
+	u32	gran_write;
+	u32	gran_erase;
+	u32	t_r;
+	u32	t_sqr;
+	u32	t_w;
+	u32	t_sqw;
+	u32	t_e;
+	u16	chnl_parallelism;
+	u8	io_sched;
+	u8	res[133];
+};
+
+struct nvm_id {
+	u8	ver_id;
+	u8	nvm_type;
+	u16	nchannels;
+	struct nvm_id_chnl *chnls;
+};
+
+struct nvm_get_features {
+	u64	rsp;
+	u64	ext;
+};
+
+#endif /* _UAPI_LINUX_LIGHTNVM_H */
+
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

  Javier Gonzalez <javier at paletta.io>
  Jesper Madsen <jmad at itu.dk>

Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 block/Kconfig             |  12 +
 block/Makefile            |   2 +-
 block/blk-mq.c            |  12 +-
 block/blk-nvm.c           | 722 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c         |  11 +
 block/blk.h               |  18 ++
 include/linux/bio.h       |   9 +
 include/linux/blk-mq.h    |   4 +-
 include/linux/blk_types.h |  12 +-
 include/linux/blkdev.h    | 218 ++++++++++++++
 include/linux/lightnvm.h  |  56 ++++
 include/uapi/linux/nvm.h  |  70 +++++
 12 files changed, 1140 insertions(+), 6 deletions(-)
 create mode 100644 block/blk-nvm.c
 create mode 100644 include/linux/lightnvm.h
 create mode 100644 include/uapi/linux/nvm.h

diff --git a/block/Kconfig b/block/Kconfig
index 161491d..a3fca8f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -88,6 +88,18 @@ config BLK_DEV_INTEGRITY
 	T10/SCSI Data Integrity Field or the T13/ATA External Path
 	Protection.  If in doubt, say N.
 
+config BLK_DEV_NVM
+	bool "Block layer Open-Channel SSD support"
+	depends on BLK_DEV
+	default y
+	---help---
+	  Say Y here to get to enable support for Open-channel SSDs.
+
+	  Open-Channel SSDs expose direct access to the underlying non-volatile
+	  memory.
+
+	  This option is required by Open-Channel SSD target drivers.
+
 config BLK_DEV_THROTTLING
 	bool "Block layer bio throttling support"
 	depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 00ecc97..66a5826 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,4 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
-
+obj-$(CONFIG_BLK_DEV_NVM)  += blk-nvm.o
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3dd028..58a8a71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
+#if CONFIG_BLK_DEV_NVM
+	rq->phys_sector = 0;
+#endif
 
 	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
 }
@@ -1445,6 +1448,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	unsigned int cmd_size = set->cmd_size;
 
 	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
 				set->numa_node,
@@ -1462,11 +1466,14 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		return NULL;
 	}
 
+	if (set->flags & BLK_MQ_F_NVM)
+		cmd_size += sizeof(struct nvm_per_rq);
+
 	/*
 	 * rq_size is the size of the request plus driver payload, rounded
 	 * to the cacheline size
 	 */
-	rq_size = round_up(sizeof(struct request) + set->cmd_size,
+	rq_size = round_up(sizeof(struct request) + cmd_size,
 				cache_line_size());
 	left = rq_size * set->queue_depth;
 
@@ -1978,6 +1985,9 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;
 
+	if (set->flags & BLK_MQ_F_NVM)
+		q->queue_flags |= 1 << QUEUE_FLAG_NVM;
+
 	q->sg_reserved_size = INT_MAX;
 
 	INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
diff --git a/block/blk-nvm.c b/block/blk-nvm.c
new file mode 100644
index 0000000..722821c
--- /dev/null
+++ b/block/blk-nvm.c
@@ -0,0 +1,722 @@
+/*
+ * blk-nvm.c - Block layer Open-channel SSD integration
+ *
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj at itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/sem.h>
+#include <linux/bitmap.h>
+
+#include <linux/lightnvm.h>
+
+#include "blk.h"
+
+static LIST_HEAD(_targets);
+static DECLARE_RWSEM(_lock);
+
+struct nvm_target_type *nvm_find_target_type(const char *name)
+{
+	struct nvm_target_type *tt;
+
+	list_for_each_entry(tt, &_targets, list)
+		if (!strcmp(name, tt->name))
+			return tt;
+
+	return NULL;
+}
+
+int nvm_register_target(struct nvm_target_type *tt)
+{
+	int ret = 0;
+
+	down_write(&_lock);
+	if (nvm_find_target_type(tt->name))
+		ret = -EEXIST;
+	else
+		list_add(&tt->list, &_targets);
+	up_write(&_lock);
+
+	return ret;
+}
+
+void nvm_unregister_target(struct nvm_target_type *tt)
+{
+	if (!tt)
+		return;
+
+	down_write(&_lock);
+	list_del(&tt->list);
+	up_write(&_lock);
+}
+
+static void nvm_reset_block(struct nvm_lun *lun, struct nvm_block *block)
+{
+	spin_lock(&block->lock);
+	bitmap_zero(block->invalid_pages, lun->nr_pages_per_blk);
+	block->next_page = 0;
+	block->nr_invalid_pages = 0;
+	atomic_set(&block->data_cmnt_size, 0);
+	spin_unlock(&block->lock);
+}
+
+/* use blk_nvm_lun_[get/put]_block to administer the blocks in use for each lun.
+ * Whenever a block is in used by an append point, we store it within the
+ * used_list. We then move it back when its free to be used by another append
+ * point.
+ *
+ * The newly claimed block is always added to the back of used_list. As we
+ * assume that the start of used list is the oldest block, and therefore
+ * more likely to contain invalidated pages.
+ */
+struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
+{
+	struct nvm_block *block = NULL;
+
+	BUG_ON(!lun);
+
+	spin_lock(&lun->lock);
+
+	if (list_empty(&lun->free_list)) {
+		pr_err_ratelimited("nvm: lun %u have no free pages available",
+								lun->id);
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	while (!is_gc && lun->nr_free_blocks < lun->reserved_blocks) {
+		spin_unlock(&lun->lock);
+		goto out;
+	}
+
+	block = list_first_entry(&lun->free_list, struct nvm_block, list);
+	list_move_tail(&block->list, &lun->used_list);
+
+	lun->nr_free_blocks--;
+
+	spin_unlock(&lun->lock);
+
+	nvm_reset_block(lun, block);
+
+out:
+	return block;
+}
+EXPORT_SYMBOL(blk_nvm_get_blk);
+
+/* We assume that all valid pages have already been moved when added back to the
+ * free list. We add it last to allow round-robin use of all pages. Thereby
+ * provide simple (naive) wear-leveling.
+ */
+void blk_nvm_put_blk(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	spin_lock(&lun->lock);
+
+	list_move_tail(&block->list, &lun->free_list);
+	lun->nr_free_blocks++;
+
+	spin_unlock(&lun->lock);
+}
+EXPORT_SYMBOL(blk_nvm_put_blk);
+
+sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	sector_t addr = ADDR_EMPTY;
+
+	spin_lock(&block->lock);
+	if (block_is_full(block))
+		goto out;
+
+	addr = block_to_addr(block) + block->next_page;
+
+	block->next_page++;
+out:
+	spin_unlock(&block->lock);
+	return addr;
+}
+EXPORT_SYMBOL(blk_nvm_alloc_addr);
+
+/* Send erase command to device */
+int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *block)
+{
+	if (dev->ops->erase_block)
+		return dev->ops->erase_block(dev->q, block->id);
+
+	return 0;
+}
+EXPORT_SYMBOL(blk_nvm_erase_blk);
+
+static void nvm_blocks_free(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	int i;
+
+	nvm_for_each_lun(dev, lun, i) {
+		if (!lun->blocks)
+			break;
+		vfree(lun->blocks);
+	}
+}
+
+static void nvm_luns_free(struct nvm_dev *dev)
+{
+	kfree(dev->luns);
+}
+
+static int nvm_luns_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_id_chnl *chnl;
+	int i;
+
+	dev->luns = kcalloc(dev->nr_luns, sizeof(struct nvm_lun), GFP_KERNEL);
+	if (!dev->luns)
+		return -ENOMEM;
+
+	nvm_for_each_lun(dev, lun, i) {
+		chnl = &dev->identity.chnls[i];
+		pr_info("nvm: p %u qsize %u gr %u ge %u begin %llu end %llu\n",
+			i, chnl->queue_size, chnl->gran_read, chnl->gran_erase,
+			chnl->laddr_begin, chnl->laddr_end);
+
+		spin_lock_init(&lun->lock);
+
+		INIT_LIST_HEAD(&lun->free_list);
+		INIT_LIST_HEAD(&lun->used_list);
+
+		lun->id = i;
+		lun->dev = dev;
+		lun->chnl = chnl;
+		lun->reserved_blocks = 2; /* for GC only */
+		lun->nr_blocks =
+				(chnl->laddr_end - chnl->laddr_begin + 1) /
+				(chnl->gran_erase / chnl->gran_read);
+		lun->nr_free_blocks = lun->nr_blocks;
+		lun->nr_pages_per_blk = chnl->gran_erase / chnl->gran_write *
+					(chnl->gran_write / dev->sector_size);
+
+		dev->total_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+		dev->total_blocks += lun->nr_blocks;
+
+		if (lun->nr_pages_per_blk >
+				MAX_INVALID_PAGES_STORAGE * BITS_PER_LONG) {
+			pr_err("nvm: number of pages per block too high.");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_block_map(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct nvm_dev *dev = private;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t total_pgs_per_lun = /* each lun have the same configuration */
+		   dev->luns[0].nr_blocks * dev->luns[0].nr_pages_per_blk;
+	u64 i;
+	int lun_id;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		/* resolve block from physical address */
+		lun_id = pba / total_pgs_per_lun;
+		lun = &dev->luns[lun_id];
+
+		/* Calculate block offset into lun */
+		pba = pba - (total_pgs_per_lun * lun_id);
+		blk = &lun->blocks[pba / lun->nr_pages_per_blk];
+
+		if (!blk->type) {
+			/* at this point, we don't know anything about the
+			 * block. It's up to the FTL on top to re-etablish the
+			 * block state */
+			list_move_tail(&blk->list, &lun->used_list);
+			blk->type = 1;
+			lun->nr_free_blocks--;
+		}
+	}
+
+	return 0;
+}
+
+static int nvm_blocks_init(struct nvm_dev *dev)
+{
+	struct nvm_lun *lun;
+	struct nvm_block *block;
+	sector_t lun_iter, block_iter, cur_block_id = 0;
+	int ret;
+
+	nvm_for_each_lun(dev, lun, lun_iter) {
+		lun->blocks = vzalloc(sizeof(struct nvm_block) *
+						lun->nr_blocks);
+		if (!lun->blocks)
+			return -ENOMEM;
+
+		lun_for_each_block(lun, block, block_iter) {
+			spin_lock_init(&block->lock);
+			INIT_LIST_HEAD(&block->list);
+
+			block->lun = lun;
+			block->id = cur_block_id++;
+
+			/* First block is reserved for device */
+			if (unlikely(lun_iter == 0 && block_iter == 0))
+				continue;
+
+			list_add_tail(&block->list, &lun->free_list);
+		}
+	}
+
+	/* Without bad block table support, we can use the mapping table to get
+	   restore the state of each block. */
+	if (dev->ops->get_l2p_tbl) {
+		ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							nvm_block_map, dev);
+		if (ret) {
+			pr_err("nvm: could not read L2P table.\n");
+			pr_warn("nvm: default block initialization");
+		}
+	}
+
+	return 0;
+}
+
+static void nvm_core_free(struct nvm_dev *dev)
+{
+	kfree(dev->identity.chnls);
+	kfree(dev);
+}
+
+static int nvm_core_init(struct nvm_dev *dev, int max_qdepth)
+{
+	dev->nr_luns = dev->identity.nchannels;
+	dev->sector_size = EXPOSED_PAGE_SIZE;
+	INIT_LIST_HEAD(&dev->online_targets);
+
+	return 0;
+}
+
+static void nvm_free(struct nvm_dev *dev)
+{
+	if (!dev)
+		return;
+
+	nvm_blocks_free(dev);
+	nvm_luns_free(dev);
+	nvm_core_free(dev);
+}
+
+int nvm_validate_features(struct nvm_dev *dev)
+{
+	struct nvm_get_features gf;
+	int ret;
+
+	ret = dev->ops->get_features(dev->q, &gf);
+	if (ret)
+		return ret;
+
+	/* Only default configuration is supported.
+	 * I.e. L2P, No ondrive GC and drive performs ECC */
+	if (gf.rsp != 0x0 || gf.ext != 0x0)
+		return -EINVAL;
+
+	return 0;
+}
+
+int nvm_validate_responsibility(struct nvm_dev *dev)
+{
+	if (!dev->ops->set_responsibility)
+		return 0;
+
+	return dev->ops->set_responsibility(dev->q, 0);
+}
+
+int nvm_init(struct nvm_dev *dev)
+{
+	struct blk_mq_tag_set *tag_set = dev->q->tag_set;
+	int max_qdepth;
+	int ret = 0;
+
+	if (!dev->q || !dev->ops)
+		return -EINVAL;
+
+	if (dev->ops->identify(dev->q, &dev->identity)) {
+		pr_err("nvm: device could not be identified\n");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	max_qdepth = tag_set->queue_depth * tag_set->nr_hw_queues;
+
+	pr_debug("nvm dev: ver %u type %u chnls %u max qdepth: %i\n",
+			dev->identity.ver_id,
+			dev->identity.nvm_type,
+			dev->identity.nchannels,
+			max_qdepth);
+
+	ret = nvm_validate_features(dev);
+	if (ret) {
+		pr_err("nvm: disk features are not supported.");
+		goto err;
+	}
+
+	ret = nvm_validate_responsibility(dev);
+	if (ret) {
+		pr_err("nvm: disk responsibilities are not supported.");
+		goto err;
+	}
+
+	ret = nvm_core_init(dev, max_qdepth);
+	if (ret) {
+		pr_err("nvm: could not initialize core structures.\n");
+		goto err;
+	}
+
+	ret = nvm_luns_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	if (!dev->nr_luns) {
+		pr_err("nvm: device did not expose any luns.\n");
+		goto err;
+	}
+
+	ret = nvm_blocks_init(dev);
+	if (ret) {
+		pr_err("nvm: could not initialize blocks\n");
+		goto err;
+	}
+
+	pr_info("nvm: allocating %lu physical pages (%lu KB)\n",
+		dev->total_pages, dev->total_pages * dev->sector_size / 1024);
+	pr_info("nvm: luns: %u\n", dev->nr_luns);
+	pr_info("nvm: blocks: %lu\n", dev->total_blocks);
+	pr_info("nvm: target sector size=%d\n", dev->sector_size);
+
+	return 0;
+err:
+	nvm_free(dev);
+	pr_err("nvm: failed to initialize nvm\n");
+	return ret;
+}
+
+void nvm_exit(struct nvm_dev *dev)
+{
+	nvm_free(dev);
+
+	pr_info("nvm: successfully unloaded\n");
+}
+
+int blk_nvm_register(struct request_queue *q, struct nvm_dev_ops *ops)
+{
+	struct nvm_dev *dev;
+	int ret;
+
+	if (!ops->identify || !ops->get_features)
+		return -EINVAL;
+
+	/* does not yet support multi-page IOs. */
+	blk_queue_max_hw_sectors(q, queue_logical_block_size(q) >> 9);
+
+	dev = kzalloc(sizeof(struct nvm_dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->q = q;
+	dev->ops = ops;
+
+	ret = nvm_init(dev);
+	if (ret)
+		goto err_init;
+
+	q->nvm = dev;
+
+	return 0;
+err_init:
+	kfree(dev);
+	return ret;
+}
+EXPORT_SYMBOL(blk_nvm_register);
+
+void blk_nvm_unregister(struct request_queue *q)
+{
+	if (!blk_queue_nvm(q))
+		return;
+
+	nvm_exit(q->nvm);
+}
+
+static int nvm_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int nvm_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void nvm_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations nvm_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvm_ioctl,
+	.open		= nvm_open,
+	.release	= nvm_release,
+};
+
+static int nvm_create_target(struct gendisk *qdisk, char *ttname, char *tname,
+						int lun_begin, int lun_end)
+{
+	struct request_queue *qqueue = qdisk->queue;
+	struct nvm_dev *qnvm = qqueue->nvm;
+	struct request_queue *tqueue;
+	struct gendisk *tdisk;
+	struct nvm_target_type *tt;
+	struct nvm_target *t;
+	void *targetdata;
+
+	tt = nvm_find_target_type(ttname);
+	if (!tt) {
+		pr_err("nvm: target type %s not found\n", ttname);
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &qnvm->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			pr_err("nvm: target name already exists.\n");
+			up_write(&_lock);
+			return -EINVAL;
+		}
+	}
+	up_write(&_lock);
+
+	t = kmalloc(sizeof(struct nvm_target), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	tqueue = blk_alloc_queue_node(GFP_KERNEL, qqueue->node);
+	if (!tqueue)
+		goto err_t;
+	blk_queue_make_request(tqueue, tt->make_rq);
+
+	tdisk = alloc_disk(0);
+	if (!tdisk)
+		goto err_queue;
+
+	sprintf(tdisk->disk_name, "%s", tname);
+	tdisk->flags = GENHD_FL_EXT_DEVT;
+	tdisk->major = 0;
+	tdisk->first_minor = 0;
+	tdisk->fops = &nvm_fops;
+	tdisk->queue = tqueue;
+
+	targetdata = tt->init(qqueue, tqueue, qdisk, tdisk, lun_begin, lun_end);
+	if (IS_ERR(targetdata))
+		goto err_init;
+
+	tdisk->private_data = targetdata;
+	tqueue->queuedata = targetdata;
+
+	blk_queue_prep_rq(qqueue, tt->prep_rq);
+	blk_queue_unprep_rq(qqueue, tt->unprep_rq);
+
+	set_capacity(tdisk, tt->capacity(targetdata));
+	add_disk(tdisk);
+
+	t->type = tt;
+	t->disk = tdisk;
+
+	down_write(&_lock);
+	list_add_tail(&t->list, &qnvm->online_targets);
+	up_write(&_lock);
+
+	return 0;
+err_init:
+	put_disk(tdisk);
+err_queue:
+	blk_cleanup_queue(tqueue);
+err_t:
+	kfree(t);
+	return -ENOMEM;
+}
+
+/* _lock must be taken */
+static void nvm_remove_target(struct nvm_target *t)
+{
+	struct nvm_target_type *tt = t->type;
+	struct gendisk *tdisk = t->disk;
+	struct request_queue *q = tdisk->queue;
+
+	del_gendisk(tdisk);
+	if (tt->exit)
+		tt->exit(tdisk->private_data);
+	blk_cleanup_queue(q);
+
+	put_disk(tdisk);
+
+	list_del(&t->list);
+	kfree(t);
+}
+
+static ssize_t free_blocks_show(struct device *d, struct device_attribute *attr,
+		char *page)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+
+	char *page_start = page;
+	struct nvm_lun *lun;
+	unsigned int i;
+
+	nvm_for_each_lun(dev, lun, i)
+		page += sprintf(page, "%8u\t%u\n", i, lun->nr_free_blocks);
+
+	return page - page_start;
+}
+
+DEVICE_ATTR_RO(free_blocks);
+
+static ssize_t configure_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	char name[255], ttname[255];
+	int lun_begin, lun_end, ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s %s %u:%u", name, ttname, &lun_begin, &lun_end);
+	if (ret != 4) {
+		pr_err("nvm: configure must be in the format of \"name targetname lun_begin:lun_end\".\n");
+		return -EINVAL;
+	}
+
+	if (lun_begin > lun_end || lun_end > dev->nr_luns) {
+		pr_err("nvm: lun out of bound (%u:%u > %u)\n",
+					lun_begin, lun_end, dev->nr_luns);
+		return -EINVAL;
+	}
+
+	ret = nvm_create_target(disk, name, ttname, lun_begin, lun_end);
+	if (ret)
+		pr_err("nvm: configure disk failed\n");
+
+	return cnt;
+}
+DEVICE_ATTR_WO(configure);
+
+static ssize_t remove_store(struct device *d, struct device_attribute *attr,
+						const char *buf, size_t cnt)
+{
+	struct gendisk *disk = dev_to_disk(d);
+	struct nvm_dev *dev = disk->queue->nvm;
+	struct nvm_target *t = NULL;
+	char tname[255];
+	int ret;
+
+	if (cnt >= 255)
+		return -EINVAL;
+
+	ret = sscanf(buf, "%s", tname);
+	if (ret != 1) {
+		pr_err("nvm: remove use the following format \"targetname\".\n");
+		return -EINVAL;
+	}
+
+	down_write(&_lock);
+	list_for_each_entry(t, &dev->online_targets, list) {
+		if (!strcmp(tname, t->disk->disk_name)) {
+			nvm_remove_target(t);
+			ret = 0;
+			break;
+		}
+	}
+	up_write(&_lock);
+
+	if (ret)
+		pr_err("nvm: target \"%s\" doesn't exist.\n", tname);
+
+	return cnt;
+}
+
+DEVICE_ATTR_WO(remove);
+
+static struct attribute *nvm_attrs[] = {
+	&dev_attr_free_blocks.attr,
+	&dev_attr_configure.attr,
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+static struct attribute_group nvm_attribute_group = {
+	.name = "nvm",
+	.attrs = nvm_attrs,
+};
+
+int blk_nvm_init_sysfs(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_group(&dev->kobj, &nvm_attribute_group);
+	if (ret)
+		return ret;
+
+	kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+
+	return 0;
+}
+
+void blk_nvm_remove_sysfs(struct device *dev)
+{
+	sysfs_remove_group(&dev->kobj, &nvm_attribute_group);
+}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index faaf36a..ad8cf2f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -568,6 +568,12 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		return ret;
 
+	if (blk_queue_nvm(q)) {
+		ret = blk_nvm_init_sysfs(dev);
+		if (ret)
+			return ret;
+	}
+
 	ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue");
 	if (ret < 0) {
 		blk_trace_remove_sysfs(dev);
@@ -601,6 +607,11 @@ void blk_unregister_queue(struct gendisk *disk)
 	if (WARN_ON(!q))
 		return;
 
+	if (blk_queue_nvm(q)) {
+		blk_nvm_unregister(q);
+		blk_nvm_remove_sysfs(disk_to_dev(disk));
+	}
+
 	if (q->mq_ops)
 		blk_mq_unregister_disk(disk);
 
diff --git a/block/blk.h b/block/blk.h
index 43b0361..3e4abee 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -281,4 +281,22 @@ static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_target {
+	struct list_head list;
+	struct nvm_target_type *type;
+	struct gendisk *disk;
+};
+
+struct nvm_dev_ops;
+
+extern void blk_nvm_unregister(struct request_queue *);
+extern int blk_nvm_init_sysfs(struct device *);
+extern void blk_nvm_remove_sysfs(struct device *);
+#else
+static void blk_nvm_unregister(struct request_queue *q) { }
+static int blk_nvm_init_sysfs(struct device *) { return 0; }
+static void blk_nvm_remove_sysfs(struct device *) { }
+#endif /* CONFIG_BLK_DEV_NVM */
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127..ace0b23 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -354,6 +354,15 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#if defined(CONFIG_BLK_DEV_NVM)
+
+/* bio open-channel ssd payload */
+struct bio_nvm_payload {
+	void *private;
+};
+
+#endif /* CONFIG_BLK_DEV_NVM */
+
 extern void bio_trim(struct bio *bio, int offset, int size);
 extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7b39af..75e1497 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,13 +140,15 @@ enum {
 	BLK_MQ_RQ_QUEUE_OK	= 0,	/* queued fine */
 	BLK_MQ_RQ_QUEUE_BUSY	= 1,	/* requeue IO for later */
 	BLK_MQ_RQ_QUEUE_ERROR	= 2,	/* end IO with error */
-	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO is already handled */
+	BLK_MQ_RQ_QUEUE_DONE	= 3,	/* IO handled by prep */
 
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_SYSFS_UP	= 1 << 3,
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
+	BLK_MQ_F_NVM		= 1 << 5,
+
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e3..a619844 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -83,7 +83,10 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
-
+#if defined(CONFIG_BLK_DEV_NVM)
+	struct bio_nvm_payload *bi_nvm; /* open-channel ssd
+								     support */
+#endif
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
 	/*
@@ -193,6 +196,8 @@ enum rq_flag_bits {
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
 	__REQ_NO_TIMEOUT,	/* requests may never expire */
+	__REQ_NVM_MAPPED,	/* NVM mapped this request */
+	__REQ_NVM_NO_INFLIGHT,	/* request should not use inflight protection */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -213,7 +218,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_NVM_NO_INFLIGHT)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
@@ -247,5 +252,6 @@ enum rq_flag_bits {
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
 #define REQ_NO_TIMEOUT		(1ULL << __REQ_NO_TIMEOUT)
-
+#define REQ_NVM_MAPPED		(1ULL << __REQ_NVM_MAPPED)
+#define REQ_NVM_NO_INFLIGHT	(1ULL << __REQ_NVM_NO_INFLIGHT)
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..d416fd5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,6 +209,9 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+#if CONFIG_BLK_DEV_NVM
+	sector_t phys_sector;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -309,6 +312,10 @@ struct queue_limits {
 	unsigned char		raid_partial_stripes_expensive;
 };
 
+#ifdef CONFIG_BLK_DEV_NVM
+struct nvm_dev;
+#endif
+
 struct request_queue {
 	/*
 	 * Together with queue_head for cacheline sharing
@@ -455,6 +462,9 @@ struct request_queue {
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	struct blk_trace	*blk_trace;
 #endif
+#ifdef CONFIG_BLK_DEV_NVM
+	struct nvm_dev *nvm;
+#endif
 	/*
 	 * for flush operations
 	 */
@@ -513,6 +523,7 @@ struct request_queue {
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_SG_GAPS     22	/* queue doesn't support SG gaps */
+#define QUEUE_FLAG_NVM         23	/* open-channel SSD managed queue */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -601,6 +612,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
 #define blk_queue_secdiscard(q)	(blk_queue_discard(q) && \
 	test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))
+#define blk_queue_nvm(q)	test_bit(QUEUE_FLAG_NVM, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
@@ -822,6 +834,7 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
 extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
+extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
@@ -902,6 +915,11 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
+static inline sector_t blk_rq_phys_pos(const struct request *rq)
+{
+	return rq->phys_sector;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1504,6 +1522,8 @@ extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
 static inline
 struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
 {
+	if (unlikely(!bdev))
+		return NULL;
 	return bdev->bd_disk->integrity;
 }
 
@@ -1598,6 +1618,204 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#ifdef CONFIG_BLK_DEV_NVM
+
+#include <uapi/linux/nvm.h>
+
+typedef int (nvm_l2p_update_fn)(u64, u64, u64 *, void *);
+typedef int (nvm_id_fn)(struct request_queue *, struct nvm_id *);
+typedef int (nvm_get_features_fn)(struct request_queue *,
+				  struct nvm_get_features *);
+typedef int (nvm_set_rsp_fn)(struct request_queue *, u64);
+typedef int (nvm_get_l2p_tbl_fn)(struct request_queue *, u64, u64,
+				 nvm_l2p_update_fn *, void *);
+typedef int (nvm_erase_blk_fn)(struct request_queue *, sector_t);
+
+struct nvm_dev_ops {
+	nvm_id_fn		*identify;
+	nvm_get_features_fn	*get_features;
+	nvm_set_rsp_fn		*set_responsibility;
+	nvm_get_l2p_tbl_fn	*get_l2p_tbl;
+
+	nvm_erase_blk_fn	*erase_block;
+};
+
+struct nvm_blocks;
+
+/*
+ * We assume that the device exposes its channels as a linear address
+ * space. A lun therefore have a phy_addr_start and phy_addr_end that
+ * denotes the start and end. This abstraction is used to let the
+ * open-channel SSD (or any other device) expose its read/write/erase
+ * interface and be administrated by the host system.
+ */
+struct nvm_lun {
+	struct nvm_dev *dev;
+
+	/* lun block lists */
+	struct list_head used_list;	/* In-use blocks */
+	struct list_head free_list;	/* Not used blocks i.e. released
+					 *  and ready for use */
+
+	struct {
+		spinlock_t lock;
+	} ____cacheline_aligned_in_smp;
+
+	struct nvm_block *blocks;
+	struct nvm_id_chnl *chnl;
+
+	int id;
+	int reserved_blocks;
+
+	unsigned int nr_blocks;		/* end_block - start_block. */
+	unsigned int nr_free_blocks;	/* Number of unused blocks */
+
+	int nr_pages_per_blk;
+};
+
+struct nvm_block {
+	/* Management structures */
+	struct list_head list;
+	struct nvm_lun *lun;
+
+	spinlock_t lock;
+
+#define MAX_INVALID_PAGES_STORAGE 8
+	/* Bitmap for invalid page intries */
+	unsigned long invalid_pages[MAX_INVALID_PAGES_STORAGE];
+	/* points to the next writable page within a block */
+	unsigned int next_page;
+	/* number of pages that are invalid, wrt host page size */
+	unsigned int nr_invalid_pages;
+
+	unsigned int id;
+	int type;
+	/* Persistent data structures */
+	atomic_t data_cmnt_size; /* data pages committed to stable storage */
+};
+
+struct nvm_dev {
+	struct nvm_dev_ops *ops;
+	struct request_queue *q;
+
+	struct nvm_id identity;
+
+	struct list_head online_targets;
+
+	/* Open-channel SSD stores extra data after the private driver data */
+	unsigned int drv_cmd_size;
+
+	int nr_luns;
+	struct nvm_lun *luns;
+
+	/*int nr_blks_per_lun;
+	int nr_pages_per_blk;*/
+	/* Calculated/Cached values. These do not reflect the actual usuable
+	 * blocks at run-time. */
+	unsigned long total_pages;
+	unsigned long total_blocks;
+
+	uint32_t sector_size;
+};
+
+/* Logical to physical mapping */
+struct nvm_addr {
+	sector_t addr;
+	struct nvm_block *block;
+};
+
+/* Physical to logical mapping */
+struct nvm_rev_addr {
+	sector_t addr;
+};
+
+struct rrpc_inflight_rq {
+	struct list_head list;
+	sector_t l_start;
+	sector_t l_end;
+};
+
+struct nvm_per_rq {
+	struct rrpc_inflight_rq inflight_rq;
+	struct nvm_addr *addr;
+	unsigned int flags;
+};
+
+typedef void (nvm_tgt_make_rq)(struct request_queue *, struct bio *);
+typedef int (nvm_tgt_prep_rq)(struct request_queue *, struct request *);
+typedef void (nvm_tgt_unprep_rq)(struct request_queue *, struct request *);
+typedef sector_t (nvm_tgt_capacity)(void *);
+typedef void *(nvm_tgt_init_fn)(struct request_queue *, struct request_queue *,
+				struct gendisk *, struct gendisk *, int, int);
+typedef void (nvm_tgt_exit_fn)(void *);
+
+struct nvm_target_type {
+	const char *name;
+	unsigned int version[3];
+
+	/* target entry points */
+	nvm_tgt_make_rq *make_rq;
+	nvm_tgt_prep_rq *prep_rq;
+	nvm_tgt_unprep_rq *unprep_rq;
+	nvm_tgt_capacity *capacity;
+
+	/* module-specific init/teardown */
+	nvm_tgt_init_fn *init;
+	nvm_tgt_exit_fn *exit;
+
+	/* For open-channel SSD internal use */
+	struct list_head list;
+};
+
+extern struct nvm_target_type *nvm_find_target_type(const char *name);
+extern int nvm_register_target(struct nvm_target_type *tt);
+extern void nvm_unregister_target(struct nvm_target_type *tt);
+extern int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *);
+extern struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int);
+extern void blk_nvm_put_blk(struct nvm_block *block);
+extern int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *);
+extern sector_t blk_nvm_alloc_addr(struct nvm_block *);
+static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
+{
+	return q->nvm;
+}
+#else
+struct nvm_dev_ops;
+struct nvm_lun;
+struct nvm_block;
+struct nvm_target_type;
+
+struct nvm_target_type *nvm_find_target_type(const char *)
+{
+	return NULL;
+}
+int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *,
+						struct nvm_dev_ops *)
+{
+	return -EINVAL;
+}
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+{
+	return NULL;
+}
+static inline void blk_nvm_put_blk(struct nvm_block *) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+{
+	return -EINVAL;
+}
+static inline int blk_nvm_get_dev(struct request_queue *)
+{
+	return NULL;
+}
+static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
+{
+	return 0;
+}
+#endif /* CONFIG_BLK_DEV_NVM */
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
new file mode 100644
index 0000000..888d994
--- /dev/null
+++ b/include/linux/lightnvm.h
@@ -0,0 +1,56 @@
+#ifndef NVM_H
+#define NVM_H
+
+#include <linux/blkdev.h>
+#include <linux/types.h>
+
+#define nvm_for_each_lun(dev, lun, i) \
+		for ((i) = 0, lun = &(dev)->luns[0]; \
+			(i) < (dev)->nr_luns; (i)++, lun = &(dev)->luns[(i)])
+
+#define lun_for_each_block(p, b, i) \
+		for ((i) = 0, b = &(p)->blocks[0]; \
+			(i) < (p)->nr_blocks; (i)++, b = &(p)->blocks[(i)])
+
+#define block_for_each_page(b, p) \
+		for ((p)->addr = block_to_addr((b)), (p)->block = (b); \
+			(p)->addr < block_to_addr((b)) \
+				+ (b)->lun->dev->nr_pages_per_blk; \
+			(p)->addr++)
+
+/* We currently assume that we the lightnvm device is accepting data in 512
+ * bytes chunks. This should be set to the smallest command size available for a
+ * given device.
+ */
+#define NVM_SECTOR 512
+#define EXPOSED_PAGE_SIZE 4096
+
+#define NR_PHY_IN_LOG (EXPOSED_PAGE_SIZE / NVM_SECTOR)
+
+#define NVM_MSG_PREFIX "nvm"
+#define ADDR_EMPTY (~0ULL)
+#define LTOP_POISON 0xD3ADB33F
+
+/* core.c */
+
+static inline int block_is_full(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->next_page == lun->nr_pages_per_blk;
+}
+
+static inline sector_t block_to_addr(struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+
+	return block->id * lun->nr_pages_per_blk;
+}
+
+static inline struct nvm_lun *paddr_to_lun(struct nvm_dev *dev,
+							sector_t p_addr)
+{
+	return &dev->luns[p_addr / (dev->total_pages / dev->nr_luns)];
+}
+
+#endif
diff --git a/include/uapi/linux/nvm.h b/include/uapi/linux/nvm.h
new file mode 100644
index 0000000..fb95cf5
--- /dev/null
+++ b/include/uapi/linux/nvm.h
@@ -0,0 +1,70 @@
+/*
+ * Definitions for the LightNVM interface
+ * Copyright (c) 2015, IT University of Copenhagen
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _UAPI_LINUX_LIGHTNVM_H
+#define _UAPI_LINUX_LIGHTNVM_H
+
+#include <linux/types.h>
+
+enum {
+	/* HW Responsibilities */
+	NVM_RSP_L2P	= 0x00,
+	NVM_RSP_GC	= 0x01,
+	NVM_RSP_ECC	= 0x02,
+
+	/* Physical NVM Type */
+	NVM_NVMT_BLK	= 0,
+	NVM_NVMT_BYTE	= 1,
+
+	/* Internal IO Scheduling algorithm */
+	NVM_IOSCHED_CHANNEL	= 0,
+	NVM_IOSCHED_CHIP	= 1,
+
+	/* Status codes */
+	NVM_SUCCESS		= 0,
+	NVM_RSP_NOT_CHANGEABLE	= 1,
+};
+
+struct nvm_id_chnl {
+	u64	laddr_begin;
+	u64	laddr_end;
+	u32	oob_size;
+	u32	queue_size;
+	u32	gran_read;
+	u32	gran_write;
+	u32	gran_erase;
+	u32	t_r;
+	u32	t_sqr;
+	u32	t_w;
+	u32	t_sqw;
+	u32	t_e;
+	u16	chnl_parallelism;
+	u8	io_sched;
+	u8	res[133];
+};
+
+struct nvm_id {
+	u8	ver_id;
+	u8	nvm_type;
+	u16	nchannels;
+	struct nvm_id_chnl *chnls;
+};
+
+struct nvm_get_features {
+	u64	rsp;
+	u64	ext;
+};
+
+#endif /* _UAPI_LINUX_LIGHTNVM_H */
+
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/5 v2] lightnvm: RRPC target
  2015-04-15 12:34 ` Matias Bjørling
  (?)
@ 2015-04-15 12:34   ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

This target implements a simple target to be used by Open-Channel SSDs.
It exposes the physical flash a generic sector-based address space.

The FTL implements a round-robin approach for sector allocation,
together with a greedy cost-based garbage collector.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 drivers/Kconfig           |    2 +
 drivers/Makefile          |    2 +
 drivers/lightnvm/Kconfig  |   29 ++
 drivers/lightnvm/Makefile |    5 +
 drivers/lightnvm/rrpc.c   | 1222 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.h   |  203 ++++++++
 include/linux/lightnvm.h  |    1 -
 7 files changed, 1463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/lightnvm/Kconfig
 create mode 100644 drivers/lightnvm/Makefile
 create mode 100644 drivers/lightnvm/rrpc.c
 create mode 100644 drivers/lightnvm/rrpc.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index c0cc96b..da47047 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"
 
 source "drivers/isdn/Kconfig"
 
+source "drivers/lightnvm/Kconfig"
+
 # input before char - char/joystick depends on it. As does USB.
 
 source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 527a6da..6b6928a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -165,3 +165,5 @@ obj-$(CONFIG_RAS)		+= ras/
 obj-$(CONFIG_THUNDERBOLT)	+= thunderbolt/
 obj-$(CONFIG_CORESIGHT)		+= coresight/
 obj-$(CONFIG_ANDROID)		+= android/
+
+obj-$(CONFIG_NVM)		+= lightnvm/
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 0000000..89fabe1
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,29 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+	bool "Open-Channel SSD target support"
+	depends on BLK_DEV_NVM
+	help
+	  Say Y here to get to enable Open-channel SSDs.
+
+	  Open-Channel SSDs implement a set of extension to SSDs, that
+	  exposes direct access to the underlying non-volatile memory.
+
+	  If you say N, all options in this submenu will be skipped and disabled
+	  only do this if you know what you are doing.
+
+if NVM
+
+config NVM_RRPC
+	tristate "Round-robin Hybrid Open-Channel SSD"
+	depends on BLK_DEV_NVM
+	---help---
+	Allows an open-channel SSD to be exposed as a block device to the
+	host. The target is implemented using a linear mapping table and
+	cost-based garbage collection. It is optimized for 4K IO sizes.
+
+	See Documentation/nvm-rrpc.txt for details.
+
+endif # NVM
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 0000000..80d75a8
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM)		+= rrpc.o
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
new file mode 100644
index 0000000..180cb09
--- /dev/null
+++ b/drivers/lightnvm/rrpc.c
@@ -0,0 +1,1222 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj@itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
+ */
+
+#include "rrpc.h"
+
+static struct kmem_cache *_addr_cache;
+static struct kmem_cache *_gcb_cache;
+static DECLARE_RWSEM(_lock);
+
+#define rrpc_for_each_lun(rrpc, rlun, i) \
+		for ((i) = 0, rlun = &(rrpc)->luns[0]; \
+			(i) < (rrpc)->nr_luns; (i)++, rlun = &(rrpc)->luns[(i)])
+
+static void invalidate_block_page(struct nvm_addr *p)
+{
+	struct nvm_block *block = p->block;
+	unsigned int page_offset;
+
+	if (!block)
+		return;
+
+	spin_lock(&block->lock);
+	page_offset = p->addr % block->lun->nr_pages_per_blk;
+	WARN_ON(test_and_set_bit(page_offset, block->invalid_pages));
+	block->nr_invalid_pages++;
+	spin_unlock(&block->lock);
+}
+
+static inline void __nvm_page_invalidate(struct rrpc *rrpc, struct nvm_addr *a)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (a->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(a);
+	rrpc->rev_trans_map[a->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+static void rrpc_invalidate_range(struct rrpc *rrpc, sector_t slba,
+								unsigned len)
+{
+	sector_t i;
+
+	spin_lock(&rrpc->rev_lock);
+	for (i = slba; i < slba + len; i++) {
+		struct nvm_addr *gp = &rrpc->trans_map[i];
+
+		__nvm_page_invalidate(rrpc, gp);
+		gp->block = NULL;
+	}
+	spin_unlock(&rrpc->rev_lock);
+}
+
+static struct request *rrpc_inflight_laddr_acquire(struct rrpc *rrpc,
+					sector_t laddr, unsigned int pages)
+{
+	struct request *rq;
+	struct rrpc_inflight_rq *inf;
+
+	rq = blk_mq_alloc_request(rrpc->q_dev, READ, GFP_NOIO, false);
+	if (!rq)
+		return ERR_PTR(-ENOMEM);
+
+	inf = rrpc_get_inflight_rq(rq);
+	if (rrpc_lock_laddr(rrpc, laddr, pages, inf)) {
+		blk_mq_free_request(rq);
+		return NULL;
+	}
+
+	return rq;
+}
+
+static void rrpc_inflight_laddr_release(struct rrpc *rrpc, struct request *rq)
+{
+	struct rrpc_inflight_rq *inf;
+
+	inf = rrpc_get_inflight_rq(rq);
+	rrpc_unlock_laddr(rrpc, inf->l_start, inf);
+
+	blk_mq_free_request(rq);
+}
+
+static void rrpc_discard(struct rrpc *rrpc, struct bio *bio)
+{
+	sector_t slba = bio->bi_iter.bi_sector / NR_PHY_IN_LOG;
+	sector_t len = bio->bi_iter.bi_size / EXPOSED_PAGE_SIZE;
+	struct request *rq;
+
+	do {
+		rq = rrpc_inflight_laddr_acquire(rrpc, slba, len);
+		schedule();
+	} while (!rq);
+
+	if (IS_ERR(rq)) {
+		bio_io_error(bio);
+		return;
+	}
+
+	rrpc_invalidate_range(rrpc, slba, len);
+	rrpc_inflight_laddr_release(rrpc, rq);
+}
+
+/* requires lun->lock taken */
+static void rrpc_set_lun_cur(struct rrpc_lun *rlun, struct nvm_block *block)
+{
+	BUG_ON(!block);
+
+	if (rlun->cur) {
+		spin_lock(&rlun->cur->lock);
+		WARN_ON(!block_is_full(rlun->cur));
+		spin_unlock(&rlun->cur->lock);
+	}
+	rlun->cur = block;
+}
+
+static struct rrpc_lun *get_next_lun(struct rrpc *rrpc)
+{
+	int next = atomic_inc_return(&rrpc->next_lun);
+
+	return &rrpc->luns[next % rrpc->nr_luns];
+}
+
+static void rrpc_gc_kick(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	unsigned int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+		queue_work(rrpc->krqd_wq, &rlun->ws_gc);
+	}
+}
+
+/**
+ * rrpc_gc_timer - default gc timer function.
+ * @data: ptr to the 'nvm' structure
+ *
+ * Description:
+ *   rrpc configures a timer to kick the GC to force proactive behavior.
+ *
+ **/
+static void rrpc_gc_timer(unsigned long data)
+{
+	struct rrpc *rrpc = (struct rrpc *)data;
+
+	rrpc_gc_kick(rrpc);
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+}
+
+static void rrpc_end_sync_bio(struct bio *bio, int error)
+{
+	struct completion *waiting = bio->bi_private;
+
+	if (error)
+		pr_err("nvm: gc request failed.\n");
+
+	complete(waiting);
+}
+
+/*
+ * rrpc_move_valid_pages -- migrate live data off the block
+ * @rrpc: the 'rrpc' structure
+ * @block: the block from which to migrate live pages
+ *
+ * Description:
+ *   GC algorithms may call this function to migrate remaining live
+ *   pages off the block prior to erasing it. This function blocks
+ *   further execution until the operation is complete.
+ */
+static int rrpc_move_valid_pages(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct request_queue *q = rrpc->q_dev;
+	struct nvm_lun *lun = block->lun;
+	struct nvm_rev_addr *rev;
+	struct bio *bio;
+	struct request *rq;
+	struct page *page;
+	int slot;
+	sector_t phys_addr;
+	DECLARE_COMPLETION_ONSTACK(wait);
+
+	if (bitmap_full(block->invalid_pages, lun->nr_pages_per_blk))
+		return 0;
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio) {
+		pr_err("nvm: could not alloc bio on gc\n");
+		return -ENOMEM;
+	}
+
+	page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
+
+	while ((slot = find_first_zero_bit(block->invalid_pages,
+					   lun->nr_pages_per_blk)) <
+						lun->nr_pages_per_blk) {
+
+		/* Lock laddr */
+		phys_addr = block_to_addr(block) + slot;
+
+try:
+		spin_lock(&rrpc->rev_lock);
+		/* Get logical address from physical to logical table */
+		rev = &rrpc->rev_trans_map[phys_addr - rrpc->poffset];
+		/* already updated by previous regular write */
+		if (rev->addr == ADDR_EMPTY) {
+			spin_unlock(&rrpc->rev_lock);
+			continue;
+		}
+
+		rq = rrpc_inflight_laddr_acquire(rrpc, rev->addr, 1);
+		if (!rq) {
+			spin_unlock(&rrpc->rev_lock);
+			schedule();
+			goto try;
+		}
+
+		spin_unlock(&rrpc->rev_lock);
+
+		/* Perform read to do GC */
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (READ | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		/* execute read */
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		/* and write it back */
+		bio_reset(bio);
+		reinit_completion(&wait);
+
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (WRITE | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		rrpc_inflight_laddr_release(rrpc, rq);
+
+		/* reset structures for next run */
+		reinit_completion(&wait);
+		bio_reset(bio);
+	}
+
+	mempool_free(page, rrpc->page_pool);
+	bio_put(bio);
+
+	if (!bitmap_full(block->invalid_pages, lun->nr_pages_per_blk)) {
+		pr_err("nvm: failed to garbage collect block\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void rrpc_block_gc(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_dev *dev = rrpc->q_nvm;
+
+	pr_debug("nvm: block '%d' being reclaimed\n", block->id);
+
+	if (rrpc_move_valid_pages(rrpc, block))
+		goto done;
+
+	blk_nvm_erase_blk(dev, block);
+	blk_nvm_put_blk(block);
+done:
+	mempool_free(gcb, rrpc->gcb_pool);
+}
+
+/* the block with highest number of invalid pages, will be in the beginning
+ * of the list */
+static struct rrpc_block *rblock_max_invalid(struct rrpc_block *ra,
+					       struct rrpc_block *rb)
+{
+	struct nvm_block *a = ra->parent;
+	struct nvm_block *b = rb->parent;
+
+	BUG_ON(!a || !b);
+
+	if (a->nr_invalid_pages == b->nr_invalid_pages)
+		return ra;
+
+	return (a->nr_invalid_pages < b->nr_invalid_pages) ? rb : ra;
+}
+
+/* linearly find the block with highest number of invalid pages
+ * requires lun->lock */
+static struct rrpc_block *block_prio_find_max(struct rrpc_lun *rlun)
+{
+	struct list_head *prio_list = &rlun->prio_list;
+	struct rrpc_block *rblock, *max;
+
+	BUG_ON(list_empty(prio_list));
+
+	max = list_first_entry(prio_list, struct rrpc_block, prio);
+	list_for_each_entry(rblock, prio_list, prio)
+		max = rblock_max_invalid(max, rblock);
+
+	return max;
+}
+
+static void rrpc_lun_gc(struct work_struct *work)
+{
+	struct rrpc_lun *rlun = container_of(work, struct rrpc_lun, ws_gc);
+	struct rrpc *rrpc = rlun->rrpc;
+	struct nvm_lun *lun = rlun->parent;
+	struct rrpc_block_gc *gcb;
+	unsigned int nr_blocks_need;
+
+	nr_blocks_need = lun->nr_blocks / GC_LIMIT_INVERSE;
+
+	if (nr_blocks_need < rrpc->nr_luns)
+		nr_blocks_need = rrpc->nr_luns;
+
+	spin_lock(&lun->lock);
+	while (nr_blocks_need > lun->nr_free_blocks &&
+					!list_empty(&rlun->prio_list)) {
+		struct rrpc_block *rblock = block_prio_find_max(rlun);
+		struct nvm_block *block = rblock->parent;
+
+		if (!block->nr_invalid_pages)
+			break;
+
+		list_del_init(&rblock->prio);
+
+		BUG_ON(!block_is_full(block));
+
+		pr_debug("nvm: selected block '%d' as GC victim\n",
+								block->id);
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb)
+			break;
+
+		gcb->rrpc = rrpc;
+		gcb->block = rblock->parent;
+		INIT_WORK(&gcb->ws_gc, rrpc_block_gc);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+
+		nr_blocks_need--;
+	}
+	spin_unlock(&lun->lock);
+
+	/* TODO: Hint that request queue can be started again */
+}
+
+static void rrpc_gc_queue(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_lun *rlun = &rrpc->luns[lun->id - rrpc->lun_offset];
+	struct rrpc_block *rblock =
+			&rlun->blocks[block->id % lun->nr_blocks];
+
+	spin_lock(&rlun->lock);
+	list_add_tail(&rblock->prio, &rlun->prio_list);
+	spin_unlock(&rlun->lock);
+
+	mempool_free(gcb, rrpc->gcb_pool);
+	pr_debug("nvm: block '%d' is full, allow GC (sched)\n", block->id);
+}
+
+static int rrpc_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int rrpc_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void rrpc_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations rrpc_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= rrpc_ioctl,
+	.open		= rrpc_open,
+	.release	= rrpc_release,
+};
+
+static struct rrpc_lun *__rrpc_get_lun_rr(struct rrpc *rrpc, int is_gc)
+{
+	unsigned int i;
+	struct rrpc_lun *rlun, *max_free;
+
+	if (!is_gc)
+		return get_next_lun(rrpc);
+
+	/* FIXME */
+	/* during GC, we don't care about RR, instead we want to make
+	 * sure that we maintain evenness between the block luns. */
+	max_free = &rrpc->luns[0];
+	/* prevent GC-ing lun from devouring pages of a lun with
+	 * little free blocks. We don't take the lock as we only need an
+	 * estimate. */
+	rrpc_for_each_lun(rrpc, rlun, i) {
+		if (rlun->parent->nr_free_blocks >
+					max_free->parent->nr_free_blocks)
+			max_free = rlun;
+	}
+
+	return max_free;
+}
+
+static inline void __rrpc_page_invalidate(struct rrpc *rrpc,
+							struct nvm_addr *gp)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (gp->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(gp);
+	rrpc->rev_trans_map[gp->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+void nvm_update_map(struct rrpc *rrpc, sector_t l_addr, struct nvm_addr *p,
+					int is_gc)
+{
+	struct nvm_addr *gp;
+	struct nvm_rev_addr *rev;
+
+	BUG_ON(l_addr >= rrpc->nr_pages);
+
+	gp = &rrpc->trans_map[l_addr];
+	spin_lock(&rrpc->rev_lock);
+	if (gp->block)
+		__nvm_page_invalidate(rrpc, gp);
+
+	gp->addr = p->addr;
+	gp->block = p->block;
+
+	rev = &rrpc->rev_trans_map[p->addr - rrpc->poffset];
+	rev->addr = l_addr;
+	spin_unlock(&rrpc->rev_lock);
+}
+
+/* Simple round-robin Logical to physical address translation.
+ *
+ * Retrieve the mapping using the active append point. Then update the ap for
+ * the next write to the disk.
+ *
+ * Returns nvm_addr with the physical address and block. Remember to return to
+ * rrpc->addr_cache when request is finished.
+ */
+static struct nvm_addr *rrpc_map_page(struct rrpc *rrpc, sector_t laddr,
+								int is_gc)
+{
+	struct nvm_addr *p;
+	struct rrpc_lun *rlun;
+	struct nvm_lun *lun;
+	struct nvm_block *p_block;
+	sector_t p_addr;
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p) {
+		pr_err("rrpc: address pool run out of space\n");
+		return NULL;
+	}
+
+	rlun = __rrpc_get_lun_rr(rrpc, is_gc);
+	lun = rlun->parent;
+
+	if (!is_gc && lun->nr_free_blocks < rrpc->nr_luns * 4)
+		return NULL;
+
+	spin_lock(&rlun->lock);
+
+	p_block = rlun->cur;
+	p_addr = blk_nvm_alloc_addr(p_block);
+
+	if (p_addr == ADDR_EMPTY) {
+		p_block = blk_nvm_get_blk(lun, 0);
+
+		if (!p_block) {
+			if (is_gc) {
+				p_addr = blk_nvm_alloc_addr(rlun->gc_cur);
+				if (p_addr == ADDR_EMPTY) {
+					p_block = blk_nvm_get_blk(lun, 1);
+					if (!p_block) {
+						pr_err("rrpc: no more blocks");
+						goto finished;
+					} else {
+						rlun->gc_cur = p_block;
+						p_addr =
+					       blk_nvm_alloc_addr(rlun->gc_cur);
+					}
+				}
+				p_block = rlun->gc_cur;
+			}
+			goto finished;
+		}
+
+		rrpc_set_lun_cur(rlun, p_block);
+		p_addr = blk_nvm_alloc_addr(p_block);
+	}
+
+finished:
+	if (p_addr == ADDR_EMPTY)
+		goto err;
+
+	p->addr = p_addr;
+	p->block = p_block;
+
+	if (!p_block)
+		WARN_ON(is_gc);
+
+	spin_unlock(&rlun->lock);
+	if (p)
+		nvm_update_map(rrpc, laddr, p, is_gc);
+	return p;
+err:
+	spin_unlock(&rlun->lock);
+	mempool_free(p, rrpc->addr_pool);
+	return NULL;
+}
+
+static void __rrpc_unprep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb = get_per_rq_data(rq);
+	struct nvm_addr *p = pb->addr;
+	struct nvm_block *block = p->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_block_gc *gcb;
+	int cmnt_size;
+
+	rrpc_unlock_rq(rrpc, rq);
+
+	if (rq_data_dir(rq) == WRITE) {
+		cmnt_size = atomic_inc_return(&block->data_cmnt_size);
+		if (likely(cmnt_size != lun->nr_pages_per_blk))
+			goto done;
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb) {
+			pr_err("rrpc: not able to queue block for gc.");
+			goto done;
+		}
+
+		gcb->rrpc = rrpc;
+		gcb->block = block;
+		INIT_WORK(&gcb->ws_gc, rrpc_gc_queue);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+	}
+
+done:
+	mempool_free(pb->addr, rrpc->addr_pool);
+}
+
+static void rrpc_unprep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return;
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	if (rq->cmd_flags & REQ_NVM_MAPPED)
+		__rrpc_unprep_rq(rrpc, rq);
+}
+
+/* lookup the primary translation table. If there isn't an associated block to
+ * the addr. We assume that there is no data and doesn't take a ref */
+static struct nvm_addr *rrpc_lookup_ltop(struct rrpc *rrpc, sector_t laddr)
+{
+	struct nvm_addr *gp, *p;
+
+	BUG_ON(!(laddr >= 0 && laddr < rrpc->nr_pages));
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p)
+		return NULL;
+
+	gp = &rrpc->trans_map[laddr];
+
+	p->addr = gp->addr;
+	p->block = gp->block;
+
+	return p;
+}
+
+static int rrpc_requeue_and_kick(struct rrpc *rrpc, struct request *rq)
+{
+	blk_mq_requeue_request(rq);
+	blk_mq_kick_requeue_list(rrpc->q_dev);
+	return BLK_MQ_RQ_QUEUE_DONE;
+}
+
+static int rrpc_read_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_addr *p;
+	struct nvm_per_rq *pb;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return BLK_MQ_RQ_QUEUE_BUSY;
+
+	p = rrpc_lookup_ltop(rrpc, l_addr);
+	if (!p) {
+		rrpc_unlock_rq(rrpc, rq);
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	}
+
+	if (p->block)
+		rq->phys_sector = nvm_get_sector(p->addr) +
+					(blk_rq_pos(rq) % NR_PHY_IN_LOG);
+	else {
+		rrpc_unlock_rq(rrpc, rq);
+		blk_mq_end_request(rq, 0);
+		return BLK_MQ_RQ_QUEUE_DONE;
+	}
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int rrpc_write_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb;
+	struct nvm_addr *p;
+	int is_gc = 0;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		is_gc = 1;
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return rrpc_requeue_and_kick(rrpc, rq);
+
+	p = rrpc_map_page(rrpc, l_addr, is_gc);
+	if (!p) {
+		BUG_ON(is_gc);
+		rrpc_unlock_rq(rrpc, rq);
+		rrpc_gc_kick(rrpc);
+		return rrpc_requeue_and_kick(rrpc, rq);
+	}
+
+	rq->phys_sector = nvm_get_sector(p->addr);
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int __rrpc_prep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	int rw = rq_data_dir(rq);
+	int ret;
+
+	if (rw == WRITE)
+		ret = rrpc_write_rq(rrpc, rq);
+	else
+		ret = rrpc_read_rq(rrpc, rq);
+
+	if (!ret)
+		rq->cmd_flags |= (REQ_NVM_MAPPED|REQ_DONTPREP);
+
+	return ret;
+}
+
+static int rrpc_prep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return 0;
+
+	if (unlikely(!bio->bi_nvm)) {
+		if (bio_data_dir(bio) == WRITE) {
+			pr_warn("nvm: attempting to write without FTL.\n");
+			return BLK_MQ_RQ_QUEUE_ERROR;
+		}
+		return BLK_MQ_RQ_QUEUE_OK;
+	}
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	return __rrpc_prep_rq(rrpc, rq);
+}
+
+static void rrpc_make_rq(struct request_queue *q, struct bio *bio)
+{
+	struct rrpc *rrpc = q->queuedata;
+
+	if (bio->bi_rw & REQ_DISCARD) {
+		rrpc_discard(rrpc, bio);
+		return;
+	}
+
+	bio->bi_nvm = &rrpc->payload;
+	bio->bi_bdev = rrpc->q_bdev;
+
+	generic_make_request(bio);
+}
+
+static void rrpc_gc_free(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	int i;
+
+	if (rrpc->krqd_wq)
+		destroy_workqueue(rrpc->krqd_wq);
+
+	if (rrpc->kgc_wq)
+		destroy_workqueue(rrpc->kgc_wq);
+
+	if (!rrpc->luns)
+		return;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		if (!rlun->blocks)
+			break;
+		vfree(rlun->blocks);
+	}
+}
+
+static int rrpc_gc_init(struct rrpc *rrpc)
+{
+	rrpc->krqd_wq = alloc_workqueue("knvm-work", WQ_MEM_RECLAIM|WQ_UNBOUND,
+						rrpc->nr_luns);
+	if (!rrpc->krqd_wq)
+		return -ENOMEM;
+
+	rrpc->kgc_wq = alloc_workqueue("knvm-gc", WQ_MEM_RECLAIM, 1);
+	if (!rrpc->kgc_wq)
+		return -ENOMEM;
+
+	setup_timer(&rrpc->gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+
+	return 0;
+}
+
+static void rrpc_map_free(struct rrpc *rrpc)
+{
+	vfree(rrpc->rev_trans_map);
+	vfree(rrpc->trans_map);
+}
+
+static int rrpc_l2p_update(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct rrpc *rrpc = (struct rrpc *)private;
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_addr *addr = rrpc->trans_map + slba;
+	struct nvm_rev_addr *raddr = rrpc->rev_trans_map;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	u64 i;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+		/* LNVM treats address-spaces as silos, LBA and PBA are
+		 * equally large and zero-indexed. */
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		addr[i].addr = pba;
+		raddr[pba].addr = slba + i;
+	}
+
+	return 0;
+}
+
+static int rrpc_map_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	sector_t i;
+	int ret;
+
+	rrpc->trans_map = vzalloc(sizeof(struct nvm_addr) * rrpc->nr_pages);
+	if (!rrpc->trans_map)
+		return -ENOMEM;
+
+	rrpc->rev_trans_map = vmalloc(sizeof(struct nvm_rev_addr)
+							* rrpc->nr_pages);
+	if (!rrpc->rev_trans_map)
+		return -ENOMEM;
+
+	for (i = 0; i < rrpc->nr_pages; i++) {
+		struct nvm_addr *p = &rrpc->trans_map[i];
+		struct nvm_rev_addr *r = &rrpc->rev_trans_map[i];
+
+		p->addr = ADDR_EMPTY;
+		r->addr = ADDR_EMPTY;
+	}
+
+	if (!dev->ops->get_l2p_tbl)
+		return 0;
+
+	/* Bring up the mapping table from device */
+	ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							rrpc_l2p_update, rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not read L2P table.\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+
+/* Minimum pages needed within a lun */
+#define PAGE_POOL_SIZE 16
+#define ADDR_POOL_SIZE 64
+
+static int rrpc_core_init(struct rrpc *rrpc)
+{
+	int i;
+
+	down_write(&_lock);
+	if (!_addr_cache) {
+		_addr_cache = kmem_cache_create("nvm_addr_cache",
+				sizeof(struct nvm_addr), 0, 0, NULL);
+		if (!_addr_cache) {
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+
+	if (!_gcb_cache) {
+		_gcb_cache = kmem_cache_create("nvm_gcb_cache",
+				sizeof(struct rrpc_block_gc), 0, 0, NULL);
+		if (!_gcb_cache) {
+			kmem_cache_destroy(_addr_cache);
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+	up_write(&_lock);
+
+	rrpc->page_pool = mempool_create_page_pool(PAGE_POOL_SIZE, 0);
+	if (!rrpc->page_pool)
+		return -ENOMEM;
+
+	rrpc->addr_pool = mempool_create_slab_pool(ADDR_POOL_SIZE, _addr_cache);
+	if (!rrpc->addr_pool)
+		return -ENOMEM;
+
+	rrpc->gcb_pool = mempool_create_slab_pool(rrpc->q_nvm->nr_luns,
+								_gcb_cache);
+	if (!rrpc->gcb_pool)
+		return -ENOMEM;
+
+	for (i = 0; i < NVM_INFLIGHT_PARTITIONS; i++) {
+		struct nvm_inflight *map = &rrpc->inflight_map[i];
+
+		spin_lock_init(&map->lock);
+		INIT_LIST_HEAD(&map->reqs);
+	}
+
+	return 0;
+}
+
+static void rrpc_core_free(struct rrpc *rrpc)
+{
+	if (rrpc->addr_pool)
+		mempool_destroy(rrpc->addr_pool);
+	if (rrpc->page_pool)
+		mempool_destroy(rrpc->page_pool);
+
+	down_write(&_lock);
+	if (_addr_cache)
+		kmem_cache_destroy(_addr_cache);
+	if (_gcb_cache)
+		kmem_cache_destroy(_gcb_cache);
+	up_write(&_lock);
+}
+
+static void rrpc_luns_free(struct rrpc *rrpc)
+{
+	kfree(rrpc->luns);
+}
+
+static int rrpc_luns_init(struct rrpc *rrpc, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_block *block;
+	struct rrpc_lun *rlun;
+	int i, j;
+
+	spin_lock_init(&rrpc->rev_lock);
+
+	rrpc->luns = kcalloc(rrpc->nr_luns, sizeof(struct rrpc_lun),
+								GFP_KERNEL);
+	if (!rrpc->luns)
+		return -ENOMEM;
+
+	/* 1:1 mapping */
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		struct nvm_lun *lun = &dev->luns[i + lun_begin];
+
+		rlun = &rrpc->luns[i];
+		rlun->rrpc = rrpc;
+		rlun->parent = lun;
+		rlun->nr_blocks = lun->nr_blocks;
+
+		rrpc->total_blocks += lun->nr_blocks;
+		rrpc->nr_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+
+		INIT_LIST_HEAD(&rlun->prio_list);
+		INIT_WORK(&rlun->ws_gc, rrpc_lun_gc);
+		spin_lock_init(&rlun->lock);
+
+		rlun->blocks = vzalloc(sizeof(struct rrpc_block) *
+						 rlun->nr_blocks);
+		if (!rlun->blocks)
+			goto err;
+
+		lun_for_each_block(lun, block, j) {
+			struct rrpc_block *rblock = &rlun->blocks[j];
+
+			rblock->parent = block;
+			INIT_LIST_HEAD(&rblock->prio);
+		}
+	}
+
+	return 0;
+err:
+	return -ENOMEM;
+}
+
+static void rrpc_free(struct rrpc *rrpc)
+{
+	rrpc_gc_free(rrpc);
+	rrpc_map_free(rrpc);
+	rrpc_core_free(rrpc);
+	rrpc_luns_free(rrpc);
+
+	kfree(rrpc);
+}
+
+static void rrpc_exit(void *private)
+{
+	struct rrpc *rrpc = private;
+
+	blkdev_put(rrpc->q_bdev, FMODE_WRITE | FMODE_READ);
+	del_timer(&rrpc->gc_timer);
+
+	flush_workqueue(rrpc->krqd_wq);
+	flush_workqueue(rrpc->kgc_wq);
+
+	rrpc_free(rrpc);
+}
+
+static sector_t rrpc_capacity(void *private)
+{
+	struct rrpc *rrpc = private;
+	struct nvm_lun *lun;
+	sector_t reserved;
+	int i, max_pages_per_blk = 0;
+
+	nvm_for_each_lun(rrpc->q_nvm, lun, i) {
+		if (lun->nr_pages_per_blk > max_pages_per_blk)
+			max_pages_per_blk = lun->nr_pages_per_blk;
+	}
+
+	/* cur, gc, and two emergency blocks for each lun */
+	reserved = rrpc->nr_luns * max_pages_per_blk * 4;
+
+	if (reserved > rrpc->nr_pages) {
+		pr_err("rrpc: not enough space available to expose storage.\n");
+		return 0;
+	}
+
+	return ((rrpc->nr_pages - reserved) / 10) * 9 * NR_PHY_IN_LOG;
+}
+
+/*
+ * Looks up the logical address from reverse trans map and check if its valid by
+ * comparing the logical to physical address with the physical address.
+ * Returns 0 on free, otherwise 1 if in use
+ */
+static void rrpc_block_map_update(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+	int offset;
+	struct nvm_addr *laddr;
+	sector_t paddr, pladdr;
+
+	for (offset = 0; offset < lun->nr_pages_per_blk; offset++) {
+		paddr = block_to_addr(block) + offset;
+
+		pladdr = rrpc->rev_trans_map[paddr].addr;
+		if (pladdr == ADDR_EMPTY)
+			continue;
+
+		laddr = &rrpc->trans_map[pladdr];
+
+		if (paddr == laddr->addr) {
+			laddr->block = block;
+		} else {
+			set_bit(offset, block->invalid_pages);
+			block->nr_invalid_pages++;
+		}
+	}
+}
+
+static int rrpc_blocks_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t lun_iter, blk_iter;
+
+	for (lun_iter = 0; lun_iter < rrpc->nr_luns; lun_iter++) {
+		lun = &dev->luns[lun_iter + rrpc->lun_offset];
+
+		lun_for_each_block(lun, blk, blk_iter)
+			rrpc_block_map_update(rrpc, blk);
+	}
+
+	return 0;
+}
+
+static int rrpc_luns_configure(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	struct nvm_block *blk;
+	int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		blk = blk_nvm_get_blk(rlun->parent, 0);
+		if (!blk)
+			return -EINVAL;
+
+		rrpc_set_lun_cur(rlun, blk);
+
+		/* Emergency gc block */
+		blk = blk_nvm_get_blk(rlun->parent, 1);
+		if (!blk)
+			return -EINVAL;
+		rlun->gc_cur = blk;
+	}
+
+	return 0;
+}
+
+static void *rrpc_init(struct request_queue *qdev,
+			struct request_queue *qtarget, struct gendisk *qdisk,
+			struct gendisk *tdisk, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev;
+	struct block_device *bdev;
+	struct rrpc *rrpc;
+	int ret;
+
+	if (!blk_queue_nvm(qdev)) {
+		pr_err("nvm: block device not supported.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	bdev = bdget_disk(qdisk, 0);
+	if (blkdev_get(bdev, FMODE_WRITE | FMODE_READ, NULL)) {
+		pr_err("nvm: could not access backing device\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = blk_nvm_get_dev(qdev);
+
+	rrpc = kzalloc(sizeof(struct rrpc), GFP_KERNEL);
+	if (!rrpc) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	rrpc->q_dev = qdev;
+	rrpc->q_nvm = qdev->nvm;
+	rrpc->q_bdev = bdev;
+	rrpc->nr_luns = lun_end - lun_begin + 1;
+
+	/* simple round-robin strategy */
+	atomic_set(&rrpc->next_lun, -1);
+
+	ret = rrpc_luns_init(rrpc, lun_begin, lun_end);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	rrpc->poffset = rrpc->luns[0].parent->nr_blocks *
+			rrpc->luns[0].parent->nr_pages_per_blk * lun_begin;
+	rrpc->lun_offset = lun_begin;
+
+	ret = rrpc_core_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize core\n");
+		goto err;
+	}
+
+	ret = rrpc_map_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize maps\n");
+		goto err;
+	}
+
+	ret = rrpc_blocks_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize state for blocks\n");
+		goto err;
+	}
+
+	ret = rrpc_luns_configure(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: not enough blocks available in LUNs.\n");
+		goto err;
+	}
+
+	ret = rrpc_gc_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize gc\n");
+		goto err;
+	}
+
+	/* make sure to inherit the size from the underlying device */
+	blk_queue_logical_block_size(qtarget, queue_physical_block_size(qdev));
+	blk_queue_max_hw_sectors(qtarget, queue_max_hw_sectors(qdev));
+
+	pr_info("nvm: rrpc initialized with %u luns and %llu pages.\n",
+			rrpc->nr_luns, (unsigned long long)rrpc->nr_pages);
+
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+
+	return rrpc;
+err:
+	blkdev_put(bdev, FMODE_WRITE | FMODE_READ);
+	rrpc_free(rrpc);
+	return ERR_PTR(ret);
+}
+
+/* round robin, page-based FTL, and cost-based GC */
+static struct nvm_target_type tt_rrpc = {
+	.name		= "rrpc",
+
+	.make_rq	= rrpc_make_rq,
+	.prep_rq	= rrpc_prep_rq,
+	.unprep_rq	= rrpc_unprep_rq,
+
+	.capacity	= rrpc_capacity,
+
+	.init		= rrpc_init,
+	.exit		= rrpc_exit,
+};
+
+static int __init rrpc_module_init(void)
+{
+	return nvm_register_target(&tt_rrpc);
+}
+
+static void rrpc_module_exit(void)
+{
+	nvm_unregister_target(&tt_rrpc);
+}
+
+module_init(rrpc_module_init);
+module_exit(rrpc_module_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Round-Robin Cost-based Hybrid Layer for Open-Channel SSDs");
diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h
new file mode 100644
index 0000000..eeebc7f
--- /dev/null
+++ b/drivers/lightnvm/rrpc.h
@@ -0,0 +1,203 @@
+/*
+ * Copyright (C) 2015 Matias Bjørling.
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef RRPC_H_
+#define RRPC_H_
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include <linux/lightnvm.h>
+
+/* We partition the namespace of translation map into these pieces for tracking
+ * in-flight addresses. */
+#define NVM_INFLIGHT_PARTITIONS 1
+
+/* Run only GC if less than 1/X blocks are free */
+#define GC_LIMIT_INVERSE 10
+#define GC_TIME_SECS 100
+
+struct nvm_inflight {
+	spinlock_t lock;
+	struct list_head reqs;
+};
+
+struct rrpc_lun;
+
+struct rrpc_block {
+	struct nvm_block *parent;
+	struct list_head prio;
+};
+
+struct rrpc_lun {
+	struct rrpc *rrpc;
+	struct nvm_lun *parent;
+	struct nvm_block *cur, *gc_cur;
+	struct rrpc_block *blocks;	/* Reference to block allocation */
+	struct list_head prio_list;		/* Blocks that may be GC'ed */
+	struct work_struct ws_gc;
+
+	int nr_blocks;
+	spinlock_t lock;
+};
+
+struct rrpc {
+	struct bio_nvm_payload payload;
+
+	struct nvm_dev *q_nvm;
+	struct request_queue *q_dev;
+	struct block_device *q_bdev;
+
+	int nr_luns;
+	int lun_offset;
+	sector_t poffset; /* physical page offset */
+
+	struct rrpc_lun *luns;
+
+	/* calculated values */
+	unsigned long nr_pages;
+	unsigned long total_blocks;
+
+	/* Write strategy variables. Move these into each for structure for each
+	 * strategy */
+	atomic_t next_lun; /* Whenever a page is written, this is updated
+			    * to point to the next write lun */
+
+	/* Simple translation map of logical addresses to physical addresses.
+	 * The logical addresses is known by the host system, while the physical
+	 * addresses are used when writing to the disk block device. */
+	struct nvm_addr *trans_map;
+	/* also store a reverse map for garbage collection */
+	struct nvm_rev_addr *rev_trans_map;
+	spinlock_t rev_lock;
+
+	struct nvm_inflight inflight_map[NVM_INFLIGHT_PARTITIONS];
+
+	mempool_t *addr_pool;
+	mempool_t *page_pool;
+	mempool_t *gcb_pool;
+
+	struct timer_list gc_timer;
+	struct workqueue_struct *krqd_wq;
+	struct workqueue_struct *kgc_wq;
+
+	struct gc_blocks *gblks;
+	struct gc_luns *gluns;
+};
+
+struct rrpc_block_gc {
+	struct rrpc *rrpc;
+	struct nvm_block *block;
+	struct work_struct ws_gc;
+};
+
+static inline sector_t nvm_get_laddr(struct request *rq)
+{
+	return blk_rq_pos(rq) / NR_PHY_IN_LOG;
+}
+
+static inline sector_t nvm_get_sector(sector_t laddr)
+{
+	return laddr * NR_PHY_IN_LOG;
+}
+
+static inline void *get_per_rq_data(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	return blk_mq_rq_to_pdu(rq) + q->tag_set->cmd_size;
+}
+
+static inline int request_intersects(struct rrpc_inflight_rq *r,
+				sector_t laddr_start, sector_t laddr_end)
+{
+	return (laddr_end >= r->l_start && laddr_end <= r->l_end) &&
+		(laddr_start >= r->l_start && laddr_start <= r->l_end);
+}
+
+static int __rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+			     unsigned pages, struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	sector_t laddr_end = laddr + pages - 1;
+	struct rrpc_inflight_rq *rtmp;
+
+	spin_lock_irq(&map->lock);
+	list_for_each_entry(rtmp, &map->reqs, list) {
+		if (unlikely(request_intersects(rtmp, laddr, laddr_end))) {
+			/* existing, overlapping request, come back later */
+			spin_unlock_irq(&map->lock);
+			return 1;
+		}
+	}
+
+	r->l_start = laddr;
+	r->l_end = laddr_end;
+
+	list_add_tail(&r->list, &map->reqs);
+	spin_unlock_irq(&map->lock);
+	return 0;
+}
+
+static inline int rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+				 unsigned pages,
+				 struct rrpc_inflight_rq *r)
+{
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	return __rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline struct rrpc_inflight_rq *rrpc_get_inflight_rq(struct request *rq)
+{
+	struct nvm_per_rq *pd = get_per_rq_data(rq);
+
+	return &pd->inflight_rq;
+}
+
+static inline int rrpc_lock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return 0;
+
+	return rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline void rrpc_unlock_laddr(struct rrpc *rrpc, sector_t laddr,
+				    struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	unsigned long flags;
+
+	spin_lock_irqsave(&map->lock, flags);
+	list_del_init(&r->list);
+	spin_unlock_irqrestore(&map->lock, flags);
+}
+
+static inline void rrpc_unlock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return;
+
+	rrpc_unlock_laddr(rrpc, laddr, r);
+}
+
+#endif /* RRPC_H_ */
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 888d994..5f9f187 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -29,7 +29,6 @@
 
 #define NVM_MSG_PREFIX "nvm"
 #define ADDR_EMPTY (~0ULL)
-#define LTOP_POISON 0xD3ADB33F
 
 /* core.c */
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/5 v2] lightnvm: RRPC target
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

This target implements a simple target to be used by Open-Channel SSDs.
It exposes the physical flash a generic sector-based address space.

The FTL implements a round-robin approach for sector allocation,
together with a greedy cost-based garbage collector.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 drivers/Kconfig           |    2 +
 drivers/Makefile          |    2 +
 drivers/lightnvm/Kconfig  |   29 ++
 drivers/lightnvm/Makefile |    5 +
 drivers/lightnvm/rrpc.c   | 1222 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.h   |  203 ++++++++
 include/linux/lightnvm.h  |    1 -
 7 files changed, 1463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/lightnvm/Kconfig
 create mode 100644 drivers/lightnvm/Makefile
 create mode 100644 drivers/lightnvm/rrpc.c
 create mode 100644 drivers/lightnvm/rrpc.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index c0cc96b..da47047 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"
 
 source "drivers/isdn/Kconfig"
 
+source "drivers/lightnvm/Kconfig"
+
 # input before char - char/joystick depends on it. As does USB.
 
 source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 527a6da..6b6928a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -165,3 +165,5 @@ obj-$(CONFIG_RAS)		+= ras/
 obj-$(CONFIG_THUNDERBOLT)	+= thunderbolt/
 obj-$(CONFIG_CORESIGHT)		+= coresight/
 obj-$(CONFIG_ANDROID)		+= android/
+
+obj-$(CONFIG_NVM)		+= lightnvm/
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 0000000..89fabe1
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,29 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+	bool "Open-Channel SSD target support"
+	depends on BLK_DEV_NVM
+	help
+	  Say Y here to get to enable Open-channel SSDs.
+
+	  Open-Channel SSDs implement a set of extension to SSDs, that
+	  exposes direct access to the underlying non-volatile memory.
+
+	  If you say N, all options in this submenu will be skipped and disabled
+	  only do this if you know what you are doing.
+
+if NVM
+
+config NVM_RRPC
+	tristate "Round-robin Hybrid Open-Channel SSD"
+	depends on BLK_DEV_NVM
+	---help---
+	Allows an open-channel SSD to be exposed as a block device to the
+	host. The target is implemented using a linear mapping table and
+	cost-based garbage collection. It is optimized for 4K IO sizes.
+
+	See Documentation/nvm-rrpc.txt for details.
+
+endif # NVM
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 0000000..80d75a8
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM)		+= rrpc.o
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
new file mode 100644
index 0000000..180cb09
--- /dev/null
+++ b/drivers/lightnvm/rrpc.c
@@ -0,0 +1,1222 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj@itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
+ */
+
+#include "rrpc.h"
+
+static struct kmem_cache *_addr_cache;
+static struct kmem_cache *_gcb_cache;
+static DECLARE_RWSEM(_lock);
+
+#define rrpc_for_each_lun(rrpc, rlun, i) \
+		for ((i) = 0, rlun = &(rrpc)->luns[0]; \
+			(i) < (rrpc)->nr_luns; (i)++, rlun = &(rrpc)->luns[(i)])
+
+static void invalidate_block_page(struct nvm_addr *p)
+{
+	struct nvm_block *block = p->block;
+	unsigned int page_offset;
+
+	if (!block)
+		return;
+
+	spin_lock(&block->lock);
+	page_offset = p->addr % block->lun->nr_pages_per_blk;
+	WARN_ON(test_and_set_bit(page_offset, block->invalid_pages));
+	block->nr_invalid_pages++;
+	spin_unlock(&block->lock);
+}
+
+static inline void __nvm_page_invalidate(struct rrpc *rrpc, struct nvm_addr *a)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (a->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(a);
+	rrpc->rev_trans_map[a->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+static void rrpc_invalidate_range(struct rrpc *rrpc, sector_t slba,
+								unsigned len)
+{
+	sector_t i;
+
+	spin_lock(&rrpc->rev_lock);
+	for (i = slba; i < slba + len; i++) {
+		struct nvm_addr *gp = &rrpc->trans_map[i];
+
+		__nvm_page_invalidate(rrpc, gp);
+		gp->block = NULL;
+	}
+	spin_unlock(&rrpc->rev_lock);
+}
+
+static struct request *rrpc_inflight_laddr_acquire(struct rrpc *rrpc,
+					sector_t laddr, unsigned int pages)
+{
+	struct request *rq;
+	struct rrpc_inflight_rq *inf;
+
+	rq = blk_mq_alloc_request(rrpc->q_dev, READ, GFP_NOIO, false);
+	if (!rq)
+		return ERR_PTR(-ENOMEM);
+
+	inf = rrpc_get_inflight_rq(rq);
+	if (rrpc_lock_laddr(rrpc, laddr, pages, inf)) {
+		blk_mq_free_request(rq);
+		return NULL;
+	}
+
+	return rq;
+}
+
+static void rrpc_inflight_laddr_release(struct rrpc *rrpc, struct request *rq)
+{
+	struct rrpc_inflight_rq *inf;
+
+	inf = rrpc_get_inflight_rq(rq);
+	rrpc_unlock_laddr(rrpc, inf->l_start, inf);
+
+	blk_mq_free_request(rq);
+}
+
+static void rrpc_discard(struct rrpc *rrpc, struct bio *bio)
+{
+	sector_t slba = bio->bi_iter.bi_sector / NR_PHY_IN_LOG;
+	sector_t len = bio->bi_iter.bi_size / EXPOSED_PAGE_SIZE;
+	struct request *rq;
+
+	do {
+		rq = rrpc_inflight_laddr_acquire(rrpc, slba, len);
+		schedule();
+	} while (!rq);
+
+	if (IS_ERR(rq)) {
+		bio_io_error(bio);
+		return;
+	}
+
+	rrpc_invalidate_range(rrpc, slba, len);
+	rrpc_inflight_laddr_release(rrpc, rq);
+}
+
+/* requires lun->lock taken */
+static void rrpc_set_lun_cur(struct rrpc_lun *rlun, struct nvm_block *block)
+{
+	BUG_ON(!block);
+
+	if (rlun->cur) {
+		spin_lock(&rlun->cur->lock);
+		WARN_ON(!block_is_full(rlun->cur));
+		spin_unlock(&rlun->cur->lock);
+	}
+	rlun->cur = block;
+}
+
+static struct rrpc_lun *get_next_lun(struct rrpc *rrpc)
+{
+	int next = atomic_inc_return(&rrpc->next_lun);
+
+	return &rrpc->luns[next % rrpc->nr_luns];
+}
+
+static void rrpc_gc_kick(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	unsigned int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+		queue_work(rrpc->krqd_wq, &rlun->ws_gc);
+	}
+}
+
+/**
+ * rrpc_gc_timer - default gc timer function.
+ * @data: ptr to the 'nvm' structure
+ *
+ * Description:
+ *   rrpc configures a timer to kick the GC to force proactive behavior.
+ *
+ **/
+static void rrpc_gc_timer(unsigned long data)
+{
+	struct rrpc *rrpc = (struct rrpc *)data;
+
+	rrpc_gc_kick(rrpc);
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+}
+
+static void rrpc_end_sync_bio(struct bio *bio, int error)
+{
+	struct completion *waiting = bio->bi_private;
+
+	if (error)
+		pr_err("nvm: gc request failed.\n");
+
+	complete(waiting);
+}
+
+/*
+ * rrpc_move_valid_pages -- migrate live data off the block
+ * @rrpc: the 'rrpc' structure
+ * @block: the block from which to migrate live pages
+ *
+ * Description:
+ *   GC algorithms may call this function to migrate remaining live
+ *   pages off the block prior to erasing it. This function blocks
+ *   further execution until the operation is complete.
+ */
+static int rrpc_move_valid_pages(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct request_queue *q = rrpc->q_dev;
+	struct nvm_lun *lun = block->lun;
+	struct nvm_rev_addr *rev;
+	struct bio *bio;
+	struct request *rq;
+	struct page *page;
+	int slot;
+	sector_t phys_addr;
+	DECLARE_COMPLETION_ONSTACK(wait);
+
+	if (bitmap_full(block->invalid_pages, lun->nr_pages_per_blk))
+		return 0;
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio) {
+		pr_err("nvm: could not alloc bio on gc\n");
+		return -ENOMEM;
+	}
+
+	page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
+
+	while ((slot = find_first_zero_bit(block->invalid_pages,
+					   lun->nr_pages_per_blk)) <
+						lun->nr_pages_per_blk) {
+
+		/* Lock laddr */
+		phys_addr = block_to_addr(block) + slot;
+
+try:
+		spin_lock(&rrpc->rev_lock);
+		/* Get logical address from physical to logical table */
+		rev = &rrpc->rev_trans_map[phys_addr - rrpc->poffset];
+		/* already updated by previous regular write */
+		if (rev->addr == ADDR_EMPTY) {
+			spin_unlock(&rrpc->rev_lock);
+			continue;
+		}
+
+		rq = rrpc_inflight_laddr_acquire(rrpc, rev->addr, 1);
+		if (!rq) {
+			spin_unlock(&rrpc->rev_lock);
+			schedule();
+			goto try;
+		}
+
+		spin_unlock(&rrpc->rev_lock);
+
+		/* Perform read to do GC */
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (READ | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		/* execute read */
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		/* and write it back */
+		bio_reset(bio);
+		reinit_completion(&wait);
+
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (WRITE | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		rrpc_inflight_laddr_release(rrpc, rq);
+
+		/* reset structures for next run */
+		reinit_completion(&wait);
+		bio_reset(bio);
+	}
+
+	mempool_free(page, rrpc->page_pool);
+	bio_put(bio);
+
+	if (!bitmap_full(block->invalid_pages, lun->nr_pages_per_blk)) {
+		pr_err("nvm: failed to garbage collect block\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void rrpc_block_gc(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_dev *dev = rrpc->q_nvm;
+
+	pr_debug("nvm: block '%d' being reclaimed\n", block->id);
+
+	if (rrpc_move_valid_pages(rrpc, block))
+		goto done;
+
+	blk_nvm_erase_blk(dev, block);
+	blk_nvm_put_blk(block);
+done:
+	mempool_free(gcb, rrpc->gcb_pool);
+}
+
+/* the block with highest number of invalid pages, will be in the beginning
+ * of the list */
+static struct rrpc_block *rblock_max_invalid(struct rrpc_block *ra,
+					       struct rrpc_block *rb)
+{
+	struct nvm_block *a = ra->parent;
+	struct nvm_block *b = rb->parent;
+
+	BUG_ON(!a || !b);
+
+	if (a->nr_invalid_pages == b->nr_invalid_pages)
+		return ra;
+
+	return (a->nr_invalid_pages < b->nr_invalid_pages) ? rb : ra;
+}
+
+/* linearly find the block with highest number of invalid pages
+ * requires lun->lock */
+static struct rrpc_block *block_prio_find_max(struct rrpc_lun *rlun)
+{
+	struct list_head *prio_list = &rlun->prio_list;
+	struct rrpc_block *rblock, *max;
+
+	BUG_ON(list_empty(prio_list));
+
+	max = list_first_entry(prio_list, struct rrpc_block, prio);
+	list_for_each_entry(rblock, prio_list, prio)
+		max = rblock_max_invalid(max, rblock);
+
+	return max;
+}
+
+static void rrpc_lun_gc(struct work_struct *work)
+{
+	struct rrpc_lun *rlun = container_of(work, struct rrpc_lun, ws_gc);
+	struct rrpc *rrpc = rlun->rrpc;
+	struct nvm_lun *lun = rlun->parent;
+	struct rrpc_block_gc *gcb;
+	unsigned int nr_blocks_need;
+
+	nr_blocks_need = lun->nr_blocks / GC_LIMIT_INVERSE;
+
+	if (nr_blocks_need < rrpc->nr_luns)
+		nr_blocks_need = rrpc->nr_luns;
+
+	spin_lock(&lun->lock);
+	while (nr_blocks_need > lun->nr_free_blocks &&
+					!list_empty(&rlun->prio_list)) {
+		struct rrpc_block *rblock = block_prio_find_max(rlun);
+		struct nvm_block *block = rblock->parent;
+
+		if (!block->nr_invalid_pages)
+			break;
+
+		list_del_init(&rblock->prio);
+
+		BUG_ON(!block_is_full(block));
+
+		pr_debug("nvm: selected block '%d' as GC victim\n",
+								block->id);
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb)
+			break;
+
+		gcb->rrpc = rrpc;
+		gcb->block = rblock->parent;
+		INIT_WORK(&gcb->ws_gc, rrpc_block_gc);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+
+		nr_blocks_need--;
+	}
+	spin_unlock(&lun->lock);
+
+	/* TODO: Hint that request queue can be started again */
+}
+
+static void rrpc_gc_queue(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_lun *rlun = &rrpc->luns[lun->id - rrpc->lun_offset];
+	struct rrpc_block *rblock =
+			&rlun->blocks[block->id % lun->nr_blocks];
+
+	spin_lock(&rlun->lock);
+	list_add_tail(&rblock->prio, &rlun->prio_list);
+	spin_unlock(&rlun->lock);
+
+	mempool_free(gcb, rrpc->gcb_pool);
+	pr_debug("nvm: block '%d' is full, allow GC (sched)\n", block->id);
+}
+
+static int rrpc_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int rrpc_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void rrpc_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations rrpc_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= rrpc_ioctl,
+	.open		= rrpc_open,
+	.release	= rrpc_release,
+};
+
+static struct rrpc_lun *__rrpc_get_lun_rr(struct rrpc *rrpc, int is_gc)
+{
+	unsigned int i;
+	struct rrpc_lun *rlun, *max_free;
+
+	if (!is_gc)
+		return get_next_lun(rrpc);
+
+	/* FIXME */
+	/* during GC, we don't care about RR, instead we want to make
+	 * sure that we maintain evenness between the block luns. */
+	max_free = &rrpc->luns[0];
+	/* prevent GC-ing lun from devouring pages of a lun with
+	 * little free blocks. We don't take the lock as we only need an
+	 * estimate. */
+	rrpc_for_each_lun(rrpc, rlun, i) {
+		if (rlun->parent->nr_free_blocks >
+					max_free->parent->nr_free_blocks)
+			max_free = rlun;
+	}
+
+	return max_free;
+}
+
+static inline void __rrpc_page_invalidate(struct rrpc *rrpc,
+							struct nvm_addr *gp)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (gp->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(gp);
+	rrpc->rev_trans_map[gp->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+void nvm_update_map(struct rrpc *rrpc, sector_t l_addr, struct nvm_addr *p,
+					int is_gc)
+{
+	struct nvm_addr *gp;
+	struct nvm_rev_addr *rev;
+
+	BUG_ON(l_addr >= rrpc->nr_pages);
+
+	gp = &rrpc->trans_map[l_addr];
+	spin_lock(&rrpc->rev_lock);
+	if (gp->block)
+		__nvm_page_invalidate(rrpc, gp);
+
+	gp->addr = p->addr;
+	gp->block = p->block;
+
+	rev = &rrpc->rev_trans_map[p->addr - rrpc->poffset];
+	rev->addr = l_addr;
+	spin_unlock(&rrpc->rev_lock);
+}
+
+/* Simple round-robin Logical to physical address translation.
+ *
+ * Retrieve the mapping using the active append point. Then update the ap for
+ * the next write to the disk.
+ *
+ * Returns nvm_addr with the physical address and block. Remember to return to
+ * rrpc->addr_cache when request is finished.
+ */
+static struct nvm_addr *rrpc_map_page(struct rrpc *rrpc, sector_t laddr,
+								int is_gc)
+{
+	struct nvm_addr *p;
+	struct rrpc_lun *rlun;
+	struct nvm_lun *lun;
+	struct nvm_block *p_block;
+	sector_t p_addr;
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p) {
+		pr_err("rrpc: address pool run out of space\n");
+		return NULL;
+	}
+
+	rlun = __rrpc_get_lun_rr(rrpc, is_gc);
+	lun = rlun->parent;
+
+	if (!is_gc && lun->nr_free_blocks < rrpc->nr_luns * 4)
+		return NULL;
+
+	spin_lock(&rlun->lock);
+
+	p_block = rlun->cur;
+	p_addr = blk_nvm_alloc_addr(p_block);
+
+	if (p_addr == ADDR_EMPTY) {
+		p_block = blk_nvm_get_blk(lun, 0);
+
+		if (!p_block) {
+			if (is_gc) {
+				p_addr = blk_nvm_alloc_addr(rlun->gc_cur);
+				if (p_addr == ADDR_EMPTY) {
+					p_block = blk_nvm_get_blk(lun, 1);
+					if (!p_block) {
+						pr_err("rrpc: no more blocks");
+						goto finished;
+					} else {
+						rlun->gc_cur = p_block;
+						p_addr =
+					       blk_nvm_alloc_addr(rlun->gc_cur);
+					}
+				}
+				p_block = rlun->gc_cur;
+			}
+			goto finished;
+		}
+
+		rrpc_set_lun_cur(rlun, p_block);
+		p_addr = blk_nvm_alloc_addr(p_block);
+	}
+
+finished:
+	if (p_addr == ADDR_EMPTY)
+		goto err;
+
+	p->addr = p_addr;
+	p->block = p_block;
+
+	if (!p_block)
+		WARN_ON(is_gc);
+
+	spin_unlock(&rlun->lock);
+	if (p)
+		nvm_update_map(rrpc, laddr, p, is_gc);
+	return p;
+err:
+	spin_unlock(&rlun->lock);
+	mempool_free(p, rrpc->addr_pool);
+	return NULL;
+}
+
+static void __rrpc_unprep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb = get_per_rq_data(rq);
+	struct nvm_addr *p = pb->addr;
+	struct nvm_block *block = p->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_block_gc *gcb;
+	int cmnt_size;
+
+	rrpc_unlock_rq(rrpc, rq);
+
+	if (rq_data_dir(rq) == WRITE) {
+		cmnt_size = atomic_inc_return(&block->data_cmnt_size);
+		if (likely(cmnt_size != lun->nr_pages_per_blk))
+			goto done;
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb) {
+			pr_err("rrpc: not able to queue block for gc.");
+			goto done;
+		}
+
+		gcb->rrpc = rrpc;
+		gcb->block = block;
+		INIT_WORK(&gcb->ws_gc, rrpc_gc_queue);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+	}
+
+done:
+	mempool_free(pb->addr, rrpc->addr_pool);
+}
+
+static void rrpc_unprep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return;
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	if (rq->cmd_flags & REQ_NVM_MAPPED)
+		__rrpc_unprep_rq(rrpc, rq);
+}
+
+/* lookup the primary translation table. If there isn't an associated block to
+ * the addr. We assume that there is no data and doesn't take a ref */
+static struct nvm_addr *rrpc_lookup_ltop(struct rrpc *rrpc, sector_t laddr)
+{
+	struct nvm_addr *gp, *p;
+
+	BUG_ON(!(laddr >= 0 && laddr < rrpc->nr_pages));
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p)
+		return NULL;
+
+	gp = &rrpc->trans_map[laddr];
+
+	p->addr = gp->addr;
+	p->block = gp->block;
+
+	return p;
+}
+
+static int rrpc_requeue_and_kick(struct rrpc *rrpc, struct request *rq)
+{
+	blk_mq_requeue_request(rq);
+	blk_mq_kick_requeue_list(rrpc->q_dev);
+	return BLK_MQ_RQ_QUEUE_DONE;
+}
+
+static int rrpc_read_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_addr *p;
+	struct nvm_per_rq *pb;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return BLK_MQ_RQ_QUEUE_BUSY;
+
+	p = rrpc_lookup_ltop(rrpc, l_addr);
+	if (!p) {
+		rrpc_unlock_rq(rrpc, rq);
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	}
+
+	if (p->block)
+		rq->phys_sector = nvm_get_sector(p->addr) +
+					(blk_rq_pos(rq) % NR_PHY_IN_LOG);
+	else {
+		rrpc_unlock_rq(rrpc, rq);
+		blk_mq_end_request(rq, 0);
+		return BLK_MQ_RQ_QUEUE_DONE;
+	}
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int rrpc_write_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb;
+	struct nvm_addr *p;
+	int is_gc = 0;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		is_gc = 1;
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return rrpc_requeue_and_kick(rrpc, rq);
+
+	p = rrpc_map_page(rrpc, l_addr, is_gc);
+	if (!p) {
+		BUG_ON(is_gc);
+		rrpc_unlock_rq(rrpc, rq);
+		rrpc_gc_kick(rrpc);
+		return rrpc_requeue_and_kick(rrpc, rq);
+	}
+
+	rq->phys_sector = nvm_get_sector(p->addr);
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int __rrpc_prep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	int rw = rq_data_dir(rq);
+	int ret;
+
+	if (rw == WRITE)
+		ret = rrpc_write_rq(rrpc, rq);
+	else
+		ret = rrpc_read_rq(rrpc, rq);
+
+	if (!ret)
+		rq->cmd_flags |= (REQ_NVM_MAPPED|REQ_DONTPREP);
+
+	return ret;
+}
+
+static int rrpc_prep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return 0;
+
+	if (unlikely(!bio->bi_nvm)) {
+		if (bio_data_dir(bio) == WRITE) {
+			pr_warn("nvm: attempting to write without FTL.\n");
+			return BLK_MQ_RQ_QUEUE_ERROR;
+		}
+		return BLK_MQ_RQ_QUEUE_OK;
+	}
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	return __rrpc_prep_rq(rrpc, rq);
+}
+
+static void rrpc_make_rq(struct request_queue *q, struct bio *bio)
+{
+	struct rrpc *rrpc = q->queuedata;
+
+	if (bio->bi_rw & REQ_DISCARD) {
+		rrpc_discard(rrpc, bio);
+		return;
+	}
+
+	bio->bi_nvm = &rrpc->payload;
+	bio->bi_bdev = rrpc->q_bdev;
+
+	generic_make_request(bio);
+}
+
+static void rrpc_gc_free(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	int i;
+
+	if (rrpc->krqd_wq)
+		destroy_workqueue(rrpc->krqd_wq);
+
+	if (rrpc->kgc_wq)
+		destroy_workqueue(rrpc->kgc_wq);
+
+	if (!rrpc->luns)
+		return;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		if (!rlun->blocks)
+			break;
+		vfree(rlun->blocks);
+	}
+}
+
+static int rrpc_gc_init(struct rrpc *rrpc)
+{
+	rrpc->krqd_wq = alloc_workqueue("knvm-work", WQ_MEM_RECLAIM|WQ_UNBOUND,
+						rrpc->nr_luns);
+	if (!rrpc->krqd_wq)
+		return -ENOMEM;
+
+	rrpc->kgc_wq = alloc_workqueue("knvm-gc", WQ_MEM_RECLAIM, 1);
+	if (!rrpc->kgc_wq)
+		return -ENOMEM;
+
+	setup_timer(&rrpc->gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+
+	return 0;
+}
+
+static void rrpc_map_free(struct rrpc *rrpc)
+{
+	vfree(rrpc->rev_trans_map);
+	vfree(rrpc->trans_map);
+}
+
+static int rrpc_l2p_update(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct rrpc *rrpc = (struct rrpc *)private;
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_addr *addr = rrpc->trans_map + slba;
+	struct nvm_rev_addr *raddr = rrpc->rev_trans_map;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	u64 i;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+		/* LNVM treats address-spaces as silos, LBA and PBA are
+		 * equally large and zero-indexed. */
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		addr[i].addr = pba;
+		raddr[pba].addr = slba + i;
+	}
+
+	return 0;
+}
+
+static int rrpc_map_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	sector_t i;
+	int ret;
+
+	rrpc->trans_map = vzalloc(sizeof(struct nvm_addr) * rrpc->nr_pages);
+	if (!rrpc->trans_map)
+		return -ENOMEM;
+
+	rrpc->rev_trans_map = vmalloc(sizeof(struct nvm_rev_addr)
+							* rrpc->nr_pages);
+	if (!rrpc->rev_trans_map)
+		return -ENOMEM;
+
+	for (i = 0; i < rrpc->nr_pages; i++) {
+		struct nvm_addr *p = &rrpc->trans_map[i];
+		struct nvm_rev_addr *r = &rrpc->rev_trans_map[i];
+
+		p->addr = ADDR_EMPTY;
+		r->addr = ADDR_EMPTY;
+	}
+
+	if (!dev->ops->get_l2p_tbl)
+		return 0;
+
+	/* Bring up the mapping table from device */
+	ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							rrpc_l2p_update, rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not read L2P table.\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+
+/* Minimum pages needed within a lun */
+#define PAGE_POOL_SIZE 16
+#define ADDR_POOL_SIZE 64
+
+static int rrpc_core_init(struct rrpc *rrpc)
+{
+	int i;
+
+	down_write(&_lock);
+	if (!_addr_cache) {
+		_addr_cache = kmem_cache_create("nvm_addr_cache",
+				sizeof(struct nvm_addr), 0, 0, NULL);
+		if (!_addr_cache) {
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+
+	if (!_gcb_cache) {
+		_gcb_cache = kmem_cache_create("nvm_gcb_cache",
+				sizeof(struct rrpc_block_gc), 0, 0, NULL);
+		if (!_gcb_cache) {
+			kmem_cache_destroy(_addr_cache);
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+	up_write(&_lock);
+
+	rrpc->page_pool = mempool_create_page_pool(PAGE_POOL_SIZE, 0);
+	if (!rrpc->page_pool)
+		return -ENOMEM;
+
+	rrpc->addr_pool = mempool_create_slab_pool(ADDR_POOL_SIZE, _addr_cache);
+	if (!rrpc->addr_pool)
+		return -ENOMEM;
+
+	rrpc->gcb_pool = mempool_create_slab_pool(rrpc->q_nvm->nr_luns,
+								_gcb_cache);
+	if (!rrpc->gcb_pool)
+		return -ENOMEM;
+
+	for (i = 0; i < NVM_INFLIGHT_PARTITIONS; i++) {
+		struct nvm_inflight *map = &rrpc->inflight_map[i];
+
+		spin_lock_init(&map->lock);
+		INIT_LIST_HEAD(&map->reqs);
+	}
+
+	return 0;
+}
+
+static void rrpc_core_free(struct rrpc *rrpc)
+{
+	if (rrpc->addr_pool)
+		mempool_destroy(rrpc->addr_pool);
+	if (rrpc->page_pool)
+		mempool_destroy(rrpc->page_pool);
+
+	down_write(&_lock);
+	if (_addr_cache)
+		kmem_cache_destroy(_addr_cache);
+	if (_gcb_cache)
+		kmem_cache_destroy(_gcb_cache);
+	up_write(&_lock);
+}
+
+static void rrpc_luns_free(struct rrpc *rrpc)
+{
+	kfree(rrpc->luns);
+}
+
+static int rrpc_luns_init(struct rrpc *rrpc, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_block *block;
+	struct rrpc_lun *rlun;
+	int i, j;
+
+	spin_lock_init(&rrpc->rev_lock);
+
+	rrpc->luns = kcalloc(rrpc->nr_luns, sizeof(struct rrpc_lun),
+								GFP_KERNEL);
+	if (!rrpc->luns)
+		return -ENOMEM;
+
+	/* 1:1 mapping */
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		struct nvm_lun *lun = &dev->luns[i + lun_begin];
+
+		rlun = &rrpc->luns[i];
+		rlun->rrpc = rrpc;
+		rlun->parent = lun;
+		rlun->nr_blocks = lun->nr_blocks;
+
+		rrpc->total_blocks += lun->nr_blocks;
+		rrpc->nr_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+
+		INIT_LIST_HEAD(&rlun->prio_list);
+		INIT_WORK(&rlun->ws_gc, rrpc_lun_gc);
+		spin_lock_init(&rlun->lock);
+
+		rlun->blocks = vzalloc(sizeof(struct rrpc_block) *
+						 rlun->nr_blocks);
+		if (!rlun->blocks)
+			goto err;
+
+		lun_for_each_block(lun, block, j) {
+			struct rrpc_block *rblock = &rlun->blocks[j];
+
+			rblock->parent = block;
+			INIT_LIST_HEAD(&rblock->prio);
+		}
+	}
+
+	return 0;
+err:
+	return -ENOMEM;
+}
+
+static void rrpc_free(struct rrpc *rrpc)
+{
+	rrpc_gc_free(rrpc);
+	rrpc_map_free(rrpc);
+	rrpc_core_free(rrpc);
+	rrpc_luns_free(rrpc);
+
+	kfree(rrpc);
+}
+
+static void rrpc_exit(void *private)
+{
+	struct rrpc *rrpc = private;
+
+	blkdev_put(rrpc->q_bdev, FMODE_WRITE | FMODE_READ);
+	del_timer(&rrpc->gc_timer);
+
+	flush_workqueue(rrpc->krqd_wq);
+	flush_workqueue(rrpc->kgc_wq);
+
+	rrpc_free(rrpc);
+}
+
+static sector_t rrpc_capacity(void *private)
+{
+	struct rrpc *rrpc = private;
+	struct nvm_lun *lun;
+	sector_t reserved;
+	int i, max_pages_per_blk = 0;
+
+	nvm_for_each_lun(rrpc->q_nvm, lun, i) {
+		if (lun->nr_pages_per_blk > max_pages_per_blk)
+			max_pages_per_blk = lun->nr_pages_per_blk;
+	}
+
+	/* cur, gc, and two emergency blocks for each lun */
+	reserved = rrpc->nr_luns * max_pages_per_blk * 4;
+
+	if (reserved > rrpc->nr_pages) {
+		pr_err("rrpc: not enough space available to expose storage.\n");
+		return 0;
+	}
+
+	return ((rrpc->nr_pages - reserved) / 10) * 9 * NR_PHY_IN_LOG;
+}
+
+/*
+ * Looks up the logical address from reverse trans map and check if its valid by
+ * comparing the logical to physical address with the physical address.
+ * Returns 0 on free, otherwise 1 if in use
+ */
+static void rrpc_block_map_update(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+	int offset;
+	struct nvm_addr *laddr;
+	sector_t paddr, pladdr;
+
+	for (offset = 0; offset < lun->nr_pages_per_blk; offset++) {
+		paddr = block_to_addr(block) + offset;
+
+		pladdr = rrpc->rev_trans_map[paddr].addr;
+		if (pladdr == ADDR_EMPTY)
+			continue;
+
+		laddr = &rrpc->trans_map[pladdr];
+
+		if (paddr == laddr->addr) {
+			laddr->block = block;
+		} else {
+			set_bit(offset, block->invalid_pages);
+			block->nr_invalid_pages++;
+		}
+	}
+}
+
+static int rrpc_blocks_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t lun_iter, blk_iter;
+
+	for (lun_iter = 0; lun_iter < rrpc->nr_luns; lun_iter++) {
+		lun = &dev->luns[lun_iter + rrpc->lun_offset];
+
+		lun_for_each_block(lun, blk, blk_iter)
+			rrpc_block_map_update(rrpc, blk);
+	}
+
+	return 0;
+}
+
+static int rrpc_luns_configure(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	struct nvm_block *blk;
+	int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		blk = blk_nvm_get_blk(rlun->parent, 0);
+		if (!blk)
+			return -EINVAL;
+
+		rrpc_set_lun_cur(rlun, blk);
+
+		/* Emergency gc block */
+		blk = blk_nvm_get_blk(rlun->parent, 1);
+		if (!blk)
+			return -EINVAL;
+		rlun->gc_cur = blk;
+	}
+
+	return 0;
+}
+
+static void *rrpc_init(struct request_queue *qdev,
+			struct request_queue *qtarget, struct gendisk *qdisk,
+			struct gendisk *tdisk, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev;
+	struct block_device *bdev;
+	struct rrpc *rrpc;
+	int ret;
+
+	if (!blk_queue_nvm(qdev)) {
+		pr_err("nvm: block device not supported.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	bdev = bdget_disk(qdisk, 0);
+	if (blkdev_get(bdev, FMODE_WRITE | FMODE_READ, NULL)) {
+		pr_err("nvm: could not access backing device\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = blk_nvm_get_dev(qdev);
+
+	rrpc = kzalloc(sizeof(struct rrpc), GFP_KERNEL);
+	if (!rrpc) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	rrpc->q_dev = qdev;
+	rrpc->q_nvm = qdev->nvm;
+	rrpc->q_bdev = bdev;
+	rrpc->nr_luns = lun_end - lun_begin + 1;
+
+	/* simple round-robin strategy */
+	atomic_set(&rrpc->next_lun, -1);
+
+	ret = rrpc_luns_init(rrpc, lun_begin, lun_end);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	rrpc->poffset = rrpc->luns[0].parent->nr_blocks *
+			rrpc->luns[0].parent->nr_pages_per_blk * lun_begin;
+	rrpc->lun_offset = lun_begin;
+
+	ret = rrpc_core_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize core\n");
+		goto err;
+	}
+
+	ret = rrpc_map_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize maps\n");
+		goto err;
+	}
+
+	ret = rrpc_blocks_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize state for blocks\n");
+		goto err;
+	}
+
+	ret = rrpc_luns_configure(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: not enough blocks available in LUNs.\n");
+		goto err;
+	}
+
+	ret = rrpc_gc_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize gc\n");
+		goto err;
+	}
+
+	/* make sure to inherit the size from the underlying device */
+	blk_queue_logical_block_size(qtarget, queue_physical_block_size(qdev));
+	blk_queue_max_hw_sectors(qtarget, queue_max_hw_sectors(qdev));
+
+	pr_info("nvm: rrpc initialized with %u luns and %llu pages.\n",
+			rrpc->nr_luns, (unsigned long long)rrpc->nr_pages);
+
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+
+	return rrpc;
+err:
+	blkdev_put(bdev, FMODE_WRITE | FMODE_READ);
+	rrpc_free(rrpc);
+	return ERR_PTR(ret);
+}
+
+/* round robin, page-based FTL, and cost-based GC */
+static struct nvm_target_type tt_rrpc = {
+	.name		= "rrpc",
+
+	.make_rq	= rrpc_make_rq,
+	.prep_rq	= rrpc_prep_rq,
+	.unprep_rq	= rrpc_unprep_rq,
+
+	.capacity	= rrpc_capacity,
+
+	.init		= rrpc_init,
+	.exit		= rrpc_exit,
+};
+
+static int __init rrpc_module_init(void)
+{
+	return nvm_register_target(&tt_rrpc);
+}
+
+static void rrpc_module_exit(void)
+{
+	nvm_unregister_target(&tt_rrpc);
+}
+
+module_init(rrpc_module_init);
+module_exit(rrpc_module_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Round-Robin Cost-based Hybrid Layer for Open-Channel SSDs");
diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h
new file mode 100644
index 0000000..eeebc7f
--- /dev/null
+++ b/drivers/lightnvm/rrpc.h
@@ -0,0 +1,203 @@
+/*
+ * Copyright (C) 2015 Matias Bjørling.
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef RRPC_H_
+#define RRPC_H_
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include <linux/lightnvm.h>
+
+/* We partition the namespace of translation map into these pieces for tracking
+ * in-flight addresses. */
+#define NVM_INFLIGHT_PARTITIONS 1
+
+/* Run only GC if less than 1/X blocks are free */
+#define GC_LIMIT_INVERSE 10
+#define GC_TIME_SECS 100
+
+struct nvm_inflight {
+	spinlock_t lock;
+	struct list_head reqs;
+};
+
+struct rrpc_lun;
+
+struct rrpc_block {
+	struct nvm_block *parent;
+	struct list_head prio;
+};
+
+struct rrpc_lun {
+	struct rrpc *rrpc;
+	struct nvm_lun *parent;
+	struct nvm_block *cur, *gc_cur;
+	struct rrpc_block *blocks;	/* Reference to block allocation */
+	struct list_head prio_list;		/* Blocks that may be GC'ed */
+	struct work_struct ws_gc;
+
+	int nr_blocks;
+	spinlock_t lock;
+};
+
+struct rrpc {
+	struct bio_nvm_payload payload;
+
+	struct nvm_dev *q_nvm;
+	struct request_queue *q_dev;
+	struct block_device *q_bdev;
+
+	int nr_luns;
+	int lun_offset;
+	sector_t poffset; /* physical page offset */
+
+	struct rrpc_lun *luns;
+
+	/* calculated values */
+	unsigned long nr_pages;
+	unsigned long total_blocks;
+
+	/* Write strategy variables. Move these into each for structure for each
+	 * strategy */
+	atomic_t next_lun; /* Whenever a page is written, this is updated
+			    * to point to the next write lun */
+
+	/* Simple translation map of logical addresses to physical addresses.
+	 * The logical addresses is known by the host system, while the physical
+	 * addresses are used when writing to the disk block device. */
+	struct nvm_addr *trans_map;
+	/* also store a reverse map for garbage collection */
+	struct nvm_rev_addr *rev_trans_map;
+	spinlock_t rev_lock;
+
+	struct nvm_inflight inflight_map[NVM_INFLIGHT_PARTITIONS];
+
+	mempool_t *addr_pool;
+	mempool_t *page_pool;
+	mempool_t *gcb_pool;
+
+	struct timer_list gc_timer;
+	struct workqueue_struct *krqd_wq;
+	struct workqueue_struct *kgc_wq;
+
+	struct gc_blocks *gblks;
+	struct gc_luns *gluns;
+};
+
+struct rrpc_block_gc {
+	struct rrpc *rrpc;
+	struct nvm_block *block;
+	struct work_struct ws_gc;
+};
+
+static inline sector_t nvm_get_laddr(struct request *rq)
+{
+	return blk_rq_pos(rq) / NR_PHY_IN_LOG;
+}
+
+static inline sector_t nvm_get_sector(sector_t laddr)
+{
+	return laddr * NR_PHY_IN_LOG;
+}
+
+static inline void *get_per_rq_data(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	return blk_mq_rq_to_pdu(rq) + q->tag_set->cmd_size;
+}
+
+static inline int request_intersects(struct rrpc_inflight_rq *r,
+				sector_t laddr_start, sector_t laddr_end)
+{
+	return (laddr_end >= r->l_start && laddr_end <= r->l_end) &&
+		(laddr_start >= r->l_start && laddr_start <= r->l_end);
+}
+
+static int __rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+			     unsigned pages, struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	sector_t laddr_end = laddr + pages - 1;
+	struct rrpc_inflight_rq *rtmp;
+
+	spin_lock_irq(&map->lock);
+	list_for_each_entry(rtmp, &map->reqs, list) {
+		if (unlikely(request_intersects(rtmp, laddr, laddr_end))) {
+			/* existing, overlapping request, come back later */
+			spin_unlock_irq(&map->lock);
+			return 1;
+		}
+	}
+
+	r->l_start = laddr;
+	r->l_end = laddr_end;
+
+	list_add_tail(&r->list, &map->reqs);
+	spin_unlock_irq(&map->lock);
+	return 0;
+}
+
+static inline int rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+				 unsigned pages,
+				 struct rrpc_inflight_rq *r)
+{
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	return __rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline struct rrpc_inflight_rq *rrpc_get_inflight_rq(struct request *rq)
+{
+	struct nvm_per_rq *pd = get_per_rq_data(rq);
+
+	return &pd->inflight_rq;
+}
+
+static inline int rrpc_lock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return 0;
+
+	return rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline void rrpc_unlock_laddr(struct rrpc *rrpc, sector_t laddr,
+				    struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	unsigned long flags;
+
+	spin_lock_irqsave(&map->lock, flags);
+	list_del_init(&r->list);
+	spin_unlock_irqrestore(&map->lock, flags);
+}
+
+static inline void rrpc_unlock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return;
+
+	rrpc_unlock_laddr(rrpc, laddr, r);
+}
+
+#endif /* RRPC_H_ */
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 888d994..5f9f187 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -29,7 +29,6 @@
 
 #define NVM_MSG_PREFIX "nvm"
 #define ADDR_EMPTY (~0ULL)
-#define LTOP_POISON 0xD3ADB33F
 
 /* core.c */
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 3/5 v2] lightnvm: RRPC target
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


This target implements a simple target to be used by Open-Channel SSDs.
It exposes the physical flash a generic sector-based address space.

The FTL implements a round-robin approach for sector allocation,
together with a greedy cost-based garbage collector.

Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 drivers/Kconfig           |    2 +
 drivers/Makefile          |    2 +
 drivers/lightnvm/Kconfig  |   29 ++
 drivers/lightnvm/Makefile |    5 +
 drivers/lightnvm/rrpc.c   | 1222 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/rrpc.h   |  203 ++++++++
 include/linux/lightnvm.h  |    1 -
 7 files changed, 1463 insertions(+), 1 deletion(-)
 create mode 100644 drivers/lightnvm/Kconfig
 create mode 100644 drivers/lightnvm/Makefile
 create mode 100644 drivers/lightnvm/rrpc.c
 create mode 100644 drivers/lightnvm/rrpc.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index c0cc96b..da47047 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"
 
 source "drivers/isdn/Kconfig"
 
+source "drivers/lightnvm/Kconfig"
+
 # input before char - char/joystick depends on it. As does USB.
 
 source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 527a6da..6b6928a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -165,3 +165,5 @@ obj-$(CONFIG_RAS)		+= ras/
 obj-$(CONFIG_THUNDERBOLT)	+= thunderbolt/
 obj-$(CONFIG_CORESIGHT)		+= coresight/
 obj-$(CONFIG_ANDROID)		+= android/
+
+obj-$(CONFIG_NVM)		+= lightnvm/
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 0000000..89fabe1
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,29 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+	bool "Open-Channel SSD target support"
+	depends on BLK_DEV_NVM
+	help
+	  Say Y here to get to enable Open-channel SSDs.
+
+	  Open-Channel SSDs implement a set of extension to SSDs, that
+	  exposes direct access to the underlying non-volatile memory.
+
+	  If you say N, all options in this submenu will be skipped and disabled
+	  only do this if you know what you are doing.
+
+if NVM
+
+config NVM_RRPC
+	tristate "Round-robin Hybrid Open-Channel SSD"
+	depends on BLK_DEV_NVM
+	---help---
+	Allows an open-channel SSD to be exposed as a block device to the
+	host. The target is implemented using a linear mapping table and
+	cost-based garbage collection. It is optimized for 4K IO sizes.
+
+	See Documentation/nvm-rrpc.txt for details.
+
+endif # NVM
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 0000000..80d75a8
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM)		+= rrpc.o
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
new file mode 100644
index 0000000..180cb09
--- /dev/null
+++ b/drivers/lightnvm/rrpc.c
@@ -0,0 +1,1222 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <mabj at itu.dk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
+ */
+
+#include "rrpc.h"
+
+static struct kmem_cache *_addr_cache;
+static struct kmem_cache *_gcb_cache;
+static DECLARE_RWSEM(_lock);
+
+#define rrpc_for_each_lun(rrpc, rlun, i) \
+		for ((i) = 0, rlun = &(rrpc)->luns[0]; \
+			(i) < (rrpc)->nr_luns; (i)++, rlun = &(rrpc)->luns[(i)])
+
+static void invalidate_block_page(struct nvm_addr *p)
+{
+	struct nvm_block *block = p->block;
+	unsigned int page_offset;
+
+	if (!block)
+		return;
+
+	spin_lock(&block->lock);
+	page_offset = p->addr % block->lun->nr_pages_per_blk;
+	WARN_ON(test_and_set_bit(page_offset, block->invalid_pages));
+	block->nr_invalid_pages++;
+	spin_unlock(&block->lock);
+}
+
+static inline void __nvm_page_invalidate(struct rrpc *rrpc, struct nvm_addr *a)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (a->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(a);
+	rrpc->rev_trans_map[a->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+static void rrpc_invalidate_range(struct rrpc *rrpc, sector_t slba,
+								unsigned len)
+{
+	sector_t i;
+
+	spin_lock(&rrpc->rev_lock);
+	for (i = slba; i < slba + len; i++) {
+		struct nvm_addr *gp = &rrpc->trans_map[i];
+
+		__nvm_page_invalidate(rrpc, gp);
+		gp->block = NULL;
+	}
+	spin_unlock(&rrpc->rev_lock);
+}
+
+static struct request *rrpc_inflight_laddr_acquire(struct rrpc *rrpc,
+					sector_t laddr, unsigned int pages)
+{
+	struct request *rq;
+	struct rrpc_inflight_rq *inf;
+
+	rq = blk_mq_alloc_request(rrpc->q_dev, READ, GFP_NOIO, false);
+	if (!rq)
+		return ERR_PTR(-ENOMEM);
+
+	inf = rrpc_get_inflight_rq(rq);
+	if (rrpc_lock_laddr(rrpc, laddr, pages, inf)) {
+		blk_mq_free_request(rq);
+		return NULL;
+	}
+
+	return rq;
+}
+
+static void rrpc_inflight_laddr_release(struct rrpc *rrpc, struct request *rq)
+{
+	struct rrpc_inflight_rq *inf;
+
+	inf = rrpc_get_inflight_rq(rq);
+	rrpc_unlock_laddr(rrpc, inf->l_start, inf);
+
+	blk_mq_free_request(rq);
+}
+
+static void rrpc_discard(struct rrpc *rrpc, struct bio *bio)
+{
+	sector_t slba = bio->bi_iter.bi_sector / NR_PHY_IN_LOG;
+	sector_t len = bio->bi_iter.bi_size / EXPOSED_PAGE_SIZE;
+	struct request *rq;
+
+	do {
+		rq = rrpc_inflight_laddr_acquire(rrpc, slba, len);
+		schedule();
+	} while (!rq);
+
+	if (IS_ERR(rq)) {
+		bio_io_error(bio);
+		return;
+	}
+
+	rrpc_invalidate_range(rrpc, slba, len);
+	rrpc_inflight_laddr_release(rrpc, rq);
+}
+
+/* requires lun->lock taken */
+static void rrpc_set_lun_cur(struct rrpc_lun *rlun, struct nvm_block *block)
+{
+	BUG_ON(!block);
+
+	if (rlun->cur) {
+		spin_lock(&rlun->cur->lock);
+		WARN_ON(!block_is_full(rlun->cur));
+		spin_unlock(&rlun->cur->lock);
+	}
+	rlun->cur = block;
+}
+
+static struct rrpc_lun *get_next_lun(struct rrpc *rrpc)
+{
+	int next = atomic_inc_return(&rrpc->next_lun);
+
+	return &rrpc->luns[next % rrpc->nr_luns];
+}
+
+static void rrpc_gc_kick(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	unsigned int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+		queue_work(rrpc->krqd_wq, &rlun->ws_gc);
+	}
+}
+
+/**
+ * rrpc_gc_timer - default gc timer function.
+ * @data: ptr to the 'nvm' structure
+ *
+ * Description:
+ *   rrpc configures a timer to kick the GC to force proactive behavior.
+ *
+ **/
+static void rrpc_gc_timer(unsigned long data)
+{
+	struct rrpc *rrpc = (struct rrpc *)data;
+
+	rrpc_gc_kick(rrpc);
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+}
+
+static void rrpc_end_sync_bio(struct bio *bio, int error)
+{
+	struct completion *waiting = bio->bi_private;
+
+	if (error)
+		pr_err("nvm: gc request failed.\n");
+
+	complete(waiting);
+}
+
+/*
+ * rrpc_move_valid_pages -- migrate live data off the block
+ * @rrpc: the 'rrpc' structure
+ * @block: the block from which to migrate live pages
+ *
+ * Description:
+ *   GC algorithms may call this function to migrate remaining live
+ *   pages off the block prior to erasing it. This function blocks
+ *   further execution until the operation is complete.
+ */
+static int rrpc_move_valid_pages(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct request_queue *q = rrpc->q_dev;
+	struct nvm_lun *lun = block->lun;
+	struct nvm_rev_addr *rev;
+	struct bio *bio;
+	struct request *rq;
+	struct page *page;
+	int slot;
+	sector_t phys_addr;
+	DECLARE_COMPLETION_ONSTACK(wait);
+
+	if (bitmap_full(block->invalid_pages, lun->nr_pages_per_blk))
+		return 0;
+
+	bio = bio_alloc(GFP_NOIO, 1);
+	if (!bio) {
+		pr_err("nvm: could not alloc bio on gc\n");
+		return -ENOMEM;
+	}
+
+	page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
+
+	while ((slot = find_first_zero_bit(block->invalid_pages,
+					   lun->nr_pages_per_blk)) <
+						lun->nr_pages_per_blk) {
+
+		/* Lock laddr */
+		phys_addr = block_to_addr(block) + slot;
+
+try:
+		spin_lock(&rrpc->rev_lock);
+		/* Get logical address from physical to logical table */
+		rev = &rrpc->rev_trans_map[phys_addr - rrpc->poffset];
+		/* already updated by previous regular write */
+		if (rev->addr == ADDR_EMPTY) {
+			spin_unlock(&rrpc->rev_lock);
+			continue;
+		}
+
+		rq = rrpc_inflight_laddr_acquire(rrpc, rev->addr, 1);
+		if (!rq) {
+			spin_unlock(&rrpc->rev_lock);
+			schedule();
+			goto try;
+		}
+
+		spin_unlock(&rrpc->rev_lock);
+
+		/* Perform read to do GC */
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (READ | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		/* execute read */
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		/* and write it back */
+		bio_reset(bio);
+		reinit_completion(&wait);
+
+		bio->bi_iter.bi_sector = nvm_get_sector(rev->addr);
+		bio->bi_rw |= (WRITE | REQ_NVM_NO_INFLIGHT);
+		bio->bi_private = &wait;
+		bio->bi_end_io = rrpc_end_sync_bio;
+		bio->bi_nvm = &rrpc->payload;
+		/* TODO: may fail when EXP_PG_SIZE > PAGE_SIZE */
+		bio_add_pc_page(q, bio, page, EXPOSED_PAGE_SIZE, 0);
+
+		q->make_request_fn(q, bio);
+		wait_for_completion_io(&wait);
+
+		rrpc_inflight_laddr_release(rrpc, rq);
+
+		/* reset structures for next run */
+		reinit_completion(&wait);
+		bio_reset(bio);
+	}
+
+	mempool_free(page, rrpc->page_pool);
+	bio_put(bio);
+
+	if (!bitmap_full(block->invalid_pages, lun->nr_pages_per_blk)) {
+		pr_err("nvm: failed to garbage collect block\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void rrpc_block_gc(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_dev *dev = rrpc->q_nvm;
+
+	pr_debug("nvm: block '%d' being reclaimed\n", block->id);
+
+	if (rrpc_move_valid_pages(rrpc, block))
+		goto done;
+
+	blk_nvm_erase_blk(dev, block);
+	blk_nvm_put_blk(block);
+done:
+	mempool_free(gcb, rrpc->gcb_pool);
+}
+
+/* the block with highest number of invalid pages, will be in the beginning
+ * of the list */
+static struct rrpc_block *rblock_max_invalid(struct rrpc_block *ra,
+					       struct rrpc_block *rb)
+{
+	struct nvm_block *a = ra->parent;
+	struct nvm_block *b = rb->parent;
+
+	BUG_ON(!a || !b);
+
+	if (a->nr_invalid_pages == b->nr_invalid_pages)
+		return ra;
+
+	return (a->nr_invalid_pages < b->nr_invalid_pages) ? rb : ra;
+}
+
+/* linearly find the block with highest number of invalid pages
+ * requires lun->lock */
+static struct rrpc_block *block_prio_find_max(struct rrpc_lun *rlun)
+{
+	struct list_head *prio_list = &rlun->prio_list;
+	struct rrpc_block *rblock, *max;
+
+	BUG_ON(list_empty(prio_list));
+
+	max = list_first_entry(prio_list, struct rrpc_block, prio);
+	list_for_each_entry(rblock, prio_list, prio)
+		max = rblock_max_invalid(max, rblock);
+
+	return max;
+}
+
+static void rrpc_lun_gc(struct work_struct *work)
+{
+	struct rrpc_lun *rlun = container_of(work, struct rrpc_lun, ws_gc);
+	struct rrpc *rrpc = rlun->rrpc;
+	struct nvm_lun *lun = rlun->parent;
+	struct rrpc_block_gc *gcb;
+	unsigned int nr_blocks_need;
+
+	nr_blocks_need = lun->nr_blocks / GC_LIMIT_INVERSE;
+
+	if (nr_blocks_need < rrpc->nr_luns)
+		nr_blocks_need = rrpc->nr_luns;
+
+	spin_lock(&lun->lock);
+	while (nr_blocks_need > lun->nr_free_blocks &&
+					!list_empty(&rlun->prio_list)) {
+		struct rrpc_block *rblock = block_prio_find_max(rlun);
+		struct nvm_block *block = rblock->parent;
+
+		if (!block->nr_invalid_pages)
+			break;
+
+		list_del_init(&rblock->prio);
+
+		BUG_ON(!block_is_full(block));
+
+		pr_debug("nvm: selected block '%d' as GC victim\n",
+								block->id);
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb)
+			break;
+
+		gcb->rrpc = rrpc;
+		gcb->block = rblock->parent;
+		INIT_WORK(&gcb->ws_gc, rrpc_block_gc);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+
+		nr_blocks_need--;
+	}
+	spin_unlock(&lun->lock);
+
+	/* TODO: Hint that request queue can be started again */
+}
+
+static void rrpc_gc_queue(struct work_struct *work)
+{
+	struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+									ws_gc);
+	struct rrpc *rrpc = gcb->rrpc;
+	struct nvm_block *block = gcb->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_lun *rlun = &rrpc->luns[lun->id - rrpc->lun_offset];
+	struct rrpc_block *rblock =
+			&rlun->blocks[block->id % lun->nr_blocks];
+
+	spin_lock(&rlun->lock);
+	list_add_tail(&rblock->prio, &rlun->prio_list);
+	spin_unlock(&rlun->lock);
+
+	mempool_free(gcb, rrpc->gcb_pool);
+	pr_debug("nvm: block '%d' is full, allow GC (sched)\n", block->id);
+}
+
+static int rrpc_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+							unsigned long arg)
+{
+	return 0;
+}
+
+static int rrpc_open(struct block_device *bdev, fmode_t mode)
+{
+	return 0;
+}
+
+static void rrpc_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations rrpc_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= rrpc_ioctl,
+	.open		= rrpc_open,
+	.release	= rrpc_release,
+};
+
+static struct rrpc_lun *__rrpc_get_lun_rr(struct rrpc *rrpc, int is_gc)
+{
+	unsigned int i;
+	struct rrpc_lun *rlun, *max_free;
+
+	if (!is_gc)
+		return get_next_lun(rrpc);
+
+	/* FIXME */
+	/* during GC, we don't care about RR, instead we want to make
+	 * sure that we maintain evenness between the block luns. */
+	max_free = &rrpc->luns[0];
+	/* prevent GC-ing lun from devouring pages of a lun with
+	 * little free blocks. We don't take the lock as we only need an
+	 * estimate. */
+	rrpc_for_each_lun(rrpc, rlun, i) {
+		if (rlun->parent->nr_free_blocks >
+					max_free->parent->nr_free_blocks)
+			max_free = rlun;
+	}
+
+	return max_free;
+}
+
+static inline void __rrpc_page_invalidate(struct rrpc *rrpc,
+							struct nvm_addr *gp)
+{
+	BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+	if (gp->addr == ADDR_EMPTY)
+		return;
+
+	invalidate_block_page(gp);
+	rrpc->rev_trans_map[gp->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+void nvm_update_map(struct rrpc *rrpc, sector_t l_addr, struct nvm_addr *p,
+					int is_gc)
+{
+	struct nvm_addr *gp;
+	struct nvm_rev_addr *rev;
+
+	BUG_ON(l_addr >= rrpc->nr_pages);
+
+	gp = &rrpc->trans_map[l_addr];
+	spin_lock(&rrpc->rev_lock);
+	if (gp->block)
+		__nvm_page_invalidate(rrpc, gp);
+
+	gp->addr = p->addr;
+	gp->block = p->block;
+
+	rev = &rrpc->rev_trans_map[p->addr - rrpc->poffset];
+	rev->addr = l_addr;
+	spin_unlock(&rrpc->rev_lock);
+}
+
+/* Simple round-robin Logical to physical address translation.
+ *
+ * Retrieve the mapping using the active append point. Then update the ap for
+ * the next write to the disk.
+ *
+ * Returns nvm_addr with the physical address and block. Remember to return to
+ * rrpc->addr_cache when request is finished.
+ */
+static struct nvm_addr *rrpc_map_page(struct rrpc *rrpc, sector_t laddr,
+								int is_gc)
+{
+	struct nvm_addr *p;
+	struct rrpc_lun *rlun;
+	struct nvm_lun *lun;
+	struct nvm_block *p_block;
+	sector_t p_addr;
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p) {
+		pr_err("rrpc: address pool run out of space\n");
+		return NULL;
+	}
+
+	rlun = __rrpc_get_lun_rr(rrpc, is_gc);
+	lun = rlun->parent;
+
+	if (!is_gc && lun->nr_free_blocks < rrpc->nr_luns * 4)
+		return NULL;
+
+	spin_lock(&rlun->lock);
+
+	p_block = rlun->cur;
+	p_addr = blk_nvm_alloc_addr(p_block);
+
+	if (p_addr == ADDR_EMPTY) {
+		p_block = blk_nvm_get_blk(lun, 0);
+
+		if (!p_block) {
+			if (is_gc) {
+				p_addr = blk_nvm_alloc_addr(rlun->gc_cur);
+				if (p_addr == ADDR_EMPTY) {
+					p_block = blk_nvm_get_blk(lun, 1);
+					if (!p_block) {
+						pr_err("rrpc: no more blocks");
+						goto finished;
+					} else {
+						rlun->gc_cur = p_block;
+						p_addr =
+					       blk_nvm_alloc_addr(rlun->gc_cur);
+					}
+				}
+				p_block = rlun->gc_cur;
+			}
+			goto finished;
+		}
+
+		rrpc_set_lun_cur(rlun, p_block);
+		p_addr = blk_nvm_alloc_addr(p_block);
+	}
+
+finished:
+	if (p_addr == ADDR_EMPTY)
+		goto err;
+
+	p->addr = p_addr;
+	p->block = p_block;
+
+	if (!p_block)
+		WARN_ON(is_gc);
+
+	spin_unlock(&rlun->lock);
+	if (p)
+		nvm_update_map(rrpc, laddr, p, is_gc);
+	return p;
+err:
+	spin_unlock(&rlun->lock);
+	mempool_free(p, rrpc->addr_pool);
+	return NULL;
+}
+
+static void __rrpc_unprep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb = get_per_rq_data(rq);
+	struct nvm_addr *p = pb->addr;
+	struct nvm_block *block = p->block;
+	struct nvm_lun *lun = block->lun;
+	struct rrpc_block_gc *gcb;
+	int cmnt_size;
+
+	rrpc_unlock_rq(rrpc, rq);
+
+	if (rq_data_dir(rq) == WRITE) {
+		cmnt_size = atomic_inc_return(&block->data_cmnt_size);
+		if (likely(cmnt_size != lun->nr_pages_per_blk))
+			goto done;
+
+		gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+		if (!gcb) {
+			pr_err("rrpc: not able to queue block for gc.");
+			goto done;
+		}
+
+		gcb->rrpc = rrpc;
+		gcb->block = block;
+		INIT_WORK(&gcb->ws_gc, rrpc_gc_queue);
+
+		queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+	}
+
+done:
+	mempool_free(pb->addr, rrpc->addr_pool);
+}
+
+static void rrpc_unprep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return;
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	if (rq->cmd_flags & REQ_NVM_MAPPED)
+		__rrpc_unprep_rq(rrpc, rq);
+}
+
+/* lookup the primary translation table. If there isn't an associated block to
+ * the addr. We assume that there is no data and doesn't take a ref */
+static struct nvm_addr *rrpc_lookup_ltop(struct rrpc *rrpc, sector_t laddr)
+{
+	struct nvm_addr *gp, *p;
+
+	BUG_ON(!(laddr >= 0 && laddr < rrpc->nr_pages));
+
+	p = mempool_alloc(rrpc->addr_pool, GFP_ATOMIC);
+	if (!p)
+		return NULL;
+
+	gp = &rrpc->trans_map[laddr];
+
+	p->addr = gp->addr;
+	p->block = gp->block;
+
+	return p;
+}
+
+static int rrpc_requeue_and_kick(struct rrpc *rrpc, struct request *rq)
+{
+	blk_mq_requeue_request(rq);
+	blk_mq_kick_requeue_list(rrpc->q_dev);
+	return BLK_MQ_RQ_QUEUE_DONE;
+}
+
+static int rrpc_read_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_addr *p;
+	struct nvm_per_rq *pb;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return BLK_MQ_RQ_QUEUE_BUSY;
+
+	p = rrpc_lookup_ltop(rrpc, l_addr);
+	if (!p) {
+		rrpc_unlock_rq(rrpc, rq);
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	}
+
+	if (p->block)
+		rq->phys_sector = nvm_get_sector(p->addr) +
+					(blk_rq_pos(rq) % NR_PHY_IN_LOG);
+	else {
+		rrpc_unlock_rq(rrpc, rq);
+		blk_mq_end_request(rq, 0);
+		return BLK_MQ_RQ_QUEUE_DONE;
+	}
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int rrpc_write_rq(struct rrpc *rrpc, struct request *rq)
+{
+	struct nvm_per_rq *pb;
+	struct nvm_addr *p;
+	int is_gc = 0;
+	sector_t l_addr = nvm_get_laddr(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		is_gc = 1;
+
+	if (rrpc_lock_rq(rrpc, rq))
+		return rrpc_requeue_and_kick(rrpc, rq);
+
+	p = rrpc_map_page(rrpc, l_addr, is_gc);
+	if (!p) {
+		BUG_ON(is_gc);
+		rrpc_unlock_rq(rrpc, rq);
+		rrpc_gc_kick(rrpc);
+		return rrpc_requeue_and_kick(rrpc, rq);
+	}
+
+	rq->phys_sector = nvm_get_sector(p->addr);
+
+	pb = get_per_rq_data(rq);
+	pb->addr = p;
+
+	return BLK_MQ_RQ_QUEUE_OK;
+}
+
+static int __rrpc_prep_rq(struct rrpc *rrpc, struct request *rq)
+{
+	int rw = rq_data_dir(rq);
+	int ret;
+
+	if (rw == WRITE)
+		ret = rrpc_write_rq(rrpc, rq);
+	else
+		ret = rrpc_read_rq(rrpc, rq);
+
+	if (!ret)
+		rq->cmd_flags |= (REQ_NVM_MAPPED|REQ_DONTPREP);
+
+	return ret;
+}
+
+static int rrpc_prep_rq(struct request_queue *q, struct request *rq)
+{
+	struct rrpc *rrpc;
+	struct bio *bio;
+
+	bio = rq->bio;
+	if (unlikely(!bio))
+		return 0;
+
+	if (unlikely(!bio->bi_nvm)) {
+		if (bio_data_dir(bio) == WRITE) {
+			pr_warn("nvm: attempting to write without FTL.\n");
+			return BLK_MQ_RQ_QUEUE_ERROR;
+		}
+		return BLK_MQ_RQ_QUEUE_OK;
+	}
+
+	rrpc = container_of(bio->bi_nvm, struct rrpc, payload);
+
+	return __rrpc_prep_rq(rrpc, rq);
+}
+
+static void rrpc_make_rq(struct request_queue *q, struct bio *bio)
+{
+	struct rrpc *rrpc = q->queuedata;
+
+	if (bio->bi_rw & REQ_DISCARD) {
+		rrpc_discard(rrpc, bio);
+		return;
+	}
+
+	bio->bi_nvm = &rrpc->payload;
+	bio->bi_bdev = rrpc->q_bdev;
+
+	generic_make_request(bio);
+}
+
+static void rrpc_gc_free(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	int i;
+
+	if (rrpc->krqd_wq)
+		destroy_workqueue(rrpc->krqd_wq);
+
+	if (rrpc->kgc_wq)
+		destroy_workqueue(rrpc->kgc_wq);
+
+	if (!rrpc->luns)
+		return;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		if (!rlun->blocks)
+			break;
+		vfree(rlun->blocks);
+	}
+}
+
+static int rrpc_gc_init(struct rrpc *rrpc)
+{
+	rrpc->krqd_wq = alloc_workqueue("knvm-work", WQ_MEM_RECLAIM|WQ_UNBOUND,
+						rrpc->nr_luns);
+	if (!rrpc->krqd_wq)
+		return -ENOMEM;
+
+	rrpc->kgc_wq = alloc_workqueue("knvm-gc", WQ_MEM_RECLAIM, 1);
+	if (!rrpc->kgc_wq)
+		return -ENOMEM;
+
+	setup_timer(&rrpc->gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+
+	return 0;
+}
+
+static void rrpc_map_free(struct rrpc *rrpc)
+{
+	vfree(rrpc->rev_trans_map);
+	vfree(rrpc->trans_map);
+}
+
+static int rrpc_l2p_update(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+	struct rrpc *rrpc = (struct rrpc *)private;
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_addr *addr = rrpc->trans_map + slba;
+	struct nvm_rev_addr *raddr = rrpc->rev_trans_map;
+	sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+	u64 elba = slba + nlb;
+	u64 i;
+
+	if (unlikely(elba > dev->total_pages)) {
+		pr_err("nvm: L2P data from device is out of bounds!\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nlb; i++) {
+		u64 pba = le64_to_cpu(entries[i]);
+		/* LNVM treats address-spaces as silos, LBA and PBA are
+		 * equally large and zero-indexed. */
+		if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+			pr_err("nvm: L2P data entry is out of bounds!\n");
+			return -EINVAL;
+		}
+
+		/* Address zero is a special one. The first page on a disk is
+		 * protected. As it often holds internal device boot
+		 * information. */
+		if (!pba)
+			continue;
+
+		addr[i].addr = pba;
+		raddr[pba].addr = slba + i;
+	}
+
+	return 0;
+}
+
+static int rrpc_map_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	sector_t i;
+	int ret;
+
+	rrpc->trans_map = vzalloc(sizeof(struct nvm_addr) * rrpc->nr_pages);
+	if (!rrpc->trans_map)
+		return -ENOMEM;
+
+	rrpc->rev_trans_map = vmalloc(sizeof(struct nvm_rev_addr)
+							* rrpc->nr_pages);
+	if (!rrpc->rev_trans_map)
+		return -ENOMEM;
+
+	for (i = 0; i < rrpc->nr_pages; i++) {
+		struct nvm_addr *p = &rrpc->trans_map[i];
+		struct nvm_rev_addr *r = &rrpc->rev_trans_map[i];
+
+		p->addr = ADDR_EMPTY;
+		r->addr = ADDR_EMPTY;
+	}
+
+	if (!dev->ops->get_l2p_tbl)
+		return 0;
+
+	/* Bring up the mapping table from device */
+	ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+							rrpc_l2p_update, rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not read L2P table.\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+
+/* Minimum pages needed within a lun */
+#define PAGE_POOL_SIZE 16
+#define ADDR_POOL_SIZE 64
+
+static int rrpc_core_init(struct rrpc *rrpc)
+{
+	int i;
+
+	down_write(&_lock);
+	if (!_addr_cache) {
+		_addr_cache = kmem_cache_create("nvm_addr_cache",
+				sizeof(struct nvm_addr), 0, 0, NULL);
+		if (!_addr_cache) {
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+
+	if (!_gcb_cache) {
+		_gcb_cache = kmem_cache_create("nvm_gcb_cache",
+				sizeof(struct rrpc_block_gc), 0, 0, NULL);
+		if (!_gcb_cache) {
+			kmem_cache_destroy(_addr_cache);
+			up_write(&_lock);
+			return -ENOMEM;
+		}
+	}
+	up_write(&_lock);
+
+	rrpc->page_pool = mempool_create_page_pool(PAGE_POOL_SIZE, 0);
+	if (!rrpc->page_pool)
+		return -ENOMEM;
+
+	rrpc->addr_pool = mempool_create_slab_pool(ADDR_POOL_SIZE, _addr_cache);
+	if (!rrpc->addr_pool)
+		return -ENOMEM;
+
+	rrpc->gcb_pool = mempool_create_slab_pool(rrpc->q_nvm->nr_luns,
+								_gcb_cache);
+	if (!rrpc->gcb_pool)
+		return -ENOMEM;
+
+	for (i = 0; i < NVM_INFLIGHT_PARTITIONS; i++) {
+		struct nvm_inflight *map = &rrpc->inflight_map[i];
+
+		spin_lock_init(&map->lock);
+		INIT_LIST_HEAD(&map->reqs);
+	}
+
+	return 0;
+}
+
+static void rrpc_core_free(struct rrpc *rrpc)
+{
+	if (rrpc->addr_pool)
+		mempool_destroy(rrpc->addr_pool);
+	if (rrpc->page_pool)
+		mempool_destroy(rrpc->page_pool);
+
+	down_write(&_lock);
+	if (_addr_cache)
+		kmem_cache_destroy(_addr_cache);
+	if (_gcb_cache)
+		kmem_cache_destroy(_gcb_cache);
+	up_write(&_lock);
+}
+
+static void rrpc_luns_free(struct rrpc *rrpc)
+{
+	kfree(rrpc->luns);
+}
+
+static int rrpc_luns_init(struct rrpc *rrpc, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_block *block;
+	struct rrpc_lun *rlun;
+	int i, j;
+
+	spin_lock_init(&rrpc->rev_lock);
+
+	rrpc->luns = kcalloc(rrpc->nr_luns, sizeof(struct rrpc_lun),
+								GFP_KERNEL);
+	if (!rrpc->luns)
+		return -ENOMEM;
+
+	/* 1:1 mapping */
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		struct nvm_lun *lun = &dev->luns[i + lun_begin];
+
+		rlun = &rrpc->luns[i];
+		rlun->rrpc = rrpc;
+		rlun->parent = lun;
+		rlun->nr_blocks = lun->nr_blocks;
+
+		rrpc->total_blocks += lun->nr_blocks;
+		rrpc->nr_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+
+		INIT_LIST_HEAD(&rlun->prio_list);
+		INIT_WORK(&rlun->ws_gc, rrpc_lun_gc);
+		spin_lock_init(&rlun->lock);
+
+		rlun->blocks = vzalloc(sizeof(struct rrpc_block) *
+						 rlun->nr_blocks);
+		if (!rlun->blocks)
+			goto err;
+
+		lun_for_each_block(lun, block, j) {
+			struct rrpc_block *rblock = &rlun->blocks[j];
+
+			rblock->parent = block;
+			INIT_LIST_HEAD(&rblock->prio);
+		}
+	}
+
+	return 0;
+err:
+	return -ENOMEM;
+}
+
+static void rrpc_free(struct rrpc *rrpc)
+{
+	rrpc_gc_free(rrpc);
+	rrpc_map_free(rrpc);
+	rrpc_core_free(rrpc);
+	rrpc_luns_free(rrpc);
+
+	kfree(rrpc);
+}
+
+static void rrpc_exit(void *private)
+{
+	struct rrpc *rrpc = private;
+
+	blkdev_put(rrpc->q_bdev, FMODE_WRITE | FMODE_READ);
+	del_timer(&rrpc->gc_timer);
+
+	flush_workqueue(rrpc->krqd_wq);
+	flush_workqueue(rrpc->kgc_wq);
+
+	rrpc_free(rrpc);
+}
+
+static sector_t rrpc_capacity(void *private)
+{
+	struct rrpc *rrpc = private;
+	struct nvm_lun *lun;
+	sector_t reserved;
+	int i, max_pages_per_blk = 0;
+
+	nvm_for_each_lun(rrpc->q_nvm, lun, i) {
+		if (lun->nr_pages_per_blk > max_pages_per_blk)
+			max_pages_per_blk = lun->nr_pages_per_blk;
+	}
+
+	/* cur, gc, and two emergency blocks for each lun */
+	reserved = rrpc->nr_luns * max_pages_per_blk * 4;
+
+	if (reserved > rrpc->nr_pages) {
+		pr_err("rrpc: not enough space available to expose storage.\n");
+		return 0;
+	}
+
+	return ((rrpc->nr_pages - reserved) / 10) * 9 * NR_PHY_IN_LOG;
+}
+
+/*
+ * Looks up the logical address from reverse trans map and check if its valid by
+ * comparing the logical to physical address with the physical address.
+ * Returns 0 on free, otherwise 1 if in use
+ */
+static void rrpc_block_map_update(struct rrpc *rrpc, struct nvm_block *block)
+{
+	struct nvm_lun *lun = block->lun;
+	int offset;
+	struct nvm_addr *laddr;
+	sector_t paddr, pladdr;
+
+	for (offset = 0; offset < lun->nr_pages_per_blk; offset++) {
+		paddr = block_to_addr(block) + offset;
+
+		pladdr = rrpc->rev_trans_map[paddr].addr;
+		if (pladdr == ADDR_EMPTY)
+			continue;
+
+		laddr = &rrpc->trans_map[pladdr];
+
+		if (paddr == laddr->addr) {
+			laddr->block = block;
+		} else {
+			set_bit(offset, block->invalid_pages);
+			block->nr_invalid_pages++;
+		}
+	}
+}
+
+static int rrpc_blocks_init(struct rrpc *rrpc)
+{
+	struct nvm_dev *dev = rrpc->q_nvm;
+	struct nvm_lun *lun;
+	struct nvm_block *blk;
+	sector_t lun_iter, blk_iter;
+
+	for (lun_iter = 0; lun_iter < rrpc->nr_luns; lun_iter++) {
+		lun = &dev->luns[lun_iter + rrpc->lun_offset];
+
+		lun_for_each_block(lun, blk, blk_iter)
+			rrpc_block_map_update(rrpc, blk);
+	}
+
+	return 0;
+}
+
+static int rrpc_luns_configure(struct rrpc *rrpc)
+{
+	struct rrpc_lun *rlun;
+	struct nvm_block *blk;
+	int i;
+
+	for (i = 0; i < rrpc->nr_luns; i++) {
+		rlun = &rrpc->luns[i];
+
+		blk = blk_nvm_get_blk(rlun->parent, 0);
+		if (!blk)
+			return -EINVAL;
+
+		rrpc_set_lun_cur(rlun, blk);
+
+		/* Emergency gc block */
+		blk = blk_nvm_get_blk(rlun->parent, 1);
+		if (!blk)
+			return -EINVAL;
+		rlun->gc_cur = blk;
+	}
+
+	return 0;
+}
+
+static void *rrpc_init(struct request_queue *qdev,
+			struct request_queue *qtarget, struct gendisk *qdisk,
+			struct gendisk *tdisk, int lun_begin, int lun_end)
+{
+	struct nvm_dev *dev;
+	struct block_device *bdev;
+	struct rrpc *rrpc;
+	int ret;
+
+	if (!blk_queue_nvm(qdev)) {
+		pr_err("nvm: block device not supported.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	bdev = bdget_disk(qdisk, 0);
+	if (blkdev_get(bdev, FMODE_WRITE | FMODE_READ, NULL)) {
+		pr_err("nvm: could not access backing device\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = blk_nvm_get_dev(qdev);
+
+	rrpc = kzalloc(sizeof(struct rrpc), GFP_KERNEL);
+	if (!rrpc) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	rrpc->q_dev = qdev;
+	rrpc->q_nvm = qdev->nvm;
+	rrpc->q_bdev = bdev;
+	rrpc->nr_luns = lun_end - lun_begin + 1;
+
+	/* simple round-robin strategy */
+	atomic_set(&rrpc->next_lun, -1);
+
+	ret = rrpc_luns_init(rrpc, lun_begin, lun_end);
+	if (ret) {
+		pr_err("nvm: could not initialize luns\n");
+		goto err;
+	}
+
+	rrpc->poffset = rrpc->luns[0].parent->nr_blocks *
+			rrpc->luns[0].parent->nr_pages_per_blk * lun_begin;
+	rrpc->lun_offset = lun_begin;
+
+	ret = rrpc_core_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize core\n");
+		goto err;
+	}
+
+	ret = rrpc_map_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize maps\n");
+		goto err;
+	}
+
+	ret = rrpc_blocks_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize state for blocks\n");
+		goto err;
+	}
+
+	ret = rrpc_luns_configure(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: not enough blocks available in LUNs.\n");
+		goto err;
+	}
+
+	ret = rrpc_gc_init(rrpc);
+	if (ret) {
+		pr_err("nvm: rrpc: could not initialize gc\n");
+		goto err;
+	}
+
+	/* make sure to inherit the size from the underlying device */
+	blk_queue_logical_block_size(qtarget, queue_physical_block_size(qdev));
+	blk_queue_max_hw_sectors(qtarget, queue_max_hw_sectors(qdev));
+
+	pr_info("nvm: rrpc initialized with %u luns and %llu pages.\n",
+			rrpc->nr_luns, (unsigned long long)rrpc->nr_pages);
+
+	mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+
+	return rrpc;
+err:
+	blkdev_put(bdev, FMODE_WRITE | FMODE_READ);
+	rrpc_free(rrpc);
+	return ERR_PTR(ret);
+}
+
+/* round robin, page-based FTL, and cost-based GC */
+static struct nvm_target_type tt_rrpc = {
+	.name		= "rrpc",
+
+	.make_rq	= rrpc_make_rq,
+	.prep_rq	= rrpc_prep_rq,
+	.unprep_rq	= rrpc_unprep_rq,
+
+	.capacity	= rrpc_capacity,
+
+	.init		= rrpc_init,
+	.exit		= rrpc_exit,
+};
+
+static int __init rrpc_module_init(void)
+{
+	return nvm_register_target(&tt_rrpc);
+}
+
+static void rrpc_module_exit(void)
+{
+	nvm_unregister_target(&tt_rrpc);
+}
+
+module_init(rrpc_module_init);
+module_exit(rrpc_module_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Round-Robin Cost-based Hybrid Layer for Open-Channel SSDs");
diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h
new file mode 100644
index 0000000..eeebc7f
--- /dev/null
+++ b/drivers/lightnvm/rrpc.h
@@ -0,0 +1,203 @@
+/*
+ * Copyright (C) 2015 Matias Bj?rling.
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef RRPC_H_
+#define RRPC_H_
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include <linux/lightnvm.h>
+
+/* We partition the namespace of translation map into these pieces for tracking
+ * in-flight addresses. */
+#define NVM_INFLIGHT_PARTITIONS 1
+
+/* Run only GC if less than 1/X blocks are free */
+#define GC_LIMIT_INVERSE 10
+#define GC_TIME_SECS 100
+
+struct nvm_inflight {
+	spinlock_t lock;
+	struct list_head reqs;
+};
+
+struct rrpc_lun;
+
+struct rrpc_block {
+	struct nvm_block *parent;
+	struct list_head prio;
+};
+
+struct rrpc_lun {
+	struct rrpc *rrpc;
+	struct nvm_lun *parent;
+	struct nvm_block *cur, *gc_cur;
+	struct rrpc_block *blocks;	/* Reference to block allocation */
+	struct list_head prio_list;		/* Blocks that may be GC'ed */
+	struct work_struct ws_gc;
+
+	int nr_blocks;
+	spinlock_t lock;
+};
+
+struct rrpc {
+	struct bio_nvm_payload payload;
+
+	struct nvm_dev *q_nvm;
+	struct request_queue *q_dev;
+	struct block_device *q_bdev;
+
+	int nr_luns;
+	int lun_offset;
+	sector_t poffset; /* physical page offset */
+
+	struct rrpc_lun *luns;
+
+	/* calculated values */
+	unsigned long nr_pages;
+	unsigned long total_blocks;
+
+	/* Write strategy variables. Move these into each for structure for each
+	 * strategy */
+	atomic_t next_lun; /* Whenever a page is written, this is updated
+			    * to point to the next write lun */
+
+	/* Simple translation map of logical addresses to physical addresses.
+	 * The logical addresses is known by the host system, while the physical
+	 * addresses are used when writing to the disk block device. */
+	struct nvm_addr *trans_map;
+	/* also store a reverse map for garbage collection */
+	struct nvm_rev_addr *rev_trans_map;
+	spinlock_t rev_lock;
+
+	struct nvm_inflight inflight_map[NVM_INFLIGHT_PARTITIONS];
+
+	mempool_t *addr_pool;
+	mempool_t *page_pool;
+	mempool_t *gcb_pool;
+
+	struct timer_list gc_timer;
+	struct workqueue_struct *krqd_wq;
+	struct workqueue_struct *kgc_wq;
+
+	struct gc_blocks *gblks;
+	struct gc_luns *gluns;
+};
+
+struct rrpc_block_gc {
+	struct rrpc *rrpc;
+	struct nvm_block *block;
+	struct work_struct ws_gc;
+};
+
+static inline sector_t nvm_get_laddr(struct request *rq)
+{
+	return blk_rq_pos(rq) / NR_PHY_IN_LOG;
+}
+
+static inline sector_t nvm_get_sector(sector_t laddr)
+{
+	return laddr * NR_PHY_IN_LOG;
+}
+
+static inline void *get_per_rq_data(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	return blk_mq_rq_to_pdu(rq) + q->tag_set->cmd_size;
+}
+
+static inline int request_intersects(struct rrpc_inflight_rq *r,
+				sector_t laddr_start, sector_t laddr_end)
+{
+	return (laddr_end >= r->l_start && laddr_end <= r->l_end) &&
+		(laddr_start >= r->l_start && laddr_start <= r->l_end);
+}
+
+static int __rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+			     unsigned pages, struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	sector_t laddr_end = laddr + pages - 1;
+	struct rrpc_inflight_rq *rtmp;
+
+	spin_lock_irq(&map->lock);
+	list_for_each_entry(rtmp, &map->reqs, list) {
+		if (unlikely(request_intersects(rtmp, laddr, laddr_end))) {
+			/* existing, overlapping request, come back later */
+			spin_unlock_irq(&map->lock);
+			return 1;
+		}
+	}
+
+	r->l_start = laddr;
+	r->l_end = laddr_end;
+
+	list_add_tail(&r->list, &map->reqs);
+	spin_unlock_irq(&map->lock);
+	return 0;
+}
+
+static inline int rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+				 unsigned pages,
+				 struct rrpc_inflight_rq *r)
+{
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	return __rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline struct rrpc_inflight_rq *rrpc_get_inflight_rq(struct request *rq)
+{
+	struct nvm_per_rq *pd = get_per_rq_data(rq);
+
+	return &pd->inflight_rq;
+}
+
+static inline int rrpc_lock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return 0;
+
+	return rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline void rrpc_unlock_laddr(struct rrpc *rrpc, sector_t laddr,
+				    struct rrpc_inflight_rq *r)
+{
+	struct nvm_inflight *map =
+			&rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+	unsigned long flags;
+
+	spin_lock_irqsave(&map->lock, flags);
+	list_del_init(&r->list);
+	spin_unlock_irqrestore(&map->lock, flags);
+}
+
+static inline void rrpc_unlock_rq(struct rrpc *rrpc, struct request *rq)
+{
+	sector_t laddr = nvm_get_laddr(rq);
+	unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+	struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+	BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+	if (rq->cmd_flags & REQ_NVM_NO_INFLIGHT)
+		return;
+
+	rrpc_unlock_laddr(rrpc, laddr, r);
+}
+
+#endif /* RRPC_H_ */
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 888d994..5f9f187 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -29,7 +29,6 @@
 
 #define NVM_MSG_PREFIX "nvm"
 #define ADDR_EMPTY (~0ULL)
-#define LTOP_POISON 0xD3ADB33F
 
 /* core.c */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/5 v2] null_blk: LightNVM support
  2015-04-15 12:34 ` Matias Bjørling
@ 2015-04-15 12:34   ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

Initial support for LightNVM. The support can be used to benchmark
performance of targets and core implementation.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 Documentation/block/null_blk.txt |  8 ++++
 drivers/block/null_blk.c         | 89 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 2f6c6ff..b907ecc 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -70,3 +70,11 @@ use_per_node_hctx=[0/1]: Default: 0
      parameter.
   1: The multi-queue block layer is instantiated with a hardware dispatch
      queue for each CPU node in the system.
+
+IV: LightNVM specific parameters
+
+lightnvm_enable=[x]: Default: 0
+  Enable LightNVM for null block devices. Requires blk-mq to be used.
+
+lightnvm_num_channels=[x]: Default: 1
+  Number of LightNVM channels that are exposed to the LightNVM driver.
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 65cd61a..9cf566e 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/blk-mq.h>
+#include <linux/lightnvm.h>
 #include <linux/hrtimer.h>
 
 struct nullb_cmd {
@@ -147,6 +148,14 @@ static bool use_per_node_hctx = false;
 module_param(use_per_node_hctx, bool, S_IRUGO);
 MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false");
 
+static bool nvm_enable;
+module_param(nvm_enable, bool, S_IRUGO);
+MODULE_PARM_DESC(nvm_enable, "Enable Open-channel SSD. Default: false");
+
+static int nvm_num_channels = 1;
+module_param(nvm_num_channels, int, S_IRUGO);
+MODULE_PARM_DESC(nvm_num_channels, "Number of channels to be exposed from the Open-Channel SSD. Default: 1");
+
 static void put_tag(struct nullb_queue *nq, unsigned int tag)
 {
 	clear_bit_unlock(tag, nq->tag_map);
@@ -351,6 +360,50 @@ static void null_request_fn(struct request_queue *q)
 	}
 }
 
+static int null_nvm_id(struct request_queue *q, struct nvm_id *id)
+{
+	sector_t size = gb * 1024 * 1024 * 1024ULL;
+	unsigned long per_chnl_size =
+				size / bs / nvm_num_channels;
+	struct nvm_id_chnl *chnl;
+	int i;
+
+	id->ver_id = 0x1;
+	id->nvm_type = NVM_NVMT_BLK;
+	id->nchannels = nvm_num_channels;
+
+	id->chnls = kmalloc_array(id->nchannels, sizeof(struct nvm_id_chnl),
+								GFP_KERNEL);
+	if (!id->chnls)
+		return -ENOMEM;
+
+	for (i = 0; i < id->nchannels; i++) {
+		chnl = &id->chnls[i];
+		chnl->queue_size = hw_queue_depth;
+		chnl->gran_read = bs;
+		chnl->gran_write = bs;
+		chnl->gran_erase = bs * 256;
+		chnl->oob_size = 0;
+		chnl->t_r = chnl->t_sqr = 25000; /* 25us */
+		chnl->t_w = chnl->t_sqw = 500000; /* 500us */
+		chnl->t_e = 1500000; /* 1.500us */
+		chnl->io_sched = NVM_IOSCHED_CHANNEL;
+		chnl->laddr_begin = per_chnl_size * i;
+		chnl->laddr_end = per_chnl_size * (i + 1) - 1;
+	}
+
+	return 0;
+}
+
+static int null_nvm_get_features(struct request_queue *q,
+						struct nvm_get_features *gf)
+{
+	gf->rsp = 0;
+	gf->ext = 0;
+
+	return 0;
+}
+
 static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
@@ -387,6 +440,11 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
+static struct nvm_dev_ops null_nvm_dev_ops = {
+	.identify		= null_nvm_id,
+	.get_features		= null_nvm_get_features,
+};
+
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
 	.map_queue      = blk_mq_map_queue,
@@ -525,6 +583,17 @@ static int null_add_dev(void)
 		nullb->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 		nullb->tag_set.driver_data = nullb;
 
+		if (nvm_enable) {
+			nullb->tag_set.flags &= ~BLK_MQ_F_SHOULD_MERGE;
+			nullb->tag_set.flags |= BLK_MQ_F_NVM;
+
+			if (bs != 4096) {
+				pr_warn("null_blk: only 4K block is supported for Open-Channel SSDs. bs is set to 4K.\n");
+				bs = 4096;
+			}
+
+		}
+
 		rv = blk_mq_alloc_tag_set(&nullb->tag_set);
 		if (rv)
 			goto out_cleanup_queues;
@@ -567,11 +636,6 @@ static int null_add_dev(void)
 		goto out_cleanup_blk_queue;
 	}
 
-	mutex_lock(&lock);
-	list_add_tail(&nullb->list, &nullb_list);
-	nullb->index = nullb_indexes++;
-	mutex_unlock(&lock);
-
 	blk_queue_logical_block_size(nullb->q, bs);
 	blk_queue_physical_block_size(nullb->q, bs);
 
@@ -579,16 +643,31 @@ static int null_add_dev(void)
 	sector_div(size, bs);
 	set_capacity(disk, size);
 
+	mutex_lock(&lock);
+	nullb->index = nullb_indexes++;
+	list_add_tail(&nullb->list, &nullb_list);
+	mutex_unlock(&lock);
+
 	disk->flags |= GENHD_FL_EXT_DEVT | GENHD_FL_SUPPRESS_PARTITION_INFO;
 	disk->major		= null_major;
 	disk->first_minor	= nullb->index;
 	disk->fops		= &null_fops;
 	disk->private_data	= nullb;
 	disk->queue		= nullb->q;
+
+	if (nvm_enable && queue_mode == NULL_Q_MQ) {
+		if (blk_nvm_register(nullb->q, &null_nvm_dev_ops))
+			goto out_cleanup_nvm;
+
+		nullb->q->nvm->drv_cmd_size = sizeof(struct nullb_cmd);
+	}
+
 	sprintf(disk->disk_name, "nullb%d", nullb->index);
 	add_disk(disk);
 	return 0;
 
+out_cleanup_nvm:
+	put_disk(disk);
 out_cleanup_blk_queue:
 	blk_cleanup_queue(nullb->q);
 out_cleanup_tags:
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 4/5 v2] null_blk: LightNVM support
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


Initial support for LightNVM. The support can be used to benchmark
performance of targets and core implementation.

Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 Documentation/block/null_blk.txt |  8 ++++
 drivers/block/null_blk.c         | 89 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 2f6c6ff..b907ecc 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -70,3 +70,11 @@ use_per_node_hctx=[0/1]: Default: 0
      parameter.
   1: The multi-queue block layer is instantiated with a hardware dispatch
      queue for each CPU node in the system.
+
+IV: LightNVM specific parameters
+
+lightnvm_enable=[x]: Default: 0
+  Enable LightNVM for null block devices. Requires blk-mq to be used.
+
+lightnvm_num_channels=[x]: Default: 1
+  Number of LightNVM channels that are exposed to the LightNVM driver.
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 65cd61a..9cf566e 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/blk-mq.h>
+#include <linux/lightnvm.h>
 #include <linux/hrtimer.h>
 
 struct nullb_cmd {
@@ -147,6 +148,14 @@ static bool use_per_node_hctx = false;
 module_param(use_per_node_hctx, bool, S_IRUGO);
 MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false");
 
+static bool nvm_enable;
+module_param(nvm_enable, bool, S_IRUGO);
+MODULE_PARM_DESC(nvm_enable, "Enable Open-channel SSD. Default: false");
+
+static int nvm_num_channels = 1;
+module_param(nvm_num_channels, int, S_IRUGO);
+MODULE_PARM_DESC(nvm_num_channels, "Number of channels to be exposed from the Open-Channel SSD. Default: 1");
+
 static void put_tag(struct nullb_queue *nq, unsigned int tag)
 {
 	clear_bit_unlock(tag, nq->tag_map);
@@ -351,6 +360,50 @@ static void null_request_fn(struct request_queue *q)
 	}
 }
 
+static int null_nvm_id(struct request_queue *q, struct nvm_id *id)
+{
+	sector_t size = gb * 1024 * 1024 * 1024ULL;
+	unsigned long per_chnl_size =
+				size / bs / nvm_num_channels;
+	struct nvm_id_chnl *chnl;
+	int i;
+
+	id->ver_id = 0x1;
+	id->nvm_type = NVM_NVMT_BLK;
+	id->nchannels = nvm_num_channels;
+
+	id->chnls = kmalloc_array(id->nchannels, sizeof(struct nvm_id_chnl),
+								GFP_KERNEL);
+	if (!id->chnls)
+		return -ENOMEM;
+
+	for (i = 0; i < id->nchannels; i++) {
+		chnl = &id->chnls[i];
+		chnl->queue_size = hw_queue_depth;
+		chnl->gran_read = bs;
+		chnl->gran_write = bs;
+		chnl->gran_erase = bs * 256;
+		chnl->oob_size = 0;
+		chnl->t_r = chnl->t_sqr = 25000; /* 25us */
+		chnl->t_w = chnl->t_sqw = 500000; /* 500us */
+		chnl->t_e = 1500000; /* 1.500us */
+		chnl->io_sched = NVM_IOSCHED_CHANNEL;
+		chnl->laddr_begin = per_chnl_size * i;
+		chnl->laddr_end = per_chnl_size * (i + 1) - 1;
+	}
+
+	return 0;
+}
+
+static int null_nvm_get_features(struct request_queue *q,
+						struct nvm_get_features *gf)
+{
+	gf->rsp = 0;
+	gf->ext = 0;
+
+	return 0;
+}
+
 static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
@@ -387,6 +440,11 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
+static struct nvm_dev_ops null_nvm_dev_ops = {
+	.identify		= null_nvm_id,
+	.get_features		= null_nvm_get_features,
+};
+
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
 	.map_queue      = blk_mq_map_queue,
@@ -525,6 +583,17 @@ static int null_add_dev(void)
 		nullb->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 		nullb->tag_set.driver_data = nullb;
 
+		if (nvm_enable) {
+			nullb->tag_set.flags &= ~BLK_MQ_F_SHOULD_MERGE;
+			nullb->tag_set.flags |= BLK_MQ_F_NVM;
+
+			if (bs != 4096) {
+				pr_warn("null_blk: only 4K block is supported for Open-Channel SSDs. bs is set to 4K.\n");
+				bs = 4096;
+			}
+
+		}
+
 		rv = blk_mq_alloc_tag_set(&nullb->tag_set);
 		if (rv)
 			goto out_cleanup_queues;
@@ -567,11 +636,6 @@ static int null_add_dev(void)
 		goto out_cleanup_blk_queue;
 	}
 
-	mutex_lock(&lock);
-	list_add_tail(&nullb->list, &nullb_list);
-	nullb->index = nullb_indexes++;
-	mutex_unlock(&lock);
-
 	blk_queue_logical_block_size(nullb->q, bs);
 	blk_queue_physical_block_size(nullb->q, bs);
 
@@ -579,16 +643,31 @@ static int null_add_dev(void)
 	sector_div(size, bs);
 	set_capacity(disk, size);
 
+	mutex_lock(&lock);
+	nullb->index = nullb_indexes++;
+	list_add_tail(&nullb->list, &nullb_list);
+	mutex_unlock(&lock);
+
 	disk->flags |= GENHD_FL_EXT_DEVT | GENHD_FL_SUPPRESS_PARTITION_INFO;
 	disk->major		= null_major;
 	disk->first_minor	= nullb->index;
 	disk->fops		= &null_fops;
 	disk->private_data	= nullb;
 	disk->queue		= nullb->q;
+
+	if (nvm_enable && queue_mode == NULL_Q_MQ) {
+		if (blk_nvm_register(nullb->q, &null_nvm_dev_ops))
+			goto out_cleanup_nvm;
+
+		nullb->q->nvm->drv_cmd_size = sizeof(struct nullb_cmd);
+	}
+
 	sprintf(disk->disk_name, "nullb%d", nullb->index);
 	add_disk(disk);
 	return 0;
 
+out_cleanup_nvm:
+	put_disk(disk);
 out_cleanup_blk_queue:
 	blk_cleanup_queue(nullb->q);
 out_cleanup_tags:
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-15 12:34 ` Matias Bjørling
@ 2015-04-15 12:34   ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)
  To: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme
  Cc: javier, keith.busch, Matias Bjørling

The first generation of Open-Channel SSDs will be based on NVMe. The
integration requires that a NVMe device exposes itself as a LightNVM
device. The way this is done currently is by hooking into the
Controller Capabilities (CAP register) and a bit in NSFEAT for each
namespace.

After detection, vendor specific codes are used to identify the device
and enumerate supported features.

Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 drivers/block/nvme-core.c | 380 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h      |   2 +
 include/uapi/linux/nvme.h | 116 ++++++++++++++
 3 files changed, 497 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e23be20..cbbf728 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -39,6 +39,7 @@
 #include <linux/slab.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
+#include <linux/lightnvm.h>
 #include <scsi/sg.h>
 #include <asm-generic/io-64-nonatomic-lo-hi.h>
 
@@ -134,6 +135,8 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_lnvm_hb_write_command) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_lnvm_l2ptbl_command) != 64);
 }
 
 typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
@@ -591,6 +594,30 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 }
 #endif
 
+static struct nvme_iod *nvme_get_dma_iod(struct nvme_dev *dev, void *buf,
+								unsigned length)
+{
+	struct scatterlist *sg;
+	struct nvme_iod *iod;
+	struct device *ddev = &dev->pci_dev->dev;
+
+	if (!length || length > INT_MAX - PAGE_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	iod = __nvme_alloc_iod(1, length, dev, 0, GFP_KERNEL);
+	if (!iod)
+		goto err;
+
+	sg = iod->sg;
+	sg_init_one(sg, buf, length);
+	iod->nents = 1;
+	dma_map_sg(ddev, sg, iod->nents, DMA_FROM_DEVICE);
+
+	return iod;
+err:
+	return ERR_PTR(-ENOMEM);
+}
+
 static void req_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
@@ -760,6 +787,46 @@ static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	writel(nvmeq->sq_tail, nvmeq->q_db);
 }
 
+static int nvme_submit_lnvm_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+							struct nvme_ns *ns)
+{
+	struct request *req = iod_get_private(iod);
+	struct nvme_command *cmnd;
+	u16 control = 0;
+	u32 dsmgmt = 0;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+		control |= NVME_RW_LR;
+
+	if (req->cmd_flags & REQ_RAHEAD)
+		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+
+	cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+	memset(cmnd, 0, sizeof(*cmnd));
+
+	cmnd->lnvm_hb_w.opcode = (rq_data_dir(req) ?
+				lnvm_cmd_hybrid_write : lnvm_cmd_hybrid_read);
+	cmnd->lnvm_hb_w.command_id = req->tag;
+	cmnd->lnvm_hb_w.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->lnvm_hb_w.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	cmnd->lnvm_hb_w.prp2 = cpu_to_le64(iod->first_dma);
+	cmnd->lnvm_hb_w.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->lnvm_hb_w.length = cpu_to_le16(
+			(blk_rq_bytes(req) >> ns->lba_shift) - 1);
+	cmnd->lnvm_hb_w.control = cpu_to_le16(control);
+	cmnd->lnvm_hb_w.dsmgmt = cpu_to_le32(dsmgmt);
+	cmnd->lnvm_hb_w.phys_addr =
+			cpu_to_le64(nvme_block_nr(ns, req->phys_sector));
+
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	writel(nvmeq->sq_tail, nvmeq->q_db);
+
+	return 0;
+}
+
 static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
 							struct nvme_ns *ns)
 {
@@ -895,6 +962,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		nvme_submit_discard(nvmeq, ns, req, iod);
 	else if (req->cmd_flags & REQ_FLUSH)
 		nvme_submit_flush(nvmeq, ns, req->tag);
+	else if (req->cmd_flags & REQ_NVM_MAPPED)
+		nvme_submit_lnvm_iod(nvmeq, iod, ns);
 	else
 		nvme_submit_iod(nvmeq, iod, ns);
 
@@ -1156,6 +1225,84 @@ static int adapter_delete_sq(struct nvme_dev *dev, u16 sqid)
 	return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid);
 }
 
+int nvme_nvm_identify_cmd(struct nvme_dev *dev, u32 chnl_off,
+							dma_addr_t dma_addr)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_identify;
+	c.common.nsid = cpu_to_le32(chnl_off);
+	c.common.prp1 = cpu_to_le64(dma_addr);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_get_features_cmd(struct nvme_dev *dev, unsigned nsid,
+							dma_addr_t dma_addr)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_get_features;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.common.prp1 = cpu_to_le64(dma_addr);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_set_responsibility_cmd(struct nvme_dev *dev, unsigned nsid,
+								u64 resp)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_set_responsibility;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.lnvm_resp.resp = cpu_to_le64(resp);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_get_l2p_tbl_cmd(struct nvme_dev *dev, unsigned nsid, u64 slba,
+				u32 nlb, u16 dma_npages, struct nvme_iod *iod)
+{
+	struct nvme_command c;
+	unsigned length;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_get_l2p_tbl;
+	c.common.nsid = cpu_to_le32(nsid);
+
+	c.lnvm_l2p.slba = cpu_to_le64(slba);
+	c.lnvm_l2p.nlb = cpu_to_le32(nlb);
+	c.lnvm_l2p.prp1_len = cpu_to_le16(dma_npages);
+
+	length = nvme_setup_prps(dev, iod, iod->length, GFP_KERNEL);
+	if ((length >> 12) != dma_npages)
+		return -ENOMEM;
+
+	c.common.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	c.common.prp2 = cpu_to_le64(iod->first_dma);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_erase_block_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+						sector_t block_id)
+{
+	struct nvme_command c;
+	int nsid = ns->ns_id;
+	int res;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_cmd_erase_sync;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.lnvm_erase.blk_addr = cpu_to_le64(block_id);
+
+	return nvme_submit_io_cmd(dev, ns, &c, &res);
+}
+
 int nvme_identify(struct nvme_dev *dev, unsigned nsid, unsigned cns,
 							dma_addr_t dma_addr)
 {
@@ -1551,6 +1698,185 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 	return 0;
 }
 
+static int init_chnls(struct nvme_dev *dev, struct nvm_id *nvm_id,
+			struct nvme_lnvm_id *dma_buf, dma_addr_t dma_addr)
+{
+	struct nvme_lnvm_id_chnl *src = dma_buf->chnls;
+	struct nvm_id_chnl *dst = nvm_id->chnls;
+	unsigned int len = nvm_id->nchannels;
+	int i, end, off = 0;
+
+	while (len) {
+		end = min_t(u32, NVME_LNVM_CHNLS_PR_REQ, len);
+
+		for (i = 0; i < end; i++, dst++, src++) {
+			dst->laddr_begin = le64_to_cpu(src->laddr_begin);
+			dst->laddr_end = le64_to_cpu(src->laddr_end);
+			dst->oob_size = le32_to_cpu(src->oob_size);
+			dst->queue_size = le32_to_cpu(src->queue_size);
+			dst->gran_read = le32_to_cpu(src->gran_read);
+			dst->gran_write = le32_to_cpu(src->gran_write);
+			dst->gran_erase = le32_to_cpu(src->gran_erase);
+			dst->t_r = le32_to_cpu(src->t_r);
+			dst->t_sqr = le32_to_cpu(src->t_sqr);
+			dst->t_w = le32_to_cpu(src->t_w);
+			dst->t_sqw = le32_to_cpu(src->t_sqw);
+			dst->t_e = le32_to_cpu(src->t_e);
+			dst->io_sched = src->io_sched;
+		}
+
+		len -= end;
+		if (!len)
+			break;
+
+		off += end;
+
+		if (nvme_nvm_identify_cmd(dev, off, dma_addr))
+			return -EIO;
+
+		src = dma_buf->chnls;
+	}
+	return 0;
+}
+
+static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	struct nvme_lnvm_id *ctrl;
+	dma_addr_t dma_addr;
+	unsigned int ret;
+
+	ctrl = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
+	if (!ctrl)
+		return -ENOMEM;
+
+	ret = nvme_nvm_identify_cmd(dev, 0, dma_addr);
+	if (ret) {
+		ret = -EIO;
+		goto out;
+	}
+
+	nvm_id->ver_id = ctrl->ver_id;
+	nvm_id->nvm_type = ctrl->nvm_type;
+	nvm_id->nchannels = le16_to_cpu(ctrl->nchannels);
+
+	if (!nvm_id->chnls)
+		nvm_id->chnls = kmalloc(sizeof(struct nvm_id_chnl)
+					* nvm_id->nchannels, GFP_KERNEL);
+
+	if (!nvm_id->chnls) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = init_chnls(dev, nvm_id, ctrl, dma_addr);
+out:
+	dma_free_coherent(&pdev->dev, 4096, ctrl, dma_addr);
+	return ret;
+}
+
+static int nvme_nvm_get_features(struct request_queue *q,
+						struct nvm_get_features *gf)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	dma_addr_t dma_addr;
+	int ret = 0;
+	u64 *mem;
+
+	mem = (u64 *)dma_alloc_coherent(&pdev->dev,
+					sizeof(struct nvm_get_features),
+							&dma_addr, GFP_KERNEL);
+	if (!mem)
+		return -ENOMEM;
+
+	ret = nvme_nvm_get_features_cmd(dev, ns->ns_id, dma_addr);
+	if (ret)
+		goto finish;
+
+	gf->rsp = le64_to_cpu(mem[0]);
+	gf->ext = le64_to_cpu(mem[1]);
+
+finish:
+	dma_free_coherent(&pdev->dev, sizeof(struct nvm_get_features), mem,
+								dma_addr);
+	return ret;
+}
+
+static int nvme_nvm_set_responsibility(struct request_queue *q, u64 resp)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+
+	return nvme_nvm_set_responsibility_cmd(dev, ns->ns_id, resp);
+}
+
+static int nvme_nvm_get_l2p_tbl(struct request_queue *q, u64 slba, u64 nlb,
+				nvm_l2p_update_fn *update_l2p, void *private)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	static const u16 dma_npages = 256U;
+	static const u32 length = dma_npages * PAGE_SIZE;
+	u64 nlb_pr_dma = length / sizeof(u64);
+	struct nvme_iod *iod;
+	u64 cmd_slba = slba;
+	dma_addr_t dma_addr;
+	void *entries;
+	int res = 0;
+
+	entries = dma_alloc_coherent(&pdev->dev, length, &dma_addr, GFP_KERNEL);
+	if (!entries)
+		return -ENOMEM;
+
+	iod = nvme_get_dma_iod(dev, entries, length);
+	if (!iod) {
+		res = -ENOMEM;
+		goto out;
+	}
+
+	while (nlb) {
+		u64 cmd_nlb = min_t(u64, nlb_pr_dma, nlb);
+
+		res = nvme_nvm_get_l2p_tbl_cmd(dev, ns->ns_id, cmd_slba,
+						(u32)cmd_nlb, dma_npages, iod);
+		if (res) {
+			dev_err(&pdev->dev, "L2P table transfer failed (%d)\n",
+									res);
+			res = -EIO;
+			goto free_iod;
+		}
+
+		if (update_l2p(cmd_slba, cmd_nlb, entries, private)) {
+			res = -EINTR;
+			goto free_iod;
+		}
+
+		cmd_slba += cmd_nlb;
+		nlb -= cmd_nlb;
+	}
+
+free_iod:
+	dma_unmap_sg(&pdev->dev, iod->sg, 1, DMA_FROM_DEVICE);
+	nvme_free_iod(dev, iod);
+out:
+	dma_free_coherent(&pdev->dev, PAGE_SIZE * dma_npages, entries,
+								dma_addr);
+	return res;
+}
+
+static int nvme_nvm_erase_block(struct request_queue *q, sector_t block_id)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+
+	return nvme_nvm_erase_block_cmd(dev, ns, block_id);
+}
+
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_admin_queue_rq,
 	.map_queue	= blk_mq_map_queue,
@@ -1560,6 +1886,14 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 	.timeout	= nvme_timeout,
 };
 
+static struct nvm_dev_ops nvme_nvm_dev_ops = {
+	.identify		= nvme_nvm_identify,
+	.get_features		= nvme_nvm_get_features,
+	.set_responsibility	= nvme_nvm_set_responsibility,
+	.get_l2p_tbl		= nvme_nvm_get_l2p_tbl,
+	.erase_block		= nvme_nvm_erase_block,
+};
+
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.map_queue	= blk_mq_map_queue,
@@ -1744,6 +2078,26 @@ void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
 		put_page(sg_page(&iod->sg[i]));
 }
 
+static int nvme_nvm_submit_io(struct nvme_ns *ns, struct nvme_user_io *io)
+{
+	struct nvme_command c;
+	struct nvme_dev *dev = ns->dev;
+
+	memset(&c, 0, sizeof(c));
+	c.rw.opcode = io->opcode;
+	c.rw.flags = io->flags;
+	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.slba = cpu_to_le64(io->slba);
+	c.rw.length = cpu_to_le16(io->nblocks);
+	c.rw.control = cpu_to_le16(io->control);
+	c.rw.dsmgmt = cpu_to_le32(io->dsmgmt);
+	c.rw.reftag = cpu_to_le32(io->reftag);
+	c.rw.apptag = cpu_to_le16(io->apptag);
+	c.rw.appmask = cpu_to_le16(io->appmask);
+
+	return nvme_submit_io_cmd(dev, ns, &c, NULL);
+}
+
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
 	struct nvme_dev *dev = ns->dev;
@@ -1769,6 +2123,10 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	case nvme_cmd_compare:
 		iod = nvme_map_user_pages(dev, io.opcode & 1, io.addr, length);
 		break;
+	case lnvm_admin_identify:
+	case lnvm_admin_get_features:
+	case lnvm_admin_set_responsibility:
+		return nvme_nvm_submit_io(ns, &io);
 	default:
 		return -EINVAL;
 	}
@@ -2073,6 +2431,17 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	if (dev->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
 
+	if (id->nsfeat & NVME_NS_FEAT_NVM) {
+		if (blk_nvm_register(ns->queue, &nvme_nvm_dev_ops)) {
+			dev_warn(&dev->pci_dev->dev,
+				 "%s: LightNVM init failure\n", __func__);
+			return 0;
+		}
+
+		/* FIXME: This will be handled later by ns */
+		ns->queue->nvm->drv_cmd_size = sizeof(struct nvme_cmd_info);
+	}
+
 	dma_free_coherent(&dev->pci_dev->dev, 4096, id, dma_addr);
 	return 0;
 }
@@ -2185,6 +2554,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	if (ns->ms)
 		revalidate_disk(ns->disk);
 	return;
+
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
@@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	struct nvme_id_ctrl *ctrl;
 	void *mem;
 	dma_addr_t dma_addr;
-	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
+	u64 cap = readq(&dev->bar->cap);
+	int shift = NVME_CAP_MPSMIN(cap) + 12;
+	int nvm_cmdset = NVME_CAP_NVM(cap);
 
 	mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
 	if (!mem)
@@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	ctrl = mem;
 	nn = le32_to_cpup(&ctrl->nn);
 	dev->oncs = le16_to_cpup(&ctrl->oncs);
+	dev->oacs = le16_to_cpup(&ctrl->oacs);
 	dev->abort_limit = ctrl->acl + 1;
 	dev->vwc = ctrl->vwc;
 	dev->event_limit = min(ctrl->aerl + 1, 8);
@@ -2364,6 +2737,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 	dev->tagset.driver_data = dev;
 
+	if (nvm_cmdset) {
+		dev->tagset.flags &= ~BLK_MQ_F_SHOULD_MERGE;
+		dev->tagset.flags |= BLK_MQ_F_NVM;
+	}
+
 	if (blk_mq_alloc_tag_set(&dev->tagset))
 		return 0;
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 0adad4a..dc9c805 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -39,6 +39,7 @@ struct nvme_bar {
 #define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
 #define NVME_CAP_MPSMIN(cap)	(((cap) >> 48) & 0xf)
 #define NVME_CAP_MPSMAX(cap)	(((cap) >> 52) & 0xf)
+#define NVME_CAP_NVM(cap)	(((cap) >> 38) & 0x1)
 
 enum {
 	NVME_CC_ENABLE		= 1 << 0,
@@ -100,6 +101,7 @@ struct nvme_dev {
 	u32 stripe_size;
 	u32 page_size;
 	u16 oncs;
+	u16 oacs;
 	u16 abort_limit;
 	u8 event_limit;
 	u8 vwc;
diff --git a/include/uapi/linux/nvme.h b/include/uapi/linux/nvme.h
index aef9a81..64c91a5 100644
--- a/include/uapi/linux/nvme.h
+++ b/include/uapi/linux/nvme.h
@@ -85,6 +85,35 @@ struct nvme_id_ctrl {
 	__u8			vs[1024];
 };
 
+struct nvme_lnvm_id_chnl {
+	__le64			laddr_begin;
+	__le64			laddr_end;
+	__le32			oob_size;
+	__le32			queue_size;
+	__le32			gran_read;
+	__le32			gran_write;
+	__le32			gran_erase;
+	__le32			t_r;
+	__le32			t_sqr;
+	__le32			t_w;
+	__le32			t_sqw;
+	__le32			t_e;
+	__le16			chnl_parallelism;
+	__u8			io_sched;
+	__u8			reserved[133];
+} __attribute__((packed));
+
+struct nvme_lnvm_id {
+	__u8				ver_id;
+	__u8				nvm_type;
+	__le16				nchannels;
+	__u8				reserved[252];
+	struct nvme_lnvm_id_chnl	chnls[];
+} __attribute__((packed));
+
+#define NVME_LNVM_CHNLS_PR_REQ ((4096U - sizeof(struct nvme_lnvm_id)) \
+					/ sizeof(struct nvme_lnvm_id_chnl))
+
 enum {
 	NVME_CTRL_ONCS_COMPARE			= 1 << 0,
 	NVME_CTRL_ONCS_WRITE_UNCORRECTABLE	= 1 << 1,
@@ -130,6 +159,7 @@ struct nvme_id_ns {
 
 enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
+	NVME_NS_FEAT_NVM	= 1 << 3,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
 	NVME_LBAF_RP_BEST	= 0,
@@ -231,6 +261,14 @@ enum nvme_opcode {
 	nvme_cmd_resv_release	= 0x15,
 };
 
+enum lnvme_opcode {
+	lnvm_cmd_hybrid_write	= 0x81,
+	lnvm_cmd_hybrid_read	= 0x02,
+	lnvm_cmd_phys_write	= 0x91,
+	lnvm_cmd_phys_read	= 0x92,
+	lnvm_cmd_erase_sync	= 0x90,
+};
+
 struct nvme_common_command {
 	__u8			opcode;
 	__u8			flags;
@@ -261,6 +299,60 @@ struct nvme_rw_command {
 	__le16			appmask;
 };
 
+struct nvme_lnvm_hb_write_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	__le64			prp1;
+	__le64			prp2;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			dsmgmt;
+	__le64			phys_addr;
+};
+
+struct nvme_lnvm_l2ptbl_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			cdw2[4];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			slba;
+	__le32			nlb;
+	__u16			prp1_len;
+	__le16			cdw14[5];
+};
+
+struct nvme_lnvm_set_resp_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			resp;
+	__u32			rsvd11[4];
+};
+
+struct nvme_lnvm_erase_block {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			blk_addr;
+	__u32			rsvd11[4];
+};
+
 enum {
 	NVME_RW_LR			= 1 << 15,
 	NVME_RW_FUA			= 1 << 14,
@@ -328,6 +420,13 @@ enum nvme_admin_opcode {
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
+
+	lnvm_admin_identify		= 0xe2,
+	lnvm_admin_get_features		= 0xe6,
+	lnvm_admin_set_responsibility	= 0xe5,
+	lnvm_admin_get_l2p_tbl		= 0xea,
+	lnvm_admin_get_bb_tbl		= 0xf2,
+	lnvm_admin_set_bb_tbl		= 0xf1,
 };
 
 enum {
@@ -457,6 +556,18 @@ struct nvme_format_cmd {
 	__u32			rsvd11[5];
 };
 
+struct nvme_lnvm_identify {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le32			cns;
+	__u32			rsvd11[5];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -470,6 +581,11 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_abort_cmd abort;
+		struct nvme_lnvm_identify lnvm_identify;
+		struct nvme_lnvm_hb_write_command lnvm_hb_w;
+		struct nvme_lnvm_l2ptbl_command lnvm_l2p;
+		struct nvme_lnvm_set_resp_command lnvm_resp;
+		struct nvme_lnvm_erase_block lnvm_erase;
 	};
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-15 12:34   ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-15 12:34 UTC (permalink / raw)


The first generation of Open-Channel SSDs will be based on NVMe. The
integration requires that a NVMe device exposes itself as a LightNVM
device. The way this is done currently is by hooking into the
Controller Capabilities (CAP register) and a bit in NSFEAT for each
namespace.

After detection, vendor specific codes are used to identify the device
and enumerate supported features.

Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 drivers/block/nvme-core.c | 380 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/nvme.h      |   2 +
 include/uapi/linux/nvme.h | 116 ++++++++++++++
 3 files changed, 497 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e23be20..cbbf728 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -39,6 +39,7 @@
 #include <linux/slab.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
+#include <linux/lightnvm.h>
 #include <scsi/sg.h>
 #include <asm-generic/io-64-nonatomic-lo-hi.h>
 
@@ -134,6 +135,8 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+	BUILD_BUG_ON(sizeof(struct nvme_lnvm_hb_write_command) != 64);
+	BUILD_BUG_ON(sizeof(struct nvme_lnvm_l2ptbl_command) != 64);
 }
 
 typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
@@ -591,6 +594,30 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 }
 #endif
 
+static struct nvme_iod *nvme_get_dma_iod(struct nvme_dev *dev, void *buf,
+								unsigned length)
+{
+	struct scatterlist *sg;
+	struct nvme_iod *iod;
+	struct device *ddev = &dev->pci_dev->dev;
+
+	if (!length || length > INT_MAX - PAGE_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	iod = __nvme_alloc_iod(1, length, dev, 0, GFP_KERNEL);
+	if (!iod)
+		goto err;
+
+	sg = iod->sg;
+	sg_init_one(sg, buf, length);
+	iod->nents = 1;
+	dma_map_sg(ddev, sg, iod->nents, DMA_FROM_DEVICE);
+
+	return iod;
+err:
+	return ERR_PTR(-ENOMEM);
+}
+
 static void req_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
@@ -760,6 +787,46 @@ static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	writel(nvmeq->sq_tail, nvmeq->q_db);
 }
 
+static int nvme_submit_lnvm_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+							struct nvme_ns *ns)
+{
+	struct request *req = iod_get_private(iod);
+	struct nvme_command *cmnd;
+	u16 control = 0;
+	u32 dsmgmt = 0;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+		control |= NVME_RW_LR;
+
+	if (req->cmd_flags & REQ_RAHEAD)
+		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+
+	cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+	memset(cmnd, 0, sizeof(*cmnd));
+
+	cmnd->lnvm_hb_w.opcode = (rq_data_dir(req) ?
+				lnvm_cmd_hybrid_write : lnvm_cmd_hybrid_read);
+	cmnd->lnvm_hb_w.command_id = req->tag;
+	cmnd->lnvm_hb_w.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->lnvm_hb_w.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	cmnd->lnvm_hb_w.prp2 = cpu_to_le64(iod->first_dma);
+	cmnd->lnvm_hb_w.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->lnvm_hb_w.length = cpu_to_le16(
+			(blk_rq_bytes(req) >> ns->lba_shift) - 1);
+	cmnd->lnvm_hb_w.control = cpu_to_le16(control);
+	cmnd->lnvm_hb_w.dsmgmt = cpu_to_le32(dsmgmt);
+	cmnd->lnvm_hb_w.phys_addr =
+			cpu_to_le64(nvme_block_nr(ns, req->phys_sector));
+
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+	writel(nvmeq->sq_tail, nvmeq->q_db);
+
+	return 0;
+}
+
 static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
 							struct nvme_ns *ns)
 {
@@ -895,6 +962,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		nvme_submit_discard(nvmeq, ns, req, iod);
 	else if (req->cmd_flags & REQ_FLUSH)
 		nvme_submit_flush(nvmeq, ns, req->tag);
+	else if (req->cmd_flags & REQ_NVM_MAPPED)
+		nvme_submit_lnvm_iod(nvmeq, iod, ns);
 	else
 		nvme_submit_iod(nvmeq, iod, ns);
 
@@ -1156,6 +1225,84 @@ static int adapter_delete_sq(struct nvme_dev *dev, u16 sqid)
 	return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid);
 }
 
+int nvme_nvm_identify_cmd(struct nvme_dev *dev, u32 chnl_off,
+							dma_addr_t dma_addr)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_identify;
+	c.common.nsid = cpu_to_le32(chnl_off);
+	c.common.prp1 = cpu_to_le64(dma_addr);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_get_features_cmd(struct nvme_dev *dev, unsigned nsid,
+							dma_addr_t dma_addr)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_get_features;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.common.prp1 = cpu_to_le64(dma_addr);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_set_responsibility_cmd(struct nvme_dev *dev, unsigned nsid,
+								u64 resp)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_set_responsibility;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.lnvm_resp.resp = cpu_to_le64(resp);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_get_l2p_tbl_cmd(struct nvme_dev *dev, unsigned nsid, u64 slba,
+				u32 nlb, u16 dma_npages, struct nvme_iod *iod)
+{
+	struct nvme_command c;
+	unsigned length;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_admin_get_l2p_tbl;
+	c.common.nsid = cpu_to_le32(nsid);
+
+	c.lnvm_l2p.slba = cpu_to_le64(slba);
+	c.lnvm_l2p.nlb = cpu_to_le32(nlb);
+	c.lnvm_l2p.prp1_len = cpu_to_le16(dma_npages);
+
+	length = nvme_setup_prps(dev, iod, iod->length, GFP_KERNEL);
+	if ((length >> 12) != dma_npages)
+		return -ENOMEM;
+
+	c.common.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	c.common.prp2 = cpu_to_le64(iod->first_dma);
+
+	return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int nvme_nvm_erase_block_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+						sector_t block_id)
+{
+	struct nvme_command c;
+	int nsid = ns->ns_id;
+	int res;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = lnvm_cmd_erase_sync;
+	c.common.nsid = cpu_to_le32(nsid);
+	c.lnvm_erase.blk_addr = cpu_to_le64(block_id);
+
+	return nvme_submit_io_cmd(dev, ns, &c, &res);
+}
+
 int nvme_identify(struct nvme_dev *dev, unsigned nsid, unsigned cns,
 							dma_addr_t dma_addr)
 {
@@ -1551,6 +1698,185 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 	return 0;
 }
 
+static int init_chnls(struct nvme_dev *dev, struct nvm_id *nvm_id,
+			struct nvme_lnvm_id *dma_buf, dma_addr_t dma_addr)
+{
+	struct nvme_lnvm_id_chnl *src = dma_buf->chnls;
+	struct nvm_id_chnl *dst = nvm_id->chnls;
+	unsigned int len = nvm_id->nchannels;
+	int i, end, off = 0;
+
+	while (len) {
+		end = min_t(u32, NVME_LNVM_CHNLS_PR_REQ, len);
+
+		for (i = 0; i < end; i++, dst++, src++) {
+			dst->laddr_begin = le64_to_cpu(src->laddr_begin);
+			dst->laddr_end = le64_to_cpu(src->laddr_end);
+			dst->oob_size = le32_to_cpu(src->oob_size);
+			dst->queue_size = le32_to_cpu(src->queue_size);
+			dst->gran_read = le32_to_cpu(src->gran_read);
+			dst->gran_write = le32_to_cpu(src->gran_write);
+			dst->gran_erase = le32_to_cpu(src->gran_erase);
+			dst->t_r = le32_to_cpu(src->t_r);
+			dst->t_sqr = le32_to_cpu(src->t_sqr);
+			dst->t_w = le32_to_cpu(src->t_w);
+			dst->t_sqw = le32_to_cpu(src->t_sqw);
+			dst->t_e = le32_to_cpu(src->t_e);
+			dst->io_sched = src->io_sched;
+		}
+
+		len -= end;
+		if (!len)
+			break;
+
+		off += end;
+
+		if (nvme_nvm_identify_cmd(dev, off, dma_addr))
+			return -EIO;
+
+		src = dma_buf->chnls;
+	}
+	return 0;
+}
+
+static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	struct nvme_lnvm_id *ctrl;
+	dma_addr_t dma_addr;
+	unsigned int ret;
+
+	ctrl = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
+	if (!ctrl)
+		return -ENOMEM;
+
+	ret = nvme_nvm_identify_cmd(dev, 0, dma_addr);
+	if (ret) {
+		ret = -EIO;
+		goto out;
+	}
+
+	nvm_id->ver_id = ctrl->ver_id;
+	nvm_id->nvm_type = ctrl->nvm_type;
+	nvm_id->nchannels = le16_to_cpu(ctrl->nchannels);
+
+	if (!nvm_id->chnls)
+		nvm_id->chnls = kmalloc(sizeof(struct nvm_id_chnl)
+					* nvm_id->nchannels, GFP_KERNEL);
+
+	if (!nvm_id->chnls) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = init_chnls(dev, nvm_id, ctrl, dma_addr);
+out:
+	dma_free_coherent(&pdev->dev, 4096, ctrl, dma_addr);
+	return ret;
+}
+
+static int nvme_nvm_get_features(struct request_queue *q,
+						struct nvm_get_features *gf)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	dma_addr_t dma_addr;
+	int ret = 0;
+	u64 *mem;
+
+	mem = (u64 *)dma_alloc_coherent(&pdev->dev,
+					sizeof(struct nvm_get_features),
+							&dma_addr, GFP_KERNEL);
+	if (!mem)
+		return -ENOMEM;
+
+	ret = nvme_nvm_get_features_cmd(dev, ns->ns_id, dma_addr);
+	if (ret)
+		goto finish;
+
+	gf->rsp = le64_to_cpu(mem[0]);
+	gf->ext = le64_to_cpu(mem[1]);
+
+finish:
+	dma_free_coherent(&pdev->dev, sizeof(struct nvm_get_features), mem,
+								dma_addr);
+	return ret;
+}
+
+static int nvme_nvm_set_responsibility(struct request_queue *q, u64 resp)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+
+	return nvme_nvm_set_responsibility_cmd(dev, ns->ns_id, resp);
+}
+
+static int nvme_nvm_get_l2p_tbl(struct request_queue *q, u64 slba, u64 nlb,
+				nvm_l2p_update_fn *update_l2p, void *private)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+	struct pci_dev *pdev = dev->pci_dev;
+	static const u16 dma_npages = 256U;
+	static const u32 length = dma_npages * PAGE_SIZE;
+	u64 nlb_pr_dma = length / sizeof(u64);
+	struct nvme_iod *iod;
+	u64 cmd_slba = slba;
+	dma_addr_t dma_addr;
+	void *entries;
+	int res = 0;
+
+	entries = dma_alloc_coherent(&pdev->dev, length, &dma_addr, GFP_KERNEL);
+	if (!entries)
+		return -ENOMEM;
+
+	iod = nvme_get_dma_iod(dev, entries, length);
+	if (!iod) {
+		res = -ENOMEM;
+		goto out;
+	}
+
+	while (nlb) {
+		u64 cmd_nlb = min_t(u64, nlb_pr_dma, nlb);
+
+		res = nvme_nvm_get_l2p_tbl_cmd(dev, ns->ns_id, cmd_slba,
+						(u32)cmd_nlb, dma_npages, iod);
+		if (res) {
+			dev_err(&pdev->dev, "L2P table transfer failed (%d)\n",
+									res);
+			res = -EIO;
+			goto free_iod;
+		}
+
+		if (update_l2p(cmd_slba, cmd_nlb, entries, private)) {
+			res = -EINTR;
+			goto free_iod;
+		}
+
+		cmd_slba += cmd_nlb;
+		nlb -= cmd_nlb;
+	}
+
+free_iod:
+	dma_unmap_sg(&pdev->dev, iod->sg, 1, DMA_FROM_DEVICE);
+	nvme_free_iod(dev, iod);
+out:
+	dma_free_coherent(&pdev->dev, PAGE_SIZE * dma_npages, entries,
+								dma_addr);
+	return res;
+}
+
+static int nvme_nvm_erase_block(struct request_queue *q, sector_t block_id)
+{
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = ns->dev;
+
+	return nvme_nvm_erase_block_cmd(dev, ns, block_id);
+}
+
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_admin_queue_rq,
 	.map_queue	= blk_mq_map_queue,
@@ -1560,6 +1886,14 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 	.timeout	= nvme_timeout,
 };
 
+static struct nvm_dev_ops nvme_nvm_dev_ops = {
+	.identify		= nvme_nvm_identify,
+	.get_features		= nvme_nvm_get_features,
+	.set_responsibility	= nvme_nvm_set_responsibility,
+	.get_l2p_tbl		= nvme_nvm_get_l2p_tbl,
+	.erase_block		= nvme_nvm_erase_block,
+};
+
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.map_queue	= blk_mq_map_queue,
@@ -1744,6 +2078,26 @@ void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
 		put_page(sg_page(&iod->sg[i]));
 }
 
+static int nvme_nvm_submit_io(struct nvme_ns *ns, struct nvme_user_io *io)
+{
+	struct nvme_command c;
+	struct nvme_dev *dev = ns->dev;
+
+	memset(&c, 0, sizeof(c));
+	c.rw.opcode = io->opcode;
+	c.rw.flags = io->flags;
+	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.slba = cpu_to_le64(io->slba);
+	c.rw.length = cpu_to_le16(io->nblocks);
+	c.rw.control = cpu_to_le16(io->control);
+	c.rw.dsmgmt = cpu_to_le32(io->dsmgmt);
+	c.rw.reftag = cpu_to_le32(io->reftag);
+	c.rw.apptag = cpu_to_le16(io->apptag);
+	c.rw.appmask = cpu_to_le16(io->appmask);
+
+	return nvme_submit_io_cmd(dev, ns, &c, NULL);
+}
+
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
 	struct nvme_dev *dev = ns->dev;
@@ -1769,6 +2123,10 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	case nvme_cmd_compare:
 		iod = nvme_map_user_pages(dev, io.opcode & 1, io.addr, length);
 		break;
+	case lnvm_admin_identify:
+	case lnvm_admin_get_features:
+	case lnvm_admin_set_responsibility:
+		return nvme_nvm_submit_io(ns, &io);
 	default:
 		return -EINVAL;
 	}
@@ -2073,6 +2431,17 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	if (dev->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
 
+	if (id->nsfeat & NVME_NS_FEAT_NVM) {
+		if (blk_nvm_register(ns->queue, &nvme_nvm_dev_ops)) {
+			dev_warn(&dev->pci_dev->dev,
+				 "%s: LightNVM init failure\n", __func__);
+			return 0;
+		}
+
+		/* FIXME: This will be handled later by ns */
+		ns->queue->nvm->drv_cmd_size = sizeof(struct nvme_cmd_info);
+	}
+
 	dma_free_coherent(&dev->pci_dev->dev, 4096, id, dma_addr);
 	return 0;
 }
@@ -2185,6 +2554,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	if (ns->ms)
 		revalidate_disk(ns->disk);
 	return;
+
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
@@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	struct nvme_id_ctrl *ctrl;
 	void *mem;
 	dma_addr_t dma_addr;
-	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
+	u64 cap = readq(&dev->bar->cap);
+	int shift = NVME_CAP_MPSMIN(cap) + 12;
+	int nvm_cmdset = NVME_CAP_NVM(cap);
 
 	mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
 	if (!mem)
@@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	ctrl = mem;
 	nn = le32_to_cpup(&ctrl->nn);
 	dev->oncs = le16_to_cpup(&ctrl->oncs);
+	dev->oacs = le16_to_cpup(&ctrl->oacs);
 	dev->abort_limit = ctrl->acl + 1;
 	dev->vwc = ctrl->vwc;
 	dev->event_limit = min(ctrl->aerl + 1, 8);
@@ -2364,6 +2737,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 	dev->tagset.driver_data = dev;
 
+	if (nvm_cmdset) {
+		dev->tagset.flags &= ~BLK_MQ_F_SHOULD_MERGE;
+		dev->tagset.flags |= BLK_MQ_F_NVM;
+	}
+
 	if (blk_mq_alloc_tag_set(&dev->tagset))
 		return 0;
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 0adad4a..dc9c805 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -39,6 +39,7 @@ struct nvme_bar {
 #define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
 #define NVME_CAP_MPSMIN(cap)	(((cap) >> 48) & 0xf)
 #define NVME_CAP_MPSMAX(cap)	(((cap) >> 52) & 0xf)
+#define NVME_CAP_NVM(cap)	(((cap) >> 38) & 0x1)
 
 enum {
 	NVME_CC_ENABLE		= 1 << 0,
@@ -100,6 +101,7 @@ struct nvme_dev {
 	u32 stripe_size;
 	u32 page_size;
 	u16 oncs;
+	u16 oacs;
 	u16 abort_limit;
 	u8 event_limit;
 	u8 vwc;
diff --git a/include/uapi/linux/nvme.h b/include/uapi/linux/nvme.h
index aef9a81..64c91a5 100644
--- a/include/uapi/linux/nvme.h
+++ b/include/uapi/linux/nvme.h
@@ -85,6 +85,35 @@ struct nvme_id_ctrl {
 	__u8			vs[1024];
 };
 
+struct nvme_lnvm_id_chnl {
+	__le64			laddr_begin;
+	__le64			laddr_end;
+	__le32			oob_size;
+	__le32			queue_size;
+	__le32			gran_read;
+	__le32			gran_write;
+	__le32			gran_erase;
+	__le32			t_r;
+	__le32			t_sqr;
+	__le32			t_w;
+	__le32			t_sqw;
+	__le32			t_e;
+	__le16			chnl_parallelism;
+	__u8			io_sched;
+	__u8			reserved[133];
+} __attribute__((packed));
+
+struct nvme_lnvm_id {
+	__u8				ver_id;
+	__u8				nvm_type;
+	__le16				nchannels;
+	__u8				reserved[252];
+	struct nvme_lnvm_id_chnl	chnls[];
+} __attribute__((packed));
+
+#define NVME_LNVM_CHNLS_PR_REQ ((4096U - sizeof(struct nvme_lnvm_id)) \
+					/ sizeof(struct nvme_lnvm_id_chnl))
+
 enum {
 	NVME_CTRL_ONCS_COMPARE			= 1 << 0,
 	NVME_CTRL_ONCS_WRITE_UNCORRECTABLE	= 1 << 1,
@@ -130,6 +159,7 @@ struct nvme_id_ns {
 
 enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
+	NVME_NS_FEAT_NVM	= 1 << 3,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
 	NVME_LBAF_RP_BEST	= 0,
@@ -231,6 +261,14 @@ enum nvme_opcode {
 	nvme_cmd_resv_release	= 0x15,
 };
 
+enum lnvme_opcode {
+	lnvm_cmd_hybrid_write	= 0x81,
+	lnvm_cmd_hybrid_read	= 0x02,
+	lnvm_cmd_phys_write	= 0x91,
+	lnvm_cmd_phys_read	= 0x92,
+	lnvm_cmd_erase_sync	= 0x90,
+};
+
 struct nvme_common_command {
 	__u8			opcode;
 	__u8			flags;
@@ -261,6 +299,60 @@ struct nvme_rw_command {
 	__le16			appmask;
 };
 
+struct nvme_lnvm_hb_write_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	__le64			prp1;
+	__le64			prp2;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			dsmgmt;
+	__le64			phys_addr;
+};
+
+struct nvme_lnvm_l2ptbl_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			cdw2[4];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			slba;
+	__le32			nlb;
+	__u16			prp1_len;
+	__le16			cdw14[5];
+};
+
+struct nvme_lnvm_set_resp_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			resp;
+	__u32			rsvd11[4];
+};
+
+struct nvme_lnvm_erase_block {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le64			blk_addr;
+	__u32			rsvd11[4];
+};
+
 enum {
 	NVME_RW_LR			= 1 << 15,
 	NVME_RW_FUA			= 1 << 14,
@@ -328,6 +420,13 @@ enum nvme_admin_opcode {
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
 	nvme_admin_security_recv	= 0x82,
+
+	lnvm_admin_identify		= 0xe2,
+	lnvm_admin_get_features		= 0xe6,
+	lnvm_admin_set_responsibility	= 0xe5,
+	lnvm_admin_get_l2p_tbl		= 0xea,
+	lnvm_admin_get_bb_tbl		= 0xf2,
+	lnvm_admin_set_bb_tbl		= 0xf1,
 };
 
 enum {
@@ -457,6 +556,18 @@ struct nvme_format_cmd {
 	__u32			rsvd11[5];
 };
 
+struct nvme_lnvm_identify {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd[2];
+	__le64			prp1;
+	__le64			prp2;
+	__le32			cns;
+	__u32			rsvd11[5];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -470,6 +581,11 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_abort_cmd abort;
+		struct nvme_lnvm_identify lnvm_identify;
+		struct nvme_lnvm_hb_write_command lnvm_hb_w;
+		struct nvme_lnvm_l2ptbl_command lnvm_l2p;
+		struct nvme_lnvm_set_resp_command lnvm_resp;
+		struct nvme_lnvm_erase_block lnvm_erase;
 	};
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
  2015-04-15 12:34   ` Matias Bjørling
@ 2015-04-16  9:10     ` Paul Bolle
  -1 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16  9:10 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

A few things I spotted (while actually fiddling with 3/5).

On Wed, 2015-04-15 at 14:34 +0200, Matias Bjørling wrote:
> index f3dd028..58a8a71 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
>  	rq->end_io = NULL;
>  	rq->end_io_data = NULL;
>  	rq->next_rq = NULL;
> +#if CONFIG_BLK_DEV_NVM

I think you meant
    #ifdef CONFIG_BLK_DEV_NVM

> +	rq->phys_sector = 0;
> +#endif
>  
>  	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
>  }

> --- /dev/null
> +++ b/block/blk-nvm.c

> +int nvm_register_target(struct nvm_target_type *tt)
> +{
> +	int ret = 0;
> +
> +	down_write(&_lock);
> +	if (nvm_find_target_type(tt->name))
> +		ret = -EEXIST;
> +	else
> +		list_add(&tt->list, &_targets);
> +	up_write(&_lock);
> +
> +	return ret;
> +}
> +
> +void nvm_unregister_target(struct nvm_target_type *tt)
> +{
> +	if (!tt)
> +		return;
> +
> +	down_write(&_lock);
> +	list_del(&tt->list);
> +	up_write(&_lock);
> +}

Trying to build rrpc.ko I saw this
    WARNING: "nvm_unregister_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
    WARNING: "nvm_register_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!

So I guess you need to export these two functions too.

> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -209,6 +209,9 @@ struct request {
>  
>  	/* for bidi */
>  	struct request *next_rq;
> +#if CONFIG_BLK_DEV_NVM

Again, I think
    #ifdef CONFIG_BLK_DEV_NVM

> +	sector_t phys_sector;
> +#endif
>  };
>  
>  static inline unsigned short req_get_ioprio(struct request *req)


Paul Bolle


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16  9:10     ` Paul Bolle
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16  9:10 UTC (permalink / raw)


A few things I spotted (while actually fiddling with 3/5).

On Wed, 2015-04-15@14:34 +0200, Matias Bj?rling wrote:
> index f3dd028..58a8a71 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
>  	rq->end_io = NULL;
>  	rq->end_io_data = NULL;
>  	rq->next_rq = NULL;
> +#if CONFIG_BLK_DEV_NVM

I think you meant
    #ifdef CONFIG_BLK_DEV_NVM

> +	rq->phys_sector = 0;
> +#endif
>  
>  	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
>  }

> --- /dev/null
> +++ b/block/blk-nvm.c

> +int nvm_register_target(struct nvm_target_type *tt)
> +{
> +	int ret = 0;
> +
> +	down_write(&_lock);
> +	if (nvm_find_target_type(tt->name))
> +		ret = -EEXIST;
> +	else
> +		list_add(&tt->list, &_targets);
> +	up_write(&_lock);
> +
> +	return ret;
> +}
> +
> +void nvm_unregister_target(struct nvm_target_type *tt)
> +{
> +	if (!tt)
> +		return;
> +
> +	down_write(&_lock);
> +	list_del(&tt->list);
> +	up_write(&_lock);
> +}

Trying to build rrpc.ko I saw this
    WARNING: "nvm_unregister_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
    WARNING: "nvm_register_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!

So I guess you need to export these two functions too.

> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -209,6 +209,9 @@ struct request {
>  
>  	/* for bidi */
>  	struct request *next_rq;
> +#if CONFIG_BLK_DEV_NVM

Again, I think
    #ifdef CONFIG_BLK_DEV_NVM

> +	sector_t phys_sector;
> +#endif
>  };
>  
>  static inline unsigned short req_get_ioprio(struct request *req)


Paul Bolle

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 3/5 v2] lightnvm: RRPC target
  2015-04-15 12:34   ` Matias Bjørling
@ 2015-04-16  9:12     ` Paul Bolle
  -1 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16  9:12 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On Wed, 2015-04-15 at 14:34 +0200, Matias Bjørling wrote:
> --- /dev/null
> +++ b/drivers/lightnvm/Kconfig

> +menuconfig NVM
> +	bool "Open-Channel SSD target support"
> +	depends on BLK_DEV_NVM
> +	help
> +	  Say Y here to get to enable Open-channel SSDs.
> +
> +	  Open-Channel SSDs implement a set of extension to SSDs, that
> +	  exposes direct access to the underlying non-volatile memory.
> +
> +	  If you say N, all options in this submenu will be skipped and disabled
> +	  only do this if you know what you are doing.
> +
> +if NVM
> +
> +config NVM_RRPC
> +	tristate "Round-robin Hybrid Open-Channel SSD"
> +	depends on BLK_DEV_NVM

NVM implies BLK_DEV_NVM so this dependency isn't really needed.

> +	---help---
> +	Allows an open-channel SSD to be exposed as a block device to the
> +	host. The target is implemented using a linear mapping table and
> +	cost-based garbage collection. It is optimized for 4K IO sizes.
> +
> +	See Documentation/nvm-rrpc.txt for details.

This file isn't part of this series, is it?

> +
> +endif # NVM
> diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
> new file mode 100644
> index 0000000..80d75a8
> --- /dev/null
> +++ b/drivers/lightnvm/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for Open-Channel SSDs.
> +#
> +
> +obj-$(CONFIG_NVM)		+= rrpc.o

I suppose you meant to use
    obj-$(CONFIG_NVM_RRPC)		+= rrpc.o

because otherwise setting NVM_RRPC has no effect. 

> --- /dev/null
> +++ b/drivers/lightnvm/rrpc.c

> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

And (according to include/linux/module.h) the license ident "GPL" states
the license is GPL v2 or later. So either the comment at the top of this
file or the license ident needs to change.

Thanks,


Paul Bolle


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 3/5 v2] lightnvm: RRPC target
@ 2015-04-16  9:12     ` Paul Bolle
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16  9:12 UTC (permalink / raw)


On Wed, 2015-04-15@14:34 +0200, Matias Bj?rling wrote:
> --- /dev/null
> +++ b/drivers/lightnvm/Kconfig

> +menuconfig NVM
> +	bool "Open-Channel SSD target support"
> +	depends on BLK_DEV_NVM
> +	help
> +	  Say Y here to get to enable Open-channel SSDs.
> +
> +	  Open-Channel SSDs implement a set of extension to SSDs, that
> +	  exposes direct access to the underlying non-volatile memory.
> +
> +	  If you say N, all options in this submenu will be skipped and disabled
> +	  only do this if you know what you are doing.
> +
> +if NVM
> +
> +config NVM_RRPC
> +	tristate "Round-robin Hybrid Open-Channel SSD"
> +	depends on BLK_DEV_NVM

NVM implies BLK_DEV_NVM so this dependency isn't really needed.

> +	---help---
> +	Allows an open-channel SSD to be exposed as a block device to the
> +	host. The target is implemented using a linear mapping table and
> +	cost-based garbage collection. It is optimized for 4K IO sizes.
> +
> +	See Documentation/nvm-rrpc.txt for details.

This file isn't part of this series, is it?

> +
> +endif # NVM
> diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
> new file mode 100644
> index 0000000..80d75a8
> --- /dev/null
> +++ b/drivers/lightnvm/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for Open-Channel SSDs.
> +#
> +
> +obj-$(CONFIG_NVM)		+= rrpc.o

I suppose you meant to use
    obj-$(CONFIG_NVM_RRPC)		+= rrpc.o

because otherwise setting NVM_RRPC has no effect. 

> --- /dev/null
> +++ b/drivers/lightnvm/rrpc.c

> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

And (according to include/linux/module.h) the license ident "GPL" states
the license is GPL v2 or later. So either the comment at the top of this
file or the license ident needs to change.

Thanks,


Paul Bolle

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
  2015-04-16  9:10     ` Paul Bolle
  (?)
@ 2015-04-16 10:23       ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 10:23 UTC (permalink / raw)
  To: Paul Bolle
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On 04/16/2015 11:10 AM, Paul Bolle wrote:
> A few things I spotted (while actually fiddling with 3/5).

Thanks. I'll fix them up.

>
> On Wed, 2015-04-15 at 14:34 +0200, Matias Bjørling wrote:
>> index f3dd028..58a8a71 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
>>   	rq->end_io = NULL;
>>   	rq->end_io_data = NULL;
>>   	rq->next_rq = NULL;
>> +#if CONFIG_BLK_DEV_NVM
>
> I think you meant
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	rq->phys_sector = 0;
>> +#endif
>>
>>   	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
>>   }
>
>> --- /dev/null
>> +++ b/block/blk-nvm.c
>
>> +int nvm_register_target(struct nvm_target_type *tt)
>> +{
>> +	int ret = 0;
>> +
>> +	down_write(&_lock);
>> +	if (nvm_find_target_type(tt->name))
>> +		ret = -EEXIST;
>> +	else
>> +		list_add(&tt->list, &_targets);
>> +	up_write(&_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +void nvm_unregister_target(struct nvm_target_type *tt)
>> +{
>> +	if (!tt)
>> +		return;
>> +
>> +	down_write(&_lock);
>> +	list_del(&tt->list);
>> +	up_write(&_lock);
>> +}
>
> Trying to build rrpc.ko I saw this
>      WARNING: "nvm_unregister_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>      WARNING: "nvm_register_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>
> So I guess you need to export these two functions too.
>
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -209,6 +209,9 @@ struct request {
>>
>>   	/* for bidi */
>>   	struct request *next_rq;
>> +#if CONFIG_BLK_DEV_NVM
>
> Again, I think
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	sector_t phys_sector;
>> +#endif
>>   };
>>
>>   static inline unsigned short req_get_ioprio(struct request *req)
>
>
> Paul Bolle
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 10:23       ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 10:23 UTC (permalink / raw)
  To: Paul Bolle
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On 04/16/2015 11:10 AM, Paul Bolle wrote:
> A few things I spotted (while actually fiddling with 3/5).

Thanks. I'll fix them up.

>
> On Wed, 2015-04-15 at 14:34 +0200, Matias Bjørling wrote:
>> index f3dd028..58a8a71 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
>>   	rq->end_io = NULL;
>>   	rq->end_io_data = NULL;
>>   	rq->next_rq = NULL;
>> +#if CONFIG_BLK_DEV_NVM
>
> I think you meant
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	rq->phys_sector = 0;
>> +#endif
>>
>>   	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
>>   }
>
>> --- /dev/null
>> +++ b/block/blk-nvm.c
>
>> +int nvm_register_target(struct nvm_target_type *tt)
>> +{
>> +	int ret = 0;
>> +
>> +	down_write(&_lock);
>> +	if (nvm_find_target_type(tt->name))
>> +		ret = -EEXIST;
>> +	else
>> +		list_add(&tt->list, &_targets);
>> +	up_write(&_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +void nvm_unregister_target(struct nvm_target_type *tt)
>> +{
>> +	if (!tt)
>> +		return;
>> +
>> +	down_write(&_lock);
>> +	list_del(&tt->list);
>> +	up_write(&_lock);
>> +}
>
> Trying to build rrpc.ko I saw this
>      WARNING: "nvm_unregister_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>      WARNING: "nvm_register_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>
> So I guess you need to export these two functions too.
>
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -209,6 +209,9 @@ struct request {
>>
>>   	/* for bidi */
>>   	struct request *next_rq;
>> +#if CONFIG_BLK_DEV_NVM
>
> Again, I think
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	sector_t phys_sector;
>> +#endif
>>   };
>>
>>   static inline unsigned short req_get_ioprio(struct request *req)
>
>
> Paul Bolle
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 10:23       ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 10:23 UTC (permalink / raw)


On 04/16/2015 11:10 AM, Paul Bolle wrote:
> A few things I spotted (while actually fiddling with 3/5).

Thanks. I'll fix them up.

>
> On Wed, 2015-04-15@14:34 +0200, Matias Bj?rling wrote:
>> index f3dd028..58a8a71 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -221,6 +221,9 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
>>   	rq->end_io = NULL;
>>   	rq->end_io_data = NULL;
>>   	rq->next_rq = NULL;
>> +#if CONFIG_BLK_DEV_NVM
>
> I think you meant
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	rq->phys_sector = 0;
>> +#endif
>>
>>   	ctx->rq_dispatched[rw_is_sync(rw_flags)]++;
>>   }
>
>> --- /dev/null
>> +++ b/block/blk-nvm.c
>
>> +int nvm_register_target(struct nvm_target_type *tt)
>> +{
>> +	int ret = 0;
>> +
>> +	down_write(&_lock);
>> +	if (nvm_find_target_type(tt->name))
>> +		ret = -EEXIST;
>> +	else
>> +		list_add(&tt->list, &_targets);
>> +	up_write(&_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +void nvm_unregister_target(struct nvm_target_type *tt)
>> +{
>> +	if (!tt)
>> +		return;
>> +
>> +	down_write(&_lock);
>> +	list_del(&tt->list);
>> +	up_write(&_lock);
>> +}
>
> Trying to build rrpc.ko I saw this
>      WARNING: "nvm_unregister_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>      WARNING: "nvm_register_target" [[...]/drivers/lightnvm/rrpc.ko] undefined!
>
> So I guess you need to export these two functions too.
>
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -209,6 +209,9 @@ struct request {
>>
>>   	/* for bidi */
>>   	struct request *next_rq;
>> +#if CONFIG_BLK_DEV_NVM
>
> Again, I think
>      #ifdef CONFIG_BLK_DEV_NVM
>
>> +	sector_t phys_sector;
>> +#endif
>>   };
>>
>>   static inline unsigned short req_get_ioprio(struct request *req)
>
>
> Paul Bolle
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
  2015-04-16 10:23       ` Matias Bjørling
  (?)
@ 2015-04-16 11:34         ` Paul Bolle
  -1 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16 11:34 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On Thu, 2015-04-16 at 12:23 +0200, Matias Bjørling wrote:
> On 04/16/2015 11:10 AM, Paul Bolle wrote:
> > A few things I spotted (while actually fiddling with 3/5).
> 
> Thanks. I'll fix them up.

Please note that just using #ifdef instead of #if is not all that's
needed. See, I had some fun playing whack-a-mole with warnings and
errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
was looking into things outside of this series that I don't understand).

After adding the changes pasted at the end of this message (which I gave
almost no thought whatsoever) I ran into:
    block/blk-mq.c: In function ‘blk_mq_init_rq_map’:
    block/blk-mq.c:1473:22: error: invalid application of ‘sizeof’ to incomplete type ‘struct nvm_per_rq’
       cmd_size += sizeof(struct nvm_per_rq);
                          ^

Then I admitted defeat.

Have fun with your turn of that game.


Paul Bolle

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 95fbc98307a5..27d4ebf36a80 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -223,7 +223,7 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	rq->phys_sector = 0;
 #endif
 
diff --git a/block/blk.h b/block/blk.h
index 3e4abee7c194..19c037041006 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,9 +294,9 @@ extern void blk_nvm_unregister(struct request_queue *);
 extern int blk_nvm_init_sysfs(struct device *);
 extern void blk_nvm_remove_sysfs(struct device *);
 #else
-static void blk_nvm_unregister(struct request_queue *q) { }
-static int blk_nvm_init_sysfs(struct device *) { return 0; }
-static void blk_nvm_remove_sysfs(struct device *) { }
+static inline void blk_nvm_unregister(struct request_queue *q) { }
+static inline int blk_nvm_init_sysfs(struct device *dev) { return 0; }
+static inline void blk_nvm_remove_sysfs(struct device *dev) { }
 #endif /* CONFIG_BLK_DEV_NVM */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 56f963586112..77d7b35a27c7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,7 +209,7 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	sector_t phys_sector;
 #endif
 };
@@ -914,11 +914,6 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
-static inline sector_t blk_rq_phys_pos(const struct request *rq)
-{
-	return rq->phys_sector;
-}
-
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1780,35 +1775,32 @@ static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
 	return q->nvm;
 }
 #else
+struct nvm_dev;
 struct nvm_dev_ops;
 struct nvm_lun;
 struct nvm_block;
 struct nvm_target_type;
 
-struct nvm_target_type *nvm_find_target_type(const char *)
+static inline struct nvm_target_type *nvm_find_target_type(const char *name)
 {
 	return NULL;
 }
-int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
-void nvm_unregister_target(struct nvm_target_type *tt) {}
-static inline int blk_nvm_register(struct request_queue *,
-						struct nvm_dev_ops *)
+static inline int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+static inline void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *q,
+						struct nvm_dev_ops *ops)
 {
 	return -EINVAL;
 }
-static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
 {
 	return NULL;
 }
-static inline void blk_nvm_put_blk(struct nvm_block *) {}
-static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+static inline void blk_nvm_put_blk(struct nvm_block *block) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blok)
 {
 	return -EINVAL;
 }
-static inline int blk_nvm_get_dev(struct request_queue *)
-{
-	return NULL;
-}
 static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
 {
 	return 0;



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 11:34         ` Paul Bolle
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16 11:34 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On Thu, 2015-04-16 at 12:23 +0200, Matias Bjørling wrote:
> On 04/16/2015 11:10 AM, Paul Bolle wrote:
> > A few things I spotted (while actually fiddling with 3/5).
> 
> Thanks. I'll fix them up.

Please note that just using #ifdef instead of #if is not all that's
needed. See, I had some fun playing whack-a-mole with warnings and
errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
was looking into things outside of this series that I don't understand).

After adding the changes pasted at the end of this message (which I gave
almost no thought whatsoever) I ran into:
    block/blk-mq.c: In function ‘blk_mq_init_rq_map’:
    block/blk-mq.c:1473:22: error: invalid application of ‘sizeof’ to incomplete type ‘struct nvm_per_rq’
       cmd_size += sizeof(struct nvm_per_rq);
                          ^

Then I admitted defeat.

Have fun with your turn of that game.


Paul Bolle

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 95fbc98307a5..27d4ebf36a80 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -223,7 +223,7 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	rq->phys_sector = 0;
 #endif
 
diff --git a/block/blk.h b/block/blk.h
index 3e4abee7c194..19c037041006 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,9 +294,9 @@ extern void blk_nvm_unregister(struct request_queue *);
 extern int blk_nvm_init_sysfs(struct device *);
 extern void blk_nvm_remove_sysfs(struct device *);
 #else
-static void blk_nvm_unregister(struct request_queue *q) { }
-static int blk_nvm_init_sysfs(struct device *) { return 0; }
-static void blk_nvm_remove_sysfs(struct device *) { }
+static inline void blk_nvm_unregister(struct request_queue *q) { }
+static inline int blk_nvm_init_sysfs(struct device *dev) { return 0; }
+static inline void blk_nvm_remove_sysfs(struct device *dev) { }
 #endif /* CONFIG_BLK_DEV_NVM */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 56f963586112..77d7b35a27c7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,7 +209,7 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	sector_t phys_sector;
 #endif
 };
@@ -914,11 +914,6 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
-static inline sector_t blk_rq_phys_pos(const struct request *rq)
-{
-	return rq->phys_sector;
-}
-
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1780,35 +1775,32 @@ static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
 	return q->nvm;
 }
 #else
+struct nvm_dev;
 struct nvm_dev_ops;
 struct nvm_lun;
 struct nvm_block;
 struct nvm_target_type;
 
-struct nvm_target_type *nvm_find_target_type(const char *)
+static inline struct nvm_target_type *nvm_find_target_type(const char *name)
 {
 	return NULL;
 }
-int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
-void nvm_unregister_target(struct nvm_target_type *tt) {}
-static inline int blk_nvm_register(struct request_queue *,
-						struct nvm_dev_ops *)
+static inline int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+static inline void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *q,
+						struct nvm_dev_ops *ops)
 {
 	return -EINVAL;
 }
-static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
 {
 	return NULL;
 }
-static inline void blk_nvm_put_blk(struct nvm_block *) {}
-static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+static inline void blk_nvm_put_blk(struct nvm_block *block) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blok)
 {
 	return -EINVAL;
 }
-static inline int blk_nvm_get_dev(struct request_queue *)
-{
-	return NULL;
-}
 static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
 {
 	return 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 11:34         ` Paul Bolle
  0 siblings, 0 replies; 53+ messages in thread
From: Paul Bolle @ 2015-04-16 11:34 UTC (permalink / raw)


On Thu, 2015-04-16@12:23 +0200, Matias Bj?rling wrote:
> On 04/16/2015 11:10 AM, Paul Bolle wrote:
> > A few things I spotted (while actually fiddling with 3/5).
> 
> Thanks. I'll fix them up.

Please note that just using #ifdef instead of #if is not all that's
needed. See, I had some fun playing whack-a-mole with warnings and
errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
was looking into things outside of this series that I don't understand).

After adding the changes pasted at the end of this message (which I gave
almost no thought whatsoever) I ran into:
    block/blk-mq.c: In function ?blk_mq_init_rq_map?:
    block/blk-mq.c:1473:22: error: invalid application of ?sizeof? to incomplete type ?struct nvm_per_rq?
       cmd_size += sizeof(struct nvm_per_rq);
                          ^

Then I admitted defeat.

Have fun with your turn of that game.


Paul Bolle

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 95fbc98307a5..27d4ebf36a80 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -223,7 +223,7 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	rq->phys_sector = 0;
 #endif
 
diff --git a/block/blk.h b/block/blk.h
index 3e4abee7c194..19c037041006 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,9 +294,9 @@ extern void blk_nvm_unregister(struct request_queue *);
 extern int blk_nvm_init_sysfs(struct device *);
 extern void blk_nvm_remove_sysfs(struct device *);
 #else
-static void blk_nvm_unregister(struct request_queue *q) { }
-static int blk_nvm_init_sysfs(struct device *) { return 0; }
-static void blk_nvm_remove_sysfs(struct device *) { }
+static inline void blk_nvm_unregister(struct request_queue *q) { }
+static inline int blk_nvm_init_sysfs(struct device *dev) { return 0; }
+static inline void blk_nvm_remove_sysfs(struct device *dev) { }
 #endif /* CONFIG_BLK_DEV_NVM */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 56f963586112..77d7b35a27c7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -209,7 +209,7 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
-#if CONFIG_BLK_DEV_NVM
+#ifdef CONFIG_BLK_DEV_NVM
 	sector_t phys_sector;
 #endif
 };
@@ -914,11 +914,6 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
-static inline sector_t blk_rq_phys_pos(const struct request *rq)
-{
-	return rq->phys_sector;
-}
-
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 						     unsigned int cmd_flags)
 {
@@ -1780,35 +1775,32 @@ static inline struct nvm_dev *blk_nvm_get_dev(struct request_queue *q)
 	return q->nvm;
 }
 #else
+struct nvm_dev;
 struct nvm_dev_ops;
 struct nvm_lun;
 struct nvm_block;
 struct nvm_target_type;
 
-struct nvm_target_type *nvm_find_target_type(const char *)
+static inline struct nvm_target_type *nvm_find_target_type(const char *name)
 {
 	return NULL;
 }
-int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
-void nvm_unregister_target(struct nvm_target_type *tt) {}
-static inline int blk_nvm_register(struct request_queue *,
-						struct nvm_dev_ops *)
+static inline int nvm_register_target(struct nvm_target_type *tt) { return -EINVAL; }
+static inline void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int blk_nvm_register(struct request_queue *q,
+						struct nvm_dev_ops *ops)
 {
 	return -EINVAL;
 }
-static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *, int)
+static inline struct nvm_block *blk_nvm_get_blk(struct nvm_lun *lun, int is_gc)
 {
 	return NULL;
 }
-static inline void blk_nvm_put_blk(struct nvm_block *) {}
-static inline int blk_nvm_erase_blk(struct nvm_dev *, struct nvm_block *)
+static inline void blk_nvm_put_blk(struct nvm_block *block) {}
+static inline int blk_nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blok)
 {
 	return -EINVAL;
 }
-static inline int blk_nvm_get_dev(struct request_queue *)
-{
-	return NULL;
-}
 static inline sector_t blk_nvm_alloc_addr(struct nvm_block *block)
 {
 	return 0;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
  2015-04-16 11:34         ` Paul Bolle
  (?)
@ 2015-04-16 13:29           ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 13:29 UTC (permalink / raw)
  To: Paul Bolle
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On 04/16/2015 01:34 PM, Paul Bolle wrote:
> On Thu, 2015-04-16 at 12:23 +0200, Matias Bjørling wrote:
>> On 04/16/2015 11:10 AM, Paul Bolle wrote:
>>> A few things I spotted (while actually fiddling with 3/5).
>>
>> Thanks. I'll fix them up.
>
> Please note that just using #ifdef instead of #if is not all that's
> needed. See, I had some fun playing whack-a-mole with warnings and
> errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
> was looking into things outside of this series that I don't understand).
>
> After adding the changes pasted at the end of this message (which I gave
> almost no thought whatsoever) I ran into:
>      block/blk-mq.c: In function ‘blk_mq_init_rq_map’:
>      block/blk-mq.c:1473:22: error: invalid application of ‘sizeof’ to incomplete type ‘struct nvm_per_rq’
>         cmd_size += sizeof(struct nvm_per_rq);
>                            ^
>
> Then I admitted defeat.
>
> Have fun with your turn of that game.

Thanks. Nothing like a good whack-a-mole. I've fixed it up in the master at:

   https://github.com/OpenChannelSSD/linux.git

Keith, for the nvme driver, I've packed everything neatly into a #ifdef, 
so it looks the same as with the #ifdef integrity. It should have been 
like that from the start.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 13:29           ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 13:29 UTC (permalink / raw)
  To: Paul Bolle
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

On 04/16/2015 01:34 PM, Paul Bolle wrote:
> On Thu, 2015-04-16 at 12:23 +0200, Matias Bjørling wrote:
>> On 04/16/2015 11:10 AM, Paul Bolle wrote:
>>> A few things I spotted (while actually fiddling with 3/5).
>>
>> Thanks. I'll fix them up.
>
> Please note that just using #ifdef instead of #if is not all that's
> needed. See, I had some fun playing whack-a-mole with warnings and
> errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
> was looking into things outside of this series that I don't understand).
>
> After adding the changes pasted at the end of this message (which I gave
> almost no thought whatsoever) I ran into:
>      block/blk-mq.c: In function ‘blk_mq_init_rq_map’:
>      block/blk-mq.c:1473:22: error: invalid application of ‘sizeof’ to incomplete type ‘struct nvm_per_rq’
>         cmd_size += sizeof(struct nvm_per_rq);
>                            ^
>
> Then I admitted defeat.
>
> Have fun with your turn of that game.

Thanks. Nothing like a good whack-a-mole. I've fixed it up in the master at:

   https://github.com/OpenChannelSSD/linux.git

Keith, for the nvme driver, I've packed everything neatly into a #ifdef, 
so it looks the same as with the #ifdef integrity. It should have been 
like that from the start.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs
@ 2015-04-16 13:29           ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-16 13:29 UTC (permalink / raw)


On 04/16/2015 01:34 PM, Paul Bolle wrote:
> On Thu, 2015-04-16@12:23 +0200, Matias Bj?rling wrote:
>> On 04/16/2015 11:10 AM, Paul Bolle wrote:
>>> A few things I spotted (while actually fiddling with 3/5).
>>
>> Thanks. I'll fix them up.
>
> Please note that just using #ifdef instead of #if is not all that's
> needed. See, I had some fun playing whack-a-mole with warnings and
> errors showing up in the "CONFIG_BLK_DEV_NVM is not set" case (because I
> was looking into things outside of this series that I don't understand).
>
> After adding the changes pasted at the end of this message (which I gave
> almost no thought whatsoever) I ran into:
>      block/blk-mq.c: In function ?blk_mq_init_rq_map?:
>      block/blk-mq.c:1473:22: error: invalid application of ?sizeof? to incomplete type ?struct nvm_per_rq?
>         cmd_size += sizeof(struct nvm_per_rq);
>                            ^
>
> Then I admitted defeat.
>
> Have fun with your turn of that game.

Thanks. Nothing like a good whack-a-mole. I've fixed it up in the master at:

   https://github.com/OpenChannelSSD/linux.git

Keith, for the nvme driver, I've packed everything neatly into a #ifdef, 
so it looks the same as with the #ifdef integrity. It should have been 
like that from the start.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-15 12:34   ` Matias Bjørling
@ 2015-04-16 14:55     ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 14:55 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier, keith.busch

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1272 bytes --]

On Wed, 15 Apr 2015, Matias Bjørling wrote:
> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
> 	struct nvme_id_ctrl *ctrl;
> 	void *mem;
> 	dma_addr_t dma_addr;
> -	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
> +	u64 cap = readq(&dev->bar->cap);
> +	int shift = NVME_CAP_MPSMIN(cap) + 12;
> +	int nvm_cmdset = NVME_CAP_NVM(cap);

The controller capabilities' command sets supported used here is the
right way to key off on support for this new command set, IMHO, but I do
not see in this patch the command set being selected when the controller
is enabled

Also if we're going this route, I think we need to define this reserved
bit in the spec, but I'm not sure how to help with that.

> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
> 	ctrl = mem;
> 	nn = le32_to_cpup(&ctrl->nn);
> 	dev->oncs = le16_to_cpup(&ctrl->oncs);
> +	dev->oacs = le16_to_cpup(&ctrl->oacs);

I don't find OACS used anywhere in the rest of the patch. I think this
must be left over from v1.

Otherwise it looks pretty good to me, but I think it would be cleaner if
the lightnvm stuff is not mixed in the same file with the standard nvme
command set. We might end up splitting nvme-core in the future anyway
for command sets and transports.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 14:55     ` Keith Busch
  0 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 14:55 UTC (permalink / raw)


On Wed, 15 Apr 2015, Matias Bj?rling wrote:
> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
> 	struct nvme_id_ctrl *ctrl;
> 	void *mem;
> 	dma_addr_t dma_addr;
> -	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
> +	u64 cap = readq(&dev->bar->cap);
> +	int shift = NVME_CAP_MPSMIN(cap) + 12;
> +	int nvm_cmdset = NVME_CAP_NVM(cap);

The controller capabilities' command sets supported used here is the
right way to key off on support for this new command set, IMHO, but I do
not see in this patch the command set being selected when the controller
is enabled

Also if we're going this route, I think we need to define this reserved
bit in the spec, but I'm not sure how to help with that.

> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
> 	ctrl = mem;
> 	nn = le32_to_cpup(&ctrl->nn);
> 	dev->oncs = le16_to_cpup(&ctrl->oncs);
> +	dev->oacs = le16_to_cpup(&ctrl->oacs);

I don't find OACS used anywhere in the rest of the patch. I think this
must be left over from v1.

Otherwise it looks pretty good to me, but I think it would be cleaner if
the lightnvm stuff is not mixed in the same file with the standard nvme
command set. We might end up splitting nvme-core in the future anyway
for command sets and transports.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-16 14:55     ` Keith Busch
@ 2015-04-16 15:14       ` Javier González
  -1 siblings, 0 replies; 53+ messages in thread
From: Javier González @ 2015-04-16 15:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matias Bjørling, hch, axboe, linux-fsdevel, linux-kernel,
	linux-nvme

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

Hi,

> On 16 Apr 2015, at 16:55, Keith Busch <keith.busch@intel.com> wrote:
> 
> On Wed, 15 Apr 2015, Matias Bjørling wrote:
>> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
>> 	struct nvme_id_ctrl *ctrl;
>> 	void *mem;
>> 	dma_addr_t dma_addr;
>> -	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>> +	u64 cap = readq(&dev->bar->cap);
>> +	int shift = NVME_CAP_MPSMIN(cap) + 12;
>> +	int nvm_cmdset = NVME_CAP_NVM(cap);
> 
> The controller capabilities' command sets supported used here is the
> right way to key off on support for this new command set, IMHO, but I do
> not see in this patch the command set being selected when the controller
> is enabled
> 
> Also if we're going this route, I think we need to define this reserved
> bit in the spec, but I'm not sure how to help with that.
> 
>> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>> 	ctrl = mem;
>> 	nn = le32_to_cpup(&ctrl->nn);
>> 	dev->oncs = le16_to_cpup(&ctrl->oncs);
>> +	dev->oacs = le16_to_cpup(&ctrl->oacs);
> 
> I don't find OACS used anywhere in the rest of the patch. I think this
> must be left over from v1.
> 
> Otherwise it looks pretty good to me, but I think it would be cleaner if
> the lightnvm stuff is not mixed in the same file with the standard nvme
> command set. We might end up splitting nvme-core in the future anyway
> for command sets and transports.

Would you be ok with having nvme-lightnvm for LightNVM specific
commands?

Javier.


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 15:14       ` Javier González
  0 siblings, 0 replies; 53+ messages in thread
From: Javier González @ 2015-04-16 15:14 UTC (permalink / raw)


Hi,

> On 16 Apr 2015,@16:55, Keith Busch <keith.busch@intel.com> wrote:
> 
> On Wed, 15 Apr 2015, Matias Bj?rling wrote:
>> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
>> 	struct nvme_id_ctrl *ctrl;
>> 	void *mem;
>> 	dma_addr_t dma_addr;
>> -	int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>> +	u64 cap = readq(&dev->bar->cap);
>> +	int shift = NVME_CAP_MPSMIN(cap) + 12;
>> +	int nvm_cmdset = NVME_CAP_NVM(cap);
> 
> The controller capabilities' command sets supported used here is the
> right way to key off on support for this new command set, IMHO, but I do
> not see in this patch the command set being selected when the controller
> is enabled
> 
> Also if we're going this route, I think we need to define this reserved
> bit in the spec, but I'm not sure how to help with that.
> 
>> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>> 	ctrl = mem;
>> 	nn = le32_to_cpup(&ctrl->nn);
>> 	dev->oncs = le16_to_cpup(&ctrl->oncs);
>> +	dev->oacs = le16_to_cpup(&ctrl->oacs);
> 
> I don't find OACS used anywhere in the rest of the patch. I think this
> must be left over from v1.
> 
> Otherwise it looks pretty good to me, but I think it would be cleaner if
> the lightnvm stuff is not mixed in the same file with the standard nvme
> command set. We might end up splitting nvme-core in the future anyway
> for command sets and transports.

Would you be ok with having nvme-lightnvm for LightNVM specific
commands?

Javier.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20150416/ef38b819/attachment.sig>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-16 15:14       ` Javier González
@ 2015-04-16 15:52         ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 15:52 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, Matias Bjørling, hch, axboe, linux-fsdevel,
	linux-kernel, linux-nvme

[-- Attachment #1: Type: TEXT/PLAIN, Size: 527 bytes --]

On Thu, 16 Apr 2015, Javier González wrote:
>> On 16 Apr 2015, at 16:55, Keith Busch <keith.busch@intel.com> wrote:
>>
>> Otherwise it looks pretty good to me, but I think it would be cleaner if
>> the lightnvm stuff is not mixed in the same file with the standard nvme
>> command set. We might end up splitting nvme-core in the future anyway
>> for command sets and transports.
>
> Would you be ok with having nvme-lightnvm for LightNVM specific
> commands?

Sounds good to me, but I don't really have a dog in this fight. :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 15:52         ` Keith Busch
  0 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 15:52 UTC (permalink / raw)


On Thu, 16 Apr 2015, Javier Gonz?lez wrote:
>> On 16 Apr 2015,@16:55, Keith Busch <keith.busch@intel.com> wrote:
>>
>> Otherwise it looks pretty good to me, but I think it would be cleaner if
>> the lightnvm stuff is not mixed in the same file with the standard nvme
>> command set. We might end up splitting nvme-core in the future anyway
>> for command sets and transports.
>
> Would you be ok with having nvme-lightnvm for LightNVM specific
> commands?

Sounds good to me, but I don't really have a dog in this fight. :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-16 15:52         ` Keith Busch
  (?)
@ 2015-04-16 16:01           ` James R. Bergsten
  -1 siblings, 0 replies; 53+ messages in thread
From: James R. Bergsten @ 2015-04-16 16:01 UTC (permalink / raw)
  To: 'Keith Busch', 'Javier González'
  Cc: hch, 'Matias Bjørling',
	axboe, linux-kernel, linux-nvme, linux-fsdevel

My two cents worth is that it's (always) better to put ALL the commands into one place so that the entire set can be viewed at once and thus avoid inadvertent overloading of an opcode.  Otherwise you don't know what you don't know.

-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Keith Busch
Sent: Thursday, April 16, 2015 8:52 AM
To: Javier González
Cc: hch@infradead.org; Matias Bjørling; axboe@fb.com; linux-kernel@vger.kernel.org; linux-nvme@lists.infradead.org; Keith Busch; linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/5 v2] nvme: LightNVM support

On Thu, 16 Apr 2015, Javier González wrote:
>> On 16 Apr 2015, at 16:55, Keith Busch <keith.busch@intel.com> wrote:
>>
>> Otherwise it looks pretty good to me, but I think it would be cleaner 
>> if the lightnvm stuff is not mixed in the same file with the standard 
>> nvme command set. We might end up splitting nvme-core in the future 
>> anyway for command sets and transports.
>
> Would you be ok with having nvme-lightnvm for LightNVM specific 
> commands?

Sounds good to me, but I don't really have a dog in this fight. :)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 16:01           ` James R. Bergsten
  0 siblings, 0 replies; 53+ messages in thread
From: James R. Bergsten @ 2015-04-16 16:01 UTC (permalink / raw)
  To: 'Keith Busch', 'Javier González'
  Cc: hch, 'Matias Bjørling',
	axboe, linux-kernel, linux-nvme, linux-fsdevel

My two cents worth is that it's (always) better to put ALL the commands into one place so that the entire set can be viewed at once and thus avoid inadvertent overloading of an opcode.  Otherwise you don't know what you don't know.

-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Keith Busch
Sent: Thursday, April 16, 2015 8:52 AM
To: Javier González
Cc: hch@infradead.org; Matias Bjørling; axboe@fb.com; linux-kernel@vger.kernel.org; linux-nvme@lists.infradead.org; Keith Busch; linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/5 v2] nvme: LightNVM support

On Thu, 16 Apr 2015, Javier González wrote:
>> On 16 Apr 2015, at 16:55, Keith Busch <keith.busch@intel.com> wrote:
>>
>> Otherwise it looks pretty good to me, but I think it would be cleaner 
>> if the lightnvm stuff is not mixed in the same file with the standard 
>> nvme command set. We might end up splitting nvme-core in the future 
>> anyway for command sets and transports.
>
> Would you be ok with having nvme-lightnvm for LightNVM specific 
> commands?

Sounds good to me, but I don't really have a dog in this fight. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 16:01           ` James R. Bergsten
  0 siblings, 0 replies; 53+ messages in thread
From: James R. Bergsten @ 2015-04-16 16:01 UTC (permalink / raw)


My two cents worth is that it's (always) better to put ALL the commands into one place so that the entire set can be viewed at once and thus avoid inadvertent overloading of an opcode.  Otherwise you don't know what you don't know.

-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Keith Busch
Sent: Thursday, April 16, 2015 8:52 AM
To: Javier Gonz?lez
Cc: hch at infradead.org; Matias Bj?rling; axboe at fb.com; linux-kernel at vger.kernel.org; linux-nvme at lists.infradead.org; Keith Busch; linux-fsdevel at vger.kernel.org
Subject: Re: [PATCH 5/5 v2] nvme: LightNVM support

On Thu, 16 Apr 2015, Javier Gonz?lez wrote:
>> On 16 Apr 2015,@16:55, Keith Busch <keith.busch@intel.com> wrote:
>>
>> Otherwise it looks pretty good to me, but I think it would be cleaner 
>> if the lightnvm stuff is not mixed in the same file with the standard 
>> nvme command set. We might end up splitting nvme-core in the future 
>> anyway for command sets and transports.
>
> Would you be ok with having nvme-lightnvm for LightNVM specific 
> commands?

Sounds good to me, but I don't really have a dog in this fight. :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-16 16:01           ` James R. Bergsten
@ 2015-04-16 16:12             ` Keith Busch
  -1 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 16:12 UTC (permalink / raw)
  To: James R. Bergsten
  Cc: 'Keith Busch', 'Javier González',
	hch, 'Matias Bjørling',
	axboe, linux-kernel, linux-nvme, linux-fsdevel

On Thu, 16 Apr 2015, James R. Bergsten wrote:
> My two cents worth is that it's (always) better to put ALL the commands into
> one place so that the entire set can be viewed at once and thus avoid
> inadvertent overloading of an opcode.  Otherwise you don't know what you
> don't know.

Yes, but these are two different command sets.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 16:12             ` Keith Busch
  0 siblings, 0 replies; 53+ messages in thread
From: Keith Busch @ 2015-04-16 16:12 UTC (permalink / raw)


On Thu, 16 Apr 2015, James R. Bergsten wrote:
> My two cents worth is that it's (always) better to put ALL the commands into
> one place so that the entire set can be viewed at once and thus avoid
> inadvertent overloading of an opcode.  Otherwise you don't know what you
> don't know.

Yes, but these are two different command sets.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5 v2] nvme: LightNVM support
  2015-04-16 14:55     ` Keith Busch
  (?)
@ 2015-04-16 17:17       ` Matias Bjorling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-16 17:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier

Den 16-04-2015 kl. 16:55 skrev Keith Busch:
> On Wed, 15 Apr 2015, Matias Bjørling wrote:
>> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     struct nvme_id_ctrl *ctrl;
>>     void *mem;
>>     dma_addr_t dma_addr;
>> -    int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>> +    u64 cap = readq(&dev->bar->cap);
>> +    int shift = NVME_CAP_MPSMIN(cap) + 12;
>> +    int nvm_cmdset = NVME_CAP_NVM(cap);
>
> The controller capabilities' command sets supported used here is the
> right way to key off on support for this new command set, IMHO, but I do
> not see in this patch the command set being selected when the controller
> is enabled

I'll get that added. Wouldn't it just be that the command set always is 
selected? A NVMe controller can expose both normal and lightnvm 
namespaces. So we would always enable it, if CAP bit is set.

>
> Also if we're going this route, I think we need to define this reserved
> bit in the spec, but I'm not sure how to help with that.

Agree, we'll see how it can be proposed.

>
>> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     ctrl = mem;
>>     nn = le32_to_cpup(&ctrl->nn);
>>     dev->oncs = le16_to_cpup(&ctrl->oncs);
>> +    dev->oacs = le16_to_cpup(&ctrl->oacs);
>
> I don't find OACS used anywhere in the rest of the patch. I think this
> must be left over from v1.

Oops, yes, that's just a left over.

>
> Otherwise it looks pretty good to me, but I think it would be cleaner if
> the lightnvm stuff is not mixed in the same file with the standard nvme
> command set. We might end up splitting nvme-core in the future anyway
> for command sets and transports.

Will do. Thanks.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 17:17       ` Matias Bjorling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-16 17:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, javier

Den 16-04-2015 kl. 16:55 skrev Keith Busch:
> On Wed, 15 Apr 2015, Matias Bjørling wrote:
>> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     struct nvme_id_ctrl *ctrl;
>>     void *mem;
>>     dma_addr_t dma_addr;
>> -    int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>> +    u64 cap = readq(&dev->bar->cap);
>> +    int shift = NVME_CAP_MPSMIN(cap) + 12;
>> +    int nvm_cmdset = NVME_CAP_NVM(cap);
>
> The controller capabilities' command sets supported used here is the
> right way to key off on support for this new command set, IMHO, but I do
> not see in this patch the command set being selected when the controller
> is enabled

I'll get that added. Wouldn't it just be that the command set always is 
selected? A NVMe controller can expose both normal and lightnvm 
namespaces. So we would always enable it, if CAP bit is set.

>
> Also if we're going this route, I think we need to define this reserved
> bit in the spec, but I'm not sure how to help with that.

Agree, we'll see how it can be proposed.

>
>> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     ctrl = mem;
>>     nn = le32_to_cpup(&ctrl->nn);
>>     dev->oncs = le16_to_cpup(&ctrl->oncs);
>> +    dev->oacs = le16_to_cpup(&ctrl->oacs);
>
> I don't find OACS used anywhere in the rest of the patch. I think this
> must be left over from v1.

Oops, yes, that's just a left over.

>
> Otherwise it looks pretty good to me, but I think it would be cleaner if
> the lightnvm stuff is not mixed in the same file with the standard nvme
> command set. We might end up splitting nvme-core in the future anyway
> for command sets and transports.

Will do. Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 5/5 v2] nvme: LightNVM support
@ 2015-04-16 17:17       ` Matias Bjorling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-16 17:17 UTC (permalink / raw)


Den 16-04-2015 kl. 16:55 skrev Keith Busch:
> On Wed, 15 Apr 2015, Matias Bj?rling wrote:
>> @@ -2316,7 +2686,9 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     struct nvme_id_ctrl *ctrl;
>>     void *mem;
>>     dma_addr_t dma_addr;
>> -    int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
>> +    u64 cap = readq(&dev->bar->cap);
>> +    int shift = NVME_CAP_MPSMIN(cap) + 12;
>> +    int nvm_cmdset = NVME_CAP_NVM(cap);
>
> The controller capabilities' command sets supported used here is the
> right way to key off on support for this new command set, IMHO, but I do
> not see in this patch the command set being selected when the controller
> is enabled

I'll get that added. Wouldn't it just be that the command set always is 
selected? A NVMe controller can expose both normal and lightnvm 
namespaces. So we would always enable it, if CAP bit is set.

>
> Also if we're going this route, I think we need to define this reserved
> bit in the spec, but I'm not sure how to help with that.

Agree, we'll see how it can be proposed.

>
>> @@ -2332,6 +2704,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
>>     ctrl = mem;
>>     nn = le32_to_cpup(&ctrl->nn);
>>     dev->oncs = le16_to_cpup(&ctrl->oncs);
>> +    dev->oacs = le16_to_cpup(&ctrl->oacs);
>
> I don't find OACS used anywhere in the rest of the patch. I think this
> must be left over from v1.

Oops, yes, that's just a left over.

>
> Otherwise it looks pretty good to me, but I think it would be cleaner if
> the lightnvm stuff is not mixed in the same file with the standard nvme
> command set. We might end up splitting nvme-core in the future anyway
> for command sets and transports.

Will do. Thanks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-15 12:34   ` Matias Bjørling
@ 2015-04-17  6:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-17  6:34 UTC (permalink / raw)
  To: Matias Bj??rling
  Cc: hch, axboe, linux-fsdevel, linux-kernel, linux-nvme, keith.busch, javier

On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
> Allow users to hook into prep/unprep functions just before an IO is
> dispatched to the device driver. This is necessary for request-based
> logic to take place at upper layers.

I don't think any of this logic belongs into the block layer.  All this
should be library functions called by the drivers.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-17  6:34     ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-17  6:34 UTC (permalink / raw)


On Wed, Apr 15, 2015@02:34:40PM +0200, Matias Bj??rling wrote:
> Allow users to hook into prep/unprep functions just before an IO is
> dispatched to the device driver. This is necessary for request-based
> logic to take place at upper layers.

I don't think any of this logic belongs into the block layer.  All this
should be library functions called by the drivers.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-17  6:34     ` Christoph Hellwig
@ 2015-04-17  8:15       ` Matias Bjørling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-17  8:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-fsdevel, linux-kernel, linux-nvme, keith.busch, javier

On 04/17/2015 08:34 AM, Christoph Hellwig wrote:
> On Wed, Apr 15, 2015 at 02:34:40PM +0200, Matias Bj??rling wrote:
>> Allow users to hook into prep/unprep functions just before an IO is
>> dispatched to the device driver. This is necessary for request-based
>> logic to take place at upper layers.
>
> I don't think any of this logic belongs into the block layer.  All this
> should be library functions called by the drivers.
>

Just the prep/unprep, or other pieces as well?

I like that struct request_queue has a ref to struct nvm_dev, and the 
variables in request and bio to get to the struct is in the block layer.

In the future, applications can have an API to get/put flash block 
directly. (using the blk_nvm_[get/put]_blk interface).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-17  8:15       ` Matias Bjørling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjørling @ 2015-04-17  8:15 UTC (permalink / raw)


On 04/17/2015 08:34 AM, Christoph Hellwig wrote:
> On Wed, Apr 15, 2015@02:34:40PM +0200, Matias Bj??rling wrote:
>> Allow users to hook into prep/unprep functions just before an IO is
>> dispatched to the device driver. This is necessary for request-based
>> logic to take place at upper layers.
>
> I don't think any of this logic belongs into the block layer.  All this
> should be library functions called by the drivers.
>

Just the prep/unprep, or other pieces as well?

I like that struct request_queue has a ref to struct nvm_dev, and the 
variables in request and bio to get to the struct is in the block layer.

In the future, applications can have an API to get/put flash block 
directly. (using the blk_nvm_[get/put]_blk interface).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-17  8:15       ` Matias Bjørling
@ 2015-04-17 17:46         ` Christoph Hellwig
  -1 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-17 17:46 UTC (permalink / raw)
  To: Matias Bj?rling
  Cc: Christoph Hellwig, axboe, linux-fsdevel, linux-kernel,
	linux-nvme, keith.busch, javier

On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
> Just the prep/unprep, or other pieces as well?

All of it - it's functionality that lies logically below the block
layer, so that's where it should be handled.

In fact it should probably work similar to the mtd subsystem - that is
have it's own API for low level drivers, and just export a block driver
as one consumer on the top side.

> In the future, applications can have an API to get/put flash block directly.
> (using the blk_nvm_[get/put]_blk interface).

s/application/filesystem/?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-17 17:46         ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-17 17:46 UTC (permalink / raw)


On Fri, Apr 17, 2015@10:15:46AM +0200, Matias Bj?rling wrote:
> Just the prep/unprep, or other pieces as well?

All of it - it's functionality that lies logically below the block
layer, so that's where it should be handled.

In fact it should probably work similar to the mtd subsystem - that is
have it's own API for low level drivers, and just export a block driver
as one consumer on the top side.

> In the future, applications can have an API to get/put flash block directly.
> (using the blk_nvm_[get/put]_blk interface).

s/application/filesystem/?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-17 17:46         ` Christoph Hellwig
@ 2015-04-18  6:45           ` Matias Bjorling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-18  6:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-fsdevel, linux-kernel, linux-nvme, keith.busch, javier

Den 17-04-2015 kl. 19:46 skrev Christoph Hellwig:
> On Fri, Apr 17, 2015 at 10:15:46AM +0200, Matias Bj?rling wrote:
>> Just the prep/unprep, or other pieces as well?
>
> All of it - it's functionality that lies logically below the block
> layer, so that's where it should be handled.
>
> In fact it should probably work similar to the mtd subsystem - that is
> have it's own API for low level drivers, and just export a block driver
> as one consumer on the top side.

The low level drivers will be NVMe and vendor's own PCI-e drivers. It's 
very generic in their nature. Each driver would duplicate the same work. 
Both could have normal and open-channel drives attached.

I'll like to keep blk-mq in the loop. I don't think it will be pretty to 
have two data paths in the drivers. For blk-mq, bios are splitted/merged 
on the way down. Thus, the actual physical addresses needs aren't known 
before the IO is diced to the right size.

The reason it shouldn't be under the a single block device, is that a 
target should be able to provide a global address space. That allows the 
address space to grow/shrink dynamically with the disks. Allowing a 
continuously growing address space, where disks can be added/removed as 
requirements grow or flash ages. Not on a sector level, but on a flash 
block level.

>
>> In the future, applications can have an API to get/put flash block directly.
>> (using the blk_nvm_[get/put]_blk interface).
>
> s/application/filesystem/?
>

Applications. The goal is that key value stores, e.g. RocksDB, 
Aerospike, Ceph and similar have direct access to flash storage. There 
won't be a kernel file-system between.

The get/put interface can be seen as a space reservation interface for 
where a given process is allowed to access the storage media.

It can also be seen in the way that we provide a block allocator in the 
kernel, while applications implement the rest of "file-system" in 
user-space, specially optimized for their data structures. This makes a 
lot of sense for a small subset (LSM, Fractal trees, etc.) of database 
applications.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-18  6:45           ` Matias Bjorling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-18  6:45 UTC (permalink / raw)


Den 17-04-2015 kl. 19:46 skrev Christoph Hellwig:
> On Fri, Apr 17, 2015@10:15:46AM +0200, Matias Bj?rling wrote:
>> Just the prep/unprep, or other pieces as well?
>
> All of it - it's functionality that lies logically below the block
> layer, so that's where it should be handled.
>
> In fact it should probably work similar to the mtd subsystem - that is
> have it's own API for low level drivers, and just export a block driver
> as one consumer on the top side.

The low level drivers will be NVMe and vendor's own PCI-e drivers. It's 
very generic in their nature. Each driver would duplicate the same work. 
Both could have normal and open-channel drives attached.

I'll like to keep blk-mq in the loop. I don't think it will be pretty to 
have two data paths in the drivers. For blk-mq, bios are splitted/merged 
on the way down. Thus, the actual physical addresses needs aren't known 
before the IO is diced to the right size.

The reason it shouldn't be under the a single block device, is that a 
target should be able to provide a global address space. That allows the 
address space to grow/shrink dynamically with the disks. Allowing a 
continuously growing address space, where disks can be added/removed as 
requirements grow or flash ages. Not on a sector level, but on a flash 
block level.

>
>> In the future, applications can have an API to get/put flash block directly.
>> (using the blk_nvm_[get/put]_blk interface).
>
> s/application/filesystem/?
>

Applications. The goal is that key value stores, e.g. RocksDB, 
Aerospike, Ceph and similar have direct access to flash storage. There 
won't be a kernel file-system between.

The get/put interface can be seen as a space reservation interface for 
where a given process is allowed to access the storage media.

It can also be seen in the way that we provide a block allocator in the 
kernel, while applications implement the rest of "file-system" in 
user-space, specially optimized for their data structures. This makes a 
lot of sense for a small subset (LSM, Fractal trees, etc.) of database 
applications.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-18  6:45           ` Matias Bjorling
@ 2015-04-18 20:16             ` Christoph Hellwig
  -1 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-18 20:16 UTC (permalink / raw)
  To: Matias Bjorling
  Cc: Christoph Hellwig, keith.busch, javier, linux-kernel, linux-nvme,
	axboe, linux-fsdevel

On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
> The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very
> generic in their nature. Each driver would duplicate the same work. Both
> could have normal and open-channel drives attached.

I didn't say the work should move into the driver, bur rather that
driver should talk to the open channel ssd code directly instead of
hooking into the core block code.

> I'll like to keep blk-mq in the loop. I don't think it will be pretty to
> have two data paths in the drivers. For blk-mq, bios are splitted/merged on
> the way down. Thus, the actual physical addresses needs aren't known before
> the IO is diced to the right size.

But you _do_ have two different data path already.  Nothing says you
can't use blk-mq for your data path, ut it should be a separate entry
point.  Similar to say how a SCSI disk and MMC device both use the block
layer but still use different entry points.

> The reason it shouldn't be under the a single block device, is that a target
> should be able to provide a global address space.
> That allows the address
> space to grow/shrink dynamically with the disks. Allowing a continuously
> growing address space, where disks can be added/removed as requirements grow
> or flash ages. Not on a sector level, but on a flash block level.

I don't understand what you mean with a single block device here, but I
suspect we're talking past each other somehow.

> >>In the future, applications can have an API to get/put flash block directly.
> >>(using the blk_nvm_[get/put]_blk interface).
> >
> >s/application/filesystem/?
> >
> 
> Applications. The goal is that key value stores, e.g. RocksDB, Aerospike,
> Ceph and similar have direct access to flash storage. There won't be a
> kernel file-system between.
> 
> The get/put interface can be seen as a space reservation interface for where
> a given process is allowed to access the storage media.
> 
> It can also be seen in the way that we provide a block allocator in the
> kernel, while applications implement the rest of "file-system" in
> user-space, specially optimized for their data structures. This makes a lot
> of sense for a small subset (LSM, Fractal trees, etc.) of database
> applications.

While we'll need a proper API for that first it's just another reason of
why we shouldnt shoe horn the open channel ssd support into the block
layer.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-18 20:16             ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2015-04-18 20:16 UTC (permalink / raw)


On Sat, Apr 18, 2015@08:45:19AM +0200, Matias Bjorling wrote:
> The low level drivers will be NVMe and vendor's own PCI-e drivers. It's very
> generic in their nature. Each driver would duplicate the same work. Both
> could have normal and open-channel drives attached.

I didn't say the work should move into the driver, bur rather that
driver should talk to the open channel ssd code directly instead of
hooking into the core block code.

> I'll like to keep blk-mq in the loop. I don't think it will be pretty to
> have two data paths in the drivers. For blk-mq, bios are splitted/merged on
> the way down. Thus, the actual physical addresses needs aren't known before
> the IO is diced to the right size.

But you _do_ have two different data path already.  Nothing says you
can't use blk-mq for your data path, ut it should be a separate entry
point.  Similar to say how a SCSI disk and MMC device both use the block
layer but still use different entry points.

> The reason it shouldn't be under the a single block device, is that a target
> should be able to provide a global address space.
> That allows the address
> space to grow/shrink dynamically with the disks. Allowing a continuously
> growing address space, where disks can be added/removed as requirements grow
> or flash ages. Not on a sector level, but on a flash block level.

I don't understand what you mean with a single block device here, but I
suspect we're talking past each other somehow.

> >>In the future, applications can have an API to get/put flash block directly.
> >>(using the blk_nvm_[get/put]_blk interface).
> >
> >s/application/filesystem/?
> >
> 
> Applications. The goal is that key value stores, e.g. RocksDB, Aerospike,
> Ceph and similar have direct access to flash storage. There won't be a
> kernel file-system between.
> 
> The get/put interface can be seen as a space reservation interface for where
> a given process is allowed to access the storage media.
> 
> It can also be seen in the way that we provide a block allocator in the
> kernel, while applications implement the rest of "file-system" in
> user-space, specially optimized for their data structures. This makes a lot
> of sense for a small subset (LSM, Fractal trees, etc.) of database
> applications.

While we'll need a proper API for that first it's just another reason of
why we shouldnt shoe horn the open channel ssd support into the block
layer.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/5 v2] blk-mq: Add prep/unprep support
  2015-04-18 20:16             ` Christoph Hellwig
@ 2015-04-19 18:12               ` Matias Bjorling
  -1 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-19 18:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: keith.busch, javier, linux-kernel, linux-nvme, axboe, linux-fsdevel

> On Sat, Apr 18, 2015 at 08:45:19AM +0200, Matias Bjorling wrote:
<snip>
>> The reason it shouldn't be under the a single block device, is that a target
>> should be able to provide a global address space.
>> That allows the address
>> space to grow/shrink dynamically with the disks. Allowing a continuously
>> growing address space, where disks can be added/removed as requirements grow
>> or flash ages. Not on a sector level, but on a flash block level.
>
> I don't understand what you mean with a single block device here, but I
> suspect we're talking past each other somehow.

Sorry. I meant that several block devices should form a single address 
space (exposed as a single block device), consisting of all the flash 
blocks. Applications could then get/put from that.

Thanks for your feedback. I'll push the pieces around and make the 
integration self-contained outside of the block layer.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/5 v2] blk-mq: Add prep/unprep support
@ 2015-04-19 18:12               ` Matias Bjorling
  0 siblings, 0 replies; 53+ messages in thread
From: Matias Bjorling @ 2015-04-19 18:12 UTC (permalink / raw)


> On Sat, Apr 18, 2015@08:45:19AM +0200, Matias Bjorling wrote:
<snip>
>> The reason it shouldn't be under the a single block device, is that a target
>> should be able to provide a global address space.
>> That allows the address
>> space to grow/shrink dynamically with the disks. Allowing a continuously
>> growing address space, where disks can be added/removed as requirements grow
>> or flash ages. Not on a sector level, but on a flash block level.
>
> I don't understand what you mean with a single block device here, but I
> suspect we're talking past each other somehow.

Sorry. I meant that several block devices should form a single address 
space (exposed as a single block device), consisting of all the flash 
blocks. Applications could then get/put from that.

Thanks for your feedback. I'll push the pieces around and make the 
integration self-contained outside of the block layer.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2015-04-19 18:13 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-15 12:34 [PATCH 0/5 v2] Support for Open-Channel SSDs Matias Bjørling
2015-04-15 12:34 ` Matias Bjørling
2015-04-15 12:34 ` [PATCH 1/5 v2] blk-mq: Add prep/unprep support Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-17  6:34   ` Christoph Hellwig
2015-04-17  6:34     ` Christoph Hellwig
2015-04-17  8:15     ` Matias Bjørling
2015-04-17  8:15       ` Matias Bjørling
2015-04-17 17:46       ` Christoph Hellwig
2015-04-17 17:46         ` Christoph Hellwig
2015-04-18  6:45         ` Matias Bjorling
2015-04-18  6:45           ` Matias Bjorling
2015-04-18 20:16           ` Christoph Hellwig
2015-04-18 20:16             ` Christoph Hellwig
2015-04-19 18:12             ` Matias Bjorling
2015-04-19 18:12               ` Matias Bjorling
2015-04-15 12:34 ` [PATCH 2/5 v2] blk-mq: Support for Open-Channel SSDs Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-16  9:10   ` Paul Bolle
2015-04-16  9:10     ` Paul Bolle
2015-04-16 10:23     ` Matias Bjørling
2015-04-16 10:23       ` Matias Bjørling
2015-04-16 10:23       ` Matias Bjørling
2015-04-16 11:34       ` Paul Bolle
2015-04-16 11:34         ` Paul Bolle
2015-04-16 11:34         ` Paul Bolle
2015-04-16 13:29         ` Matias Bjørling
2015-04-16 13:29           ` Matias Bjørling
2015-04-16 13:29           ` Matias Bjørling
2015-04-15 12:34 ` [PATCH 3/5 v2] lightnvm: RRPC target Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-16  9:12   ` Paul Bolle
2015-04-16  9:12     ` Paul Bolle
2015-04-15 12:34 ` [PATCH 4/5 v2] null_blk: LightNVM support Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-15 12:34 ` [PATCH 5/5 v2] nvme: " Matias Bjørling
2015-04-15 12:34   ` Matias Bjørling
2015-04-16 14:55   ` Keith Busch
2015-04-16 14:55     ` Keith Busch
2015-04-16 15:14     ` Javier González
2015-04-16 15:14       ` Javier González
2015-04-16 15:52       ` Keith Busch
2015-04-16 15:52         ` Keith Busch
2015-04-16 16:01         ` James R. Bergsten
2015-04-16 16:01           ` James R. Bergsten
2015-04-16 16:01           ` James R. Bergsten
2015-04-16 16:12           ` Keith Busch
2015-04-16 16:12             ` Keith Busch
2015-04-16 17:17     ` Matias Bjorling
2015-04-16 17:17       ` Matias Bjorling
2015-04-16 17:17       ` Matias Bjorling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.