io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.g.garry@oracle.com>
To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
	jejb@linux.ibm.com, martin.petersen@oracle.com,
	djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org,
	dchinner@redhat.com, jack@suse.cz
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org,
	tytso@mit.edu, jbongio@google.com, linux-scsi@vger.kernel.org,
	ojaswin@linux.ibm.com, linux-aio@kvack.org,
	linux-btrfs@vger.kernel.org, io-uring@vger.kernel.org,
	nilay@linux.ibm.com, ritesh.list@gmail.com,
	Alan Adamson <alan.adamson@oracle.com>,
	John Garry <john.g.garry@oracle.com>
Subject: [PATCH v4 10/11] nvme: Atomic write support
Date: Mon, 19 Feb 2024 13:01:08 +0000	[thread overview]
Message-ID: <20240219130109.341523-11-john.g.garry@oracle.com> (raw)
In-Reply-To: <20240219130109.341523-1-john.g.garry@oracle.com>

From: Alan Adamson <alan.adamson@oracle.com>

Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.

NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.

Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data Structure,
#NVM Command Set.

NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.

Unlike SCSI, NVMe specifies no granularity or alignment rules. In addition,
again unlike SCSI, there is no dedicated atomic write command - a write
which adheres to the atomic size limit and boundary is implicitly atomic.

If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF

If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.

In all cases, atomic_write_unit_min is set to the logical block size.

If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0

The block layer requires that the atomic_write_boundary value is a
power-of-2. However, it is really only required that atomic_write_boundary
be a multiple of atomic_write_unit_max. As such, if NABSPF were not a
power-of-2, atomic_write_unit_max could be reduced such that it was
divisible into NABSPF. However, this complexity will not be yet supported.

A helper function, nvme_valid_atomic_write(), is also added for the
submission path to verify that a request has been submitted to the driver
will actually be executed atomically.

Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
#jpg: total rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 67 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0a96362912ce..c5bc663c8582 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -934,6 +934,31 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+__maybe_unused
+static bool nvme_valid_atomic_write(struct request *req)
+{
+	struct request_queue *q = req->q;
+	u32 boundary_bytes = queue_atomic_write_boundary_bytes(q);
+
+	if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q))
+		return false;
+
+	if (boundary_bytes) {
+		u64 mask = boundary_bytes - 1, imask = ~mask;
+		u64 start = blk_rq_pos(req) << SECTOR_SHIFT;
+		u64 end = start + blk_rq_bytes(req) - 1;
+
+		/* If greater then must be crossing a boundary */
+		if (blk_rq_bytes(req) > boundary_bytes)
+			return false;
+
+		if ((start & imask) != (end & imask))
+			return false;
+	}
+
+	return true;
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -1960,6 +1985,45 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_update_atomic_write_disk_info(struct nvme_ctrl *ctrl,
+		 struct gendisk *disk, struct nvme_id_ns *id, u32 bs,
+		 u32 atomic_bs)
+{
+	unsigned int unit_min = 0, unit_max = 0, boundary = 0, max_bytes = 0;
+	struct request_queue *q = disk->queue;
+
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (le16_to_cpu(id->nabspf))
+			boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+
+		/*
+		 * The boundary size just needs to be a multiple of unit_max
+		 * (and not necessarily a power-of-2), so this could be relaxed
+		 * in the block layer in future.
+		 * Furthermore, if needed, unit_max could be reduced so that the
+		 * boundary size was compliant.
+		 */
+		if (!boundary || is_power_of_2(boundary)) {
+			max_bytes = atomic_bs;
+			unit_min = bs;
+			unit_max = rounddown_pow_of_two(atomic_bs);
+		} else {
+			dev_notice(ctrl->device, "Unsupported atomic write boundary (%d)\n",
+				boundary);
+			boundary = 0;
+		}
+	} else if (ctrl->subsys->awupf) {
+		max_bytes = atomic_bs;
+		unit_min = bs;
+		unit_max = rounddown_pow_of_two(atomic_bs);
+	}
+
+	blk_queue_atomic_write_max_bytes(q, max_bytes);
+	blk_queue_atomic_write_unit_min_sectors(q, unit_min >> SECTOR_SHIFT);
+	blk_queue_atomic_write_unit_max_sectors(q, unit_max >> SECTOR_SHIFT);
+	blk_queue_atomic_write_boundary_bytes(q, boundary);
+}
+
 static void nvme_update_disk_info(struct nvme_ctrl *ctrl, struct gendisk *disk,
 		struct nvme_ns_head *head, struct nvme_id_ns *id)
 {
@@ -1990,6 +2054,9 @@ static void nvme_update_disk_info(struct nvme_ctrl *ctrl, struct gendisk *disk,
 			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		else
 			atomic_bs = (1 + ctrl->subsys->awupf) * bs;
+
+		nvme_update_atomic_write_disk_info(ctrl, disk, id, bs,
+						atomic_bs);
 	}
 
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {
-- 
2.31.1


  parent reply	other threads:[~2024-02-19 13:08 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-19 13:00 [PATCH v4 00/11] block atomic writes John Garry
2024-02-19 13:00 ` [PATCH v4 01/11] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
2024-02-19 18:57   ` Keith Busch
2024-02-19 13:01 ` [PATCH v4 02/11] block: Call blkdev_dio_unaligned() from blkdev_direct_IO() John Garry
2024-02-19 18:57   ` Keith Busch
2024-02-20  8:31     ` John Garry
2024-02-20  6:54   ` Christoph Hellwig
2024-02-19 13:01 ` [PATCH v4 03/11] fs: Initial atomic write support John Garry
2024-02-19 19:16   ` David Sterba
2024-02-20  8:13     ` John Garry
2024-02-19 22:44   ` Dave Chinner
2024-02-20  9:52     ` John Garry
2024-02-24 18:16   ` Ritesh Harjani
2024-02-24 18:20     ` Ritesh Harjani
2024-02-26  8:58       ` John Garry
2024-02-26  9:13         ` Ritesh Harjani
2024-02-26  9:46           ` John Garry
2024-02-26  8:51     ` John Garry
2024-02-19 13:01 ` [PATCH v4 04/11] fs: Add initial atomic write support info to statx John Garry
2024-02-19 22:28   ` Dave Chinner
2024-02-20  9:40     ` John Garry
2024-02-20  8:20   ` Christoph Hellwig
2024-02-20  9:01     ` John Garry
2024-02-24 18:46   ` Ritesh Harjani
2024-02-26  9:07     ` John Garry
2024-02-19 13:01 ` [PATCH v4 05/11] block: Add core atomic write support John Garry
2024-02-19 22:58   ` Dave Chinner
2024-02-20  8:22     ` Christoph Hellwig
2024-02-20 10:01     ` John Garry
2024-02-25 12:09   ` Ritesh Harjani
2024-02-25 12:21     ` Ritesh Harjani
2024-02-26  9:23     ` John Garry
2024-02-19 13:01 ` [PATCH v4 06/11] block: Add atomic write support for statx John Garry
2024-02-20  8:29   ` Christoph Hellwig
2024-02-20  9:35     ` John Garry
2024-02-25 14:20   ` Ritesh Harjani
2024-02-26  9:36     ` John Garry
2024-02-19 13:01 ` [PATCH v4 07/11] block: Add fops atomic write support John Garry
2024-02-25 14:46   ` Ritesh Harjani
2024-02-26  9:46     ` John Garry
2024-02-19 13:01 ` [PATCH v4 08/11] scsi: sd: Atomic " John Garry
2024-02-19 13:01 ` [PATCH v4 09/11] scsi: scsi_debug: " John Garry
2024-02-20  7:12   ` Ojaswin Mujoo
2024-02-20  9:01     ` John Garry
2024-02-19 13:01 ` John Garry [this message]
2024-02-19 19:21   ` [PATCH v4 10/11] nvme: " Keith Busch
2024-02-20  6:55     ` Christoph Hellwig
2024-02-20  8:19       ` John Garry
2024-02-20  8:31   ` Christoph Hellwig
2024-02-20  8:50     ` John Garry
2024-02-19 13:01 ` [PATCH v4 11/11] nvme: Ensure atomic writes will be executed atomically John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240219130109.341523-11-john.g.garry@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=alan.adamson@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=io-uring@vger.kernel.org \
    --cc=jack@suse.cz \
    --cc=jbongio@google.com \
    --cc=jejb@linux.ibm.com \
    --cc=kbusch@kernel.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=nilay@linux.ibm.com \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=sagi@grimberg.me \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).