All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/41] btrfs: zoned block device support
@ 2020-11-10 11:26 Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 01/41] block: add bio_add_zone_append_page Naohiro Aota
                   ` (42 more replies)
  0 siblings, 43 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This series adds zoned block device support to btrfs.

This series is also available on github.
Kernel   https://github.com/naota/linux/tree/btrfs-zoned-v10
Userland https://github.com/naota/btrfs-progs/tree/btrfs-zoned
xfstests https://github.com/naota/fstests/tree/btrfs-zoned

Userland tool depends on patched util-linux (libblkid and wipefs) to handle
log-structured superblock. To ease the testing, pre-compiled static linked
userland tools are available here:
https://wdc.app.box.com/s/fnhqsb3otrvgkstq66o6bvdw6tk525kp

This v10 still leaves the following issues left for later fix. But, the
first part of the series should be good shape to be merged.
- Bio submission path & splitting an ordered extent
- Redirtying freed tree blocks
  - Switch to keeping it dirty
    - Not working correctly for now
- Dedicated tree-log block group
  - We need tree-log for zoned device
    - Dbench (32 clients) is 85% slower with "-o notreelog"
  - Need to separate tree-log block group from other metadata space_info
- Relocation
  - Use normal write command for relocation
  - Relocated device extents must be reset
    - It should be discarded on regular btrfs too though

Changes from v9:
  - Extract iomap_dio_bio_opflags() to set the proper bi_opf flag
  - write pointer emulation
    - Rewrite using btrfs_previous_extent_item()
    - Convert ASSERT() to runtime check
  - Exclude regular superblock positions
  - Fix an error on writing to conventional zones
  - Take the transaction lock in mark_block_group_to_copy()
  - Rename 'hmzoned_devices' to 'zoned_devices' in btrfs_check_zoned_mode()
  - Add do_discard_extent() helper
  - Move zoned check into fetch_cluster_info()
  - Drop setting bdev to bio in btrfs_bio_add_page() (will fix later once
    we support multiple devices)
  - Subtract bytes_zone_unusable properly when removing a block group
  - Add "struct block_device *bdev" directly to btrfs_rmap_block()
  - Rename btrfs_zone_align to btrfs_align_offset_to_zone
  - Add comment to use pr_info in place of btrfs_info
  - Add comment for superblock log zones
  - Fix coding style
  - Fix typos

btrfs-progs and xfstests series will follow.

This version of ZONED btrfs switched from normal write command to zone
append write command. You do not need to specify LBA (at the write pointer)
to write for zone append write command. Instead, you only select a zone to
write with its start LBA. Then the device (NVMe ZNS), or the emulation of
zone append command in the sd driver in the case of SAS or SATA HDDs,
automatically writes the data at the write pointer position and return the
written LBA as a command reply.

The benefit of using the zone append write command is that write command
issuing order does not matter. So, we can eliminate block group lock and
utilize asynchronous checksum, which can reorder the IOs.

Eliminating the lock improves performance. In particular, on a workload
with massive competing to the same zone [1], we observed 36% performance
improvement compared to normal write.

[1] Fio running 16 jobs with 4KB random writes for 5 minutes

However, there are some limitations. We cannot use the non-SINGLE profile.
Supporting non-SINGLE profile with zone append writing is not trivial. For
example, in the DUP profile, we send a zone append writing IO to two zones
on a device. The device reply with written LBAs for the IOs. If the offsets
of the returned addresses from the beginning of the zone are different,
then it results in different logical addresses.

For the same reason, we cannot issue multiple IOs for one ordered extent.
Thus, the size of an ordered extent is limited under max_zone_append_size.
This limitation will cause fragmentation and increased usage of metadata.
In the future, we can add optimization to merge ordered extents after
end_bio.

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Enabling tree-log

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing than a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes which ZONED mode must avoid.

This series introduces a dedicated block group for tree-log blocks to
create two metadata writing streams, one for tree-log blocks and the
other for metadata blocks. As a result, each write stream can now be
written to devices separately and sequentially.

* Log-structured superblock

Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the
zone.

This series implements superblock log writing. It uses two zones as a
circular buffer to write updated superblocks. Once the first zone is filled
up, start writing into the second zone. The first zone will be reset once
both zones are filled. We can determine the postion of the latest
superblock by reading the write pointer information from a device.

* Patch series organization

Patches 1 and 2 are preparing patches for block and iomap layer.

Patch 3 introduces the ZONED incompatible feature flag to indicate that the
btrfs volume was formatted for use on zoned block devices.

Patches 4 to 6 implement functions to gather information on the zones of
the device (zones type, write pointer position, and max_zone_append_size).

Patches 7 to 10 disable features which are not compatible with the
sequential write constraints of zoned block devices. These includes
space_cache, NODATACOW, fallocate, and MIXED_BG.

Patch 11 implements the log-structured superblock writing.

Patches 12 and 13 tweak the device extent allocation for ZONED mode and add
verification to check if a device extent is properly aligned to zones.

Patches 14 to 17 implements sequential block allocator for ZONED mode.

Patch 18 implement a zone reset for unused block groups.

Patches 19 to 30 implement the writing path for several types of IO
(non-compressed data, direct IO, and metadata). These include re-dirtying
once-freed metadata blocks to prevent write holes.

Patches 31 to 40 tweak some btrfs features work with ZONED mode. These
include device-replace, relocation, repairing IO error, and tree-log.

Finally, patch 41 adds the ZONED feature to the list of supported features.

* Patch testing note

** Zone-aware util-linux

Since the log-structured superblock feature changed the location of
superblock magic, the current util-linux (libblkid) cannot detect ZONED
btrfs anymore. You need to apply a to-be posted patch to util-linux to make
it "zone aware".

** Testing device

You need devices with zone append writing command support to run ZONED
btrfs.

Other than real devices, null_blk supports zone append write command. You
can use memory backed null_blk to run the test on it. Following script
creates 12800 MB /dev/nullb0.

    sysfs=/sys/kernel/config/nullb/nullb0
    size=12800 # MB
    
    # drop nullb0
    if [[ -d $sysfs ]]; then
            echo 0 > "${sysfs}"/power
            rmdir $sysfs
    fi
    lsmod | grep -q null_blk && rmmod null_blk
    modprobe null_blk nr_devices=0
    
    mkdir "${sysfs}"
    
    echo "${size}" > "${sysfs}"/size
    echo 1 > "${sysfs}"/zoned
    echo 0 > "${sysfs}"/zone_nr_conv
    echo 1 > "${sysfs}"/memory_backed
    
    echo 1 > "${sysfs}"/power
    udevadm settle

Zoned SCSI devices such as SMR HDDs or scsi_debug also support the zone
append command as an emulated command within the SCSI sd driver. This
emulation is completely transparent to the user and provides the same
semantic as a NVMe ZNS native drive support.

Also, there is a qemu patch available to enable NVMe ZNS device.

** xfstests

We ran xfstests on ZONED btrfs, and, if we omit some cases that are known
to fail currently, all test cases pass.

Cases that can be ignored:
1) failing also with the regular btrfs on regular devices,
2) trying to test fallocate feature without testing with
   "_require_xfs_io_command "falloc"",
3) trying to test incompatible features for ZONED btrfs (e.g. RAID5/6)
4) trying to use incompatible setup for ZONED btrfs (e.g. dm-linear not
   aligned to zone boundary, swap)
5) trying to create a file system with too small size, (we require at least
   9 zones to initiate a ZONED btrfs)
6) dropping original MKFS_OPTIONS ("-O zoned"), so it cannot create ZONED
   btrfs (btrfs/003)
7) having ENOSPC which incurred by larger metadata block group size

I will send a patch series for xfstests to handle these cases (2-6)
properly.

Patched xfstests is available here:

https://github.com/naota/fstests/tree/btrfs-zoned

Also, you need to apply the following patch if you run xfstests with
tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

v9 https://lore.kernel.org/linux-btrfs/cover.1604065156.git.naohiro.aota@wdc.com/
v8 https://lore.kernel.org/linux-btrfs/cover.1601572459.git.naohiro.aota@wdc.com/
v7 https://lore.kernel.org/linux-btrfs/20200911123259.3782926-1-naohiro.aota@wdc.com/
v6 https://lore.kernel.org/linux-btrfs/20191213040915.3502922-1-naohiro.aota@wdc.com/
v5 https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/
v4 https://lwn.net/Articles/797061/
v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v9
 - Direct-IO path now follow several hardware restrictions (other than
   max_zone_append_size) by using ZONE_APPEND support of iomap
 - introduces union of fs_info->zone_size and fs_info->zoned [Johannes]
   - and use btrfs_is_zoned(fs_info) in place of btrfs_fs_incompat(fs_info, ZONED)
 - print if zoned is enabled or not when printing module info [Johannes]
 - drop patch of disabling inode_cache on ZONED
 - moved for_teelog flag to a proper location [Johannes]
 - Code style fixes [Johannes]
 - Add comment about adding physical layer things to ordered extent
   structure
 - Pass file_offset explicitly to extract_ordered_extent() instead of
   determining it from bio
 - Bug fixes
   - write out fsync region so that the logical address of ordered extents
     and checksums are properly finalized
   - free zone_info at umount time
   - fix superblock log handling when entering zones[1] in the first time
   - fixes double free of log-tree roots [Johannes] 
   - Drop erroneous ASSERT in do_allocation_zoned()
v8
 - Use bio_add_hw_page() to build up bio to honor hardware restrictions
   - add bio_add_zone_append_page() as a wrapper of the function
 - Split file extent on submitting bio
   - If bio_add_zone_append_page() fails, split the file extent and send
     out bio
   - so, we can ensure one bio == one file extent
 - Fix build bot issues
 - Rebased on misc-next
v7:
 - Use zone append write command instead of normal write command
   - Bio issuing order does not matter
   - No need to use lock anymore
   - Can use asynchronous checksum
 - Removed RAID support for now
 - Rename HMZONED to ZONED
 - Split some patches
 - Rebased on kdave/for-5.9-rc3 + iomap direct IO
v6:
 - Use bitmap helpers (Johannes)
 - Code cleanup (Johannes)
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v5:
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v4:
 - Move memory allcation of zone informattion out of
   btrfs_get_dev_zones() (Anand)
 - Add disabled features table in commit log (Anand)
 - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
v3:
 - Serialize allocation and submit_bio instead of bio buffering in
   btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)


Johannes Thumshirn (1):
  block: add bio_add_zone_append_page

Naohiro Aota (40):
  iomap: support REQ_OP_ZONE_APPEND
  btrfs: introduce ZONED feature flag
  btrfs: get zone information of zoned block devices
  btrfs: check and enable ZONED mode
  btrfs: introduce max_zone_append_size
  btrfs: disallow space_cache in ZONED mode
  btrfs: disallow NODATACOW in ZONED mode
  btrfs: disable fallocate in ZONED mode
  btrfs: disallow mixed-bg in ZONED mode
  btrfs: implement log-structured superblock for ZONED mode
  btrfs: implement zoned chunk allocator
  btrfs: verify device extent is aligned to zone
  btrfs: load zone's alloction offset
  btrfs: emulate write pointer for conventional zones
  btrfs: track unusable bytes for zones
  btrfs: do sequential extent allocation in ZONED mode
  btrfs: reset zones of unused block groups
  btrfs: redirty released extent buffers in ZONED mode
  btrfs: extract page adding function
  btrfs: use bio_add_zone_append_page for zoned btrfs
  btrfs: handle REQ_OP_ZONE_APPEND as writing
  btrfs: split ordered extent when bio is sent
  btrfs: extend btrfs_rmap_block for specifying a device
  btrfs: use ZONE_APPEND write for ZONED btrfs
  btrfs: enable zone append writing for direct IO
  btrfs: introduce dedicated data write path for ZONED mode
  btrfs: serialize meta IOs on ZONED mode
  btrfs: wait existing extents before truncating
  btrfs: avoid async metadata checksum on ZONED mode
  btrfs: mark block groups to copy for device-replace
  btrfs: implement cloning for ZONED device-replace
  btrfs: implement copying for ZONED device-replace
  btrfs: support dev-replace in ZONED mode
  btrfs: enable relocation in ZONED mode
  btrfs: relocate block group to repair IO failure in ZONED
  btrfs: split alloc_log_tree()
  btrfs: extend zoned allocator to use dedicated tree-log block group
  btrfs: serialize log transaction on ZONED mode
  btrfs: reorder log node allocation
  btrfs: enable to mount ZONED incompat flag

 block/bio.c                       |   38 +
 fs/btrfs/Makefile                 |    1 +
 fs/btrfs/block-group.c            |   84 +-
 fs/btrfs/block-group.h            |   18 +-
 fs/btrfs/ctree.h                  |   20 +-
 fs/btrfs/dev-replace.c            |  195 +++++
 fs/btrfs/dev-replace.h            |    3 +
 fs/btrfs/disk-io.c                |   93 ++-
 fs/btrfs/disk-io.h                |    2 +
 fs/btrfs/extent-tree.c            |  218 ++++-
 fs/btrfs/extent_io.c              |  130 ++-
 fs/btrfs/extent_io.h              |    2 +
 fs/btrfs/file.c                   |    6 +-
 fs/btrfs/free-space-cache.c       |   58 ++
 fs/btrfs/free-space-cache.h       |    2 +
 fs/btrfs/inode.c                  |  164 +++-
 fs/btrfs/ioctl.c                  |   13 +
 fs/btrfs/ordered-data.c           |   79 ++
 fs/btrfs/ordered-data.h           |   10 +
 fs/btrfs/relocation.c             |   35 +-
 fs/btrfs/scrub.c                  |  145 ++++
 fs/btrfs/space-info.c             |   13 +-
 fs/btrfs/space-info.h             |    4 +-
 fs/btrfs/super.c                  |   19 +-
 fs/btrfs/sysfs.c                  |    4 +
 fs/btrfs/tests/extent-map-tests.c |    2 +-
 fs/btrfs/transaction.c            |   10 +
 fs/btrfs/transaction.h            |    3 +
 fs/btrfs/tree-log.c               |   52 +-
 fs/btrfs/volumes.c                |  322 +++++++-
 fs/btrfs/volumes.h                |    7 +
 fs/btrfs/zoned.c                  | 1272 +++++++++++++++++++++++++++++
 fs/btrfs/zoned.h                  |  295 +++++++
 fs/iomap/direct-io.c              |   41 +-
 include/linux/bio.h               |    2 +
 include/linux/iomap.h             |    1 +
 include/uapi/linux/btrfs.h        |    1 +
 37 files changed, 3246 insertions(+), 118 deletions(-)
 create mode 100644 fs/btrfs/zoned.c
 create mode 100644 fs/btrfs/zoned.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH v10 01/41] block: add bio_add_zone_append_page
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 17:20   ` Christoph Hellwig
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
                   ` (41 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c         | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/bio.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 58d765400226..c8943201c26c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -853,6 +853,44 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+/**
+ * bio_add_zone_append_page - attempt to add page to zone-append bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist of a bio that will be submitted
+ * for a zone-append request. This can fail for a number of reasons, such as the
+ * bio being full or the target block device is not a zoned block device or
+ * other limitations of the target block device. The target block device must
+ * allow bio's up to PAGE_SIZE, so it is always possible to add a single page
+ * to an empty bio.
+ *
+ * Returns: number of bytes added to the bio, or 0 in case of a failure.
+ */
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset)
+{
+	struct request_queue *q;
+	bool same_page = false;
+
+	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
+		return 0;
+
+	if (WARN_ON_ONCE(!bio->bi_disk))
+		return 0;
+
+	q = bio->bi_disk->queue;
+
+	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
+		return 0;
+
+	return bio_add_hw_page(q, bio, page, len, offset,
+			       queue_max_zone_append_sectors(q), &same_page);
+}
+EXPORT_SYMBOL_GPL(bio_add_zone_append_page);
+
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..7ef300cb4e9a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -442,6 +442,8 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off, bool *same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 01/41] block: add bio_add_zone_append_page Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 17:25   ` Christoph Hellwig
                     ` (3 more replies)
  2020-11-10 11:26 ` [PATCH v10 03/41] btrfs: introduce ZONED feature flag Naohiro Aota
                   ` (40 subsequent siblings)
  42 siblings, 4 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.

To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
 include/linux/iomap.h |  1 +
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..f04572a55a09 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	iomap_dio_submit_bio(dio, iomap, bio, pos);
 }
 
+/*
+ * Figure out the bio's operation flags from the dio request, the
+ * mapping, and whether or not we want FUA.  Note that we can end up
+ * clearing the WRITE_FUA flag in the dio request.
+ */
+static inline unsigned int
+iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
+{
+	unsigned int opflags = REQ_SYNC | REQ_IDLE;
+
+	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
+		return REQ_OP_READ;
+	}
+
+	if (iomap->flags & IOMAP_F_ZONE_APPEND)
+		opflags |= REQ_OP_ZONE_APPEND;
+	else
+		opflags |= REQ_OP_WRITE;
+
+	if (use_fua)
+		opflags |= REQ_FUA;
+	else
+		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+
+	return opflags;
+}
+
 static loff_t
 iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		struct iomap_dio *dio, struct iomap *iomap)
@@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
+		/*
+		 * Set the operation flags early so that bio_iov_iter_get_pages
+		 * can set up the page vector appropriately for a ZONE_APPEND
+		 * operation.
+		 */
+		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
+
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
 		if (unlikely(ret)) {
 			/*
@@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
-			if (use_fua)
-				bio->bi_opf |= REQ_FUA;
-			else
-				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
 			task_io_account_write(n);
 		} else {
-			bio->bi_opf = REQ_OP_READ;
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..1bccd1880d0d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -54,6 +54,7 @@ struct vm_fault;
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_ZONE_APPEND	0x20
 
 /*
  * Flags set by the core iomap code during operations:
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 03/41] btrfs: introduce ZONED feature flag
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 01/41] block: add bio_add_zone_append_page Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-19 21:31   ` David Sterba
  2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
                   ` (39 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Damien Le Moal, Anand Jain,
	Johannes Thumshirn

This patch introduces the ZONED incompat flag. The flag indicates that the
volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/sysfs.c           | 2 ++
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 279d9262b676..828006020bbd 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -263,6 +263,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -278,6 +279,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(zoned),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 2c39d15a2beb..5df73001aad4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -307,6 +307,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
+#define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (2 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 03/41] btrfs: introduce ZONED feature flag Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-12  6:57   ` Anand Jain
                     ` (2 more replies)
  2020-11-10 11:26 ` [PATCH v10 05/41] btrfs: check and enable ZONED mode Naohiro Aota
                   ` (38 subsequent siblings)
  42 siblings, 3 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Damien Le Moal, Josef Bacik

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zone_info().  To
avoid costly run-time zone report commands to test the device zones type
during block allocation, attach the seq_zones bitmap to the device
structure to indicate if a zone is sequential or accept random writes. Also
it attaches the empty_zones bitmap to indicate if a zone is empty or not.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone and
btrfs_dev_is_empty_zone() to test if the zone is a empty zone.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile      |   1 +
 fs/btrfs/dev-replace.c |   5 ++
 fs/btrfs/super.c       |   5 ++
 fs/btrfs/volumes.c     |  19 ++++-
 fs/btrfs/volumes.h     |   4 +
 fs/btrfs/zoned.c       | 182 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  91 +++++++++++++++++++++
 7 files changed, 305 insertions(+), 2 deletions(-)
 create mode 100644 fs/btrfs/zoned.c
 create mode 100644 fs/btrfs/zoned.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index e738f6206ea5..0497fdc37f90 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
+btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 20ce1970015f..6f6d77224c2b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -21,6 +21,7 @@
 #include "rcu-string.h"
 #include "dev-replace.h"
 #include "sysfs.h"
+#include "zoned.h"
 
 /*
  * Device replace overview
@@ -291,6 +292,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
 
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error;
+
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8840a4fa81eb..ed55014fd1bd 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
 #endif
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 			", ref-verify=on"
+#endif
+#ifdef CONFIG_BLK_DEV_ZONED
+			", zoned=yes"
+#else
+			", zoned=no"
 #endif
 			;
 	pr_info("Btrfs loaded, crc32c=%s%s\n", crc32c_impl(), options);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 58b9c419a2b6..e787bf89f761 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -31,6 +31,7 @@
 #include "space-info.h"
 #include "block-group.h"
 #include "discard.h"
+#include "zoned.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
 	rcu_string_free(device->name);
 	extent_io_tree_release(&device->alloc_state);
 	bio_put(device->flush_bio);
+	btrfs_destroy_dev_zone_info(device);
 	kfree(device);
 }
 
@@ -667,6 +669,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret != 0)
+		goto error_free_page;
+
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -1143,6 +1150,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 		device->bdev = NULL;
 	}
 	clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+	btrfs_destroy_dev_zone_info(device);
 
 	device->fs_info = NULL;
 	atomic_set(&device->dev_stats_ccnt, 0);
@@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	}
 	rcu_assign_pointer(device->name, name);
 
+	device->fs_info = fs_info;
+	device->bdev = bdev;
+
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error_free_device;
+
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
@@ -2559,8 +2575,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 					 fs_info->sectorsize);
 	device->disk_total_bytes = device->total_bytes;
 	device->commit_total_bytes = device->total_bytes;
-	device->fs_info = fs_info;
-	device->bdev = bdev;
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
 	device->mode = FMODE_EXCL;
@@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		sb->s_flags |= SB_RDONLY;
 	if (trans)
 		btrfs_end_transaction(trans);
+	btrfs_destroy_dev_zone_info(device);
 error_free_device:
 	btrfs_free_device(device);
 error:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bf27ac07d315..9c07b97a2260 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,8 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 
+struct btrfs_zoned_device_info;
+
 struct btrfs_device {
 	struct list_head dev_list; /* device_list_mutex */
 	struct list_head dev_alloc_list; /* chunk mutex */
@@ -64,6 +66,8 @@ struct btrfs_device {
 
 	struct block_device *bdev;
 
+	struct btrfs_zoned_device_info *zone_info;
+
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
new file mode 100644
index 000000000000..b7ffe6670d3a
--- /dev/null
+++ b/fs/btrfs/zoned.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "zoned.h"
+#include "rcu-string.h"
+
+/* Maximum number of zones to report per blkdev_report_zones() call */
+#define BTRFS_REPORT_NR_ZONES   4096
+
+static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
+			     void *data)
+{
+	struct blk_zone *zones = data;
+
+	memcpy(&zones[idx], zone, sizeof(*zone));
+
+	return 0;
+}
+
+static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+			       struct blk_zone *zones, unsigned int *nr_zones)
+{
+	int ret;
+
+	if (!*nr_zones)
+		return 0;
+
+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones,
+				  copy_zone_info_cb, zones);
+	if (ret < 0) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "zoned: failed to read zone %llu on %s (devid %llu)",
+				 pos, rcu_str_deref(device->name),
+				 device->devid);
+		return ret;
+	}
+	*nr_zones = ret;
+	if (!ret)
+		return -EIO;
+
+	return 0;
+}
+
+int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = NULL;
+	struct block_device *bdev = device->bdev;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	sector_t sector = 0;
+	struct blk_zone *zones = NULL;
+	unsigned int i, nreported = 0, nr_zones;
+	unsigned int zone_sectors;
+	int ret;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	if (device->zone_info)
+		return 0;
+
+	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
+	if (!zone_info)
+		return -ENOMEM;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	ASSERT(is_power_of_2(zone_sectors));
+	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
+	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+	if (!IS_ALIGNED(nr_sectors, zone_sectors))
+		zone_info->nr_zones++;
+
+	zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->seq_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->empty_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zones = kcalloc(BTRFS_REPORT_NR_ZONES,
+			sizeof(struct blk_zone), GFP_KERNEL);
+	if (!zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Get zones type */
+	while (sector < nr_sectors) {
+		nr_zones = BTRFS_REPORT_NR_ZONES;
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
+					  &nr_zones);
+		if (ret)
+			goto out;
+
+		for (i = 0; i < nr_zones; i++) {
+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+				set_bit(nreported, zone_info->seq_zones);
+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+				set_bit(nreported, zone_info->empty_zones);
+			nreported++;
+		}
+		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
+	}
+
+	if (nreported != zone_info->nr_zones) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "inconsistent number of zones on %s (%u / %u)",
+				 rcu_str_deref(device->name), nreported,
+				 zone_info->nr_zones);
+		ret = -EIO;
+		goto out;
+	}
+
+	kfree(zones);
+
+	device->zone_info = zone_info;
+
+	/*
+	 * This function is called from open_fs_devices(), which is before
+	 * we set the device->fs_info. So, we use pr_info instead of
+	 * btrfs_info to avoid printing confusing message like "BTRFS info
+	 * (device <unknown>) ..."
+	 */
+
+	rcu_read_lock();
+	if (device->fs_info)
+		btrfs_info(device->fs_info,
+			"host-%s zoned block device %s, %u zones of %llu bytes",
+			bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+			rcu_str_deref(device->name), zone_info->nr_zones,
+			zone_info->zone_size);
+	else
+		pr_info("BTRFS info: host-%s zoned block device %s, %u zones of %llu bytes",
+			bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+			rcu_str_deref(device->name), zone_info->nr_zones,
+			zone_info->zone_size);
+	rcu_read_unlock();
+
+	return 0;
+
+out:
+	kfree(zones);
+	bitmap_free(zone_info->empty_zones);
+	bitmap_free(zone_info->seq_zones);
+	kfree(zone_info);
+
+	return ret;
+}
+
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return;
+
+	bitmap_free(zone_info->seq_zones);
+	bitmap_free(zone_info->empty_zones);
+	kfree(zone_info);
+	device->zone_info = NULL;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone)
+{
+	unsigned int nr_zones = 1;
+	int ret;
+
+	ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones);
+	if (ret != 0 || !nr_zones)
+		return ret ? ret : -EIO;
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
new file mode 100644
index 000000000000..c9e69ff87ab9
--- /dev/null
+++ b/fs/btrfs/zoned.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_ZONED_H
+#define BTRFS_ZONED_H
+
+#include <linux/types.h>
+
+struct btrfs_zoned_device_info {
+	/*
+	 * Number of zones, zone size and types of zones if bdev is a
+	 * zoned block device.
+	 */
+	u64 zone_size;
+	u8  zone_size_shift;
+	u32 nr_zones;
+	unsigned long *seq_zones;
+	unsigned long *empty_zones;
+};
+
+#ifdef CONFIG_BLK_DEV_ZONED
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone);
+int btrfs_get_dev_zone_info(struct btrfs_device *device);
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+#else /* CONFIG_BLK_DEV_ZONED */
+static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+				     struct blk_zone *zone)
+{
+	return 0;
+}
+
+static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	return 0;
+}
+
+static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
+
+#endif
+
+static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return false;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->seq_zones);
+}
+
+static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return true;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
+						u64 pos, bool set)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+	unsigned int zno;
+
+	if (!zone_info)
+		return;
+
+	zno = pos >> zone_info->zone_size_shift;
+	if (set)
+		set_bit(zno, zone_info->empty_zones);
+	else
+		clear_bit(zno, zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
+					    u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, true);
+}
+
+static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
+					      u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, false);
+}
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (3 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-18 11:29   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 06/41] btrfs: introduce max_zone_append_size Naohiro Aota
                   ` (37 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Johannes Thumshirn,
	Damien Le Moal, Josef Bacik

This commit introduces the function btrfs_check_zoned_mode() to check if
ZONED flag is enabled on the file system and if the file system consists of
zoned devices with equal zone size.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       | 11 ++++++
 fs/btrfs/dev-replace.c |  7 ++++
 fs/btrfs/disk-io.c     | 11 ++++++
 fs/btrfs/super.c       |  1 +
 fs/btrfs/volumes.c     |  5 +++
 fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       | 26 ++++++++++++++
 7 files changed, 142 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac3d6f4e35b..453f41ca024e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -948,6 +948,12 @@ struct btrfs_fs_info {
 	/* Type of exclusive operation running */
 	unsigned long exclusive_operation;
 
+	/* Zone size when in ZONED mode */
+	union {
+		u64 zone_size;
+		u64 zoned;
+	};
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
@@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
 }
 #endif
 
+static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
+{
+	return fs_info->zoned != 0;
+}
+
 #endif
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f6d77224c2b..db87f1aa604b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		btrfs_err(fs_info,
+			  "dev-replace: zoned type of target device mismatch with filesystem");
+		ret = -EINVAL;
+		goto error;
+	}
+
 	sync_blockdev(bdev);
 
 	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 764001609a15..e76ac4da208d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "space-info.h"
+#include "zoned.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2976,6 +2977,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	if (features & BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 		btrfs_info(fs_info, "has skinny extents");
 
+	fs_info->zoned = features & BTRFS_FEATURE_INCOMPAT_ZONED;
+
 	/*
 	 * flag our filesystem as having big metadata blocks if
 	 * they are bigger than the page size
@@ -3130,7 +3133,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 
 	btrfs_free_extra_devids(fs_devices, 1);
 
+	ret = btrfs_check_zoned_mode(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to inititialize zoned mode: %d",
+			  ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_sysfs_add_fsid(fs_devices);
+
 	if (ret) {
 		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
 				ret);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index ed55014fd1bd..3312fe08168f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -44,6 +44,7 @@
 #include "backref.h"
 #include "space-info.h"
 #include "sysfs.h"
+#include "zoned.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
 #include "discard.h"
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e787bf89f761..10827892c086 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	if (fs_devices->seeding) {
 		seeding_dev = 1;
 		down_write(&sb->s_umount);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b7ffe6670d3a..1223d5b0e411 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -180,3 +180,84 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 
 	return 0;
 }
+
+int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 zoned_devices = 0;
+	u64 nr_devices = 0;
+	u64 zone_size = 0;
+	const bool incompat_zoned = btrfs_is_zoned(fs_info);
+	int ret = 0;
+
+	/* Count zoned devices */
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		enum blk_zoned_model model;
+
+		if (!device->bdev)
+			continue;
+
+		model = bdev_zoned_model(device->bdev);
+		if (model == BLK_ZONED_HM ||
+		    (model == BLK_ZONED_HA && incompat_zoned)) {
+			zoned_devices++;
+			if (!zone_size) {
+				zone_size = device->zone_info->zone_size;
+			} else if (device->zone_info->zone_size != zone_size) {
+				btrfs_err(fs_info,
+					  "zoned: unequal block device zone sizes: have %llu found %llu",
+					  device->zone_info->zone_size,
+					  zone_size);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+		nr_devices++;
+	}
+
+	if (!zoned_devices && !incompat_zoned)
+		goto out;
+
+	if (!zoned_devices && incompat_zoned) {
+		/* No zoned block device found on ZONED FS */
+		btrfs_err(fs_info,
+			  "zoned: no zoned devices found on a zoned filesystem");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (zoned_devices && !incompat_zoned) {
+		btrfs_err(fs_info,
+			  "zoned: mode not enabled but zoned device found");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (zoned_devices != nr_devices) {
+		btrfs_err(fs_info,
+			  "zoned: cannot mix zoned and regular devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
+	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
+	 * check the alignment here.
+	 */
+	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
+		btrfs_err(fs_info,
+			  "zoned: zone size not aligned to stripe %u",
+			  BTRFS_STRIPE_LEN);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fs_info->zone_size = zone_size;
+
+	btrfs_info(fs_info, "zoned mode enabled with zone size %llu",
+		   fs_info->zone_size);
+out:
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index c9e69ff87ab9..bcb1cb99a4f3 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -4,6 +4,7 @@
 #define BTRFS_ZONED_H
 
 #include <linux/types.h>
+#include <linux/blkdev.h>
 
 struct btrfs_zoned_device_info {
 	/*
@@ -22,6 +23,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -36,6 +38,15 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
 
 static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
 
+static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	btrfs_err(fs_info, "Zoned block devices support is not enabled");
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -88,4 +99,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
 	btrfs_dev_set_empty_zone_bit(device, pos, false);
 }
 
+static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
+						struct block_device *bdev)
+{
+	u64 zone_size;
+
+	if (btrfs_is_zoned(fs_info)) {
+		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+		/* Do not allow non-zoned device */
+		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+	}
+
+	/* Do not allow Host Manged zoned device */
+	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 06/41] btrfs: introduce max_zone_append_size
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (4 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 05/41] btrfs: check and enable ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-19  9:23   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode Naohiro Aota
                   ` (36 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.

Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h |  3 +++
 fs/btrfs/zoned.c | 17 +++++++++++++++--
 fs/btrfs/zoned.h |  1 +
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 453f41ca024e..c70d3fcc62c2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -954,6 +954,9 @@ struct btrfs_fs_info {
 		u64 zoned;
 	};
 
+	/* Max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1223d5b0e411..2897432eb43c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -48,6 +48,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 {
 	struct btrfs_zoned_device_info *zone_info = NULL;
 	struct block_device *bdev = device->bdev;
+	struct request_queue *queue = bdev_get_queue(bdev);
 	sector_t nr_sectors = bdev->bd_part->nr_sects;
 	sector_t sector = 0;
 	struct blk_zone *zones = NULL;
@@ -69,6 +70,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	ASSERT(is_power_of_2(zone_sectors));
 	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->max_zone_append_size =
+		(u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;
 	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
@@ -188,6 +191,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
+	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
 
@@ -201,10 +205,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		model = bdev_zoned_model(device->bdev);
 		if (model == BLK_ZONED_HM ||
 		    (model == BLK_ZONED_HA && incompat_zoned)) {
+			struct btrfs_zoned_device_info *zone_info =
+				device->zone_info;
+
 			zoned_devices++;
 			if (!zone_size) {
-				zone_size = device->zone_info->zone_size;
-			} else if (device->zone_info->zone_size != zone_size) {
+				zone_size = zone_info->zone_size;
+			} else if (zone_info->zone_size != zone_size) {
 				btrfs_err(fs_info,
 					  "zoned: unequal block device zone sizes: have %llu found %llu",
 					  device->zone_info->zone_size,
@@ -212,6 +219,11 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
+			if (!max_zone_append_size ||
+			    (zone_info->max_zone_append_size &&
+			     zone_info->max_zone_append_size < max_zone_append_size))
+				max_zone_append_size =
+					zone_info->max_zone_append_size;
 		}
 		nr_devices++;
 	}
@@ -255,6 +267,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
+	fs_info->max_zone_append_size = max_zone_append_size;
 
 	btrfs_info(fs_info, "zoned mode enabled with zone size %llu",
 		   fs_info->zone_size);
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index bcb1cb99a4f3..52aa6af5d8dc 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -13,6 +13,7 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
+	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (5 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 06/41] btrfs: introduce max_zone_append_size Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-19 10:42   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 08/41] btrfs: disallow NODATACOW " Naohiro Aota
                   ` (35 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

As updates to the space cache v1 are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache v1.  This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

Note: we can technically use free-space-tree (space cache v2) on ZONED
mode. But, since ZONED mode now always allocate extents in a block group
sequentially regardless of underlying device zone type, it's no use to
enable and maintain the tree.

For the same reason, NODATACOW is also disabled.

In summary, ZONED will disable:

| Disabled features | Reason                                              |
|-------------------+-----------------------------------------------------|
| RAID/Dup          | Cannot handle two zone append writes to different   |
|                   | zones                                               |
|-------------------+-----------------------------------------------------|
| space_cache (v1)  | In-place updating                                   |
| NODATACOW         | In-place updating                                   |
|-------------------+-----------------------------------------------------|
| fallocate         | Reserved extent will be a write hole                |
|-------------------+-----------------------------------------------------|
| MIXED_BG          | Allocated metadata region will be write holes for   |
|                   | data writes                                         |

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/super.c | 13 +++++++++++--
 fs/btrfs/zoned.c | 18 ++++++++++++++++++
 fs/btrfs/zoned.h |  6 ++++++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3312fe08168f..1adbbeebc649 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -525,8 +525,15 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 	cache_gen = btrfs_super_cache_generation(info->super_copy);
 	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
 		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
-	else if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	else if (cache_gen) {
+		if (btrfs_is_zoned(info)) {
+			btrfs_info(info,
+			"zoned: clearing existing space cache");
+			btrfs_set_super_cache_generation(info->super_copy, 0);
+		} else {
+			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+		}
+	}
 
 	/*
 	 * Even the options are empty, we still need to do extra check
@@ -985,6 +992,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		ret = -EINVAL;
 
 	}
+	if (!ret)
+		ret = btrfs_check_mountopts_zoned(info);
 	if (!ret && btrfs_test_opt(info, SPACE_CACHE))
 		btrfs_info(info, "disk space caching is enabled");
 	if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 2897432eb43c..d6b8165e2c91 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -274,3 +274,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 out:
 	return ret;
 }
+
+int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
+{
+	if (!btrfs_is_zoned(info))
+		return 0;
+
+	/*
+	 * Space cache writing is not COWed. Disable that to avoid write
+	 * errors in sequential zones.
+	 */
+	if (btrfs_test_opt(info, SPACE_CACHE)) {
+		btrfs_err(info,
+			  "zoned: space cache v1 is not supported");
+		return -EINVAL;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 52aa6af5d8dc..81c00a3ed202 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -25,6 +25,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
+int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -48,6 +49,11 @@ static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	return -EOPNOTSUPP;
 }
 
+static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 08/41] btrfs: disallow NODATACOW in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (6 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-20  4:17   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 09/41] btrfs: disable fallocate " Naohiro Aota
                   ` (34 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik, Johannes Thumshirn

NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ioctl.c | 13 +++++++++++++
 fs/btrfs/zoned.c |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ab408a23ba32..d13b522e7bb2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -193,6 +193,15 @@ static int check_fsflags(unsigned int old_flags, unsigned int flags)
 	return 0;
 }
 
+static int check_fsflags_compatible(struct btrfs_fs_info *fs_info,
+				    unsigned int flags)
+{
+	if (btrfs_is_zoned(fs_info) && (flags & FS_NOCOW_FL))
+		return -EPERM;
+
+	return 0;
+}
+
 static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 {
 	struct inode *inode = file_inode(file);
@@ -230,6 +239,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	if (ret)
 		goto out_unlock;
 
+	ret = check_fsflags_compatible(fs_info, fsflags);
+	if (ret)
+		goto out_unlock;
+
 	binode_flags = binode->flags;
 	if (fsflags & FS_SYNC_FL)
 		binode_flags |= BTRFS_INODE_SYNC;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d6b8165e2c91..bd153932606e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -290,5 +290,10 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 		return -EINVAL;
 	}
 
+	if (btrfs_test_opt(info, NODATACOW)) {
+		btrfs_err(info, "zoned: NODATACOW not supported");
+		return -EINVAL;
+	}
+
 	return 0;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 09/41] btrfs: disable fallocate in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (7 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 08/41] btrfs: disallow NODATACOW " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-20  4:28   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 10/41] btrfs: disallow mixed-bg " Naohiro Aota
                   ` (33 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Johannes Thumshirn, Josef Bacik

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file.  To avoid this, report
fallocate() as not supported when in ZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in ZONED
mode by utilizing space_info->bytes_may_use or so.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0ff659455b1e..68938a43081e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3341,6 +3341,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
+	/* Do not allow fallocate in ZONED mode */
+	if (btrfs_is_zoned(btrfs_sb(inode->i_sb)))
+		return -EOPNOTSUPP;
+
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 10/41] btrfs: disallow mixed-bg in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (8 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 09/41] btrfs: disable fallocate " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-20  4:32   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
                   ` (32 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do so, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

This commit check and disallow MIXED_BG with ZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/zoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bd153932606e..f87d35cb9235 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -266,6 +266,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info,
+			  "zoned: mixed block groups not supported");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (9 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 10/41] btrfs: disallow mixed-bg " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-11  1:34   ` kernel test robot
                     ` (3 more replies)
  2020-11-10 11:26 ` [PATCH v10 12/41] btrfs: implement zoned chunk allocator Naohiro Aota
                   ` (31 subsequent siblings)
  42 siblings, 4 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone. One easy solution is
limiting superblock and copies to be placed only in conventional zones.
However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problems, we employ superblock log writing. It uses two
zones as a circular buffer to write updated superblocks. Once the first
zone is filled up, start writing into the second buffer. Then, when the
both zones are filled up and before start writing to the first zone again,
it reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when the both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and next
  to it

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |   9 ++
 fs/btrfs/disk-io.c     |  41 ++++-
 fs/btrfs/scrub.c       |   3 +
 fs/btrfs/volumes.c     |  21 ++-
 fs/btrfs/zoned.c       | 329 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  44 ++++++
 6 files changed, 435 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c0f1d6818df7..6b4831824f51 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1723,6 +1723,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 static int exclude_super_stripes(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
+	const bool zoned = btrfs_is_zoned(fs_info);
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
@@ -1744,6 +1745,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 		if (ret)
 			return ret;
 
+		/* Shouldn't have super stripes in sequential zones */
+		if (zoned && nr) {
+			btrfs_err(fs_info,
+				  "zoned: block group %llu must not contain super block",
+				  cache->start);
+			return -EUCLEAN;
+		}
+
 		while (nr--) {
 			u64 len = min_t(u64, stripe_len,
 				cache->start + cache->length - logical[nr]);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e76ac4da208d..509085a368bb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3423,10 +3423,17 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
 {
 	struct btrfs_super_block *super;
 	struct page *page;
-	u64 bytenr;
+	u64 bytenr, bytenr_orig;
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
+	int ret;
+
+	bytenr_orig = btrfs_sb_offset(copy_num);
+	ret = btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr);
+	if (ret == -ENOENT)
+		return ERR_PTR(-EINVAL);
+	else if (ret)
+		return ERR_PTR(ret);
 
-	bytenr = btrfs_sb_offset(copy_num);
 	if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode))
 		return ERR_PTR(-EINVAL);
 
@@ -3440,7 +3447,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
 		return ERR_PTR(-ENODATA);
 	}
 
-	if (btrfs_super_bytenr(super) != bytenr) {
+	if (btrfs_super_bytenr(super) != bytenr_orig) {
 		btrfs_release_disk_super(super);
 		return ERR_PTR(-EINVAL);
 	}
@@ -3495,7 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	int i;
 	int errors = 0;
-	u64 bytenr;
+	int ret;
+	u64 bytenr, bytenr_orig;
 
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
@@ -3507,12 +3515,21 @@ static int write_dev_supers(struct btrfs_device *device,
 		struct bio *bio;
 		struct btrfs_super_block *disk_super;
 
-		bytenr = btrfs_sb_offset(i);
+		bytenr_orig = btrfs_sb_offset(i);
+		ret = btrfs_sb_log_location(device, i, WRITE, &bytenr);
+		if (ret == -ENOENT) {
+			continue;
+		} else if (ret < 0) {
+			btrfs_err(device->fs_info, "couldn't get super block location for mirror %d",
+				  i);
+			errors++;
+			continue;
+		}
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
 
-		btrfs_set_super_bytenr(sb, bytenr);
+		btrfs_set_super_bytenr(sb, bytenr_orig);
 
 		crypto_shash_digest(shash, (const char *)sb + BTRFS_CSUM_SIZE,
 				    BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE,
@@ -3557,6 +3574,7 @@ static int write_dev_supers(struct btrfs_device *device,
 			bio->bi_opf |= REQ_FUA;
 
 		btrfsic_submit_bio(bio);
+		btrfs_advance_sb_log(device, i);
 	}
 	return errors < i ? 0 : -1;
 }
@@ -3573,6 +3591,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 	int i;
 	int errors = 0;
 	bool primary_failed = false;
+	int ret;
 	u64 bytenr;
 
 	if (max_mirrors == 0)
@@ -3581,7 +3600,15 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 	for (i = 0; i < max_mirrors; i++) {
 		struct page *page;
 
-		bytenr = btrfs_sb_offset(i);
+		ret = btrfs_sb_log_location(device, i, READ, &bytenr);
+		if (ret == -ENOENT) {
+			break;
+		} else if (ret < 0) {
+			errors++;
+			if (i == 0)
+				primary_failed = true;
+			continue;
+		}
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cf63f1e27a27..aa1b36cf5c88 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -20,6 +20,7 @@
 #include "rcu-string.h"
 #include "raid56.h"
 #include "block-group.h"
+#include "zoned.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -3704,6 +3705,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >
 		    scrub_dev->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(scrub_dev, bytenr))
+			continue;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 10827892c086..db884b96a5ea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1282,7 +1282,8 @@ void btrfs_release_disk_super(struct btrfs_super_block *super)
 }
 
 static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
-						       u64 bytenr)
+						       u64 bytenr,
+						       u64 bytenr_orig)
 {
 	struct btrfs_super_block *disk_super;
 	struct page *page;
@@ -1313,7 +1314,7 @@ static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev
 	/* align our pointer to the offset of the super block */
 	disk_super = p + offset_in_page(bytenr);
 
-	if (btrfs_super_bytenr(disk_super) != bytenr ||
+	if (btrfs_super_bytenr(disk_super) != bytenr_orig ||
 	    btrfs_super_magic(disk_super) != BTRFS_MAGIC) {
 		btrfs_release_disk_super(p);
 		return ERR_PTR(-EINVAL);
@@ -1348,7 +1349,8 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	bool new_device_added = false;
 	struct btrfs_device *device = NULL;
 	struct block_device *bdev;
-	u64 bytenr;
+	u64 bytenr, bytenr_orig;
+	int ret;
 
 	lockdep_assert_held(&uuid_mutex);
 
@@ -1358,14 +1360,18 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	 * So, we need to add a special mount option to scan for
 	 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 	 */
-	bytenr = btrfs_sb_offset(0);
 	flags |= FMODE_EXCL;
 
 	bdev = blkdev_get_by_path(path, flags, holder);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
 
-	disk_super = btrfs_read_disk_super(bdev, bytenr);
+	bytenr_orig = btrfs_sb_offset(0);
+	ret = btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr);
+	if (ret)
+		return ERR_PTR(ret);
+
+	disk_super = btrfs_read_disk_super(bdev, bytenr, bytenr_orig);
 	if (IS_ERR(disk_super)) {
 		device = ERR_CAST(disk_super);
 		goto error_bdev_put;
@@ -2029,6 +2035,11 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 		if (IS_ERR(disk_super))
 			continue;
 
+		if (bdev_is_zoned(bdev)) {
+			btrfs_reset_sb_log_zones(bdev, copy_num);
+			continue;
+		}
+
 		memset(&disk_super->magic, 0, sizeof(disk_super->magic));
 
 		page = virt_to_page(disk_super);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f87d35cb9235..84ade8c19ddc 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -10,6 +10,9 @@
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
 
+/* Number of superblock log zones */
+#define BTRFS_NR_SB_LOG_ZONES 2
+
 static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 			     void *data)
 {
@@ -20,6 +23,106 @@ static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 	return 0;
 }
 
+static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
+			    u64 *wp_ret)
+{
+	bool empty[BTRFS_NR_SB_LOG_ZONES];
+	bool full[BTRFS_NR_SB_LOG_ZONES];
+	sector_t sector;
+
+	ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
+	       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);
+
+	empty[0] = (zones[0].cond == BLK_ZONE_COND_EMPTY);
+	empty[1] = (zones[1].cond == BLK_ZONE_COND_EMPTY);
+	full[0] = (zones[0].cond == BLK_ZONE_COND_FULL);
+	full[1] = (zones[1].cond == BLK_ZONE_COND_FULL);
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 C
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   C: Compare SBs from zones[0] and zones[1], use the newer one
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* Special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* Compare two super blocks */
+		struct address_space *mapping = bdev->bd_inode->i_mapping;
+		struct page *page[BTRFS_NR_SB_LOG_ZONES];
+		struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES];
+		int i;
+
+		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
+			u64 bytenr = ((zones[i].start + zones[i].len) << SECTOR_SHIFT) -
+				BTRFS_SUPER_INFO_SIZE;
+
+			page[i] = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
+			if (IS_ERR(page[i])) {
+				if (i == 1)
+					btrfs_release_disk_super(super[0]);
+				return PTR_ERR(page[i]);
+			}
+			super[i] = page_address(page[i]);
+		}
+
+		if (super[0]->generation > super[1]->generation)
+			sector = zones[1].start;
+		else
+			sector = zones[0].start;
+
+		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
+			btrfs_release_disk_super(super[i]);
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+/*
+ * The following zones are reserved as the circular buffer on ZONED btrfs.
+ *  - The primary superblock: zones 0 and 1
+ *  - The first copy: zones 16 and 17
+ *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
+ *    next to it
+ */
+static inline u32 sb_zone_number(u8 shift, int mirror)
+{
+	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
+
+	switch (mirror) {
+	case 0:
+		return 0;
+	case 1:
+		return 16;
+	case 2:
+		return min(btrfs_sb_offset(mirror) >> shift, 1024ULL);
+	}
+
+	return 0;
+}
+
 static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 			       struct blk_zone *zones, unsigned int *nr_zones)
 {
@@ -122,6 +225,52 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 		goto out;
 	}
 
+	/* Validate superblock log */
+	nr_zones = BTRFS_NR_SB_LOG_ZONES;
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		u32 sb_zone = sb_zone_number(zone_info->zone_size_shift, i);
+		u64 sb_wp;
+		int sb_pos = BTRFS_NR_SB_LOG_ZONES * i;
+
+		if (sb_zone + 1 >= zone_info->nr_zones)
+			continue;
+
+		sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT);
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
+					  &zone_info->sb_zones[sb_pos],
+					  &nr_zones);
+		if (ret)
+			goto out;
+
+		if (nr_zones != BTRFS_NR_SB_LOG_ZONES) {
+			btrfs_err_in_rcu(device->fs_info,
+			 "zoned: failed to read super block log zone info at devid %llu zone %u",
+					 device->devid, sb_zone);
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		/*
+		 * If zones[0] is conventional, always use the beggining of
+		 * the zone to record superblock. No need to validate in
+		 * that case.
+		 */
+		if (zone_info->sb_zones[BTRFS_NR_SB_LOG_ZONES * i].type ==
+		    BLK_ZONE_TYPE_CONVENTIONAL)
+			continue;
+
+		ret = sb_write_pointer(device->bdev,
+				       &zone_info->sb_zones[sb_pos], &sb_wp);
+		if (ret != -ENOENT && ret) {
+			btrfs_err_in_rcu(device->fs_info,
+				"zoned: super block log zone corrupted devid %llu zone %u",
+					 device->devid, sb_zone);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
+
 	kfree(zones);
 
 	device->zone_info = zone_info;
@@ -304,3 +453,183 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 
 	return 0;
 }
+
+static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
+			   int rw, u64 *bytenr_ret)
+{
+	u64 wp;
+	int ret;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
+		return 0;
+	}
+
+	ret = sb_write_pointer(bdev, zones, &wp);
+	if (ret != -ENOENT && ret < 0)
+		return ret;
+
+	if (rw == WRITE) {
+		struct blk_zone *reset = NULL;
+
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			reset = &zones[0];
+		else if (wp == zones[1].start << SECTOR_SHIFT)
+			reset = &zones[1];
+
+		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
+			ASSERT(reset->cond == BLK_ZONE_COND_FULL);
+
+			ret = blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET,
+					       reset->start, reset->len,
+					       GFP_NOFS);
+			if (ret)
+				return ret;
+
+			reset->cond = BLK_ZONE_COND_EMPTY;
+			reset->wp = reset->start;
+		}
+	} else if (ret != -ENOENT) {
+		/* For READ, we want the precious one */
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+
+	*bytenr_ret = wp;
+	return 0;
+
+}
+
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret)
+{
+	struct blk_zone zones[BTRFS_NR_SB_LOG_ZONES];
+	unsigned int zone_sectors;
+	u32 sb_zone;
+	int ret;
+	u64 zone_size;
+	u8 zone_sectors_shift;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u32 nr_zones;
+
+	if (!bdev_is_zoned(bdev)) {
+		*bytenr_ret = btrfs_sb_offset(mirror);
+		return 0;
+	}
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	if (!is_power_of_2(zone_sectors))
+		return -EINVAL;
+	zone_size = zone_sectors << SECTOR_SHIFT;
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift,
+				  BTRFS_NR_SB_LOG_ZONES, copy_zone_info_cb,
+				  zones);
+	if (ret < 0)
+		return ret;
+	if (ret != BTRFS_NR_SB_LOG_ZONES)
+		return -EIO;
+
+	return sb_log_location(bdev, zones, rw, bytenr_ret);
+}
+
+int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
+			  u64 *bytenr_ret)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u32 zone_num;
+
+	if (!zinfo) {
+		*bytenr_ret = btrfs_sb_offset(mirror);
+		return 0;
+	}
+
+	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return -ENOENT;
+
+	return sb_log_location(device->bdev,
+			       &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
+			       rw, bytenr_ret);
+}
+
+static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
+				  int mirror)
+{
+	u32 zone_num;
+
+	if (!zinfo)
+		return false;
+
+	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return false;
+
+	if (!test_bit(zone_num, zinfo->seq_zones))
+		return false;
+
+	return true;
+}
+
+void btrfs_advance_sb_log(struct btrfs_device *device, int mirror)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	struct blk_zone *zone;
+
+	if (!is_sb_log_zone(zinfo, mirror))
+		return;
+
+	zone = &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror];
+	if (zone->cond != BLK_ZONE_COND_FULL) {
+
+		if (zone->cond == BLK_ZONE_COND_EMPTY)
+			zone->cond = BLK_ZONE_COND_IMP_OPEN;
+
+		zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+
+		if (zone->wp == zone->start + zone->len)
+			zone->cond = BLK_ZONE_COND_FULL;
+
+		return;
+	}
+
+	zone++;
+	ASSERT(zone->cond != BLK_ZONE_COND_FULL);
+	if (zone->cond == BLK_ZONE_COND_EMPTY)
+		zone->cond = BLK_ZONE_COND_IMP_OPEN;
+
+	zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+
+	if (zone->wp == zone->start + zone->len)
+		zone->cond = BLK_ZONE_COND_FULL;
+}
+
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
+{
+	sector_t zone_sectors;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u8 zone_sectors_shift;
+	u32 sb_zone;
+	u32 nr_zones;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	return blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET,
+				sb_zone << zone_sectors_shift,
+				zone_sectors * BTRFS_NR_SB_LOG_ZONES, GFP_NOFS);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 81c00a3ed202..de9d7dd8c351 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -5,6 +5,8 @@
 
 #include <linux/types.h>
 #include <linux/blkdev.h>
+#include "volumes.h"
+#include "disk-io.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -17,6 +19,7 @@ struct btrfs_zoned_device_info {
 	u32 nr_zones;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;
+	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
@@ -26,6 +29,12 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
 int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret);
+int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
+			  u64 *bytenr_ret);
+void btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -54,6 +63,30 @@ static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 	return 0;
 }
 
+static inline int btrfs_sb_log_location_bdev(struct block_device *bdev,
+					     int mirror, int rw,
+					     u64 *bytenr_ret)
+{
+	*bytenr_ret = btrfs_sb_offset(mirror);
+	return 0;
+}
+
+static inline int btrfs_sb_log_location(struct btrfs_device *device, int mirror,
+					int rw, u64 *bytenr_ret)
+{
+	*bytenr_ret = btrfs_sb_offset(mirror);
+	return 0;
+}
+
+static inline void btrfs_advance_sb_log(struct btrfs_device *device,
+					int mirror) { }
+
+static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
+					   int mirror)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -121,4 +154,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
 	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
 }
 
+static inline bool btrfs_check_super_location(struct btrfs_device *device,
+					      u64 pos)
+{
+	/*
+	 * On a non-zoned device, any address is OK. On a zoned device,
+	 * non-SEQUENTIAL WRITE REQUIRED zones are capable.
+	 */
+	return device->zone_info == NULL ||
+	       !btrfs_dev_is_sequential(device, pos);
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (10 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-24 11:36   ` Anand Jain
  2020-12-09  5:27   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 13/41] btrfs: verify device extent is aligned to zone Naohiro Aota
                   ` (30 subsequent siblings)
  42 siblings, 2 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This commit implements a zoned chunk/dev_extent allocator. The zoned
allocator aligns the device extents to zone boundaries, so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents.

Also, it checks that a region allocation is not overlapping any of the
super block zones, and ensures the region is empty.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/volumes.c | 136 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |   1 +
 fs/btrfs/zoned.c   | 144 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h   |  34 +++++++++++
 4 files changed, 315 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index db884b96a5ea..7831cf6c6da4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1416,6 +1416,21 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 	return false;
 }
 
+static inline u64 dev_extent_search_start_zoned(struct btrfs_device *device,
+						u64 start)
+{
+	u64 tmp;
+
+	if (device->zone_info->zone_size > SZ_1M)
+		tmp = device->zone_info->zone_size;
+	else
+		tmp = SZ_1M;
+	if (start < tmp)
+		start = tmp;
+
+	return btrfs_align_offset_to_zone(device, start);
+}
+
 static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 {
 	switch (device->fs_devices->chunk_alloc_policy) {
@@ -1426,11 +1441,57 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 		 * make sure to start at an offset of at least 1MB.
 		 */
 		return max_t(u64, start, SZ_1M);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		return dev_extent_search_start_zoned(device, start);
 	default:
 		BUG();
 	}
 }
 
+static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
+					u64 *hole_start, u64 *hole_size,
+					u64 num_bytes)
+{
+	u64 zone_size = device->zone_info->zone_size;
+	u64 pos;
+	int ret;
+	int changed = 0;
+
+	ASSERT(IS_ALIGNED(*hole_start, zone_size));
+
+	while (*hole_size > 0) {
+		pos = btrfs_find_allocatable_zones(device, *hole_start,
+						   *hole_start + *hole_size,
+						   num_bytes);
+		if (pos != *hole_start) {
+			*hole_size = *hole_start + *hole_size - pos;
+			*hole_start = pos;
+			changed = 1;
+			if (*hole_size < num_bytes)
+				break;
+		}
+
+		ret = btrfs_ensure_empty_zones(device, pos, num_bytes);
+
+		/* Range is ensured to be empty */
+		if (!ret)
+			return changed;
+
+		/* Given hole range was invalid (outside of device) */
+		if (ret == -ERANGE) {
+			*hole_start += *hole_size;
+			*hole_size = 0;
+			return 1;
+		}
+
+		*hole_start += zone_size;
+		*hole_size -= zone_size;
+		changed = 1;
+	}
+
+	return changed;
+}
+
 /**
  * dev_extent_hole_check - check if specified hole is suitable for allocation
  * @device:	the device which we have the hole
@@ -1463,6 +1524,10 @@ static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		/* No extra check */
 		break;
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		changed |= dev_extent_hole_check_zoned(device, hole_start,
+						       hole_size, num_bytes);
+		break;
 	default:
 		BUG();
 	}
@@ -1517,6 +1582,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 
 	search_start = dev_extent_search_start(device, search_start);
 
+	WARN_ON(device->zone_info &&
+		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4907,6 +4975,37 @@ static void init_alloc_chunk_ctl_policy_regular(
 	ctl->dev_extent_min = BTRFS_STRIPE_LEN * ctl->dev_stripes;
 }
 
+static void init_alloc_chunk_ctl_policy_zoned(
+				      struct btrfs_fs_devices *fs_devices,
+				      struct alloc_chunk_ctl *ctl)
+{
+	u64 zone_size = fs_devices->fs_info->zone_size;
+	u64 limit;
+	int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
+	int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
+	u64 min_chunk_size = min_data_stripes * zone_size;
+	u64 type = ctl->type;
+
+	ctl->max_stripe_size = zone_size;
+	if (type & BTRFS_BLOCK_GROUP_DATA) {
+		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
+						 zone_size);
+	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+		ctl->max_chunk_size = ctl->max_stripe_size;
+	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+		ctl->max_chunk_size = 2 * ctl->max_stripe_size;
+		ctl->devs_max = min_t(int, ctl->devs_max,
+				      BTRFS_MAX_DEVS_SYS_CHUNK);
+	}
+
+	/* We don't want a chunk larger than 10% of writable space */
+	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
+			       zone_size),
+		    min_chunk_size);
+	ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
+	ctl->dev_extent_min = zone_size * ctl->dev_stripes;
+}
+
 static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 				 struct alloc_chunk_ctl *ctl)
 {
@@ -4927,6 +5026,9 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		init_alloc_chunk_ctl_policy_regular(fs_devices, ctl);
 		break;
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl);
+		break;
 	default:
 		BUG();
 	}
@@ -5053,6 +5155,38 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 	return 0;
 }
 
+static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
+				    struct btrfs_device_info *devices_info)
+{
+	u64 zone_size = devices_info[0].dev->zone_info->zone_size;
+	/* Number of stripes that count for block group size */
+	int data_stripes;
+
+	/*
+	 * It should hold because:
+	 *    dev_extent_min == dev_extent_want == zone_size * dev_stripes
+	 */
+	ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min);
+
+	ctl->stripe_size = zone_size;
+	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+	data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+
+	/* stripe_size is fixed in ZONED. Reduce ndevs instead. */
+	if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) {
+		ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies,
+					     ctl->stripe_size) + ctl->nparity,
+				     ctl->dev_stripes);
+		ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+		data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+		ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size);
+	}
+
+	ctl->chunk_size = ctl->stripe_size * data_stripes;
+
+	return 0;
+}
+
 static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
 			      struct btrfs_device_info *devices_info)
@@ -5080,6 +5214,8 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 	switch (fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		return decide_stripe_size_regular(ctl, devices_info);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		return decide_stripe_size_zoned(ctl, devices_info);
 	default:
 		BUG();
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9c07b97a2260..0249aca668fb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -213,6 +213,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
 enum btrfs_chunk_allocation_policy {
 	BTRFS_CHUNK_ALLOC_REGULAR,
+	BTRFS_CHUNK_ALLOC_ZONED,
 };
 
 struct btrfs_fs_devices {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 84ade8c19ddc..ed5de1c138d7 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1,11 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
+#include "disk-io.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -424,6 +426,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
+	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	btrfs_info(fs_info, "zoned mode enabled with zone size %llu",
 		   fs_info->zone_size);
@@ -633,3 +636,144 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 				sb_zone << zone_sectors_shift,
 				zone_sectors * BTRFS_NR_SB_LOG_ZONES, GFP_NOFS);
 }
+
+/*
+ * btrfs_check_allocatable_zones - find allocatable zones within give region
+ * @device:	the device to allocate a region
+ * @hole_start: the position of the hole to allocate the region
+ * @num_bytes:	the size of wanted region
+ * @hole_size:	the size of hole
+ *
+ * Allocatable region should not contain any superblock locations.
+ */
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u8 shift = zinfo->zone_size_shift;
+	u64 nzones = num_bytes >> shift;
+	u64 pos = hole_start;
+	u64 begin, end;
+	bool have_sb;
+	int i;
+
+	ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	while (pos < hole_end) {
+		begin = pos >> shift;
+		end = begin + nzones;
+
+		if (end > zinfo->nr_zones)
+			return hole_end;
+
+		/* Check if zones in the region are all empty */
+		if (btrfs_dev_is_sequential(device, pos) &&
+		    find_next_zero_bit(zinfo->empty_zones, end, begin) != end) {
+			pos += zinfo->zone_size;
+			continue;
+		}
+
+		have_sb = false;
+		for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+			u32 sb_zone;
+			u64 sb_pos;
+
+			sb_zone = sb_zone_number(shift, i);
+			if (!(end <= sb_zone ||
+			      sb_zone + BTRFS_NR_SB_LOG_ZONES <= begin)) {
+				have_sb = true;
+				pos = ((u64)sb_zone + BTRFS_NR_SB_LOG_ZONES) << shift;
+				break;
+			}
+
+			/*
+			 * We also need to exclude regular superblock
+			 * positions
+			 */
+			sb_pos = btrfs_sb_offset(i);
+			if (!(pos + num_bytes <= sb_pos ||
+			      sb_pos + BTRFS_SUPER_INFO_SIZE <= pos)) {
+				have_sb = true;
+				pos = ALIGN(sb_pos + BTRFS_SUPER_INFO_SIZE,
+					    zinfo->zone_size);
+				break;
+			}
+		}
+		if (!have_sb)
+			break;
+
+	}
+
+	return pos;
+}
+
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes)
+{
+	int ret;
+
+	*bytes = 0;
+	ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_RESET,
+			       physical >> SECTOR_SHIFT, length >> SECTOR_SHIFT,
+			       GFP_NOFS);
+	if (ret)
+		return ret;
+
+	*bytes = length;
+	while (length) {
+		btrfs_dev_set_zone_empty(device, physical);
+		physical += device->zone_info->zone_size;
+		length -= device->zone_info->zone_size;
+	}
+
+	return 0;
+}
+
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u8 shift = zinfo->zone_size_shift;
+	unsigned long begin = start >> shift;
+	unsigned long end = (start + size) >> shift;
+	u64 pos;
+	int ret;
+
+	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return -ERANGE;
+
+	/* All the zones are conventional */
+	if (find_next_bit(zinfo->seq_zones, begin, end) == end)
+		return 0;
+
+	/* All the zones are sequential and empty */
+	if (find_next_zero_bit(zinfo->seq_zones, begin, end) == end &&
+	    find_next_zero_bit(zinfo->empty_zones, begin, end) == end)
+		return 0;
+
+	for (pos = start; pos < start + size; pos += zinfo->zone_size) {
+		u64 reset_bytes;
+
+		if (!btrfs_dev_is_sequential(device, pos) ||
+		    btrfs_dev_is_empty_zone(device, pos))
+			continue;
+
+		/* Free regions should be empty */
+		btrfs_warn_in_rcu(
+			device->fs_info,
+			"zoned: resetting device %s (devid %llu) zone %llu for allocation",
+			rcu_str_deref(device->name), device->devid,
+			pos >> shift);
+		WARN_ON_ONCE(1);
+
+		ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
+					      &reset_bytes);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index de9d7dd8c351..ec2391c52d8b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -35,6 +35,11 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 			  u64 *bytenr_ret);
 void btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
 int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes);
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes);
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -87,6 +92,26 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
 	return 0;
 }
 
+static inline u64 btrfs_find_allocatable_zones(struct btrfs_device *device,
+					       u64 hole_start, u64 hole_end,
+					       u64 num_bytes)
+{
+	return hole_start;
+}
+
+static inline int btrfs_reset_device_zone(struct btrfs_device *device,
+					  u64 physical, u64 length, u64 *bytes)
+{
+	*bytes = 0;
+	return 0;
+}
+
+static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
+					   u64 start, u64 size)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -165,4 +190,13 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device,
 	       !btrfs_dev_is_sequential(device, pos);
 }
 
+static inline u64 btrfs_align_offset_to_zone(struct btrfs_device *device,
+					     u64 pos)
+{
+	if (!device->zone_info)
+		return pos;
+
+	return ALIGN(pos, device->zone_info->zone_size);
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 13/41] btrfs: verify device extent is aligned to zone
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (11 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 12/41] btrfs: implement zoned chunk allocator Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-27  6:27   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 14/41] btrfs: load zone's alloction offset Naohiro Aota
                   ` (29 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

Add a check in verify_one_dev_extent() to check if a device extent on a
zoned block device is aligned to the respective zone boundary.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7831cf6c6da4..c0e27c1e2559 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7783,6 +7783,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 		ret = -EUCLEAN;
 		goto out;
 	}
+
+	if (dev->zone_info) {
+		u64 zone_size = dev->zone_info->zone_size;
+
+		if (!IS_ALIGNED(physical_offset, zone_size) ||
+		    !IS_ALIGNED(physical_len, zone_size)) {
+			btrfs_err(fs_info,
+"zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
+				  devid, physical_offset, physical_len);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
 out:
 	free_extent_map(em);
 	return ret;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 14/41] btrfs: load zone's alloction offset
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (12 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 13/41] btrfs: verify device extent is aligned to zone Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-12-08  9:54   ` Anand Jain
  2020-11-10 11:26 ` [PATCH v10 15/41] btrfs: emulate write pointer for conventional zones Naohiro Aota
                   ` (28 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

Zoned btrfs must allocate blocks at the zones' write pointer. The device's
write pointer position can be mapped to a logical address within a block
group. This commit adds "alloc_offset" to track the logical address.

This logical address is populated in btrfs_load_block_group_zone_info()
from write pointers of corresponding zones.

For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the returned
addresses from the beginning of the zone are different, then it results in
different logical addresses.

We need fine-grained logical to physical mapping to support such separated
physical address issue. Since it should require additional metadata type,
disable non-SINGLE profiles for now.

This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c |  15 ++++
 fs/btrfs/block-group.h |   6 ++
 fs/btrfs/zoned.c       | 154 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |   7 ++
 4 files changed, 182 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6b4831824f51..ffc64dfbe09e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -15,6 +15,7 @@
 #include "delalloc-space.h"
 #include "discard.h"
 #include "raid56.h"
+#include "zoned.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1935,6 +1936,13 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 			goto error;
 	}
 
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_err(info, "zoned: failed to load zone info of bg %llu",
+			  cache->start);
+		goto error;
+	}
+
 	/*
 	 * We need to exclude the super stripes now so that the space info has
 	 * super bytes accounted for, otherwise we'll think we have more space
@@ -2161,6 +2169,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->last_byte_to_unpin = (u64)-1;
 	cache->cached = BTRFS_CACHE_FINISHED;
 	cache->needs_free_space = 1;
+
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	ret = exclude_super_stripes(cache);
 	if (ret) {
 		/* We may have excluded something, so call this just in case */
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index adfd7583a17b..14e3043c9ce7 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -183,6 +183,12 @@ struct btrfs_block_group {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	/*
+	 * Allocation offset for the block group to implement sequential
+	 * allocation. This is used only with ZONED mode enabled.
+	 */
+	u64 alloc_offset;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index ed5de1c138d7..69d3412c4fef 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -3,14 +3,20 @@
 #include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+/* Pseudo write pointer value for conventional zone */
+#define WP_CONVENTIONAL ((u64)-2)
 
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
@@ -777,3 +783,151 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 
 	return 0;
 }
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->start;
+	u64 length = cache->length;
+	u64 physical = 0;
+	int ret;
+	int i;
+	unsigned int nofs_flag;
+	u64 *alloc_offsets = NULL;
+	u32 num_sequential = 0, num_conventional = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info, "zoned: block group %llu len %llu unaligned to zone size %llu",
+			  logical, length, fs_info->zone_size);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+				GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (is_sequential)
+			num_sequential++;
+		else
+			num_conventional++;
+
+		if (!is_sequential) {
+			alloc_offsets[i] = WP_CONVENTIONAL;
+			continue;
+		}
+
+		/*
+		 * This zone will be used for allocation, so mark this
+		 * zone non-empty.
+		 */
+		btrfs_dev_clear_zone_empty(device, physical);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		nofs_flag = memalloc_nofs_save();
+		ret = btrfs_get_dev_zone(device, physical, &zone);
+		memalloc_nofs_restore(nofs_flag);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(fs_info, "zoned: offline/readonly zone %llu on device %s (devid %llu)",
+				  physical >> device->zone_info->zone_size_shift,
+				  rcu_str_deref(device->name), device->devid);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (num_conventional > 0) {
+		/*
+		 * Since conventional zones do not have a write pointer, we
+		 * cannot determine alloc_offset from the pointer
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+		cache->alloc_offset = alloc_offsets[0];
+		break;
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID0:
+	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* non-SINGLE profiles are not supported yet */
+	default:
+		btrfs_err(fs_info, "zoned: profile %s not supported",
+			  btrfs_bg_type_to_raid_name(map->type));
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index ec2391c52d8b..e3338a2f1be9 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -40,6 +40,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -112,6 +113,12 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
 	return 0;
 }
 
+static inline int btrfs_load_block_group_zone_info(
+	struct btrfs_block_group *cache)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 15/41] btrfs: emulate write pointer for conventional zones
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (13 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 14/41] btrfs: load zone's alloction offset Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 16/41] btrfs: track unusable bytes for zones Naohiro Aota
                   ` (27 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset if a block group contains a conventional
zone.

But instead, we can consider the end of the last allocated extent in the
block group as an allocation offset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 80 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 74 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 69d3412c4fef..9bf40300e428 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -784,6 +784,61 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 	return 0;
 }
 
+static int emulate_write_pointer(struct btrfs_block_group *cache,
+				 u64 *offset_ret)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	int ret;
+	u64 length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = cache->start + cache->length;
+	key.type = 0;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	/* We should not find the exact match */
+	if (ret <= 0) {
+		ret = -EUCLEAN;
+		goto out;
+	}
+
+	ret = btrfs_previous_extent_item(root, path, cache->start);
+	if (ret) {
+		if (ret == 1) {
+			ret = 0;
+			*offset_ret = 0;
+		}
+		goto out;
+	}
+
+	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
+
+	if (found_key.type == BTRFS_EXTENT_ITEM_KEY)
+		length = found_key.offset;
+	else
+		length = fs_info->nodesize;
+
+	if (!(found_key.objectid >= cache->start &&
+	       found_key.objectid + length <= cache->start + cache->length)) {
+		ret = -EUCLEAN;
+		goto out;
+	}
+	*offset_ret = found_key.objectid + length - cache->start;
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
@@ -798,6 +853,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	int i;
 	unsigned int nofs_flag;
 	u64 *alloc_offsets = NULL;
+	u64 emulated_offset = 0;
 	u32 num_sequential = 0, num_conventional = 0;
 
 	if (!btrfs_is_zoned(fs_info))
@@ -899,12 +955,16 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	}
 
 	if (num_conventional > 0) {
-		/*
-		 * Since conventional zones do not have a write pointer, we
-		 * cannot determine alloc_offset from the pointer
-		 */
-		ret = -EINVAL;
-		goto out;
+		ret = emulate_write_pointer(cache, &emulated_offset);
+		if (ret || map->num_stripes == num_conventional) {
+			if (!ret)
+				cache->alloc_offset = emulated_offset;
+			else
+				btrfs_err(fs_info,
+			"zoned: failed to emulate write pointer of bg %llu",
+					  cache->start);
+			goto out;
+		}
 	}
 
 	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
@@ -926,6 +986,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	}
 
 out:
+	/* An extent is allocated after the write pointer */
+	if (num_conventional && emulated_offset > cache->alloc_offset) {
+		btrfs_err(fs_info,
+			  "zoned: got wrong write pointer in BG %llu: %llu > %llu",
+			  logical, emulated_offset, cache->alloc_offset);
+		ret = -EIO;
+	}
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 16/41] btrfs: track unusable bytes for zones
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (14 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 15/41] btrfs: emulate write pointer for conventional zones Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 17/41] btrfs: do sequential extent allocation in ZONED mode Naohiro Aota
                   ` (26 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

In zoned btrfs a region that was once written then freed is not usable
until we reset the underlying zone. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field  to the block
group structure, and "bytes_zone_unusable" to the space_info structure to
track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the
allocation offset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      | 21 ++++++++++-----
 fs/btrfs/block-group.h      |  1 +
 fs/btrfs/extent-tree.c      | 10 ++++++-
 fs/btrfs/free-space-cache.c | 52 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-cache.h |  2 ++
 fs/btrfs/space-info.c       | 13 ++++++----
 fs/btrfs/space-info.h       |  4 ++-
 fs/btrfs/sysfs.c            |  2 ++
 fs/btrfs/zoned.c            | 24 +++++++++++++++++
 fs/btrfs/zoned.h            |  3 +++
 10 files changed, 119 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index ffc64dfbe09e..723b7c183cd9 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1080,12 +1080,17 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		WARN_ON(block_group->space_info->total_bytes
 			< block_group->length);
 		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length);
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
+	block_group->space_info->bytes_zone_unusable -=
+		block_group->zone_unusable;
 	block_group->space_info->disk_total -= block_group->length * factor;
 
 	spin_unlock(&block_group->space_info->lock);
@@ -1229,7 +1234,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	}
 
 	num_bytes = cache->length - cache->reserved - cache->pinned -
-		    cache->bytes_super - cache->used;
+		    cache->bytes_super - cache->zone_unusable - cache->used;
 
 	/*
 	 * Data never overcommits, even in mixed mode, so do just the straight
@@ -1973,6 +1978,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		btrfs_free_excluded_extents(cache);
 	}
 
+	btrfs_calc_zone_unusable(cache);
+
 	ret = btrfs_add_block_group_cache(info, cache);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
@@ -1980,7 +1987,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, cache->length,
-				cache->used, cache->bytes_super, &space_info);
+				cache->used, cache->bytes_super,
+				cache->zone_unusable, &space_info);
 
 	cache->space_info = space_info;
 
@@ -2217,7 +2225,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+				cache->bytes_super, 0, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
@@ -2325,7 +2333,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
 	spin_lock(&cache->lock);
 	if (!--cache->ro) {
 		num_bytes = cache->length - cache->reserved -
-			    cache->pinned - cache->bytes_super - cache->used;
+			    cache->pinned - cache->bytes_super -
+			    cache->zone_unusable - cache->used;
 		sinfo->bytes_readonly -= num_bytes;
 		list_del_init(&cache->ro_list);
 	}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 14e3043c9ce7..5be47f4bfea7 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -189,6 +189,7 @@ struct btrfs_block_group {
 	 * allocation. This is used only with ZONED mode enabled.
 	 */
 	u64 alloc_offset;
+	u64 zone_unusable;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3b21fee13e77..09439782b9a8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -34,6 +34,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "rcu-string.h"
+#include "zoned.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2765,6 +2766,9 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
 {
 	struct btrfs_free_cluster *ret = NULL;
 
+	if (btrfs_is_zoned(fs_info))
+		return NULL;
+
 	*empty_cluster = 0;
 	if (btrfs_mixed_space_info(space_info))
 		return ret;
@@ -2846,7 +2850,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		space_info->max_extent_size = 0;
 		percpu_counter_add_batch(&space_info->total_bytes_pinned,
 			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
-		if (cache->ro) {
+		if (btrfs_is_zoned(fs_info)) {
+			/* Need reset before reusing in a zoned block group */
+			space_info->bytes_zone_unusable += len;
+			readonly = true;
+		} else if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
 		}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index af0013d3df63..f6434794cb0b 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2467,6 +2467,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	int ret = 0;
 	u64 filter_bytes = bytes;
 
+	ASSERT(!btrfs_is_zoned(fs_info));
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2524,11 +2526,44 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
+					u64 bytenr, u64 size, bool used)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->start;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (!used)
+		to_free = size;
+	else if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+
+	ctl->free_space += to_free;
+	block_group->zone_unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	if (!used) {
+		spin_lock(&block_group->lock);
+		block_group->alloc_offset -= size;
+		spin_unlock(&block_group->lock);
+	}
+	return 0;
+}
+
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size)
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
@@ -2537,6 +2572,16 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group,
 				      bytenr, size, trim_state);
 }
 
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size)
+{
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    false);
+
+	return btrfs_add_free_space(block_group, bytenr, size);
+}
+
 /*
  * This is a subtle distinction because when adding free space back in general,
  * we want it to be added as untrimmed for async. But in the case where we add
@@ -2547,6 +2592,10 @@ int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) ||
 	    btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
@@ -2564,6 +2613,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 	int ret;
 	bool re_search = false;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return 0;
+
 	spin_lock(&ctl->tree_lock);
 
 again:
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index e3d5e0ad8f8e..469382529f7e 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -116,6 +116,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   enum btrfs_trim_state trim_state);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size);
 int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 				       u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64099565ab8f..bbbf3c1412a4 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -163,6 +163,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 	ASSERT(s_info);
 	return s_info->bytes_used + s_info->bytes_reserved +
 		s_info->bytes_pinned + s_info->bytes_readonly +
+		s_info->bytes_zone_unusable +
 		(may_use_included ? s_info->bytes_may_use : 0);
 }
 
@@ -257,7 +258,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -273,6 +274,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
 	found->bytes_readonly += bytes_readonly;
+	found->bytes_zone_unusable += bytes_zone_unusable;
 	if (total_bytes > 0)
 		found->full = 0;
 	btrfs_try_granting_tickets(info, found);
@@ -422,10 +424,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		   info->total_bytes - btrfs_space_info_used(info, true),
 		   info->full ? "" : "not ");
 	btrfs_info(fs_info,
-		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
+		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
-		info->bytes_readonly);
+		info->bytes_readonly, info->bytes_zone_unusable);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
@@ -454,9 +456,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(cache, &info->block_groups[index], list) {
 		spin_lock(&cache->lock);
 		btrfs_info(fs_info,
-			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
+			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s",
 			cache->start, cache->length, cache->used, cache->pinned,
-			cache->reserved, cache->ro ? "[readonly]" : "");
+			cache->reserved, cache->zone_unusable,
+			cache->ro ? "[readonly]" : "");
 		spin_unlock(&cache->lock);
 		btrfs_dump_free_space(cache, bytes);
 	}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 5646393b928c..ee003ffba956 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -17,6 +17,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 bytes_zone_unusable;	/* total bytes that are unusable until
+					   resetting the device zone */
 
 	u64 max_extent_size;	/* This will hold the maximum extent size of
 				   the space info if we had an ENOSPC in the
@@ -119,7 +121,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned");
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 828006020bbd..ea679803da9b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -635,6 +635,7 @@ SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
 SPACE_INFO_ATTR(bytes_readonly);
+SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(space_info, total_bytes_pinned,
@@ -648,6 +649,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bytes_reserved),
 	BTRFS_ATTR_PTR(space_info, bytes_may_use),
 	BTRFS_ATTR_PTR(space_info, bytes_readonly),
+	BTRFS_ATTR_PTR(space_info, bytes_zone_unusable),
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, total_bytes_pinned),
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 9bf40300e428..5ee26b9fe5b1 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -999,3 +999,27 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 
 	return ret;
 }
+
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
+{
+	u64 unusable, free;
+
+	if (!btrfs_is_zoned(cache->fs_info))
+		return;
+
+	WARN_ON(cache->bytes_super != 0);
+	unusable = cache->alloc_offset - cache->used;
+	free = cache->length - cache->alloc_offset;
+
+	/* We only need ->free_space in ALLOC_SEQ BGs */
+	cache->last_byte_to_unpin = (u64)-1;
+	cache->cached = BTRFS_CACHE_FINISHED;
+	cache->free_space_ctl->free_space = free;
+	cache->zone_unusable = unusable;
+
+	/*
+	 * Should not have any excluded extents. Just
+	 * in case, though.
+	 */
+	btrfs_free_excluded_extents(cache);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index e3338a2f1be9..c86cde1978cd 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -41,6 +41,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -119,6 +120,8 @@ static inline int btrfs_load_block_group_zone_info(
 	return 0;
 }
 
+static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 17/41] btrfs: do sequential extent allocation in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (15 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 16/41] btrfs: track unusable bytes for zones Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 18/41] btrfs: reset zones of unused block groups Naohiro Aota
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

This commit implements a sequential extent allocator for the ZONED mode.
This allocator just needs to check if there is enough space in the block
group. Therefor the allocator never manages bitmaps or clusters. Also add
ASSERTs to the corresponding functions.

Actually, with zone append writing, it is unnecessary to track the
allocation offset. It only needs to check space availability. But, by
tracking the offset and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there is
no IO reordering.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      |  4 ++
 fs/btrfs/extent-tree.c      | 85 ++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |  6 +++
 3 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 723b7c183cd9..232885261c37 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -683,6 +683,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	struct btrfs_caching_control *caching_ctl;
 	int ret = 0;
 
+	/* Allocator for ZONED btrfs does not use the cache at all */
+	if (btrfs_is_zoned(fs_info))
+		return 0;
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 09439782b9a8..ab0ce3ba2b89 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3563,6 +3563,7 @@ btrfs_release_block_group(struct btrfs_block_group *cache,
 
 enum btrfs_extent_allocation_policy {
 	BTRFS_EXTENT_ALLOC_CLUSTERED,
+	BTRFS_EXTENT_ALLOC_ZONED,
 };
 
 /*
@@ -3815,6 +3816,58 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows
+ * sequential allocation. No need to play with trees. This function
+ * also reserves the bytes as in btrfs_add_reserved_bytes.
+ */
+static int do_allocation_zoned(struct btrfs_block_group *block_group,
+			       struct find_free_extent_ctl *ffe_ctl,
+			       struct btrfs_block_group **bg_ret)
+{
+	struct btrfs_space_info *space_info = block_group->space_info;
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 start = block_group->start;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	ASSERT(btrfs_is_zoned(block_group->fs_info));
+
+	spin_lock(&space_info->lock);
+	spin_lock(&block_group->lock);
+
+	if (block_group->ro) {
+		ret = 1;
+		goto out;
+	}
+
+	avail = block_group->length - block_group->alloc_offset;
+	if (avail < num_bytes) {
+		ffe_ctl->max_extent_size = avail;
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + block_group->alloc_offset;
+	block_group->alloc_offset += num_bytes;
+	spin_lock(&ctl->tree_lock);
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	/*
+	 * We do not check if found_offset is aligned to stripesize. The
+	 * address is anyway rewritten when using zone append writing.
+	 */
+
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+
+out:
+	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 static int do_allocation(struct btrfs_block_group *block_group,
 			 struct find_free_extent_ctl *ffe_ctl,
 			 struct btrfs_block_group **bg_ret)
@@ -3822,6 +3875,8 @@ static int do_allocation(struct btrfs_block_group *block_group,
 	switch (ffe_ctl->policy) {
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return do_allocation_clustered(block_group, ffe_ctl, bg_ret);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		return do_allocation_zoned(block_group, ffe_ctl, bg_ret);
 	default:
 		BUG();
 	}
@@ -3836,6 +3891,9 @@ static void release_block_group(struct btrfs_block_group *block_group,
 		ffe_ctl->retry_clustered = false;
 		ffe_ctl->retry_unclustered = false;
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3864,6 +3922,9 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		found_extent_clustered(ffe_ctl, ins);
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3879,6 +3940,9 @@ static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl)
 		 */
 		ffe_ctl->loop = LOOP_NO_EMPTY_SIZE;
 		return 0;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Give up here */
+		return -ENOSPC;
 	default:
 		BUG();
 	}
@@ -4047,6 +4111,9 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* nothing to do */
+		return 0;
 	default:
 		BUG();
 	}
@@ -4110,6 +4177,9 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.last_ptr = NULL;
 	ffe_ctl.use_cluster = true;
 
+	if (btrfs_is_zoned(fs_info))
+		ffe_ctl.policy = BTRFS_EXTENT_ALLOC_ZONED;
+
 	ins->type = BTRFS_EXTENT_ITEM_KEY;
 	ins->objectid = 0;
 	ins->offset = 0;
@@ -4252,20 +4322,23 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		/* move on to the next group */
 		if (ffe_ctl.search_start + num_bytes >
 		    block_group->start + block_group->length) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    num_bytes);
 			goto loop;
 		}
 
 		if (ffe_ctl.found_offset < ffe_ctl.search_start)
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-				ffe_ctl.search_start - ffe_ctl.found_offset);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    ffe_ctl.search_start - ffe_ctl.found_offset);
 
 		ret = btrfs_add_reserved_bytes(block_group, ram_bytes,
 				num_bytes, delalloc);
 		if (ret == -EAGAIN) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    num_bytes);
 			goto loop;
 		}
 		btrfs_inc_block_group_reservations(block_group);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index f6434794cb0b..2161d0ad5cf0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2903,6 +2903,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 	u64 align_gap_len = 0;
 	enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -3034,6 +3036,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3810,6 +3814,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group,
 	int ret;
 	u64 rem = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 18/41] btrfs: reset zones of unused block groups
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (16 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 17/41] btrfs: do sequential extent allocation in ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 19/41] btrfs: redirty released extent buffers in ZONED mode Naohiro Aota
                   ` (24 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

For an ZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  8 ++++++--
 fs/btrfs/extent-tree.c | 17 ++++++++++++-----
 fs/btrfs/zoned.h       | 16 ++++++++++++++++
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 232885261c37..31511e59ca74 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1470,8 +1470,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			goto flip_async;
 
-		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
+		/*
+		 * DISCARD can flip during remount. In ZONED mode, we need
+		 * to reset sequential required zones.
+		 */
+		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+				btrfs_is_zoned(fs_info);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ab0ce3ba2b89..11e6483372c3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1331,6 +1331,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			struct btrfs_device *dev = stripe->dev;
+			u64 physical = stripe->physical;
+			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
 
@@ -1338,14 +1341,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
-			if (!blk_queue_discard(req_q))
+			/* Zone reset in ZONED mode */
+			if (btrfs_can_zone_reset(dev, physical, length))
+				ret = btrfs_reset_device_zone(dev, physical,
+							      length, &bytes);
+			else if (blk_queue_discard(req_q))
+				ret = btrfs_issue_discard(dev->bdev, physical,
+							  length, &bytes);
+			else
 				continue;
 
-			ret = btrfs_issue_discard(stripe->dev->bdev,
-						  stripe->physical,
-						  stripe->length,
-						  &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index c86cde1978cd..6a07af0c7f6d 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -209,4 +209,20 @@ static inline u64 btrfs_align_offset_to_zone(struct btrfs_device *device,
 	return ALIGN(pos, device->zone_info->zone_size);
 }
 
+static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
+					u64 physical, u64 length)
+{
+	u64 zone_size;
+
+	if (!btrfs_dev_is_sequential(device, physical))
+		return false;
+
+	zone_size = device->zone_info->zone_size;
+	if (!IS_ALIGNED(physical, zone_size) ||
+	    !IS_ALIGNED(length, zone_size))
+		return false;
+
+	return true;
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 19/41] btrfs: redirty released extent buffers in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (17 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 18/41] btrfs: reset zones of unused block groups Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 20/41] btrfs: extract page adding function Naohiro Aota
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED volumes, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean and unwritten extent buffers that
have been released in a transaction. Btrfs redirty the buffer so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. btrfsck. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking
and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  8 ++++++++
 fs/btrfs/extent-tree.c | 12 +++++++++++-
 fs/btrfs/extent_io.c   |  4 ++++
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/transaction.c | 10 ++++++++++
 fs/btrfs/transaction.h |  3 +++
 fs/btrfs/tree-log.c    |  6 ++++++
 fs/btrfs/zoned.c       | 37 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  7 +++++++
 9 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 509085a368bb..c0180fbd5c78 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -462,6 +462,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
+
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+		WARN_ON(found_start != 0);
+		return 0;
+	}
+
 	/*
 	 * Please do not consolidate these warnings into a single if.
 	 * It is useful to know what went wrong.
@@ -4616,6 +4622,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 				     EXTENT_DIRTY);
 	btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
 
+	btrfs_free_redirty_list(cur_trans);
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 11e6483372c3..99640dacf8e6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3422,8 +3422,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
-			if (!ret)
+			if (!ret) {
+				btrfs_redirty_list_add(trans->transaction, buf);
 				goto out;
+			}
 		}
 
 		pin = 0;
@@ -3435,6 +3437,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
+		if (btrfs_is_zoned(fs_info)) {
+			btrfs_redirty_list_add(trans->transaction, buf);
+			pin_down_extent(trans, cache, buf->start, buf->len, 1);
+			btrfs_put_block_group(cache);
+			goto out;
+		}
+
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
@@ -4771,6 +4780,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	__btrfs_tree_lock(buf, nest);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags);
 
 	btrfs_set_lock_blocking_write(buf);
 	set_extent_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 60f5f68d892d..e91c504fe973 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -24,6 +24,7 @@
 #include "rcu-string.h"
 #include "backref.h"
 #include "disk-io.h"
+#include "zoned.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4959,6 +4960,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 
 	btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list,
 			     &fs_info->allocated_ebs);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
@@ -5744,6 +5746,8 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 	char *src = (char *)srcv;
 	unsigned long i = start >> PAGE_SHIFT;
 
+	WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
+
 	if (check_eb_range(eb, start, len))
 		return;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index f39d02e7f7ef..5f2ccfd0205e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -30,6 +30,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -107,6 +108,7 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	int spinning_writers;
 	atomic_t spinning_readers;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 52ada47aff50..a8561536cd0d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -22,6 +22,7 @@
 #include "qgroup.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -336,6 +337,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2345,6 +2348,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	/*
+	 * At this point, we should have written all the tree blocks
+	 * allocated in this transaction. So it's now safe to free the
+	 * redirtyied extent buffers.
+	 */
+	btrfs_free_redirty_list(cur_trans);
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 858d9153a1cd..380e0aaa15b3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -92,6 +92,9 @@ struct btrfs_transaction {
 	 */
 	atomic_t pending_ordered;
 	wait_queue_head_t pending_wait;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 56cbc1706b6f..5f585cf57383 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
 #include "inode-map.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 /* magic values for the inode_only field in btrfs_log_inode:
  *
@@ -2742,6 +2743,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 						free_extent_buffer(next);
 						return ret;
 					}
+					btrfs_redirty_list_add(
+						trans->transaction, next);
 				} else {
 					if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags))
 						clear_extent_buffer_dirty(next);
@@ -3277,6 +3280,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
 			  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
 	extent_io_tree_release(&log->log_csum_range);
+
+	if (trans && log->node)
+		btrfs_redirty_list_add(trans->transaction, log->node);
 	btrfs_put_root(log);
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5ee26b9fe5b1..b56bfeaf8744 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -10,6 +10,7 @@
 #include "rcu-string.h"
 #include "disk-io.h"
 #include "block-group.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1023,3 +1024,39 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 	 */
 	btrfs_free_excluded_extents(cache);
 }
+
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (!btrfs_is_zoned(fs_info) ||
+	    btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
+	    !list_empty(&eb->release_list))
+		return;
+
+	set_extent_buffer_dirty(eb);
+	set_extent_bits_nowait(&trans->dirty_pages, eb->start,
+			       eb->start + eb->len - 1, EXTENT_DIRTY);
+	memzero_extent_buffer(eb, 0, eb->len);
+	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+
+	spin_lock(&trans->releasing_ebs_lock);
+	list_add_tail(&eb->release_list, &trans->releasing_ebs);
+	spin_unlock(&trans->releasing_ebs_lock);
+	atomic_inc(&eb->refs);
+}
+
+void btrfs_free_redirty_list(struct btrfs_transaction *trans)
+{
+	spin_lock(&trans->releasing_ebs_lock);
+	while (!list_empty(&trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+	spin_unlock(&trans->releasing_ebs_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6a07af0c7f6d..a7de80c313be 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -42,6 +42,9 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb);
+void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -122,6 +125,10 @@ static inline int btrfs_load_block_group_zone_info(
 
 static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
 
+static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+					  struct extent_buffer *eb) { }
+static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 20/41] btrfs: extract page adding function
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (18 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 19/41] btrfs: redirty released extent buffers in ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs Naohiro Aota
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

This commit extract page adding to bio part from submit_extent_page(). The
page is added only when bio_flags are the same, contiguous and the added
page fits in the same stripe as pages in the bio.

Condition checkings are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 56 ++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e91c504fe973..868ae0874a34 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3012,6 +3012,44 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
 	return bio;
 }
 
+/**
+ * btrfs_bio_add_page	-	attempt to add a page to bio
+ * @bio:	destination bio
+ * @page:	page to add to the bio
+ * @logical:	offset of the new bio or to check whether we are adding
+ *              a contiguous page to the previous one
+ * @pg_offset:	starting offset in the page
+ * @size:	portion of page that we want to write
+ * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
+ * @bio_flags:	flags of the current bio to see if we can merge them
+ *
+ * Attempt to add a page to bio considering stripe alignment etc. Return
+ * true if successfully page added. Otherwise, return false.
+ */
+static bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
+			       unsigned int size, unsigned int pg_offset,
+			       unsigned long prev_bio_flags,
+			       unsigned long bio_flags)
+{
+	sector_t sector = logical >> SECTOR_SHIFT;
+	bool contig;
+
+	if (prev_bio_flags != bio_flags)
+		return false;
+
+	if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
+		contig = bio->bi_iter.bi_sector == sector;
+	else
+		contig = bio_end_sector(bio) == sector;
+	if (!contig)
+		return false;
+
+	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
+		return false;
+
+	return bio_add_page(bio, page, size, pg_offset) == size;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3040,27 +3078,15 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t page_size = min_t(size_t, size, PAGE_SIZE);
-	sector_t sector = offset >> 9;
 	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
 
 	ASSERT(bio_ret);
 
 	if (*bio_ret) {
-		bool contig;
-		bool can_merge = true;
-
 		bio = *bio_ret;
-		if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
-			contig = bio->bi_iter.bi_sector == sector;
-		else
-			contig = bio_end_sector(bio) == sector;
-
-		if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags))
-			can_merge = false;
-
-		if (prev_bio_flags != bio_flags || !contig || !can_merge ||
-		    force_bio_submit ||
-		    bio_add_page(bio, page, page_size, pg_offset) < page_size) {
+		if (force_bio_submit ||
+		    !btrfs_bio_add_page(bio, page, offset, page_size, pg_offset,
+					prev_bio_flags, bio_flags)) {
 			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
 			if (ret < 0) {
 				*bio_ret = NULL;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (19 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 20/41] btrfs: extract page adding function Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

Zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow the restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to memoize the target device to btrfs_io_bio(bio)->device.

Currently, zoned btrfs only supports SINGLE profile. In the feature,
btrfs_io_bio can hold extent_map and check the restrictions for all the
devices the bio will be mapped.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 868ae0874a34..b9b366f4d942 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3033,6 +3033,7 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
 {
 	sector_t sector = logical >> SECTOR_SHIFT;
 	bool contig;
+	int ret;
 
 	if (prev_bio_flags != bio_flags)
 		return false;
@@ -3047,7 +3048,12 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
 	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
 		return false;
 
-	return bio_add_page(bio, page, size, pg_offset) == size;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+	else
+		ret = bio_add_page(bio, page, size, pg_offset);
+
+	return ret == size;
 }
 
 /*
@@ -3078,7 +3084,9 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t page_size = min_t(size_t, size, PAGE_SIZE);
-	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
+	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct extent_io_tree *tree = &inode->io_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	ASSERT(bio_ret);
 
@@ -3109,11 +3117,27 @@ static int submit_extent_page(unsigned int opf,
 	if (wbc) {
 		struct block_device *bdev;
 
-		bdev = BTRFS_I(page->mapping->host)->root->fs_info->fs_devices->latest_bdev;
+		bdev = fs_info->fs_devices->latest_bdev;
 		bio_set_dev(bio, bdev);
 		wbc_init_bio(wbc, bio);
 		wbc_account_cgroup_owner(wbc, page, page_size);
 	}
+	if (btrfs_is_zoned(fs_info) &&
+	    bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, offset, page_size);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		map = em->map_lookup;
+		/* We only support SINGLE profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
 
 	*bio_ret = bio;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (20 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

ZONED btrfs uses REQ_OP_ZONE_APPEND bios for writing to actual devices. Let
btrfs_end_bio() and btrfs_op be aware of it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/inode.c   | 10 +++++-----
 fs/btrfs/volumes.c |  8 ++++----
 fs/btrfs/volumes.h |  1 +
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c0180fbd5c78..8acf1ed75889 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -652,7 +652,7 @@ static void end_workqueue_bio(struct bio *bio)
 	fs_info = end_io_wq->info;
 	end_io_wq->status = bio->bi_status;
 
-	if (bio_op(bio) == REQ_OP_WRITE) {
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
 			wq = fs_info->endio_meta_write_workers;
 		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE)
@@ -827,7 +827,7 @@ blk_status_t btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio,
 	int async = check_async_write(fs_info, BTRFS_I(inode));
 	blk_status_t ret;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		/*
 		 * called for a read, do the setup so that checksum validation
 		 * can happen in the async kernel threads
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 936c3137c646..591ca539e444 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2192,7 +2192,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
 			goto out;
@@ -7526,7 +7526,7 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	if (!refcount_dec_and_test(&dip->refs))
 		return;
 
-	if (bio_op(dip->dio_bio) == REQ_OP_WRITE) {
+	if (btrfs_op(dip->dio_bio) == BTRFS_MAP_WRITE) {
 		__endio_write_update_ordered(BTRFS_I(dip->inode),
 					     dip->logical_offset,
 					     dip->bytes,
@@ -7695,7 +7695,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_private *dip = bio->bi_private;
-	bool write = bio_op(bio) == REQ_OP_WRITE;
+	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
 	blk_status_t ret;
 
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
@@ -7746,7 +7746,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 							  struct inode *inode,
 							  loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	size_t dip_size;
 	struct btrfs_dio_private *dip;
@@ -7777,7 +7777,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		struct bio *dio_bio, loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c0e27c1e2559..683b3ed06226 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6451,7 +6451,7 @@ static void btrfs_end_bio(struct bio *bio)
 			struct btrfs_device *dev = btrfs_io_bio(bio)->device;
 
 			ASSERT(dev->bdev);
-			if (bio_op(bio) == REQ_OP_WRITE)
+			if (btrfs_op(bio) == BTRFS_MAP_WRITE)
 				btrfs_dev_stat_inc_and_print(dev,
 						BTRFS_DEV_STAT_WRITE_ERRS);
 			else if (!(bio->bi_opf & REQ_RAHEAD))
@@ -6564,10 +6564,10 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
 	if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
-	    ((bio_op(bio) == REQ_OP_WRITE) || (mirror_num > 1))) {
+	    ((btrfs_op(bio) == BTRFS_MAP_WRITE) || (mirror_num > 1))) {
 		/* In this case, map_length has been set to the length of
 		   a single stripe; not the whole write */
-		if (bio_op(bio) == REQ_OP_WRITE) {
+		if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 			ret = raid56_parity_write(fs_info, bio, bbio,
 						  map_length);
 		} else {
@@ -6590,7 +6590,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		dev = bbio->stripes[dev_nr].dev;
 		if (!dev || !dev->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
 						   &dev->dev_state) ||
-		    (bio_op(first_bio) == REQ_OP_WRITE &&
+		    (btrfs_op(first_bio) == BTRFS_MAP_WRITE &&
 		    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
 			bbio_error(bbio, first_bio, logical);
 			continue;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0249aca668fb..cff1f7689eac 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -410,6 +410,7 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
 	case REQ_OP_DISCARD:
 		return BTRFS_MAP_DISCARD;
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return BTRFS_MAP_WRITE;
 	default:
 		WARN_ON_ONCE(1);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (21 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-11  2:01     ` kernel test robot
                     ` (4 more replies)
  2020-11-10 11:26 ` [PATCH v10 24/41] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
                   ` (19 subsequent siblings)
  42 siblings, 5 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule.

This commit implements the splitting of an ordered extent and extent map
on bio submission to adhere to the rule.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c        | 89 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.c | 76 +++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.h |  2 +
 3 files changed, 167 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 591ca539e444..df85d8dea37c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,6 +2158,86 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
+int extract_ordered_extent(struct inode *inode, struct bio *bio,
+			   loff_t file_offset)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em = NULL, *em_new = NULL;
+	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 len = bio->bi_iter.bi_size;
+	u64 end = start + len;
+	u64 ordered_end;
+	u64 pre, post;
+	int ret = 0;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON_ONCE(!ordered))
+		return -EIO;
+
+	/* No need to split */
+	if (ordered->disk_num_bytes == len)
+		goto out;
+
+	/* We cannot split once end_bio'd ordered extent */
+	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We cannot split a compressed ordered extent */
+	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We cannot split a waited ordered extent */
+	if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
+	/* bio must be in one ordered extent */
+	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Checksum list should be empty */
+	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pre = start - ordered->disk_bytenr;
+	post = ordered_end - end;
+
+	btrfs_split_ordered_extent(ordered, pre, post);
+
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
+	if (!em) {
+		read_unlock(&em_tree->lock);
+		ret = -EIO;
+		goto out;
+	}
+	read_unlock(&em_tree->lock);
+
+	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+	em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
+			      em->start + pre, em->block_start + pre, len,
+			      len, len, BTRFS_COMPRESS_NONE,
+			      BTRFS_ORDERED_REGULAR);
+	free_extent_map(em_new);
+
+out:
+	free_extent_map(em);
+	btrfs_put_ordered_extent(ordered);
+
+	return ret;
+}
+
 /*
  * extent_io.c submission hook. This does the right thing for csum calculation
  * on write, or reading the csums from the tree before a read.
@@ -2192,6 +2272,15 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct page *page = bio_first_bvec_all(bio)->bv_page;
+		loff_t file_offset = page_offset(page);
+
+		ret = extract_ordered_extent(inode, bio, file_offset);
+		if (ret)
+			goto out;
+	}
+
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 87bac9ecdf4c..35ef25e39561 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -943,6 +943,82 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 	}
 }
 
+static void clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
+				 u64 len)
+{
+	struct inode *inode = ordered->inode;
+	u64 file_offset = ordered->file_offset + pos;
+	u64 disk_bytenr = ordered->disk_bytenr + pos;
+	u64 num_bytes = len;
+	u64 disk_num_bytes = len;
+	int type;
+	unsigned long flags_masked =
+		ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
+	int compress_type = ordered->compress_type;
+	unsigned long weight;
+
+	weight = hweight_long(flags_masked);
+	WARN_ON_ONCE(weight > 1);
+	if (!weight)
+		type = 0;
+	else
+		type = __ffs(flags_masked);
+
+	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
+		WARN_ON_ONCE(1);
+		btrfs_add_ordered_extent_compress(BTRFS_I(inode), file_offset,
+						  disk_bytenr, num_bytes,
+						  disk_num_bytes, type,
+						  compress_type);
+	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
+		btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
+					     disk_bytenr, num_bytes,
+					     disk_num_bytes, type);
+	} else {
+		btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
+					 disk_bytenr, num_bytes, disk_num_bytes,
+					 type);
+	}
+}
+
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post)
+{
+	struct inode *inode = ordered->inode;
+	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	spin_lock_irq(&tree->lock);
+	/* Remove from tree once */
+	node = &ordered->rb_node;
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	if (tree->last == node)
+		tree->last = NULL;
+
+	ordered->file_offset += pre;
+	ordered->disk_bytenr += pre;
+	ordered->num_bytes -= (pre + post);
+	ordered->disk_num_bytes -= (pre + post);
+	ordered->bytes_left -= (pre + post);
+
+	/* Re-insert the node */
+	node = tree_insert(&tree->tree, ordered->file_offset,
+			   &ordered->rb_node);
+	if (node)
+		btrfs_panic(fs_info, -EEXIST,
+				"zoned: inconsistency in ordered tree at offset %llu",
+				ordered->file_offset);
+
+	spin_unlock_irq(&tree->lock);
+
+	if (pre)
+		clone_ordered_extent(ordered, 0, pre);
+	if (post)
+		clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
+}
+
 int __init ordered_data_init(void)
 {
 	btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c3a2325e64a4..e346b03bd66a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -193,6 +193,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					u64 end,
 					struct extent_state **cached_state);
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 24/41] btrfs: extend btrfs_rmap_block for specifying a device
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (22 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs Naohiro Aota
                   ` (18 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.

This commit extends the function to match to a specified device. The old
functionality of querying all devices is left intact by specifying NULL as
target device.

We pass block_device instead of btrfs_device to __btrfs_rmap_block. This
function is intended to reverse-map the result of bio, which only have
block_device.

This commit also exports the function for later use.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c            | 20 ++++++++++++++------
 fs/btrfs/block-group.h            |  8 +++-----
 fs/btrfs/tests/extent-map-tests.c |  2 +-
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 31511e59ca74..04bb0602f1cc 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1646,8 +1646,11 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 /**
- * btrfs_rmap_block - Map a physical disk address to a list of logical addresses
+ * btrfs_rmap_block - Map a physical disk address to a list of logical
+ *                    addresses
  * @chunk_start:   logical address of block group
+ * @bdev:	   physical device to resolve. Can be NULL to indicate any
+ *                 device.
  * @physical:	   physical address to map to logical addresses
  * @logical:	   return array of logical addresses which map to @physical
  * @naddrs:	   length of @logical
@@ -1657,9 +1660,9 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
  * Used primarily to exclude those portions of a block group that contain super
  * block copies.
  */
-EXPORT_FOR_TESTS
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
-		     u64 physical, u64 **logical, int *naddrs, int *stripe_len)
+		     struct block_device *bdev, u64 physical, u64 **logical,
+		     int *naddrs, int *stripe_len)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
@@ -1677,6 +1680,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	map = em->map_lookup;
 	data_stripe_length = em->orig_block_len;
 	io_stripe_size = map->stripe_len;
+	chunk_start = em->start;
 
 	/* For RAID5/6 adjust to a full IO stripe length */
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -1691,14 +1695,18 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	for (i = 0; i < map->num_stripes; i++) {
 		bool already_inserted = false;
 		u64 stripe_nr;
+		u64 offset;
 		int j;
 
 		if (!in_range(physical, map->stripes[i].physical,
 			      data_stripe_length))
 			continue;
 
+		if (bdev && map->stripes[i].dev->bdev != bdev)
+			continue;
+
 		stripe_nr = physical - map->stripes[i].physical;
-		stripe_nr = div64_u64(stripe_nr, map->stripe_len);
+		stripe_nr = div64_u64_rem(stripe_nr, map->stripe_len, &offset);
 
 		if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
 			stripe_nr = stripe_nr * map->num_stripes + i;
@@ -1712,7 +1720,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 		 * instead of map->stripe_len
 		 */
 
-		bytenr = chunk_start + stripe_nr * io_stripe_size;
+		bytenr = chunk_start + stripe_nr * io_stripe_size + offset;
 
 		/* Ensure we don't add duplicate addresses */
 		for (j = 0; j < nr; j++) {
@@ -1754,7 +1762,7 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
-		ret = btrfs_rmap_block(fs_info, cache->start,
+		ret = btrfs_rmap_block(fs_info, cache->start, NULL,
 				       bytenr, &logical, &nr, &stripe_len);
 		if (ret)
 			return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 5be47f4bfea7..9a4009eaaecb 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -275,6 +275,9 @@ void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type);
 u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
+int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
+		       struct block_device *bdev, u64 physical, u64 **logical,
+		       int *naddrs, int *stripe_len);
 
 static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info)
 {
@@ -301,9 +304,4 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
 void btrfs_freeze_block_group(struct btrfs_block_group *cache);
 void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
 
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
-		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
-#endif
-
 #endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 57379e96ccc9..c0aefe6dee0b 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -507,7 +507,7 @@ static int test_rmap_block(struct btrfs_fs_info *fs_info,
 		goto out_free;
 	}
 
-	ret = btrfs_rmap_block(fs_info, em->start, btrfs_sb_offset(1),
+	ret = btrfs_rmap_block(fs_info, em->start, NULL, btrfs_sb_offset(1),
 			       &logical, &out_ndaddrs, &out_stripe_len);
 	if (ret || (out_ndaddrs == 0 && test->expected_mapped_addr)) {
 		test_err("didn't rmap anything but expected %d",
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (23 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 24/41] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 26/41] btrfs: enable zone append writing for direct IO Naohiro Aota
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik, Johannes Thumshirn

This commit enables zone append writing for zoned btrfs. When using zone
append, a bio is issued to the start of a target zone and the device
decides to place it inside the zone. Upon completion the device reports
the actual written position back to the host.

Three parts are necessary to enable zone append in btrfs. First, modify
the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust
the bi_sector to point the beginning of the zone.

Secondly, records the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later in
btrfs_finish_ordered_io().

And finally, rewrites the logical addresses of the extent mapping and
checksum data according to the physical address (using __btrfs_rmap_block).
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c    | 12 +++++++-
 fs/btrfs/file.c         |  2 +-
 fs/btrfs/inode.c        |  4 +++
 fs/btrfs/ordered-data.c |  3 ++
 fs/btrfs/ordered-data.h |  8 +++++
 fs/btrfs/volumes.c      | 15 +++++++++
 fs/btrfs/zoned.c        | 68 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h        | 11 +++++++
 8 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b9b366f4d942..7f94fef3647b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2743,6 +2743,7 @@ static void end_bio_extent_writepage(struct bio *bio)
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
+	bool first_bvec = true;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
@@ -2769,6 +2770,11 @@ static void end_bio_extent_writepage(struct bio *bio)
 		start = page_offset(page);
 		end = start + bvec->bv_offset + bvec->bv_len - 1;
 
+		if (first_bvec) {
+			btrfs_record_physical_zoned(inode, start, bio);
+			first_bvec = false;
+		}
+
 		end_extent_writepage(page, error, start, end);
 		end_page_writeback(page);
 	}
@@ -3525,6 +3531,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	size_t blocksize;
 	int ret = 0;
 	int nr = 0;
+	int opf = REQ_OP_WRITE;
 	const unsigned int write_flags = wbc_to_write_flags(wbc);
 	bool compressed;
 
@@ -3537,6 +3544,9 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		return 1;
 	}
 
+	if (btrfs_is_zoned(inode->root->fs_info))
+		opf = REQ_OP_ZONE_APPEND;
+
 	/*
 	 * we don't want to touch the inode after unlocking the page,
 	 * so we update the mapping writeback index now
@@ -3597,7 +3607,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
-		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
+		ret = submit_extent_page(opf | write_flags, wbc,
 					 page, offset, iosize, pg_offset,
 					 &epd->bio,
 					 end_bio_extent_writepage,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 68938a43081e..bdc268c91334 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2226,7 +2226,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * the current transaction commits before the ordered extents complete
 	 * and a power failure happens right after that.
 	 */
-	if (full_sync) {
+	if (full_sync || btrfs_is_zoned(fs_info)) {
 		ret = btrfs_wait_ordered_range(inode, start, len);
 	} else {
 		/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df85d8dea37c..fe15441278de 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "delalloc-space.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -2676,6 +2677,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	bool clear_reserved_extent = true;
 	unsigned int clear_bits;
 
+	if (ordered_extent->disk)
+		btrfs_rewrite_logical_zoned(ordered_extent);
+
 	start = ordered_extent->file_offset;
 	end = start + ordered_extent->num_bytes - 1;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 35ef25e39561..1a3b06713d0f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -199,6 +199,9 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
+	entry->physical = (u64)-1;
+	entry->disk = NULL;
+	entry->partno = (u8)-1;
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index e346b03bd66a..084c609afd83 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -127,6 +127,14 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * used to reverse-map physical address returned from ZONE_APPEND
+	 * write command in a workqueue context.
+	 */
+	u64 physical;
+	struct gendisk *disk;
+	u8 partno;
 };
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 683b3ed06226..c8187d704c89 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6503,6 +6503,21 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 	btrfs_io_bio(bio)->device = dev;
 	bio->bi_end_io = btrfs_end_bio;
 	bio->bi_iter.bi_sector = physical >> 9;
+	/*
+	 * For zone append writing, bi_sector must point the beginning of the
+	 * zone
+	 */
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		if (btrfs_dev_is_sequential(dev, physical)) {
+			u64 zone_start = round_down(physical,
+						    fs_info->zone_size);
+
+			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
+		} else {
+			bio->bi_opf &= ~REQ_OP_ZONE_APPEND;
+			bio->bi_opf |= REQ_OP_WRITE;
+		}
+	}
 	btrfs_debug_in_rcu(fs_info,
 	"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
 		bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b56bfeaf8744..f38bd0200788 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1060,3 +1060,71 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	}
 	spin_unlock(&trans->releasing_ebs_lock);
 }
+
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio)
+{
+	struct btrfs_ordered_extent *ordered;
+	u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (bio_op(bio) != REQ_OP_ZONE_APPEND)
+		return;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON(!ordered))
+		return;
+
+	ordered->physical = physical;
+	ordered->disk = bio->bi_disk;
+	ordered->partno = bio->bi_partno;
+
+	btrfs_put_ordered_extent(ordered);
+}
+
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	struct inode *inode = ordered->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ordered_sum *sum;
+	struct block_device *bdev;
+	u64 orig_logical = ordered->disk_bytenr;
+	u64 *logical = NULL;
+	int nr, stripe_len;
+
+	bdev = bdget_disk(ordered->disk, ordered->partno);
+	if (WARN_ON(!bdev))
+		return;
+
+	if (WARN_ON(btrfs_rmap_block(fs_info, orig_logical, bdev,
+				     ordered->physical, &logical, &nr,
+				     &stripe_len)))
+		goto out;
+
+	WARN_ON(nr != 1);
+
+	if (orig_logical == *logical)
+		goto out;
+
+	ordered->disk_bytenr = *logical;
+
+	em_tree = &BTRFS_I(inode)->extent_tree;
+	write_lock(&em_tree->lock);
+	em = search_extent_mapping(em_tree, ordered->file_offset,
+				   ordered->num_bytes);
+	em->block_start = *logical;
+	free_extent_map(em);
+	write_unlock(&em_tree->lock);
+
+	list_for_each_entry(sum, &ordered->list, list) {
+		if (*logical < orig_logical)
+			sum->bytenr -= orig_logical - *logical;
+		else
+			sum->bytenr += *logical - orig_logical;
+	}
+
+out:
+	kfree(logical);
+	bdput(bdev);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index a7de80c313be..2872a0cbc847 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -45,6 +45,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio);
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -129,6 +132,14 @@ static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 
+static inline void btrfs_record_physical_zoned(struct inode *inode,
+					       u64 file_offset, struct bio *bio)
+{
+}
+
+static inline void btrfs_rewrite_logical_zoned(
+				struct btrfs_ordered_extent *ordered) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 26/41] btrfs: enable zone append writing for direct IO
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (24 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 27/41] btrfs: introduce dedicated data write path for ZONED mode Naohiro Aota
                   ` (16 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

Likewise to buffered IO, enable zone append writing for direct IO when its
used on a zoned block device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fe15441278de..445cb6ba4a59 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7542,6 +7542,9 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
 	iomap->length = len;
 
+	if (write && btrfs_is_zoned(fs_info) && fs_info->max_zone_append_size)
+		iomap->flags |= IOMAP_F_ZONE_APPEND;
+
 	free_extent_map(em);
 
 	return 0;
@@ -7779,6 +7782,8 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	if (err)
 		dip->dio_bio->bi_status = err;
 
+	btrfs_record_physical_zoned(dip->inode, dip->logical_offset, bio);
+
 	bio_put(bio);
 	btrfs_dio_private_put(dip);
 }
@@ -7933,6 +7938,18 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		bio->bi_end_io = btrfs_end_dio_bio;
 		btrfs_io_bio(bio)->logical = file_offset;
 
+		WARN_ON_ONCE(write && btrfs_is_zoned(fs_info) &&
+			     fs_info->max_zone_append_size &&
+			     bio_op(bio) != REQ_OP_ZONE_APPEND);
+
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			ret = extract_ordered_extent(inode, bio, file_offset);
+			if (ret) {
+				bio_put(bio);
+				goto out_err;
+			}
+		}
+
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 27/41] btrfs: introduce dedicated data write path for ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (25 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 26/41] btrfs: enable zone append writing for direct IO Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 28/41] btrfs: serialize meta IOs on " Naohiro Aota
                   ` (15 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

If more than one IO is issued for one file extent, these IO can be written
to separate regions on a device. Since we cannot map one file extent to
such a separate area, we need to follow the "one IO == one ordered extent"
rule.

The Normal buffered, uncompressed, not pre-allocated write path (used by
cow_file_range()) sometimes does not follow this rule. It can write a part
of an ordered extent when specified a region to write e.g., when its
called from fdatasync().

Introduces a dedicated (uncompressed buffered) data write path for ZONED
mode. This write path will CoW the region and write it at once.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 445cb6ba4a59..991ef2bf018f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1350,6 +1350,29 @@ static int cow_file_range_async(struct btrfs_inode *inode,
 	return 0;
 }
 
+static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
+				       struct page *locked_page, u64 start,
+				       u64 end, int *page_started,
+				       unsigned long *nr_written)
+{
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end,
+			     page_started, nr_written, 0);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(&inode->vfs_inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1820,17 +1843,24 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	const bool do_compress = inode_can_compress(inode) &&
+		inode_need_compress(inode, start, end);
+	const bool zoned = btrfs_is_zoned(inode->root->fs_info);
 
 	if (inode->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (inode->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_can_compress(inode) ||
-		   !inode_need_compress(inode, start, end)) {
+	} else if (!do_compress && !zoned) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				     page_started, nr_written, 1);
+	} else if (!do_compress && zoned) {
+		ret = run_delalloc_zoned(inode, locked_page, start, end,
+					 page_started, nr_written);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 28/41] btrfs: serialize meta IOs on ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (26 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 27/41] btrfs: introduce dedicated data write path for ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 29/41] btrfs: wait existing extents before truncating Naohiro Aota
                   ` (14 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using the logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs. Furthermore, this add a
per-block group metadata IO submission pointer "meta_write_pointer" to
ensure sequential writing, which can be caused when writing back blocks in
an unfinished transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent_io.c   | 27 ++++++++++++++++++++++-
 fs/btrfs/zoned.c       | 50 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       | 32 +++++++++++++++++++++++++++
 6 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9a4009eaaecb..44f68e12f863 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -190,6 +190,7 @@ struct btrfs_block_group {
 	 */
 	u64 alloc_offset;
 	u64 zone_unusable;
+	u64 meta_write_pointer;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c70d3fcc62c2..8138e932b7cc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -956,6 +956,7 @@ struct btrfs_fs_info {
 
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
+	struct mutex zoned_meta_io_lock;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8acf1ed75889..66f90ebfc01f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2652,6 +2652,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
+	mutex_init(&fs_info->zoned_meta_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7f94fef3647b..d26c827f39c6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -25,6 +25,7 @@
 #include "backref.h"
 #include "disk-io.h"
 #include "zoned.h"
+#include "block-group.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -3995,6 +3996,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_buffer *eb, *prev_eb = NULL;
+	struct btrfs_block_group *cache = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.extent_locked = 0,
@@ -4029,6 +4031,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
+	btrfs_zoned_meta_io_lock(fs_info);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
@@ -4071,12 +4074,30 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (!ret)
 				continue;
 
+			if (!btrfs_check_meta_write_pointer(fs_info, eb,
+							    &cache)) {
+				/*
+				 * If for_sync, this hole will be filled with
+				 * trasnsaction commit.
+				 */
+				if (wbc->sync_mode == WB_SYNC_ALL &&
+				    !wbc->for_sync)
+					ret = -EAGAIN;
+				else
+					ret = 0;
+				done = 1;
+				free_extent_buffer(eb);
+				break;
+			}
+
 			prev_eb = eb;
 			ret = lock_extent_buffer_for_io(eb, &epd);
 			if (!ret) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				free_extent_buffer(eb);
 				continue;
 			} else if (ret < 0) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				done = 1;
 				free_extent_buffer(eb);
 				break;
@@ -4109,10 +4130,12 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
+	if (cache)
+		btrfs_put_block_group(cache);
 	ASSERT(ret <= 0);
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
-		return ret;
+		goto out;
 	}
 	/*
 	 * If something went wrong, don't allow any metadata write bio to be
@@ -4147,6 +4170,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 		ret = -EROFS;
 		end_write_bio(&epd, ret);
 	}
+out:
+	btrfs_zoned_meta_io_unlock(fs_info);
 	return ret;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f38bd0200788..d345c07f5fdf 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -995,6 +995,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		ret = -EIO;
 	}
 
+	if (!ret)
+		cache->meta_write_pointer = cache->alloc_offset + cache->start;
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
@@ -1128,3 +1131,50 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	kfree(logical);
 	bdput(bdev);
 }
+
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret)
+{
+	struct btrfs_block_group *cache;
+	bool ret = true;
+
+	if (!btrfs_is_zoned(fs_info))
+		return true;
+
+	cache = *cache_ret;
+
+	if (cache && (eb->start < cache->start ||
+		      cache->start + cache->length <= eb->start)) {
+		btrfs_put_block_group(cache);
+		cache = NULL;
+		*cache_ret = NULL;
+	}
+
+	if (!cache)
+		cache = btrfs_lookup_block_group(fs_info, eb->start);
+
+	if (cache) {
+		if (cache->meta_write_pointer != eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			ret = false;
+		} else {
+			cache->meta_write_pointer = eb->start + eb->len;
+		}
+
+		*cache_ret = cache;
+	}
+
+	return ret;
+}
+
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb)
+{
+	if (!btrfs_is_zoned(eb->fs_info) || !cache)
+		return;
+
+	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
+	cache->meta_write_pointer = eb->start;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 2872a0cbc847..41d786a97e40 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -48,6 +48,11 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
 				 struct bio *bio);
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret);
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -140,6 +145,19 @@ static inline void btrfs_record_physical_zoned(struct inode *inode,
 static inline void btrfs_rewrite_logical_zoned(
 				struct btrfs_ordered_extent *ordered) { }
 
+static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+			       struct extent_buffer *eb,
+			       struct btrfs_block_group **cache_ret)
+{
+	return true;
+}
+
+static inline void btrfs_revert_meta_write_pointer(
+						struct btrfs_block_group *cache,
+						struct extent_buffer *eb)
+{
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -243,4 +261,18 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline void btrfs_zoned_meta_io_lock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_lock(&fs_info->zoned_meta_io_lock);
+}
+
+static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_unlock(&fs_info->zoned_meta_io_lock);
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 29/41] btrfs: wait existing extents before truncating
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (27 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 28/41] btrfs: serialize meta IOs on " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 30/41] btrfs: avoid async metadata checksum on ZONED mode Naohiro Aota
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated.  Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 991ef2bf018f..992aa963592d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4955,6 +4955,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_is_zoned(fs_info)) {
+			ret = btrfs_wait_ordered_range(
+				inode,
+				ALIGN(newsize, fs_info->sectorsize),
+				(u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 30/41] btrfs: avoid async metadata checksum on ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (28 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 29/41] btrfs: wait existing extents before truncating Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace Naohiro Aota
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata
write IOs.

Even with these serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU and
not per zone.

To preserve write BIO ordering, we can disable async metadata checksum on
ZONED.  This does not result in lower performance with HDDs as a single CPU
core is fast enough to do checksum for a single zone write stream with the
maximum possible bandwidth of the device. If multiple zones are being
written simultaneously, HDD seek overhead lowers the achievable maximum
bandwidth, resulting again in a per zone checksum serialization not
affecting performance.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 66f90ebfc01f..9490dbbbdb2a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -813,6 +813,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio,
 static int check_async_write(struct btrfs_fs_info *fs_info,
 			     struct btrfs_inode *bi)
 {
+	if (btrfs_is_zoned(fs_info))
+		return 0;
 	if (atomic_read(&bi->sync_writers))
 		return 0;
 	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (29 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 30/41] btrfs: avoid async metadata checksum on ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-11  3:13   ` kernel test robot
  2020-11-11  3:16   ` kernel test robot
  2020-11-10 11:26 ` [PATCH v10 32/41] btrfs: implement cloning for ZONED device-replace Naohiro Aota
                   ` (11 subsequent siblings)
  42 siblings, 2 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This is the 1/4 patch to support device-replace in ZONED mode.

We have two types of I/Os during the device-replace process. One is an I/O
to "copy" (by the scrub functions) all the device extents on the source
device to the destination device.  The other one is an I/O to "clone" (by
handle_ops_on_dev_replace()) new incoming write I/Os from users to the
source device into the target device.

Cloning incoming I/Os can break the sequential write rule in the target
device. When writing is mapped in the middle of a block group, the I/O is
directed in the middle of a target device zone, which breaks the sequential
write rule.

However, the cloning function cannot be merely disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether bio is
going to not yet copied region.  Since we have a time gap between finishing
btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/dev-replace.c | 183 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |   3 +
 fs/btrfs/scrub.c       |  17 ++++
 4 files changed, 204 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 44f68e12f863..ccbcf37eae9c 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -95,6 +95,7 @@ struct btrfs_block_group {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index db87f1aa604b..95e75fc8e266 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -22,6 +22,7 @@
 #include "dev-replace.h"
 #include "sysfs.h"
 #include "zoned.h"
+#include "block-group.h"
 
 /*
  * Device replace overview
@@ -437,6 +438,184 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group *cache;
+	struct btrfs_trans_handle *trans;
+	int ret = 0;
+	u64 chunk_offset, length;
+
+	/* Do not use "to_copy" on non-ZONED for now */
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	mutex_lock(&fs_info->chunk_mutex);
+
+	/* Ensure we don't have pending new block group */
+	spin_lock(&fs_info->trans_lock);
+	while (fs_info->running_transaction &&
+	       !list_empty(&fs_info->running_transaction->dev_update_list)) {
+		spin_unlock(&fs_info->trans_lock);
+		mutex_unlock(&fs_info->chunk_mutex);
+		trans = btrfs_attach_transaction(root);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			mutex_lock(&fs_info->chunk_mutex);
+			if (ret == -ENOENT)
+				continue;
+			else
+				goto unlock;
+		}
+
+		ret = btrfs_commit_transaction(trans);
+		mutex_lock(&fs_info->chunk_mutex);
+		if (ret)
+			goto unlock;
+
+		spin_lock(&fs_info->trans_lock);
+	}
+	spin_unlock(&fs_info->trans_lock);
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.offset = 0;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto free_path;
+	if (ret > 0) {
+		if (path->slots[0] >=
+		    btrfs_header_nritems(path->nodes[0])) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto free_path;
+			if (ret > 0) {
+				ret = 0;
+				goto free_path;
+			}
+		} else {
+			ret = 0;
+		}
+	}
+
+	while (1) {
+		struct extent_buffer *l = path->nodes[0];
+		int slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		spin_lock(&cache->lock);
+		cache->to_copy = 1;
+		spin_unlock(&cache->lock);
+
+		btrfs_put_block_group(cache);
+
+skip:
+		ret = btrfs_next_item(root, path);
+		if (ret != 0) {
+			if (ret > 0)
+				ret = 0;
+			break;
+		}
+	}
+
+free_path:
+	btrfs_free_path(path);
+unlock:
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return ret;
+}
+
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 chunk_offset = cache->start;
+	int num_extents, cur_extent;
+	int i;
+
+	/* Do not use "to_copy" on non-ZONED for now */
+	if (!btrfs_is_zoned(fs_info))
+		return true;
+
+	spin_lock(&cache->lock);
+	if (cache->removed) {
+		spin_unlock(&cache->lock);
+		return true;
+	}
+	spin_unlock(&cache->lock);
+
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	ASSERT(!IS_ERR(em));
+	map = em->map_lookup;
+
+	num_extents = cur_extent = 0;
+	for (i = 0; i < map->num_stripes; i++) {
+		/* We have more device extent to copy */
+		if (srcdev != map->stripes[i].dev)
+			continue;
+
+		num_extents++;
+		if (physical == map->stripes[i].physical)
+			cur_extent = i;
+	}
+
+	free_extent_map(em);
+
+	if (num_extents > 1 && cur_extent < num_extents - 1) {
+		/*
+		 * Has more stripes on this device. Keep this BG
+		 * readonly until we finish all the stripes.
+		 */
+		return false;
+	}
+
+	/* Last stripe on this device */
+	spin_lock(&cache->lock);
+	cache->to_copy = 0;
+	spin_unlock(&cache->lock);
+
+	return true;
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -478,6 +657,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 60b70dacc299..3911049a5f23 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
 void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
 int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
 int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical);
 
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index aa1b36cf5c88..d0d7db3c8b0b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3500,6 +3500,17 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+
+		if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) {
+			spin_lock(&cache->lock);
+			if (!cache->to_copy) {
+				spin_unlock(&cache->lock);
+				ro_set = 0;
+				goto done;
+			}
+			spin_unlock(&cache->lock);
+		}
+
 		/*
 		 * Make sure that while we are scrubbing the corresponding block
 		 * group doesn't get its logical address and its device extents
@@ -3631,6 +3642,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace &&
+		    !btrfs_finish_block_group_to_copy(dev_replace->srcdev,
+						      cache, found_key.offset))
+			ro_set = 0;
+
+done:
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 32/41] btrfs: implement cloning for ZONED device-replace
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (30 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 33/41] btrfs: implement copying " Naohiro Aota
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This is 2/4 patch to implement device-replace for ZONED mode.

On zoned mode, a block group must be either copied (from the source device
to the destination device) or cloned (to the both device).

This commit implements the cloning part. If a block group targeted by an IO
is marked to copy, we should not clone the IO to the destination device,
because the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 57 +++++++++++++++++++++++++++++++-----------
 fs/btrfs/scrub.c       |  2 +-
 fs/btrfs/volumes.c     | 33 ++++++++++++++++++++++--
 fs/btrfs/zoned.c       | 11 ++++++++
 4 files changed, 85 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 99640dacf8e6..2ee21076b641 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -35,6 +35,7 @@
 #include "discard.h"
 #include "rcu-string.h"
 #include "zoned.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1298,6 +1299,46 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 	return ret;
 }
 
+static int do_discard_extent(struct btrfs_bio_stripe *stripe, u64 *bytes)
+{
+	struct btrfs_device *dev = stripe->dev;
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	u64 phys = stripe->physical;
+	u64 len = stripe->length;
+	u64 discarded = 0;
+	int ret = 0;
+
+	/* Zone reset in ZONED mode */
+	if (btrfs_can_zone_reset(dev, phys, len)) {
+		u64 src_disc;
+
+		ret = btrfs_reset_device_zone(dev, phys, len, &discarded);
+		if (ret)
+			goto out;
+
+		if (!btrfs_dev_replace_is_ongoing(dev_replace) ||
+		    dev != dev_replace->srcdev)
+			goto out;
+
+		src_disc = discarded;
+
+		/* send to replace target as well */
+		ret = btrfs_reset_device_zone(dev_replace->tgtdev, phys, len,
+					      &discarded);
+		discarded += src_disc;
+	} else if (blk_queue_discard(bdev_get_queue(stripe->dev->bdev))) {
+		ret = btrfs_issue_discard(dev->bdev, phys, len, &discarded);
+	} else {
+		ret = 0;
+		*bytes = 0;
+	}
+
+out:
+	*bytes = discarded;
+	return ret;
+}
+
 int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			 u64 num_bytes, u64 *actual_bytes)
 {
@@ -1331,28 +1372,14 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
-			struct btrfs_device *dev = stripe->dev;
-			u64 physical = stripe->physical;
-			u64 length = stripe->length;
 			u64 bytes;
-			struct request_queue *req_q;
 
 			if (!stripe->dev->bdev) {
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
 
-			req_q = bdev_get_queue(stripe->dev->bdev);
-			/* Zone reset in ZONED mode */
-			if (btrfs_can_zone_reset(dev, physical, length))
-				ret = btrfs_reset_device_zone(dev, physical,
-							      length, &bytes);
-			else if (blk_queue_discard(req_q))
-				ret = btrfs_issue_discard(dev->bdev, physical,
-							  length, &bytes);
-			else
-				continue;
-
+			ret = do_discard_extent(stripe, &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d0d7db3c8b0b..371bb6437cab 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3501,7 +3501,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			goto skip;
 
 
-		if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) {
+		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
 			spin_lock(&cache->lock);
 			if (!cache->to_copy) {
 				spin_unlock(&cache->lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c8187d704c89..434fc6f758cc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5969,9 +5969,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+	bool ret;
+
+	/* non-ZONED mode does not use "to_copy" flag */
+	if (!btrfs_is_zoned(fs_info))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+
+	spin_lock(&cache->lock);
+	ret = cache->to_copy;
+	spin_unlock(&cache->lock);
+
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -5984,6 +6004,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	if (op == BTRFS_MAP_WRITE) {
 		int index_where_to_add;
 
+		/*
+		 * a block group which have "to_copy" set will
+		 * eventually copied by dev-replace process. We can
+		 * avoid cloning IO here.
+		 */
+		if (is_block_group_to_copy(dev_replace->srcdev->fs_info,
+					   logical))
+			return;
+
 		/*
 		 * duplicate the write operations while the dev replace
 		 * procedure is running. Since the copying of the old disk to
@@ -6379,8 +6408,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d345c07f5fdf..8bf5df03ceb8 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -11,6 +11,7 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "transaction.h"
+#include "dev-replace.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -891,6 +892,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	for (i = 0; i < map->num_stripes; i++) {
 		bool is_sequential;
 		struct blk_zone zone;
+		struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+		int dev_replace_is_ongoing = 0;
 
 		device = map->stripes[i].dev;
 		physical = map->stripes[i].physical;
@@ -917,6 +920,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		 */
 		btrfs_dev_clear_zone_empty(device, physical);
 
+		down_read(&dev_replace->rwsem);
+		dev_replace_is_ongoing =
+			btrfs_dev_replace_is_ongoing(dev_replace);
+		if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
+			btrfs_dev_clear_zone_empty(dev_replace->tgtdev,
+						   physical);
+		up_read(&dev_replace->rwsem);
+
 		/*
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 33/41] btrfs: implement copying for ZONED device-replace
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (31 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 32/41] btrfs: implement cloning for ZONED device-replace Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 34/41] btrfs: support dev-replace in ZONED mode Naohiro Aota
                   ` (9 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This is 3/4 patch to implement device-replace on ZONED mode.

This commit implement copying. So, it track the write pointer during device
replace process. Device-replace's copying is smart to copy only used
extents on source device, we have to fill the gap to honor the sequential
write rule in the target device.

Device-replace process in ZONED mode must copy or clone all the extents in
the source device exactly once.  So, we need to use to ensure allocations
started just before the dev-replace process to have their corresponding
extent information in the B-trees. finish_extent_writes_for_zoned()
implements that functionality, which basically is the removed code in the
commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
error during device replace").

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/scrub.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.c | 12 +++++++
 fs/btrfs/zoned.h |  8 +++++
 3 files changed, 106 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 371bb6437cab..aaf7882dee06 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -169,6 +169,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1623,6 +1624,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
 	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
 }
 
+static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
+{
+	int ret = 0;
+	u64 length;
+
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return 0;
+
+	if (sctx->write_pointer < physical) {
+		length = physical - sctx->write_pointer;
+
+		ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev,
+						sctx->write_pointer, length);
+		if (!ret)
+			sctx->write_pointer = physical;
+	}
+	return ret;
+}
+
 static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage)
 {
@@ -1645,6 +1665,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	if (sbio->page_count == 0) {
 		struct bio *bio;
 
+		ret = fill_writer_pointer_gap(sctx,
+					      spage->physical_for_dev_replace);
+		if (ret) {
+			mutex_unlock(&sctx->wr_lock);
+			return ret;
+		}
+
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
 		sbio->dev = sctx->wr_tgtdev;
@@ -1706,6 +1733,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_is_zoned(sctx->fs_info))
+		sctx->write_pointer = sbio->physical +
+			sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -2973,6 +3004,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+static void sync_replace_for_zoned(struct scrub_ctx *sctx)
+{
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return;
+
+	sctx->flush_all_writes = true;
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+
+	wait_event(sctx->list_wait,
+		   atomic_read(&sctx->bios_in_flight) == 0);
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3105,6 +3151,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+		sctx->flush_all_writes = true;
+	}
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3292,6 +3346,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			if (sctx->is_dev_replace)
+				sync_replace_for_zoned(sctx);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3414,6 +3471,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int finish_extent_writes_for_zoned(struct btrfs_root *root,
+					  struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_trans_handle *trans;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	btrfs_wait_block_group_reservations(cache);
+	btrfs_wait_nocow_writers(cache);
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	return btrfs_commit_transaction(trans);
+}
+
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			   struct btrfs_device *scrub_dev, u64 start, u64 end)
@@ -3569,6 +3645,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * group is not RO.
 		 */
 		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
+		if (!ret && sctx->is_dev_replace) {
+			ret = finish_extent_writes_for_zoned(root, cache);
+			if (ret) {
+				btrfs_dec_block_group_ro(cache);
+				scrub_pause_off(fs_info);
+				btrfs_put_block_group(cache);
+				break;
+			}
+		}
+
 		if (ret == 0) {
 			ro_set = 1;
 		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8bf5df03ceb8..cce4ddfff5d2 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1189,3 +1189,15 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
 	cache->meta_write_pointer = eb->start;
 }
+
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+			      u64 length)
+{
+	if (!btrfs_dev_is_sequential(device, physical))
+		return -EOPNOTSUPP;
+
+	return blkdev_issue_zeroout(device->bdev,
+				    physical >> SECTOR_SHIFT,
+				    length >> SECTOR_SHIFT,
+				    GFP_NOFS, 0);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 41d786a97e40..40204f8310ca 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -53,6 +53,8 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_group **cache_ret);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+			      u64 length);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -158,6 +160,12 @@ static inline void btrfs_revert_meta_write_pointer(
 {
 }
 
+static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
+					    u64 physical, u64 length)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 34/41] btrfs: support dev-replace in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (32 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 33/41] btrfs: implement copying " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 35/41] btrfs: enable relocation " Naohiro Aota
                   ` (8 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

This is 4/4 patch to implement device-replace on ZONED mode.

Even after the copying is done, the write pointers of the source device and
the destination device may not be synchronized. For example, when the last
allocated extent is freed before device-replace process, the extent is not
copied, leaving a hole there.

This patch synchronize the write pointers by writing zeros to the
destination device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/scrub.c | 36 +++++++++++++++++++++++++
 fs/btrfs/zoned.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h |  9 +++++++
 3 files changed, 114 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index aaf7882dee06..0e2211b9c810 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3019,6 +3019,31 @@ static void sync_replace_for_zoned(struct scrub_ctx *sctx)
 		   atomic_read(&sctx->bios_in_flight) == 0);
 }
 
+static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical,
+					u64 physical, u64 physical_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	int ret = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
+
+	mutex_lock(&sctx->wr_lock);
+	if (sctx->write_pointer < physical_end) {
+		ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical,
+						    physical,
+						    sctx->write_pointer);
+		if (ret)
+			btrfs_err(fs_info, "failed to recover write pointer");
+	}
+	mutex_unlock(&sctx->wr_lock);
+	btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical);
+
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3416,6 +3441,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (sctx->is_dev_replace && ret >= 0) {
+		int ret2;
+
+		ret2 = sync_write_pointer_for_zoned(sctx, base + offset,
+						    map->stripes[num].physical,
+						    physical_end);
+		if (ret2)
+			ret = ret2;
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index cce4ddfff5d2..77ca93bda258 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -12,6 +12,7 @@
 #include "block-group.h"
 #include "transaction.h"
 #include "dev-replace.h"
+#include "space-info.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1201,3 +1202,71 @@ int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
 				    length >> SECTOR_SHIFT,
 				    GFP_NOFS, 0);
 }
+
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	unsigned int nofs_flag;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nofs_flag = memalloc_nofs_save();
+	nmirrors = (int)bbio->num_stripes;
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* Missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone);
+		/* Failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+	memalloc_nofs_restore(nofs_flag);
+
+	return ret;
+}
+
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos)
+{
+	struct btrfs_fs_info *fs_info = tgt_dev->fs_info;
+	struct blk_zone zone;
+	u64 length;
+	u64 wp;
+	int ret;
+
+	if (!btrfs_dev_is_sequential(tgt_dev, physical_pos))
+		return 0;
+
+	ret = read_zone_info(fs_info, logical, &zone);
+	if (ret)
+		return ret;
+
+	wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT);
+
+	if (physical_pos == wp)
+		return 0;
+
+	if (physical_pos > wp)
+		return -EUCLEAN;
+
+	length = wp - physical_pos;
+	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 40204f8310ca..5b61500a0aa9 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -55,6 +55,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
 			      u64 length);
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				  u64 physical_start, u64 physical_pos);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -166,6 +168,13 @@ static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 	return -EOPNOTSUPP;
 }
 
+static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
+						u64 logical, u64 physical_start,
+						u64 physical_pos)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 35/41] btrfs: enable relocation in ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (33 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 34/41] btrfs: support dev-replace in ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 36/41] btrfs: relocate block group to repair IO failure in ZONED Naohiro Aota
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.

Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In ZONED mode, we just truncate
the inode to the size that we wanted to pre-allocate. Then, we flush dirty
pages on the file before finishing relocation process.
run_delalloc_zoned() will handle all the allocation and submit IOs to the
underlying layers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3602806d71bd..44b697b881b6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2603,6 +2603,32 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	if (ret)
 		return ret;
 
+	/*
+	 * In ZONED mode, we cannot preallocate the file region. Instead, we
+	 * dirty and fiemap_write the region.
+	 */
+
+	if (btrfs_is_zoned(inode->root->fs_info)) {
+		struct btrfs_root *root = inode->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->vfs_inode.i_ctime = current_time(&inode->vfs_inode);
+		i_size_write(&inode->vfs_inode, end);
+		ret = btrfs_update_inode(trans, root, &inode->vfs_inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+
+		return btrfs_end_transaction(trans);
+	}
+
 	inode_lock(&inode->vfs_inode);
 	for (nr = 0; nr < cluster->nr; nr++) {
 		start = cluster->boundary[nr] - offset;
@@ -2799,6 +2825,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		}
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_is_zoned(fs_info) && !ret)
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
 out:
 	kfree(ra);
 	return ret;
@@ -3434,8 +3462,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_is_zoned(trans->fs_info))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -3450,8 +3482,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 36/41] btrfs: relocate block group to repair IO failure in ZONED
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (34 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 35/41] btrfs: enable relocation " Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 37/41] btrfs: split alloc_log_tree() Naohiro Aota
                   ` (6 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.

We can consider three methods to repair an IO failure in ZONED mode:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to the
    new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and so
it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It is
safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group, so it is
straightforward to implement. It relocates all the mirrored device extents,
so it is, potentially, a more costly operation than method (1) or (2). But
it relocates only using extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple IO
errors, this commit introduces "relocating_repair" bit to show it's now
relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent_io.c   |  3 ++
 fs/btrfs/scrub.c       |  3 ++
 fs/btrfs/volumes.c     | 71 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     |  1 +
 5 files changed, 79 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index ccbcf37eae9c..25f67fe24746 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -96,6 +96,7 @@ struct btrfs_block_group {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int to_copy:1;
+	unsigned int relocating_repair:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d26c827f39c6..c11cf531ba86 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2268,6 +2268,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
 
+	if (btrfs_is_zoned(fs_info))
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	bio = btrfs_io_bio_alloc(1);
 	bio->bi_iter.bi_size = 0;
 	map_length = length;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0e2211b9c810..e6a8df8a8f4f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (btrfs_is_zoned(fs_info) && !sctx->is_dev_replace)
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	/*
 	 * We must use GFP_NOFS because the scrub task might be waiting for a
 	 * worker task executing this function and in turn a transaction commit
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 434fc6f758cc..8788dc64ba46 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7984,3 +7984,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+static int relocating_repair_kthread(void *data)
+{
+	struct btrfs_block_group *cache = (struct btrfs_block_group *) data;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	u64 target;
+	int ret = 0;
+
+	target = cache->start;
+	btrfs_put_block_group(cache);
+
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
+		btrfs_info(fs_info,
+			   "zoned: skip relocating block group %llu to repair: EBUSY",
+			   target);
+		return -EBUSY;
+	}
+
+	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+
+	/* Ensure Block Group still exists */
+	cache = btrfs_lookup_block_group(fs_info, target);
+	if (!cache)
+		goto out;
+
+	if (!cache->relocating_repair)
+		goto out;
+
+	ret = btrfs_may_alloc_data_chunk(fs_info, target);
+	if (ret < 0)
+		goto out;
+
+	btrfs_info(fs_info, "zoned: relocating block group %llu to repair IO failure",
+		   target);
+	ret = btrfs_relocate_chunk(fs_info, target);
+
+out:
+	if (cache)
+		btrfs_put_block_group(cache);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	btrfs_exclop_finish(fs_info);
+
+	return ret;
+}
+
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	/* Do not attempt to repair in degraded state */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		return 0;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache)
+		return 0;
+
+	spin_lock(&cache->lock);
+	if (cache->relocating_repair) {
+		spin_unlock(&cache->lock);
+		btrfs_put_block_group(cache);
+		return 0;
+	}
+	cache->relocating_repair = 1;
+	spin_unlock(&cache->lock);
+
+	kthread_run(relocating_repair_kthread, cache,
+		    "btrfs-relocating-repair");
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index cff1f7689eac..7c1ad6901791 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -584,5 +584,6 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 37/41] btrfs: split alloc_log_tree()
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (35 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 36/41] btrfs: relocate block group to repair IO failure in ZONED Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
                   ` (5 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Johannes Thumshirn

This is a preparation for the next patch. This commit split
alloc_log_tree() to allocating tree structure part (remains in
alloc_log_tree()) and allocating tree node part (moved in
btrfs_alloc_log_tree_node()). The latter part is also exported to be used
in the next patch.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 33 +++++++++++++++++++++++++++------
 fs/btrfs/disk-io.h |  2 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9490dbbbdb2a..97e3deb46cf1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1211,7 +1211,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 					 struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root;
-	struct extent_buffer *leaf;
 
 	root = btrfs_alloc_root(fs_info, BTRFS_TREE_LOG_OBJECTID, GFP_NOFS);
 	if (!root)
@@ -1221,6 +1220,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 	root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 	root->root_key.offset = BTRFS_TREE_LOG_OBJECTID;
 
+	return root;
+}
+
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root)
+{
+	struct extent_buffer *leaf;
+
 	/*
 	 * DON'T set SHAREABLE bit for log trees.
 	 *
@@ -1233,26 +1240,33 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
 			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
-	if (IS_ERR(leaf)) {
-		btrfs_put_root(root);
-		return ERR_CAST(leaf);
-	}
+	if (IS_ERR(leaf))
+		return PTR_ERR(leaf);
 
 	root->node = leaf;
 
 	btrfs_mark_buffer_dirty(root->node);
 	btrfs_tree_unlock(root->node);
-	return root;
+
+	return 0;
 }
 
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
+
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		btrfs_put_root(log_root);
+		return ret;
+	}
+
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
@@ -1264,11 +1278,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log_root;
 	struct btrfs_inode_item *inode_item;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		btrfs_put_root(log_root);
+		return ret;
+	}
+
 	log_root->last_trans = trans->transid;
 	log_root->root_key.offset = root->root_key.objectid;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index fee69ced58b4..b82ae3711c42 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -115,6 +115,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			extent_submit_bio_start_t *submit_bio_start);
 blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
 			  int mirror_num);
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (36 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 37/41] btrfs: split alloc_log_tree() Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-11  4:58   ` [PATCH v10.1 " Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 39/41] btrfs: serialize log transaction on ZONED mode Naohiro Aota
                   ` (4 subsequent siblings)
  42 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Johannes Thumshirn

This is the 1/3 patch to enable tree log on ZONED mode.

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing from a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes that ZONED mode must avoid.

We can introduce a dedicated block group for tree-log blocks so that
tree-log blocks and other metadata blocks can be separated write streams.
As a result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and btrfs assign
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  7 +++++
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/extent-tree.c | 63 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 04bb0602f1cc..d222f54eb0c1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -939,6 +939,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&cluster->refill_lock);
 
+	if (btrfs_is_zoned(fs_info)) {
+		spin_lock(&fs_info->treelog_bg_lock);
+		if (fs_info->treelog_bg == block_group->start)
+			fs_info->treelog_bg = 0;
+		spin_unlock(&fs_info->treelog_bg_lock);
+	}
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8138e932b7cc..2fd7e58343ce 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -957,6 +957,8 @@ struct btrfs_fs_info {
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
+	spinlock_t treelog_bg_lock;
+	u64 treelog_bg;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2ee21076b641..69d913ffc425 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3631,6 +3631,9 @@ struct find_free_extent_ctl {
 	bool have_caching_bg;
 	bool orig_have_caching_bg;
 
+	/* Allocation is called for tree-log */
+	bool for_treelog;
+
 	/* RAID index, converted from flags */
 	int index;
 
@@ -3868,23 +3871,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct find_free_extent_ctl *ffe_ctl,
 			       struct btrfs_block_group **bg_ret)
 {
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
 	u64 avail;
+	u64 bytenr = block_group->start;
+	u64 log_bytenr;
 	int ret = 0;
+	bool skip;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
+	/*
+	 * Do not allow non-tree-log blocks in the dedicated tree-log block
+	 * group, and vice versa.
+	 */
+	spin_lock(&fs_info->treelog_bg_lock);
+	log_bytenr = fs_info->treelog_bg;
+	skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
+			      (!ffe_ctl->for_treelog && bytenr == log_bytenr));
+	spin_unlock(&fs_info->treelog_bg_lock);
+	if (skip)
+		return 1;
+
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
+	spin_lock(&fs_info->treelog_bg_lock);
+
+	ASSERT(!ffe_ctl->for_treelog ||
+	       block_group->start == fs_info->treelog_bg ||
+	       fs_info->treelog_bg == 0);
 
 	if (block_group->ro) {
 		ret = 1;
 		goto out;
 	}
 
+	/*
+	 * Do not allow currently using block group to be tree-log dedicated
+	 * block group.
+	 */
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	    (block_group->used || block_group->reserved)) {
+		ret = 1;
+		goto out;
+	}
+
 	avail = block_group->length - block_group->alloc_offset;
 	if (avail < num_bytes) {
 		ffe_ctl->max_extent_size = avail;
@@ -3892,6 +3926,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
+		fs_info->treelog_bg = block_group->start;
+
 	ffe_ctl->found_offset = start + block_group->alloc_offset;
 	block_group->alloc_offset += num_bytes;
 	spin_lock(&ctl->tree_lock);
@@ -3906,6 +3943,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	ffe_ctl->search_start = ffe_ctl->found_offset;
 
 out:
+	if (ret && ffe_ctl->for_treelog)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4155,7 +4195,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/* nothing to do */
+		if (ffe_ctl->for_treelog) {
+			spin_lock(&fs_info->treelog_bg_lock);
+			if (fs_info->treelog_bg)
+				ffe_ctl->hint_byte = fs_info->treelog_bg;
+			spin_unlock(&fs_info->treelog_bg_lock);
+		}
 		return 0;
 	default:
 		BUG();
@@ -4199,6 +4244,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	struct find_free_extent_ctl ffe_ctl = {0};
 	struct btrfs_space_info *space_info;
 	bool full_search = false;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -4212,6 +4258,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.orig_have_caching_bg = false;
 	ffe_ctl.found_offset = 0;
 	ffe_ctl.hint_byte = hint_byte_orig;
+	ffe_ctl.for_treelog = for_treelog;
 	ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED;
 
 	/* For clustered allocation */
@@ -4286,8 +4333,15 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		struct btrfs_block_group *bg_ret;
 
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro))
+		if (unlikely(block_group->ro)) {
+			if (btrfs_is_zoned(fs_info) && for_treelog) {
+				spin_lock(&fs_info->treelog_bg_lock);
+				if (block_group->start == fs_info->treelog_bg)
+					fs_info->treelog_bg = 0;
+				spin_unlock(&fs_info->treelog_bg_lock);
+			}
 			continue;
+		}
 
 		btrfs_grab_block_group(block_group, delalloc);
 		ffe_ctl.search_start = block_group->start;
@@ -4475,6 +4529,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool final_tried = num_bytes == min_alloc_size;
 	u64 flags;
 	int ret;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
@@ -4498,8 +4553,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 
 			sinfo = btrfs_find_space_info(fs_info, flags);
 			btrfs_err(fs_info,
-				  "allocation failed flags %llu, wanted %llu",
-				  flags, num_bytes);
+			"allocation failed flags %llu, wanted %llu treelog %d",
+				  flags, num_bytes, for_treelog);
 			if (sinfo)
 				btrfs_dump_space_info(fs_info, sinfo,
 						      num_bytes, 1);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 39/41] btrfs: serialize log transaction on ZONED mode
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (37 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 40/41] btrfs: reorder log node allocation Naohiro Aota
                   ` (3 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

This is the 2/3 patch to enable tree-log on ZONED mode.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The nodes of the global log root tree
(fs_info->log_root_tree), also have the same mixed allocation problem.

This patch serializes log transactions by waiting for a committing
transaction when someone tries to start a new transaction, to avoid the
mixed allocation problem. We must also wait for running log transactions
from another subvolume, but there is no easy way to detect which subvolume
root is running a log transaction. So, this patch forbids starting a new
log transaction when other subvolumes already allocated the global log root
tree.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/tree-log.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5f585cf57383..505de1cc1394 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -106,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
 				       u64 dirid, int del_all);
+static void wait_log_commit(struct btrfs_root *root, int transid);
 
 /*
  * tree logging is a special write ahead log used to make sure that
@@ -140,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	const bool zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
 
 	mutex_lock(&root->log_mutex);
 
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		if (btrfs_need_log_full_commit(trans)) {
 			ret = -EAGAIN;
 			goto out;
 		}
 
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
+
 		if (!root->log_start_pid) {
 			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 			root->log_start_pid = current->pid;
@@ -158,7 +168,9 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 		}
 	} else {
 		mutex_lock(&fs_info->tree_log_mutex);
-		if (!fs_info->log_root_tree)
+		if (zoned && fs_info->log_root_tree)
+			ret = -EAGAIN;
+		else if (!fs_info->log_root_tree)
 			ret = btrfs_init_log_root_tree(trans, fs_info);
 		mutex_unlock(&fs_info->tree_log_mutex);
 		if (ret)
@@ -193,14 +205,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
  */
 static int join_running_log_trans(struct btrfs_root *root)
 {
+	const bool zoned = btrfs_is_zoned(root->fs_info);
 	int ret = -ENOENT;
 
 	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
 		return ret;
 
 	mutex_lock(&root->log_mutex);
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		ret = 0;
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
 		atomic_inc(&root->log_writers);
 	}
 	mutex_unlock(&root->log_mutex);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 40/41] btrfs: reorder log node allocation
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (38 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 39/41] btrfs: serialize log transaction on ZONED mode Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 11:26 ` [PATCH v10 41/41] btrfs: enable to mount ZONED incompat flag Naohiro Aota
                   ` (2 subsequent siblings)
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik, Johannes Thumshirn

This is the 3/3 patch to enable tree-log on ZONED mode.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

This patch reorders the allocation of them by delaying allocation of the
root node of "fs_info->log_root_tree," so that the node buffers can go out
sequentially to devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c  |  7 -------
 fs/btrfs/tree-log.c | 24 ++++++++++++++++++------
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 97e3deb46cf1..e896dd564434 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,18 +1255,11 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
-	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
-	ret = btrfs_alloc_log_tree_node(trans, log_root);
-	if (ret) {
-		btrfs_put_root(log_root);
-		return ret;
-	}
-
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 505de1cc1394..15f9e8a461ee 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3140,6 +3140,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
 	root_log_ctx.log_transid = log_root_tree->log_transid;
 
+	mutex_lock(&fs_info->tree_log_mutex);
+	if (!log_root_tree->node) {
+		ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
+		if (ret) {
+			mutex_unlock(&fs_info->tree_log_mutex);
+			goto out;
+		}
+	}
+	mutex_unlock(&fs_info->tree_log_mutex);
+
 	/*
 	 * Now we are safe to update the log_root_tree because we're under the
 	 * log_mutex, and we're a current writer so we're holding the commit
@@ -3289,12 +3299,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		.process_func = process_one_buffer
 	};
 
-	ret = walk_log_tree(trans, log, &wc);
-	if (ret) {
-		if (trans)
-			btrfs_abort_transaction(trans, ret);
-		else
-			btrfs_handle_fs_error(log->fs_info, ret, NULL);
+	if (log->node) {
+		ret = walk_log_tree(trans, log, &wc);
+		if (ret) {
+			if (trans)
+				btrfs_abort_transaction(trans, ret);
+			else
+				btrfs_handle_fs_error(log->fs_info, ret, NULL);
+		}
 	}
 
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10 41/41] btrfs: enable to mount ZONED incompat flag
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (39 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 40/41] btrfs: reorder log node allocation Naohiro Aota
@ 2020-11-10 11:26 ` Naohiro Aota
  2020-11-10 14:00 ` [PATCH v10 00/41] btrfs: zoned block device support Anand Jain
  2020-11-27 19:28 ` David Sterba
  42 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-10 11:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota, Josef Bacik

This final patch adds the ZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file
system.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2fd7e58343ce..935b3470a069 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -302,7 +302,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
-	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
+	 BTRFS_FEATURE_INCOMPAT_ZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 00/41] btrfs: zoned block device support
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (40 preceding siblings ...)
  2020-11-10 11:26 ` [PATCH v10 41/41] btrfs: enable to mount ZONED incompat flag Naohiro Aota
@ 2020-11-10 14:00 ` Anand Jain
  2020-11-11  5:07   ` Naohiro Aota
  2020-11-27 19:28 ` David Sterba
  42 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-10 14:00 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig, Darrick J. Wong

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> This series is also available on github.
> Kernel   https://github.com/naota/linux/tree/btrfs-zoned-v10

  This branch is not reachable. Should it be

     https://github.com/naota/linux/tree/btrfs-zoned-for-v10 ?

  But the commits in this branch are at a pre-fixups stage.

Thanks, Anand


> Userland https://github.com/naota/btrfs-progs/tree/btrfs-zoned
> xfstests https://github.com/naota/fstests/tree/btrfs-zoned

> 
> Userland tool depends on patched util-linux (libblkid and wipefs) to handle
> log-structured superblock. To ease the testing, pre-compiled static linked
> userland tools are available here:
> https://wdc.app.box.com/s/fnhqsb3otrvgkstq66o6bvdw6tk525kp
> 
> This v10 still leaves the following issues left for later fix. But, the
> first part of the series should be good shape to be merged.
> - Bio submission path & splitting an ordered extent
> - Redirtying freed tree blocks
>    - Switch to keeping it dirty
>      - Not working correctly for now
> - Dedicated tree-log block group
>    - We need tree-log for zoned device
>      - Dbench (32 clients) is 85% slower with "-o notreelog"
>    - Need to separate tree-log block group from other metadata space_info
> - Relocation
>    - Use normal write command for relocation
>    - Relocated device extents must be reset
>      - It should be discarded on regular btrfs too though
> 
> Changes from v9:
>    - Extract iomap_dio_bio_opflags() to set the proper bi_opf flag
>    - write pointer emulation
>      - Rewrite using btrfs_previous_extent_item()
>      - Convert ASSERT() to runtime check
>    - Exclude regular superblock positions
>    - Fix an error on writing to conventional zones
>    - Take the transaction lock in mark_block_group_to_copy()
>    - Rename 'hmzoned_devices' to 'zoned_devices' in btrfs_check_zoned_mode()
>    - Add do_discard_extent() helper
>    - Move zoned check into fetch_cluster_info()
>    - Drop setting bdev to bio in btrfs_bio_add_page() (will fix later once
>      we support multiple devices)
>    - Subtract bytes_zone_unusable properly when removing a block group
>    - Add "struct block_device *bdev" directly to btrfs_rmap_block()
>    - Rename btrfs_zone_align to btrfs_align_offset_to_zone
>    - Add comment to use pr_info in place of btrfs_info
>    - Add comment for superblock log zones
>    - Fix coding style
>    - Fix typos
> 
> btrfs-progs and xfstests series will follow.
> 
> This version of ZONED btrfs switched from normal write command to zone
> append write command. You do not need to specify LBA (at the write pointer)
> to write for zone append write command. Instead, you only select a zone to
> write with its start LBA. Then the device (NVMe ZNS), or the emulation of
> zone append command in the sd driver in the case of SAS or SATA HDDs,
> automatically writes the data at the write pointer position and return the
> written LBA as a command reply.
> 
> The benefit of using the zone append write command is that write command
> issuing order does not matter. So, we can eliminate block group lock and
> utilize asynchronous checksum, which can reorder the IOs.
> 
> Eliminating the lock improves performance. In particular, on a workload
> with massive competing to the same zone [1], we observed 36% performance
> improvement compared to normal write.
> 
> [1] Fio running 16 jobs with 4KB random writes for 5 minutes
> 
> However, there are some limitations. We cannot use the non-SINGLE profile.
> Supporting non-SINGLE profile with zone append writing is not trivial. For
> example, in the DUP profile, we send a zone append writing IO to two zones
> on a device. The device reply with written LBAs for the IOs. If the offsets
> of the returned addresses from the beginning of the zone are different,
> then it results in different logical addresses.
> 
> For the same reason, we cannot issue multiple IOs for one ordered extent.
> Thus, the size of an ordered extent is limited under max_zone_append_size.
> This limitation will cause fragmentation and increased usage of metadata.
> In the future, we can add optimization to merge ordered extents after
> end_bio.
> 
> * Patch series description
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring
> that writes be issued in LBA order from each zone write pointer
> position. This patch series ensures that the sequential write
> constraint of sequential zones is respected while fundamentally not
> changing BtrFS block and I/O management for block stored in
> conventional zones.
> 
> To achieve this, the default chunk size of btrfs is changed on zoned
> block devices so that chunks are always aligned to a zone. Allocation
> of blocks within a chunk is changed so that the allocation is always
> sequential from the beginning of the chunks. To do so, an allocation
> pointer is added to block groups and used as the allocation hint.  The
> allocation changes also ensure that blocks freed below the allocation
> pointer are ignored, resulting in sequential block allocation
> regardless of the chunk usage.
> 
> The zone of a chunk is reset to allow reuse of the zone only when the
> block group is being freed, that is, when all the chunks of the block
> group are unused.
> 
> For btrfs volumes composed of multiple zoned disks, a restriction is
> added to ensure that all disks have the same zone size. This
> restriction matches the existing constraint that all chunks in a block
> group must have the same size.
> 
> * Enabling tree-log
> 
> The tree-log feature does not work on ZONED mode as is. Blocks for a
> tree-log tree are allocated mixed with other metadata blocks, and btrfs
> writes and syncs the tree-log blocks to devices at the time of fsync(),
> which is different timing than a global transaction commit. As a result,
> both writing tree-log blocks and writing other metadata blocks become
> non-sequential writes which ZONED mode must avoid.
> 
> This series introduces a dedicated block group for tree-log blocks to
> create two metadata writing streams, one for tree-log blocks and the
> other for metadata blocks. As a result, each write stream can now be
> written to devices separately and sequentially.
> 
> * Log-structured superblock
> 
> Superblock (and its copies) is the only data structure in btrfs which
> has a fixed location on a device. Since we cannot overwrite in a
> sequential write required zone, we cannot place superblock in the
> zone.
> 
> This series implements superblock log writing. It uses two zones as a
> circular buffer to write updated superblocks. Once the first zone is filled
> up, start writing into the second zone. The first zone will be reset once
> both zones are filled. We can determine the postion of the latest
> superblock by reading the write pointer information from a device.
> 
> * Patch series organization
> 
> Patches 1 and 2 are preparing patches for block and iomap layer.
> 
> Patch 3 introduces the ZONED incompatible feature flag to indicate that the
> btrfs volume was formatted for use on zoned block devices.
> 
> Patches 4 to 6 implement functions to gather information on the zones of
> the device (zones type, write pointer position, and max_zone_append_size).
> 
> Patches 7 to 10 disable features which are not compatible with the
> sequential write constraints of zoned block devices. These includes
> space_cache, NODATACOW, fallocate, and MIXED_BG.
> 
> Patch 11 implements the log-structured superblock writing.
> 
> Patches 12 and 13 tweak the device extent allocation for ZONED mode and add
> verification to check if a device extent is properly aligned to zones.
> 
> Patches 14 to 17 implements sequential block allocator for ZONED mode.
> 
> Patch 18 implement a zone reset for unused block groups.
> 
> Patches 19 to 30 implement the writing path for several types of IO
> (non-compressed data, direct IO, and metadata). These include re-dirtying
> once-freed metadata blocks to prevent write holes.
> 
> Patches 31 to 40 tweak some btrfs features work with ZONED mode. These
> include device-replace, relocation, repairing IO error, and tree-log.
> 
> Finally, patch 41 adds the ZONED feature to the list of supported features.
> 
> * Patch testing note
> 
> ** Zone-aware util-linux
> 
> Since the log-structured superblock feature changed the location of
> superblock magic, the current util-linux (libblkid) cannot detect ZONED
> btrfs anymore. You need to apply a to-be posted patch to util-linux to make
> it "zone aware".
> 
> ** Testing device
> 
> You need devices with zone append writing command support to run ZONED
> btrfs.
> 
> Other than real devices, null_blk supports zone append write command. You
> can use memory backed null_blk to run the test on it. Following script
> creates 12800 MB /dev/nullb0.
> 
>      sysfs=/sys/kernel/config/nullb/nullb0
>      size=12800 # MB
>      
>      # drop nullb0
>      if [[ -d $sysfs ]]; then
>              echo 0 > "${sysfs}"/power
>              rmdir $sysfs
>      fi
>      lsmod | grep -q null_blk && rmmod null_blk
>      modprobe null_blk nr_devices=0
>      
>      mkdir "${sysfs}"
>      
>      echo "${size}" > "${sysfs}"/size
>      echo 1 > "${sysfs}"/zoned
>      echo 0 > "${sysfs}"/zone_nr_conv
>      echo 1 > "${sysfs}"/memory_backed
>      
>      echo 1 > "${sysfs}"/power
>      udevadm settle
> 
> Zoned SCSI devices such as SMR HDDs or scsi_debug also support the zone
> append command as an emulated command within the SCSI sd driver. This
> emulation is completely transparent to the user and provides the same
> semantic as a NVMe ZNS native drive support.
> 
> Also, there is a qemu patch available to enable NVMe ZNS device.
> 
> ** xfstests
> 
> We ran xfstests on ZONED btrfs, and, if we omit some cases that are known
> to fail currently, all test cases pass.
> 
> Cases that can be ignored:
> 1) failing also with the regular btrfs on regular devices,
> 2) trying to test fallocate feature without testing with
>     "_require_xfs_io_command "falloc"",
> 3) trying to test incompatible features for ZONED btrfs (e.g. RAID5/6)
> 4) trying to use incompatible setup for ZONED btrfs (e.g. dm-linear not
>     aligned to zone boundary, swap)
> 5) trying to create a file system with too small size, (we require at least
>     9 zones to initiate a ZONED btrfs)
> 6) dropping original MKFS_OPTIONS ("-O zoned"), so it cannot create ZONED
>     btrfs (btrfs/003)
> 7) having ENOSPC which incurred by larger metadata block group size
> 
> I will send a patch series for xfstests to handle these cases (2-6)
> properly.
> 
> Patched xfstests is available here:
> 
> https://github.com/naota/fstests/tree/btrfs-zoned
> 
> Also, you need to apply the following patch if you run xfstests with
> tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
> "_devmgt_remove" without this patch.
> 
> https://marc.info/?l=linux-scsi&m=156498625421698&w=2
> 
> v9 https://lore.kernel.org/linux-btrfs/cover.1604065156.git.naohiro.aota@wdc.com/
> v8 https://lore.kernel.org/linux-btrfs/cover.1601572459.git.naohiro.aota@wdc.com/
> v7 https://lore.kernel.org/linux-btrfs/20200911123259.3782926-1-naohiro.aota@wdc.com/
> v6 https://lore.kernel.org/linux-btrfs/20191213040915.3502922-1-naohiro.aota@wdc.com/
> v5 https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/
> v4 https://lwn.net/Articles/797061/
> v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
> v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
> v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/
> 
> Changelog
> v9
>   - Direct-IO path now follow several hardware restrictions (other than
>     max_zone_append_size) by using ZONE_APPEND support of iomap
>   - introduces union of fs_info->zone_size and fs_info->zoned [Johannes]
>     - and use btrfs_is_zoned(fs_info) in place of btrfs_fs_incompat(fs_info, ZONED)
>   - print if zoned is enabled or not when printing module info [Johannes]
>   - drop patch of disabling inode_cache on ZONED
>   - moved for_teelog flag to a proper location [Johannes]
>   - Code style fixes [Johannes]
>   - Add comment about adding physical layer things to ordered extent
>     structure
>   - Pass file_offset explicitly to extract_ordered_extent() instead of
>     determining it from bio
>   - Bug fixes
>     - write out fsync region so that the logical address of ordered extents
>       and checksums are properly finalized
>     - free zone_info at umount time
>     - fix superblock log handling when entering zones[1] in the first time
>     - fixes double free of log-tree roots [Johannes]
>     - Drop erroneous ASSERT in do_allocation_zoned()
> v8
>   - Use bio_add_hw_page() to build up bio to honor hardware restrictions
>     - add bio_add_zone_append_page() as a wrapper of the function
>   - Split file extent on submitting bio
>     - If bio_add_zone_append_page() fails, split the file extent and send
>       out bio
>     - so, we can ensure one bio == one file extent
>   - Fix build bot issues
>   - Rebased on misc-next
> v7:
>   - Use zone append write command instead of normal write command
>     - Bio issuing order does not matter
>     - No need to use lock anymore
>     - Can use asynchronous checksum
>   - Removed RAID support for now
>   - Rename HMZONED to ZONED
>   - Split some patches
>   - Rebased on kdave/for-5.9-rc3 + iomap direct IO
> v6:
>   - Use bitmap helpers (Johannes)
>   - Code cleanup (Johannes)
>   - Rebased on kdave/for-5.5
>   - Enable the tree-log feature.
>   - Treat conventional zones as sequential zones, so we can now allow
>     mixed allocation of conventional zone and sequential write required
>     zone to construct a block group.
>   - Implement log-structured superblock
>     - No need for one conventional zone at the beginning of a device.
>   - Fix deadlock of direct IO writing
>   - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>   - Fix leak of zone_info (Johannes)
> v5:
>   - Rebased on kdave/for-5.5
>   - Enable the tree-log feature.
>   - Treat conventional zones as sequential zones, so we can now allow
>     mixed allocation of conventional zone and sequential write required
>     zone to construct a block group.
>   - Implement log-structured superblock
>     - No need for one conventional zone at the beginning of a device.
>   - Fix deadlock of direct IO writing
>   - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>   - Fix leak of zone_info (Johannes)
> v4:
>   - Move memory allcation of zone informattion out of
>     btrfs_get_dev_zones() (Anand)
>   - Add disabled features table in commit log (Anand)
>   - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
> v3:
>   - Serialize allocation and submit_bio instead of bio buffering in
>     btrfs_map_bio().
>   -- Disable async checksum/submit in HMZONED mode
>   - Introduce helper functions and hmzoned.c/h (Josef, David)
>   - Add support for repairing IO failure
>   - Add support for NOCOW direct IO write (Josef)
>   - Disable preallocation entirely
>   -- Disable INODE_MAP_CACHE
>   -- relocation is reworked not to rely on preallocation in HMZONED mode
>   - Disable NODATACOW
>   -Disable MIXED_BG
>   - Device extent that cover super block position is banned (David)
> v2:
>   - Add support for dev-replace
>   -- To support dev-replace, moved submit_buffer one layer up. It now
>      handles bio instead of btrfs_bio.
>   -- Mark unmirrored Block Group readonly only when there are writable
>      mirrored BGs. Necessary to handle degraded RAID.
>   - Expire worker use vanilla delayed_work instead of btrfs's async-thread
>   - Device extent allocator now ensure that region is on the same zone type.
>   - Add delayed allocation shrinking.
>   - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
>   - Fix
>   -- Use SECTOR_SHIFT (Nikolay)
>   -- Use btrfs_err (Nikolay)
> 
> 
> Johannes Thumshirn (1):
>    block: add bio_add_zone_append_page
> 
> Naohiro Aota (40):
>    iomap: support REQ_OP_ZONE_APPEND
>    btrfs: introduce ZONED feature flag
>    btrfs: get zone information of zoned block devices
>    btrfs: check and enable ZONED mode
>    btrfs: introduce max_zone_append_size
>    btrfs: disallow space_cache in ZONED mode
>    btrfs: disallow NODATACOW in ZONED mode
>    btrfs: disable fallocate in ZONED mode
>    btrfs: disallow mixed-bg in ZONED mode
>    btrfs: implement log-structured superblock for ZONED mode
>    btrfs: implement zoned chunk allocator
>    btrfs: verify device extent is aligned to zone
>    btrfs: load zone's alloction offset
>    btrfs: emulate write pointer for conventional zones
>    btrfs: track unusable bytes for zones
>    btrfs: do sequential extent allocation in ZONED mode
>    btrfs: reset zones of unused block groups
>    btrfs: redirty released extent buffers in ZONED mode
>    btrfs: extract page adding function
>    btrfs: use bio_add_zone_append_page for zoned btrfs
>    btrfs: handle REQ_OP_ZONE_APPEND as writing
>    btrfs: split ordered extent when bio is sent
>    btrfs: extend btrfs_rmap_block for specifying a device
>    btrfs: use ZONE_APPEND write for ZONED btrfs
>    btrfs: enable zone append writing for direct IO
>    btrfs: introduce dedicated data write path for ZONED mode
>    btrfs: serialize meta IOs on ZONED mode
>    btrfs: wait existing extents before truncating
>    btrfs: avoid async metadata checksum on ZONED mode
>    btrfs: mark block groups to copy for device-replace
>    btrfs: implement cloning for ZONED device-replace
>    btrfs: implement copying for ZONED device-replace
>    btrfs: support dev-replace in ZONED mode
>    btrfs: enable relocation in ZONED mode
>    btrfs: relocate block group to repair IO failure in ZONED
>    btrfs: split alloc_log_tree()
>    btrfs: extend zoned allocator to use dedicated tree-log block group
>    btrfs: serialize log transaction on ZONED mode
>    btrfs: reorder log node allocation
>    btrfs: enable to mount ZONED incompat flag
> 
>   block/bio.c                       |   38 +
>   fs/btrfs/Makefile                 |    1 +
>   fs/btrfs/block-group.c            |   84 +-
>   fs/btrfs/block-group.h            |   18 +-
>   fs/btrfs/ctree.h                  |   20 +-
>   fs/btrfs/dev-replace.c            |  195 +++++
>   fs/btrfs/dev-replace.h            |    3 +
>   fs/btrfs/disk-io.c                |   93 ++-
>   fs/btrfs/disk-io.h                |    2 +
>   fs/btrfs/extent-tree.c            |  218 ++++-
>   fs/btrfs/extent_io.c              |  130 ++-
>   fs/btrfs/extent_io.h              |    2 +
>   fs/btrfs/file.c                   |    6 +-
>   fs/btrfs/free-space-cache.c       |   58 ++
>   fs/btrfs/free-space-cache.h       |    2 +
>   fs/btrfs/inode.c                  |  164 +++-
>   fs/btrfs/ioctl.c                  |   13 +
>   fs/btrfs/ordered-data.c           |   79 ++
>   fs/btrfs/ordered-data.h           |   10 +
>   fs/btrfs/relocation.c             |   35 +-
>   fs/btrfs/scrub.c                  |  145 ++++
>   fs/btrfs/space-info.c             |   13 +-
>   fs/btrfs/space-info.h             |    4 +-
>   fs/btrfs/super.c                  |   19 +-
>   fs/btrfs/sysfs.c                  |    4 +
>   fs/btrfs/tests/extent-map-tests.c |    2 +-
>   fs/btrfs/transaction.c            |   10 +
>   fs/btrfs/transaction.h            |    3 +
>   fs/btrfs/tree-log.c               |   52 +-
>   fs/btrfs/volumes.c                |  322 +++++++-
>   fs/btrfs/volumes.h                |    7 +
>   fs/btrfs/zoned.c                  | 1272 +++++++++++++++++++++++++++++
>   fs/btrfs/zoned.h                  |  295 +++++++
>   fs/iomap/direct-io.c              |   41 +-
>   include/linux/bio.h               |    2 +
>   include/linux/iomap.h             |    1 +
>   include/uapi/linux/btrfs.h        |    1 +
>   37 files changed, 3246 insertions(+), 118 deletions(-)
>   create mode 100644 fs/btrfs/zoned.c
>   create mode 100644 fs/btrfs/zoned.h
> 


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 01/41] block: add bio_add_zone_append_page
  2020-11-10 11:26 ` [PATCH v10 01/41] block: add bio_add_zone_append_page Naohiro Aota
@ 2020-11-10 17:20   ` Christoph Hellwig
  2020-11-11  7:20     ` Johannes Thumshirn
  0 siblings, 1 reply; 125+ messages in thread
From: Christoph Hellwig @ 2020-11-10 17:20 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Johannes Thumshirn

> +int bio_add_zone_append_page(struct bio *bio, struct page *page,
> +			     unsigned int len, unsigned int offset)
> +{
> +	struct request_queue *q;
> +	bool same_page = false;
> +
> +	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
> +		return 0;
> +
> +	if (WARN_ON_ONCE(!bio->bi_disk))
> +		return 0;

Do we need this check?  I'd rather just initialize q at declaration time
and let the NULL pointer deref be the obvious sign for a grave
programming error..

Except for that the patch looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
@ 2020-11-10 17:25   ` Christoph Hellwig
  2020-11-10 18:55   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Christoph Hellwig @ 2020-11-10 17:25 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);

We could just calculate the flags once before the loop if we touch
this anyway.

But otherwise this looks ok to me.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
  2020-11-10 17:25   ` Christoph Hellwig
@ 2020-11-10 18:55   ` Darrick J. Wong
  2020-11-10 19:01     ` Darrick J. Wong
  2020-11-24 11:29     ` Christoph Hellwig
  2020-11-30 18:11   ` Darrick J. Wong
  2020-12-09  9:31   ` Christoph Hellwig
  3 siblings, 2 replies; 125+ messages in thread
From: Darrick J. Wong @ 2020-11-10 18:55 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig

On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> REQ_OP_ZONE_APPEND.
> 
> To utilize it, we need to set the bio_op before calling
> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> and restricted bio.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..f04572a55a09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
> +/*
> + * Figure out the bio's operation flags from the dio request, the
> + * mapping, and whether or not we want FUA.  Note that we can end up
> + * clearing the WRITE_FUA flag in the dio request.
> + */
> +static inline unsigned int
> +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)

Hmm, just to check my understanding of what iomap has to do to support
all this:

When we're wanting to use a ZONE_APPEND command, the @iomap structure
has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
to IOMAP_MAPPED, but what should iomap->addr be set to?

I gather from what I see in zonefs and the relevant NVME proposal that
iomap->addr should be set to the (byte) address of the zone we want to
append to?  And if we do that, then bio->bi_iter.bi_sector will be set
to sector address of iomap->addr, right?

(I got lost trying to figure out how btrfs sets ->addr for appends.)

Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
to wherever the drive told it that it actually wrote the bio, right?

If that's true, then that implies that need_zeroout must always be false
for an append operation, right?  Does that also mean that the directio
request has to be aligned to an fs block and not just the sector size?

Can userspace send a directio append that crosses a zone boundary?  If
so, what happens if a direct append to a lower address fails but a
direct append to a higher address succeeds?

> +{
> +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> +
> +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> +		return REQ_OP_READ;
> +	}
> +
> +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> +		opflags |= REQ_OP_ZONE_APPEND;
> +	else
> +		opflags |= REQ_OP_WRITE;
> +
> +	if (use_fua)
> +		opflags |= REQ_FUA;
> +	else
> +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +
> +	return opflags;
> +}
> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);

I'm also vaguely wondering how to communicate the write location back to
the filesystem when the bio completes?  btrfs handles the bio completion
completely so it doesn't have a problem, but for other filesystems
(cough future xfs cough) either we'd have to add a new callback for
append operations; or I guess everyone could hook the bio endio.

Admittedly that's not really your problem, and for all I know hch is
already working on this.

--D

> +
>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>  		if (unlikely(ret)) {
>  			/*
> @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		n = bio->bi_iter.bi_size;
>  		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> -			if (use_fua)
> -				bio->bi_opf |= REQ_FUA;
> -			else
> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>  			task_io_account_write(n);
>  		} else {
> -			bio->bi_opf = REQ_OP_READ;
>  			if (dio->flags & IOMAP_DIO_DIRTY)
>  				bio_set_pages_dirty(bio);
>  		}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..1bccd1880d0d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -54,6 +54,7 @@ struct vm_fault;
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> +#define IOMAP_F_ZONE_APPEND	0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 18:55   ` Darrick J. Wong
@ 2020-11-10 19:01     ` Darrick J. Wong
  2020-11-24 11:29     ` Christoph Hellwig
  1 sibling, 0 replies; 125+ messages in thread
From: Darrick J. Wong @ 2020-11-10 19:01 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig

On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> > REQ_OP_ZONE_APPEND.
> > 
> > To utilize it, we need to set the bio_op before calling
> > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> > and restricted bio.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 36 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index c1aafb2ab990..f04572a55a09 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> >  	iomap_dio_submit_bio(dio, iomap, bio, pos);
> >  }
> >  
> > +/*
> > + * Figure out the bio's operation flags from the dio request, the
> > + * mapping, and whether or not we want FUA.  Note that we can end up
> > + * clearing the WRITE_FUA flag in the dio request.
> > + */
> > +static inline unsigned int
> > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
> 
> Hmm, just to check my understanding of what iomap has to do to support
> all this:
> 
> When we're wanting to use a ZONE_APPEND command, the @iomap structure
> has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
> to IOMAP_MAPPED, but what should iomap->addr be set to?
> 
> I gather from what I see in zonefs and the relevant NVME proposal that
> iomap->addr should be set to the (byte) address of the zone we want to
> append to?  And if we do that, then bio->bi_iter.bi_sector will be set
> to sector address of iomap->addr, right?
> 
> (I got lost trying to figure out how btrfs sets ->addr for appends.)
> 
> Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
> to wherever the drive told it that it actually wrote the bio, right?
> 
> If that's true, then that implies that need_zeroout must always be false
> for an append operation, right?  Does that also mean that the directio
> request has to be aligned to an fs block and not just the sector size?
> 
> Can userspace send a directio append that crosses a zone boundary?  If
> so, what happens if a direct append to a lower address fails but a
> direct append to a higher address succeeds?

Bleh, vim tomfoolery == missing sentence.  Change the above paragraph to
read:

Can userspace send a directio append that crosses a zone boundary?  Can
we issue multiple bios for a single append write?  What happens if a
direct append to a lower address fails but a direct append to a higher
address succeeds?

--D

> > +{
> > +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> > +
> > +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> > +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> > +		return REQ_OP_READ;
> > +	}
> > +
> > +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> > +		opflags |= REQ_OP_ZONE_APPEND;
> > +	else
> > +		opflags |= REQ_OP_WRITE;
> > +
> > +	if (use_fua)
> > +		opflags |= REQ_FUA;
> > +	else
> > +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> > +
> > +	return opflags;
> > +}
> > +
> >  static loff_t
> >  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		struct iomap_dio *dio, struct iomap *iomap)
> > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		bio->bi_private = dio;
> >  		bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > +		/*
> > +		 * Set the operation flags early so that bio_iov_iter_get_pages
> > +		 * can set up the page vector appropriately for a ZONE_APPEND
> > +		 * operation.
> > +		 */
> > +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> 
> I'm also vaguely wondering how to communicate the write location back to
> the filesystem when the bio completes?  btrfs handles the bio completion
> completely so it doesn't have a problem, but for other filesystems
> (cough future xfs cough) either we'd have to add a new callback for
> append operations; or I guess everyone could hook the bio endio.
> 
> Admittedly that's not really your problem, and for all I know hch is
> already working on this.
> 
> --D
> 
> > +
> >  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
> >  		if (unlikely(ret)) {
> >  			/*
> > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  
> >  		n = bio->bi_iter.bi_size;
> >  		if (dio->flags & IOMAP_DIO_WRITE) {
> > -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> > -			if (use_fua)
> > -				bio->bi_opf |= REQ_FUA;
> > -			else
> > -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> >  			task_io_account_write(n);
> >  		} else {
> > -			bio->bi_opf = REQ_OP_READ;
> >  			if (dio->flags & IOMAP_DIO_DIRTY)
> >  				bio_set_pages_dirty(bio);
> >  		}
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 4d1d3c3469e9..1bccd1880d0d 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -54,6 +54,7 @@ struct vm_fault;
> >  #define IOMAP_F_SHARED		0x04
> >  #define IOMAP_F_MERGED		0x08
> >  #define IOMAP_F_BUFFER_HEAD	0x10
> > +#define IOMAP_F_ZONE_APPEND	0x20
> >  
> >  /*
> >   * Flags set by the core iomap code during operations:
> > -- 
> > 2.27.0
> > 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
@ 2020-11-11  1:34   ` kernel test robot
  2020-11-11  2:43   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  1:34 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arc-allyesconfig (attached as .config)
compiler: arceb-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/9c0a9c59b898ad314f5610c22069f439465b2fec
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout 9c0a9c59b898ad314f5610c22069f439465b2fec
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   fs/btrfs/zoned.c: In function 'btrfs_sb_log_location_bdev':
>> fs/btrfs/zoned.c:511:6: warning: variable 'zone_size' set but not used [-Wunused-but-set-variable]
     511 |  u64 zone_size;
         |      ^~~~~~~~~

vim +/zone_size +511 fs/btrfs/zoned.c

   503	
   504	int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
   505				       u64 *bytenr_ret)
   506	{
   507		struct blk_zone zones[BTRFS_NR_SB_LOG_ZONES];
   508		unsigned int zone_sectors;
   509		u32 sb_zone;
   510		int ret;
 > 511		u64 zone_size;
   512		u8 zone_sectors_shift;
   513		sector_t nr_sectors = bdev->bd_part->nr_sects;
   514		u32 nr_zones;
   515	
   516		if (!bdev_is_zoned(bdev)) {
   517			*bytenr_ret = btrfs_sb_offset(mirror);
   518			return 0;
   519		}
   520	
   521		ASSERT(rw == READ || rw == WRITE);
   522	
   523		zone_sectors = bdev_zone_sectors(bdev);
   524		if (!is_power_of_2(zone_sectors))
   525			return -EINVAL;
   526		zone_size = zone_sectors << SECTOR_SHIFT;
   527		zone_sectors_shift = ilog2(zone_sectors);
   528		nr_zones = nr_sectors >> zone_sectors_shift;
   529	
   530		sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
   531		if (sb_zone + 1 >= nr_zones)
   532			return -ENOENT;
   533	
   534		ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift,
   535					  BTRFS_NR_SB_LOG_ZONES, copy_zone_info_cb,
   536					  zones);
   537		if (ret < 0)
   538			return ret;
   539		if (ret != BTRFS_NR_SB_LOG_ZONES)
   540			return -EIO;
   541	
   542		return sb_log_location(bdev, zones, rw, bytenr_ret);
   543	}
   544	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 66433 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
@ 2020-11-11  2:01     ` kernel test robot
  2020-11-11  2:26     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  2:01 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: kbuild-all, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

[-- Attachment #1: Type: text/plain, Size: 4437 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arc-allyesconfig (attached as .config)
compiler: arceb-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/inode.c:2161:5: warning: no previous prototype for 'extract_ordered_extent' [-Wmissing-prototypes]
    2161 | int extract_ordered_extent(struct inode *inode, struct bio *bio,
         |     ^~~~~~~~~~~~~~~~~~~~~~

vim +/extract_ordered_extent +2161 fs/btrfs/inode.c

  2160	
> 2161	int extract_ordered_extent(struct inode *inode, struct bio *bio,
  2162				   loff_t file_offset)
  2163	{
  2164		struct btrfs_ordered_extent *ordered;
  2165		struct extent_map *em = NULL, *em_new = NULL;
  2166		struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
  2167		u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
  2168		u64 len = bio->bi_iter.bi_size;
  2169		u64 end = start + len;
  2170		u64 ordered_end;
  2171		u64 pre, post;
  2172		int ret = 0;
  2173	
  2174		ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
  2175		if (WARN_ON_ONCE(!ordered))
  2176			return -EIO;
  2177	
  2178		/* No need to split */
  2179		if (ordered->disk_num_bytes == len)
  2180			goto out;
  2181	
  2182		/* We cannot split once end_bio'd ordered extent */
  2183		if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
  2184			ret = -EINVAL;
  2185			goto out;
  2186		}
  2187	
  2188		/* We cannot split a compressed ordered extent */
  2189		if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
  2190			ret = -EINVAL;
  2191			goto out;
  2192		}
  2193	
  2194		/* We cannot split a waited ordered extent */
  2195		if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
  2196			ret = -EINVAL;
  2197			goto out;
  2198		}
  2199	
  2200		ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
  2201		/* bio must be in one ordered extent */
  2202		if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
  2203			ret = -EINVAL;
  2204			goto out;
  2205		}
  2206	
  2207		/* Checksum list should be empty */
  2208		if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
  2209			ret = -EINVAL;
  2210			goto out;
  2211		}
  2212	
  2213		pre = start - ordered->disk_bytenr;
  2214		post = ordered_end - end;
  2215	
  2216		btrfs_split_ordered_extent(ordered, pre, post);
  2217	
  2218		read_lock(&em_tree->lock);
  2219		em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
  2220		if (!em) {
  2221			read_unlock(&em_tree->lock);
  2222			ret = -EIO;
  2223			goto out;
  2224		}
  2225		read_unlock(&em_tree->lock);
  2226	
  2227		ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
  2228		em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
  2229				      em->start + pre, em->block_start + pre, len,
  2230				      len, len, BTRFS_COMPRESS_NONE,
  2231				      BTRFS_ORDERED_REGULAR);
  2232		free_extent_map(em_new);
  2233	
  2234	out:
  2235		free_extent_map(em);
  2236		btrfs_put_ordered_extent(ordered);
  2237	
  2238		return ret;
  2239	}
  2240	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 66433 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
@ 2020-11-11  2:01     ` kernel test robot
  0 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  2:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4560 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arc-allyesconfig (attached as .config)
compiler: arceb-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/inode.c:2161:5: warning: no previous prototype for 'extract_ordered_extent' [-Wmissing-prototypes]
    2161 | int extract_ordered_extent(struct inode *inode, struct bio *bio,
         |     ^~~~~~~~~~~~~~~~~~~~~~

vim +/extract_ordered_extent +2161 fs/btrfs/inode.c

  2160	
> 2161	int extract_ordered_extent(struct inode *inode, struct bio *bio,
  2162				   loff_t file_offset)
  2163	{
  2164		struct btrfs_ordered_extent *ordered;
  2165		struct extent_map *em = NULL, *em_new = NULL;
  2166		struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
  2167		u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
  2168		u64 len = bio->bi_iter.bi_size;
  2169		u64 end = start + len;
  2170		u64 ordered_end;
  2171		u64 pre, post;
  2172		int ret = 0;
  2173	
  2174		ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
  2175		if (WARN_ON_ONCE(!ordered))
  2176			return -EIO;
  2177	
  2178		/* No need to split */
  2179		if (ordered->disk_num_bytes == len)
  2180			goto out;
  2181	
  2182		/* We cannot split once end_bio'd ordered extent */
  2183		if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
  2184			ret = -EINVAL;
  2185			goto out;
  2186		}
  2187	
  2188		/* We cannot split a compressed ordered extent */
  2189		if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
  2190			ret = -EINVAL;
  2191			goto out;
  2192		}
  2193	
  2194		/* We cannot split a waited ordered extent */
  2195		if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
  2196			ret = -EINVAL;
  2197			goto out;
  2198		}
  2199	
  2200		ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
  2201		/* bio must be in one ordered extent */
  2202		if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
  2203			ret = -EINVAL;
  2204			goto out;
  2205		}
  2206	
  2207		/* Checksum list should be empty */
  2208		if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
  2209			ret = -EINVAL;
  2210			goto out;
  2211		}
  2212	
  2213		pre = start - ordered->disk_bytenr;
  2214		post = ordered_end - end;
  2215	
  2216		btrfs_split_ordered_extent(ordered, pre, post);
  2217	
  2218		read_lock(&em_tree->lock);
  2219		em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
  2220		if (!em) {
  2221			read_unlock(&em_tree->lock);
  2222			ret = -EIO;
  2223			goto out;
  2224		}
  2225		read_unlock(&em_tree->lock);
  2226	
  2227		ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
  2228		em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
  2229				      em->start + pre, em->block_start + pre, len,
  2230				      len, len, BTRFS_COMPRESS_NONE,
  2231				      BTRFS_ORDERED_REGULAR);
  2232		free_extent_map(em_new);
  2233	
  2234	out:
  2235		free_extent_map(em);
  2236		btrfs_put_ordered_extent(ordered);
  2237	
  2238		return ret;
  2239	}
  2240	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 66433 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
@ 2020-11-11  2:26     ` kernel test robot
  2020-11-11  2:26     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  2:26 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: kbuild-all, clang-built-linux, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Naohiro Aota

[-- Attachment #1: Type: text/plain, Size: 4851 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: powerpc-randconfig-r022-20201110 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/inode.c:2161:5: warning: no previous prototype for function 'extract_ordered_extent' [-Wmissing-prototypes]
   int extract_ordered_extent(struct inode *inode, struct bio *bio,
       ^
   fs/btrfs/inode.c:2161:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int extract_ordered_extent(struct inode *inode, struct bio *bio,
   ^
   static 
   1 warning generated.

vim +/extract_ordered_extent +2161 fs/btrfs/inode.c

  2160	
> 2161	int extract_ordered_extent(struct inode *inode, struct bio *bio,
  2162				   loff_t file_offset)
  2163	{
  2164		struct btrfs_ordered_extent *ordered;
  2165		struct extent_map *em = NULL, *em_new = NULL;
  2166		struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
  2167		u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
  2168		u64 len = bio->bi_iter.bi_size;
  2169		u64 end = start + len;
  2170		u64 ordered_end;
  2171		u64 pre, post;
  2172		int ret = 0;
  2173	
  2174		ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
  2175		if (WARN_ON_ONCE(!ordered))
  2176			return -EIO;
  2177	
  2178		/* No need to split */
  2179		if (ordered->disk_num_bytes == len)
  2180			goto out;
  2181	
  2182		/* We cannot split once end_bio'd ordered extent */
  2183		if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
  2184			ret = -EINVAL;
  2185			goto out;
  2186		}
  2187	
  2188		/* We cannot split a compressed ordered extent */
  2189		if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
  2190			ret = -EINVAL;
  2191			goto out;
  2192		}
  2193	
  2194		/* We cannot split a waited ordered extent */
  2195		if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
  2196			ret = -EINVAL;
  2197			goto out;
  2198		}
  2199	
  2200		ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
  2201		/* bio must be in one ordered extent */
  2202		if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
  2203			ret = -EINVAL;
  2204			goto out;
  2205		}
  2206	
  2207		/* Checksum list should be empty */
  2208		if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
  2209			ret = -EINVAL;
  2210			goto out;
  2211		}
  2212	
  2213		pre = start - ordered->disk_bytenr;
  2214		post = ordered_end - end;
  2215	
  2216		btrfs_split_ordered_extent(ordered, pre, post);
  2217	
  2218		read_lock(&em_tree->lock);
  2219		em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
  2220		if (!em) {
  2221			read_unlock(&em_tree->lock);
  2222			ret = -EIO;
  2223			goto out;
  2224		}
  2225		read_unlock(&em_tree->lock);
  2226	
  2227		ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
  2228		em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
  2229				      em->start + pre, em->block_start + pre, len,
  2230				      len, len, BTRFS_COMPRESS_NONE,
  2231				      BTRFS_ORDERED_REGULAR);
  2232		free_extent_map(em_new);
  2233	
  2234	out:
  2235		free_extent_map(em);
  2236		btrfs_put_ordered_extent(ordered);
  2237	
  2238		return ret;
  2239	}
  2240	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29478 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
@ 2020-11-11  2:26     ` kernel test robot
  0 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  2:26 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4981 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: powerpc-randconfig-r022-20201110 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/inode.c:2161:5: warning: no previous prototype for function 'extract_ordered_extent' [-Wmissing-prototypes]
   int extract_ordered_extent(struct inode *inode, struct bio *bio,
       ^
   fs/btrfs/inode.c:2161:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int extract_ordered_extent(struct inode *inode, struct bio *bio,
   ^
   static 
   1 warning generated.

vim +/extract_ordered_extent +2161 fs/btrfs/inode.c

  2160	
> 2161	int extract_ordered_extent(struct inode *inode, struct bio *bio,
  2162				   loff_t file_offset)
  2163	{
  2164		struct btrfs_ordered_extent *ordered;
  2165		struct extent_map *em = NULL, *em_new = NULL;
  2166		struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
  2167		u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
  2168		u64 len = bio->bi_iter.bi_size;
  2169		u64 end = start + len;
  2170		u64 ordered_end;
  2171		u64 pre, post;
  2172		int ret = 0;
  2173	
  2174		ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
  2175		if (WARN_ON_ONCE(!ordered))
  2176			return -EIO;
  2177	
  2178		/* No need to split */
  2179		if (ordered->disk_num_bytes == len)
  2180			goto out;
  2181	
  2182		/* We cannot split once end_bio'd ordered extent */
  2183		if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
  2184			ret = -EINVAL;
  2185			goto out;
  2186		}
  2187	
  2188		/* We cannot split a compressed ordered extent */
  2189		if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
  2190			ret = -EINVAL;
  2191			goto out;
  2192		}
  2193	
  2194		/* We cannot split a waited ordered extent */
  2195		if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
  2196			ret = -EINVAL;
  2197			goto out;
  2198		}
  2199	
  2200		ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
  2201		/* bio must be in one ordered extent */
  2202		if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
  2203			ret = -EINVAL;
  2204			goto out;
  2205		}
  2206	
  2207		/* Checksum list should be empty */
  2208		if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
  2209			ret = -EINVAL;
  2210			goto out;
  2211		}
  2212	
  2213		pre = start - ordered->disk_bytenr;
  2214		post = ordered_end - end;
  2215	
  2216		btrfs_split_ordered_extent(ordered, pre, post);
  2217	
  2218		read_lock(&em_tree->lock);
  2219		em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
  2220		if (!em) {
  2221			read_unlock(&em_tree->lock);
  2222			ret = -EIO;
  2223			goto out;
  2224		}
  2225		read_unlock(&em_tree->lock);
  2226	
  2227		ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
  2228		em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
  2229				      em->start + pre, em->block_start + pre, len,
  2230				      len, len, BTRFS_COMPRESS_NONE,
  2231				      BTRFS_ORDERED_REGULAR);
  2232		free_extent_map(em_new);
  2233	
  2234	out:
  2235		free_extent_map(em);
  2236		btrfs_put_ordered_extent(ordered);
  2237	
  2238		return ret;
  2239	}
  2240	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 29478 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
  2020-11-11  1:34   ` kernel test robot
@ 2020-11-11  2:43   ` kernel test robot
  2020-11-23 17:46   ` David Sterba
  2020-11-24  6:46   ` Anand Jain
  3 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  2:43 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4455 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: x86_64-randconfig-s022-20201110 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-76-gf680124b-dirty
        # https://github.com/0day-ci/linux/commit/9c0a9c59b898ad314f5610c22069f439465b2fec
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout 9c0a9c59b898ad314f5610c22069f439465b2fec
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
>> fs/btrfs/zoned.c:85:29: sparse: sparse: restricted __le64 degrades to integer
   fs/btrfs/zoned.c:85:52: sparse: sparse: restricted __le64 degrades to integer

vim +85 fs/btrfs/zoned.c

    25	
    26	static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
    27				    u64 *wp_ret)
    28	{
    29		bool empty[BTRFS_NR_SB_LOG_ZONES];
    30		bool full[BTRFS_NR_SB_LOG_ZONES];
    31		sector_t sector;
    32	
    33		ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
    34		       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);
    35	
    36		empty[0] = (zones[0].cond == BLK_ZONE_COND_EMPTY);
    37		empty[1] = (zones[1].cond == BLK_ZONE_COND_EMPTY);
    38		full[0] = (zones[0].cond == BLK_ZONE_COND_FULL);
    39		full[1] = (zones[1].cond == BLK_ZONE_COND_FULL);
    40	
    41		/*
    42		 * Possible state of log buffer zones
    43		 *
    44		 *   E I F
    45		 * E * x 0
    46		 * I 0 x 0
    47		 * F 1 1 C
    48		 *
    49		 * Row: zones[0]
    50		 * Col: zones[1]
    51		 * State:
    52		 *   E: Empty, I: In-Use, F: Full
    53		 * Log position:
    54		 *   *: Special case, no superblock is written
    55		 *   0: Use write pointer of zones[0]
    56		 *   1: Use write pointer of zones[1]
    57		 *   C: Compare SBs from zones[0] and zones[1], use the newer one
    58		 *   x: Invalid state
    59		 */
    60	
    61		if (empty[0] && empty[1]) {
    62			/* Special case to distinguish no superblock to read */
    63			*wp_ret = zones[0].start << SECTOR_SHIFT;
    64			return -ENOENT;
    65		} else if (full[0] && full[1]) {
    66			/* Compare two super blocks */
    67			struct address_space *mapping = bdev->bd_inode->i_mapping;
    68			struct page *page[BTRFS_NR_SB_LOG_ZONES];
    69			struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES];
    70			int i;
    71	
    72			for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
    73				u64 bytenr = ((zones[i].start + zones[i].len) << SECTOR_SHIFT) -
    74					BTRFS_SUPER_INFO_SIZE;
    75	
    76				page[i] = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
    77				if (IS_ERR(page[i])) {
    78					if (i == 1)
    79						btrfs_release_disk_super(super[0]);
    80					return PTR_ERR(page[i]);
    81				}
    82				super[i] = page_address(page[i]);
    83			}
    84	
  > 85			if (super[0]->generation > super[1]->generation)
    86				sector = zones[1].start;
    87			else
    88				sector = zones[0].start;
    89	
    90			for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
    91				btrfs_release_disk_super(super[i]);
    92		} else if (!full[0] && (empty[1] || full[1])) {
    93			sector = zones[0].wp;
    94		} else if (full[0]) {
    95			sector = zones[1].wp;
    96		} else {
    97			return -EUCLEAN;
    98		}
    99		*wp_ret = sector << SECTOR_SHIFT;
   100		return 0;
   101	}
   102	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32694 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace
  2020-11-10 11:26 ` [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace Naohiro Aota
@ 2020-11-11  3:13   ` kernel test robot
  2020-11-11  3:16   ` kernel test robot
  1 sibling, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:13 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5521 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arc-allyesconfig (attached as .config)
compiler: arceb-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/355d6db6997f390f363f08fa9bfbf8b38eb891a2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout 355d6db6997f390f363f08fa9bfbf8b38eb891a2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   fs/btrfs/dev-replace.c: In function 'mark_block_group_to_copy':
>> fs/btrfs/dev-replace.c:452:20: warning: variable 'length' set but not used [-Wunused-but-set-variable]
     452 |  u64 chunk_offset, length;
         |                    ^~~~~~

vim +/length +452 fs/btrfs/dev-replace.c

   440	
   441	static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
   442					    struct btrfs_device *src_dev)
   443	{
   444		struct btrfs_path *path;
   445		struct btrfs_key key;
   446		struct btrfs_key found_key;
   447		struct btrfs_root *root = fs_info->dev_root;
   448		struct btrfs_dev_extent *dev_extent = NULL;
   449		struct btrfs_block_group *cache;
   450		struct btrfs_trans_handle *trans;
   451		int ret = 0;
 > 452		u64 chunk_offset, length;
   453	
   454		/* Do not use "to_copy" on non-ZONED for now */
   455		if (!btrfs_is_zoned(fs_info))
   456			return 0;
   457	
   458		mutex_lock(&fs_info->chunk_mutex);
   459	
   460		/* Ensure we don't have pending new block group */
   461		spin_lock(&fs_info->trans_lock);
   462		while (fs_info->running_transaction &&
   463		       !list_empty(&fs_info->running_transaction->dev_update_list)) {
   464			spin_unlock(&fs_info->trans_lock);
   465			mutex_unlock(&fs_info->chunk_mutex);
   466			trans = btrfs_attach_transaction(root);
   467			if (IS_ERR(trans)) {
   468				ret = PTR_ERR(trans);
   469				mutex_lock(&fs_info->chunk_mutex);
   470				if (ret == -ENOENT)
   471					continue;
   472				else
   473					goto unlock;
   474			}
   475	
   476			ret = btrfs_commit_transaction(trans);
   477			mutex_lock(&fs_info->chunk_mutex);
   478			if (ret)
   479				goto unlock;
   480	
   481			spin_lock(&fs_info->trans_lock);
   482		}
   483		spin_unlock(&fs_info->trans_lock);
   484	
   485		path = btrfs_alloc_path();
   486		if (!path) {
   487			ret = -ENOMEM;
   488			goto unlock;
   489		}
   490	
   491		path->reada = READA_FORWARD;
   492		path->search_commit_root = 1;
   493		path->skip_locking = 1;
   494	
   495		key.objectid = src_dev->devid;
   496		key.offset = 0;
   497		key.type = BTRFS_DEV_EXTENT_KEY;
   498	
   499		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
   500		if (ret < 0)
   501			goto free_path;
   502		if (ret > 0) {
   503			if (path->slots[0] >=
   504			    btrfs_header_nritems(path->nodes[0])) {
   505				ret = btrfs_next_leaf(root, path);
   506				if (ret < 0)
   507					goto free_path;
   508				if (ret > 0) {
   509					ret = 0;
   510					goto free_path;
   511				}
   512			} else {
   513				ret = 0;
   514			}
   515		}
   516	
   517		while (1) {
   518			struct extent_buffer *l = path->nodes[0];
   519			int slot = path->slots[0];
   520	
   521			btrfs_item_key_to_cpu(l, &found_key, slot);
   522	
   523			if (found_key.objectid != src_dev->devid)
   524				break;
   525	
   526			if (found_key.type != BTRFS_DEV_EXTENT_KEY)
   527				break;
   528	
   529			if (found_key.offset < key.offset)
   530				break;
   531	
   532			dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
   533			length = btrfs_dev_extent_length(l, dev_extent);
   534	
   535			chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
   536	
   537			cache = btrfs_lookup_block_group(fs_info, chunk_offset);
   538			if (!cache)
   539				goto skip;
   540	
   541			spin_lock(&cache->lock);
   542			cache->to_copy = 1;
   543			spin_unlock(&cache->lock);
   544	
   545			btrfs_put_block_group(cache);
   546	
   547	skip:
   548			ret = btrfs_next_item(root, path);
   549			if (ret != 0) {
   550				if (ret > 0)
   551					ret = 0;
   552				break;
   553			}
   554		}
   555	
   556	free_path:
   557		btrfs_free_path(path);
   558	unlock:
   559		mutex_unlock(&fs_info->chunk_mutex);
   560	
   561		return ret;
   562	}
   563	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 66433 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace
  2020-11-10 11:26 ` [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace Naohiro Aota
  2020-11-11  3:13   ` kernel test robot
@ 2020-11-11  3:16   ` kernel test robot
  1 sibling, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:16 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5521 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arc-allyesconfig (attached as .config)
compiler: arceb-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/355d6db6997f390f363f08fa9bfbf8b38eb891a2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout 355d6db6997f390f363f08fa9bfbf8b38eb891a2
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   fs/btrfs/dev-replace.c: In function 'mark_block_group_to_copy':
>> fs/btrfs/dev-replace.c:452:20: warning: variable 'length' set but not used [-Wunused-but-set-variable]
     452 |  u64 chunk_offset, length;
         |                    ^~~~~~

vim +/length +452 fs/btrfs/dev-replace.c

   440	
   441	static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
   442					    struct btrfs_device *src_dev)
   443	{
   444		struct btrfs_path *path;
   445		struct btrfs_key key;
   446		struct btrfs_key found_key;
   447		struct btrfs_root *root = fs_info->dev_root;
   448		struct btrfs_dev_extent *dev_extent = NULL;
   449		struct btrfs_block_group *cache;
   450		struct btrfs_trans_handle *trans;
   451		int ret = 0;
 > 452		u64 chunk_offset, length;
   453	
   454		/* Do not use "to_copy" on non-ZONED for now */
   455		if (!btrfs_is_zoned(fs_info))
   456			return 0;
   457	
   458		mutex_lock(&fs_info->chunk_mutex);
   459	
   460		/* Ensure we don't have pending new block group */
   461		spin_lock(&fs_info->trans_lock);
   462		while (fs_info->running_transaction &&
   463		       !list_empty(&fs_info->running_transaction->dev_update_list)) {
   464			spin_unlock(&fs_info->trans_lock);
   465			mutex_unlock(&fs_info->chunk_mutex);
   466			trans = btrfs_attach_transaction(root);
   467			if (IS_ERR(trans)) {
   468				ret = PTR_ERR(trans);
   469				mutex_lock(&fs_info->chunk_mutex);
   470				if (ret == -ENOENT)
   471					continue;
   472				else
   473					goto unlock;
   474			}
   475	
   476			ret = btrfs_commit_transaction(trans);
   477			mutex_lock(&fs_info->chunk_mutex);
   478			if (ret)
   479				goto unlock;
   480	
   481			spin_lock(&fs_info->trans_lock);
   482		}
   483		spin_unlock(&fs_info->trans_lock);
   484	
   485		path = btrfs_alloc_path();
   486		if (!path) {
   487			ret = -ENOMEM;
   488			goto unlock;
   489		}
   490	
   491		path->reada = READA_FORWARD;
   492		path->search_commit_root = 1;
   493		path->skip_locking = 1;
   494	
   495		key.objectid = src_dev->devid;
   496		key.offset = 0;
   497		key.type = BTRFS_DEV_EXTENT_KEY;
   498	
   499		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
   500		if (ret < 0)
   501			goto free_path;
   502		if (ret > 0) {
   503			if (path->slots[0] >=
   504			    btrfs_header_nritems(path->nodes[0])) {
   505				ret = btrfs_next_leaf(root, path);
   506				if (ret < 0)
   507					goto free_path;
   508				if (ret > 0) {
   509					ret = 0;
   510					goto free_path;
   511				}
   512			} else {
   513				ret = 0;
   514			}
   515		}
   516	
   517		while (1) {
   518			struct extent_buffer *l = path->nodes[0];
   519			int slot = path->slots[0];
   520	
   521			btrfs_item_key_to_cpu(l, &found_key, slot);
   522	
   523			if (found_key.objectid != src_dev->devid)
   524				break;
   525	
   526			if (found_key.type != BTRFS_DEV_EXTENT_KEY)
   527				break;
   528	
   529			if (found_key.offset < key.offset)
   530				break;
   531	
   532			dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
   533			length = btrfs_dev_extent_length(l, dev_extent);
   534	
   535			chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
   536	
   537			cache = btrfs_lookup_block_group(fs_info, chunk_offset);
   538			if (!cache)
   539				goto skip;
   540	
   541			spin_lock(&cache->lock);
   542			cache->to_copy = 1;
   543			spin_unlock(&cache->lock);
   544	
   545			btrfs_put_block_group(cache);
   546	
   547	skip:
   548			ret = btrfs_next_item(root, path);
   549			if (ret != 0) {
   550				if (ret > 0)
   551					ret = 0;
   552				break;
   553			}
   554		}
   555	
   556	free_path:
   557		btrfs_free_path(path);
   558	unlock:
   559		mutex_unlock(&fs_info->chunk_mutex);
   560	
   561		return ret;
   562	}
   563	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 66433 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
@ 2020-11-11  3:46     ` kernel test robot
  2020-11-11  2:26     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:46 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: kbuild-all, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: x86_64-randconfig-s022-20201110 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-76-gf680124b-dirty
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
>> fs/btrfs/inode.c:2161:5: sparse: sparse: symbol 'extract_ordered_extent' was not declared. Should it be static?
>> fs/btrfs/inode.c:2279:21: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted blk_status_t [usertype] ret @@     got int @@
>> fs/btrfs/inode.c:2279:21: sparse:     expected restricted blk_status_t [usertype] ret
>> fs/btrfs/inode.c:2279:21: sparse:     got int

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32694 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 23/41] btrfs: split ordered extent when bio is sent
@ 2020-11-11  3:46     ` kernel test robot
  0 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:46 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

Hi Naohiro,

I love your patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.10-rc3]
[cannot apply to kdave/for-next block/for-next next-20201110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: x86_64-randconfig-s022-20201110 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-76-gf680124b-dirty
        # https://github.com/0day-ci/linux/commit/c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Naohiro-Aota/btrfs-zoned-block-device-support/20201110-193227
        git checkout c2b1e52b104fa60d0c731cc5016be18e98ec71d2
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
>> fs/btrfs/inode.c:2161:5: sparse: sparse: symbol 'extract_ordered_extent' was not declared. Should it be static?
>> fs/btrfs/inode.c:2279:21: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted blk_status_t [usertype] ret @@     got int @@
>> fs/btrfs/inode.c:2279:21: sparse:     expected restricted blk_status_t [usertype] ret
>> fs/btrfs/inode.c:2279:21: sparse:     got int

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32694 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC PATCH] btrfs: extract_ordered_extent() can be static
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
@ 2020-11-11  3:46     ` kernel test robot
  2020-11-11  2:26     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:46 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: kbuild-all, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Naohiro Aota


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 inode.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df85d8dea37c9e..ee3f3ab1b964b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,8 +2158,8 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
-int extract_ordered_extent(struct inode *inode, struct bio *bio,
-			   loff_t file_offset)
+static int extract_ordered_extent(struct inode *inode, struct bio *bio,
+				  loff_t file_offset)
 {
 	struct btrfs_ordered_extent *ordered;
 	struct extent_map *em = NULL, *em_new = NULL;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC PATCH] btrfs: extract_ordered_extent() can be static
@ 2020-11-11  3:46     ` kernel test robot
  0 siblings, 0 replies; 125+ messages in thread
From: kernel test robot @ 2020-11-11  3:46 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 778 bytes --]


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 inode.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df85d8dea37c9e..ee3f3ab1b964b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,8 +2158,8 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
-int extract_ordered_extent(struct inode *inode, struct bio *bio,
-			   loff_t file_offset)
+static int extract_ordered_extent(struct inode *inode, struct bio *bio,
+				  loff_t file_offset)
 {
 	struct btrfs_ordered_extent *ordered;
 	struct extent_map *em = NULL, *em_new = NULL;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10.1 23/41] btrfs: split ordered extent when bio is sent
  2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
                     ` (3 preceding siblings ...)
  2020-11-11  3:46     ` kernel test robot
@ 2020-11-11  4:12   ` Naohiro Aota
  4 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-11  4:12 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Josef Bacik, Hannes Reinecke, linux-fsdevel, Naohiro Aota,
	kernel test robot

For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule.

This commit implements the splitting of an ordered extent and extent map
on bio submission to adhere to the rule.

[testbot] made extract_ordered_extent static
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c        | 89 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.c | 76 +++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.h |  2 +
 3 files changed, 167 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 591ca539e444..ee3f3ab1b964 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,6 +2158,86 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
+static int extract_ordered_extent(struct inode *inode, struct bio *bio,
+				  loff_t file_offset)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em = NULL, *em_new = NULL;
+	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 len = bio->bi_iter.bi_size;
+	u64 end = start + len;
+	u64 ordered_end;
+	u64 pre, post;
+	int ret = 0;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON_ONCE(!ordered))
+		return -EIO;
+
+	/* No need to split */
+	if (ordered->disk_num_bytes == len)
+		goto out;
+
+	/* We cannot split once end_bio'd ordered extent */
+	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We cannot split a compressed ordered extent */
+	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We cannot split a waited ordered extent */
+	if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
+	/* bio must be in one ordered extent */
+	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Checksum list should be empty */
+	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pre = start - ordered->disk_bytenr;
+	post = ordered_end - end;
+
+	btrfs_split_ordered_extent(ordered, pre, post);
+
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
+	if (!em) {
+		read_unlock(&em_tree->lock);
+		ret = -EIO;
+		goto out;
+	}
+	read_unlock(&em_tree->lock);
+
+	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+	em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
+			      em->start + pre, em->block_start + pre, len,
+			      len, len, BTRFS_COMPRESS_NONE,
+			      BTRFS_ORDERED_REGULAR);
+	free_extent_map(em_new);
+
+out:
+	free_extent_map(em);
+	btrfs_put_ordered_extent(ordered);
+
+	return ret;
+}
+
 /*
  * extent_io.c submission hook. This does the right thing for csum calculation
  * on write, or reading the csums from the tree before a read.
@@ -2192,6 +2272,15 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct page *page = bio_first_bvec_all(bio)->bv_page;
+		loff_t file_offset = page_offset(page);
+
+		ret = extract_ordered_extent(inode, bio, file_offset);
+		if (ret)
+			goto out;
+	}
+
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 87bac9ecdf4c..35ef25e39561 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -943,6 +943,82 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 	}
 }
 
+static void clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
+				 u64 len)
+{
+	struct inode *inode = ordered->inode;
+	u64 file_offset = ordered->file_offset + pos;
+	u64 disk_bytenr = ordered->disk_bytenr + pos;
+	u64 num_bytes = len;
+	u64 disk_num_bytes = len;
+	int type;
+	unsigned long flags_masked =
+		ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
+	int compress_type = ordered->compress_type;
+	unsigned long weight;
+
+	weight = hweight_long(flags_masked);
+	WARN_ON_ONCE(weight > 1);
+	if (!weight)
+		type = 0;
+	else
+		type = __ffs(flags_masked);
+
+	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
+		WARN_ON_ONCE(1);
+		btrfs_add_ordered_extent_compress(BTRFS_I(inode), file_offset,
+						  disk_bytenr, num_bytes,
+						  disk_num_bytes, type,
+						  compress_type);
+	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
+		btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
+					     disk_bytenr, num_bytes,
+					     disk_num_bytes, type);
+	} else {
+		btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
+					 disk_bytenr, num_bytes, disk_num_bytes,
+					 type);
+	}
+}
+
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post)
+{
+	struct inode *inode = ordered->inode;
+	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	spin_lock_irq(&tree->lock);
+	/* Remove from tree once */
+	node = &ordered->rb_node;
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	if (tree->last == node)
+		tree->last = NULL;
+
+	ordered->file_offset += pre;
+	ordered->disk_bytenr += pre;
+	ordered->num_bytes -= (pre + post);
+	ordered->disk_num_bytes -= (pre + post);
+	ordered->bytes_left -= (pre + post);
+
+	/* Re-insert the node */
+	node = tree_insert(&tree->tree, ordered->file_offset,
+			   &ordered->rb_node);
+	if (node)
+		btrfs_panic(fs_info, -EEXIST,
+				"zoned: inconsistency in ordered tree at offset %llu",
+				ordered->file_offset);
+
+	spin_unlock_irq(&tree->lock);
+
+	if (pre)
+		clone_ordered_extent(ordered, 0, pre);
+	if (post)
+		clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
+}
+
 int __init ordered_data_init(void)
 {
 	btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c3a2325e64a4..e346b03bd66a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -193,6 +193,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					u64 end,
 					struct extent_state **cached_state);
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v10.1 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group
  2020-11-10 11:26 ` [PATCH v10 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
@ 2020-11-11  4:58   ` Naohiro Aota
  0 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-11  4:58 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Josef Bacik, Hannes Reinecke, linux-fsdevel, Naohiro Aota,
	Johannes Thumshirn

This is the 1/3 patch to enable tree log on ZONED mode.

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing from a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes that ZONED mode must avoid.

We can introduce a dedicated block group for tree-log blocks so that
tree-log blocks and other metadata blocks can be separated write streams.
As a result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and btrfs assign
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  7 ++++
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/extent-tree.c | 79 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 84 insertions(+), 4 deletions(-)

I forgot to merge a patch to add the lock description...

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 04bb0602f1cc..d222f54eb0c1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -939,6 +939,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&cluster->refill_lock);
 
+	if (btrfs_is_zoned(fs_info)) {
+		spin_lock(&fs_info->treelog_bg_lock);
+		if (fs_info->treelog_bg == block_group->start)
+			fs_info->treelog_bg = 0;
+		spin_unlock(&fs_info->treelog_bg_lock);
+	}
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8138e932b7cc..2fd7e58343ce 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -957,6 +957,8 @@ struct btrfs_fs_info {
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
+	spinlock_t treelog_bg_lock;
+	u64 treelog_bg;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2ee21076b641..f50eea392b2f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3631,6 +3631,9 @@ struct find_free_extent_ctl {
 	bool have_caching_bg;
 	bool orig_have_caching_bg;
 
+	/* Allocation is called for tree-log */
+	bool for_treelog;
+
 	/* RAID index, converted from flags */
 	int index;
 
@@ -3859,6 +3862,22 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
+/*
+ * Tree-log Block Group Locking
+ * ============================
+ *
+ * fs_info::treelog_bg_lock protects the fs_info::treelog_bg which
+ * indicates the starting address of a block group, which is reserved only
+ * for tree-log metadata.
+ *
+ * Lock nesting
+ * ============
+ *
+ * space_info::lock
+ *   block_group::lock
+ *     fs_info::treelog_bg_lock
+ */
+
 /*
  * Simple allocator for sequential only block group. It only allows
  * sequential allocation. No need to play with trees. This function
@@ -3868,23 +3887,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct find_free_extent_ctl *ffe_ctl,
 			       struct btrfs_block_group **bg_ret)
 {
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
 	u64 avail;
+	u64 bytenr = block_group->start;
+	u64 log_bytenr;
 	int ret = 0;
+	bool skip;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
+	/*
+	 * Do not allow non-tree-log blocks in the dedicated tree-log block
+	 * group, and vice versa.
+	 */
+	spin_lock(&fs_info->treelog_bg_lock);
+	log_bytenr = fs_info->treelog_bg;
+	skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
+			      (!ffe_ctl->for_treelog && bytenr == log_bytenr));
+	spin_unlock(&fs_info->treelog_bg_lock);
+	if (skip)
+		return 1;
+
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
+	spin_lock(&fs_info->treelog_bg_lock);
+
+	ASSERT(!ffe_ctl->for_treelog ||
+	       block_group->start == fs_info->treelog_bg ||
+	       fs_info->treelog_bg == 0);
 
 	if (block_group->ro) {
 		ret = 1;
 		goto out;
 	}
 
+	/*
+	 * Do not allow currently using block group to be tree-log dedicated
+	 * block group.
+	 */
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	    (block_group->used || block_group->reserved)) {
+		ret = 1;
+		goto out;
+	}
+
 	avail = block_group->length - block_group->alloc_offset;
 	if (avail < num_bytes) {
 		ffe_ctl->max_extent_size = avail;
@@ -3892,6 +3942,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
+		fs_info->treelog_bg = block_group->start;
+
 	ffe_ctl->found_offset = start + block_group->alloc_offset;
 	block_group->alloc_offset += num_bytes;
 	spin_lock(&ctl->tree_lock);
@@ -3906,6 +3959,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	ffe_ctl->search_start = ffe_ctl->found_offset;
 
 out:
+	if (ret && ffe_ctl->for_treelog)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4155,7 +4211,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/* nothing to do */
+		if (ffe_ctl->for_treelog) {
+			spin_lock(&fs_info->treelog_bg_lock);
+			if (fs_info->treelog_bg)
+				ffe_ctl->hint_byte = fs_info->treelog_bg;
+			spin_unlock(&fs_info->treelog_bg_lock);
+		}
 		return 0;
 	default:
 		BUG();
@@ -4199,6 +4260,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	struct find_free_extent_ctl ffe_ctl = {0};
 	struct btrfs_space_info *space_info;
 	bool full_search = false;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -4212,6 +4274,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.orig_have_caching_bg = false;
 	ffe_ctl.found_offset = 0;
 	ffe_ctl.hint_byte = hint_byte_orig;
+	ffe_ctl.for_treelog = for_treelog;
 	ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED;
 
 	/* For clustered allocation */
@@ -4286,8 +4349,15 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		struct btrfs_block_group *bg_ret;
 
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro))
+		if (unlikely(block_group->ro)) {
+			if (btrfs_is_zoned(fs_info) && for_treelog) {
+				spin_lock(&fs_info->treelog_bg_lock);
+				if (block_group->start == fs_info->treelog_bg)
+					fs_info->treelog_bg = 0;
+				spin_unlock(&fs_info->treelog_bg_lock);
+			}
 			continue;
+		}
 
 		btrfs_grab_block_group(block_group, delalloc);
 		ffe_ctl.search_start = block_group->start;
@@ -4475,6 +4545,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool final_tried = num_bytes == min_alloc_size;
 	u64 flags;
 	int ret;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
@@ -4498,8 +4569,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 
 			sinfo = btrfs_find_space_info(fs_info, flags);
 			btrfs_err(fs_info,
-				  "allocation failed flags %llu, wanted %llu",
-				  flags, num_bytes);
+			"allocation failed flags %llu, wanted %llu treelog %d",
+				  flags, num_bytes, for_treelog);
 			if (sinfo)
 				btrfs_dump_space_info(fs_info, sinfo,
 						      num_bytes, 1);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 00/41] btrfs: zoned block device support
  2020-11-10 14:00 ` [PATCH v10 00/41] btrfs: zoned block device support Anand Jain
@ 2020-11-11  5:07   ` Naohiro Aota
  0 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-11  5:07 UTC (permalink / raw)
  To: Anand Jain
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On Tue, Nov 10, 2020 at 10:00:14PM +0800, Anand Jain wrote:
>On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>This series adds zoned block device support to btrfs.
>>
>>This series is also available on github.
>>Kernel   https://github.com/naota/linux/tree/btrfs-zoned-v10
>
> This branch is not reachable. Should it be
>
>    https://github.com/naota/linux/tree/btrfs-zoned-for-v10 ?
>
> But the commits in this branch are at a pre-fixups stage.

Sorry, I forgot to push the one. I have pushed the fixuped version to
"btrfs-zoned-v10" branch. Thanks

>
>Thanks, Anand
>
>
>>Userland https://github.com/naota/btrfs-progs/tree/btrfs-zoned
>>xfstests https://github.com/naota/fstests/tree/btrfs-zoned
>
>>
>>Userland tool depends on patched util-linux (libblkid and wipefs) to handle
>>log-structured superblock. To ease the testing, pre-compiled static linked
>>userland tools are available here:
>>https://wdc.app.box.com/s/fnhqsb3otrvgkstq66o6bvdw6tk525kp
>>
>>This v10 still leaves the following issues left for later fix. But, the
>>first part of the series should be good shape to be merged.
>>- Bio submission path & splitting an ordered extent
>>- Redirtying freed tree blocks
>>   - Switch to keeping it dirty
>>     - Not working correctly for now
>>- Dedicated tree-log block group
>>   - We need tree-log for zoned device
>>     - Dbench (32 clients) is 85% slower with "-o notreelog"
>>   - Need to separate tree-log block group from other metadata space_info
>>- Relocation
>>   - Use normal write command for relocation
>>   - Relocated device extents must be reset
>>     - It should be discarded on regular btrfs too though
>>
>>Changes from v9:
>>   - Extract iomap_dio_bio_opflags() to set the proper bi_opf flag
>>   - write pointer emulation
>>     - Rewrite using btrfs_previous_extent_item()
>>     - Convert ASSERT() to runtime check
>>   - Exclude regular superblock positions
>>   - Fix an error on writing to conventional zones
>>   - Take the transaction lock in mark_block_group_to_copy()
>>   - Rename 'hmzoned_devices' to 'zoned_devices' in btrfs_check_zoned_mode()
>>   - Add do_discard_extent() helper
>>   - Move zoned check into fetch_cluster_info()
>>   - Drop setting bdev to bio in btrfs_bio_add_page() (will fix later once
>>     we support multiple devices)
>>   - Subtract bytes_zone_unusable properly when removing a block group
>>   - Add "struct block_device *bdev" directly to btrfs_rmap_block()
>>   - Rename btrfs_zone_align to btrfs_align_offset_to_zone
>>   - Add comment to use pr_info in place of btrfs_info
>>   - Add comment for superblock log zones
>>   - Fix coding style
>>   - Fix typos
>>
>>btrfs-progs and xfstests series will follow.
>>
>>This version of ZONED btrfs switched from normal write command to zone
>>append write command. You do not need to specify LBA (at the write pointer)
>>to write for zone append write command. Instead, you only select a zone to
>>write with its start LBA. Then the device (NVMe ZNS), or the emulation of
>>zone append command in the sd driver in the case of SAS or SATA HDDs,
>>automatically writes the data at the write pointer position and return the
>>written LBA as a command reply.
>>
>>The benefit of using the zone append write command is that write command
>>issuing order does not matter. So, we can eliminate block group lock and
>>utilize asynchronous checksum, which can reorder the IOs.
>>
>>Eliminating the lock improves performance. In particular, on a workload
>>with massive competing to the same zone [1], we observed 36% performance
>>improvement compared to normal write.
>>
>>[1] Fio running 16 jobs with 4KB random writes for 5 minutes
>>
>>However, there are some limitations. We cannot use the non-SINGLE profile.
>>Supporting non-SINGLE profile with zone append writing is not trivial. For
>>example, in the DUP profile, we send a zone append writing IO to two zones
>>on a device. The device reply with written LBAs for the IOs. If the offsets
>>of the returned addresses from the beginning of the zone are different,
>>then it results in different logical addresses.
>>
>>For the same reason, we cannot issue multiple IOs for one ordered extent.
>>Thus, the size of an ordered extent is limited under max_zone_append_size.
>>This limitation will cause fragmentation and increased usage of metadata.
>>In the future, we can add optimization to merge ordered extents after
>>end_bio.
>>
>>* Patch series description
>>
>>A zoned block device consists of a number of zones. Zones are either
>>conventional and accepting random writes or sequential and requiring
>>that writes be issued in LBA order from each zone write pointer
>>position. This patch series ensures that the sequential write
>>constraint of sequential zones is respected while fundamentally not
>>changing BtrFS block and I/O management for block stored in
>>conventional zones.
>>
>>To achieve this, the default chunk size of btrfs is changed on zoned
>>block devices so that chunks are always aligned to a zone. Allocation
>>of blocks within a chunk is changed so that the allocation is always
>>sequential from the beginning of the chunks. To do so, an allocation
>>pointer is added to block groups and used as the allocation hint.  The
>>allocation changes also ensure that blocks freed below the allocation
>>pointer are ignored, resulting in sequential block allocation
>>regardless of the chunk usage.
>>
>>The zone of a chunk is reset to allow reuse of the zone only when the
>>block group is being freed, that is, when all the chunks of the block
>>group are unused.
>>
>>For btrfs volumes composed of multiple zoned disks, a restriction is
>>added to ensure that all disks have the same zone size. This
>>restriction matches the existing constraint that all chunks in a block
>>group must have the same size.
>>
>>* Enabling tree-log
>>
>>The tree-log feature does not work on ZONED mode as is. Blocks for a
>>tree-log tree are allocated mixed with other metadata blocks, and btrfs
>>writes and syncs the tree-log blocks to devices at the time of fsync(),
>>which is different timing than a global transaction commit. As a result,
>>both writing tree-log blocks and writing other metadata blocks become
>>non-sequential writes which ZONED mode must avoid.
>>
>>This series introduces a dedicated block group for tree-log blocks to
>>create two metadata writing streams, one for tree-log blocks and the
>>other for metadata blocks. As a result, each write stream can now be
>>written to devices separately and sequentially.
>>
>>* Log-structured superblock
>>
>>Superblock (and its copies) is the only data structure in btrfs which
>>has a fixed location on a device. Since we cannot overwrite in a
>>sequential write required zone, we cannot place superblock in the
>>zone.
>>
>>This series implements superblock log writing. It uses two zones as a
>>circular buffer to write updated superblocks. Once the first zone is filled
>>up, start writing into the second zone. The first zone will be reset once
>>both zones are filled. We can determine the postion of the latest
>>superblock by reading the write pointer information from a device.
>>
>>* Patch series organization
>>
>>Patches 1 and 2 are preparing patches for block and iomap layer.
>>
>>Patch 3 introduces the ZONED incompatible feature flag to indicate that the
>>btrfs volume was formatted for use on zoned block devices.
>>
>>Patches 4 to 6 implement functions to gather information on the zones of
>>the device (zones type, write pointer position, and max_zone_append_size).
>>
>>Patches 7 to 10 disable features which are not compatible with the
>>sequential write constraints of zoned block devices. These includes
>>space_cache, NODATACOW, fallocate, and MIXED_BG.
>>
>>Patch 11 implements the log-structured superblock writing.
>>
>>Patches 12 and 13 tweak the device extent allocation for ZONED mode and add
>>verification to check if a device extent is properly aligned to zones.
>>
>>Patches 14 to 17 implements sequential block allocator for ZONED mode.
>>
>>Patch 18 implement a zone reset for unused block groups.
>>
>>Patches 19 to 30 implement the writing path for several types of IO
>>(non-compressed data, direct IO, and metadata). These include re-dirtying
>>once-freed metadata blocks to prevent write holes.
>>
>>Patches 31 to 40 tweak some btrfs features work with ZONED mode. These
>>include device-replace, relocation, repairing IO error, and tree-log.
>>
>>Finally, patch 41 adds the ZONED feature to the list of supported features.
>>
>>* Patch testing note
>>
>>** Zone-aware util-linux
>>
>>Since the log-structured superblock feature changed the location of
>>superblock magic, the current util-linux (libblkid) cannot detect ZONED
>>btrfs anymore. You need to apply a to-be posted patch to util-linux to make
>>it "zone aware".
>>
>>** Testing device
>>
>>You need devices with zone append writing command support to run ZONED
>>btrfs.
>>
>>Other than real devices, null_blk supports zone append write command. You
>>can use memory backed null_blk to run the test on it. Following script
>>creates 12800 MB /dev/nullb0.
>>
>>     sysfs=/sys/kernel/config/nullb/nullb0
>>     size=12800 # MB
>>     # drop nullb0
>>     if [[ -d $sysfs ]]; then
>>             echo 0 > "${sysfs}"/power
>>             rmdir $sysfs
>>     fi
>>     lsmod | grep -q null_blk && rmmod null_blk
>>     modprobe null_blk nr_devices=0
>>     mkdir "${sysfs}"
>>     echo "${size}" > "${sysfs}"/size
>>     echo 1 > "${sysfs}"/zoned
>>     echo 0 > "${sysfs}"/zone_nr_conv
>>     echo 1 > "${sysfs}"/memory_backed
>>     echo 1 > "${sysfs}"/power
>>     udevadm settle
>>
>>Zoned SCSI devices such as SMR HDDs or scsi_debug also support the zone
>>append command as an emulated command within the SCSI sd driver. This
>>emulation is completely transparent to the user and provides the same
>>semantic as a NVMe ZNS native drive support.
>>
>>Also, there is a qemu patch available to enable NVMe ZNS device.
>>
>>** xfstests
>>
>>We ran xfstests on ZONED btrfs, and, if we omit some cases that are known
>>to fail currently, all test cases pass.
>>
>>Cases that can be ignored:
>>1) failing also with the regular btrfs on regular devices,
>>2) trying to test fallocate feature without testing with
>>    "_require_xfs_io_command "falloc"",
>>3) trying to test incompatible features for ZONED btrfs (e.g. RAID5/6)
>>4) trying to use incompatible setup for ZONED btrfs (e.g. dm-linear not
>>    aligned to zone boundary, swap)
>>5) trying to create a file system with too small size, (we require at least
>>    9 zones to initiate a ZONED btrfs)
>>6) dropping original MKFS_OPTIONS ("-O zoned"), so it cannot create ZONED
>>    btrfs (btrfs/003)
>>7) having ENOSPC which incurred by larger metadata block group size
>>
>>I will send a patch series for xfstests to handle these cases (2-6)
>>properly.
>>
>>Patched xfstests is available here:
>>
>>https://github.com/naota/fstests/tree/btrfs-zoned
>>
>>Also, you need to apply the following patch if you run xfstests with
>>tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
>>"_devmgt_remove" without this patch.
>>
>>https://marc.info/?l=linux-scsi&m=156498625421698&w=2
>>
>>v9 https://lore.kernel.org/linux-btrfs/cover.1604065156.git.naohiro.aota@wdc.com/
>>v8 https://lore.kernel.org/linux-btrfs/cover.1601572459.git.naohiro.aota@wdc.com/
>>v7 https://lore.kernel.org/linux-btrfs/20200911123259.3782926-1-naohiro.aota@wdc.com/
>>v6 https://lore.kernel.org/linux-btrfs/20191213040915.3502922-1-naohiro.aota@wdc.com/
>>v5 https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/
>>v4 https://lwn.net/Articles/797061/
>>v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
>>v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
>>v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/
>>
>>Changelog
>>v9
>>  - Direct-IO path now follow several hardware restrictions (other than
>>    max_zone_append_size) by using ZONE_APPEND support of iomap
>>  - introduces union of fs_info->zone_size and fs_info->zoned [Johannes]
>>    - and use btrfs_is_zoned(fs_info) in place of btrfs_fs_incompat(fs_info, ZONED)
>>  - print if zoned is enabled or not when printing module info [Johannes]
>>  - drop patch of disabling inode_cache on ZONED
>>  - moved for_teelog flag to a proper location [Johannes]
>>  - Code style fixes [Johannes]
>>  - Add comment about adding physical layer things to ordered extent
>>    structure
>>  - Pass file_offset explicitly to extract_ordered_extent() instead of
>>    determining it from bio
>>  - Bug fixes
>>    - write out fsync region so that the logical address of ordered extents
>>      and checksums are properly finalized
>>    - free zone_info at umount time
>>    - fix superblock log handling when entering zones[1] in the first time
>>    - fixes double free of log-tree roots [Johannes]
>>    - Drop erroneous ASSERT in do_allocation_zoned()
>>v8
>>  - Use bio_add_hw_page() to build up bio to honor hardware restrictions
>>    - add bio_add_zone_append_page() as a wrapper of the function
>>  - Split file extent on submitting bio
>>    - If bio_add_zone_append_page() fails, split the file extent and send
>>      out bio
>>    - so, we can ensure one bio == one file extent
>>  - Fix build bot issues
>>  - Rebased on misc-next
>>v7:
>>  - Use zone append write command instead of normal write command
>>    - Bio issuing order does not matter
>>    - No need to use lock anymore
>>    - Can use asynchronous checksum
>>  - Removed RAID support for now
>>  - Rename HMZONED to ZONED
>>  - Split some patches
>>  - Rebased on kdave/for-5.9-rc3 + iomap direct IO
>>v6:
>>  - Use bitmap helpers (Johannes)
>>  - Code cleanup (Johannes)
>>  - Rebased on kdave/for-5.5
>>  - Enable the tree-log feature.
>>  - Treat conventional zones as sequential zones, so we can now allow
>>    mixed allocation of conventional zone and sequential write required
>>    zone to construct a block group.
>>  - Implement log-structured superblock
>>    - No need for one conventional zone at the beginning of a device.
>>  - Fix deadlock of direct IO writing
>>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>>  - Fix leak of zone_info (Johannes)
>>v5:
>>  - Rebased on kdave/for-5.5
>>  - Enable the tree-log feature.
>>  - Treat conventional zones as sequential zones, so we can now allow
>>    mixed allocation of conventional zone and sequential write required
>>    zone to construct a block group.
>>  - Implement log-structured superblock
>>    - No need for one conventional zone at the beginning of a device.
>>  - Fix deadlock of direct IO writing
>>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>>  - Fix leak of zone_info (Johannes)
>>v4:
>>  - Move memory allcation of zone informattion out of
>>    btrfs_get_dev_zones() (Anand)
>>  - Add disabled features table in commit log (Anand)
>>  - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
>>v3:
>>  - Serialize allocation and submit_bio instead of bio buffering in
>>    btrfs_map_bio().
>>  -- Disable async checksum/submit in HMZONED mode
>>  - Introduce helper functions and hmzoned.c/h (Josef, David)
>>  - Add support for repairing IO failure
>>  - Add support for NOCOW direct IO write (Josef)
>>  - Disable preallocation entirely
>>  -- Disable INODE_MAP_CACHE
>>  -- relocation is reworked not to rely on preallocation in HMZONED mode
>>  - Disable NODATACOW
>>  -Disable MIXED_BG
>>  - Device extent that cover super block position is banned (David)
>>v2:
>>  - Add support for dev-replace
>>  -- To support dev-replace, moved submit_buffer one layer up. It now
>>     handles bio instead of btrfs_bio.
>>  -- Mark unmirrored Block Group readonly only when there are writable
>>     mirrored BGs. Necessary to handle degraded RAID.
>>  - Expire worker use vanilla delayed_work instead of btrfs's async-thread
>>  - Device extent allocator now ensure that region is on the same zone type.
>>  - Add delayed allocation shrinking.
>>  - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
>>  - Fix
>>  -- Use SECTOR_SHIFT (Nikolay)
>>  -- Use btrfs_err (Nikolay)
>>
>>
>>Johannes Thumshirn (1):
>>   block: add bio_add_zone_append_page
>>
>>Naohiro Aota (40):
>>   iomap: support REQ_OP_ZONE_APPEND
>>   btrfs: introduce ZONED feature flag
>>   btrfs: get zone information of zoned block devices
>>   btrfs: check and enable ZONED mode
>>   btrfs: introduce max_zone_append_size
>>   btrfs: disallow space_cache in ZONED mode
>>   btrfs: disallow NODATACOW in ZONED mode
>>   btrfs: disable fallocate in ZONED mode
>>   btrfs: disallow mixed-bg in ZONED mode
>>   btrfs: implement log-structured superblock for ZONED mode
>>   btrfs: implement zoned chunk allocator
>>   btrfs: verify device extent is aligned to zone
>>   btrfs: load zone's alloction offset
>>   btrfs: emulate write pointer for conventional zones
>>   btrfs: track unusable bytes for zones
>>   btrfs: do sequential extent allocation in ZONED mode
>>   btrfs: reset zones of unused block groups
>>   btrfs: redirty released extent buffers in ZONED mode
>>   btrfs: extract page adding function
>>   btrfs: use bio_add_zone_append_page for zoned btrfs
>>   btrfs: handle REQ_OP_ZONE_APPEND as writing
>>   btrfs: split ordered extent when bio is sent
>>   btrfs: extend btrfs_rmap_block for specifying a device
>>   btrfs: use ZONE_APPEND write for ZONED btrfs
>>   btrfs: enable zone append writing for direct IO
>>   btrfs: introduce dedicated data write path for ZONED mode
>>   btrfs: serialize meta IOs on ZONED mode
>>   btrfs: wait existing extents before truncating
>>   btrfs: avoid async metadata checksum on ZONED mode
>>   btrfs: mark block groups to copy for device-replace
>>   btrfs: implement cloning for ZONED device-replace
>>   btrfs: implement copying for ZONED device-replace
>>   btrfs: support dev-replace in ZONED mode
>>   btrfs: enable relocation in ZONED mode
>>   btrfs: relocate block group to repair IO failure in ZONED
>>   btrfs: split alloc_log_tree()
>>   btrfs: extend zoned allocator to use dedicated tree-log block group
>>   btrfs: serialize log transaction on ZONED mode
>>   btrfs: reorder log node allocation
>>   btrfs: enable to mount ZONED incompat flag
>>
>>  block/bio.c                       |   38 +
>>  fs/btrfs/Makefile                 |    1 +
>>  fs/btrfs/block-group.c            |   84 +-
>>  fs/btrfs/block-group.h            |   18 +-
>>  fs/btrfs/ctree.h                  |   20 +-
>>  fs/btrfs/dev-replace.c            |  195 +++++
>>  fs/btrfs/dev-replace.h            |    3 +
>>  fs/btrfs/disk-io.c                |   93 ++-
>>  fs/btrfs/disk-io.h                |    2 +
>>  fs/btrfs/extent-tree.c            |  218 ++++-
>>  fs/btrfs/extent_io.c              |  130 ++-
>>  fs/btrfs/extent_io.h              |    2 +
>>  fs/btrfs/file.c                   |    6 +-
>>  fs/btrfs/free-space-cache.c       |   58 ++
>>  fs/btrfs/free-space-cache.h       |    2 +
>>  fs/btrfs/inode.c                  |  164 +++-
>>  fs/btrfs/ioctl.c                  |   13 +
>>  fs/btrfs/ordered-data.c           |   79 ++
>>  fs/btrfs/ordered-data.h           |   10 +
>>  fs/btrfs/relocation.c             |   35 +-
>>  fs/btrfs/scrub.c                  |  145 ++++
>>  fs/btrfs/space-info.c             |   13 +-
>>  fs/btrfs/space-info.h             |    4 +-
>>  fs/btrfs/super.c                  |   19 +-
>>  fs/btrfs/sysfs.c                  |    4 +
>>  fs/btrfs/tests/extent-map-tests.c |    2 +-
>>  fs/btrfs/transaction.c            |   10 +
>>  fs/btrfs/transaction.h            |    3 +
>>  fs/btrfs/tree-log.c               |   52 +-
>>  fs/btrfs/volumes.c                |  322 +++++++-
>>  fs/btrfs/volumes.h                |    7 +
>>  fs/btrfs/zoned.c                  | 1272 +++++++++++++++++++++++++++++
>>  fs/btrfs/zoned.h                  |  295 +++++++
>>  fs/iomap/direct-io.c              |   41 +-
>>  include/linux/bio.h               |    2 +
>>  include/linux/iomap.h             |    1 +
>>  include/uapi/linux/btrfs.h        |    1 +
>>  37 files changed, 3246 insertions(+), 118 deletions(-)
>>  create mode 100644 fs/btrfs/zoned.c
>>  create mode 100644 fs/btrfs/zoned.h
>>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 01/41] block: add bio_add_zone_append_page
  2020-11-10 17:20   ` Christoph Hellwig
@ 2020-11-11  7:20     ` Johannes Thumshirn
  0 siblings, 0 replies; 125+ messages in thread
From: Johannes Thumshirn @ 2020-11-11  7:20 UTC (permalink / raw)
  To: hch, Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, Darrick J. Wong

On 10/11/2020 18:20, Christoph Hellwig wrote:
> Do we need this check?  I'd rather just initialize q at declaration time
> and let the NULL pointer deref be the obvious sign for a grave
> programming error..

I wanted to be a bit more gentle, but I don't mind the other way around
as well.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
@ 2020-11-12  6:57   ` Anand Jain
  2020-11-12  7:35     ` Johannes Thumshirn
                       ` (2 more replies)
  2020-11-25 21:47   ` David Sterba
  2020-11-25 22:16   ` David Sterba
  2 siblings, 3 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-12  6:57 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Damien Le Moal, Josef Bacik



> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 8840a4fa81eb..ed55014fd1bd 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>   #endif
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   			", ref-verify=on"
> +#endif
> +#ifdef CONFIG_BLK_DEV_ZONED
> +			", zoned=yes"
> +#else
> +			", zoned=no"
>   #endif

IMO, we don't need this, as most of the generic kernel will be compiled
with the CONFIG_BLK_DEV_ZONED defined.
For review purpose we may want to know if the mounted device
is a zoned device. So log of zone device and its type may be useful
when we have verified the zoned devices in the open_ctree().

> @@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
>   	rcu_string_free(device->name);
>   	extent_io_tree_release(&device->alloc_state);
>   	bio_put(device->flush_bio);

> +	btrfs_destroy_dev_zone_info(device);

Free of btrfs_device::zone_info is already happening in the path..

  btrfs_close_one_device()
    btrfs_destroy_dev_zone_info()

  We don't need this..

  btrfs_free_device()
   btrfs_destroy_dev_zone_info()


> @@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	}
>   	rcu_assign_pointer(device->name, name);
>   
> +	device->fs_info = fs_info;
> +	device->bdev = bdev;
> +
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zone_info(device);
> +	if (ret)
> +		goto error_free_device;
> +
>   	trans = btrfs_start_transaction(root, 0);
>   	if (IS_ERR(trans)) {
>   		ret = PTR_ERR(trans);

It should be something like goto error_free_zone from here.


> @@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   		sb->s_flags |= SB_RDONLY;
>   	if (trans)
>   		btrfs_end_transaction(trans);


error_free_zone:
> +	btrfs_destroy_dev_zone_info(device);
>   error_free_device:
>   	btrfs_free_device(device);
>   error:

  As mentioned we don't need btrfs_destroy_dev_zone_info()
  again in  btrfs_free_device(). Otherwise we end up calling
  btrfs_destroy_dev_zone_info twice here.


Thanks, Anand

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  6:57   ` Anand Jain
@ 2020-11-12  7:35     ` Johannes Thumshirn
  2020-11-12  7:44       ` Damien Le Moal
  2020-11-12  9:39     ` Johannes Thumshirn
  2020-11-12 12:57     ` Naohiro Aota
  2 siblings, 1 reply; 125+ messages in thread
From: Johannes Thumshirn @ 2020-11-12  7:35 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Damien Le Moal, Josef Bacik

On 12/11/2020 08:00, Anand Jain wrote:
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 8840a4fa81eb..ed55014fd1bd 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>   #endif
>>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>   			", ref-verify=on"
>> +#endif
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +			", zoned=yes"
>> +#else
>> +			", zoned=no"
>>   #endif
> IMO, we don't need this, as most of the generic kernel will be compiled
> with the CONFIG_BLK_DEV_ZONED defined.
> For review purpose we may want to know if the mounted device
> is a zoned device. So log of zone device and its type may be useful
> when we have verified the zoned devices in the open_ctree().
> 

David explicitly asked for this in [1] so we included it.

[1] https://lore.kernel.org/linux-btrfs/20201013155301.GE6756@twin.jikos.cz

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  7:35     ` Johannes Thumshirn
@ 2020-11-12  7:44       ` Damien Le Moal
  2020-11-12  9:44         ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: Damien Le Moal @ 2020-11-12  7:44 UTC (permalink / raw)
  To: Johannes Thumshirn, Anand Jain, Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong, Josef Bacik

On 2020/11/12 16:35, Johannes Thumshirn wrote:
> On 12/11/2020 08:00, Anand Jain wrote:
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 8840a4fa81eb..ed55014fd1bd 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>>   #endif
>>>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>   			", ref-verify=on"
>>> +#endif
>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>> +			", zoned=yes"
>>> +#else
>>> +			", zoned=no"
>>>   #endif
>> IMO, we don't need this, as most of the generic kernel will be compiled
>> with the CONFIG_BLK_DEV_ZONED defined.
>> For review purpose we may want to know if the mounted device
>> is a zoned device. So log of zone device and its type may be useful
>> when we have verified the zoned devices in the open_ctree().
>>
> 
> David explicitly asked for this in [1] so we included it.
> 
> [1] https://lore.kernel.org/linux-btrfs/20201013155301.GE6756@twin.jikos.cz
> 

And as of now, not all generic kernels are compiled with CONFIG_BLK_DEV_ZONED.
E.g. RHEL and CentOS. That may change in the future, but it should not be
assumed that CONFIG_BLK_DEV_ZONED is always enabled.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  6:57   ` Anand Jain
  2020-11-12  7:35     ` Johannes Thumshirn
@ 2020-11-12  9:39     ` Johannes Thumshirn
  2020-11-12 12:57     ` Naohiro Aota
  2 siblings, 0 replies; 125+ messages in thread
From: Johannes Thumshirn @ 2020-11-12  9:39 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Damien Le Moal, Josef Bacik

On 12/11/2020 08:00, Anand Jain wrote:
> 
> 
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 8840a4fa81eb..ed55014fd1bd 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>   #endif
>>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>   			", ref-verify=on"
>> +#endif
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +			", zoned=yes"
>> +#else
>> +			", zoned=no"
>>   #endif
> 
> IMO, we don't need this, as most of the generic kernel will be compiled
> with the CONFIG_BLK_DEV_ZONED defined.
> For review purpose we may want to know if the mounted device
> is a zoned device. So log of zone device and its type may be useful
> when we have verified the zoned devices in the open_ctree().
> 
>> @@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>   	rcu_string_free(device->name);
>>   	extent_io_tree_release(&device->alloc_state);
>>   	bio_put(device->flush_bio);
> 
>> +	btrfs_destroy_dev_zone_info(device);
> 
> Free of btrfs_device::zone_info is already happening in the path..
> 
>   btrfs_close_one_device()
>     btrfs_destroy_dev_zone_info()
> 
>   We don't need this..
> 
>   btrfs_free_device()
>    btrfs_destroy_dev_zone_info()
> 
> 
>> @@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	}
>>   	rcu_assign_pointer(device->name, name);
>>   
>> +	device->fs_info = fs_info;
>> +	device->bdev = bdev;
>> +
>> +	/* Get zone type information of zoned block devices */
>> +	ret = btrfs_get_dev_zone_info(device);
>> +	if (ret)
>> +		goto error_free_device;
>> +
>>   	trans = btrfs_start_transaction(root, 0);
>>   	if (IS_ERR(trans)) {
>>   		ret = PTR_ERR(trans);
> 
> It should be something like goto error_free_zone from here.
> 
> 
>> @@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   		sb->s_flags |= SB_RDONLY;
>>   	if (trans)
>>   		btrfs_end_transaction(trans);
> 
> 
> error_free_zone:
>> +	btrfs_destroy_dev_zone_info(device);
>>   error_free_device:
>>   	btrfs_free_device(device);
>>   error:
> 
>   As mentioned we don't need btrfs_destroy_dev_zone_info()
>   again in  btrfs_free_device(). Otherwise we end up calling
>   btrfs_destroy_dev_zone_info twice here.

Which doesn't do any harm as:
void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
{
        struct btrfs_zoned_device_info *zone_info = device->zone_info;

        if (!zone_info)
                return;

	/* ... */
        device->zone_info = NULL;
}

Not sure what would be the preferred style here

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  7:44       ` Damien Le Moal
@ 2020-11-12  9:44         ` Anand Jain
  2020-11-13 21:34           ` David Sterba
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-12  9:44 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong, Josef Bacik



On 12/11/20 3:44 pm, Damien Le Moal wrote:
> On 2020/11/12 16:35, Johannes Thumshirn wrote:
>> On 12/11/2020 08:00, Anand Jain wrote:
>>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>>> index 8840a4fa81eb..ed55014fd1bd 100644
>>>> --- a/fs/btrfs/super.c
>>>> +++ b/fs/btrfs/super.c
>>>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>>>    #endif
>>>>    #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>    			", ref-verify=on"
>>>> +#endif
>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>> +			", zoned=yes"
>>>> +#else
>>>> +			", zoned=no"
>>>>    #endif
>>> IMO, we don't need this, as most of the generic kernel will be compiled
>>> with the CONFIG_BLK_DEV_ZONED defined.
>>> For review purpose we may want to know if the mounted device
>>> is a zoned device. So log of zone device and its type may be useful
>>> when we have verified the zoned devices in the open_ctree().
>>>
>>
>> David explicitly asked for this in [1] so we included it.
>>
>> [1] https://lore.kernel.org/linux-btrfs/20201013155301.GE6756@twin.jikos.cz
>>
> 
> And as of now, not all generic kernels are compiled with CONFIG_BLK_DEV_ZONED.
> E.g. RHEL and CentOS. That may change in the future, but it should not be
> assumed that CONFIG_BLK_DEV_ZONED is always enabled.
> 

Ok. My comment was from the long term perspective. I am fine if you want 
to keep it.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  6:57   ` Anand Jain
  2020-11-12  7:35     ` Johannes Thumshirn
  2020-11-12  9:39     ` Johannes Thumshirn
@ 2020-11-12 12:57     ` Naohiro Aota
  2020-11-18 11:17       ` Anand Jain
  2 siblings, 1 reply; 125+ messages in thread
From: Naohiro Aota @ 2020-11-12 12:57 UTC (permalink / raw)
  To: Anand Jain
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Josef Bacik

On Thu, Nov 12, 2020 at 02:57:42PM +0800, Anand Jain wrote:
>
>
>>diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>index 8840a4fa81eb..ed55014fd1bd 100644
>>--- a/fs/btrfs/super.c
>>+++ b/fs/btrfs/super.c
>>@@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>  #endif
>>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>  			", ref-verify=on"
>>+#endif
>>+#ifdef CONFIG_BLK_DEV_ZONED
>>+			", zoned=yes"
>>+#else
>>+			", zoned=no"
>>  #endif
>
>IMO, we don't need this, as most of the generic kernel will be compiled
>with the CONFIG_BLK_DEV_ZONED defined.
>For review purpose we may want to know if the mounted device
>is a zoned device. So log of zone device and its type may be useful
>when we have verified the zoned devices in the open_ctree().
>
>>@@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>  	rcu_string_free(device->name);
>>  	extent_io_tree_release(&device->alloc_state);
>>  	bio_put(device->flush_bio);
>
>>+	btrfs_destroy_dev_zone_info(device);
>
>Free of btrfs_device::zone_info is already happening in the path..
>
> btrfs_close_one_device()
>   btrfs_destroy_dev_zone_info()
>
> We don't need this..
>
> btrfs_free_device()
>  btrfs_destroy_dev_zone_info()

Ah, yes, I once had it only in btrfs_free_device() and noticed that it does
not free the device zone info on umount. So, I added one in
btrfs_close_one_device() and forgot to remove the other one. I'll drop it
from btrfs_free_device().

>
>
>>@@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>  	}
>>  	rcu_assign_pointer(device->name, name);
>>+	device->fs_info = fs_info;
>>+	device->bdev = bdev;
>>+
>>+	/* Get zone type information of zoned block devices */
>>+	ret = btrfs_get_dev_zone_info(device);
>>+	if (ret)
>>+		goto error_free_device;
>>+
>>  	trans = btrfs_start_transaction(root, 0);
>>  	if (IS_ERR(trans)) {
>>  		ret = PTR_ERR(trans);
>
>It should be something like goto error_free_zone from here.
>
>
>>@@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>  		sb->s_flags |= SB_RDONLY;
>>  	if (trans)
>>  		btrfs_end_transaction(trans);
>
>
>error_free_zone:

And, I'll do something like this.

>>+	btrfs_destroy_dev_zone_info(device);
>>  error_free_device:
>>  	btrfs_free_device(device);
>>  error:
>
> As mentioned we don't need btrfs_destroy_dev_zone_info()
> again in  btrfs_free_device(). Otherwise we end up calling
> btrfs_destroy_dev_zone_info twice here.
>
>
>Thanks, Anand

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12  9:44         ` Anand Jain
@ 2020-11-13 21:34           ` David Sterba
  0 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-13 21:34 UTC (permalink / raw)
  To: Anand Jain
  Cc: Damien Le Moal, Johannes Thumshirn, Naohiro Aota, linux-btrfs,
	dsterba, hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Josef Bacik

On Thu, Nov 12, 2020 at 05:44:11PM +0800, Anand Jain wrote:
> On 12/11/20 3:44 pm, Damien Le Moal wrote:
> > On 2020/11/12 16:35, Johannes Thumshirn wrote:
> >> On 12/11/2020 08:00, Anand Jain wrote:
> >>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >>>> index 8840a4fa81eb..ed55014fd1bd 100644
> >>>> --- a/fs/btrfs/super.c
> >>>> +++ b/fs/btrfs/super.c
> >>>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
> >>>>    #endif
> >>>>    #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> >>>>    			", ref-verify=on"
> >>>> +#endif
> >>>> +#ifdef CONFIG_BLK_DEV_ZONED
> >>>> +			", zoned=yes"
> >>>> +#else
> >>>> +			", zoned=no"
> >>>>    #endif
> >>> IMO, we don't need this, as most of the generic kernel will be compiled
> >>> with the CONFIG_BLK_DEV_ZONED defined.
> >>> For review purpose we may want to know if the mounted device
> >>> is a zoned device. So log of zone device and its type may be useful
> >>> when we have verified the zoned devices in the open_ctree().
> >>>
> >>
> >> David explicitly asked for this in [1] so we included it.
> >>
> >> [1] https://lore.kernel.org/linux-btrfs/20201013155301.GE6756@twin.jikos.cz
> >>
> > 
> > And as of now, not all generic kernels are compiled with CONFIG_BLK_DEV_ZONED.
> > E.g. RHEL and CentOS. That may change in the future, but it should not be
> > assumed that CONFIG_BLK_DEV_ZONED is always enabled.
> 
> Ok. My comment was from the long term perspective. I am fine if you want 
> to keep it.

The idea is to let the module announce which conditionally built
features are there according to fs/btrfs/Makefile and Kconfig. Besides
ACLs that should be always on and self-tests that run right after module
load, all other are there and we should keep the list up to date.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-12 12:57     ` Naohiro Aota
@ 2020-11-18 11:17       ` Anand Jain
  2020-11-30 11:16         ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-18 11:17 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Josef Bacik



Also, %device->fs_info is not protected. It is better to avoid using
fs_info when we are still at open_fs_devices(). Yeah, the unknown part
can be better. We need to fix it as a whole. For now, you can use
something like...

-------------------------
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1223d5b0e411..e857bb304d28 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -130,19 +130,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device 
*device)
          * (device <unknown>) ..."
          */

-       rcu_read_lock();
-       if (device->fs_info)
-               btrfs_info(device->fs_info,
-                       "host-%s zoned block device %s, %u zones of %llu 
bytes",
-                       bdev_zoned_model(bdev) == BLK_ZONED_HM ? 
"managed" : "aware",
-                       rcu_str_deref(device->name), zone_info->nr_zones,
-                       zone_info->zone_size);
-       else
-               pr_info("BTRFS info: host-%s zoned block device %s, %u 
zones of %llu bytes",
-                       bdev_zoned_model(bdev) == BLK_ZONED_HM ? 
"managed" : "aware",
-                       rcu_str_deref(device->name), zone_info->nr_zones,
-                       zone_info->zone_size);
-       rcu_read_unlock();
+       btrfs_info_in_rcu(NULL,
+               "host-%s zoned block device %s, %u zones of %llu bytes",
+               bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : 
"aware",
+               rcu_str_deref(device->name), zone_info->nr_zones,
+               zone_info->zone_size);

         return 0;
  ---------------------------

Thanks, Anand


On 12/11/20 8:57 pm, Naohiro Aota wrote:
> On Thu, Nov 12, 2020 at 02:57:42PM +0800, Anand Jain wrote:
>>
>>
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 8840a4fa81eb..ed55014fd1bd 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
>>>  #endif
>>>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>              ", ref-verify=on"
>>> +#endif
>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>> +            ", zoned=yes"
>>> +#else
>>> +            ", zoned=no"
>>>  #endif
>>
>> IMO, we don't need this, as most of the generic kernel will be compiled
>> with the CONFIG_BLK_DEV_ZONED defined.
>> For review purpose we may want to know if the mounted device
>> is a zoned device. So log of zone device and its type may be useful
>> when we have verified the zoned devices in the open_ctree().
>>
>>> @@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>>      rcu_string_free(device->name);
>>>      extent_io_tree_release(&device->alloc_state);
>>>      bio_put(device->flush_bio);
>>
>>> +    btrfs_destroy_dev_zone_info(device);
>>
>> Free of btrfs_device::zone_info is already happening in the path..
>>
>> btrfs_close_one_device()
>>   btrfs_destroy_dev_zone_info()
>>
>> We don't need this..
>>
>> btrfs_free_device()
>>  btrfs_destroy_dev_zone_info()
> 
> Ah, yes, I once had it only in btrfs_free_device() and noticed that it does
> not free the device zone info on umount. So, I added one in
> btrfs_close_one_device() and forgot to remove the other one. I'll drop it
> from btrfs_free_device().
> 
>>
>>
>>> @@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info 
>>> *fs_info, const char *device_path
>>>      }
>>>      rcu_assign_pointer(device->name, name);
>>> +    device->fs_info = fs_info;
>>> +    device->bdev = bdev;
>>> +
>>> +    /* Get zone type information of zoned block devices */
>>> +    ret = btrfs_get_dev_zone_info(device);
>>> +    if (ret)
>>> +        goto error_free_device;
>>> +
>>>      trans = btrfs_start_transaction(root, 0);
>>>      if (IS_ERR(trans)) {
>>>          ret = PTR_ERR(trans);
>>
>> It should be something like goto error_free_zone from here.
>>
>>
>>> @@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info 
>>> *fs_info, const char *device_path
>>>          sb->s_flags |= SB_RDONLY;
>>>      if (trans)
>>>          btrfs_end_transaction(trans);
>>
>>
>> error_free_zone:
> 
> And, I'll do something like this.
> 
>>> +    btrfs_destroy_dev_zone_info(device);
>>>  error_free_device:
>>>      btrfs_free_device(device);
>>>  error:
>>
>> As mentioned we don't need btrfs_destroy_dev_zone_info()
>> again in  btrfs_free_device(). Otherwise we end up calling
>> btrfs_destroy_dev_zone_info twice here.
>>
>>
>> Thanks, Anand


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-10 11:26 ` [PATCH v10 05/41] btrfs: check and enable ZONED mode Naohiro Aota
@ 2020-11-18 11:29   ` Anand Jain
  2020-11-27 18:44     ` David Sterba
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-18 11:29 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Johannes Thumshirn, Damien Le Moal, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> This commit introduces the function btrfs_check_zoned_mode() to check if
> ZONED flag is enabled on the file system and if the file system consists of
> zoned devices with equal zone size.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/ctree.h       | 11 ++++++
>   fs/btrfs/dev-replace.c |  7 ++++
>   fs/btrfs/disk-io.c     | 11 ++++++
>   fs/btrfs/super.c       |  1 +
>   fs/btrfs/volumes.c     |  5 +++
>   fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/zoned.h       | 26 ++++++++++++++
>   7 files changed, 142 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index aac3d6f4e35b..453f41ca024e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>   	/* Type of exclusive operation running */
>   	unsigned long exclusive_operation;
>   
> +	/* Zone size when in ZONED mode */
> +	union {
> +		u64 zone_size;
> +		u64 zoned;
> +	};
> +
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	spinlock_t ref_verify_lock;
>   	struct rb_root block_tree;
> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>   }
>   #endif
>   
> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
> +{
> +	return fs_info->zoned != 0;
> +}
> +
>   #endif
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 6f6d77224c2b..db87f1aa604b 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>   		return PTR_ERR(bdev);
>   	}
>   
> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> +		btrfs_err(fs_info,
> +			  "dev-replace: zoned type of target device mismatch with filesystem");
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
>   	sync_blockdev(bdev);
>   
>   	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {

  I am not sure if it is done in some other patch. But we still have to
  check for

  (model == BLK_ZONED_HA && incompat_zoned))

right? What if in a non-zoned FS, a zoned device is added through the
replace. No?


> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 764001609a15..e76ac4da208d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -42,6 +42,7 @@
>   #include "block-group.h"
>   #include "discard.h"
>   #include "space-info.h"
> +#include "zoned.h"
>   
>   #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
>   				 BTRFS_HEADER_FLAG_RELOC |\
> @@ -2976,6 +2977,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>   	if (features & BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
>   		btrfs_info(fs_info, "has skinny extents");
>   
> +	fs_info->zoned = features & BTRFS_FEATURE_INCOMPAT_ZONED;
> +
>   	/*
>   	 * flag our filesystem as having big metadata blocks if
>   	 * they are bigger than the page size
> @@ -3130,7 +3133,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>   
>   	btrfs_free_extra_devids(fs_devices, 1);
>   
> +	ret = btrfs_check_zoned_mode(fs_info);
> +	if (ret) {
> +		btrfs_err(fs_info, "failed to inititialize zoned mode: %d",
> +			  ret);
> +		goto fail_block_groups;
> +	}
> +
>   	ret = btrfs_sysfs_add_fsid(fs_devices);
> +
>   	if (ret) {
>   		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>   				ret);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index ed55014fd1bd..3312fe08168f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -44,6 +44,7 @@
>   #include "backref.h"
>   #include "space-info.h"
>   #include "sysfs.h"
> +#include "zoned.h"
>   #include "tests/btrfs-tests.h"
>   #include "block-group.h"
>   #include "discard.h"
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e787bf89f761..10827892c086 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	if (IS_ERR(bdev))
>   		return PTR_ERR(bdev);
>   
> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
>   	if (fs_devices->seeding) {
>   		seeding_dev = 1;
>   		down_write(&sb->s_umount);


Same here too. It can also happen that a zone device is added to a non 
zoned fs.

Thanks, Anand


> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index b7ffe6670d3a..1223d5b0e411 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -180,3 +180,84 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   
>   	return 0;
>   }
> +
> +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 zoned_devices = 0;
> +	u64 nr_devices = 0;
> +	u64 zone_size = 0;
> +	const bool incompat_zoned = btrfs_is_zoned(fs_info);
> +	int ret = 0;
> +
> +	/* Count zoned devices */
> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
> +		enum blk_zoned_model model;
> +
> +		if (!device->bdev)
> +			continue;
> +
> +		model = bdev_zoned_model(device->bdev);
> +		if (model == BLK_ZONED_HM ||
> +		    (model == BLK_ZONED_HA && incompat_zoned)) {
> +			zoned_devices++;
> +			if (!zone_size) {
> +				zone_size = device->zone_info->zone_size;
> +			} else if (device->zone_info->zone_size != zone_size) {
> +				btrfs_err(fs_info,
> +					  "zoned: unequal block device zone sizes: have %llu found %llu",
> +					  device->zone_info->zone_size,
> +					  zone_size);
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +		}
> +		nr_devices++;
> +	}
> +
> +	if (!zoned_devices && !incompat_zoned)
> +		goto out;
> +
> +	if (!zoned_devices && incompat_zoned) {
> +		/* No zoned block device found on ZONED FS */
> +		btrfs_err(fs_info,
> +			  "zoned: no zoned devices found on a zoned filesystem");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (zoned_devices && !incompat_zoned) {
> +		btrfs_err(fs_info,
> +			  "zoned: mode not enabled but zoned device found");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (zoned_devices != nr_devices) {
> +		btrfs_err(fs_info,
> +			  "zoned: cannot mix zoned and regular devices");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
> +	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
> +	 * check the alignment here.
> +	 */
> +	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
> +		btrfs_err(fs_info,
> +			  "zoned: zone size not aligned to stripe %u",
> +			  BTRFS_STRIPE_LEN);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	fs_info->zone_size = zone_size;
> +
> +	btrfs_info(fs_info, "zoned mode enabled with zone size %llu",
> +		   fs_info->zone_size);
> +out:
> +	return ret;
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index c9e69ff87ab9..bcb1cb99a4f3 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -4,6 +4,7 @@
>   #define BTRFS_ZONED_H
>   
>   #include <linux/types.h>
> +#include <linux/blkdev.h>
>   
>   struct btrfs_zoned_device_info {
>   	/*
> @@ -22,6 +23,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   		       struct blk_zone *zone);
>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
> +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   				     struct blk_zone *zone)
> @@ -36,6 +38,15 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
>   
>   static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
>   
> +static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
> +{
> +	if (!btrfs_is_zoned(fs_info))
> +		return 0;
> +
> +	btrfs_err(fs_info, "Zoned block devices support is not enabled");
> +	return -EOPNOTSUPP;
> +}
> +
>   #endif
>   
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> @@ -88,4 +99,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>   	btrfs_dev_set_empty_zone_bit(device, pos, false);
>   }
>   
> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
> +						struct block_device *bdev)
> +{
> +	u64 zone_size;
> +
> +	if (btrfs_is_zoned(fs_info)) {
> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> +		/* Do not allow non-zoned device */
> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
> +	}
> +
> +	/* Do not allow Host Manged zoned device */
> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
> +}
> +
>   #endif
> 


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 06/41] btrfs: introduce max_zone_append_size
  2020-11-10 11:26 ` [PATCH v10 06/41] btrfs: introduce max_zone_append_size Naohiro Aota
@ 2020-11-19  9:23   ` Anand Jain
  2020-11-27 18:47     ` David Sterba
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-19  9:23 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> The zone append write command has a maximum IO size restriction it
> accepts. This is because a zone append write command cannot be split, as
> we ask the device to place the data into a specific target zone and the
> device responds with the actual written location of the data.
> 
> Introduce max_zone_append_size to zone_info and fs_info to track the
> value, so we can limit all I/O to a zoned block device that we want to
> write using the zone append command to the device's limits.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---

Looks good except for - what happens when we replace or add a new zone
device with a different queue_max_zone_append_sectors(queue) value. ?

Nit: IMHO some parts of patch-4, 5 and 6 could have been in one
patch. Now it's fine as they are already at v10 and have rb.

Thanks, Anand


>   fs/btrfs/ctree.h |  3 +++
>   fs/btrfs/zoned.c | 17 +++++++++++++++--
>   fs/btrfs/zoned.h |  1 +
>   3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 453f41ca024e..c70d3fcc62c2 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -954,6 +954,9 @@ struct btrfs_fs_info {
>   		u64 zoned;
>   	};
>   
> +	/* Max size to emit ZONE_APPEND write command */
> +	u64 max_zone_append_size;
> +
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	spinlock_t ref_verify_lock;
>   	struct rb_root block_tree;
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 1223d5b0e411..2897432eb43c 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -48,6 +48,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
>   {
>   	struct btrfs_zoned_device_info *zone_info = NULL;
>   	struct block_device *bdev = device->bdev;
> +	struct request_queue *queue = bdev_get_queue(bdev);
>   	sector_t nr_sectors = bdev->bd_part->nr_sects;
>   	sector_t sector = 0;
>   	struct blk_zone *zones = NULL;
> @@ -69,6 +70,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
>   	ASSERT(is_power_of_2(zone_sectors));
>   	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
>   	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
> +	zone_info->max_zone_append_size =
> +		(u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;
>   	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
>   	if (!IS_ALIGNED(nr_sectors, zone_sectors))
>   		zone_info->nr_zones++;
> @@ -188,6 +191,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   	u64 zoned_devices = 0;
>   	u64 nr_devices = 0;
>   	u64 zone_size = 0;
> +	u64 max_zone_append_size = 0;
>   	const bool incompat_zoned = btrfs_is_zoned(fs_info);
>   	int ret = 0;
>   
> @@ -201,10 +205,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   		model = bdev_zoned_model(device->bdev);
>   		if (model == BLK_ZONED_HM ||
>   		    (model == BLK_ZONED_HA && incompat_zoned)) {
> +			struct btrfs_zoned_device_info *zone_info =
> +				device->zone_info;
> +
>   			zoned_devices++;
>   			if (!zone_size) {
> -				zone_size = device->zone_info->zone_size;
> -			} else if (device->zone_info->zone_size != zone_size) {
> +				zone_size = zone_info->zone_size;
> +			} else if (zone_info->zone_size != zone_size) {
>   				btrfs_err(fs_info,
>   					  "zoned: unequal block device zone sizes: have %llu found %llu",
>   					  device->zone_info->zone_size,
> @@ -212,6 +219,11 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   				ret = -EINVAL;
>   				goto out;
>   			}
> +			if (!max_zone_append_size ||
> +			    (zone_info->max_zone_append_size &&
> +			     zone_info->max_zone_append_size < max_zone_append_size))
> +				max_zone_append_size =
> +					zone_info->max_zone_append_size;
>   		}
>   		nr_devices++;
>   	}
> @@ -255,6 +267,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   	}
>   
>   	fs_info->zone_size = zone_size;
> +	fs_info->max_zone_append_size = max_zone_append_size;
>   
>   	btrfs_info(fs_info, "zoned mode enabled with zone size %llu",
>   		   fs_info->zone_size);
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index bcb1cb99a4f3..52aa6af5d8dc 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -13,6 +13,7 @@ struct btrfs_zoned_device_info {
>   	 */
>   	u64 zone_size;
>   	u8  zone_size_shift;
> +	u64 max_zone_append_size;
>   	u32 nr_zones;
>   	unsigned long *seq_zones;
>   	unsigned long *empty_zones;
> 


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode
  2020-11-10 11:26 ` [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode Naohiro Aota
@ 2020-11-19 10:42   ` Anand Jain
  2020-11-20  4:08     ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-19 10:42 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik



> @@ -985,6 +992,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>   		ret = -EINVAL;
>   
>   	}
> +	if (!ret)
> +		ret = btrfs_check_mountopts_zoned(info);
>   	if (!ret && btrfs_test_opt(info, SPACE_CACHE))
>   		btrfs_info(info, "disk space caching is enabled");
>   	if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 2897432eb43c..d6b8165e2c91 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -274,3 +274,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   out:
>   	return ret;
>   }
> +
> +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
> +{
> +	if (!btrfs_is_zoned(info))
> +		return 0;
> +
> +	/*
> +	 * Space cache writing is not COWed. Disable that to avoid write
> +	 * errors in sequential zones.
> +	 */
> +	if (btrfs_test_opt(info, SPACE_CACHE)) {
> +		btrfs_err(info,
> +			  "zoned: space cache v1 is not supported");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index 52aa6af5d8dc..81c00a3ed202 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -25,6 +25,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>   int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
> +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   				     struct blk_zone *zone)
> @@ -48,6 +49,11 @@ static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>   	return -EOPNOTSUPP;
>   }
>   
> +static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
> +{
> +	return 0;
> +}
> +
>   #endif
>   
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> 

The whole of the above code can be replaced by..

-------------------
@@ -810,8 +810,15 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
char *options,
                         break;
                 case Opt_space_cache:
                 case Opt_space_cache_version:
                         if (token == Opt_space_cache ||
                             strcmp(args[0].from, "v1") == 0) {
+                               if (btrfs_is_zoned(info)) {
+                                       btrfs_err(info,
+                                       "zoned: space cache v1 is not 
supported");
+                                       ret = -EINVAL;
+                                       goto out;
+                               }
                                 btrfs_clear_opt(info->mount_opt,
                                                 FREE_SPACE_TREE);
                                 btrfs_set_and_info(info, SPACE_CACHE,
-------------------

Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 03/41] btrfs: introduce ZONED feature flag
  2020-11-10 11:26 ` [PATCH v10 03/41] btrfs: introduce ZONED feature flag Naohiro Aota
@ 2020-11-19 21:31   ` David Sterba
  0 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-19 21:31 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Anand Jain,
	Johannes Thumshirn

On Tue, Nov 10, 2020 at 08:26:06PM +0900, Naohiro Aota wrote:
> This patch introduces the ZONED incompat flag. The flag indicates that the
> volume management will satisfy the constraints imposed by host-managed
> zoned block devices.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/sysfs.c           | 2 ++
>  include/uapi/linux/btrfs.h | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 279d9262b676..828006020bbd 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -263,6 +263,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
>  BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
>  BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
>  BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> +BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);

> +	BTRFS_FEAT_ATTR_PTR(zoned),

As we're going to add zoned support incrementally, we can't advertise
the support in sysfs until it's feature complete. Until then it's going
to be under CONFIG_BTRFS_DEBUG. This has been folded to this patch and
changelog updated.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode
  2020-11-19 10:42   ` Anand Jain
@ 2020-11-20  4:08     ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-20  4:08 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik



On 19/11/20 6:42 pm, Anand Jain wrote:
> 
> 
>> @@ -985,6 +992,8 @@ int btrfs_parse_options(struct btrfs_fs_info 
>> *info, char *options,
>>           ret = -EINVAL;
>>       }
>> +    if (!ret)
>> +        ret = btrfs_check_mountopts_zoned(info);
>>       if (!ret && btrfs_test_opt(info, SPACE_CACHE))
>>           btrfs_info(info, "disk space caching is enabled");
>>       if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 2897432eb43c..d6b8165e2c91 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -274,3 +274,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info 
>> *fs_info)
>>   out:
>>       return ret;
>>   }
>> +
>> +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>> +{
>> +    if (!btrfs_is_zoned(info))
>> +        return 0;
>> +
>> +    /*
>> +     * Space cache writing is not COWed. Disable that to avoid write
>> +     * errors in sequential zones.
>> +     */
>> +    if (btrfs_test_opt(info, SPACE_CACHE)) {
>> +        btrfs_err(info,
>> +              "zoned: space cache v1 is not supported");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
>> index 52aa6af5d8dc..81c00a3ed202 100644
>> --- a/fs/btrfs/zoned.h
>> +++ b/fs/btrfs/zoned.h
>> @@ -25,6 +25,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, 
>> u64 pos,
>>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>>   int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
>> +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
>>   #else /* CONFIG_BLK_DEV_ZONED */
>>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, 
>> u64 pos,
>>                        struct blk_zone *zone)
>> @@ -48,6 +49,11 @@ static inline int btrfs_check_zoned_mode(struct 
>> btrfs_fs_info *fs_info)
>>       return -EOPNOTSUPP;
>>   }
>> +static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info 
>> *info)
>> +{
>> +    return 0;
>> +}
>> +
>>   #endif
>>   static inline bool btrfs_dev_is_sequential(struct btrfs_device 
>> *device, u64 pos)
>>
> 
> The whole of the above code can be replaced by..
> 



> -------------------
> @@ -810,8 +810,15 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> char *options,
>                          break;
>                  case Opt_space_cache:
>                  case Opt_space_cache_version:
>                          if (token == Opt_space_cache ||
>                              strcmp(args[0].from, "v1") == 0) {
> +                               if (btrfs_is_zoned(info)) {
> +                                       btrfs_err(info,
> +                                       "zoned: space cache v1 is not 
> supported");
> +                                       ret = -EINVAL;
> +                                       goto out;
> +                               }
>                                  btrfs_clear_opt(info->mount_opt,
>                                                  FREE_SPACE_TREE);
>                                  btrfs_set_and_info(info, SPACE_CACHE,
> -------------------
> 


Later patches add more flag-checks to btrfs_check_mountopts_zoned(),
I have no preference for either this way or adding those checks
individually in btrfs_parse_options(). So you may ignore this.

Thanks.

> Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 08/41] btrfs: disallow NODATACOW in ZONED mode
  2020-11-10 11:26 ` [PATCH v10 08/41] btrfs: disallow NODATACOW " Naohiro Aota
@ 2020-11-20  4:17   ` Anand Jain
  2020-11-23 17:21     ` David Sterba
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-20  4:17 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik, Johannes Thumshirn

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> NODATACOW implies overwriting the file data on a device, which is
> impossible in sequential required zones. Disable NODATACOW globally with
> mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Looks good.
  Reviewed-by: Anand Jain <anand.jain@oracle.com>

A nit below.

> ---
>   fs/btrfs/ioctl.c | 13 +++++++++++++
>   fs/btrfs/zoned.c |  5 +++++
>   2 files changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ab408a23ba32..d13b522e7bb2 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -193,6 +193,15 @@ static int check_fsflags(unsigned int old_flags, unsigned int flags)
>   	return 0;
>   }
>   
> +static int check_fsflags_compatible(struct btrfs_fs_info *fs_info,
> +				    unsigned int flags)
> +{
> +	if (btrfs_is_zoned(fs_info) && (flags & FS_NOCOW_FL))


> +		return -EPERM;

nit:
  Should it be -EINVAL instead? I am not sure. May be David can fix 
while integrating.

Thanks.


> +
> +	return 0;
> +}
> +
>   static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -230,6 +239,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
>   	if (ret)
>   		goto out_unlock;
>   
> +	ret = check_fsflags_compatible(fs_info, fsflags);
> +	if (ret)
> +		goto out_unlock;
> +
>   	binode_flags = binode->flags;
>   	if (fsflags & FS_SYNC_FL)
>   		binode_flags |= BTRFS_INODE_SYNC;
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index d6b8165e2c91..bd153932606e 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -290,5 +290,10 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>   		return -EINVAL;
>   	}
>   
> +	if (btrfs_test_opt(info, NODATACOW)) {
> +		btrfs_err(info, "zoned: NODATACOW not supported");
> +		return -EINVAL;
> +	}
> +
>   	return 0;
>   }
> 


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 09/41] btrfs: disable fallocate in ZONED mode
  2020-11-10 11:26 ` [PATCH v10 09/41] btrfs: disable fallocate " Naohiro Aota
@ 2020-11-20  4:28   ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-20  4:28 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Johannes Thumshirn, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> fallocate() is implemented by reserving actual extent instead of
> reservations. This can result in exposing the sequential write constraint
> of host-managed zoned block devices to the application, which would break
> the POSIX semantic for the fallocated file.  To avoid this, report
> fallocate() as not supported when in ZONED mode for now.
> 
> In the future, we may be able to implement "in-memory" fallocate() in ZONED
> mode by utilizing space_info->bytes_may_use or so.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Looks good.

Reviewed-by: Anand Jain <anand.jain@orcle.com>

Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 10/41] btrfs: disallow mixed-bg in ZONED mode
  2020-11-10 11:26 ` [PATCH v10 10/41] btrfs: disallow mixed-bg " Naohiro Aota
@ 2020-11-20  4:32   ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-20  4:32 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> Placing both data and metadata in a block group is impossible in ZONED
> mode. For data, we can allocate a space for it and write it immediately
> after the allocation. For metadata, however, we cannot do so, because the
> logical addresses are recorded in other metadata buffers to build up the
> trees. As a result, a data buffer can be placed after a metadata buffer,
> which is not written yet. Writing out the data buffer will break the
> sequential write rule.
> 
> This commit check and disallow MIXED_BG with ZONED mode.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---


Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 08/41] btrfs: disallow NODATACOW in ZONED mode
  2020-11-20  4:17   ` Anand Jain
@ 2020-11-23 17:21     ` David Sterba
  2020-11-24  3:29       ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: David Sterba @ 2020-11-23 17:21 UTC (permalink / raw)
  To: Anand Jain
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong, Josef Bacik,
	Johannes Thumshirn

On Fri, Nov 20, 2020 at 12:17:21PM +0800, Anand Jain wrote:
> On 10/11/20 7:26 pm, Naohiro Aota wrote:
> > +				    unsigned int flags)
> > +{
> > +	if (btrfs_is_zoned(fs_info) && (flags & FS_NOCOW_FL))
> 
> 
> > +		return -EPERM;
> 
> nit:
>   Should it be -EINVAL instead? I am not sure. May be David can fix 
> while integrating.

IIRC we've discussed that in some previous iteration. EPERM should be
interpreted as that it's not permitted right now, but otherwise it is a
valid operation/flag. The constraint is the zoned device.

As an example: deleting default subvolume is not permitted (EPERM), but
otherwise subvolume deletion is a valid operation.

So, EINVAL is for invalid combination of parameters or a request for
something that does not make sense at all.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
  2020-11-11  1:34   ` kernel test robot
  2020-11-11  2:43   ` kernel test robot
@ 2020-11-23 17:46   ` David Sterba
  2020-11-24  9:30     ` Johannes Thumshirn
  2020-11-24  6:46   ` Anand Jain
  3 siblings, 1 reply; 125+ messages in thread
From: David Sterba @ 2020-11-23 17:46 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On Tue, Nov 10, 2020 at 08:26:14PM +0900, Naohiro Aota wrote:
> Superblock (and its copies) is the only data structure in btrfs which has a
> fixed location on a device. Since we cannot overwrite in a sequential write
> required zone, we cannot place superblock in the zone. One easy solution is
> limiting superblock and copies to be placed only in conventional zones.
> However, this method has two downsides: one is reduced number of superblock
> copies. The location of the second copy of superblock is 256GB, which is in
> a sequential write required zone on typical devices in the market today.
> So, the number of superblock and copies is limited to be two.  Second
> downside is that we cannot support devices which have no conventional zones
> at all.
> 
> To solve these two problems, we employ superblock log writing. It uses two
> zones as a circular buffer to write updated superblocks. Once the first
> zone is filled up, start writing into the second buffer. Then, when the
> both zones are filled up and before start writing to the first zone again,
> it reset the first zone.
> 
> We can determine the position of the latest superblock by reading write
> pointer information from a device. One corner case is when the both zones
> are full. For this situation, we read out the last superblock of each
> zone, and compare them to determine which zone is older.
> 
> The following zones are reserved as the circular buffer on ZONED btrfs.
> 
> - The primary superblock: zones 0 and 1
> - The first copy: zones 16 and 17
> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>   to it

I was thinking about that, again. We need a specification. The above is
too vague.

- supported zone sizes
  eg. if device has 256M, how does it work? I think we can support
  zones from some range (256M-1G), where filling the zone will start
  filing the other zone, leaving the remaining space empty if needed,
  effectively reserving the logical range [0..2G] for superblock

- related to the above, is it necessary to fill the whole zone?
  if both zones are filled, assuming 1G zone size, do we really expect
  the user to wait until 2G of data are read?
  with average reading speed 150MB/s, reading 2G will take about 13
  seconds, just to find the latest copy of the superblock(!)

- what are exact offsets of the superblocks
  primary (64K), ie. not from the beginning
  as partitioning is not supported, nor bootloaders, we don't need to
  worry about overwriting them

- what is an application supposed to do when there's a garbage after a
  sequence of valid superblocks (all zeros can be considered a valid
  termination block)

The idea is to provide enough information for a 3rd party tool to read
the superblock (blkid, progs) and decouple the format from current
hardware capabilities. If the zones are going to be large in the future
we might consider allowing further flexibility, or fix the current zone
maximum to 1G and in the future add a separate incompat bit that would
extend the maximum to say 10G.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 08/41] btrfs: disallow NODATACOW in ZONED mode
  2020-11-23 17:21     ` David Sterba
@ 2020-11-24  3:29       ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-24  3:29 UTC (permalink / raw)
  To: dsterba, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong, Josef Bacik,
	Johannes Thumshirn

On 24/11/20 1:21 am, David Sterba wrote:
> On Fri, Nov 20, 2020 at 12:17:21PM +0800, Anand Jain wrote:
>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>> +				    unsigned int flags)
>>> +{
>>> +	if (btrfs_is_zoned(fs_info) && (flags & FS_NOCOW_FL))
>>
>>
>>> +		return -EPERM;
>>
>> nit:
>>    Should it be -EINVAL instead? I am not sure. May be David can fix
>> while integrating.
> 
> IIRC we've discussed that in some previous iteration. EPERM should be
> interpreted as that it's not permitted right now, but otherwise it is a
> valid operation/flag. The constraint is the zoned device.
> 
> As an example: deleting default subvolume is not permitted (EPERM), but
> otherwise subvolume deletion is a valid operation.
> 
> So, EINVAL is for invalid combination of parameters or a request for
> something that does not make sense at all.
> 

Ok. Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
                     ` (2 preceding siblings ...)
  2020-11-23 17:46   ` David Sterba
@ 2020-11-24  6:46   ` Anand Jain
  2020-11-24  7:16     ` Hannes Reinecke
  3 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-24  6:46 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig, Darrick J. Wong

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> Superblock (and its copies) is the only data structure in btrfs which has a
> fixed location on a device. Since we cannot overwrite in a sequential write
> required zone, we cannot place superblock in the zone. One easy solution is
> limiting superblock and copies to be placed only in conventional zones.
> However, this method has two downsides: one is reduced number of superblock
> copies. The location of the second copy of superblock is 256GB, which is in
> a sequential write required zone on typical devices in the market today.
> So, the number of superblock and copies is limited to be two.  Second
> downside is that we cannot support devices which have no conventional zones
> at all.
> 


> To solve these two problems, we employ superblock log writing. It uses two
> zones as a circular buffer to write updated superblocks. Once the first
> zone is filled up, start writing into the second buffer. Then, when the
> both zones are filled up and before start writing to the first zone again,
> it reset the first zone.
> 
> We can determine the position of the latest superblock by reading write
> pointer information from a device. One corner case is when the both zones
> are full. For this situation, we read out the last superblock of each
> zone, and compare them to determine which zone is older.
> 
> The following zones are reserved as the circular buffer on ZONED btrfs.
> 
> - The primary superblock: zones 0 and 1
> - The first copy: zones 16 and 17
> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>    to it

Superblock log approach needs a non-deterministic and inconsistent
number of blocks to be read to find copy #0. And, to use 4K bytes
we are reserving a lot more space. But I don't know any better way.
I am just checking with you...

At the time of mkfs, is it possible to format the block device to
add conventional zones as needed to support our sb LBAs?
  OR
For superblock zones why not reset the write pointer before the
transaction commit?

Thanks.


> If these reserved zones are conventional, superblock is written fixed at
> the start of the zone without logging.


> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>




^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-24  6:46   ` Anand Jain
@ 2020-11-24  7:16     ` Hannes Reinecke
  0 siblings, 0 replies; 125+ messages in thread
From: Hannes Reinecke @ 2020-11-24  7:16 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig, Darrick J. Wong

On 11/24/20 7:46 AM, Anand Jain wrote:
> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>> Superblock (and its copies) is the only data structure in btrfs which 
>> has a
>> fixed location on a device. Since we cannot overwrite in a sequential 
>> write
>> required zone, we cannot place superblock in the zone. One easy 
>> solution is
>> limiting superblock and copies to be placed only in conventional zones.
>> However, this method has two downsides: one is reduced number of 
>> superblock
>> copies. The location of the second copy of superblock is 256GB, which 
>> is in
>> a sequential write required zone on typical devices in the market today.
>> So, the number of superblock and copies is limited to be two.  Second
>> downside is that we cannot support devices which have no conventional 
>> zones
>> at all.
>>
> 
> 
>> To solve these two problems, we employ superblock log writing. It uses 
>> two
>> zones as a circular buffer to write updated superblocks. Once the first
>> zone is filled up, start writing into the second buffer. Then, when the
>> both zones are filled up and before start writing to the first zone 
>> again,
>> it reset the first zone.
>>
>> We can determine the position of the latest superblock by reading write
>> pointer information from a device. One corner case is when the both zones
>> are full. For this situation, we read out the last superblock of each
>> zone, and compare them to determine which zone is older.
>>
>> The following zones are reserved as the circular buffer on ZONED btrfs.
>>
>> - The primary superblock: zones 0 and 1
>> - The first copy: zones 16 and 17
>> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>>    to it
> 
> Superblock log approach needs a non-deterministic and inconsistent
> number of blocks to be read to find copy #0. And, to use 4K bytes
> we are reserving a lot more space. But I don't know any better way.
> I am just checking with you...
> 
> At the time of mkfs, is it possible to format the block device to
> add conventional zones as needed to support our sb LBAs?

No. The number of conventional zones (if any) are a drive characteristic 
and one cannot assume that the number can be modified.

>   OR
> For superblock zones why not reset the write pointer before the
> transaction commit?
> 
A write pointer reset is equivalent to clearing the contents of the 
zone, so we would lose the previous information there.

HTH.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode
  2020-11-23 17:46   ` David Sterba
@ 2020-11-24  9:30     ` Johannes Thumshirn
  0 siblings, 0 replies; 125+ messages in thread
From: Johannes Thumshirn @ 2020-11-24  9:30 UTC (permalink / raw)
  To: dsterba, Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, hch,
	Darrick J. Wong

On 23/11/2020 18:49, David Sterba wrote:
> On Tue, Nov 10, 2020 at 08:26:14PM +0900, Naohiro Aota wrote:
>> Superblock (and its copies) is the only data structure in btrfs which has a
>> fixed location on a device. Since we cannot overwrite in a sequential write
>> required zone, we cannot place superblock in the zone. One easy solution is
>> limiting superblock and copies to be placed only in conventional zones.
>> However, this method has two downsides: one is reduced number of superblock
>> copies. The location of the second copy of superblock is 256GB, which is in
>> a sequential write required zone on typical devices in the market today.
>> So, the number of superblock and copies is limited to be two.  Second
>> downside is that we cannot support devices which have no conventional zones
>> at all.
>>
>> To solve these two problems, we employ superblock log writing. It uses two
>> zones as a circular buffer to write updated superblocks. Once the first
>> zone is filled up, start writing into the second buffer. Then, when the
>> both zones are filled up and before start writing to the first zone again,
>> it reset the first zone.
>>
>> We can determine the position of the latest superblock by reading write
>> pointer information from a device. One corner case is when the both zones
>> are full. For this situation, we read out the last superblock of each
>> zone, and compare them to determine which zone is older.
>>
>> The following zones are reserved as the circular buffer on ZONED btrfs.
>>
>> - The primary superblock: zones 0 and 1
>> - The first copy: zones 16 and 17
>> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>>   to it
> 
> I was thinking about that, again. We need a specification. The above is
> too vague.
> 
> - supported zone sizes
>   eg. if device has 256M, how does it work? I think we can support
>   zones from some range (256M-1G), where filling the zone will start
>   filing the other zone, leaving the remaining space empty if needed,
>   effectively reserving the logical range [0..2G] for superblock
> 
> - related to the above, is it necessary to fill the whole zone?
>   if both zones are filled, assuming 1G zone size, do we really expect
>   the user to wait until 2G of data are read?
>   with average reading speed 150MB/s, reading 2G will take about 13
>   seconds, just to find the latest copy of the superblock(!)
> 
> - what are exact offsets of the superblocks
>   primary (64K), ie. not from the beginning
>   as partitioning is not supported, nor bootloaders, we don't need to
>   worry about overwriting them
> 
> - what is an application supposed to do when there's a garbage after a
>   sequence of valid superblocks (all zeros can be considered a valid
>   termination block)
> 
> The idea is to provide enough information for a 3rd party tool to read
> the superblock (blkid, progs) and decouple the format from current
> hardware capabilities. If the zones are going to be large in the future
> we might consider allowing further flexibility, or fix the current zone
> maximum to 1G and in the future add a separate incompat bit that would
> extend the maximum to say 10G.
> 

We don't need to do that. All we need to do for finding the valid superblock
is a report zones call, get the write pointer and then read from 
write-pointer - sizeof(struct brtfs_super_block). There is no need for scanning
a whole zone. The last thing that was written will be right before the write
pointer.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 18:55   ` Darrick J. Wong
  2020-11-10 19:01     ` Darrick J. Wong
@ 2020-11-24 11:29     ` Christoph Hellwig
  1 sibling, 0 replies; 125+ messages in thread
From: Christoph Hellwig @ 2020-11-24 11:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig

On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote:
> When we're wanting to use a ZONE_APPEND command, the @iomap structure
> has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
> to IOMAP_MAPPED, but what should iomap->addr be set to?
> 
> I gather from what I see in zonefs and the relevant NVME proposal that
> iomap->addr should be set to the (byte) address of the zone we want to
> append to?  And if we do that, then bio->bi_iter.bi_sector will be set
> to sector address of iomap->addr, right?

Yes.

> Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
> to wherever the drive told it that it actually wrote the bio, right?

Yes.

> If that's true, then that implies that need_zeroout must always be false
> for an append operation, right?  Does that also mean that the directio
> request has to be aligned to an fs block and not just the sector size?

I think so, yes.

> Can userspace send a directio append that crosses a zone boundary?  If
> so, what happens if a direct append to a lower address fails but a
> direct append to a higher address succeeds?

Userspace doesn't know about zone boundaries.  It can send I/O larger
than a zone, but the file system has to split it into multiple I/Os
just like when it has to cross and AG boundary in XFS.

> I'm also vaguely wondering how to communicate the write location back to
> the filesystem when the bio completes?  btrfs handles the bio completion
> completely so it doesn't have a problem, but for other filesystems
> (cough future xfs cough) either we'd have to add a new callback for
> append operations; or I guess everyone could hook the bio endio.
> 
> Admittedly that's not really your problem, and for all I know hch is
> already working on this.

I think any non-trivial file system needs to override the bio completion
handler for writes anyway, so this seems reasonable.  It might be worth
documenting, though.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-10 11:26 ` [PATCH v10 12/41] btrfs: implement zoned chunk allocator Naohiro Aota
@ 2020-11-24 11:36   ` Anand Jain
  2020-11-25  1:57     ` Naohiro Aota
  2020-12-09  5:27   ` Anand Jain
  1 sibling, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-24 11:36 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig, Darrick J. Wong

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> This commit implements a zoned chunk/dev_extent allocator. The zoned
> allocator aligns the device extents to zone boundaries, so that a zone
> reset affects only the device extent and does not change the state of
> blocks in the neighbor device extents.
> 
> Also, it checks that a region allocation is not overlapping any of the
> super block zones, and ensures the region is empty.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Looks good.

Chunks and stripes are aligned to the zone_size. I guess zone_size won't
change after the block device has been formatted with it? For testing,
what if the device image is dumped onto another zoned device with a
different zone_size?

A small nit is below.

> +static void init_alloc_chunk_ctl_policy_zoned(
> +				      struct btrfs_fs_devices *fs_devices,
> +				      struct alloc_chunk_ctl *ctl)
> +{
> +	u64 zone_size = fs_devices->fs_info->zone_size;
> +	u64 limit;
> +	int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
> +	int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
> +	u64 min_chunk_size = min_data_stripes * zone_size;
> +	u64 type = ctl->type;
> +
> +	ctl->max_stripe_size = zone_size;
> +	if (type & BTRFS_BLOCK_GROUP_DATA) {
> +		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
> +						 zone_size);
> +	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
> +		ctl->max_chunk_size = ctl->max_stripe_size;
> +	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
> +		ctl->max_chunk_size = 2 * ctl->max_stripe_size;
> +		ctl->devs_max = min_t(int, ctl->devs_max,
> +				      BTRFS_MAX_DEVS_SYS_CHUNK);
> +	}
> +


> +	/* We don't want a chunk larger than 10% of writable space */
> +	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),

  What's the purpose of dev_factor here?

Thanks.


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-24 11:36   ` Anand Jain
@ 2020-11-25  1:57     ` Naohiro Aota
  2020-11-25  7:17       ` Anand Jain
  2020-11-25  9:59       ` Graham Cobb
  0 siblings, 2 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-25  1:57 UTC (permalink / raw)
  To: Anand Jain
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On Tue, Nov 24, 2020 at 07:36:18PM +0800, Anand Jain wrote:
>On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>This commit implements a zoned chunk/dev_extent allocator. The zoned
>>allocator aligns the device extents to zone boundaries, so that a zone
>>reset affects only the device extent and does not change the state of
>>blocks in the neighbor device extents.
>>
>>Also, it checks that a region allocation is not overlapping any of the
>>super block zones, and ensures the region is empty.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>Looks good.
>
>Chunks and stripes are aligned to the zone_size. I guess zone_size won't
>change after the block device has been formatted with it? For testing,
>what if the device image is dumped onto another zoned device with a
>different zone_size?

Zone size is a drive characteristic, so it never change on the same device.

Dump/restore on another device with a different zone_size should be banned,
because we cannot ensure device extents are aligned to zone boundaries.

>
>A small nit is below.
>
>>+static void init_alloc_chunk_ctl_policy_zoned(
>>+				      struct btrfs_fs_devices *fs_devices,
>>+				      struct alloc_chunk_ctl *ctl)
>>+{
>>+	u64 zone_size = fs_devices->fs_info->zone_size;
>>+	u64 limit;
>>+	int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
>>+	int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
>>+	u64 min_chunk_size = min_data_stripes * zone_size;
>>+	u64 type = ctl->type;
>>+
>>+	ctl->max_stripe_size = zone_size;
>>+	if (type & BTRFS_BLOCK_GROUP_DATA) {
>>+		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
>>+						 zone_size);
>>+	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
>>+		ctl->max_chunk_size = ctl->max_stripe_size;
>>+	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
>>+		ctl->max_chunk_size = 2 * ctl->max_stripe_size;
>>+		ctl->devs_max = min_t(int, ctl->devs_max,
>>+				      BTRFS_MAX_DEVS_SYS_CHUNK);
>>+	}
>>+
>
>
>>+	/* We don't want a chunk larger than 10% of writable space */
>>+	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
>
> What's the purpose of dev_factor here?

This one follows the same limitation as in regular allocator
(init_alloc_chunk_ctl_policy_regular).

>
>Thanks.
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-25  1:57     ` Naohiro Aota
@ 2020-11-25  7:17       ` Anand Jain
  2020-11-25 11:48         ` Naohiro Aota
  2020-11-25  9:59       ` Graham Cobb
  1 sibling, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-25  7:17 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong



On 25/11/20 9:57 am, Naohiro Aota wrote:
> On Tue, Nov 24, 2020 at 07:36:18PM +0800, Anand Jain wrote:
>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>> This commit implements a zoned chunk/dev_extent allocator. The zoned
>>> allocator aligns the device extents to zone boundaries, so that a zone
>>> reset affects only the device extent and does not change the state of
>>> blocks in the neighbor device extents.
>>>
>>> Also, it checks that a region allocation is not overlapping any of the
>>> super block zones, and ensures the region is empty.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Looks good.
>>
>> Chunks and stripes are aligned to the zone_size. I guess zone_size won't
>> change after the block device has been formatted with it? For testing,
>> what if the device image is dumped onto another zoned device with a
>> different zone_size?
> 
> Zone size is a drive characteristic, so it never change on the same device.
> 
> Dump/restore on another device with a different zone_size should be banned,
> because we cannot ensure device extents are aligned to zone boundaries.

Fair enough. Do we have any checks to fail such mount? Sorry if I have 
missed it somewhere in the patch?
Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-25  1:57     ` Naohiro Aota
  2020-11-25  7:17       ` Anand Jain
@ 2020-11-25  9:59       ` Graham Cobb
  2020-11-25 11:50         ` Naohiro Aota
  1 sibling, 1 reply; 125+ messages in thread
From: Graham Cobb @ 2020-11-25  9:59 UTC (permalink / raw)
  To: Naohiro Aota, Anand Jain
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On 25/11/2020 01:57, Naohiro Aota wrote:
> On Tue, Nov 24, 2020 at 07:36:18PM +0800, Anand Jain wrote:
>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>> This commit implements a zoned chunk/dev_extent allocator. The zoned
>>> allocator aligns the device extents to zone boundaries, so that a zone
>>> reset affects only the device extent and does not change the state of
>>> blocks in the neighbor device extents.
>>>
>>> Also, it checks that a region allocation is not overlapping any of the
>>> super block zones, and ensures the region is empty.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Looks good.
>>
>> Chunks and stripes are aligned to the zone_size. I guess zone_size won't
>> change after the block device has been formatted with it? For testing,
>> what if the device image is dumped onto another zoned device with a
>> different zone_size?
> 
> Zone size is a drive characteristic, so it never change on the same device.
> 
> Dump/restore on another device with a different zone_size should be banned,
> because we cannot ensure device extents are aligned to zone boundaries.

Does this mean 'btrfs replace' is banned as well? Or is it allowed to a
similar-enough device? What about 'add' followed by 'remove'?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-25  7:17       ` Anand Jain
@ 2020-11-25 11:48         ` Naohiro Aota
  0 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-25 11:48 UTC (permalink / raw)
  To: Anand Jain
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On Wed, Nov 25, 2020 at 03:17:42PM +0800, Anand Jain wrote:
>
>
>On 25/11/20 9:57 am, Naohiro Aota wrote:
>>On Tue, Nov 24, 2020 at 07:36:18PM +0800, Anand Jain wrote:
>>>On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>This commit implements a zoned chunk/dev_extent allocator. The zoned
>>>>allocator aligns the device extents to zone boundaries, so that a zone
>>>>reset affects only the device extent and does not change the state of
>>>>blocks in the neighbor device extents.
>>>>
>>>>Also, it checks that a region allocation is not overlapping any of the
>>>>super block zones, and ensures the region is empty.
>>>>
>>>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>>Looks good.
>>>
>>>Chunks and stripes are aligned to the zone_size. I guess zone_size won't
>>>change after the block device has been formatted with it? For testing,
>>>what if the device image is dumped onto another zoned device with a
>>>different zone_size?
>>
>>Zone size is a drive characteristic, so it never change on the same device.
>>
>>Dump/restore on another device with a different zone_size should be banned,
>>because we cannot ensure device extents are aligned to zone boundaries.
>
>Fair enough. Do we have any checks to fail such mount? Sorry if I have 
>missed it somewhere in the patch?
>Thanks.

We have a check in verify_one_dev_extent() to confirm that a device
extent's position and size are aligned to zone size (patch 13).

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-25  9:59       ` Graham Cobb
@ 2020-11-25 11:50         ` Naohiro Aota
  0 siblings, 0 replies; 125+ messages in thread
From: Naohiro Aota @ 2020-11-25 11:50 UTC (permalink / raw)
  To: Graham Cobb
  Cc: Anand Jain, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong

On Wed, Nov 25, 2020 at 09:59:40AM +0000, Graham Cobb wrote:
>On 25/11/2020 01:57, Naohiro Aota wrote:
>> On Tue, Nov 24, 2020 at 07:36:18PM +0800, Anand Jain wrote:
>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>> This commit implements a zoned chunk/dev_extent allocator. The zoned
>>>> allocator aligns the device extents to zone boundaries, so that a zone
>>>> reset affects only the device extent and does not change the state of
>>>> blocks in the neighbor device extents.
>>>>
>>>> Also, it checks that a region allocation is not overlapping any of the
>>>> super block zones, and ensures the region is empty.
>>>>
>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>> Looks good.
>>>
>>> Chunks and stripes are aligned to the zone_size. I guess zone_size won't
>>> change after the block device has been formatted with it? For testing,
>>> what if the device image is dumped onto another zoned device with a
>>> different zone_size?
>>
>> Zone size is a drive characteristic, so it never change on the same device.
>>
>> Dump/restore on another device with a different zone_size should be banned,
>> because we cannot ensure device extents are aligned to zone boundaries.
>
>Does this mean 'btrfs replace' is banned as well? Or is it allowed to a
>similar-enough device? What about 'add' followed by 'remove'?

Replacing is allowed if the zone size is the same. Adding a disk is the
same. This restriction is checked in btrfs_init_new_device() (patch 5).

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
  2020-11-12  6:57   ` Anand Jain
@ 2020-11-25 21:47   ` David Sterba
  2020-11-25 22:07     ` David Sterba
  2020-11-25 23:50     ` Damien Le Moal
  2020-11-25 22:16   ` David Sterba
  2 siblings, 2 replies; 125+ messages in thread
From: David Sterba @ 2020-11-25 21:47 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Josef Bacik

On Tue, Nov 10, 2020 at 08:26:07PM +0900, Naohiro Aota wrote:
> +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> +{
> +	struct btrfs_zoned_device_info *zone_info = NULL;
> +	struct block_device *bdev = device->bdev;
> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> +	sector_t sector = 0;

I'd rather replace the sector_t types with u64. The type is unsigned
long and does not have the same width on 32/64 bit. The typecasts must
be used and if not, bugs happen (and happened).

> +	struct blk_zone *zones = NULL;
> +	unsigned int i, nreported = 0, nr_zones;
> +	unsigned int zone_sectors;
> +	int ret;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	if (device->zone_info)
> +		return 0;
> +
> +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
> +	if (!zone_info)
> +		return -ENOMEM;
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +	ASSERT(is_power_of_2(zone_sectors));

As is_power_of_2 works only on longs, this needs to be opencoded as
there's no unsigned long long version.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-25 21:47   ` David Sterba
@ 2020-11-25 22:07     ` David Sterba
  2020-11-25 23:50     ` Damien Le Moal
  1 sibling, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-25 22:07 UTC (permalink / raw)
  To: dsterba, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong, Damien Le Moal,
	Josef Bacik

On Wed, Nov 25, 2020 at 10:47:53PM +0100, David Sterba wrote:
> On Tue, Nov 10, 2020 at 08:26:07PM +0900, Naohiro Aota wrote:
> > +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> > +{
> > +	struct btrfs_zoned_device_info *zone_info = NULL;
> > +	struct block_device *bdev = device->bdev;
> > +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> > +	sector_t sector = 0;
> 
> I'd rather replace the sector_t types with u64. The type is unsigned
> long and does not have the same width on 32/64 bit. The typecasts must
> be used and if not, bugs happen (and happened).

Like in the same function a few lines below

   95         /* Get zones type */
   96         while (sector < nr_sectors) {
   97                 nr_zones = BTRFS_REPORT_NR_ZONES;
   98                 ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
   99                                           &nr_zones);

sector without a type cast to u64

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
  2020-11-12  6:57   ` Anand Jain
  2020-11-25 21:47   ` David Sterba
@ 2020-11-25 22:16   ` David Sterba
  2 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-25 22:16 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Josef Bacik

On Tue, Nov 10, 2020 at 08:26:07PM +0900, Naohiro Aota wrote:
> +	/* Get zones type */
> +	while (sector < nr_sectors) {
> +		nr_zones = BTRFS_REPORT_NR_ZONES;
> +		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
> +					  &nr_zones);
> +		if (ret)
> +			goto out;
> +
> +		for (i = 0; i < nr_zones; i++) {
> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> +				set_bit(nreported, zone_info->seq_zones);
> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> +				set_bit(nreported, zone_info->empty_zones);

set_bit is atomic and it's not needed as nothing else could be touching
the bitmap, so I'll switch it to plain __set_bit.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-25 21:47   ` David Sterba
  2020-11-25 22:07     ` David Sterba
@ 2020-11-25 23:50     ` Damien Le Moal
  2020-11-26 14:11       ` David Sterba
  1 sibling, 1 reply; 125+ messages in thread
From: Damien Le Moal @ 2020-11-25 23:50 UTC (permalink / raw)
  To: dsterba, Naohiro Aota
  Cc: hare, dsterba, linux-fsdevel, linux-btrfs, hch, josef,
	darrick.wong, axboe

Hi David,

On Wed, 2020-11-25 at 22:47 +0100, David Sterba wrote:
> On Tue, Nov 10, 2020 at 08:26:07PM +0900, Naohiro Aota wrote:
> > +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> > +{
> > +	struct btrfs_zoned_device_info *zone_info = NULL;
> > +	struct block_device *bdev = device->bdev;
> > +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> > +	sector_t sector = 0;
> 
> I'd rather replace the sector_t types with u64. The type is unsigned
> long and does not have the same width on 32/64 bit. The typecasts must
> be used and if not, bugs happen (and happened).

Since kernel 5.2, sector_t is unconditionally defined as u64 in linux/type.h:

typedef u64 sector_t;

CONFIG_LBDAF does not exist anymore.

I am not against using u64 at all, but using sector_t makes it clear what the
unit is for the values at hand.

> 
> > +	struct blk_zone *zones = NULL;
> > +	unsigned int i, nreported = 0, nr_zones;
> > +	unsigned int zone_sectors;
> > +	int ret;
> > +
> > +	if (!bdev_is_zoned(bdev))
> > +		return 0;
> > +
> > +	if (device->zone_info)
> > +		return 0;
> > +
> > +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
> > +	if (!zone_info)
> > +		return -ENOMEM;
> > +
> > +	zone_sectors = bdev_zone_sectors(bdev);
> > +	ASSERT(is_power_of_2(zone_sectors));
> 
> As is_power_of_2 works only on longs, this needs to be opencoded as
> there's no unsigned long long version.

-- 
Damien Le Moal
Western Digital

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-25 23:50     ` Damien Le Moal
@ 2020-11-26 14:11       ` David Sterba
  0 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-26 14:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: dsterba, Naohiro Aota, hare, dsterba, linux-fsdevel, linux-btrfs,
	hch, josef, darrick.wong, axboe

On Wed, Nov 25, 2020 at 11:50:39PM +0000, Damien Le Moal wrote:
> Hi David,
> 
> On Wed, 2020-11-25 at 22:47 +0100, David Sterba wrote:
> > On Tue, Nov 10, 2020 at 08:26:07PM +0900, Naohiro Aota wrote:
> > > +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> > > +{
> > > +	struct btrfs_zoned_device_info *zone_info = NULL;
> > > +	struct block_device *bdev = device->bdev;
> > > +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> > > +	sector_t sector = 0;
> > 
> > I'd rather replace the sector_t types with u64. The type is unsigned
> > long and does not have the same width on 32/64 bit. The typecasts must
> > be used and if not, bugs happen (and happened).
> 
> Since kernel 5.2, sector_t is unconditionally defined as u64 in linux/type.h:
> 
> typedef u64 sector_t;
> 
> CONFIG_LBDAF does not exist anymore.

That's great, I was not aware of that.

> I am not against using u64 at all, but using sector_t makes it clear what the
> unit is for the values at hand.

Yeah agreed, I'll switch it back.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 13/41] btrfs: verify device extent is aligned to zone
  2020-11-10 11:26 ` [PATCH v10 13/41] btrfs: verify device extent is aligned to zone Naohiro Aota
@ 2020-11-27  6:27   ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-27  6:27 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> Add a check in verify_one_dev_extent() to check if a device extent on a
> zoned block device is aligned to the respective zone boundary.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/volumes.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7831cf6c6da4..c0e27c1e2559 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7783,6 +7783,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
>   		ret = -EUCLEAN;
>   		goto out;
>   	}
> +
> +	if (dev->zone_info) {
> +		u64 zone_size = dev->zone_info->zone_size;
> +
> +		if (!IS_ALIGNED(physical_offset, zone_size) ||
> +		    !IS_ALIGNED(physical_len, zone_size)) {
> +			btrfs_err(fs_info,
> +"zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
> +				  devid, physical_offset, physical_len);
> +			ret = -EUCLEAN;
> +			goto out;
> +		}
> +	}
> +
>   out:
>   	free_extent_map(em);
>   	return ret;
> 


Looks good.
Reviewed-by: Anand Jain <anand.jain@oracle.com>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-18 11:29   ` Anand Jain
@ 2020-11-27 18:44     ` David Sterba
  2020-11-30 12:12       ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: David Sterba @ 2020-11-27 18:44 UTC (permalink / raw)
  To: Anand Jain
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong,
	Johannes Thumshirn, Damien Le Moal, Josef Bacik

On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
> On 10/11/20 7:26 pm, Naohiro Aota wrote:
> > This commit introduces the function btrfs_check_zoned_mode() to check if
> > ZONED flag is enabled on the file system and if the file system consists of
> > zoned devices with equal zone size.
> > 
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   fs/btrfs/ctree.h       | 11 ++++++
> >   fs/btrfs/dev-replace.c |  7 ++++
> >   fs/btrfs/disk-io.c     | 11 ++++++
> >   fs/btrfs/super.c       |  1 +
> >   fs/btrfs/volumes.c     |  5 +++
> >   fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
> >   fs/btrfs/zoned.h       | 26 ++++++++++++++
> >   7 files changed, 142 insertions(+)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index aac3d6f4e35b..453f41ca024e 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -948,6 +948,12 @@ struct btrfs_fs_info {
> >   	/* Type of exclusive operation running */
> >   	unsigned long exclusive_operation;
> >   
> > +	/* Zone size when in ZONED mode */
> > +	union {
> > +		u64 zone_size;
> > +		u64 zoned;
> > +	};
> > +
> >   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> >   	spinlock_t ref_verify_lock;
> >   	struct rb_root block_tree;
> > @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
> >   }
> >   #endif
> >   
> > +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
> > +{
> > +	return fs_info->zoned != 0;
> > +}
> > +
> >   #endif
> > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> > index 6f6d77224c2b..db87f1aa604b 100644
> > --- a/fs/btrfs/dev-replace.c
> > +++ b/fs/btrfs/dev-replace.c
> > @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> >   		return PTR_ERR(bdev);
> >   	}
> >   
> > +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> > +		btrfs_err(fs_info,
> > +			  "dev-replace: zoned type of target device mismatch with filesystem");
> > +		ret = -EINVAL;
> > +		goto error;
> > +	}
> > +
> >   	sync_blockdev(bdev);
> >   
> >   	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
> 
>   I am not sure if it is done in some other patch. But we still have to
>   check for
> 
>   (model == BLK_ZONED_HA && incompat_zoned))

Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
btrfs_check_device_zone_type checks for _HM.

> right? What if in a non-zoned FS, a zoned device is added through the
> replace. No?

The types of devices cannot mix, yeah. So I'd like to know the answer as
well.

> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	if (IS_ERR(bdev))
> >   		return PTR_ERR(bdev);
> >   
> > +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> > +		ret = -EINVAL;
> > +		goto error;
> > +	}
> > +
> >   	if (fs_devices->seeding) {
> >   		seeding_dev = 1;
> >   		down_write(&sb->s_umount);
> 
> Same here too. It can also happen that a zone device is added to a non 
> zoned fs.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 06/41] btrfs: introduce max_zone_append_size
  2020-11-19  9:23   ` Anand Jain
@ 2020-11-27 18:47     ` David Sterba
  0 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-27 18:47 UTC (permalink / raw)
  To: Anand Jain
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong, Josef Bacik

On Thu, Nov 19, 2020 at 05:23:20PM +0800, Anand Jain wrote:
> On 10/11/20 7:26 pm, Naohiro Aota wrote:
> > The zone append write command has a maximum IO size restriction it
> > accepts. This is because a zone append write command cannot be split, as
> > we ask the device to place the data into a specific target zone and the
> > device responds with the actual written location of the data.
> > 
> > Introduce max_zone_append_size to zone_info and fs_info to track the
> > value, so we can limit all I/O to a zoned block device that we want to
> > write using the zone append command to the device's limits.
> > 
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> 
> Looks good except for - what happens when we replace or add a new zone
> device with a different queue_max_zone_append_sectors(queue) value. ?

The max zone seems to be a constraint for all devices so yeah it should
be recalculated.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 00/41] btrfs: zoned block device support
  2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
                   ` (41 preceding siblings ...)
  2020-11-10 14:00 ` [PATCH v10 00/41] btrfs: zoned block device support Anand Jain
@ 2020-11-27 19:28 ` David Sterba
  42 siblings, 0 replies; 125+ messages in thread
From: David Sterba @ 2020-11-27 19:28 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

On Tue, Nov 10, 2020 at 08:26:03PM +0900, Naohiro Aota wrote:
> Johannes Thumshirn (1):
>   block: add bio_add_zone_append_page
> 
> Naohiro Aota (40):
>   iomap: support REQ_OP_ZONE_APPEND

From that one

>   btrfs: introduce ZONED feature flag
>   btrfs: get zone information of zoned block devices
>   btrfs: check and enable ZONED mode
>   btrfs: introduce max_zone_append_size
>   btrfs: disallow space_cache in ZONED mode
>   btrfs: disallow NODATACOW in ZONED mode
>   btrfs: disable fallocate in ZONED mode
>   btrfs: disallow mixed-bg in ZONED mode
>   btrfs: implement log-structured superblock for ZONED mode

up to this patch added to misc-next. There's still open question
regarding the superblock copies, we had a chat about that with Johanness
so he'll tell you more.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 04/41] btrfs: get zone information of zoned block devices
  2020-11-18 11:17       ` Anand Jain
@ 2020-11-30 11:16         ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-11-30 11:16 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong, Damien Le Moal, Josef Bacik


Below two comments are fixed in the misc-next.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks.


On 18/11/20 7:17 pm, Anand Jain wrote:
> 
> 
> Also, %device->fs_info is not protected. It is better to avoid using
> fs_info when we are still at open_fs_devices(). Yeah, the unknown part
> can be better. We need to fix it as a whole. For now, you can use
> something like...
> 
> -------------------------
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 1223d5b0e411..e857bb304d28 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -130,19 +130,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device 
> *device)
>           * (device <unknown>) ..."
>           */
> 
> -       rcu_read_lock();
> -       if (device->fs_info)
> -               btrfs_info(device->fs_info,
> -                       "host-%s zoned block device %s, %u zones of %llu 
> bytes",
> -                       bdev_zoned_model(bdev) == BLK_ZONED_HM ? 
> "managed" : "aware",
> -                       rcu_str_deref(device->name), zone_info->nr_zones,
> -                       zone_info->zone_size);
> -       else
> -               pr_info("BTRFS info: host-%s zoned block device %s, %u 
> zones of %llu bytes",
> -                       bdev_zoned_model(bdev) == BLK_ZONED_HM ? 
> "managed" : "aware",
> -                       rcu_str_deref(device->name), zone_info->nr_zones,
> -                       zone_info->zone_size);
> -       rcu_read_unlock();
> +       btrfs_info_in_rcu(NULL,
> +               "host-%s zoned block device %s, %u zones of %llu bytes",
> +               bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : 
> "aware",
> +               rcu_str_deref(device->name), zone_info->nr_zones,
> +               zone_info->zone_size);
> 
>          return 0;
>   ---------------------------
> 
> Thanks, Anand
> 
> 



>>>> @@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>>>      rcu_string_free(device->name);
>>>>      extent_io_tree_release(&device->alloc_state);
>>>>      bio_put(device->flush_bio);
>>>
>>>> +    btrfs_destroy_dev_zone_info(device);
>>>
>>> Free of btrfs_device::zone_info is already happening in the path..
>>>
>>> btrfs_close_one_device()
>>>   btrfs_destroy_dev_zone_info()
>>>
>>> We don't need this..
>>>
>>> btrfs_free_device()
>>>  btrfs_destroy_dev_zone_info()
>>
>> Ah, yes, I once had it only in btrfs_free_device() and noticed that it 
>> does
>> not free the device zone info on umount. So, I added one in
>> btrfs_close_one_device() and forgot to remove the other one. I'll drop it
>> from btrfs_free_device().



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-27 18:44     ` David Sterba
@ 2020-11-30 12:12       ` Anand Jain
  2020-11-30 13:15         ` Damien Le Moal
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-11-30 12:12 UTC (permalink / raw)
  To: dsterba, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Christoph Hellwig, Darrick J. Wong,
	Johannes Thumshirn, Damien Le Moal, Josef Bacik

On 28/11/20 2:44 am, David Sterba wrote:
> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>> ZONED flag is enabled on the file system and if the file system consists of
>>> zoned devices with equal zone size.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>    fs/btrfs/ctree.h       | 11 ++++++
>>>    fs/btrfs/dev-replace.c |  7 ++++
>>>    fs/btrfs/disk-io.c     | 11 ++++++
>>>    fs/btrfs/super.c       |  1 +
>>>    fs/btrfs/volumes.c     |  5 +++
>>>    fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>    fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>    7 files changed, 142 insertions(+)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index aac3d6f4e35b..453f41ca024e 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>    	/* Type of exclusive operation running */
>>>    	unsigned long exclusive_operation;
>>>    
>>> +	/* Zone size when in ZONED mode */
>>> +	union {
>>> +		u64 zone_size;
>>> +		u64 zoned;
>>> +	};
>>> +
>>>    #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>    	spinlock_t ref_verify_lock;
>>>    	struct rb_root block_tree;
>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>    }
>>>    #endif
>>>    
>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>> +{
>>> +	return fs_info->zoned != 0;
>>> +}
>>> +
>>>    #endif
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index 6f6d77224c2b..db87f1aa604b 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>    		return PTR_ERR(bdev);
>>>    	}
>>>    
>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>> +		btrfs_err(fs_info,
>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>> +		ret = -EINVAL;
>>> +		goto error;
>>> +	}
>>> +
>>>    	sync_blockdev(bdev);
>>>    
>>>    	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>
>>    I am not sure if it is done in some other patch. But we still have to
>>    check for
>>
>>    (model == BLK_ZONED_HA && incompat_zoned))
> 
> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
> btrfs_check_device_zone_type checks for _HM.


Still confusing to me. The below function, which is part of this
patch, says we don't support BLK_ZONED_HM. So does it mean we
allow BLK_ZONED_HA only?

+static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info 
*fs_info,
+						struct block_device *bdev)
+{
+	u64 zone_size;
+
+	if (btrfs_is_zoned(fs_info)) {
+		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+		/* Do not allow non-zoned device */
+		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+	}
+
+	/* Do not allow Host Manged zoned device */
+	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
+}


Also, if there is a new type of zoned device in the future, the older 
kernel should be able to reject the newer zone device types.

And, if possible could you rename above function to 
btrfs_zone_type_is_valid(). Or better.


>> right? What if in a non-zoned FS, a zoned device is added through the
>> replace. No?
> 
> The types of devices cannot mix, yeah. So I'd like to know the answer as
> well.


>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    	if (IS_ERR(bdev))
>>>    		return PTR_ERR(bdev);
>>>    
>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>> +		ret = -EINVAL;
>>> +		goto error;
>>> +	}
>>> +
>>>    	if (fs_devices->seeding) {
>>>    		seeding_dev = 1;
>>>    		down_write(&sb->s_umount);
>>
>> Same here too. It can also happen that a zone device is added to a non
>> zoned fs.


Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-30 12:12       ` Anand Jain
@ 2020-11-30 13:15         ` Damien Le Moal
  2020-12-01  2:19           ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: Damien Le Moal @ 2020-11-30 13:15 UTC (permalink / raw)
  To: Anand Jain, dsterba, Naohiro Aota, linux-btrfs, dsterba, hare,
	linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 2020/11/30 21:13, Anand Jain wrote:
> On 28/11/20 2:44 am, David Sterba wrote:
>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>> zoned devices with equal zone size.
>>>>
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>> ---
>>>>    fs/btrfs/ctree.h       | 11 ++++++
>>>>    fs/btrfs/dev-replace.c |  7 ++++
>>>>    fs/btrfs/disk-io.c     | 11 ++++++
>>>>    fs/btrfs/super.c       |  1 +
>>>>    fs/btrfs/volumes.c     |  5 +++
>>>>    fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>    fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>    7 files changed, 142 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>    	/* Type of exclusive operation running */
>>>>    	unsigned long exclusive_operation;
>>>>    
>>>> +	/* Zone size when in ZONED mode */
>>>> +	union {
>>>> +		u64 zone_size;
>>>> +		u64 zoned;
>>>> +	};
>>>> +
>>>>    #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>    	spinlock_t ref_verify_lock;
>>>>    	struct rb_root block_tree;
>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>    }
>>>>    #endif
>>>>    
>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>> +{
>>>> +	return fs_info->zoned != 0;
>>>> +}
>>>> +
>>>>    #endif
>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>> --- a/fs/btrfs/dev-replace.c
>>>> +++ b/fs/btrfs/dev-replace.c
>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>    		return PTR_ERR(bdev);
>>>>    	}
>>>>    
>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>> +		btrfs_err(fs_info,
>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>> +		ret = -EINVAL;
>>>> +		goto error;
>>>> +	}
>>>> +
>>>>    	sync_blockdev(bdev);
>>>>    
>>>>    	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>
>>>    I am not sure if it is done in some other patch. But we still have to
>>>    check for
>>>
>>>    (model == BLK_ZONED_HA && incompat_zoned))
>>
>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>> btrfs_check_device_zone_type checks for _HM.
> 
> 
> Still confusing to me. The below function, which is part of this
> patch, says we don't support BLK_ZONED_HM. So does it mean we
> allow BLK_ZONED_HA only?
> 
> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info 
> *fs_info,
> +						struct block_device *bdev)
> +{
> +	u64 zone_size;
> +
> +	if (btrfs_is_zoned(fs_info)) {
> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> +		/* Do not allow non-zoned device */

This comment does not make sense. It should be:

		/* Only allow zoned devices with the same zone size */

> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
> +	}
> +
> +	/* Do not allow Host Manged zoned device */
> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;

The comment is also wrong. It should read:

	/* Allow only host managed zoned devices */

This is because we decided to treat host aware devices in the same way as
regular block devices, since HA drives are backward compatible with regular
block devices.

> +}
> 
> 
> Also, if there is a new type of zoned device in the future, the older 
> kernel should be able to reject the newer zone device types.
> 
> And, if possible could you rename above function to 
> btrfs_zone_type_is_valid(). Or better.
> 
> 
>>> right? What if in a non-zoned FS, a zoned device is added through the
>>> replace. No?
>>
>> The types of devices cannot mix, yeah. So I'd like to know the answer as
>> well.
> 
> 
>>>> --- a/fs/btrfs/volumes.c
>>>> +++ b/fs/btrfs/volumes.c
>>>> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>>    	if (IS_ERR(bdev))
>>>>    		return PTR_ERR(bdev);
>>>>    
>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>> +		ret = -EINVAL;
>>>> +		goto error;
>>>> +	}
>>>> +
>>>>    	if (fs_devices->seeding) {
>>>>    		seeding_dev = 1;
>>>>    		down_write(&sb->s_umount);
>>>
>>> Same here too. It can also happen that a zone device is added to a non
>>> zoned fs.
> 
> 
> Thanks.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
  2020-11-10 17:25   ` Christoph Hellwig
  2020-11-10 18:55   ` Darrick J. Wong
@ 2020-11-30 18:11   ` Darrick J. Wong
  2020-12-01 10:16     ` Johannes Thumshirn
  2020-12-09  9:31   ` Christoph Hellwig
  3 siblings, 1 reply; 125+ messages in thread
From: Darrick J. Wong @ 2020-11-30 18:11 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, Christoph Hellwig

On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> REQ_OP_ZONE_APPEND.
> 
> To utilize it, we need to set the bio_op before calling
> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> and restricted bio.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Christoph's answers seem reasonable to me, so

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Er... do you want me to take this one via the iomap tree?

--D

> ---
>  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..f04572a55a09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
> +/*
> + * Figure out the bio's operation flags from the dio request, the
> + * mapping, and whether or not we want FUA.  Note that we can end up
> + * clearing the WRITE_FUA flag in the dio request.
> + */
> +static inline unsigned int
> +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
> +{
> +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> +
> +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> +		return REQ_OP_READ;
> +	}
> +
> +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> +		opflags |= REQ_OP_ZONE_APPEND;
> +	else
> +		opflags |= REQ_OP_WRITE;
> +
> +	if (use_fua)
> +		opflags |= REQ_FUA;
> +	else
> +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +
> +	return opflags;
> +}
> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> +
>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>  		if (unlikely(ret)) {
>  			/*
> @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		n = bio->bi_iter.bi_size;
>  		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> -			if (use_fua)
> -				bio->bi_opf |= REQ_FUA;
> -			else
> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>  			task_io_account_write(n);
>  		} else {
> -			bio->bi_opf = REQ_OP_READ;
>  			if (dio->flags & IOMAP_DIO_DIRTY)
>  				bio_set_pages_dirty(bio);
>  		}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..1bccd1880d0d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -54,6 +54,7 @@ struct vm_fault;
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> +#define IOMAP_F_ZONE_APPEND	0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-11-30 13:15         ` Damien Le Moal
@ 2020-12-01  2:19           ` Anand Jain
  2020-12-01  2:29             ` Damien Le Moal
  0 siblings, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-12-01  2:19 UTC (permalink / raw)
  To: Damien Le Moal, dsterba, Naohiro Aota, linux-btrfs, dsterba,
	hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 30/11/20 9:15 pm, Damien Le Moal wrote:
> On 2020/11/30 21:13, Anand Jain wrote:
>> On 28/11/20 2:44 am, David Sterba wrote:
>>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>>> zoned devices with equal zone size.
>>>>>
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>> ---
>>>>>     fs/btrfs/ctree.h       | 11 ++++++
>>>>>     fs/btrfs/dev-replace.c |  7 ++++
>>>>>     fs/btrfs/disk-io.c     | 11 ++++++
>>>>>     fs/btrfs/super.c       |  1 +
>>>>>     fs/btrfs/volumes.c     |  5 +++
>>>>>     fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>>     fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>>     7 files changed, 142 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>>> --- a/fs/btrfs/ctree.h
>>>>> +++ b/fs/btrfs/ctree.h
>>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>>     	/* Type of exclusive operation running */
>>>>>     	unsigned long exclusive_operation;
>>>>>     
>>>>> +	/* Zone size when in ZONED mode */
>>>>> +	union {
>>>>> +		u64 zone_size;
>>>>> +		u64 zoned;
>>>>> +	};
>>>>> +
>>>>>     #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>>     	spinlock_t ref_verify_lock;
>>>>>     	struct rb_root block_tree;
>>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>>     }
>>>>>     #endif
>>>>>     
>>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>>> +{
>>>>> +	return fs_info->zoned != 0;
>>>>> +}
>>>>> +
>>>>>     #endif
>>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>>> --- a/fs/btrfs/dev-replace.c
>>>>> +++ b/fs/btrfs/dev-replace.c
>>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>>     		return PTR_ERR(bdev);
>>>>>     	}
>>>>>     
>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>> +		btrfs_err(fs_info,
>>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>>> +		ret = -EINVAL;
>>>>> +		goto error;
>>>>> +	}
>>>>> +
>>>>>     	sync_blockdev(bdev);
>>>>>     
>>>>>     	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>>
>>>>     I am not sure if it is done in some other patch. But we still have to
>>>>     check for
>>>>
>>>>     (model == BLK_ZONED_HA && incompat_zoned))
>>>
>>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>>> btrfs_check_device_zone_type checks for _HM.
>>
>>
>> Still confusing to me. The below function, which is part of this
>> patch, says we don't support BLK_ZONED_HM. So does it mean we
>> allow BLK_ZONED_HA only?
>>
>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info
>> *fs_info,
>> +						struct block_device *bdev)
>> +{
>> +	u64 zone_size;
>> +
>> +	if (btrfs_is_zoned(fs_info)) {
>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>> +		/* Do not allow non-zoned device */
> 
> This comment does not make sense. It should be:
> 
> 		/* Only allow zoned devices with the same zone size */
> 
>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>> +	}
>> +
>> +	/* Do not allow Host Manged zoned device */
>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
> 
> The comment is also wrong. It should read:
> 
> 	/* Allow only host managed zoned devices */
> 
> This is because we decided to treat host aware devices in the same way as
> regular block devices, since HA drives are backward compatible with regular
> block devices.


Yeah, I read about them, but I have questions like do an FS work on top 
of a BLK_ZONED_HA without modification?
  Are we ok to replace an HM device with a HA device? Or add a HA device 
to a btrfs on an HM device.

Thanks.

> 
>> +}
>>
>>
>> Also, if there is a new type of zoned device in the future, the older
>> kernel should be able to reject the newer zone device types.
>>
>> And, if possible could you rename above function to
>> btrfs_zone_type_is_valid(). Or better.
>>
>>
>>>> right? What if in a non-zoned FS, a zoned device is added through the
>>>> replace. No?
>>>
>>> The types of devices cannot mix, yeah. So I'd like to know the answer as
>>> well.
>>
>>
>>>>> --- a/fs/btrfs/volumes.c
>>>>> +++ b/fs/btrfs/volumes.c
>>>>> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>>>     	if (IS_ERR(bdev))
>>>>>     		return PTR_ERR(bdev);
>>>>>     
>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>> +		ret = -EINVAL;
>>>>> +		goto error;
>>>>> +	}
>>>>> +
>>>>>     	if (fs_devices->seeding) {
>>>>>     		seeding_dev = 1;
>>>>>     		down_write(&sb->s_umount);
>>>>
>>>> Same here too. It can also happen that a zone device is added to a non
>>>> zoned fs.
>>
>>
>> Thanks.
>>
> 
> 


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01  2:19           ` Anand Jain
@ 2020-12-01  2:29             ` Damien Le Moal
  2020-12-01  5:53               ` Anand Jain
  2020-12-01 10:45               ` Graham Cobb
  0 siblings, 2 replies; 125+ messages in thread
From: Damien Le Moal @ 2020-12-01  2:29 UTC (permalink / raw)
  To: Anand Jain, dsterba, Naohiro Aota, linux-btrfs, dsterba, hare,
	linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 2020/12/01 11:20, Anand Jain wrote:
> On 30/11/20 9:15 pm, Damien Le Moal wrote:
>> On 2020/11/30 21:13, Anand Jain wrote:
>>> On 28/11/20 2:44 am, David Sterba wrote:
>>>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>>>> zoned devices with equal zone size.
>>>>>>
>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>>> ---
>>>>>>     fs/btrfs/ctree.h       | 11 ++++++
>>>>>>     fs/btrfs/dev-replace.c |  7 ++++
>>>>>>     fs/btrfs/disk-io.c     | 11 ++++++
>>>>>>     fs/btrfs/super.c       |  1 +
>>>>>>     fs/btrfs/volumes.c     |  5 +++
>>>>>>     fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>>>     7 files changed, 142 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>>>> --- a/fs/btrfs/ctree.h
>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>>>     	/* Type of exclusive operation running */
>>>>>>     	unsigned long exclusive_operation;
>>>>>>     
>>>>>> +	/* Zone size when in ZONED mode */
>>>>>> +	union {
>>>>>> +		u64 zone_size;
>>>>>> +		u64 zoned;
>>>>>> +	};
>>>>>> +
>>>>>>     #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>>>     	spinlock_t ref_verify_lock;
>>>>>>     	struct rb_root block_tree;
>>>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>>>     }
>>>>>>     #endif
>>>>>>     
>>>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>>>> +{
>>>>>> +	return fs_info->zoned != 0;
>>>>>> +}
>>>>>> +
>>>>>>     #endif
>>>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>>>> --- a/fs/btrfs/dev-replace.c
>>>>>> +++ b/fs/btrfs/dev-replace.c
>>>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>>>     		return PTR_ERR(bdev);
>>>>>>     	}
>>>>>>     
>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>> +		btrfs_err(fs_info,
>>>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto error;
>>>>>> +	}
>>>>>> +
>>>>>>     	sync_blockdev(bdev);
>>>>>>     
>>>>>>     	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>>>
>>>>>     I am not sure if it is done in some other patch. But we still have to
>>>>>     check for
>>>>>
>>>>>     (model == BLK_ZONED_HA && incompat_zoned))
>>>>
>>>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>>>> btrfs_check_device_zone_type checks for _HM.
>>>
>>>
>>> Still confusing to me. The below function, which is part of this
>>> patch, says we don't support BLK_ZONED_HM. So does it mean we
>>> allow BLK_ZONED_HA only?
>>>
>>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info
>>> *fs_info,
>>> +						struct block_device *bdev)
>>> +{
>>> +	u64 zone_size;
>>> +
>>> +	if (btrfs_is_zoned(fs_info)) {
>>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>>> +		/* Do not allow non-zoned device */
>>
>> This comment does not make sense. It should be:
>>
>> 		/* Only allow zoned devices with the same zone size */
>>
>>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>>> +	}
>>> +
>>> +	/* Do not allow Host Manged zoned device */
>>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>>
>> The comment is also wrong. It should read:
>>
>> 	/* Allow only host managed zoned devices */
>>
>> This is because we decided to treat host aware devices in the same way as
>> regular block devices, since HA drives are backward compatible with regular
>> block devices.
> 
> 
> Yeah, I read about them, but I have questions like do an FS work on top 
> of a BLK_ZONED_HA without modification?

Yes. These drives are fully backward compatible and accept random writes
anywhere. Performance however is potentially a different story as the drive will
eventually need to do internal garbage collection of some sort, exactly like an
SSD, but definitely not at SSD speeds :)

>   Are we ok to replace an HM device with a HA device? Or add a HA device 
> to a btrfs on an HM device.

We have a choice here: we can treat HA drives as regular devices or treat them
as HM devices. Anything in between does not make sense. I am fine either way,
the main reason being that there are no HA drive on the market today that I know
of (this model did not have a lot of success due to the potentially very
unpredictable performance depending on the use case).

So the simplest thing to do is, in my opinion, to ignore their "zone"
characteristics and treat them as regular disks. But treating them as HM drives
is a simple to do too.

Of note is that a host-aware drive will be reported by the block layer as
BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
then the block layer will treat the drive as a regular disk.

> 
> Thanks.
> 
>>
>>> +}
>>>
>>>
>>> Also, if there is a new type of zoned device in the future, the older
>>> kernel should be able to reject the newer zone device types.
>>>
>>> And, if possible could you rename above function to
>>> btrfs_zone_type_is_valid(). Or better.
>>>
>>>
>>>>> right? What if in a non-zoned FS, a zoned device is added through the
>>>>> replace. No?
>>>>
>>>> The types of devices cannot mix, yeah. So I'd like to know the answer as
>>>> well.
>>>
>>>
>>>>>> --- a/fs/btrfs/volumes.c
>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>>>>     	if (IS_ERR(bdev))
>>>>>>     		return PTR_ERR(bdev);
>>>>>>     
>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto error;
>>>>>> +	}
>>>>>> +
>>>>>>     	if (fs_devices->seeding) {
>>>>>>     		seeding_dev = 1;
>>>>>>     		down_write(&sb->s_umount);
>>>>>
>>>>> Same here too. It can also happen that a zone device is added to a non
>>>>> zoned fs.
>>>
>>>
>>> Thanks.
>>>
>>
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01  2:29             ` Damien Le Moal
@ 2020-12-01  5:53               ` Anand Jain
  2020-12-01  6:09                 ` Damien Le Moal
  2020-12-01 10:45               ` Graham Cobb
  1 sibling, 1 reply; 125+ messages in thread
From: Anand Jain @ 2020-12-01  5:53 UTC (permalink / raw)
  To: Damien Le Moal, dsterba, Naohiro Aota, linux-btrfs, dsterba,
	hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 1/12/20 10:29 am, Damien Le Moal wrote:
> On 2020/12/01 11:20, Anand Jain wrote:
>> On 30/11/20 9:15 pm, Damien Le Moal wrote:
>>> On 2020/11/30 21:13, Anand Jain wrote:
>>>> On 28/11/20 2:44 am, David Sterba wrote:
>>>>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>>>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>>>>> zoned devices with equal zone size.
>>>>>>>
>>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>>>> ---
>>>>>>>      fs/btrfs/ctree.h       | 11 ++++++
>>>>>>>      fs/btrfs/dev-replace.c |  7 ++++
>>>>>>>      fs/btrfs/disk-io.c     | 11 ++++++
>>>>>>>      fs/btrfs/super.c       |  1 +
>>>>>>>      fs/btrfs/volumes.c     |  5 +++
>>>>>>>      fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>      fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>>>>      7 files changed, 142 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>>>>      	/* Type of exclusive operation running */
>>>>>>>      	unsigned long exclusive_operation;
>>>>>>>      
>>>>>>> +	/* Zone size when in ZONED mode */
>>>>>>> +	union {
>>>>>>> +		u64 zone_size;
>>>>>>> +		u64 zoned;
>>>>>>> +	};
>>>>>>> +
>>>>>>>      #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>>>>      	spinlock_t ref_verify_lock;
>>>>>>>      	struct rb_root block_tree;
>>>>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>>>>      }
>>>>>>>      #endif
>>>>>>>      
>>>>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>>>>> +{
>>>>>>> +	return fs_info->zoned != 0;
>>>>>>> +}
>>>>>>> +
>>>>>>>      #endif
>>>>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>>>>> --- a/fs/btrfs/dev-replace.c
>>>>>>> +++ b/fs/btrfs/dev-replace.c
>>>>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>>>>      		return PTR_ERR(bdev);
>>>>>>>      	}
>>>>>>>      
>>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>>> +		btrfs_err(fs_info,
>>>>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>>>>> +		ret = -EINVAL;
>>>>>>> +		goto error;
>>>>>>> +	}
>>>>>>> +
>>>>>>>      	sync_blockdev(bdev);
>>>>>>>      
>>>>>>>      	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>>>>
>>>>>>      I am not sure if it is done in some other patch. But we still have to
>>>>>>      check for
>>>>>>
>>>>>>      (model == BLK_ZONED_HA && incompat_zoned))
>>>>>
>>>>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>>>>> btrfs_check_device_zone_type checks for _HM.
>>>>
>>>>
>>>> Still confusing to me. The below function, which is part of this
>>>> patch, says we don't support BLK_ZONED_HM. So does it mean we
>>>> allow BLK_ZONED_HA only?
>>>>
>>>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info
>>>> *fs_info,
>>>> +						struct block_device *bdev)
>>>> +{
>>>> +	u64 zone_size;
>>>> +
>>>> +	if (btrfs_is_zoned(fs_info)) {
>>>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>>>> +		/* Do not allow non-zoned device */
>>>
>>> This comment does not make sense. It should be:
>>>
>>> 		/* Only allow zoned devices with the same zone size */
>>>
>>>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>>>> +	}
>>>> +
>>>> +	/* Do not allow Host Manged zoned device */
>>>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>>>
>>> The comment is also wrong. It should read:
>>>
>>> 	/* Allow only host managed zoned devices */
>>>
>>> This is because we decided to treat host aware devices in the same way as
>>> regular block devices, since HA drives are backward compatible with regular
>>> block devices.
>>
>>
>> Yeah, I read about them, but I have questions like do an FS work on top
>> of a BLK_ZONED_HA without modification?
> 
> Yes. These drives are fully backward compatible and accept random writes
> anywhere. Performance however is potentially a different story as the drive will
> eventually need to do internal garbage collection of some sort, exactly like an
> SSD, but definitely not at SSD speeds :)
> 
>>    Are we ok to replace an HM device with a HA device? Or add a HA device
>> to a btrfs on an HM device.
> 
> We have a choice here: we can treat HA drives as regular devices or treat them
> as HM devices. Anything in between does not make sense. I am fine either way,
> the main reason being that there are no HA drive on the market today that I know
> of (this model did not have a lot of success due to the potentially very
> unpredictable performance depending on the use case).
> 
> So the simplest thing to do is, in my opinion, to ignore their "zone"
> characteristics and treat them as regular disks. But treating them as HM drives
> is a simple to do too.
> > Of note is that a host-aware drive will be reported by the block layer as
> BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
> then the block layer will treat the drive as a regular disk.

IMO. For now, it is better to check for the BLK_ZONED_HA explicitly in a 
non-zoned-btrfs. And check for BLK_ZONED_HM explicitly in a zoned-btrfs. 
This way, if there is another type of BLK_ZONED_xx in the future, we 
have the opportunity to review to support it. As below [1]...

[1]
bool btrfs_check_device_type()
{
	if (bdev_is_zoned()) {
		if (btrfs_is_zoned())
			if (bdev_zoned_model == BLK_ZONED_HM)
			/* also check the zone_size. */
				return true;
		else
			if (bdev_zoned_model == BLK_ZONED_HA)
			/* a regular device and FS, no zone_size to check I think? */
				return true;
	} else {
		if (!btrfs_is_zoned())
			return true
	}

	return false;
}

Thanks.

> 
>>
>> Thanks.
>>
>>>
>>>> +}
>>>>
>>>>
>>>> Also, if there is a new type of zoned device in the future, the older
>>>> kernel should be able to reject the newer zone device types.
>>>>
>>>> And, if possible could you rename above function to
>>>> btrfs_zone_type_is_valid(). Or better.
>>>>
>>>>
>>>>>> right? What if in a non-zoned FS, a zoned device is added through the
>>>>>> replace. No?
>>>>>
>>>>> The types of devices cannot mix, yeah. So I'd like to know the answer as
>>>>> well.
>>>>
>>>>
>>>>>>> --- a/fs/btrfs/volumes.c
>>>>>>> +++ b/fs/btrfs/volumes.c
>>>>>>> @@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>>>>>      	if (IS_ERR(bdev))
>>>>>>>      		return PTR_ERR(bdev);
>>>>>>>      
>>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>>> +		ret = -EINVAL;
>>>>>>> +		goto error;
>>>>>>> +	}
>>>>>>> +
>>>>>>>      	if (fs_devices->seeding) {
>>>>>>>      		seeding_dev = 1;
>>>>>>>      		down_write(&sb->s_umount);
>>>>>>
>>>>>> Same here too. It can also happen that a zone device is added to a non
>>>>>> zoned fs.
>>>>
>>>>
>>>> Thanks.


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01  5:53               ` Anand Jain
@ 2020-12-01  6:09                 ` Damien Le Moal
  2020-12-01  7:12                   ` Anand Jain
  0 siblings, 1 reply; 125+ messages in thread
From: Damien Le Moal @ 2020-12-01  6:09 UTC (permalink / raw)
  To: Anand Jain, dsterba, Naohiro Aota, linux-btrfs, dsterba, hare,
	linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 2020/12/01 14:54, Anand Jain wrote:
> On 1/12/20 10:29 am, Damien Le Moal wrote:
>> On 2020/12/01 11:20, Anand Jain wrote:
>>> On 30/11/20 9:15 pm, Damien Le Moal wrote:
>>>> On 2020/11/30 21:13, Anand Jain wrote:
>>>>> On 28/11/20 2:44 am, David Sterba wrote:
>>>>>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>>>>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>>>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>>>>>> zoned devices with equal zone size.
>>>>>>>>
>>>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>>>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>>>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>>>>> ---
>>>>>>>>      fs/btrfs/ctree.h       | 11 ++++++
>>>>>>>>      fs/btrfs/dev-replace.c |  7 ++++
>>>>>>>>      fs/btrfs/disk-io.c     | 11 ++++++
>>>>>>>>      fs/btrfs/super.c       |  1 +
>>>>>>>>      fs/btrfs/volumes.c     |  5 +++
>>>>>>>>      fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>>>>>      7 files changed, 142 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>>>>>      	/* Type of exclusive operation running */
>>>>>>>>      	unsigned long exclusive_operation;
>>>>>>>>      
>>>>>>>> +	/* Zone size when in ZONED mode */
>>>>>>>> +	union {
>>>>>>>> +		u64 zone_size;
>>>>>>>> +		u64 zoned;
>>>>>>>> +	};
>>>>>>>> +
>>>>>>>>      #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>>>>>      	spinlock_t ref_verify_lock;
>>>>>>>>      	struct rb_root block_tree;
>>>>>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>>>>>      }
>>>>>>>>      #endif
>>>>>>>>      
>>>>>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>>>>>> +{
>>>>>>>> +	return fs_info->zoned != 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      #endif
>>>>>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>>>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>>>>>> --- a/fs/btrfs/dev-replace.c
>>>>>>>> +++ b/fs/btrfs/dev-replace.c
>>>>>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>>>>>      		return PTR_ERR(bdev);
>>>>>>>>      	}
>>>>>>>>      
>>>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>>>> +		btrfs_err(fs_info,
>>>>>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>>>>>> +		ret = -EINVAL;
>>>>>>>> +		goto error;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>      	sync_blockdev(bdev);
>>>>>>>>      
>>>>>>>>      	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>>>>>
>>>>>>>      I am not sure if it is done in some other patch. But we still have to
>>>>>>>      check for
>>>>>>>
>>>>>>>      (model == BLK_ZONED_HA && incompat_zoned))
>>>>>>
>>>>>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>>>>>> btrfs_check_device_zone_type checks for _HM.
>>>>>
>>>>>
>>>>> Still confusing to me. The below function, which is part of this
>>>>> patch, says we don't support BLK_ZONED_HM. So does it mean we
>>>>> allow BLK_ZONED_HA only?
>>>>>
>>>>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info
>>>>> *fs_info,
>>>>> +						struct block_device *bdev)
>>>>> +{
>>>>> +	u64 zone_size;
>>>>> +
>>>>> +	if (btrfs_is_zoned(fs_info)) {
>>>>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>>>>> +		/* Do not allow non-zoned device */
>>>>
>>>> This comment does not make sense. It should be:
>>>>
>>>> 		/* Only allow zoned devices with the same zone size */
>>>>
>>>>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>>>>> +	}
>>>>> +
>>>>> +	/* Do not allow Host Manged zoned device */
>>>>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>>>>
>>>> The comment is also wrong. It should read:
>>>>
>>>> 	/* Allow only host managed zoned devices */
>>>>
>>>> This is because we decided to treat host aware devices in the same way as
>>>> regular block devices, since HA drives are backward compatible with regular
>>>> block devices.
>>>
>>>
>>> Yeah, I read about them, but I have questions like do an FS work on top
>>> of a BLK_ZONED_HA without modification?
>>
>> Yes. These drives are fully backward compatible and accept random writes
>> anywhere. Performance however is potentially a different story as the drive will
>> eventually need to do internal garbage collection of some sort, exactly like an
>> SSD, but definitely not at SSD speeds :)
>>
>>>    Are we ok to replace an HM device with a HA device? Or add a HA device
>>> to a btrfs on an HM device.
>>
>> We have a choice here: we can treat HA drives as regular devices or treat them
>> as HM devices. Anything in between does not make sense. I am fine either way,
>> the main reason being that there are no HA drive on the market today that I know
>> of (this model did not have a lot of success due to the potentially very
>> unpredictable performance depending on the use case).
>>
>> So the simplest thing to do is, in my opinion, to ignore their "zone"
>> characteristics and treat them as regular disks. But treating them as HM drives
>> is a simple to do too.
>>> Of note is that a host-aware drive will be reported by the block layer as
>> BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
>> then the block layer will treat the drive as a regular disk.
> 
> IMO. For now, it is better to check for the BLK_ZONED_HA explicitly in a 
> non-zoned-btrfs. And check for BLK_ZONED_HM explicitly in a zoned-btrfs. 

Sure, we can. But since HA drives are backward compatible, not sure the HA check
for non-zoned make sense. As long as the zoned flag is not set, the drive can be
used like a regular disk. If the user really want to use it as a zoned drive,
then it can format with force selecting the zoned flag in btrfs super. Then the
HA drive will be used as a zoned disk, exactly like HM disks.

> This way, if there is another type of BLK_ZONED_xx in the future, we 
> have the opportunity to review to support it. As below [1]...

It is very unlikely that we will see any other zone model. ZNS adopted the HM
model in purpose, to avoid multiplying the possible models, making the ecosystem
effort a nightmare.

> 
> [1]
> bool btrfs_check_device_type()
> {
> 	if (bdev_is_zoned()) {
> 		if (btrfs_is_zoned())
> 			if (bdev_zoned_model == BLK_ZONED_HM)
> 			/* also check the zone_size. */
> 				return true;
> 		else
> 			if (bdev_zoned_model == BLK_ZONED_HA)
> 			/* a regular device and FS, no zone_size to check I think? */
> 				return true;
> 	} else {
> 		if (!btrfs_is_zoned())
> 			return true
> 	}
> 
> 	return false;
> }
> 
> Thanks.

Works for me. May be reverse the conditions to make things easier to read and
understand:

bool btrfs_check_device_type()
{
	if (btrfs_is_zoned()) {
		if (bdev_is_zoned()) {
			/* also check the zone_size. */
			return true;
		}

		/*
		 * Regular device: emulate zones with zone size equal
		 * to device extent size.
		 */
		return true;
	}

	if (bdev_zoned_model == BLK_ZONED_HM) {
		/* Zoned HM device require zoned btrfs */
		return false;
	}

	/* Regular device or zoned HA device used as a regular device */
	return true;
}


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01  6:09                 ` Damien Le Moal
@ 2020-12-01  7:12                   ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-12-01  7:12 UTC (permalink / raw)
  To: Damien Le Moal, dsterba, Naohiro Aota, linux-btrfs, dsterba,
	hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik



On 1/12/20 2:09 pm, Damien Le Moal wrote:
> On 2020/12/01 14:54, Anand Jain wrote:
>> On 1/12/20 10:29 am, Damien Le Moal wrote:
>>> On 2020/12/01 11:20, Anand Jain wrote:
>>>> On 30/11/20 9:15 pm, Damien Le Moal wrote:
>>>>> On 2020/11/30 21:13, Anand Jain wrote:
>>>>>> On 28/11/20 2:44 am, David Sterba wrote:
>>>>>>> On Wed, Nov 18, 2020 at 07:29:20PM +0800, Anand Jain wrote:
>>>>>>>> On 10/11/20 7:26 pm, Naohiro Aota wrote:
>>>>>>>>> This commit introduces the function btrfs_check_zoned_mode() to check if
>>>>>>>>> ZONED flag is enabled on the file system and if the file system consists of
>>>>>>>>> zoned devices with equal zone size.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>>>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>>>>>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>>>>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>>>>>> ---
>>>>>>>>>       fs/btrfs/ctree.h       | 11 ++++++
>>>>>>>>>       fs/btrfs/dev-replace.c |  7 ++++
>>>>>>>>>       fs/btrfs/disk-io.c     | 11 ++++++
>>>>>>>>>       fs/btrfs/super.c       |  1 +
>>>>>>>>>       fs/btrfs/volumes.c     |  5 +++
>>>>>>>>>       fs/btrfs/zoned.c       | 81 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>       fs/btrfs/zoned.h       | 26 ++++++++++++++
>>>>>>>>>       7 files changed, 142 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>>>> index aac3d6f4e35b..453f41ca024e 100644
>>>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>>>> @@ -948,6 +948,12 @@ struct btrfs_fs_info {
>>>>>>>>>       	/* Type of exclusive operation running */
>>>>>>>>>       	unsigned long exclusive_operation;
>>>>>>>>>       
>>>>>>>>> +	/* Zone size when in ZONED mode */
>>>>>>>>> +	union {
>>>>>>>>> +		u64 zone_size;
>>>>>>>>> +		u64 zoned;
>>>>>>>>> +	};
>>>>>>>>> +
>>>>>>>>>       #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>>>>>>>>       	spinlock_t ref_verify_lock;
>>>>>>>>>       	struct rb_root block_tree;
>>>>>>>>> @@ -3595,4 +3601,9 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>>>>>>>>>       }
>>>>>>>>>       #endif
>>>>>>>>>       
>>>>>>>>> +static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
>>>>>>>>> +{
>>>>>>>>> +	return fs_info->zoned != 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>       #endif
>>>>>>>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>>>>>>>> index 6f6d77224c2b..db87f1aa604b 100644
>>>>>>>>> --- a/fs/btrfs/dev-replace.c
>>>>>>>>> +++ b/fs/btrfs/dev-replace.c
>>>>>>>>> @@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>>>>>>>       		return PTR_ERR(bdev);
>>>>>>>>>       	}
>>>>>>>>>       
>>>>>>>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>>>>>>>> +		btrfs_err(fs_info,
>>>>>>>>> +			  "dev-replace: zoned type of target device mismatch with filesystem");
>>>>>>>>> +		ret = -EINVAL;
>>>>>>>>> +		goto error;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>>       	sync_blockdev(bdev);
>>>>>>>>>       
>>>>>>>>>       	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
>>>>>>>>
>>>>>>>>       I am not sure if it is done in some other patch. But we still have to
>>>>>>>>       check for
>>>>>>>>
>>>>>>>>       (model == BLK_ZONED_HA && incompat_zoned))
>>>>>>>
>>>>>>> Do you really mean BLK_ZONED_HA, ie. host-aware (HA)?
>>>>>>> btrfs_check_device_zone_type checks for _HM.
>>>>>>
>>>>>>
>>>>>> Still confusing to me. The below function, which is part of this
>>>>>> patch, says we don't support BLK_ZONED_HM. So does it mean we
>>>>>> allow BLK_ZONED_HA only?
>>>>>>
>>>>>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info
>>>>>> *fs_info,
>>>>>> +						struct block_device *bdev)
>>>>>> +{
>>>>>> +	u64 zone_size;
>>>>>> +
>>>>>> +	if (btrfs_is_zoned(fs_info)) {
>>>>>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>>>>>> +		/* Do not allow non-zoned device */
>>>>>
>>>>> This comment does not make sense. It should be:
>>>>>
>>>>> 		/* Only allow zoned devices with the same zone size */
>>>>>
>>>>>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>>>>>> +	}
>>>>>> +
>>>>>> +	/* Do not allow Host Manged zoned device */
>>>>>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>>>>>
>>>>> The comment is also wrong. It should read:
>>>>>
>>>>> 	/* Allow only host managed zoned devices */
>>>>>
>>>>> This is because we decided to treat host aware devices in the same way as
>>>>> regular block devices, since HA drives are backward compatible with regular
>>>>> block devices.
>>>>
>>>>
>>>> Yeah, I read about them, but I have questions like do an FS work on top
>>>> of a BLK_ZONED_HA without modification?
>>>
>>> Yes. These drives are fully backward compatible and accept random writes
>>> anywhere. Performance however is potentially a different story as the drive will
>>> eventually need to do internal garbage collection of some sort, exactly like an
>>> SSD, but definitely not at SSD speeds :)
>>>
>>>>     Are we ok to replace an HM device with a HA device? Or add a HA device
>>>> to a btrfs on an HM device.
>>>
>>> We have a choice here: we can treat HA drives as regular devices or treat them
>>> as HM devices. Anything in between does not make sense. I am fine either way,
>>> the main reason being that there are no HA drive on the market today that I know
>>> of (this model did not have a lot of success due to the potentially very
>>> unpredictable performance depending on the use case).
>>>
>>> So the simplest thing to do is, in my opinion, to ignore their "zone"
>>> characteristics and treat them as regular disks. But treating them as HM drives
>>> is a simple to do too.
>>>> Of note is that a host-aware drive will be reported by the block layer as
>>> BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
>>> then the block layer will treat the drive as a regular disk.
>>
>> IMO. For now, it is better to check for the BLK_ZONED_HA explicitly in a
>> non-zoned-btrfs. And check for BLK_ZONED_HM explicitly in a zoned-btrfs.
> 
> Sure, we can. But since HA drives are backward compatible, not sure the HA check
> for non-zoned make sense. As long as the zoned flag is not set, the drive can be
> used like a regular disk. If the user really want to use it as a zoned drive,
> then it can format with force selecting the zoned flag in btrfs super. Then the
> HA drive will be used as a zoned disk, exactly like HM disks.
> 
>> This way, if there is another type of BLK_ZONED_xx in the future, we
>> have the opportunity to review to support it. As below [1]...
> 
> It is very unlikely that we will see any other zone model. ZNS adopted the HM
> model in purpose, to avoid multiplying the possible models, making the ecosystem
> effort a nightmare.
> 
>>
>> [1]
>> bool btrfs_check_device_type()
>> {
>> 	if (bdev_is_zoned()) {
>> 		if (btrfs_is_zoned())
>> 			if (bdev_zoned_model == BLK_ZONED_HM)
>> 			/* also check the zone_size. */
>> 				return true;
>> 		else
>> 			if (bdev_zoned_model == BLK_ZONED_HA)
>> 			/* a regular device and FS, no zone_size to check I think? */
>> 				return true;
>> 	} else {
>> 		if (!btrfs_is_zoned())
>> 			return true
>> 	}
>>
>> 	return false;
>> }
>>
>> Thanks.
> 
> Works for me. May be reverse the conditions to make things easier to read and
> understand:
> 
> bool btrfs_check_device_type()
> {
> 	if (btrfs_is_zoned()) {
> 		if (bdev_is_zoned()) {
> 			/* also check the zone_size. */
> 			return true;
> 		}
> 
> 		/*
> 		 * Regular device: emulate zones with zone size equal
> 		 * to device extent size.
> 		 */
> 		return true;
> 	}
> 
> 	if (bdev_zoned_model == BLK_ZONED_HM) {
> 		/* Zoned HM device require zoned btrfs */
> 		return false;
> 	}
> 
> 	/* Regular device or zoned HA device used as a regular device */
> 	return true;
> }
> 
> 

Yeah. Makes sense.

Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-30 18:11   ` Darrick J. Wong
@ 2020-12-01 10:16     ` Johannes Thumshirn
  0 siblings, 0 replies; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-01 10:16 UTC (permalink / raw)
  To: Darrick J. Wong, Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, hch

On 30/11/2020 19:16, Darrick J. Wong wrote:
> On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
>> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
>> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
>> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
>> REQ_OP_ZONE_APPEND.
>>
>> To utilize it, we need to set the bio_op before calling
>> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
>> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
>> and restricted bio.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> Christoph's answers seem reasonable to me, so
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Er... do you want me to take this one via the iomap tree?

Ahm probably yes, but we have an update addressing Christoph's coming
with the next version series.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01  2:29             ` Damien Le Moal
  2020-12-01  5:53               ` Anand Jain
@ 2020-12-01 10:45               ` Graham Cobb
  2020-12-01 11:03                 ` Damien Le Moal
  1 sibling, 1 reply; 125+ messages in thread
From: Graham Cobb @ 2020-12-01 10:45 UTC (permalink / raw)
  To: Damien Le Moal, Anand Jain, dsterba, Naohiro Aota, linux-btrfs,
	dsterba, hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 01/12/2020 02:29, Damien Le Moal wrote:
> Yes. These drives are fully backward compatible and accept random writes
> anywhere. Performance however is potentially a different story as the drive will
> eventually need to do internal garbage collection of some sort, exactly like an
> SSD, but definitely not at SSD speeds :)
> 
>>   Are we ok to replace an HM device with a HA device? Or add a HA device 
>> to a btrfs on an HM device.
> 
> We have a choice here: we can treat HA drives as regular devices or treat them
> as HM devices. Anything in between does not make sense. I am fine either way,
> the main reason being that there are no HA drive on the market today that I know
> of (this model did not have a lot of success due to the potentially very
> unpredictable performance depending on the use case).

So there will be no testing against HA drives? And no btrfs developers
will have one? And they have very different timing and possibly failure
modes from "normal" disks when they do GC?

I think there is no option but to disallow them. If HA drives start to
appear in significant numbers then that would be easy enough to change,
after suitable testing.

> Of note is that a host-aware drive will be reported by the block layer as
> BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
> then the block layer will treat the drive as a regular disk.

That is a bit of a shame. With that unfortunate decision in the block
layer, system managers need to realise that partitioning an HA disk
means they may be entering territory untested by their filesystem.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01 10:45               ` Graham Cobb
@ 2020-12-01 11:03                 ` Damien Le Moal
  2020-12-01 11:11                   ` hch
  0 siblings, 1 reply; 125+ messages in thread
From: Damien Le Moal @ 2020-12-01 11:03 UTC (permalink / raw)
  To: Graham Cobb, Anand Jain, dsterba, Naohiro Aota, linux-btrfs,
	dsterba, hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 2020/12/01 19:45, Graham Cobb wrote:
> On 01/12/2020 02:29, Damien Le Moal wrote:
>> Yes. These drives are fully backward compatible and accept random writes
>> anywhere. Performance however is potentially a different story as the drive will
>> eventually need to do internal garbage collection of some sort, exactly like an
>> SSD, but definitely not at SSD speeds :)
>>
>>>   Are we ok to replace an HM device with a HA device? Or add a HA device 
>>> to a btrfs on an HM device.
>>
>> We have a choice here: we can treat HA drives as regular devices or treat them
>> as HM devices. Anything in between does not make sense. I am fine either way,
>> the main reason being that there are no HA drive on the market today that I know
>> of (this model did not have a lot of success due to the potentially very
>> unpredictable performance depending on the use case).
> 
> So there will be no testing against HA drives? And no btrfs developers
> will have one? And they have very different timing and possibly failure
> modes from "normal" disks when they do GC?
> 
> I think there is no option but to disallow them. If HA drives start to
> appear in significant numbers then that would be easy enough to change,
> after suitable testing.

Works for me. Even simpler :)

> 
>> Of note is that a host-aware drive will be reported by the block layer as
>> BLK_ZONED_HA only as long as the drive does not have any partition. If it does,
>> then the block layer will treat the drive as a regular disk.
> 
> That is a bit of a shame. With that unfortunate decision in the block
> layer, system managers need to realise that partitioning an HA disk
> means they may be entering territory untested by their filesystem.

Well, not really. HA drives, per specifications, are backward compatible. If
they are partitioned, the block layer will force a regular drive mode use,
hiding their zoned interface, which is completely optional to use in the first
place.

If by "untested territory" you mean the possibility of hitting drive FW bugs
coming from the added complexity of internal GC, then I would argue that this is
a common territory for any FS on any drive, especially SSDs: device FW bugs do
exist and show up from time to time, even on the simplest of drives.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01 11:03                 ` Damien Le Moal
@ 2020-12-01 11:11                   ` hch
  2020-12-01 11:27                     ` Damien Le Moal
  0 siblings, 1 reply; 125+ messages in thread
From: hch @ 2020-12-01 11:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Graham Cobb, Anand Jain, dsterba, Naohiro Aota, linux-btrfs,
	dsterba, hare, linux-fsdevel, Jens Axboe, hch, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On Tue, Dec 01, 2020 at 11:03:35AM +0000, Damien Le Moal wrote:
> Well, not really. HA drives, per specifications, are backward compatible. If
> they are partitioned, the block layer will force a regular drive mode use,
> hiding their zoned interface, which is completely optional to use in the first
> place.
> 
> If by "untested territory" you mean the possibility of hitting drive FW bugs
> coming from the added complexity of internal GC, then I would argue that this is
> a common territory for any FS on any drive, especially SSDs: device FW bugs do
> exist and show up from time to time, even on the simplest of drives.

Also note that most cheaper hard drives are using SMR internally,
you'll run into the same GC "issues" as with a HA drive that is used for
random writes.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 05/41] btrfs: check and enable ZONED mode
  2020-12-01 11:11                   ` hch
@ 2020-12-01 11:27                     ` Damien Le Moal
  0 siblings, 0 replies; 125+ messages in thread
From: Damien Le Moal @ 2020-12-01 11:27 UTC (permalink / raw)
  To: hch
  Cc: Graham Cobb, Anand Jain, dsterba, Naohiro Aota, linux-btrfs,
	dsterba, hare, linux-fsdevel, Jens Axboe, Darrick J. Wong,
	Johannes Thumshirn, Josef Bacik

On 2020/12/01 20:11, hch@infradead.org wrote:
> On Tue, Dec 01, 2020 at 11:03:35AM +0000, Damien Le Moal wrote:
>> Well, not really. HA drives, per specifications, are backward compatible. If
>> they are partitioned, the block layer will force a regular drive mode use,
>> hiding their zoned interface, which is completely optional to use in the first
>> place.
>>
>> If by "untested territory" you mean the possibility of hitting drive FW bugs
>> coming from the added complexity of internal GC, then I would argue that this is
>> a common territory for any FS on any drive, especially SSDs: device FW bugs do
>> exist and show up from time to time, even on the simplest of drives.
> 
> Also note that most cheaper hard drives are using SMR internally,
> you'll run into the same GC "issues" as with a HA drive that is used for
> random writes.

Indeed. Good point !

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 14/41] btrfs: load zone's alloction offset
  2020-11-10 11:26 ` [PATCH v10 14/41] btrfs: load zone's alloction offset Naohiro Aota
@ 2020-12-08  9:54   ` Anand Jain
  0 siblings, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-12-08  9:54 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig,
	Darrick J. Wong, Josef Bacik

On 10/11/20 7:26 pm, Naohiro Aota wrote:
> Zoned btrfs must allocate blocks at the zones' write pointer. The device's
> write pointer position can be mapped to a logical address within a block
> group. This commit adds "alloc_offset" to track the logical address.
> 
> This logical address is populated in btrfs_load_block_group_zone_info()
> from write pointers of corresponding zones.
> 
> For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE
> profile with zone append writing is not trivial. For example, in the DUP
> profile, we send a zone append writing IO to two zones on a device. The
> device reply with written LBAs for the IOs. If the offsets of the returned
> addresses from the beginning of the zone are different, then it results in
> different logical addresses.
> 
> We need fine-grained logical to physical mapping to support such separated
> physical address issue. Since it should require additional metadata type,
> disable non-SINGLE profiles for now.
> 
> This commit supports the case all the zones in a block group are
> sequential. The next patch will handle the case having a conventional zone.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/block-group.c |  15 ++++
>   fs/btrfs/block-group.h |   6 ++
>   fs/btrfs/zoned.c       | 154 +++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/zoned.h       |   7 ++
>   4 files changed, 182 insertions(+)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6b4831824f51..ffc64dfbe09e 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -15,6 +15,7 @@
>   #include "delalloc-space.h"
>   #include "discard.h"
>   #include "raid56.h"
> +#include "zoned.h"
>   
>   /*
>    * Return target flags in extended format or 0 if restripe for this chunk_type
> @@ -1935,6 +1936,13 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   			goto error;
>   	}
>   
> +	ret = btrfs_load_block_group_zone_info(cache);
> +	if (ret) {
> +		btrfs_err(info, "zoned: failed to load zone info of bg %llu",
> +			  cache->start);
> +		goto error;
> +	}
> +
>   	/*
>   	 * We need to exclude the super stripes now so that the space info has
>   	 * super bytes accounted for, otherwise we'll think we have more space
> @@ -2161,6 +2169,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
>   	cache->last_byte_to_unpin = (u64)-1;
>   	cache->cached = BTRFS_CACHE_FINISHED;
>   	cache->needs_free_space = 1;
> +
> +	ret = btrfs_load_block_group_zone_info(cache);
> +	if (ret) {
> +		btrfs_put_block_group(cache);
> +		return ret;
> +	}
> +
>   	ret = exclude_super_stripes(cache);
>   	if (ret) {
>   		/* We may have excluded something, so call this just in case */
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index adfd7583a17b..14e3043c9ce7 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -183,6 +183,12 @@ struct btrfs_block_group {
>   
>   	/* Record locked full stripes for RAID5/6 block group */
>   	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> +
> +	/*
> +	 * Allocation offset for the block group to implement sequential
> +	 * allocation. This is used only with ZONED mode enabled.
> +	 */
> +	u64 alloc_offset;
>   };
>   
>   static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index ed5de1c138d7..69d3412c4fef 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -3,14 +3,20 @@
>   #include <linux/bitops.h>
>   #include <linux/slab.h>
>   #include <linux/blkdev.h>
> +#include <linux/sched/mm.h>
>   #include "ctree.h"
>   #include "volumes.h"
>   #include "zoned.h"
>   #include "rcu-string.h"
>   #include "disk-io.h"
> +#include "block-group.h"
>   
>   /* Maximum number of zones to report per blkdev_report_zones() call */
>   #define BTRFS_REPORT_NR_ZONES   4096
> +/* Invalid allocation pointer value for missing devices */
> +#define WP_MISSING_DEV ((u64)-1)
> +/* Pseudo write pointer value for conventional zone */
> +#define WP_CONVENTIONAL ((u64)-2)
>   
>   /* Number of superblock log zones */
>   #define BTRFS_NR_SB_LOG_ZONES 2
> @@ -777,3 +783,151 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
>   
>   	return 0;
>   }
> +
> +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
> +{
> +	struct btrfs_fs_info *fs_info = cache->fs_info;
> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	struct btrfs_device *device;
> +	u64 logical = cache->start;
> +	u64 length = cache->length;
> +	u64 physical = 0;
> +	int ret;
> +	int i;
> +	unsigned int nofs_flag;
> +	u64 *alloc_offsets = NULL;
> +	u32 num_sequential = 0, num_conventional = 0;
> +
> +	if (!btrfs_is_zoned(fs_info))
> +		return 0;
> +
> +	/* Sanity check */
> +	if (!IS_ALIGNED(length, fs_info->zone_size)) {
> +		btrfs_err(fs_info, "zoned: block group %llu len %llu unaligned to zone size %llu",
> +			  logical, length, fs_info->zone_size);
> +		return -EIO;
> +	}
> +
> +	/* Get the chunk mapping */
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, logical, length);
> +	read_unlock(&em_tree->lock);
> +
> +	if (!em)
> +		return -EINVAL;
> +
> +	map = em->map_lookup;
> +
> +	/*
> +	 * Get the zone type: if the group is mapped to a non-sequential zone,
> +	 * there is no need for the allocation offset (fit allocation is OK).
> +	 */
> +	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
> +				GFP_NOFS);
> +	if (!alloc_offsets) {
> +		free_extent_map(em);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < map->num_stripes; i++) {
> +		bool is_sequential;
> +		struct blk_zone zone;
> +
> +		device = map->stripes[i].dev;
> +		physical = map->stripes[i].physical;
> +
> +		if (device->bdev == NULL) {
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		}
> +
> +		is_sequential = btrfs_dev_is_sequential(device, physical);
> +		if (is_sequential)
> +			num_sequential++;
> +		else
> +			num_conventional++;
> +
> +		if (!is_sequential) {
> +			alloc_offsets[i] = WP_CONVENTIONAL;
> +			continue;
> +		}
> +
> +		/*
> +		 * This zone will be used for allocation, so mark this
> +		 * zone non-empty.
> +		 */
> +		btrfs_dev_clear_zone_empty(device, physical);
> +
> +		/*
> +		 * The group is mapped to a sequential zone. Get the zone write
> +		 * pointer to determine the allocation offset within the zone.
> +		 */
> +		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
> +		nofs_flag = memalloc_nofs_save();
> +		ret = btrfs_get_dev_zone(device, physical, &zone);
> +		memalloc_nofs_restore(nofs_flag);
> +		if (ret == -EIO || ret == -EOPNOTSUPP) {
> +			ret = 0;
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		} else if (ret) {
> +			goto out;
> +		}
> +
> +		switch (zone.cond) {
> +		case BLK_ZONE_COND_OFFLINE:
> +		case BLK_ZONE_COND_READONLY:
> +			btrfs_err(fs_info, "zoned: offline/readonly zone %llu on device %s (devid %llu)",
> +				  physical >> device->zone_info->zone_size_shift,
> +				  rcu_str_deref(device->name), device->devid);
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			break;
> +		case BLK_ZONE_COND_EMPTY:
> +			alloc_offsets[i] = 0;
> +			break;
> +		case BLK_ZONE_COND_FULL:
> +			alloc_offsets[i] = fs_info->zone_size;
> +			break;
> +		default:
> +			/* Partially used zone */
> +			alloc_offsets[i] =
> +				((zone.wp - zone.start) << SECTOR_SHIFT);
> +			break;
> +		}
> +	}
> +
> +	if (num_conventional > 0) {
> +		/*
> +		 * Since conventional zones do not have a write pointer, we
> +		 * cannot determine alloc_offset from the pointer
> +		 */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +	case 0: /* single */
> +		cache->alloc_offset = alloc_offsets[0];
> +		break;
> +	case BTRFS_BLOCK_GROUP_DUP:
> +	case BTRFS_BLOCK_GROUP_RAID1:
> +	case BTRFS_BLOCK_GROUP_RAID0:
> +	case BTRFS_BLOCK_GROUP_RAID10:
> +	case BTRFS_BLOCK_GROUP_RAID5:
> +	case BTRFS_BLOCK_GROUP_RAID6:
> +		/* non-SINGLE profiles are not supported yet */
> +	default:
> +		btrfs_err(fs_info, "zoned: profile %s not supported",
> +			  btrfs_bg_type_to_raid_name(map->type));
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +out:
> +	kfree(alloc_offsets);
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index ec2391c52d8b..e3338a2f1be9 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -40,6 +40,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
>   int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
>   			    u64 length, u64 *bytes);
>   int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
> +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   				     struct blk_zone *zone)
> @@ -112,6 +113,12 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
>   	return 0;
>   }
>   
> +static inline int btrfs_load_block_group_zone_info(
> +	struct btrfs_block_group *cache)
> +{
> +	return 0;
> +}
> +
>   #endif
>   
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> 


looks good.

Reviewed-by: Anand Jain <anand.jain@oracle.com>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 12/41] btrfs: implement zoned chunk allocator
  2020-11-10 11:26 ` [PATCH v10 12/41] btrfs: implement zoned chunk allocator Naohiro Aota
  2020-11-24 11:36   ` Anand Jain
@ 2020-12-09  5:27   ` Anand Jain
  1 sibling, 0 replies; 125+ messages in thread
From: Anand Jain @ 2020-12-09  5:27 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Jens Axboe, Christoph Hellwig, Darrick J. Wong




> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index db884b96a5ea..7831cf6c6da4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1416,6 +1416,21 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
>   	return false;
>   }
>   
> +static inline u64 dev_extent_search_start_zoned(struct btrfs_device *device,
> +						u64 start)
> +{
> +	u64 tmp;
> +
> +	if (device->zone_info->zone_size > SZ_1M)
> +		tmp = device->zone_info->zone_size;
> +	else
> +		tmp = SZ_1M;
> +	if (start < tmp)
> +		start = tmp;
> +
> +	return btrfs_align_offset_to_zone(device, start);
> +}
> +
>   static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
>   {
>   	switch (device->fs_devices->chunk_alloc_policy) {
> @@ -1426,11 +1441,57 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
>   		 * make sure to start at an offset of at least 1MB.
>   		 */
>   		return max_t(u64, start, SZ_1M);
> +	case BTRFS_CHUNK_ALLOC_ZONED:
> +		return dev_extent_search_start_zoned(device, start);
>   	default:
>   		BUG();
>   	}
>   }
>   

> @@ -165,4 +190,13 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device,
>   	       !btrfs_dev_is_sequential(device, pos);
>   }
>   
> +static inline u64 btrfs_align_offset_to_zone(struct btrfs_device *device,
> +					     u64 pos)
> +{
> +	if (!device->zone_info)
> +		return pos;
> +
> +	return ALIGN(pos, device->zone_info->zone_size);
> +}
> +
>   #endif
> 

  Small functions (such as above) can be opened coded to make the
  reviewing easier. The btrfs_align_offset_to_zone() and
  dev_extent_search_start_zoned() can be open coded and merged into
  the parent function dev_extent_search_start() as below...

dev_extent_search_start()
::
	case BTRFS_CHUNK_ALLOC_ZONED:
		start = max_t(u64, start,
			      max_t(u64, device->zone_info->zone_size, SZ_1M));

          return ALIGN(start, device->zone_info->zone_size);

  As of now we don't allow mix of zoned with regular device in a
  btrfs (those are verified during mount and device add/replace).
  So we don't have to check for the same again in
  btrfs_align_offset_to_zone().

Thanks.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
                     ` (2 preceding siblings ...)
  2020-11-30 18:11   ` Darrick J. Wong
@ 2020-12-09  9:31   ` Christoph Hellwig
  2020-12-09 10:08     ` Johannes Thumshirn
  3 siblings, 1 reply; 125+ messages in thread
From: Christoph Hellwig @ 2020-12-09  9:31 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe,
	Christoph Hellwig, Darrick J. Wong

Btw, another thing I noticed:

when using io_uring to submit a write to btrfs that ends up using Zone
Append we'll hit the

	if (WARN_ON_ONCE(is_bvec))
		return -EINVAL;

case in bio_iov_iter_get_pages with the changes in this series.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-09  9:31   ` Christoph Hellwig
@ 2020-12-09 10:08     ` Johannes Thumshirn
  2020-12-09 10:10       ` hch
  0 siblings, 1 reply; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-09 10:08 UTC (permalink / raw)
  To: hch, Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Jens Axboe, Darrick J. Wong

On 09/12/2020 10:34, Christoph Hellwig wrote:
> Btw, another thing I noticed:
> 
> when using io_uring to submit a write to btrfs that ends up using Zone
> Append we'll hit the
> 
> 	if (WARN_ON_ONCE(is_bvec))
> 		return -EINVAL;
> 
> case in bio_iov_iter_get_pages with the changes in this series.

Yes this warning is totally bogus. It was in there from the beginning of the
zone-append series and I have no idea why I didn't kill it.

IIRC Chaitanya had a patch in his nvmet zoned series removing it.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-09 10:08     ` Johannes Thumshirn
@ 2020-12-09 10:10       ` hch
  2020-12-09 10:16         ` Johannes Thumshirn
  0 siblings, 1 reply; 125+ messages in thread
From: hch @ 2020-12-09 10:10 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
> On 09/12/2020 10:34, Christoph Hellwig wrote:
> > Btw, another thing I noticed:
> > 
> > when using io_uring to submit a write to btrfs that ends up using Zone
> > Append we'll hit the
> > 
> > 	if (WARN_ON_ONCE(is_bvec))
> > 		return -EINVAL;
> > 
> > case in bio_iov_iter_get_pages with the changes in this series.
> 
> Yes this warning is totally bogus. It was in there from the beginning of the
> zone-append series and I have no idea why I didn't kill it.
> 
> IIRC Chaitanya had a patch in his nvmet zoned series removing it.

Yes, but it is wrong.  What we need is a version of
__bio_iov_bvec_add_pages that takes the hardware limits into account.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-09 10:10       ` hch
@ 2020-12-09 10:16         ` Johannes Thumshirn
  2020-12-09 13:38           ` Johannes Thumshirn
  0 siblings, 1 reply; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-09 10:16 UTC (permalink / raw)
  To: hch
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

On 09/12/2020 11:10, hch@infradead.org wrote:
> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>> Btw, another thing I noticed:
>>>
>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>> Append we'll hit the
>>>
>>> 	if (WARN_ON_ONCE(is_bvec))
>>> 		return -EINVAL;
>>>
>>> case in bio_iov_iter_get_pages with the changes in this series.
>>
>> Yes this warning is totally bogus. It was in there from the beginning of the
>> zone-append series and I have no idea why I didn't kill it.
>>
>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
> 
> Yes, but it is wrong.  What we need is a version of
> __bio_iov_bvec_add_pages that takes the hardware limits into account.
> 

Ah now I understand the situation, I'm on it.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-09 10:16         ` Johannes Thumshirn
@ 2020-12-09 13:38           ` Johannes Thumshirn
  2020-12-11  7:26             ` Johannes Thumshirn
  0 siblings, 1 reply; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-09 13:38 UTC (permalink / raw)
  To: hch
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

On 09/12/2020 11:18, Johannes Thumshirn wrote:
> On 09/12/2020 11:10, hch@infradead.org wrote:
>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>> Btw, another thing I noticed:
>>>>
>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>> Append we'll hit the
>>>>
>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>> 		return -EINVAL;
>>>>
>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>
>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>> zone-append series and I have no idea why I didn't kill it.
>>>
>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
>>
>> Yes, but it is wrong.  What we need is a version of
>> __bio_iov_bvec_add_pages that takes the hardware limits into account.
>>
> 
> Ah now I understand the situation, I'm on it.
> 

OK got something, just need to test it.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-09 13:38           ` Johannes Thumshirn
@ 2020-12-11  7:26             ` Johannes Thumshirn
  2020-12-11 21:24               ` Chaitanya Kulkarni
  0 siblings, 1 reply; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-11  7:26 UTC (permalink / raw)
  To: hch
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

On 09/12/2020 14:41, Johannes Thumshirn wrote:
> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>> Btw, another thing I noticed:
>>>>>
>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>> Append we'll hit the
>>>>>
>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>> 		return -EINVAL;
>>>>>
>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>
>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>> zone-append series and I have no idea why I didn't kill it.
>>>>
>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
>>>
>>> Yes, but it is wrong.  What we need is a version of
>>> __bio_iov_bvec_add_pages that takes the hardware limits into account.
>>>
>>
>> Ah now I understand the situation, I'm on it.
>>
> 
> OK got something, just need to test it.
> 

I just ran tests with my solution and to verify it worked as expected I ran the
test without it. Interestingly the WARN_ON() didn't trigger for me. Here's the
fio command line I've used:

fio --ioengine io_uring --rw readwrite --bs 1M --size 1G --time_based   \
    --runtime 1m --filename /mnt/test/io_uring --name io_uring-test     \
    --direct 1 --numjobs $NPROC


I did verify it's using zone append though.

What did you use to trigger the warning?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-11  7:26             ` Johannes Thumshirn
@ 2020-12-11 21:24               ` Chaitanya Kulkarni
  2020-12-12 10:22                 ` Johannes Thumshirn
  0 siblings, 1 reply; 125+ messages in thread
From: Chaitanya Kulkarni @ 2020-12-11 21:24 UTC (permalink / raw)
  To: Johannes Thumshirn, hch
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

Johhanes/Christoph,

On 12/10/20 23:30, Johannes Thumshirn wrote:
> On 09/12/2020 14:41, Johannes Thumshirn wrote:
>> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>>> Btw, another thing I noticed:
>>>>>>
>>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>>> Append we'll hit the
>>>>>>
>>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>>> 		return -EINVAL;
>>>>>>
>>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>>> zone-append series and I have no idea why I didn't kill it.
>>>>>
>>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.

Even though that patch is not needed I've tested that with the NVMeOF
backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still
send that patch, please confirm.


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND
  2020-12-11 21:24               ` Chaitanya Kulkarni
@ 2020-12-12 10:22                 ` Johannes Thumshirn
  0 siblings, 0 replies; 125+ messages in thread
From: Johannes Thumshirn @ 2020-12-12 10:22 UTC (permalink / raw)
  To: Chaitanya Kulkarni, hch
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel,
	Jens Axboe, Darrick J. Wong

On 11/12/2020 22:24, Chaitanya Kulkarni wrote:
> Johhanes/Christoph,
> 
> On 12/10/20 23:30, Johannes Thumshirn wrote:
>> On 09/12/2020 14:41, Johannes Thumshirn wrote:
>>> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>>>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>>>> Btw, another thing I noticed:
>>>>>>>
>>>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>>>> Append we'll hit the
>>>>>>>
>>>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>>>> 		return -EINVAL;
>>>>>>>
>>>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>>>> zone-append series and I have no idea why I didn't kill it.
>>>>>>
>>>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
> 
> Even though that patch is not needed I've tested that with the NVMeOF
> backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still
> send that patch, please confirm.
>

I have the following locally, but I fail to call it in my tests:

commit ea93c91a70204a23ebf9e22b19fbf8add644e12e
Author: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Date:   Wed Dec 9 23:26:38 2020 +0900

    block: provide __bio_iov_bvec_add_append_pages
    
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

diff --git a/block/bio.c b/block/bio.c
index 5c6982902330..dc8178ca796f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -992,6 +992,31 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 }
 EXPORT_SYMBOL_GPL(bio_release_pages);
 
+static int __bio_iov_bvec_add_append_pages(struct bio *bio,
+                                          struct iov_iter *iter)
+{
+       const struct bio_vec *bv = iter->bvec;
+       struct request_queue *q = bio->bi_disk->queue;
+       unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+       unsigned int len;
+       size_t size;
+
+       if (WARN_ON_ONCE(!max_append_sectors))
+               return -EINVAL;
+
+       if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+               return -EINVAL;
+
+       len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
+       size = bio_add_hw_page(q, bio, bv->bv_page, len,
+                              bv->bv_offset + iter->iov_offset,
+                              max_append_sectors, false);
+       if (unlikely(size != len))
+               return -EINVAL;
+       iov_iter_advance(iter, size);
+       return 0;
+}
+
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
        const struct bio_vec *bv = iter->bvec;
@@ -1142,9 +1167,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
        do {
                if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-                       if (WARN_ON_ONCE(is_bvec))
-                               return -EINVAL;
-                       ret = __bio_iov_append_get_pages(bio, iter);
+                       if (is_bvec)
+                               ret = __bio_iov_bvec_add_append_pages(bio, iter);
+                       else
+                               ret = __bio_iov_append_get_pages(bio, iter);
                } else {
                        if (is_bvec)
                                ret = __bio_iov_bvec_add_pages(bio, iter);



It's basically __bio_iov_bvec_add_pages() respecting the max_zone_sectors queue limit.

^ permalink raw reply related	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2020-12-12 10:23 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-10 11:26 [PATCH v10 00/41] btrfs: zoned block device support Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 01/41] block: add bio_add_zone_append_page Naohiro Aota
2020-11-10 17:20   ` Christoph Hellwig
2020-11-11  7:20     ` Johannes Thumshirn
2020-11-10 11:26 ` [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
2020-11-10 17:25   ` Christoph Hellwig
2020-11-10 18:55   ` Darrick J. Wong
2020-11-10 19:01     ` Darrick J. Wong
2020-11-24 11:29     ` Christoph Hellwig
2020-11-30 18:11   ` Darrick J. Wong
2020-12-01 10:16     ` Johannes Thumshirn
2020-12-09  9:31   ` Christoph Hellwig
2020-12-09 10:08     ` Johannes Thumshirn
2020-12-09 10:10       ` hch
2020-12-09 10:16         ` Johannes Thumshirn
2020-12-09 13:38           ` Johannes Thumshirn
2020-12-11  7:26             ` Johannes Thumshirn
2020-12-11 21:24               ` Chaitanya Kulkarni
2020-12-12 10:22                 ` Johannes Thumshirn
2020-11-10 11:26 ` [PATCH v10 03/41] btrfs: introduce ZONED feature flag Naohiro Aota
2020-11-19 21:31   ` David Sterba
2020-11-10 11:26 ` [PATCH v10 04/41] btrfs: get zone information of zoned block devices Naohiro Aota
2020-11-12  6:57   ` Anand Jain
2020-11-12  7:35     ` Johannes Thumshirn
2020-11-12  7:44       ` Damien Le Moal
2020-11-12  9:44         ` Anand Jain
2020-11-13 21:34           ` David Sterba
2020-11-12  9:39     ` Johannes Thumshirn
2020-11-12 12:57     ` Naohiro Aota
2020-11-18 11:17       ` Anand Jain
2020-11-30 11:16         ` Anand Jain
2020-11-25 21:47   ` David Sterba
2020-11-25 22:07     ` David Sterba
2020-11-25 23:50     ` Damien Le Moal
2020-11-26 14:11       ` David Sterba
2020-11-25 22:16   ` David Sterba
2020-11-10 11:26 ` [PATCH v10 05/41] btrfs: check and enable ZONED mode Naohiro Aota
2020-11-18 11:29   ` Anand Jain
2020-11-27 18:44     ` David Sterba
2020-11-30 12:12       ` Anand Jain
2020-11-30 13:15         ` Damien Le Moal
2020-12-01  2:19           ` Anand Jain
2020-12-01  2:29             ` Damien Le Moal
2020-12-01  5:53               ` Anand Jain
2020-12-01  6:09                 ` Damien Le Moal
2020-12-01  7:12                   ` Anand Jain
2020-12-01 10:45               ` Graham Cobb
2020-12-01 11:03                 ` Damien Le Moal
2020-12-01 11:11                   ` hch
2020-12-01 11:27                     ` Damien Le Moal
2020-11-10 11:26 ` [PATCH v10 06/41] btrfs: introduce max_zone_append_size Naohiro Aota
2020-11-19  9:23   ` Anand Jain
2020-11-27 18:47     ` David Sterba
2020-11-10 11:26 ` [PATCH v10 07/41] btrfs: disallow space_cache in ZONED mode Naohiro Aota
2020-11-19 10:42   ` Anand Jain
2020-11-20  4:08     ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 08/41] btrfs: disallow NODATACOW " Naohiro Aota
2020-11-20  4:17   ` Anand Jain
2020-11-23 17:21     ` David Sterba
2020-11-24  3:29       ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 09/41] btrfs: disable fallocate " Naohiro Aota
2020-11-20  4:28   ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 10/41] btrfs: disallow mixed-bg " Naohiro Aota
2020-11-20  4:32   ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 11/41] btrfs: implement log-structured superblock for " Naohiro Aota
2020-11-11  1:34   ` kernel test robot
2020-11-11  2:43   ` kernel test robot
2020-11-23 17:46   ` David Sterba
2020-11-24  9:30     ` Johannes Thumshirn
2020-11-24  6:46   ` Anand Jain
2020-11-24  7:16     ` Hannes Reinecke
2020-11-10 11:26 ` [PATCH v10 12/41] btrfs: implement zoned chunk allocator Naohiro Aota
2020-11-24 11:36   ` Anand Jain
2020-11-25  1:57     ` Naohiro Aota
2020-11-25  7:17       ` Anand Jain
2020-11-25 11:48         ` Naohiro Aota
2020-11-25  9:59       ` Graham Cobb
2020-11-25 11:50         ` Naohiro Aota
2020-12-09  5:27   ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 13/41] btrfs: verify device extent is aligned to zone Naohiro Aota
2020-11-27  6:27   ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 14/41] btrfs: load zone's alloction offset Naohiro Aota
2020-12-08  9:54   ` Anand Jain
2020-11-10 11:26 ` [PATCH v10 15/41] btrfs: emulate write pointer for conventional zones Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 16/41] btrfs: track unusable bytes for zones Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 17/41] btrfs: do sequential extent allocation in ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 18/41] btrfs: reset zones of unused block groups Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 19/41] btrfs: redirty released extent buffers in ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 20/41] btrfs: extract page adding function Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
2020-11-11  2:01   ` kernel test robot
2020-11-11  2:01     ` kernel test robot
2020-11-11  2:26   ` kernel test robot
2020-11-11  2:26     ` kernel test robot
2020-11-11  3:46   ` kernel test robot
2020-11-11  3:46     ` kernel test robot
2020-11-11  3:46   ` [RFC PATCH] btrfs: extract_ordered_extent() can be static kernel test robot
2020-11-11  3:46     ` kernel test robot
2020-11-11  4:12   ` [PATCH v10.1 23/41] btrfs: split ordered extent when bio is sent Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 24/41] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 26/41] btrfs: enable zone append writing for direct IO Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 27/41] btrfs: introduce dedicated data write path for ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 28/41] btrfs: serialize meta IOs on " Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 29/41] btrfs: wait existing extents before truncating Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 30/41] btrfs: avoid async metadata checksum on ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 31/41] btrfs: mark block groups to copy for device-replace Naohiro Aota
2020-11-11  3:13   ` kernel test robot
2020-11-11  3:16   ` kernel test robot
2020-11-10 11:26 ` [PATCH v10 32/41] btrfs: implement cloning for ZONED device-replace Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 33/41] btrfs: implement copying " Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 34/41] btrfs: support dev-replace in ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 35/41] btrfs: enable relocation " Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 36/41] btrfs: relocate block group to repair IO failure in ZONED Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 37/41] btrfs: split alloc_log_tree() Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
2020-11-11  4:58   ` [PATCH v10.1 " Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 39/41] btrfs: serialize log transaction on ZONED mode Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 40/41] btrfs: reorder log node allocation Naohiro Aota
2020-11-10 11:26 ` [PATCH v10 41/41] btrfs: enable to mount ZONED incompat flag Naohiro Aota
2020-11-10 14:00 ` [PATCH v10 00/41] btrfs: zoned block device support Anand Jain
2020-11-11  5:07   ` Naohiro Aota
2020-11-27 19:28 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.