All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v15 00/42] btrfs: zoned block device support
@ 2021-02-04 10:21 Naohiro Aota
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
  2021-02-10 19:58 ` [PATCH v15 00/42] btrfs: zoned block device support David Sterba
  0 siblings, 2 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota

This series adds zoned block device support to btrfs. Some of the patches
in the previous series are already merged as preparation patches.

This series and related changes to userland tools are also available on
github.

Kernel   https://github.com/naota/linux/tree/btrfs-zoned-v15
Userland https://github.com/naota/btrfs-progs/tree/btrfs-zoned
xfstests https://github.com/naota/fstests/tree/btrfs-zoned

Userland tool depends on patched util-linux (libblkid and wipefs) to handle
log-structured superblock. To ease the testing, pre-compiled static linked
userland tools are available here:
https://wdc.app.box.com/s/fnhqsb3otrvgkstq66o6bvdw6tk525kp

Followup work will address several areas that can be improved.

- Splitting an ordered extent: as we need to enforce the rule one BIO ==
  one ordered extent, the BIO submission path could be improved to not
  require splitting ordered extents, but rather, to create ordered extents
  that can be processed with a single BIO.
- Redirtying freed tree blocks: switch to keeping the blocks dirty
- Dedicated tree-log block group: We need a tree-log for zoned device for
  performance reasons. Dbench (32 clients) is 85% slower with "-o
  notreelog".  However, we need to separate tree-log block group from other
  metadata space_info to avoid premature ENOSPC problem
- Relocation: Use normal write command for relocation. Also, relocated
  device extents must be reset and they should be discarded on regular
  btrfs too.
- Support for zone capacity smaller than zone size (NVMe ZNS devices)
- Support device open and active zones limits (NVMe ZNS devices)

Also, we are leaving a fix for "btrfs: serialize log transaction on zoned
filesystem" for later. Filipe pointed out that fsync() on zoned filesystem
fallback to a full transaction commit even without concurrency, leading to
performance degradation. There is a fix for this issue itself. However, the
fix revealed other failures in fsync() path. Current code is slower but
working.  So, we leave this performance fix for later.

Changes from v14(+ fixed in for-next)
  - Fix commit log, messages and comment styles (David)
  - Added some comments to code
  - Do not always call inode_need_compress() (patch 29)
  - Fix double unlock in mark_block_group_to_copy() (patch 33)
  - Do not limit parallelism for non-zoned FS (patch 41)

btrfs-progs and xfstests series will follow.

This version of ZONED btrfs switched from normal write command to zone
append write command. You do not need to specify LBA (at the write pointer)
to write for zone append write command. Instead, you only select a zone to
write with its start LBA. Then the device (NVMe ZNS), or the emulation of
zone append command in the sd driver in the case of SAS or SATA HDDs,
automatically writes the data at the write pointer position and return the
written LBA as a command reply.

The benefit of using the zone append write command is that write command
issuing order does not matter. So, we can eliminate block group lock and
utilize asynchronous checksum, which can reorder the IOs.

Eliminating the lock improves performance. In particular, on a workload
with massive competing to the same zone [1], we observed 36% performance
improvement compared to normal write.

[1] Fio running 16 jobs with 4KB random writes for 5 minutes

However, there are some limitations. We cannot use the non-SINGLE profile.
Supporting non-SINGLE profile with zone append writing is not trivial. For
example, in the DUP profile, we send a zone append writing IO to two zones
on a device. The device reply with written LBAs for the IOs. If the offsets
of the returned addresses from the beginning of the zone are different,
then it results in different logical addresses.

For the same reason, we cannot issue multiple IOs for one ordered extent.
Thus, the size of an ordered extent is limited under max_zone_append_size.
This limitation will cause fragmentation and increased usage of metadata.
In the future, we can add optimization to merge ordered extents after
end_bio.

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Enabling tree-log

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing than a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes which ZONED mode must avoid.

This series introduces a dedicated block group for tree-log blocks to
create two metadata writing streams, one for tree-log blocks and the
other for metadata blocks. As a result, each write stream can now be
written to devices separately and sequentially.

* Log-structured superblock

Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the
zone.

This series implements superblock log writing. It uses two zones as a
circular buffer to write updated superblocks. Once the first zone is filled
up, start writing into the second zone. The first zone will be reset once
both zones are filled. We can determine the postion of the latest
superblock by reading the write pointer information from a device.

* Patch series organization

Patches 1 and 2 are preparing patches for block and iomap layer.

Patches 3 to 7 are fixes for previous patches or preparing for the latter
patches.

Patch 8 implements emulated zoned mode for non-zoned devices.

Patches 9 and 10 tweak the device extent allocation for ZONED mode and add
verification to check if a device extent is properly aligned to zones.

Patches 11 to 14 implements sequential block allocator for ZONED mode.

Patches 15 and 16 tweak some btrfs features work with emulated ZONED mode.
These include re-dirtying (and pinning) of once-freed metadata blocks to
prevent write holes, and advancing the allocation offset for tree-log node
blocks.

Patches 17 and later are for real zoned devices.

Patch 17 implement a zone reset for unused block groups.

Patches 18 to 31 implement the writing path for several types of IO
(non-compressed data, direct IO, and metadata). These include re-dirtying
once-freed metadata blocks to prevent write holes.

Patches 32 to 41 tweak some btrfs features work with ZONED mode. These
include device-replace, relocation, repairing IO error, and tree-log.

Patch 42 adds the ZONED feature to the list of supported features.

* Patch testing note

** Zone-aware util-linux

Since the log-structured superblock feature changed the location of
superblock magic, the current util-linux (libblkid) cannot detect ZONED
btrfs anymore. You need to apply a to-be posted patch to util-linux to make
it "zone aware".

** Testing device

You need devices with zone append writing command support to run ZONED
btrfs.

Other than real devices, null_blk supports zone append write command. You
can use memory backed null_blk to run the test on it. Following script
creates 12800 MB /dev/nullb0.

    sysfs=/sys/kernel/config/nullb/nullb0
    size=12800 # MB
    
    # drop nullb0
    if [[ -d $sysfs ]]; then
            echo 0 > "${sysfs}"/power
            rmdir $sysfs
    fi
    lsmod | grep -q null_blk && rmmod null_blk
    modprobe null_blk nr_devices=0
    
    mkdir "${sysfs}"
    
    echo "${size}" > "${sysfs}"/size
    echo 1 > "${sysfs}"/zoned
    echo 0 > "${sysfs}"/zone_nr_conv
    echo 1 > "${sysfs}"/memory_backed
    
    echo 1 > "${sysfs}"/power
    udevadm settle

Zoned SCSI devices such as SMR HDDs or scsi_debug also support the zone
append command as an emulated command within the SCSI sd driver. This
emulation is completely transparent to the user and provides the same
semantic as a NVMe ZNS native drive support.

Also, there is a qemu patch available to enable NVMe ZNS device.

** xfstests

We ran xfstests on ZONED btrfs, and, if we omit some cases that are known
to fail currently, all test cases pass.

Cases that can be ignored:
1) failing also with the regular btrfs on regular devices,
2) trying to test fallocate feature without testing with
   "_require_xfs_io_command "falloc"",
3) trying to test incompatible features for ZONED btrfs (e.g. RAID5/6)
4) trying to use incompatible setup for ZONED btrfs (e.g. dm-linear not
   aligned to zone boundary, swap)
5) trying to create a file system with too small size, (we require at least
   9 zones to initiate a ZONED btrfs)
6) dropping original MKFS_OPTIONS ("-O zoned"), so it cannot create ZONED
   btrfs (btrfs/003)
7) having ENOSPC which incurred by larger metadata block group size

I will send a patch series for xfstests to handle these cases (2-6)
properly.

Patched xfstests is available here:

https://github.com/naota/fstests/tree/btrfs-zoned

Also, you need to apply the following patch if you run xfstests with
tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

v14 https://lore.kernel.org/linux-btrfs/SN4PR0401MB359814032CDF9889ED2DA7FA9BB89@SN4PR0401MB3598.namprd04.prod.outlook.com/T/
v13 https://lore.kernel.org/linux-btrfs/cover.1611295439.git.naohiro.aota@wdc.com/T/
v12 https://lore.kernel.org/linux-btrfs/cover.1610693036.git.naohiro.aota@wdc.com/T/
v11 https://lore.kernel.org/linux-btrfs/SN4PR0401MB35989E15509A0D36CBC35B109BAA0@SN4PR0401MB3598.namprd04.prod.outlook.com/T/#t
v10 https://lore.kernel.org/linux-btrfs/cover.1605007036.git.naohiro.aota@wdc.com/
v9 https://lore.kernel.org/linux-btrfs/cover.1604065156.git.naohiro.aota@wdc.com/
v8 https://lore.kernel.org/linux-btrfs/cover.1601572459.git.naohiro.aota@wdc.com/
v7 https://lore.kernel.org/linux-btrfs/20200911123259.3782926-1-naohiro.aota@wdc.com/
v6 https://lore.kernel.org/linux-btrfs/20191213040915.3502922-1-naohiro.aota@wdc.com/
v5 https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/
v4 https://lwn.net/Articles/797061/
v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v13
  - Rebased on the latest misc-next
    - Fix conflicts
  - Bug fix
    - Fix use-after-free in read_one_block_group() (Patch 05)
    - Set ffe_ctl->max_extent_size and total_free_size properly (Patch 14)
    - Add check if a bio spans across ordered extents (Patch 23)
  - Add comment.
v12
 - Addressed the review comments
   - Add @return to btrfs_bio_add_page()
   - Load and pass struct btrfs_block_group_item* to read_one_block_group
   - Remove lock contention from transaction log serialisation
   - Introduce btrfs_clear_treelog_bg() helper
   - Do not eat errored return value in calculate_alloc_pointer()
   - Handle errors from btrfs_add_ordered_extent() in clone_ordered_extent()
   - Use btrfs_is_zoned in btrfs_ioctl_fitrim()
   - Code style and commit message fix
 - Use the same SB locations as regular btrfs if the device is non-zoned.
 - Remove "force_zoned" flag since it's no longer necessary by delaying the
   load of zone info.
 - Added comments.
v11
  - Added emulated zoned mode support.
    - Change superblock (SB) location on conventional zones to unify the
      location of primary SB on regular btrfs and emulated zoned btrfs.
    - Move zone info loading later to open_ctre() stage to determine
      emulated zone size from the size of device extents.
  - Set REQ_OP_ZONE_APPEND only if the block group is on a sequential zone.
  - Disallow fitrim on zoned mode for now.
  - Mark fully zone_unusable block group as unused, so that it can be
    reclaimed soon.
  - Replace: do not issue zero out on conventional zones.
  - Add treelog_bg_lock's lock order description.
  - Open code btrfs_align_offset_to_zone() and
    dev_extent_search_start_zoned() (Anand).

  - Bug fix:
    - Re-check pending extent if device extent hole is changed by
      dev_extent_hole_check_zoned() 
    - Do not load allocation pointer for new block group on conventional
      zones to avoid deadlock.
v10
  - Added emulated zoned mode support.
    - Change superblock (SB) location on conventional zones to unify the
      location of primary SB on regular btrfs and emulated zoned btrfs.
    - Move zone info loading later to open_ctre() stage to determine
      emulated zone size from the size of device extents.
  - Set REQ_OP_ZONE_APPEND only if the block group is on a sequential zone.
  - Disallow fitrim on zoned mode for now.
  - Mark fully zone_unusable block group as unused, so that it can be
    reclaimed soon.
  - Replace: do not issue zero out on conventional zones.
  - Add treelog_bg_lock's lock order description.
  - Open code btrfs_align_offset_to_zone() and
    dev_extent_search_start_zoned() (Anand).

  - Bug fix:
    - Re-check pending extent if device extent hole is changed by
      dev_extent_hole_check_zoned() 
    - Do not load allocation pointer for new block group on conventional
      zones to avoid deadlock.

v9
 - Direct-IO path now follow several hardware restrictions (other than
   max_zone_append_size) by using ZONE_APPEND support of iomap
 - introduces union of fs_info->zone_size and fs_info->zoned [Johannes]
   - and use btrfs_is_zoned(fs_info) in place of btrfs_fs_incompat(fs_info, ZONED)
 - print if zoned is enabled or not when printing module info [Johannes]
 - drop patch of disabling inode_cache on ZONED
 - moved for_teelog flag to a proper location [Johannes]
 - Code style fixes [Johannes]
 - Add comment about adding physical layer things to ordered extent
   structure
 - Pass file_offset explicitly to extract_ordered_extent() instead of
   determining it from bio
 - Bug fixes
   - write out fsync region so that the logical address of ordered extents
     and checksums are properly finalized
   - free zone_info at umount time
   - fix superblock log handling when entering zones[1] in the first time
   - fixes double free of log-tree roots [Johannes] 
   - Drop erroneous ASSERT in do_allocation_zoned()
v8
 - Use bio_add_hw_page() to build up bio to honor hardware restrictions
   - add bio_add_zone_append_page() as a wrapper of the function
 - Split file extent on submitting bio
   - If bio_add_zone_append_page() fails, split the file extent and send
     out bio
   - so, we can ensure one bio == one file extent
 - Fix build bot issues
 - Rebased on misc-next
v7:
 - Use zone append write command instead of normal write command
   - Bio issuing order does not matter
   - No need to use lock anymore
   - Can use asynchronous checksum
 - Removed RAID support for now
 - Rename HMZONED to ZONED
 - Split some patches
 - Rebased on kdave/for-5.9-rc3 + iomap direct IO
v6:
 - Use bitmap helpers (Johannes)
 - Code cleanup (Johannes)
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v5:
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v4:
 - Move memory allcation of zone informattion out of
   btrfs_get_dev_zones() (Anand)
 - Add disabled features table in commit log (Anand)
 - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
v3:
 - Serialize allocation and submit_bio instead of bio buffering in
   btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Johannes Thumshirn (7):
  block: add bio_add_zone_append_page
  btrfs: release path before calling to btrfs_load_block_group_zone_info
  btrfs: zoned: do not load fs_info::zoned from incompat flag
  btrfs: zoned: allow zoned filesystems on non-zoned block devices
  btrfs: zoned: check if bio spans across an ordered extent
  btrfs: zoned: cache if block-group is on a sequential zone
  btrfs: save irq flags when looking up an ordered extent

Naohiro Aota (35):
  iomap: support REQ_OP_ZONE_APPEND
  btrfs: zoned: defer loading zone info after opening trees
  btrfs: zoned: use regular super block location on zone emulation
  btrfs: zoned: disallow fitrim on zoned filesystems
  btrfs: zoned: implement zoned chunk allocator
  btrfs: zoned: verify device extent is aligned to zone
  btrfs: zoned: load zone's allocation offset
  btrfs: zoned: calculate allocation offset for conventional zones
  btrfs: zoned: track unusable bytes for zones
  btrfs: zoned: implement sequential extent allocation
  btrfs: zoned: redirty released extent buffers
  btrfs: zoned: advance allocation pointer after tree log node
  btrfs: zoned: reset zones of unused block groups
  btrfs: factor out helper adding a page to bio
  btrfs: zoned: use bio_add_zone_append_page
  btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
  btrfs: zoned: split ordered extent when bio is sent
  btrfs: extend btrfs_rmap_block for specifying a device
  btrfs: zoned: use ZONE_APPEND write for zoned btrfs
  btrfs: zoned: enable zone append writing for direct IO
  btrfs: zoned: introduce dedicated data write path for zoned
    filesystems
  btrfs: zoned: serialize metadata IO
  btrfs: zoned: wait for existing extents before truncating
  btrfs: zoned: do not use async metadata checksum on zoned filesystems
  btrfs: zoned: mark block groups to copy for device-replace
  btrfs: zoned: implement cloning for zoned device-replace
  btrfs: zoned: implement copying for zoned device-replace
  btrfs: zoned: support dev-replace in zoned filesystems
  btrfs: zoned: enable relocation on a zoned filesystem
  btrfs: zoned: relocate block group to repair IO failure in zoned
    filesystems
  btrfs: split alloc_log_tree()
  btrfs: zoned: extend zoned allocator to use dedicated tree-log block
    group
  btrfs: zoned: serialize log transaction on zoned filesystems
  btrfs: zoned: reorder log node allocation on zoned filesystem
  btrfs: zoned: enable to mount ZONED incompat flag

 block/bio.c                       |  33 ++
 fs/btrfs/block-group.c            | 134 +++--
 fs/btrfs/block-group.h            |  21 +-
 fs/btrfs/ctree.h                  |   8 +-
 fs/btrfs/dev-replace.c            | 184 +++++++
 fs/btrfs/dev-replace.h            |   3 +
 fs/btrfs/disk-io.c                |  66 ++-
 fs/btrfs/disk-io.h                |   2 +
 fs/btrfs/extent-tree.c            | 230 +++++++-
 fs/btrfs/extent_io.c              | 139 ++++-
 fs/btrfs/extent_io.h              |   2 +
 fs/btrfs/file.c                   |   6 +-
 fs/btrfs/free-space-cache.c       |  87 +++
 fs/btrfs/free-space-cache.h       |   2 +
 fs/btrfs/inode.c                  | 197 ++++++-
 fs/btrfs/ioctl.c                  |   8 +
 fs/btrfs/ordered-data.c           |  86 ++-
 fs/btrfs/ordered-data.h           |  10 +
 fs/btrfs/relocation.c             |  34 +-
 fs/btrfs/scrub.c                  | 143 +++++
 fs/btrfs/space-info.c             |  13 +-
 fs/btrfs/space-info.h             |   4 +-
 fs/btrfs/sysfs.c                  |   2 +
 fs/btrfs/tests/extent-map-tests.c |   2 +-
 fs/btrfs/transaction.c            |  10 +
 fs/btrfs/transaction.h            |   3 +
 fs/btrfs/tree-log.c               |  62 ++-
 fs/btrfs/volumes.c                | 314 ++++++++++-
 fs/btrfs/volumes.h                |   3 +
 fs/btrfs/zoned.c                  | 874 +++++++++++++++++++++++++++++-
 fs/btrfs/zoned.h                  | 157 +++++-
 fs/iomap/direct-io.c              |  43 +-
 include/linux/bio.h               |   2 +
 include/linux/iomap.h             |   1 +
 34 files changed, 2716 insertions(+), 169 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v15 01/42] block: add bio_add_zone_append_page
  2021-02-04 10:21 [PATCH v15 00/42] btrfs: zoned block device support Naohiro Aota
@ 2021-02-04 10:21 ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 02/42] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
                     ` (41 more replies)
  2021-02-10 19:58 ` [PATCH v15 00/42] btrfs: zoned block device support David Sterba
  1 sibling, 42 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Christoph Hellwig,
	Josef Bacik, Chaitanya Kulkarni, Jens Axboe

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c         | 33 +++++++++++++++++++++++++++++++++
 include/linux/bio.h |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 1f2cc1fbe283..2f21d2958b60 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -851,6 +851,39 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+/**
+ * bio_add_zone_append_page - attempt to add page to zone-append bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist of a bio that will be submitted
+ * for a zone-append request. This can fail for a number of reasons, such as the
+ * bio being full or the target block device is not a zoned block device or
+ * other limitations of the target block device. The target block device must
+ * allow bio's up to PAGE_SIZE, so it is always possible to add a single page
+ * to an empty bio.
+ *
+ * Returns: number of bytes added to the bio, or 0 in case of a failure.
+ */
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool same_page = false;
+
+	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
+		return 0;
+
+	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
+		return 0;
+
+	return bio_add_hw_page(q, bio, page, len, offset,
+			       queue_max_zone_append_sectors(q), &same_page);
+}
+EXPORT_SYMBOL_GPL(bio_add_zone_append_page);
+
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1edda614f7ce..de62911473bb 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -455,6 +455,8 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off, bool *same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 02/42] iomap: support REQ_OP_ZONE_APPEND
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 03/42] btrfs: zoned: defer loading zone info after opening trees Naohiro Aota
                     ` (40 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Darrick J . Wong,
	Christoph Hellwig, Chaitanya Kulkarni

A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.

To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/iomap/direct-io.c  | 43 +++++++++++++++++++++++++++++++++++++------
 include/linux/iomap.h |  1 +
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..2273120d8ed7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -201,6 +201,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	iomap_dio_submit_bio(dio, iomap, bio, pos);
 }
 
+/*
+ * Figure out the bio's operation flags from the dio request, the
+ * mapping, and whether or not we want FUA.  Note that we can end up
+ * clearing the WRITE_FUA flag in the dio request.
+ */
+static inline unsigned int
+iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
+{
+	unsigned int opflags = REQ_SYNC | REQ_IDLE;
+
+	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
+		return REQ_OP_READ;
+	}
+
+	if (iomap->flags & IOMAP_F_ZONE_APPEND)
+		opflags |= REQ_OP_ZONE_APPEND;
+	else
+		opflags |= REQ_OP_WRITE;
+
+	if (use_fua)
+		opflags |= REQ_FUA;
+	else
+		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+
+	return opflags;
+}
+
 static loff_t
 iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		struct iomap_dio *dio, struct iomap *iomap)
@@ -208,6 +236,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	unsigned int blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
 	unsigned int fs_block_size = i_blocksize(inode), pad;
 	unsigned int align = iov_iter_alignment(dio->submit.iter);
+	unsigned int bio_opf;
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
@@ -263,6 +292,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			iomap_dio_zero(dio, iomap, pos - pad, pad);
 	}
 
+	/*
+	 * Set the operation flags early so that bio_iov_iter_get_pages
+	 * can set up the page vector appropriately for a ZONE_APPEND
+	 * operation.
+	 */
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
+
 	do {
 		size_t n;
 		if (dio->error) {
@@ -278,6 +314,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
+		bio->bi_opf = bio_opf;
 
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
 		if (unlikely(ret)) {
@@ -293,14 +330,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
-			if (use_fua)
-				bio->bi_opf |= REQ_FUA;
-			else
-				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
 			task_io_account_write(n);
 		} else {
-			bio->bi_opf = REQ_OP_READ;
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..8ebb1fa6f3b7 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -55,6 +55,7 @@ struct vm_fault;
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_ZONE_APPEND	0x20
 
 /*
  * Flags set by the core iomap code during operations:
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 03/42] btrfs: zoned: defer loading zone info after opening trees
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 02/42] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 04/42] btrfs: zoned: use regular super block location on zone emulation Naohiro Aota
                     ` (39 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

This is a preparation patch to implement zone emulation on a regular
device.

To emulate a zoned filesystem on a regular (non-zoned) device, we need to
decide an emulated zone size. Instead of making it a compile-time static
value, we'll make it configurable at mkfs time. Since we have one zone ==
one device extent restriction, we can determine the emulated zone size
from the size of a device extent. We can extend btrfs_get_dev_zone_info()
to show a regular device filled with conventional zones once the zone size
is decided.

The current call site of btrfs_get_dev_zone_info() during the mount process
is earlier than loading the file system trees so that we don't know the
size of a device extent at this point. Thus we can't slice a regular device
to conventional zones.

This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
info for all the devices. And, it places this function in open_ctree()
after loading the trees.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 13 +++++++++++++
 fs/btrfs/volumes.c |  4 ----
 fs/btrfs/zoned.c   | 25 +++++++++++++++++++++++++
 fs/btrfs/zoned.h   |  6 ++++++
 4 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 71fab77873a5..2b6a3df765cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3333,6 +3333,19 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	if (ret)
 		goto fail_tree_roots;
 
+	/*
+	 * Get zone type information of zoned block devices. This will also
+	 * handle emulation of a zoned filesystem if a regular device has the
+	 * zoned incompat feature flag set.
+	 */
+	ret = btrfs_get_dev_zone_info_all_devices(fs_info);
+	if (ret) {
+		btrfs_err(fs_info,
+			  "zoned: failed to read device zone info: %d",
+			  ret);
+		goto fail_block_groups;
+	}
+
 	/*
 	 * If we have a uuid root and we're not being told to rescan we need to
 	 * check the generation here so we can set the
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3948f5b50d11..07cd4742c123 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -669,10 +669,6 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
-	ret = btrfs_get_dev_zone_info(device);
-	if (ret != 0)
-		goto error_free_page;
-
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 41d27fefd306..0b1b1f38a196 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -143,6 +143,31 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 	return 0;
 }
 
+int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	int ret = 0;
+
+	/* fs_info->zone_size might not set yet. Use the incomapt flag here. */
+	if (!btrfs_fs_incompat(fs_info, ZONED))
+		return 0;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		/* We can skip reading of zone info for missing devices */
+		if (!device->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone_info(device);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	return ret;
+}
+
 int btrfs_get_dev_zone_info(struct btrfs_device *device)
 {
 	struct btrfs_zoned_device_info *zone_info = NULL;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 8abe2f83272b..eb47b7ad9ab1 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -25,6 +25,7 @@ struct btrfs_zoned_device_info {
 #ifdef CONFIG_BLK_DEV_ZONED
 int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
+int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info);
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
@@ -42,6 +43,11 @@ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 	return 0;
 }
 
+static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
+{
+	return 0;
+}
+
 static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
 {
 	return 0;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 04/42] btrfs: zoned: use regular super block location on zone emulation
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 02/42] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 03/42] btrfs: zoned: defer loading zone info after opening trees Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 05/42] btrfs: release path before calling to btrfs_load_block_group_zone_info Naohiro Aota
                     ` (38 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

A zoned btrfs filesystem currently has a superblock at the beginning of
the superblock logging zones if the zones are conventional. This
difference in superblock position causes a chicken-and-egg problem for
filesystems with emulated zones. Since the device is a regular (non-zoned)
device, we cannot know if the filesystem is regular or zoned while reading
the superblock. But, to load the superblock, we need to see if it is
emulated zoned or not.

Place the superblocks at the same location as they are on regular btrfs on
regular devices to solve the problem. It is possible because it's ensured
that all the superblock locations are at an (emulated) conventional zone on
regular devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0b1b1f38a196..8b3868088c5e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -552,7 +552,13 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 	struct btrfs_zoned_device_info *zinfo = device->zone_info;
 	u32 zone_num;
 
-	if (!zinfo) {
+	/*
+	 * For a zoned filesystem on a non-zoned block device, use the same
+	 * super block locations as regular filesystem. Doing so, the super
+	 * block can always be retrieved and the zoned flag of the volume
+	 * detected from the super block information.
+	 */
+	if (!bdev_is_zoned(device->bdev)) {
 		*bytenr_ret = btrfs_sb_offset(mirror);
 		return 0;
 	}
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 05/42] btrfs: release path before calling to btrfs_load_block_group_zone_info
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (2 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 04/42] btrfs: zoned: use regular super block location on zone emulation Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 06/42] btrfs: zoned: do not load fs_info::zoned from incompat flag Naohiro Aota
                     ` (37 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Anand Jain, Josef Bacik

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Since we have no write pointer in conventional zones, we cannot
determine the allocation offset from it. Instead, we set the allocation
offset after the highest addressed extent. This is done by reading the
extent tree in btrfs_load_block_group_zone_info().

However, this function is called from btrfs_read_block_groups(), so the
read lock for the tree node could be recursively taken.

To avoid this unsafe locking scenario, release the path before reading
the extent tree to get the allocation offset.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/block-group.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5fa6b3d540f4..b8fbee70a897 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1810,24 +1810,8 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
-static void read_block_group_item(struct btrfs_block_group *cache,
-				 struct btrfs_path *path,
-				 const struct btrfs_key *key)
-{
-	struct extent_buffer *leaf = path->nodes[0];
-	struct btrfs_block_group_item bgi;
-	int slot = path->slots[0];
-
-	cache->length = key->offset;
-
-	read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
-			   sizeof(bgi));
-	cache->used = btrfs_stack_block_group_used(&bgi);
-	cache->flags = btrfs_stack_block_group_flags(&bgi);
-}
-
 static int read_one_block_group(struct btrfs_fs_info *info,
-				struct btrfs_path *path,
+				struct btrfs_block_group_item *bgi,
 				const struct btrfs_key *key,
 				int need_clear)
 {
@@ -1842,7 +1826,9 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	if (!cache)
 		return -ENOMEM;
 
-	read_block_group_item(cache, path, key);
+	cache->length = key->offset;
+	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->flags = btrfs_stack_block_group_flags(bgi);
 
 	set_free_space_tree_thresholds(cache);
 
@@ -2001,19 +1987,29 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		need_clear = 1;
 
 	while (1) {
+		struct btrfs_block_group_item bgi;
+		struct extent_buffer *leaf;
+		int slot;
+
 		ret = find_first_block_group(info, path, &key);
 		if (ret > 0)
 			break;
 		if (ret != 0)
 			goto error;
 
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		ret = read_one_block_group(info, path, &key, need_clear);
+		leaf = path->nodes[0];
+		slot = path->slots[0];
+
+		read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
+				   sizeof(bgi));
+
+		btrfs_item_key_to_cpu(leaf, &key, slot);
+		btrfs_release_path(path);
+		ret = read_one_block_group(info, &bgi, &key, need_clear);
 		if (ret < 0)
 			goto error;
 		key.objectid += key.offset;
 		key.offset = 0;
-		btrfs_release_path(path);
 	}
 	btrfs_release_path(path);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 06/42] btrfs: zoned: do not load fs_info::zoned from incompat flag
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (3 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 05/42] btrfs: release path before calling to btrfs_load_block_group_zone_info Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 07/42] btrfs: zoned: disallow fitrim on zoned filesystems Naohiro Aota
                     ` (36 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Anand Jain, Josef Bacik

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Don't set the zoned flag in fs_info as soon as we're encountering the
incompat filesystem flag for a zoned filesystem on mount. The zoned flag
in fs_info is in a union together with the zone_size, so setting it too
early will result in setting an incorrect zone_size as well.

Once the correct zone_size is read from the device, we can rely on the
zoned flag in fs_info as well to determine if the filesystem is zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/disk-io.c | 2 --
 fs/btrfs/zoned.c   | 8 ++++++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b6a3df765cd..8551b0fc1b22 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3201,8 +3201,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	if (features & BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 		btrfs_info(fs_info, "has skinny extents");
 
-	fs_info->zoned = (features & BTRFS_FEATURE_INCOMPAT_ZONED);
-
 	/*
 	 * flag our filesystem as having big metadata blocks if
 	 * they are bigger than the page size
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8b3868088c5e..c0840412ccb6 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -432,6 +432,14 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
 
+	/*
+	 * Check mount options here, because we might change fs_info->zoned
+	 * from fs_info->zone_size.
+	 */
+	ret = btrfs_check_mountopts_zoned(fs_info);
+	if (ret)
+		goto out;
+
 	btrfs_info(fs_info, "zoned mode enabled with zone size %llu", zone_size);
 out:
 	return ret;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 07/42] btrfs: zoned: disallow fitrim on zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (4 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 06/42] btrfs: zoned: do not load fs_info::zoned from incompat flag Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 08/42] btrfs: zoned: allow zoned filesystems on non-zoned block devices Naohiro Aota
                     ` (35 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

The implementation of fitrim depends on space cache, which is not used
and disabled for zoned extent allocator. So the current code does not
work with zoned filesystem.

In the future, we can implement fitrim for zoned filesystems by enabling
space cache (but, only for fitrim) or scanning the extent tree at fitrim
time.  For now, disallow fitrim on zoned filesystems.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ioctl.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e6a63f652235..a8c60d46d19c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -527,6 +527,14 @@ static noinline int btrfs_ioctl_fitrim(struct btrfs_fs_info *fs_info,
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	/*
+	 * btrfs_trim_block_group() depends on space cache, which is not
+	 * available in zoned filesystem. So, disallow fitrim on a zoned
+	 * filesystem for now.
+	 */
+	if (btrfs_is_zoned(fs_info))
+		return -EOPNOTSUPP;
+
 	/*
 	 * If the fs is mounted with nologreplay, which requires it to be
 	 * mounted in RO mode as well, we can not allow discard on free space
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 08/42] btrfs: zoned: allow zoned filesystems on non-zoned block devices
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (5 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 07/42] btrfs: zoned: disallow fitrim on zoned filesystems Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 09/42] btrfs: zoned: implement zoned chunk allocator Naohiro Aota
                     ` (34 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Anand Jain, Josef Bacik,
	Naohiro Aota

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Run a zoned filesystem on non-zoned devices. This is done by "slicing up"
the block device into static sized chunks and fake a conventional zone on
each of them. The emulated zone size is determined from the size of device
extent.

This is mainly aimed at testing parts of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/zoned.c | 150 +++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/zoned.h |  14 +++--
 2 files changed, 148 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c0840412ccb6..6699f626a86e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -119,6 +119,36 @@ static inline u32 sb_zone_number(int shift, int mirror)
 	return 0;
 }
 
+/*
+ * Emulate blkdev_report_zones() for a non-zoned device. It slices up the block
+ * device into static sized chunks and fake a conventional zone on each of
+ * them.
+ */
+static int emulate_report_zones(struct btrfs_device *device, u64 pos,
+				struct blk_zone *zones, unsigned int nr_zones)
+{
+	const sector_t zone_sectors = device->fs_info->zone_size >> SECTOR_SHIFT;
+	sector_t bdev_size = bdev_nr_sectors(device->bdev);
+	unsigned int i;
+
+	pos >>= SECTOR_SHIFT;
+	for (i = 0; i < nr_zones; i++) {
+		zones[i].start = i * zone_sectors + pos;
+		zones[i].len = zone_sectors;
+		zones[i].capacity = zone_sectors;
+		zones[i].wp = zones[i].start + zone_sectors;
+		zones[i].type = BLK_ZONE_TYPE_CONVENTIONAL;
+		zones[i].cond = BLK_ZONE_COND_NOT_WP;
+
+		if (zones[i].wp >= bdev_size) {
+			i++;
+			break;
+		}
+	}
+
+	return i;
+}
+
 static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 			       struct blk_zone *zones, unsigned int *nr_zones)
 {
@@ -127,6 +157,12 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 	if (!*nr_zones)
 		return 0;
 
+	if (!bdev_is_zoned(device->bdev)) {
+		ret = emulate_report_zones(device, pos, zones, *nr_zones);
+		*nr_zones = ret;
+		return 0;
+	}
+
 	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones,
 				  copy_zone_info_cb, zones);
 	if (ret < 0) {
@@ -143,6 +179,50 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 	return 0;
 }
 
+/* The emulated zone size is determined from the size of device extent */
+static int calculate_emulated_zone_size(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	struct btrfs_dev_extent *dext;
+	int ret = 0;
+
+	key.objectid = 1;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+	key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+
+	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		/* No dev extents at all? Not good */
+		if (ret > 0) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
+	leaf = path->nodes[0];
+	dext = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
+	fs_info->zone_size = btrfs_dev_extent_length(leaf, dext);
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+
+	return ret;
+}
+
 int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -170,6 +250,7 @@ int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
 
 int btrfs_get_dev_zone_info(struct btrfs_device *device)
 {
+	struct btrfs_fs_info *fs_info = device->fs_info;
 	struct btrfs_zoned_device_info *zone_info = NULL;
 	struct block_device *bdev = device->bdev;
 	struct request_queue *queue = bdev_get_queue(bdev);
@@ -178,9 +259,14 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	struct blk_zone *zones = NULL;
 	unsigned int i, nreported = 0, nr_zones;
 	unsigned int zone_sectors;
+	char *model, *emulated;
 	int ret;
 
-	if (!bdev_is_zoned(bdev))
+	/*
+	 * Cannot use btrfs_is_zoned here, since fs_info::zone_size might not
+	 * yet be set.
+	 */
+	if (!btrfs_fs_incompat(fs_info, ZONED))
 		return 0;
 
 	if (device->zone_info)
@@ -190,8 +276,20 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	if (!zone_info)
 		return -ENOMEM;
 
+	if (!bdev_is_zoned(bdev)) {
+		if (!fs_info->zone_size) {
+			ret = calculate_emulated_zone_size(fs_info);
+			if (ret)
+				goto out;
+		}
+
+		ASSERT(fs_info->zone_size);
+		zone_sectors = fs_info->zone_size >> SECTOR_SHIFT;
+	} else {
+		zone_sectors = bdev_zone_sectors(bdev);
+	}
+
 	nr_sectors = bdev_nr_sectors(bdev);
-	zone_sectors = bdev_zone_sectors(bdev);
 	/* Check if it's power of 2 (see is_power_of_2) */
 	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
 	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
@@ -297,20 +395,42 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 
 	device->zone_info = zone_info;
 
-	/* device->fs_info is not safe to use for printing messages */
-	btrfs_info_in_rcu(NULL,
-			"host-%s zoned block device %s, %u zones of %llu bytes",
-			bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
-			rcu_str_deref(device->name), zone_info->nr_zones,
-			zone_info->zone_size);
+	switch (bdev_zoned_model(bdev)) {
+	case BLK_ZONED_HM:
+		model = "host-managed zoned";
+		emulated = "";
+		break;
+	case BLK_ZONED_HA:
+		model = "host-aware zoned";
+		emulated = "";
+		break;
+	case BLK_ZONED_NONE:
+		model = "regular";
+		emulated = "emulated ";
+		break;
+	default:
+		/* Just in case */
+		btrfs_err_in_rcu(fs_info, "zoned: unsupported model %d on %s",
+				 bdev_zoned_model(bdev),
+				 rcu_str_deref(device->name));
+		ret = -EOPNOTSUPP;
+		goto out_free_zone_info;
+	}
+
+	btrfs_info_in_rcu(fs_info,
+		"%s block device %s, %u %szones of %llu bytes",
+		model, rcu_str_deref(device->name), zone_info->nr_zones,
+		emulated, zone_info->zone_size);
 
 	return 0;
 
 out:
 	kfree(zones);
+out_free_zone_info:
 	bitmap_free(zone_info->empty_zones);
 	bitmap_free(zone_info->seq_zones);
 	kfree(zone_info);
+	device->zone_info = NULL;
 
 	return ret;
 }
@@ -349,7 +469,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
 	u64 max_zone_append_size = 0;
-	const bool incompat_zoned = btrfs_is_zoned(fs_info);
+	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
 	/* Count zoned devices */
@@ -360,9 +480,17 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 			continue;
 
 		model = bdev_zoned_model(device->bdev);
+		/*
+		 * A Host-Managed zoned device must be used as a zoned device.
+		 * A Host-Aware zoned device and a non-zoned devices can be
+		 * treated as a zoned device, if ZONED flag is enabled in the
+		 * superblock.
+		 */
 		if (model == BLK_ZONED_HM ||
-		    (model == BLK_ZONED_HA && incompat_zoned)) {
-			struct btrfs_zoned_device_info *zone_info;
+		    (model == BLK_ZONED_HA && incompat_zoned) ||
+		    (model == BLK_ZONED_NONE && incompat_zoned)) {
+			struct btrfs_zoned_device_info *zone_info =
+				device->zone_info;
 
 			zone_info = device->zone_info;
 			zoned_devices++;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index eb47b7ad9ab1..5e78786bb723 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -142,12 +142,16 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, u64 p
 static inline bool btrfs_check_device_zone_type(const struct btrfs_fs_info *fs_info,
 						struct block_device *bdev)
 {
-	u64 zone_size;
-
 	if (btrfs_is_zoned(fs_info)) {
-		zone_size = bdev_zone_sectors(bdev) << SECTOR_SHIFT;
-		/* Do not allow non-zoned device */
-		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+		/*
+		 * We can allow a regular device on a zoned filesystem, because
+		 * we will emulate the zoned capabilities.
+		 */
+		if (!bdev_is_zoned(bdev))
+			return true;
+
+		return fs_info->zone_size ==
+			(bdev_zone_sectors(bdev) << SECTOR_SHIFT);
 	}
 
 	/* Do not allow Host Manged zoned device */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 09/42] btrfs: zoned: implement zoned chunk allocator
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (6 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 08/42] btrfs: zoned: allow zoned filesystems on non-zoned block devices Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 10/42] btrfs: zoned: verify device extent is aligned to zone Naohiro Aota
                     ` (33 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.

To implement the allocator, we need to extend the following functions for
a zoned filesystem.

- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size

init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.

dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular btrfs because we anyway
reserve the first two zones for superblock logging.

dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.

With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.

Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/volumes.c | 171 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |   1 +
 fs/btrfs/zoned.c   | 141 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h   |  25 +++++++
 4 files changed, 321 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 07cd4742c123..ae2aeadad5a0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1414,11 +1414,62 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 		 * make sure to start at an offset of at least 1MB.
 		 */
 		return max_t(u64, start, SZ_1M);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		/*
+		 * We don't care about the starting region like regular
+		 * allocator, because we anyway use/reserve the first two zones
+		 * for superblock logging.
+		 */
+		return ALIGN(start, device->zone_info->zone_size);
 	default:
 		BUG();
 	}
 }
 
+static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
+					u64 *hole_start, u64 *hole_size,
+					u64 num_bytes)
+{
+	u64 zone_size = device->zone_info->zone_size;
+	u64 pos;
+	int ret;
+	bool changed = false;
+
+	ASSERT(IS_ALIGNED(*hole_start, zone_size));
+
+	while (*hole_size > 0) {
+		pos = btrfs_find_allocatable_zones(device, *hole_start,
+						   *hole_start + *hole_size,
+						   num_bytes);
+		if (pos != *hole_start) {
+			*hole_size = *hole_start + *hole_size - pos;
+			*hole_start = pos;
+			changed = true;
+			if (*hole_size < num_bytes)
+				break;
+		}
+
+		ret = btrfs_ensure_empty_zones(device, pos, num_bytes);
+
+		/* Range is ensured to be empty */
+		if (!ret)
+			return changed;
+
+		/* Given hole range was invalid (outside of device) */
+		if (ret == -ERANGE) {
+			*hole_start += *hole_size;
+			*hole_size = 0;
+			return 1;
+		}
+
+		*hole_start += zone_size;
+		*hole_size -= zone_size;
+		changed = true;
+	}
+
+	return changed;
+}
+
 /**
  * dev_extent_hole_check - check if specified hole is suitable for allocation
  * @device:	the device which we have the hole
@@ -1426,7 +1477,7 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
  * @hole_size:	the size of the hole
  * @num_bytes:	the size of the free space that we need
  *
- * This function may modify @hole_start and @hole_end to reflect the suitable
+ * This function may modify @hole_start and @hole_size to reflect the suitable
  * position for allocation. Returns 1 if hole position is updated, 0 otherwise.
  */
 static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
@@ -1435,24 +1486,39 @@ static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
 	bool changed = false;
 	u64 hole_end = *hole_start + *hole_size;
 
-	/*
-	 * Check before we set max_hole_start, otherwise we could end up
-	 * sending back this offset anyway.
-	 */
-	if (contains_pending_extent(device, hole_start, *hole_size)) {
-		if (hole_end >= *hole_start)
-			*hole_size = hole_end - *hole_start;
-		else
-			*hole_size = 0;
-		changed = true;
-	}
+	for (;;) {
+		/*
+		 * Check before we set max_hole_start, otherwise we could end up
+		 * sending back this offset anyway.
+		 */
+		if (contains_pending_extent(device, hole_start, *hole_size)) {
+			if (hole_end >= *hole_start)
+				*hole_size = hole_end - *hole_start;
+			else
+				*hole_size = 0;
+			changed = true;
+		}
+
+		switch (device->fs_devices->chunk_alloc_policy) {
+		case BTRFS_CHUNK_ALLOC_REGULAR:
+			/* No extra check */
+			break;
+		case BTRFS_CHUNK_ALLOC_ZONED:
+			if (dev_extent_hole_check_zoned(device, hole_start,
+							hole_size, num_bytes)) {
+				changed = true;
+				/*
+				 * The changed hole can contain pending extent.
+				 * Loop again to check that.
+				 */
+				continue;
+			}
+			break;
+		default:
+			BUG();
+		}
 
-	switch (device->fs_devices->chunk_alloc_policy) {
-	case BTRFS_CHUNK_ALLOC_REGULAR:
-		/* No extra check */
 		break;
-	default:
-		BUG();
 	}
 
 	return changed;
@@ -1505,6 +1571,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 
 	search_start = dev_extent_search_start(device, search_start);
 
+	WARN_ON(device->zone_info &&
+		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4899,6 +4968,37 @@ static void init_alloc_chunk_ctl_policy_regular(
 	ctl->dev_extent_min = BTRFS_STRIPE_LEN * ctl->dev_stripes;
 }
 
+static void init_alloc_chunk_ctl_policy_zoned(
+				      struct btrfs_fs_devices *fs_devices,
+				      struct alloc_chunk_ctl *ctl)
+{
+	u64 zone_size = fs_devices->fs_info->zone_size;
+	u64 limit;
+	int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
+	int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
+	u64 min_chunk_size = min_data_stripes * zone_size;
+	u64 type = ctl->type;
+
+	ctl->max_stripe_size = zone_size;
+	if (type & BTRFS_BLOCK_GROUP_DATA) {
+		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
+						 zone_size);
+	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+		ctl->max_chunk_size = ctl->max_stripe_size;
+	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+		ctl->max_chunk_size = 2 * ctl->max_stripe_size;
+		ctl->devs_max = min_t(int, ctl->devs_max,
+				      BTRFS_MAX_DEVS_SYS_CHUNK);
+	}
+
+	/* We don't want a chunk larger than 10% of writable space */
+	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
+			       zone_size),
+		    min_chunk_size);
+	ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
+	ctl->dev_extent_min = zone_size * ctl->dev_stripes;
+}
+
 static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 				 struct alloc_chunk_ctl *ctl)
 {
@@ -4919,6 +5019,9 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		init_alloc_chunk_ctl_policy_regular(fs_devices, ctl);
 		break;
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl);
+		break;
 	default:
 		BUG();
 	}
@@ -5045,6 +5148,38 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 	return 0;
 }
 
+static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
+				    struct btrfs_device_info *devices_info)
+{
+	u64 zone_size = devices_info[0].dev->zone_info->zone_size;
+	/* Number of stripes that count for block group size */
+	int data_stripes;
+
+	/*
+	 * It should hold because:
+	 *    dev_extent_min == dev_extent_want == zone_size * dev_stripes
+	 */
+	ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min);
+
+	ctl->stripe_size = zone_size;
+	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+	data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+
+	/* stripe_size is fixed in zoned filesysmte. Reduce ndevs instead. */
+	if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) {
+		ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies,
+					     ctl->stripe_size) + ctl->nparity,
+				     ctl->dev_stripes);
+		ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+		data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+		ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size);
+	}
+
+	ctl->chunk_size = ctl->stripe_size * data_stripes;
+
+	return 0;
+}
+
 static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
 			      struct btrfs_device_info *devices_info)
@@ -5072,6 +5207,8 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 	switch (fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		return decide_stripe_size_regular(ctl, devices_info);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		return decide_stripe_size_zoned(ctl, devices_info);
 	default:
 		BUG();
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 04e2b26823c2..598ac225176d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -214,6 +214,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
 enum btrfs_chunk_allocation_policy {
 	BTRFS_CHUNK_ALLOC_REGULAR,
+	BTRFS_CHUNK_ALLOC_ZONED,
 };
 
 /*
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 6699f626a86e..69fd0d078b9b 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1,11 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
+#include "disk-io.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -559,6 +561,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
+	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	/*
 	 * Check mount options here, because we might change fs_info->zoned
@@ -779,3 +782,141 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 				sb_zone << zone_sectors_shift,
 				zone_sectors * BTRFS_NR_SB_LOG_ZONES, GFP_NOFS);
 }
+
+/**
+ * btrfs_find_allocatable_zones - find allocatable zones within a given region
+ *
+ * @device:	the device to allocate a region on
+ * @hole_start: the position of the hole to allocate the region
+ * @num_bytes:	size of wanted region
+ * @hole_end:	the end of the hole
+ * @return:	position of allocatable zones
+ *
+ * Allocatable region should not contain any superblock locations.
+ */
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	const u8 shift = zinfo->zone_size_shift;
+	u64 nzones = num_bytes >> shift;
+	u64 pos = hole_start;
+	u64 begin, end;
+	bool have_sb;
+	int i;
+
+	ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	while (pos < hole_end) {
+		begin = pos >> shift;
+		end = begin + nzones;
+
+		if (end > zinfo->nr_zones)
+			return hole_end;
+
+		/* Check if zones in the region are all empty */
+		if (btrfs_dev_is_sequential(device, pos) &&
+		    find_next_zero_bit(zinfo->empty_zones, end, begin) != end) {
+			pos += zinfo->zone_size;
+			continue;
+		}
+
+		have_sb = false;
+		for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+			u32 sb_zone;
+			u64 sb_pos;
+
+			sb_zone = sb_zone_number(shift, i);
+			if (!(end <= sb_zone ||
+			      sb_zone + BTRFS_NR_SB_LOG_ZONES <= begin)) {
+				have_sb = true;
+				pos = ((u64)sb_zone + BTRFS_NR_SB_LOG_ZONES) << shift;
+				break;
+			}
+
+			/* We also need to exclude regular superblock positions */
+			sb_pos = btrfs_sb_offset(i);
+			if (!(pos + num_bytes <= sb_pos ||
+			      sb_pos + BTRFS_SUPER_INFO_SIZE <= pos)) {
+				have_sb = true;
+				pos = ALIGN(sb_pos + BTRFS_SUPER_INFO_SIZE,
+					    zinfo->zone_size);
+				break;
+			}
+		}
+		if (!have_sb)
+			break;
+	}
+
+	return pos;
+}
+
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes)
+{
+	int ret;
+
+	*bytes = 0;
+	ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_RESET,
+			       physical >> SECTOR_SHIFT, length >> SECTOR_SHIFT,
+			       GFP_NOFS);
+	if (ret)
+		return ret;
+
+	*bytes = length;
+	while (length) {
+		btrfs_dev_set_zone_empty(device, physical);
+		physical += device->zone_info->zone_size;
+		length -= device->zone_info->zone_size;
+	}
+
+	return 0;
+}
+
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	const u8 shift = zinfo->zone_size_shift;
+	unsigned long begin = start >> shift;
+	unsigned long end = (start + size) >> shift;
+	u64 pos;
+	int ret;
+
+	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return -ERANGE;
+
+	/* All the zones are conventional */
+	if (find_next_bit(zinfo->seq_zones, begin, end) == end)
+		return 0;
+
+	/* All the zones are sequential and empty */
+	if (find_next_zero_bit(zinfo->seq_zones, begin, end) == end &&
+	    find_next_zero_bit(zinfo->empty_zones, begin, end) == end)
+		return 0;
+
+	for (pos = start; pos < start + size; pos += zinfo->zone_size) {
+		u64 reset_bytes;
+
+		if (!btrfs_dev_is_sequential(device, pos) ||
+		    btrfs_dev_is_empty_zone(device, pos))
+			continue;
+
+		/* Free regions should be empty */
+		btrfs_warn_in_rcu(
+			device->fs_info,
+		"zoned: resetting device %s (devid %llu) zone %llu for allocation",
+			rcu_str_deref(device->name), device->devid, pos >> shift);
+		WARN_ON_ONCE(1);
+
+		ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
+					      &reset_bytes);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 5e78786bb723..6c8f83c48c2e 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -36,6 +36,11 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 			  u64 *bytenr_ret);
 void btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
 int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes);
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes);
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -91,6 +96,26 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror
 	return 0;
 }
 
+static inline u64 btrfs_find_allocatable_zones(struct btrfs_device *device,
+					       u64 hole_start, u64 hole_end,
+					       u64 num_bytes)
+{
+	return hole_start;
+}
+
+static inline int btrfs_reset_device_zone(struct btrfs_device *device,
+					  u64 physical, u64 length, u64 *bytes)
+{
+	*bytes = 0;
+	return 0;
+}
+
+static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
+					   u64 start, u64 size)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 10/42] btrfs: zoned: verify device extent is aligned to zone
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (7 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 09/42] btrfs: zoned: implement zoned chunk allocator Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 11/42] btrfs: zoned: load zone's allocation offset Naohiro Aota
                     ` (32 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

Add a check in verify_one_dev_extent() to ensure that a device extent on a
zoned block device is aligned to the respective zone boundary.

If it isn't mark the filesystem as unclean.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/volumes.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ae2aeadad5a0..10401def16ef 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7769,6 +7769,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 		ret = -EUCLEAN;
 		goto out;
 	}
+
+	if (dev->zone_info) {
+		u64 zone_size = dev->zone_info->zone_size;
+
+		if (!IS_ALIGNED(physical_offset, zone_size) ||
+		    !IS_ALIGNED(physical_len, zone_size)) {
+			btrfs_err(fs_info,
+"zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
+				  devid, physical_offset, physical_len);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
 out:
 	free_extent_map(em);
 	return ret;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 11/42] btrfs: zoned: load zone's allocation offset
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (8 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 10/42] btrfs: zoned: verify device extent is aligned to zone Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 12/42] btrfs: zoned: calculate allocation offset for conventional zones Naohiro Aota
                     ` (31 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik, Anand Jain

A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address within
a block group. To facilitate this, add an "alloc_offset" to the
block-group to track the logical addresses of the write pointer.

This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.

For now, zoned filesystemzoned filesystems the single profile. Supporting
non-single profile with zone append writing is not trivial. For example,
in the dup profile, we send a zone append writing IO to two zones on a
device. The device reply with written LBAs for the IOs. If the offsets
of the returned addresses from the beginning of the zone are different,
then it results in different logical addresses.

We need fine-grained logical to physical mapping to support such separated
physical address issue. Since it should require additional metadata type,
disable non-single profiles for now.

This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  15 ++++
 fs/btrfs/block-group.h |   6 ++
 fs/btrfs/zoned.c       | 151 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |   7 ++
 4 files changed, 179 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index b8fbee70a897..e6bf728496eb 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -15,6 +15,7 @@
 #include "delalloc-space.h"
 #include "discard.h"
 #include "raid56.h"
+#include "zoned.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1855,6 +1856,13 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 			goto error;
 	}
 
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_err(info, "zoned: failed to load zone info of bg %llu",
+			  cache->start);
+		goto error;
+	}
+
 	/*
 	 * We need to exclude the super stripes now so that the space info has
 	 * super bytes accounted for, otherwise we'll think we have more space
@@ -2141,6 +2149,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->cached = BTRFS_CACHE_FINISHED;
 	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
 		cache->needs_free_space = 1;
+
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	ret = exclude_super_stripes(cache);
 	if (ret) {
 		/* We may have excluded something, so call this just in case */
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 8f74a96074f7..224946fa9bed 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -183,6 +183,12 @@ struct btrfs_block_group {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	/*
+	 * Allocation offset for the block group to implement sequential
+	 * allocation. This is used only on a zoned filesystem.
+	 */
+	u64 alloc_offset;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 69fd0d078b9b..0a7cd00f405f 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -3,14 +3,20 @@
 #include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+/* Pseudo write pointer value for conventional zone */
+#define WP_CONVENTIONAL ((u64)-2)
 
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
@@ -920,3 +926,148 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 
 	return 0;
 }
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->start;
+	u64 length = cache->length;
+	u64 physical = 0;
+	int ret;
+	int i;
+	unsigned int nofs_flag;
+	u64 *alloc_offsets = NULL;
+	u32 num_sequential = 0, num_conventional = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info,
+		"zoned: block group %llu len %llu unaligned to zone size %llu",
+			  logical, length, fs_info->zone_size);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (is_sequential)
+			num_sequential++;
+		else
+			num_conventional++;
+
+		if (!is_sequential) {
+			alloc_offsets[i] = WP_CONVENTIONAL;
+			continue;
+		}
+
+		/*
+		 * This zone will be used for allocation, so mark this zone
+		 * non-empty.
+		 */
+		btrfs_dev_clear_zone_empty(device, physical);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		nofs_flag = memalloc_nofs_save();
+		ret = btrfs_get_dev_zone(device, physical, &zone);
+		memalloc_nofs_restore(nofs_flag);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(fs_info,
+		"zoned: offline/readonly zone %llu on device %s (devid %llu)",
+				  physical >> device->zone_info->zone_size_shift,
+				  rcu_str_deref(device->name), device->devid);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+					((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (num_conventional > 0) {
+		/*
+		 * Since conventional zones do not have a write pointer, we
+		 * cannot determine alloc_offset from the pointer
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+		cache->alloc_offset = alloc_offsets[0];
+		break;
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID0:
+	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* non-single profiles are not supported yet */
+	default:
+		btrfs_err(fs_info, "zoned: profile %s not yet supported",
+			  btrfs_bg_type_to_raid_name(map->type));
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6c8f83c48c2e..4f3152d7b98f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -41,6 +41,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -116,6 +117,12 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
 	return 0;
 }
 
+static inline int btrfs_load_block_group_zone_info(
+		struct btrfs_block_group *cache)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 12/42] btrfs: zoned: calculate allocation offset for conventional zones
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (9 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 11/42] btrfs: zoned: load zone's allocation offset Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 13/42] btrfs: zoned: track unusable bytes for zones Naohiro Aota
                     ` (30 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik, Anand Jain

Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.

But instead, we can consider the end of the highest addressed extent in the
block group for the allocation offset.

For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking extent
buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  4 +-
 fs/btrfs/zoned.c       | 99 +++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/zoned.h       |  4 +-
 3 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e6bf728496eb..6d10874189df 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1856,7 +1856,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 			goto error;
 	}
 
-	ret = btrfs_load_block_group_zone_info(cache);
+	ret = btrfs_load_block_group_zone_info(cache, false);
 	if (ret) {
 		btrfs_err(info, "zoned: failed to load zone info of bg %llu",
 			  cache->start);
@@ -2150,7 +2150,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
 		cache->needs_free_space = 1;
 
-	ret = btrfs_load_block_group_zone_info(cache);
+	ret = btrfs_load_block_group_zone_info(cache, true);
 	if (ret) {
 		btrfs_put_block_group(cache);
 		return ret;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0a7cd00f405f..b892566a1c93 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -927,7 +927,68 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 	return 0;
 }
 
-int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
+/*
+ * Calculate an allocation pointer from the extent allocation information
+ * for a block group consist of conventional zones. It is pointed to the
+ * end of the highest addressed extent in the block group as an allocation
+ * offset.
+ */
+static int calculate_alloc_pointer(struct btrfs_block_group *cache,
+				   u64 *offset_ret)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	int ret;
+	u64 length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = cache->start + cache->length;
+	key.type = 0;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	/* We should not find the exact match */
+	if (!ret)
+		ret = -EUCLEAN;
+	if (ret < 0)
+		goto out;
+
+	ret = btrfs_previous_extent_item(root, path, cache->start);
+	if (ret) {
+		if (ret == 1) {
+			ret = 0;
+			*offset_ret = 0;
+		}
+		goto out;
+	}
+
+	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
+
+	if (found_key.type == BTRFS_EXTENT_ITEM_KEY)
+		length = found_key.offset;
+	else
+		length = fs_info->nodesize;
+
+	if (!(found_key.objectid >= cache->start &&
+	       found_key.objectid + length <= cache->start + cache->length)) {
+		ret = -EUCLEAN;
+		goto out;
+	}
+	*offset_ret = found_key.objectid + length - cache->start;
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
 	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
@@ -941,6 +1002,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	int i;
 	unsigned int nofs_flag;
 	u64 *alloc_offsets = NULL;
+	u64 last_alloc = 0;
 	u32 num_sequential = 0, num_conventional = 0;
 
 	if (!btrfs_is_zoned(fs_info))
@@ -1040,11 +1102,30 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 
 	if (num_conventional > 0) {
 		/*
-		 * Since conventional zones do not have a write pointer, we
-		 * cannot determine alloc_offset from the pointer
+		 * Avoid calling calculate_alloc_pointer() for new BG. It
+		 * is no use for new BG. It must be always 0.
+		 *
+		 * Also, we have a lock chain of extent buffer lock ->
+		 * chunk mutex.  For new BG, this function is called from
+		 * btrfs_make_block_group() which is already taking the
+		 * chunk mutex. Thus, we cannot call
+		 * calculate_alloc_pointer() which takes extent buffer
+		 * locks to avoid deadlock.
 		 */
-		ret = -EINVAL;
-		goto out;
+		if (new) {
+			cache->alloc_offset = 0;
+			goto out;
+		}
+		ret = calculate_alloc_pointer(cache, &last_alloc);
+		if (ret || map->num_stripes == num_conventional) {
+			if (!ret)
+				cache->alloc_offset = last_alloc;
+			else
+				btrfs_err(fs_info,
+			"zoned: failed to determine allocation offset of bg %llu",
+					  cache->start);
+			goto out;
+		}
 	}
 
 	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
@@ -1066,6 +1147,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	}
 
 out:
+	/* An extent is allocated after the write pointer */
+	if (!ret && num_conventional && last_alloc > cache->alloc_offset) {
+		btrfs_err(fs_info,
+			  "zoned: got wrong write pointer in BG %llu: %llu > %llu",
+			  logical, last_alloc, cache->alloc_offset);
+		ret = -EIO;
+	}
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 4f3152d7b98f..d27db3993e51 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -41,7 +41,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
-int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -118,7 +118,7 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
 }
 
 static inline int btrfs_load_block_group_zone_info(
-		struct btrfs_block_group *cache)
+		struct btrfs_block_group *cache, bool new)
 {
 	return 0;
 }
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 13/42] btrfs: zoned: track unusable bytes for zones
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (10 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 12/42] btrfs: zoned: calculate allocation offset for conventional zones Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 14/42] btrfs: zoned: implement sequential extent allocation Naohiro Aota
                     ` (29 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.

Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      | 51 +++++++++++++++++++++-------
 fs/btrfs/block-group.h      |  1 +
 fs/btrfs/extent-tree.c      |  5 +++
 fs/btrfs/free-space-cache.c | 67 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-cache.h |  2 ++
 fs/btrfs/space-info.c       | 13 ++++---
 fs/btrfs/space-info.h       |  4 ++-
 fs/btrfs/sysfs.c            |  2 ++
 fs/btrfs/zoned.c            | 21 ++++++++++++
 fs/btrfs/zoned.h            |  3 ++
 10 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6d10874189df..e4444d4dd4b5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1009,12 +1009,17 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		WARN_ON(block_group->space_info->total_bytes
 			< block_group->length);
 		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length);
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
+	block_group->space_info->bytes_zone_unusable -=
+		block_group->zone_unusable;
 	block_group->space_info->disk_total -= block_group->length * factor;
 
 	spin_unlock(&block_group->space_info->lock);
@@ -1158,7 +1163,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	}
 
 	num_bytes = cache->length - cache->reserved - cache->pinned -
-		    cache->bytes_super - cache->used;
+		    cache->bytes_super - cache->zone_unusable - cache->used;
 
 	/*
 	 * Data never overcommits, even in mixed mode, so do just the straight
@@ -1189,6 +1194,12 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 
 	if (!ret) {
 		sinfo->bytes_readonly += num_bytes;
+		if (btrfs_is_zoned(cache->fs_info)) {
+			/* Migrate zone_unusable bytes to readonly */
+			sinfo->bytes_readonly += cache->zone_unusable;
+			sinfo->bytes_zone_unusable -= cache->zone_unusable;
+			cache->zone_unusable = 0;
+		}
 		cache->ro++;
 		list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
 	}
@@ -1876,12 +1887,20 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	}
 
 	/*
-	 * Check for two cases, either we are full, and therefore don't need
-	 * to bother with the caching work since we won't find any space, or we
-	 * are empty, and we can just add all the space in and be done with it.
-	 * This saves us _a_lot_ of time, particularly in the full case.
+	 * For zoned filesystem, space after the allocation offset is the only
+	 * free space for a block group. So, we don't need any caching work.
+	 * btrfs_calc_zone_unusable() will set the amount of free space and
+	 * zone_unusable space.
+	 *
+	 * For regular filesystem, check for two cases, either we are full, and
+	 * therefore don't need to bother with the caching work since we won't
+	 * find any space, or we are empty, and we can just add all the space
+	 * in and be done with it.  This saves us _a_lot_ of time, particularly
+	 * in the full case.
 	 */
-	if (cache->length == cache->used) {
+	if (btrfs_is_zoned(info)) {
+		btrfs_calc_zone_unusable(cache);
+	} else if (cache->length == cache->used) {
 		cache->last_byte_to_unpin = (u64)-1;
 		cache->cached = BTRFS_CACHE_FINISHED;
 		btrfs_free_excluded_extents(cache);
@@ -1900,7 +1919,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, cache->length,
-				cache->used, cache->bytes_super, &space_info);
+				cache->used, cache->bytes_super,
+				cache->zone_unusable, &space_info);
 
 	cache->space_info = space_info;
 
@@ -1956,7 +1976,7 @@ static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
 			break;
 		}
 		btrfs_update_space_info(fs_info, bg->flags, em->len, em->len,
-					0, &space_info);
+					0, 0, &space_info);
 		bg->space_info = space_info;
 		link_block_group(bg);
 
@@ -2197,7 +2217,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+				cache->bytes_super, 0, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
@@ -2305,8 +2325,15 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
 	spin_lock(&cache->lock);
 	if (!--cache->ro) {
 		num_bytes = cache->length - cache->reserved -
-			    cache->pinned - cache->bytes_super - cache->used;
+			    cache->pinned - cache->bytes_super -
+			    cache->zone_unusable - cache->used;
 		sinfo->bytes_readonly -= num_bytes;
+		if (btrfs_is_zoned(cache->fs_info)) {
+			/* Migrate zone_unusable bytes back */
+			cache->zone_unusable = cache->alloc_offset - cache->used;
+			sinfo->bytes_zone_unusable += cache->zone_unusable;
+			sinfo->bytes_readonly -= cache->zone_unusable;
+		}
 		list_del_init(&cache->ro_list);
 	}
 	spin_unlock(&cache->lock);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 224946fa9bed..0fd66febe115 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -189,6 +189,7 @@ struct btrfs_block_group {
 	 * allocation. This is used only on a zoned filesystem.
 	 */
 	u64 alloc_offset;
+	u64 zone_unusable;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5476ab84e544..5c61c3f136f7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -34,6 +34,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "rcu-string.h"
+#include "zoned.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2740,6 +2741,10 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
+		} else if (btrfs_is_zoned(fs_info)) {
+			/* Need reset before reusing in a zoned block group */
+			space_info->bytes_zone_unusable += len;
+			readonly = true;
 		}
 		spin_unlock(&cache->lock);
 		if (!readonly && return_free_space &&
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 6134e10a6e7f..b93ac31eca69 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2477,6 +2477,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	int ret = 0;
 	u64 filter_bytes = bytes;
 
+	ASSERT(!btrfs_is_zoned(fs_info));
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2534,11 +2536,49 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
+					u64 bytenr, u64 size, bool used)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->start;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (!used)
+		to_free = size;
+	else if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+
+	ctl->free_space += to_free;
+	block_group->zone_unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	if (!used) {
+		spin_lock(&block_group->lock);
+		block_group->alloc_offset -= size;
+		spin_unlock(&block_group->lock);
+	}
+
+	/* All the region is now unusable. Mark it as unused and reclaim */
+	if (block_group->zone_unusable == block_group->length)
+		btrfs_mark_bg_unused(block_group);
+
+	return 0;
+}
+
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size)
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
@@ -2547,6 +2587,16 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group,
 				      bytenr, size, trim_state);
 }
 
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size)
+{
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    false);
+
+	return btrfs_add_free_space(block_group, bytenr, size);
+}
+
 /*
  * This is a subtle distinction because when adding free space back in general,
  * we want it to be added as untrimmed for async. But in the case where we add
@@ -2557,6 +2607,10 @@ int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) ||
 	    btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
@@ -2574,6 +2628,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 	int ret;
 	bool re_search = false;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return 0;
+
 	spin_lock(&ctl->tree_lock);
 
 again:
@@ -2668,6 +2725,16 @@ void btrfs_dump_free_space(struct btrfs_block_group *block_group,
 	struct rb_node *n;
 	int count = 0;
 
+	/*
+	 * Zoned btrfs does not use free space tree and cluster. Just print
+	 * out the free space after the allocation offset.
+	 */
+	if (btrfs_is_zoned(fs_info)) {
+		btrfs_info(fs_info, "free space %llu",
+			   block_group->length - block_group->alloc_offset);
+		return;
+	}
+
 	spin_lock(&ctl->tree_lock);
 	for (n = rb_first(&ctl->free_space_offset); n; n = rb_next(n)) {
 		info = rb_entry(n, struct btrfs_free_space, offset_index);
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index ecb09a02d544..1f23088d43f9 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -107,6 +107,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   enum btrfs_trim_state trim_state);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size);
 int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 				       u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index bccd98141a6e..2da6177f4b0b 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -169,6 +169,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 	ASSERT(s_info);
 	return s_info->bytes_used + s_info->bytes_reserved +
 		s_info->bytes_pinned + s_info->bytes_readonly +
+		s_info->bytes_zone_unusable +
 		(may_use_included ? s_info->bytes_may_use : 0);
 }
 
@@ -264,7 +265,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -280,6 +281,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
 	found->bytes_readonly += bytes_readonly;
+	found->bytes_zone_unusable += bytes_zone_unusable;
 	if (total_bytes > 0)
 		found->full = 0;
 	btrfs_try_granting_tickets(info, found);
@@ -429,10 +431,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		   info->total_bytes - btrfs_space_info_used(info, true),
 		   info->full ? "" : "not ");
 	btrfs_info(fs_info,
-		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
+		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
-		info->bytes_readonly);
+		info->bytes_readonly, info->bytes_zone_unusable);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
@@ -461,9 +463,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(cache, &info->block_groups[index], list) {
 		spin_lock(&cache->lock);
 		btrfs_info(fs_info,
-			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
+			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s",
 			cache->start, cache->length, cache->used, cache->pinned,
-			cache->reserved, cache->ro ? "[readonly]" : "");
+			cache->reserved, cache->zone_unusable,
+			cache->ro ? "[readonly]" : "");
 		spin_unlock(&cache->lock);
 		btrfs_dump_free_space(cache, bytes);
 	}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index e237156ce888..b1a8ffb03b3e 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -17,6 +17,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 bytes_zone_unusable;	/* total bytes that are unusable until
+					   resetting the device zone */
 
 	u64 max_extent_size;	/* This will hold the maximum extent size of
 				   the space info if we had an ENOSPC in the
@@ -123,7 +125,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned");
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 19b9fffa2c9c..6eb1c50fa98c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -666,6 +666,7 @@ SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
 SPACE_INFO_ATTR(bytes_readonly);
+SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(space_info, total_bytes_pinned,
@@ -679,6 +680,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bytes_reserved),
 	BTRFS_ATTR_PTR(space_info, bytes_may_use),
 	BTRFS_ATTR_PTR(space_info, bytes_readonly),
+	BTRFS_ATTR_PTR(space_info, bytes_zone_unusable),
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, total_bytes_pinned),
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b892566a1c93..c5f9f4c6f20b 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1160,3 +1160,24 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 
 	return ret;
 }
+
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
+{
+	u64 unusable, free;
+
+	if (!btrfs_is_zoned(cache->fs_info))
+		return;
+
+	WARN_ON(cache->bytes_super != 0);
+	unusable = cache->alloc_offset - cache->used;
+	free = cache->length - cache->alloc_offset;
+
+	/* We only need ->free_space in ALLOC_SEQ block groups */
+	cache->last_byte_to_unpin = (u64)-1;
+	cache->cached = BTRFS_CACHE_FINISHED;
+	cache->free_space_ctl->free_space = free;
+	cache->zone_unusable = unusable;
+
+	/* Should not have any excluded extents. Just in case, though */
+	btrfs_free_excluded_extents(cache);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index d27db3993e51..37304d1675e6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -42,6 +42,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new);
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -123,6 +124,8 @@ static inline int btrfs_load_block_group_zone_info(
 	return 0;
 }
 
+static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 14/42] btrfs: zoned: implement sequential extent allocation
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (11 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 13/42] btrfs: zoned: track unusable bytes for zones Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 15/42] btrfs: zoned: redirty released extent buffers Naohiro Aota
                     ` (28 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Therefore the allocator never manages bitmaps or clusters. Also, add
assertions to the corresponding functions.

As zone append writing is used, it would be unnecessary to track the
allocation offset, as the allocator only needs to check available space.
But by tracking and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there
is no IO reordering.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      |  4 ++
 fs/btrfs/extent-tree.c      | 90 ++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |  6 +++
 3 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e4444d4dd4b5..63093cfb807e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -725,6 +725,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	struct btrfs_caching_control *caching_ctl = NULL;
 	int ret = 0;
 
+	/* Allocator for zoned filesystems does not use the cache at all */
+	if (btrfs_is_zoned(fs_info))
+		return 0;
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5c61c3f136f7..85d99307673d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3429,6 +3429,7 @@ btrfs_release_block_group(struct btrfs_block_group *cache,
 
 enum btrfs_extent_allocation_policy {
 	BTRFS_EXTENT_ALLOC_CLUSTERED,
+	BTRFS_EXTENT_ALLOC_ZONED,
 };
 
 /*
@@ -3681,6 +3682,65 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows sequential
+ * allocation. No need to play with trees. This function also reserves the
+ * bytes as in btrfs_add_reserved_bytes.
+ */
+static int do_allocation_zoned(struct btrfs_block_group *block_group,
+			       struct find_free_extent_ctl *ffe_ctl,
+			       struct btrfs_block_group **bg_ret)
+{
+	struct btrfs_space_info *space_info = block_group->space_info;
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 start = block_group->start;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	ASSERT(btrfs_is_zoned(block_group->fs_info));
+
+	spin_lock(&space_info->lock);
+	spin_lock(&block_group->lock);
+
+	if (block_group->ro) {
+		ret = 1;
+		goto out;
+	}
+
+	avail = block_group->length - block_group->alloc_offset;
+	if (avail < num_bytes) {
+		if (ffe_ctl->max_extent_size < avail) {
+			/*
+			 * With sequential allocator, free space is always
+			 * contiguous
+			 */
+			ffe_ctl->max_extent_size = avail;
+			ffe_ctl->total_free_space = avail;
+		}
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + block_group->alloc_offset;
+	block_group->alloc_offset += num_bytes;
+	spin_lock(&ctl->tree_lock);
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	/*
+	 * We do not check if found_offset is aligned to stripesize. The
+	 * address is anyway rewritten when using zone append writing.
+	 */
+
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+
+out:
+	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 static int do_allocation(struct btrfs_block_group *block_group,
 			 struct find_free_extent_ctl *ffe_ctl,
 			 struct btrfs_block_group **bg_ret)
@@ -3688,6 +3748,8 @@ static int do_allocation(struct btrfs_block_group *block_group,
 	switch (ffe_ctl->policy) {
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return do_allocation_clustered(block_group, ffe_ctl, bg_ret);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		return do_allocation_zoned(block_group, ffe_ctl, bg_ret);
 	default:
 		BUG();
 	}
@@ -3702,6 +3764,9 @@ static void release_block_group(struct btrfs_block_group *block_group,
 		ffe_ctl->retry_clustered = false;
 		ffe_ctl->retry_unclustered = false;
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3730,6 +3795,9 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		found_extent_clustered(ffe_ctl, ins);
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3745,6 +3813,9 @@ static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl)
 		 */
 		ffe_ctl->loop = LOOP_NO_EMPTY_SIZE;
 		return 0;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Give up here */
+		return -ENOSPC;
 	default:
 		BUG();
 	}
@@ -3913,6 +3984,9 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* Nothing to do */
+		return 0;
 	default:
 		BUG();
 	}
@@ -3976,6 +4050,9 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.last_ptr = NULL;
 	ffe_ctl.use_cluster = true;
 
+	if (btrfs_is_zoned(fs_info))
+		ffe_ctl.policy = BTRFS_EXTENT_ALLOC_ZONED;
+
 	ins->type = BTRFS_EXTENT_ITEM_KEY;
 	ins->objectid = 0;
 	ins->offset = 0;
@@ -4118,20 +4195,21 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		/* move on to the next group */
 		if (ffe_ctl.search_start + num_bytes >
 		    block_group->start + block_group->length) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+					    ffe_ctl.found_offset, num_bytes);
 			goto loop;
 		}
 
 		if (ffe_ctl.found_offset < ffe_ctl.search_start)
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-				ffe_ctl.search_start - ffe_ctl.found_offset);
+			btrfs_add_free_space_unused(block_group,
+					ffe_ctl.found_offset,
+					ffe_ctl.search_start - ffe_ctl.found_offset);
 
 		ret = btrfs_add_reserved_bytes(block_group, ram_bytes,
 				num_bytes, delalloc);
 		if (ret == -EAGAIN) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+					ffe_ctl.found_offset, num_bytes);
 			goto loop;
 		}
 		btrfs_inc_block_group_reservations(block_group);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index b93ac31eca69..d2a43186cc7f 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2928,6 +2928,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 	u64 align_gap_len = 0;
 	enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -3059,6 +3061,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3835,6 +3839,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group,
 	int ret;
 	u64 rem = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 15/42] btrfs: zoned: redirty released extent buffers
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (12 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 14/42] btrfs: zoned: implement sequential extent allocation Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 16/42] btrfs: zoned: advance allocation pointer after tree log node Naohiro Aota
                     ` (27 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  8 ++++++++
 fs/btrfs/extent-tree.c | 12 +++++++++++-
 fs/btrfs/extent_io.c   |  4 ++++
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/transaction.c | 10 ++++++++++
 fs/btrfs/transaction.h |  3 +++
 fs/btrfs/tree-log.c    |  6 ++++++
 fs/btrfs/zoned.c       | 37 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  7 +++++++
 9 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8551b0fc1b22..eb1afd7d89f7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -459,6 +459,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct bio_vec *bvec
 		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
+
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+		WARN_ON(found_start != 0);
+		return 0;
+	}
+
 	/*
 	 * Please do not consolidate these warnings into a single if.
 	 * It is useful to know what went wrong.
@@ -4774,6 +4780,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 				     EXTENT_DIRTY);
 	btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
 
+	btrfs_free_redirty_list(cur_trans);
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 85d99307673d..4d48a773bf9c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3292,8 +3292,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
-			if (!ret)
+			if (!ret) {
+				btrfs_redirty_list_add(trans->transaction, buf);
 				goto out;
+			}
 		}
 
 		cache = btrfs_lookup_block_group(fs_info, buf->start);
@@ -3304,6 +3306,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
+		if (btrfs_is_zoned(fs_info)) {
+			btrfs_redirty_list_add(trans->transaction, buf);
+			pin_down_extent(trans, cache, buf->start, buf->len, 1);
+			btrfs_put_block_group(cache);
+			goto out;
+		}
+
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
@@ -4635,6 +4644,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	__btrfs_tree_lock(buf, nest);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags);
 
 	set_extent_buffer_uptodate(buf);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2fa4ca12e2dd..fa9b37178d42 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -25,6 +25,7 @@
 #include "backref.h"
 #include "disk-io.h"
 #include "subpage.h"
+#include "zoned.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -5183,6 +5184,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 
 	btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list,
 			     &fs_info->allocated_ebs);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
@@ -6105,6 +6107,8 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 	char *src = (char *)srcv;
 	unsigned long i = get_eb_page_index(start);
 
+	WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
+
 	if (check_eb_range(eb, start, len))
 		return;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 047b3e66897f..824640cb0ace 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -31,6 +31,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -93,6 +94,7 @@ struct extent_buffer {
 	struct rw_semaphore lock;
 
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 00c0680dac3a..acff6bb49a97 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -21,6 +21,7 @@
 #include "qgroup.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -380,6 +381,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2350,6 +2353,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	/*
+	 * At this point, we should have written all the tree blocks allocated
+	 * in this transaction. So it's now safe to free the redirtyied extent
+	 * buffers.
+	 */
+	btrfs_free_redirty_list(cur_trans);
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 935bd6958a8a..6335716e513f 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -93,6 +93,9 @@ struct btrfs_transaction {
 	 */
 	atomic_t pending_ordered;
 	wait_queue_head_t pending_wait;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4c7b283ed2b2..c02eeeac439c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -19,6 +19,7 @@
 #include "qgroup.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 /* magic values for the inode_only field in btrfs_log_inode:
  *
@@ -2752,6 +2753,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 						free_extent_buffer(next);
 						return ret;
 					}
+					btrfs_redirty_list_add(
+						trans->transaction, next);
 				} else {
 					if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags))
 						clear_extent_buffer_dirty(next);
@@ -3296,6 +3299,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
 			  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
 	extent_io_tree_release(&log->log_csum_range);
+
+	if (trans && log->node)
+		btrfs_redirty_list_add(trans->transaction, log->node);
 	btrfs_put_root(log);
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c5f9f4c6f20b..1de67d789b83 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -10,6 +10,7 @@
 #include "rcu-string.h"
 #include "disk-io.h"
 #include "block-group.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1181,3 +1182,39 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 	/* Should not have any excluded extents. Just in case, though */
 	btrfs_free_excluded_extents(cache);
 }
+
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (!btrfs_is_zoned(fs_info) ||
+	    btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
+	    !list_empty(&eb->release_list))
+		return;
+
+	set_extent_buffer_dirty(eb);
+	set_extent_bits_nowait(&trans->dirty_pages, eb->start,
+			       eb->start + eb->len - 1, EXTENT_DIRTY);
+	memzero_extent_buffer(eb, 0, eb->len);
+	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+
+	spin_lock(&trans->releasing_ebs_lock);
+	list_add_tail(&eb->release_list, &trans->releasing_ebs);
+	spin_unlock(&trans->releasing_ebs_lock);
+	atomic_inc(&eb->refs);
+}
+
+void btrfs_free_redirty_list(struct btrfs_transaction *trans)
+{
+	spin_lock(&trans->releasing_ebs_lock);
+	while (!list_empty(&trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+	spin_unlock(&trans->releasing_ebs_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 37304d1675e6..b250a578e38c 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -43,6 +43,9 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new);
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb);
+void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -126,6 +129,10 @@ static inline int btrfs_load_block_group_zone_info(
 
 static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
 
+static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+					  struct extent_buffer *eb) { }
+static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 16/42] btrfs: zoned: advance allocation pointer after tree log node
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (13 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 15/42] btrfs: zoned: redirty released extent buffers Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 17/42] btrfs: zoned: reset zones of unused block groups Naohiro Aota
                     ` (26 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Since the allocation info of a tree log node is not recorded in the extent
tree, calculate_alloc_pointer() cannot detect this node, so the pointer
can be over a tree node.

Replaying the log calls btrfs_remove_free_space() for each node in the
log tree.

So, advance the pointer after the node to not allocate over it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/free-space-cache.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d2a43186cc7f..5400294bd271 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2628,8 +2628,22 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 	int ret;
 	bool re_search = false;
 
-	if (btrfs_is_zoned(block_group->fs_info))
+	if (btrfs_is_zoned(block_group->fs_info)) {
+		/*
+		 * This can happen with conventional zones when replaying log.
+		 * Since the allocation info of tree-log nodes are not recorded
+		 * to the extent-tree, calculate_alloc_pointer() failed to
+		 * advance the allocation pointer after last allocated tree log
+		 * node blocks.
+		 *
+		 * This function is called from
+		 * btrfs_pin_extent_for_log_replay() when replaying the log.
+		 * Advance the pointer not to overwrite the tree-log nodes.
+		 */
+		if (block_group->alloc_offset < offset + bytes)
+			block_group->alloc_offset = offset + bytes;
 		return 0;
+	}
 
 	spin_lock(&ctl->tree_lock);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 17/42] btrfs: zoned: reset zones of unused block groups
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (14 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 16/42] btrfs: zoned: advance allocation pointer after tree log node Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 18/42] btrfs: factor out helper adding a page to bio Naohiro Aota
                     ` (25 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik, Anand Jain

We must reset the zones of a deleted unused block group to rewind the
zones' write pointers to the zones' start.

To do this, we can use the DISCARD_SYNC code to do the reset when the
filesystem is running on zoned devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  8 ++++++--
 fs/btrfs/extent-tree.c | 17 ++++++++++++-----
 fs/btrfs/zoned.h       | 15 +++++++++++++++
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 63093cfb807e..70a0c0f8f99f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1408,8 +1408,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			goto flip_async;
 
-		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
+		/*
+		 * DISCARD can flip during remount. On zoned filesystems, we
+		 * need to reset sequential-required zones.
+		 */
+		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+				btrfs_is_zoned(fs_info);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4d48a773bf9c..a717366c9823 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1298,6 +1298,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			struct btrfs_device *dev = stripe->dev;
+			u64 physical = stripe->physical;
+			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
 
@@ -1305,14 +1308,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
-			if (!blk_queue_discard(req_q))
+			/* Zone reset on zoned filesystems */
+			if (btrfs_can_zone_reset(dev, physical, length))
+				ret = btrfs_reset_device_zone(dev, physical,
+							      length, &bytes);
+			else if (blk_queue_discard(req_q))
+				ret = btrfs_issue_discard(dev->bdev, physical,
+							  length, &bytes);
+			else
 				continue;
 
-			ret = btrfs_issue_discard(stripe->dev->bdev,
-						  stripe->physical,
-						  stripe->length,
-						  &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index b250a578e38c..c105641a6ad3 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -209,4 +209,19 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device, u64 p
 	return device->zone_info == NULL || !btrfs_dev_is_sequential(device, pos);
 }
 
+static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
+					u64 physical, u64 length)
+{
+	u64 zone_size;
+
+	if (!btrfs_dev_is_sequential(device, physical))
+		return false;
+
+	zone_size = device->zone_info->zone_size;
+	if (!IS_ALIGNED(physical, zone_size) || !IS_ALIGNED(length, zone_size))
+		return false;
+
+	return true;
+}
+
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 18/42] btrfs: factor out helper adding a page to bio
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (15 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 17/42] btrfs: zoned: reset zones of unused block groups Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 19/42] btrfs: zoned: use bio_add_zone_append_page Naohiro Aota
                     ` (24 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Extract adding a page to a bio from submit_extent_page().  The page is
added only when bio_flags are the same, contiguous and the added page fits
in the same stripe as pages in the bio.

Condition checks are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 60 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 45 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fa9b37178d42..5db7e6c69391 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3084,6 +3084,48 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
 	return bio;
 }
 
+/**
+ * btrfs_bio_add_page - attempt to add a page to bio
+ *
+ * @bio:	destination bio
+ * @page:	page to add to the bio
+ * @disk_bytenr:  offset of the new bio or to check whether we are adding
+ *                a contiguous page to the previous one
+ * @pg_offset:	starting offset in the page
+ * @size:	portion of page that we want to write
+ * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
+ * @bio_flags:	flags of the current bio to see if we can merge them
+ * @return:	true if page was added, false otherwise
+ *
+ * Attempt to add a page to bio considering stripe alignment etc.
+ *
+ * Return true if successfully page added. Otherwise, return false.
+ */
+static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
+			       u64 disk_bytenr, unsigned int size,
+			       unsigned int pg_offset,
+			       unsigned long prev_bio_flags,
+			       unsigned long bio_flags)
+{
+	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
+	bool contig;
+
+	if (prev_bio_flags != bio_flags)
+		return false;
+
+	if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
+		contig = bio->bi_iter.bi_sector == sector;
+	else
+		contig = bio_end_sector(bio) == sector;
+	if (!contig)
+		return false;
+
+	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
+		return false;
+
+	return bio_add_page(bio, page, size, pg_offset) == size;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3112,27 +3154,15 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t io_size = min_t(size_t, size, PAGE_SIZE);
-	sector_t sector = disk_bytenr >> 9;
 	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
 
 	ASSERT(bio_ret);
 
 	if (*bio_ret) {
-		bool contig;
-		bool can_merge = true;
-
 		bio = *bio_ret;
-		if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
-			contig = bio->bi_iter.bi_sector == sector;
-		else
-			contig = bio_end_sector(bio) == sector;
-
-		if (btrfs_bio_fits_in_stripe(page, io_size, bio, bio_flags))
-			can_merge = false;
-
-		if (prev_bio_flags != bio_flags || !contig || !can_merge ||
-		    force_bio_submit ||
-		    bio_add_page(bio, page, io_size, pg_offset) < io_size) {
+		if (force_bio_submit ||
+		    !btrfs_bio_add_page(bio, page, disk_bytenr, io_size,
+					pg_offset, prev_bio_flags, bio_flags)) {
 			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
 			if (ret < 0) {
 				*bio_ret = NULL;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 19/42] btrfs: zoned: use bio_add_zone_append_page
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (16 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 18/42] btrfs: factor out helper adding a page to bio Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:21   ` [PATCH v15 20/42] btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
                     ` (23 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

A zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow these restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to cache the target device to btrfs_io_bio(bio)->device.

Caching only the target device is sufficient here as zoned filesystems
only supports the single profile at the moment. Once more profiles will be
supported btrfs_io_bio can hold an extent_map to be able to check for the
restrictions of all devices the brtfs_bio will be mapped to.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5db7e6c69391..15503a435e98 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3109,6 +3109,7 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
 {
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
 	bool contig;
+	int ret;
 
 	if (prev_bio_flags != bio_flags)
 		return false;
@@ -3123,7 +3124,12 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
 	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
 		return false;
 
-	return bio_add_page(bio, page, size, pg_offset) == size;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+	else
+		ret = bio_add_page(bio, page, size, pg_offset);
+
+	return ret == size;
 }
 
 /*
@@ -3154,7 +3160,9 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t io_size = min_t(size_t, size, PAGE_SIZE);
-	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
+	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct extent_io_tree *tree = &inode->io_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	ASSERT(bio_ret);
 
@@ -3185,11 +3193,26 @@ static int submit_extent_page(unsigned int opf,
 	if (wbc) {
 		struct block_device *bdev;
 
-		bdev = BTRFS_I(page->mapping->host)->root->fs_info->fs_devices->latest_bdev;
+		bdev = fs_info->fs_devices->latest_bdev;
 		bio_set_dev(bio, bdev);
 		wbc_init_bio(wbc, bio);
 		wbc_account_cgroup_owner(wbc, page, io_size);
 	}
+	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		map = em->map_lookup;
+		/* We only support single profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
 
 	*bio_ret = bio;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 20/42] btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (17 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 19/42] btrfs: zoned: use bio_add_zone_append_page Naohiro Aota
@ 2021-02-04 10:21   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 21/42] btrfs: zoned: split ordered extent when bio is sent Naohiro Aota
                     ` (22 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:21 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.

Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/inode.c   | 10 +++++-----
 fs/btrfs/volumes.c |  8 ++++----
 fs/btrfs/volumes.h |  1 +
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index eb1afd7d89f7..70621184a731 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -709,7 +709,7 @@ static void end_workqueue_bio(struct bio *bio)
 	fs_info = end_io_wq->info;
 	end_io_wq->status = bio->bi_status;
 
-	if (bio_op(bio) == REQ_OP_WRITE) {
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
 			wq = fs_info->endio_meta_write_workers;
 		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE)
@@ -885,7 +885,7 @@ blk_status_t btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio,
 	int async = check_async_write(fs_info, BTRFS_I(inode));
 	blk_status_t ret;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		/*
 		 * called for a read, do the setup so that checksum validation
 		 * can happen in the async kernel threads
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5522e9d09c8a..d7a9c770dc3b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2250,7 +2250,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
 			goto out;
@@ -7681,7 +7681,7 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	if (!refcount_dec_and_test(&dip->refs))
 		return;
 
-	if (bio_op(dip->dio_bio) == REQ_OP_WRITE) {
+	if (btrfs_op(dip->dio_bio) == BTRFS_MAP_WRITE) {
 		__endio_write_update_ordered(BTRFS_I(dip->inode),
 					     dip->logical_offset,
 					     dip->bytes,
@@ -7847,7 +7847,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_private *dip = bio->bi_private;
-	bool write = bio_op(bio) == REQ_OP_WRITE;
+	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
 	blk_status_t ret;
 
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
@@ -7897,7 +7897,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 							  struct inode *inode,
 							  loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	size_t dip_size;
 	struct btrfs_dio_private *dip;
@@ -7927,7 +7927,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		struct bio *dio_bio, loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
 			     BTRFS_BLOCK_GROUP_RAID56_MASK);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 10401def16ef..400375aaa197 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6448,7 +6448,7 @@ static void btrfs_end_bio(struct bio *bio)
 			struct btrfs_device *dev = btrfs_io_bio(bio)->device;
 
 			ASSERT(dev->bdev);
-			if (bio_op(bio) == REQ_OP_WRITE)
+			if (btrfs_op(bio) == BTRFS_MAP_WRITE)
 				btrfs_dev_stat_inc_and_print(dev,
 						BTRFS_DEV_STAT_WRITE_ERRS);
 			else if (!(bio->bi_opf & REQ_RAHEAD))
@@ -6561,10 +6561,10 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
 	if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
-	    ((bio_op(bio) == REQ_OP_WRITE) || (mirror_num > 1))) {
+	    ((btrfs_op(bio) == BTRFS_MAP_WRITE) || (mirror_num > 1))) {
 		/* In this case, map_length has been set to the length of
 		   a single stripe; not the whole write */
-		if (bio_op(bio) == REQ_OP_WRITE) {
+		if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 			ret = raid56_parity_write(fs_info, bio, bbio,
 						  map_length);
 		} else {
@@ -6587,7 +6587,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		dev = bbio->stripes[dev_nr].dev;
 		if (!dev || !dev->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
 						   &dev->dev_state) ||
-		    (bio_op(first_bio) == REQ_OP_WRITE &&
+		    (btrfs_op(first_bio) == BTRFS_MAP_WRITE &&
 		    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
 			bbio_error(bbio, first_bio, logical);
 			continue;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 598ac225176d..d3bbdb4175df 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -424,6 +424,7 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
 	case REQ_OP_DISCARD:
 		return BTRFS_MAP_DISCARD;
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return BTRFS_MAP_WRITE;
 	default:
 		WARN_ON_ONCE(1);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 21/42] btrfs: zoned: split ordered extent when bio is sent
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (18 preceding siblings ...)
  2021-02-04 10:21   ` [PATCH v15 20/42] btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 22/42] btrfs: zoned: check if bio spans across an ordered extent Naohiro Aota
                     ` (21 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.

Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.

extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.

Several sanity checks need to be done in extract_ordered_extent() e.g.

- We cannot split once end_bio'd ordered extent because we cannot divide
  ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
  Since the function is called before btrfs_wq_submit_bio() or
  btrfs_csum_one_bio(), this sholud be always ensured.

We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c        | 95 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.c | 78 +++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.h |  2 +
 3 files changed, 175 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d7a9c770dc3b..750482a06d67 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2215,6 +2215,92 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
+static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
+					   struct bio *bio, loff_t file_offset)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em = NULL, *em_new = NULL;
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 len = bio->bi_iter.bi_size;
+	u64 end = start + len;
+	u64 ordered_end;
+	u64 pre, post;
+	int ret = 0;
+
+	ordered = btrfs_lookup_ordered_extent(inode, file_offset);
+	if (WARN_ON_ONCE(!ordered))
+		return BLK_STS_IOERR;
+
+	/* No need to split */
+	if (ordered->disk_num_bytes == len)
+		goto out;
+
+	/* We cannot split once end_bio'd ordered extent */
+	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We cannot split a compressed ordered extent */
+	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
+	/* bio must be in one ordered extent */
+	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Checksum list should be empty */
+	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pre = start - ordered->disk_bytenr;
+	post = ordered_end - end;
+
+	ret = btrfs_split_ordered_extent(ordered, pre, post);
+	if (ret)
+		goto out;
+
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
+	if (!em) {
+		read_unlock(&em_tree->lock);
+		ret = -EIO;
+		goto out;
+	}
+	read_unlock(&em_tree->lock);
+
+	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+	/*
+	 * We cannot reuse em_new here but have to create a new one, as
+	 * unpin_extent_cache() expects the start of the extent map to be the
+	 * logical offset of the file, which does not hold true anymore after
+	 * splitting.
+	 */
+	em_new = create_io_em(inode, em->start + pre, len,
+			      em->start + pre, em->block_start + pre, len,
+			      len, len, BTRFS_COMPRESS_NONE,
+			      BTRFS_ORDERED_REGULAR);
+	if (IS_ERR(em_new)) {
+		ret = PTR_ERR(em_new);
+		goto out;
+	}
+	free_extent_map(em_new);
+
+out:
+	free_extent_map(em);
+	btrfs_put_ordered_extent(ordered);
+
+	return errno_to_blk_status(ret);
+}
+
 /*
  * extent_io.c submission hook. This does the right thing for csum calculation
  * on write, or reading the csums from the tree before a read.
@@ -2250,6 +2336,15 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct page *page = bio_first_bvec_all(bio)->bv_page;
+		loff_t file_offset = page_offset(page);
+
+		ret = extract_ordered_extent(BTRFS_I(inode), bio, file_offset);
+		if (ret)
+			goto out;
+	}
+
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e8dee1578d4a..2dc707f02f00 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -920,6 +920,84 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 	}
 }
 
+static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
+				u64 len)
+{
+	struct inode *inode = ordered->inode;
+	u64 file_offset = ordered->file_offset + pos;
+	u64 disk_bytenr = ordered->disk_bytenr + pos;
+	u64 num_bytes = len;
+	u64 disk_num_bytes = len;
+	int type;
+	unsigned long flags_masked = ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
+	int compress_type = ordered->compress_type;
+	unsigned long weight;
+	int ret;
+
+	weight = hweight_long(flags_masked);
+	WARN_ON_ONCE(weight > 1);
+	if (!weight)
+		type = 0;
+	else
+		type = __ffs(flags_masked);
+
+	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
+		WARN_ON_ONCE(1);
+		ret = btrfs_add_ordered_extent_compress(BTRFS_I(inode),
+				file_offset, disk_bytenr, num_bytes,
+				disk_num_bytes, compress_type);
+	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
+		ret = btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
+				disk_bytenr, num_bytes, disk_num_bytes, type);
+	} else {
+		ret = btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
+				disk_bytenr, num_bytes, disk_num_bytes, type);
+	}
+
+	return ret;
+}
+
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post)
+{
+	struct inode *inode = ordered->inode;
+	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	int ret = 0;
+
+	spin_lock_irq(&tree->lock);
+	/* Remove from tree once */
+	node = &ordered->rb_node;
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	if (tree->last == node)
+		tree->last = NULL;
+
+	ordered->file_offset += pre;
+	ordered->disk_bytenr += pre;
+	ordered->num_bytes -= (pre + post);
+	ordered->disk_num_bytes -= (pre + post);
+	ordered->bytes_left -= (pre + post);
+
+	/* Re-insert the node */
+	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
+	if (node)
+		btrfs_panic(fs_info, -EEXIST,
+			"zoned: inconsistency in ordered tree at offset %llu",
+			    ordered->file_offset);
+
+	spin_unlock_irq(&tree->lock);
+
+	if (pre)
+		ret = clone_ordered_extent(ordered, 0, pre);
+	if (post)
+		ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes,
+					   post);
+
+	return ret;
+}
+
 int __init ordered_data_init(void)
 {
 	btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index cca3307807e8..c400be75a3f1 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -201,6 +201,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					u64 end,
 					struct extent_state **cached_state);
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+			       u64 post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 22/42] btrfs: zoned: check if bio spans across an ordered extent
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (19 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 21/42] btrfs: zoned: split ordered extent when bio is sent Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 23/42] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
                     ` (20 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Josef Bacik, Naohiro Aota

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h     |  2 ++
 fs/btrfs/extent_io.c |  9 +++++++--
 fs/btrfs/inode.c     | 27 +++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a9b0521d9e89..10da47ab093a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3120,6 +3120,8 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 			     unsigned long bio_flags);
+bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
+				      unsigned int size);
 void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 15503a435e98..72b1a23d17f9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3124,10 +3124,15 @@ static bool btrfs_bio_add_page(struct bio *bio, struct page *page,
 	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
 		return false;
 
-	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct page *first_page = bio_first_bvec_all(bio)->bv_page;
+
+		if (!btrfs_bio_fits_in_ordered_extent(first_page, bio, size))
+			return false;
 		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
-	else
+	} else {
 		ret = bio_add_page(bio, page, size, pg_offset);
+	}
 
 	return ret == size;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 750482a06d67..31545e503b9e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2215,6 +2215,33 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
+bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
+				      unsigned int size)
+{
+	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_ordered_extent *ordered;
+	u64 len = bio->bi_iter.bi_size + size;
+	bool ret = true;
+
+	ASSERT(btrfs_is_zoned(fs_info));
+	ASSERT(fs_info->max_zone_append_size > 0);
+	ASSERT(bio_op(bio) == REQ_OP_ZONE_APPEND);
+
+	/* Ordered extent not yet created, so we're good */
+	ordered = btrfs_lookup_ordered_extent(inode, page_offset(page));
+	if (!ordered)
+		return ret;
+
+	if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) + len >
+	    ordered->disk_bytenr + ordered->disk_num_bytes)
+		ret = false;
+
+	btrfs_put_ordered_extent(ordered);
+
+	return ret;
+}
+
 static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
 					   struct bio *bio, loff_t file_offset)
 {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 23/42] btrfs: extend btrfs_rmap_block for specifying a device
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (20 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 22/42] btrfs: zoned: check if bio spans across an ordered extent Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 24/42] btrfs: zoned: cache if block-group is on a sequential zone Naohiro Aota
                     ` (19 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.

Extend the function to match to a specified device. The old functionality
of querying all devices is left intact by specifying NULL as target
device.

A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
as this function is intended to reverse-map the result of a bio, which
only has a block_device.

Also export the function for later use.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c            | 16 +++++++++++-----
 fs/btrfs/block-group.h            |  8 +++-----
 fs/btrfs/tests/extent-map-tests.c |  2 +-
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 70a0c0f8f99f..f5e9f560ce6d 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1588,6 +1588,7 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
  *
  * @fs_info:       the filesystem
  * @chunk_start:   logical address of block group
+ * @bdev:	   physical device to resolve, can be NULL to indicate any device
  * @physical:	   physical address to map to logical addresses
  * @logical:	   return array of logical addresses which map to @physical
  * @naddrs:	   length of @logical
@@ -1597,9 +1598,9 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
  * Used primarily to exclude those portions of a block group that contain super
  * block copies.
  */
-EXPORT_FOR_TESTS
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
-		     u64 physical, u64 **logical, int *naddrs, int *stripe_len)
+		     struct block_device *bdev, u64 physical, u64 **logical,
+		     int *naddrs, int *stripe_len)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
@@ -1617,6 +1618,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	map = em->map_lookup;
 	data_stripe_length = em->orig_block_len;
 	io_stripe_size = map->stripe_len;
+	chunk_start = em->start;
 
 	/* For RAID5/6 adjust to a full IO stripe length */
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -1631,14 +1633,18 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	for (i = 0; i < map->num_stripes; i++) {
 		bool already_inserted = false;
 		u64 stripe_nr;
+		u64 offset;
 		int j;
 
 		if (!in_range(physical, map->stripes[i].physical,
 			      data_stripe_length))
 			continue;
 
+		if (bdev && map->stripes[i].dev->bdev != bdev)
+			continue;
+
 		stripe_nr = physical - map->stripes[i].physical;
-		stripe_nr = div64_u64(stripe_nr, map->stripe_len);
+		stripe_nr = div64_u64_rem(stripe_nr, map->stripe_len, &offset);
 
 		if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
 			stripe_nr = stripe_nr * map->num_stripes + i;
@@ -1652,7 +1658,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 		 * instead of map->stripe_len
 		 */
 
-		bytenr = chunk_start + stripe_nr * io_stripe_size;
+		bytenr = chunk_start + stripe_nr * io_stripe_size + offset;
 
 		/* Ensure we don't add duplicate addresses */
 		for (j = 0; j < nr; j++) {
@@ -1694,7 +1700,7 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
-		ret = btrfs_rmap_block(fs_info, cache->start,
+		ret = btrfs_rmap_block(fs_info, cache->start, NULL,
 				       bytenr, &logical, &nr, &stripe_len);
 		if (ret)
 			return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 0fd66febe115..d14ac03bb93d 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -277,6 +277,9 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
 void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache,
 				struct btrfs_caching_control *caching_ctl);
+int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
+		       struct block_device *bdev, u64 physical, u64 **logical,
+		       int *naddrs, int *stripe_len);
 
 static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info)
 {
@@ -303,9 +306,4 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
 void btrfs_freeze_block_group(struct btrfs_block_group *cache);
 void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
 
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
-		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
-#endif
-
 #endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 57379e96ccc9..c0aefe6dee0b 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -507,7 +507,7 @@ static int test_rmap_block(struct btrfs_fs_info *fs_info,
 		goto out_free;
 	}
 
-	ret = btrfs_rmap_block(fs_info, em->start, btrfs_sb_offset(1),
+	ret = btrfs_rmap_block(fs_info, em->start, NULL, btrfs_sb_offset(1),
 			       &logical, &out_ndaddrs, &out_stripe_len);
 	if (ret || (out_ndaddrs == 0 && test->expected_mapped_addr)) {
 		test_err("didn't rmap anything but expected %d",
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 24/42] btrfs: zoned: cache if block-group is on a sequential zone
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (21 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 23/42] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 25/42] btrfs: save irq flags when looking up an ordered extent Naohiro Aota
                     ` (18 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Johannes Thumshirn, Josef Bacik

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

On a zoned filesystem, cache if a block-group is on a sequential write
only zone.

On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing
of data, therefore provide btrfs_use_zone_append() to figure out if I/O is
targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND
for data writing.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.h |  3 +++
 fs/btrfs/zoned.c       | 29 +++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  6 ++++++
 3 files changed, 38 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index d14ac03bb93d..31c7c5872b92 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -181,6 +181,9 @@ struct btrfs_block_group {
 	 */
 	int needs_free_space;
 
+	/* Flag indicating this block group is placed on a sequential zone */
+	bool seq_zone;
+
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1de67d789b83..f6c68704c840 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1101,6 +1101,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		}
 	}
 
+	if (num_sequential > 0)
+		cache->seq_zone = true;
+
 	if (num_conventional > 0) {
 		/*
 		 * Avoid calling calculate_alloc_pointer() for new BG. It
@@ -1218,3 +1221,29 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	}
 	spin_unlock(&trans->releasing_ebs_lock);
 }
+
+bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_block_group *cache;
+	bool ret = false;
+
+	if (!btrfs_is_zoned(fs_info))
+		return false;
+
+	if (!fs_info->max_zone_append_size)
+		return false;
+
+	if (!is_data_inode(&inode->vfs_inode))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, em->block_start);
+	ASSERT(cache);
+	if (!cache)
+		return false;
+
+	ret = cache->seq_zone;
+	btrfs_put_block_group(cache);
+
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index c105641a6ad3..14d578328cbe 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -46,6 +46,7 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -133,6 +134,11 @@ static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 
+static inline bool btrfs_use_zone_append(struct btrfs_inode *inode,
+					 struct extent_map *em)
+{
+	return false;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 25/42] btrfs: save irq flags when looking up an ordered extent
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (22 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 24/42] btrfs: zoned: cache if block-group is on a sequential zone Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 26/42] btrfs: zoned: use ZONE_APPEND write for zoned btrfs Naohiro Aota
                     ` (17 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Johannes Thumshirn, Josef Bacik

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.

btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ordered-data.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2dc707f02f00..fe235ab935d3 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -767,9 +767,10 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
+	unsigned long flags;
 
 	tree = &inode->ordered_tree;
-	spin_lock_irq(&tree->lock);
+	spin_lock_irqsave(&tree->lock, flags);
 	node = tree_search(tree, file_offset);
 	if (!node)
 		goto out;
@@ -780,7 +781,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 	if (entry)
 		refcount_inc(&entry->refs);
 out:
-	spin_unlock_irq(&tree->lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
 	return entry;
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 26/42] btrfs: zoned: use ZONE_APPEND write for zoned btrfs
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (23 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 25/42] btrfs: save irq flags when looking up an ordered extent Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 27/42] btrfs: zoned: enable zone append writing for direct IO Naohiro Aota
                     ` (16 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Johannes Thumshirn, Josef Bacik

This commit enables zone append writing for zoned btrfs. When using zone
append, a bio is issued to the start of a target zone and the device
decides to place it inside the zone. Upon completion the device reports
the actual written position back to the host.

Three parts are necessary to enable zone append in btrfs. First, modify
the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust
the bi_sector to point the beginning of the zone.

Secondly, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later in
btrfs_finish_ordered_io().

And finally, rewrites the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c    | 15 +++++++--
 fs/btrfs/file.c         |  6 +++-
 fs/btrfs/inode.c        |  4 +++
 fs/btrfs/ordered-data.c |  3 ++
 fs/btrfs/ordered-data.h |  8 +++++
 fs/btrfs/volumes.c      | 14 ++++++++
 fs/btrfs/zoned.c        | 73 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h        | 12 +++++++
 8 files changed, 132 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 72b1a23d17f9..4c186a5f9efa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2735,6 +2735,7 @@ static void end_bio_extent_writepage(struct bio *bio)
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
+	bool first_bvec = true;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
@@ -2761,6 +2762,11 @@ static void end_bio_extent_writepage(struct bio *bio)
 		start = page_offset(page);
 		end = start + bvec->bv_offset + bvec->bv_len - 1;
 
+		if (first_bvec) {
+			btrfs_record_physical_zoned(inode, start, bio);
+			first_bvec = false;
+		}
+
 		end_extent_writepage(page, error, start, end);
 		end_page_writeback(page);
 	}
@@ -3665,6 +3671,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	struct extent_map *em;
 	int ret = 0;
 	int nr = 0;
+	int opf = REQ_OP_WRITE;
 	const unsigned int write_flags = wbc_to_write_flags(wbc);
 	bool compressed;
 
@@ -3711,6 +3718,10 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 
 		/* Note that em_end from extent_map_end() is exclusive */
 		iosize = min(em_end, end + 1) - cur;
+
+		if (btrfs_use_zone_append(inode, em))
+			opf = REQ_OP_ZONE_APPEND;
+
 		free_extent_map(em);
 		em = NULL;
 
@@ -3736,8 +3747,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
-		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
-					 page, disk_bytenr, iosize,
+		ret = submit_extent_page(opf | write_flags, wbc, page,
+					 disk_bytenr, iosize,
 					 cur - page_offset(page), &epd->bio,
 					 end_bio_extent_writepage,
 					 0, 0, 0, false);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5a54f78faed5..0152524599e6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2168,8 +2168,12 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * commit waits for their completion, to avoid data loss if we fsync,
 	 * the current transaction commits before the ordered extents complete
 	 * and a power failure happens right after that.
+	 *
+	 * For zoned filesystem, if a write IO uses a ZONE_APPEND command, the
+	 * logical address recorded in the ordered extent may change. We need
+	 * to wait for the IO to stabilize the logical address.
 	 */
-	if (full_sync) {
+	if (full_sync || btrfs_is_zoned(fs_info)) {
 		ret = btrfs_wait_ordered_range(inode, start, len);
 	} else {
 		/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 31545e503b9e..6dbab9293425 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -50,6 +50,7 @@
 #include "delalloc-space.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -2874,6 +2875,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	if (ordered_extent->disk)
+		btrfs_rewrite_logical_zoned(ordered_extent);
+
 	btrfs_free_io_failure_record(inode, start, end);
 
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index fe235ab935d3..985a21558437 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -199,6 +199,9 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
+	entry->physical = (u64)-1;
+	entry->disk = NULL;
+	entry->partno = (u8)-1;
 
 	ASSERT(type == BTRFS_ORDERED_REGULAR ||
 	       type == BTRFS_ORDERED_NOCOW ||
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c400be75a3f1..99e0853e4d3b 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,14 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * Used to reverse-map physical address returned from ZONE_APPEND write
+	 * command in a workqueue context
+	 */
+	u64 physical;
+	struct gendisk *disk;
+	u8 partno;
 };
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 400375aaa197..a4d47c6050f7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6500,6 +6500,20 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 	btrfs_io_bio(bio)->device = dev;
 	bio->bi_end_io = btrfs_end_bio;
 	bio->bi_iter.bi_sector = physical >> 9;
+	/*
+	 * For zone append writing, bi_sector must point the beginning of the
+	 * zone
+	 */
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		if (btrfs_dev_is_sequential(dev, physical)) {
+			u64 zone_start = round_down(physical, fs_info->zone_size);
+
+			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
+		} else {
+			bio->bi_opf &= ~REQ_OP_ZONE_APPEND;
+			bio->bi_opf |= REQ_OP_WRITE;
+		}
+	}
 	btrfs_debug_in_rcu(fs_info,
 	"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
 		bio_op(bio), bio->bi_opf, bio->bi_iter.bi_sector,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f6c68704c840..050aea447332 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1247,3 +1247,76 @@ bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em)
 
 	return ret;
 }
+
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio)
+{
+	struct btrfs_ordered_extent *ordered;
+	const u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (bio_op(bio) != REQ_OP_ZONE_APPEND)
+		return;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON(!ordered))
+		return;
+
+	ordered->physical = physical;
+	ordered->disk = bio->bi_disk;
+	ordered->partno = bio->bi_partno;
+
+	btrfs_put_ordered_extent(ordered);
+}
+
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	struct inode *inode = ordered->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ordered_sum *sum;
+	struct block_device *bdev;
+	u64 orig_logical = ordered->disk_bytenr;
+	u64 *logical = NULL;
+	int nr, stripe_len;
+
+	/*
+	 * Zoned devices should not have partitions. So, we can assume it
+	 * is 0.
+	 */
+	ASSERT(ordered->partno == 0);
+	bdev = bdgrab(ordered->disk->part0);
+	if (WARN_ON(!bdev))
+		return;
+
+	if (WARN_ON(btrfs_rmap_block(fs_info, orig_logical, bdev,
+				     ordered->physical, &logical, &nr,
+				     &stripe_len)))
+		goto out;
+
+	WARN_ON(nr != 1);
+
+	if (orig_logical == *logical)
+		goto out;
+
+	ordered->disk_bytenr = *logical;
+
+	em_tree = &BTRFS_I(inode)->extent_tree;
+	write_lock(&em_tree->lock);
+	em = search_extent_mapping(em_tree, ordered->file_offset,
+				   ordered->num_bytes);
+	em->block_start = *logical;
+	free_extent_map(em);
+	write_unlock(&em_tree->lock);
+
+	list_for_each_entry(sum, &ordered->list, list) {
+		if (*logical < orig_logical)
+			sum->bytenr -= orig_logical - *logical;
+		else
+			sum->bytenr += *logical - orig_logical;
+	}
+
+out:
+	kfree(logical);
+	bdput(bdev);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 14d578328cbe..04f7b21652b6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -47,6 +47,9 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em);
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio);
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -139,6 +142,15 @@ static inline bool btrfs_use_zone_append(struct btrfs_inode *inode,
 {
 	return false;
 }
+
+static inline void btrfs_record_physical_zoned(struct inode *inode,
+					       u64 file_offset, struct bio *bio)
+{
+}
+
+static inline void btrfs_rewrite_logical_zoned(
+				struct btrfs_ordered_extent *ordered) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 27/42] btrfs: zoned: enable zone append writing for direct IO
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (24 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 26/42] btrfs: zoned: use ZONE_APPEND write for zoned btrfs Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 28/42] btrfs: zoned: introduce dedicated data write path for zoned filesystems Naohiro Aota
                     ` (15 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Likewise to buffered IO, enable zone append writing for direct IO when its
used on a zoned block device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6dbab9293425..dd6fe8afd0e0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7738,6 +7738,9 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
 	iomap->length = len;
 
+	if (write && btrfs_use_zone_append(BTRFS_I(inode), em))
+		iomap->flags |= IOMAP_F_ZONE_APPEND;
+
 	free_extent_map(em);
 
 	return 0;
@@ -7964,6 +7967,8 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	if (err)
 		dip->dio_bio->bi_status = err;
 
+	btrfs_record_physical_zoned(dip->inode, dip->logical_offset, bio);
+
 	bio_put(bio);
 	btrfs_dio_private_put(dip);
 }
@@ -8124,6 +8129,19 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		bio->bi_end_io = btrfs_end_dio_bio;
 		btrfs_io_bio(bio)->logical = file_offset;
 
+		WARN_ON_ONCE(write && btrfs_is_zoned(fs_info) &&
+			     fs_info->max_zone_append_size &&
+			     bio_op(bio) != REQ_OP_ZONE_APPEND);
+
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			status = extract_ordered_extent(BTRFS_I(inode), bio,
+							file_offset);
+			if (status) {
+				bio_put(bio);
+				goto out_err;
+			}
+		}
+
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 28/42] btrfs: zoned: introduce dedicated data write path for zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (25 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 27/42] btrfs: zoned: enable zone append writing for direct IO Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 29/42] btrfs: zoned: serialize metadata IO Naohiro Aota
                     ` (14 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

If more than one IO is issued for one file extent, these IO can be written
to separate regions on a device. Since we cannot map one file extent to
such a separate area on a zoned filesystem, we need to follow the "one IO
== one ordered extent" rule.

The normal buffered, uncompressed and not pre-allocated write path (used by
cow_file_range()) sometimes does not follow this rule. It can write a part
of an ordered extent when specified a region to write e.g., when its
called from fdatasync().

Introduce a dedicated (uncompressed buffered) data write path for zoned
filesystems, that will CoW the region and write it at once.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dd6fe8afd0e0..c4779cde83c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1394,6 +1394,29 @@ static int cow_file_range_async(struct btrfs_inode *inode,
 	return 0;
 }
 
+static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
+				       struct page *locked_page, u64 start,
+				       u64 end, int *page_started,
+				       unsigned long *nr_written)
+{
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end, page_started,
+			     nr_written, 0);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(&inode->vfs_inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1871,17 +1894,24 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	const bool zoned = btrfs_is_zoned(inode->root->fs_info);
 
 	if (inode->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (inode->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
 	} else if (!inode_can_compress(inode) ||
 		   !inode_need_compress(inode, start, end)) {
-		ret = cow_file_range(inode, locked_page, start, end,
-				     page_started, nr_written, 1);
+		if (zoned)
+			ret = run_delalloc_zoned(inode, locked_page, start, end,
+						 page_started, nr_written);
+		else
+			ret = cow_file_range(inode, locked_page, start, end,
+					     page_started, nr_written, 1);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 29/42] btrfs: zoned: serialize metadata IO
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (26 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 28/42] btrfs: zoned: introduce dedicated data write path for zoned filesystems Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 30/42] btrfs: zoned: wait for existing extents before truncating Naohiro Aota
                     ` (13 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns -EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent_io.c   | 25 ++++++++++++++++++++-
 fs/btrfs/zoned.c       | 50 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       | 32 +++++++++++++++++++++++++++
 6 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 31c7c5872b92..a07108d65c44 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -193,6 +193,7 @@ struct btrfs_block_group {
 	 */
 	u64 alloc_offset;
 	u64 zone_unusable;
+	u64 meta_write_pointer;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 10da47ab093a..1bb4f767966a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -975,6 +975,7 @@ struct btrfs_fs_info {
 
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
+	struct mutex zoned_meta_io_lock;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70621184a731..458bb27e0327 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2769,6 +2769,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
+	mutex_init(&fs_info->zoned_meta_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4c186a5f9efa..ac210cf0956b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -26,6 +26,7 @@
 #include "disk-io.h"
 #include "subpage.h"
 #include "zoned.h"
+#include "block-group.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4162,6 +4163,7 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 			  struct extent_buffer **eb_context)
 {
 	struct address_space *mapping = page->mapping;
+	struct btrfs_block_group *cache = NULL;
 	struct extent_buffer *eb;
 	int ret;
 
@@ -4194,13 +4196,31 @@ static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!ret)
 		return 0;
 
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, &cache)) {
+		/*
+		 * If for_sync, this hole will be filled with
+		 * trasnsaction commit.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
+			ret = -EAGAIN;
+		else
+			ret = 0;
+		free_extent_buffer(eb);
+		return ret;
+	}
+
 	*eb_context = eb;
 
 	ret = lock_extent_buffer_for_io(eb, epd);
 	if (ret <= 0) {
+		btrfs_revert_meta_write_pointer(cache, eb);
+		if (cache)
+			btrfs_put_block_group(cache);
 		free_extent_buffer(eb);
 		return ret;
 	}
+	if (cache)
+		btrfs_put_block_group(cache);
 	ret = write_one_eb(eb, wbc, epd);
 	free_extent_buffer(eb);
 	if (ret < 0)
@@ -4246,6 +4266,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
+	btrfs_zoned_meta_io_lock(fs_info);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
@@ -4286,7 +4307,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 	}
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
-		return ret;
+		goto out;
 	}
 	/*
 	 * If something went wrong, don't allow any metadata write bio to be
@@ -4321,6 +4342,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 		ret = -EROFS;
 		end_write_bio(&epd, ret);
 	}
+out:
+	btrfs_zoned_meta_io_unlock(fs_info);
 	return ret;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 050aea447332..2803a3e5d022 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1159,6 +1159,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		ret = -EIO;
 	}
 
+	if (!ret)
+		cache->meta_write_pointer = cache->alloc_offset + cache->start;
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
@@ -1320,3 +1323,50 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	kfree(logical);
 	bdput(bdev);
 }
+
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret)
+{
+	struct btrfs_block_group *cache;
+	bool ret = true;
+
+	if (!btrfs_is_zoned(fs_info))
+		return true;
+
+	cache = *cache_ret;
+
+	if (cache && (eb->start < cache->start ||
+		      cache->start + cache->length <= eb->start)) {
+		btrfs_put_block_group(cache);
+		cache = NULL;
+		*cache_ret = NULL;
+	}
+
+	if (!cache)
+		cache = btrfs_lookup_block_group(fs_info, eb->start);
+
+	if (cache) {
+		if (cache->meta_write_pointer != eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			ret = false;
+		} else {
+			cache->meta_write_pointer = eb->start + eb->len;
+		}
+
+		*cache_ret = cache;
+	}
+
+	return ret;
+}
+
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb)
+{
+	if (!btrfs_is_zoned(eb->fs_info) || !cache)
+		return;
+
+	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
+	cache->meta_write_pointer = eb->start;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 04f7b21652b6..0755a25d0f4c 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -50,6 +50,11 @@ bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em);
 void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
 				 struct bio *bio);
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret);
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -151,6 +156,19 @@ static inline void btrfs_record_physical_zoned(struct inode *inode,
 static inline void btrfs_rewrite_logical_zoned(
 				struct btrfs_ordered_extent *ordered) { }
 
+static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+			       struct extent_buffer *eb,
+			       struct btrfs_block_group **cache_ret)
+{
+	return true;
+}
+
+static inline void btrfs_revert_meta_write_pointer(
+						struct btrfs_block_group *cache,
+						struct extent_buffer *eb)
+{
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -242,4 +260,18 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline void btrfs_zoned_meta_io_lock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_lock(&fs_info->zoned_meta_io_lock);
+}
+
+static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_unlock(&fs_info->zoned_meta_io_lock);
+}
+
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 30/42] btrfs: zoned: wait for existing extents before truncating
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (27 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 29/42] btrfs: zoned: serialize metadata IO Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 31/42] btrfs: zoned: do not use async metadata checksum on zoned filesystems Naohiro Aota
                     ` (12 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated. Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c4779cde83c6..535abf898225 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5169,6 +5169,15 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_is_zoned(fs_info)) {
+			ret = btrfs_wait_ordered_range(inode,
+					ALIGN(newsize, fs_info->sectorsize),
+					(u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 31/42] btrfs: zoned: do not use async metadata checksum on zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (28 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 30/42] btrfs: zoned: wait for existing extents before truncating Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 32/42] btrfs: zoned: mark block groups to copy for device-replace Naohiro Aota
                     ` (11 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

On zoned filesystems, btrfs uses per-FS zoned_meta_io_lock to serialize
the metadata write IOs.

Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU and
not per zone.

To preserve write BIO ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as a
single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple zones
are being written simultaneously, HDD seek overhead lowers the achievable
maximum bandwidth, resulting again in a per zone checksum serialization not
affecting the performance.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 458bb27e0327..6e16f556ed75 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -871,6 +871,8 @@ static blk_status_t btree_submit_bio_start(struct inode *inode, struct bio *bio,
 static int check_async_write(struct btrfs_fs_info *fs_info,
 			     struct btrfs_inode *bi)
 {
+	if (btrfs_is_zoned(fs_info))
+		return 0;
 	if (atomic_read(&bi->sync_writers))
 		return 0;
 	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 32/42] btrfs: zoned: mark block groups to copy for device-replace
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (29 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 31/42] btrfs: zoned: do not use async metadata checksum on zoned filesystems Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 33/42] btrfs: zoned: implement cloning for zoned device-replace Naohiro Aota
                     ` (10 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

This is the 1/4 patch to support device-replace on zoned filesystems.

We have two types of I/Os during the device-replace process. One is an I/O
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an I/O to "clone" (by
handle_ops_on_dev_replace()) new incoming write I/Os from users to the
source device into the target device.

Cloning incoming I/Os can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the I/O is
directed to the middle of a target device zone, which breaks the
sequential write requirement.

However, the cloning function cannot be disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/dev-replace.c | 184 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |   3 +
 fs/btrfs/scrub.c       |  16 ++++
 4 files changed, 204 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index a07108d65c44..d37ee576ac6e 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -95,6 +95,7 @@ struct btrfs_block_group {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index bc73f798ce3a..3a9c1e046ebe 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -22,6 +22,7 @@
 #include "dev-replace.h"
 #include "sysfs.h"
 #include "zoned.h"
+#include "block-group.h"
 
 /*
  * Device replace overview
@@ -459,6 +460,185 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group *cache;
+	struct btrfs_trans_handle *trans;
+	int ret = 0;
+	u64 chunk_offset;
+
+	/* Do not use "to_copy" on non zoned filesystem for now */
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	mutex_lock(&fs_info->chunk_mutex);
+
+	/* Ensure we don't have pending new block group */
+	spin_lock(&fs_info->trans_lock);
+	while (fs_info->running_transaction &&
+	       !list_empty(&fs_info->running_transaction->dev_update_list)) {
+		spin_unlock(&fs_info->trans_lock);
+		mutex_unlock(&fs_info->chunk_mutex);
+		trans = btrfs_attach_transaction(root);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			mutex_lock(&fs_info->chunk_mutex);
+			if (ret == -ENOENT) {
+				spin_lock(&fs_info->trans_lock);
+				continue;
+			} else {
+				goto unlock;
+			}
+		}
+
+		ret = btrfs_commit_transaction(trans);
+		mutex_lock(&fs_info->chunk_mutex);
+		if (ret)
+			goto unlock;
+
+		spin_lock(&fs_info->trans_lock);
+	}
+	spin_unlock(&fs_info->trans_lock);
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto free_path;
+	if (ret > 0) {
+		if (path->slots[0] >=
+		    btrfs_header_nritems(path->nodes[0])) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto free_path;
+			if (ret > 0) {
+				ret = 0;
+				goto free_path;
+			}
+		} else {
+			ret = 0;
+		}
+	}
+
+	while (1) {
+		struct extent_buffer *leaf = path->nodes[0];
+		int slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(leaf, slot, struct btrfs_dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(leaf, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		spin_lock(&cache->lock);
+		cache->to_copy = 1;
+		spin_unlock(&cache->lock);
+
+		btrfs_put_block_group(cache);
+
+skip:
+		ret = btrfs_next_item(root, path);
+		if (ret != 0) {
+			if (ret > 0)
+				ret = 0;
+			break;
+		}
+	}
+
+free_path:
+	btrfs_free_path(path);
+unlock:
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return ret;
+}
+
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 chunk_offset = cache->start;
+	int num_extents, cur_extent;
+	int i;
+
+	/* Do not use "to_copy" on non zoned filesystem for now */
+	if (!btrfs_is_zoned(fs_info))
+		return true;
+
+	spin_lock(&cache->lock);
+	if (cache->removed) {
+		spin_unlock(&cache->lock);
+		return true;
+	}
+	spin_unlock(&cache->lock);
+
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	ASSERT(!IS_ERR(em));
+	map = em->map_lookup;
+
+	num_extents = cur_extent = 0;
+	for (i = 0; i < map->num_stripes; i++) {
+		/* We have more device extent to copy */
+		if (srcdev != map->stripes[i].dev)
+			continue;
+
+		num_extents++;
+		if (physical == map->stripes[i].physical)
+			cur_extent = i;
+	}
+
+	free_extent_map(em);
+
+	if (num_extents > 1 && cur_extent < num_extents - 1) {
+		/*
+		 * Has more stripes on this device. Keep this block group
+		 * readonly until we finish all the stripes.
+		 */
+		return false;
+	}
+
+	/* Last stripe on this device */
+	spin_lock(&cache->lock);
+	cache->to_copy = 0;
+	spin_unlock(&cache->lock);
+
+	return true;
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -500,6 +680,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 60b70dacc299..3911049a5f23 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
 void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
 int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
 int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical);
 
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5f4f88a4d2c8..da4f9c24e42d 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3561,6 +3561,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
+			spin_lock(&cache->lock);
+			if (!cache->to_copy) {
+				spin_unlock(&cache->lock);
+				ro_set = 0;
+				goto done;
+			}
+			spin_unlock(&cache->lock);
+		}
+
 		/*
 		 * Make sure that while we are scrubbing the corresponding block
 		 * group doesn't get its logical address and its device extents
@@ -3692,6 +3702,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace &&
+		    !btrfs_finish_block_group_to_copy(dev_replace->srcdev,
+						      cache, found_key.offset))
+			ro_set = 0;
+
+done:
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 33/42] btrfs: zoned: implement cloning for zoned device-replace
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (30 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 32/42] btrfs: zoned: mark block groups to copy for device-replace Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 34/42] btrfs: zoned: implement copying " Naohiro Aota
                     ` (9 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

This is 2/4 patch to implement device-replace for zoned filesystems.

In zoned mode, a block group must be either copied (from the source device
to the destination device) or cloned (to the both device).

This commit implements the cloning part. If a block group targeted by an IO
is marked to copy, we should not clone the IO to the destination device,
because the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 57 +++++++++++++++++++++++++++++++-----------
 fs/btrfs/volumes.c     | 31 +++++++++++++++++++++--
 fs/btrfs/zoned.c       |  9 +++++++
 3 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a717366c9823..e2b2abc42295 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -35,6 +35,7 @@
 #include "discard.h"
 #include "rcu-string.h"
 #include "zoned.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1265,6 +1266,46 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 	return ret;
 }
 
+static int do_discard_extent(struct btrfs_bio_stripe *stripe, u64 *bytes)
+{
+	struct btrfs_device *dev = stripe->dev;
+	struct btrfs_fs_info *fs_info = dev->fs_info;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	u64 phys = stripe->physical;
+	u64 len = stripe->length;
+	u64 discarded = 0;
+	int ret = 0;
+
+	/* Zone reset on a zoned filesystem */
+	if (btrfs_can_zone_reset(dev, phys, len)) {
+		u64 src_disc;
+
+		ret = btrfs_reset_device_zone(dev, phys, len, &discarded);
+		if (ret)
+			goto out;
+
+		if (!btrfs_dev_replace_is_ongoing(dev_replace) ||
+		    dev != dev_replace->srcdev)
+			goto out;
+
+		src_disc = discarded;
+
+		/* Send to replace target as well */
+		ret = btrfs_reset_device_zone(dev_replace->tgtdev, phys, len,
+					      &discarded);
+		discarded += src_disc;
+	} else if (blk_queue_discard(bdev_get_queue(stripe->dev->bdev))) {
+		ret = btrfs_issue_discard(dev->bdev, phys, len, &discarded);
+	} else {
+		ret = 0;
+		*bytes = 0;
+	}
+
+out:
+	*bytes = discarded;
+	return ret;
+}
+
 int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			 u64 num_bytes, u64 *actual_bytes)
 {
@@ -1298,28 +1339,14 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
-			struct btrfs_device *dev = stripe->dev;
-			u64 physical = stripe->physical;
-			u64 length = stripe->length;
 			u64 bytes;
-			struct request_queue *req_q;
 
 			if (!stripe->dev->bdev) {
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
 
-			req_q = bdev_get_queue(stripe->dev->bdev);
-			/* Zone reset on zoned filesystems */
-			if (btrfs_can_zone_reset(dev, physical, length))
-				ret = btrfs_reset_device_zone(dev, physical,
-							      length, &bytes);
-			else if (blk_queue_discard(req_q))
-				ret = btrfs_issue_discard(dev->bdev, physical,
-							  length, &bytes);
-			else
-				continue;
-
+			ret = do_discard_extent(stripe, &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a4d47c6050f7..52ec6721ada2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5973,9 +5973,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+	bool ret;
+
+	/* Non-ZONED mode does not use "to_copy" flag */
+	if (!btrfs_is_zoned(fs_info))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+
+	spin_lock(&cache->lock);
+	ret = cache->to_copy;
+	spin_unlock(&cache->lock);
+
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -5988,6 +6008,13 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	if (op == BTRFS_MAP_WRITE) {
 		int index_where_to_add;
 
+		/*
+		 * A block group which have "to_copy" set will eventually
+		 * copied by dev-replace process. We can avoid cloning IO here.
+		 */
+		if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
+			return;
+
 		/*
 		 * duplicate the write operations while the dev replace
 		 * procedure is running. Since the copying of the old disk to
@@ -6376,8 +6403,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 2803a3e5d022..72d9c8ba98a3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -11,6 +11,7 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "transaction.h"
+#include "dev-replace.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1036,6 +1037,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 	for (i = 0; i < map->num_stripes; i++) {
 		bool is_sequential;
 		struct blk_zone zone;
+		struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+		int dev_replace_is_ongoing = 0;
 
 		device = map->stripes[i].dev;
 		physical = map->stripes[i].physical;
@@ -1062,6 +1065,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		 */
 		btrfs_dev_clear_zone_empty(device, physical);
 
+		down_read(&dev_replace->rwsem);
+		dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
+		if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
+			btrfs_dev_clear_zone_empty(dev_replace->tgtdev, physical);
+		up_read(&dev_replace->rwsem);
+
 		/*
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 34/42] btrfs: zoned: implement copying for zoned device-replace
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (31 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 33/42] btrfs: zoned: implement cloning for zoned device-replace Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 35/42] btrfs: zoned: support dev-replace in zoned filesystems Naohiro Aota
                     ` (8 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

This is 3/4 patch to implement device-replace on zoned filesystems.

This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.

The device-replace process on zoned filesystems must copy or clone all the
extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d840 ("Btrfs: fix
block group remaining RO forever after error during device replace").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/scrub.c   | 84 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/zoned.c   |  9 +++++
 fs/btrfs/zoned.h   |  7 ++++
 4 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index da4f9c24e42d..92904902d160 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -166,6 +166,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1619,6 +1620,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
 	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
 }
 
+static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
+{
+	int ret = 0;
+	u64 length;
+
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return 0;
+
+	if (sctx->write_pointer < physical) {
+		length = physical - sctx->write_pointer;
+
+		ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev,
+						sctx->write_pointer, length);
+		if (!ret)
+			sctx->write_pointer = physical;
+	}
+	return ret;
+}
+
 static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage)
 {
@@ -1641,6 +1661,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	if (sbio->page_count == 0) {
 		struct bio *bio;
 
+		ret = fill_writer_pointer_gap(sctx,
+					      spage->physical_for_dev_replace);
+		if (ret) {
+			mutex_unlock(&sctx->wr_lock);
+			return ret;
+		}
+
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
 		sbio->dev = sctx->wr_tgtdev;
@@ -1702,6 +1729,9 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_is_zoned(sctx->fs_info))
+		sctx->write_pointer = sbio->physical + sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -3025,6 +3055,20 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+static void sync_replace_for_zoned(struct scrub_ctx *sctx)
+{
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return;
+
+	sctx->flush_all_writes = true;
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3165,6 +3209,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+		sctx->flush_all_writes = true;
+	}
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3353,6 +3405,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			if (sctx->is_dev_replace)
+				sync_replace_for_zoned(sctx);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3475,6 +3530,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int finish_extent_writes_for_zoned(struct btrfs_root *root,
+					  struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_trans_handle *trans;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	btrfs_wait_block_group_reservations(cache);
+	btrfs_wait_nocow_writers(cache);
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	return btrfs_commit_transaction(trans);
+}
+
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			   struct btrfs_device *scrub_dev, u64 start, u64 end)
@@ -3629,6 +3703,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * group is not RO.
 		 */
 		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
+		if (!ret && sctx->is_dev_replace) {
+			ret = finish_extent_writes_for_zoned(root, cache);
+			if (ret) {
+				btrfs_dec_block_group_ro(cache);
+				scrub_pause_off(fs_info);
+				btrfs_put_block_group(cache);
+				break;
+			}
+		}
+
 		if (ret == 0) {
 			ro_set = 1;
 		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 52ec6721ada2..1312b17a6b49 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5978,7 +5978,7 @@ static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
 	struct btrfs_block_group *cache;
 	bool ret;
 
-	/* Non-ZONED mode does not use "to_copy" flag */
+	/* Non zoned filesystem does not use "to_copy" flag */
 	if (!btrfs_is_zoned(fs_info))
 		return false;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 72d9c8ba98a3..396723947934 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1379,3 +1379,12 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
 	cache->meta_write_pointer = eb->start;
 }
+
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length)
+{
+	if (!btrfs_dev_is_sequential(device, physical))
+		return -EOPNOTSUPP;
+
+	return blkdev_issue_zeroout(device->bdev, physical >> SECTOR_SHIFT,
+				    length >> SECTOR_SHIFT, GFP_NOFS, 0);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 0755a25d0f4c..5ed1ea2009ea 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -55,6 +55,7 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_group **cache_ret);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -169,6 +170,12 @@ static inline void btrfs_revert_meta_write_pointer(
 {
 }
 
+static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
+					    u64 physical, u64 length)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 35/42] btrfs: zoned: support dev-replace in zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (32 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 34/42] btrfs: zoned: implement copying " Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 36/42] btrfs: zoned: enable relocation on a zoned filesystem Naohiro Aota
                     ` (7 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

This is 4/4 patch to implement device-replace on zoned filesystems.

Even after the copying is done, the write pointers of the source device and
the destination device may not be synchronized. For example, when the last
allocated extent is freed before device-replace process, the extent is not
copied, leaving a hole there.

Synchronize the write pointers by writing zeroes to the destination device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/scrub.c | 40 ++++++++++++++++++++++++++
 fs/btrfs/zoned.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h |  9 ++++++
 3 files changed, 123 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 92904902d160..e0c3ec01e324 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1628,6 +1628,9 @@ static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
 	if (!btrfs_is_zoned(sctx->fs_info))
 		return 0;
 
+	if (!btrfs_dev_is_sequential(sctx->wr_tgtdev, physical))
+		return 0;
+
 	if (sctx->write_pointer < physical) {
 		length = physical - sctx->write_pointer;
 
@@ -3069,6 +3072,32 @@ static void sync_replace_for_zoned(struct scrub_ctx *sctx)
 	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
 }
 
+static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical,
+					u64 physical, u64 physical_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	int ret = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
+
+	mutex_lock(&sctx->wr_lock);
+	if (sctx->write_pointer < physical_end) {
+		ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical,
+						    physical,
+						    sctx->write_pointer);
+		if (ret)
+			btrfs_err(fs_info,
+				  "zoned: failed to recover write pointer");
+	}
+	mutex_unlock(&sctx->wr_lock);
+	btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical);
+
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3475,6 +3504,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (sctx->is_dev_replace && ret >= 0) {
+		int ret2;
+
+		ret2 = sync_write_pointer_for_zoned(sctx, base + offset,
+						    map->stripes[num].physical,
+						    physical_end);
+		if (ret2)
+			ret = ret2;
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 396723947934..148cbfc7f988 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -12,6 +12,7 @@
 #include "block-group.h"
 #include "transaction.h"
 #include "dev-replace.h"
+#include "space-info.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1388,3 +1389,76 @@ int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 len
 	return blkdev_issue_zeroout(device->bdev, physical >> SECTOR_SHIFT,
 				    length >> SECTOR_SHIFT, GFP_NOFS, 0);
 }
+
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	unsigned int nofs_flag;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nofs_flag = memalloc_nofs_save();
+	nmirrors = (int)bbio->num_stripes;
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* Missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone);
+		/* Failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+	memalloc_nofs_restore(nofs_flag);
+
+	return ret;
+}
+
+/*
+ * Synchronize write pointer in a zone at @physical_start on @tgt_dev, by
+ * filling zeros between @physical_pos to a write pointer of dev-replace
+ * source device.
+ */
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos)
+{
+	struct btrfs_fs_info *fs_info = tgt_dev->fs_info;
+	struct blk_zone zone;
+	u64 length;
+	u64 wp;
+	int ret;
+
+	if (!btrfs_dev_is_sequential(tgt_dev, physical_pos))
+		return 0;
+
+	ret = read_zone_info(fs_info, logical, &zone);
+	if (ret)
+		return ret;
+
+	wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT);
+
+	if (physical_pos == wp)
+		return 0;
+
+	if (physical_pos > wp)
+		return -EUCLEAN;
+
+	length = wp - physical_pos;
+	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 5ed1ea2009ea..932ad9bc0de6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -56,6 +56,8 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				  u64 physical_start, u64 physical_pos);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -176,6 +178,13 @@ static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 	return -EOPNOTSUPP;
 }
 
+static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
+						u64 logical, u64 physical_start,
+						u64 physical_pos)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 36/42] btrfs: zoned: enable relocation on a zoned filesystem
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (33 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 35/42] btrfs: zoned: support dev-replace in zoned filesystems Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 37/42] btrfs: zoned: relocate block group to repair IO failure in zoned filesystems Naohiro Aota
                     ` (6 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

Currently fallocate() is disabled on a zoned filesystem. Since current
relocation process relies on preallocation to move file data extents, it
must be handled differently.

On a zoned filesystem, we just truncate the inode to the size that we
wanted to pre-allocate. Then, we flush dirty pages on the file before
finishing the relocation process. run_delalloc_zoned() will handle all the
allocations and submit IOs to the underlying layers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 473b78874844..232d5da7b7be 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2553,6 +2553,31 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	if (ret)
 		return ret;
 
+	/*
+	 * On a zoned filesystem, we cannot preallocate the file region.
+	 * Instead, we dirty and fiemap_write the region.
+	 */
+	if (btrfs_is_zoned(inode->root->fs_info)) {
+		struct btrfs_root *root = inode->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->vfs_inode.i_ctime = current_time(&inode->vfs_inode);
+		i_size_write(&inode->vfs_inode, end);
+		ret = btrfs_update_inode(trans, root, inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+
+		return btrfs_end_transaction(trans);
+	}
+
 	inode_lock(&inode->vfs_inode);
 	for (nr = 0; nr < cluster->nr; nr++) {
 		start = cluster->boundary[nr] - offset;
@@ -2756,6 +2781,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		}
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_is_zoned(fs_info) && !ret)
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
 out:
 	kfree(ra);
 	return ret;
@@ -3434,8 +3461,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_is_zoned(trans->fs_info))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -3450,8 +3481,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 37/42] btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (34 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 36/42] btrfs: zoned: enable relocation on a zoned filesystem Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 38/42] btrfs: split alloc_log_tree() Naohiro Aota
                     ` (5 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik

When btrfs finds a checksum error and if the file system has a mirror of
the damaged data, btrfs read the correct data from the mirror and writes
it to damaged blocks. This however, violates the sequential write
constraints of a zoned block device.

We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to the
    new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and so
it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It is
safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device extents,
so it potentially is a more costly operation than method (1) or (2). But
it relocates only used extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple IO
errors, this commit introduces "relocating_repair" bit to show it's now
relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent_io.c   |  3 ++
 fs/btrfs/scrub.c       |  3 ++
 fs/btrfs/volumes.c     | 72 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     |  1 +
 5 files changed, 80 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index d37ee576ac6e..29678426247d 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -96,6 +96,7 @@ struct btrfs_block_group {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int to_copy:1;
+	unsigned int relocating_repair:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac210cf0956b..32fb5021f353 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2260,6 +2260,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
 
+	if (btrfs_is_zoned(fs_info))
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	bio = btrfs_io_bio_alloc(1);
 	bio->bi_iter.bi_size = 0;
 	map_length = length;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e0c3ec01e324..310fce00fcda 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -857,6 +857,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (btrfs_is_zoned(fs_info) && !sctx->is_dev_replace)
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	/*
 	 * We must use GFP_NOFS because the scrub task might be waiting for a
 	 * worker task executing this function and in turn a transaction commit
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1312b17a6b49..b8fab44394f5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7980,3 +7980,75 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+static int relocating_repair_kthread(void *data)
+{
+	struct btrfs_block_group *cache = (struct btrfs_block_group *)data;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	u64 target;
+	int ret = 0;
+
+	target = cache->start;
+	btrfs_put_block_group(cache);
+
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
+		btrfs_info(fs_info,
+			   "zoned: skip relocating block group %llu to repair: EBUSY",
+			   target);
+		return -EBUSY;
+	}
+
+	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+
+	/* Ensure block group still exists */
+	cache = btrfs_lookup_block_group(fs_info, target);
+	if (!cache)
+		goto out;
+
+	if (!cache->relocating_repair)
+		goto out;
+
+	ret = btrfs_may_alloc_data_chunk(fs_info, target);
+	if (ret < 0)
+		goto out;
+
+	btrfs_info(fs_info,
+		   "zoned: relocating block group %llu to repair IO failure",
+		   target);
+	ret = btrfs_relocate_chunk(fs_info, target);
+
+out:
+	if (cache)
+		btrfs_put_block_group(cache);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	btrfs_exclop_finish(fs_info);
+
+	return ret;
+}
+
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	/* Do not attempt to repair in degraded state */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		return 0;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache)
+		return 0;
+
+	spin_lock(&cache->lock);
+	if (cache->relocating_repair) {
+		spin_unlock(&cache->lock);
+		btrfs_put_block_group(cache);
+		return 0;
+	}
+	cache->relocating_repair = 1;
+	spin_unlock(&cache->lock);
+
+	kthread_run(relocating_repair_kthread, cache,
+		    "btrfs-relocating-repair");
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d3bbdb4175df..d4c3e0dd32b8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -599,5 +599,6 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 38/42] btrfs: split alloc_log_tree()
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (35 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 37/42] btrfs: zoned: relocate block group to repair IO failure in zoned filesystems Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 39/42] btrfs: zoned: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
                     ` (4 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Filipe Manana, Josef Bacik,
	Johannes Thumshirn

This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().

Also export the latter part is to be used in the next patch.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 33 +++++++++++++++++++++++++++------
 fs/btrfs/disk-io.h |  2 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6e16f556ed75..d2fa92526b3b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1254,7 +1254,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 					 struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root;
-	struct extent_buffer *leaf;
 
 	root = btrfs_alloc_root(fs_info, BTRFS_TREE_LOG_OBJECTID, GFP_NOFS);
 	if (!root)
@@ -1264,6 +1263,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 	root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 	root->root_key.offset = BTRFS_TREE_LOG_OBJECTID;
 
+	return root;
+}
+
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root)
+{
+	struct extent_buffer *leaf;
+
 	/*
 	 * DON'T set SHAREABLE bit for log trees.
 	 *
@@ -1276,26 +1283,33 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
 			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
-	if (IS_ERR(leaf)) {
-		btrfs_put_root(root);
-		return ERR_CAST(leaf);
-	}
+	if (IS_ERR(leaf))
+		return PTR_ERR(leaf);
 
 	root->node = leaf;
 
 	btrfs_mark_buffer_dirty(root->node);
 	btrfs_tree_unlock(root->node);
-	return root;
+
+	return 0;
 }
 
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
+
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		btrfs_put_root(log_root);
+		return ret;
+	}
+
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
@@ -1307,11 +1321,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log_root;
 	struct btrfs_inode_item *inode_item;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		btrfs_put_root(log_root);
+		return ret;
+	}
+
 	log_root->last_trans = trans->transid;
 	log_root->root_key.offset = root->root_key.objectid;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 9f4a2a1e3d36..0e7e9526b6a8 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -120,6 +120,8 @@ blk_status_t btrfs_wq_submit_bio(struct inode *inode, struct bio *bio,
 				 extent_submit_bio_start_t *submit_bio_start);
 blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
 			  int mirror_num);
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 39/42] btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (36 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 38/42] btrfs: split alloc_log_tree() Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems Naohiro Aota
                     ` (3 subsequent siblings)
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Josef Bacik, Johannes Thumshirn

This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams.  As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and btrfs assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  2 ++
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent-tree.c | 75 +++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/zoned.h       | 14 ++++++++
 5 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index f5e9f560ce6d..5064be59dac5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -901,6 +901,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&cluster->refill_lock);
 
+	btrfs_clear_treelog_bg(block_group);
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1bb4f767966a..6f4b493625ef 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -976,6 +976,8 @@ struct btrfs_fs_info {
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
+	spinlock_t treelog_bg_lock;
+	u64 treelog_bg;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d2fa92526b3b..84c6650d5ef7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2787,6 +2787,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->super_lock);
 	spin_lock_init(&fs_info->buffer_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
+	spin_lock_init(&fs_info->treelog_bg_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e2b2abc42295..f8e8c17e5624 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3497,6 +3497,9 @@ struct find_free_extent_ctl {
 	bool have_caching_bg;
 	bool orig_have_caching_bg;
 
+	/* Allocation is called for tree-log */
+	bool for_treelog;
+
 	/* RAID index, converted from flags */
 	int index;
 
@@ -3725,6 +3728,22 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
+/*
+ * Tree-log block group locking
+ * ============================
+ *
+ * fs_info::treelog_bg_lock protects the fs_info::treelog_bg which
+ * indicates the starting address of a block group, which is reserved only
+ * for tree-log metadata.
+ *
+ * Lock nesting
+ * ============
+ *
+ * space_info::lock
+ *   block_group::lock
+ *     fs_info::treelog_bg_lock
+ */
+
 /*
  * Simple allocator for sequential only block group. It only allows sequential
  * allocation. No need to play with trees. This function also reserves the
@@ -3734,23 +3753,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct find_free_extent_ctl *ffe_ctl,
 			       struct btrfs_block_group **bg_ret)
 {
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
 	u64 avail;
+	u64 bytenr = block_group->start;
+	u64 log_bytenr;
 	int ret = 0;
+	bool skip;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
+	/*
+	 * Do not allow non-tree-log blocks in the dedicated tree-log block
+	 * group, and vice versa.
+	 */
+	spin_lock(&fs_info->treelog_bg_lock);
+	log_bytenr = fs_info->treelog_bg;
+	skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
+			      (!ffe_ctl->for_treelog && bytenr == log_bytenr));
+	spin_unlock(&fs_info->treelog_bg_lock);
+	if (skip)
+		return 1;
+
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
+	spin_lock(&fs_info->treelog_bg_lock);
+
+	ASSERT(!ffe_ctl->for_treelog ||
+	       block_group->start == fs_info->treelog_bg ||
+	       fs_info->treelog_bg == 0);
 
 	if (block_group->ro) {
 		ret = 1;
 		goto out;
 	}
 
+	/*
+	 * Do not allow currently using block group to be tree-log dedicated
+	 * block group.
+	 */
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	    (block_group->used || block_group->reserved)) {
+		ret = 1;
+		goto out;
+	}
+
 	avail = block_group->length - block_group->alloc_offset;
 	if (avail < num_bytes) {
 		if (ffe_ctl->max_extent_size < avail) {
@@ -3765,6 +3815,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
+		fs_info->treelog_bg = block_group->start;
+
 	ffe_ctl->found_offset = start + block_group->alloc_offset;
 	block_group->alloc_offset += num_bytes;
 	spin_lock(&ctl->tree_lock);
@@ -3779,6 +3832,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	ffe_ctl->search_start = ffe_ctl->found_offset;
 
 out:
+	if (ret && ffe_ctl->for_treelog)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4028,7 +4084,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/* Nothing to do */
+		if (ffe_ctl->for_treelog) {
+			spin_lock(&fs_info->treelog_bg_lock);
+			if (fs_info->treelog_bg)
+				ffe_ctl->hint_byte = fs_info->treelog_bg;
+			spin_unlock(&fs_info->treelog_bg_lock);
+		}
 		return 0;
 	default:
 		BUG();
@@ -4072,6 +4133,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	struct find_free_extent_ctl ffe_ctl = {0};
 	struct btrfs_space_info *space_info;
 	bool full_search = false;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -4085,6 +4147,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.orig_have_caching_bg = false;
 	ffe_ctl.found_offset = 0;
 	ffe_ctl.hint_byte = hint_byte_orig;
+	ffe_ctl.for_treelog = for_treelog;
 	ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED;
 
 	/* For clustered allocation */
@@ -4159,8 +4222,11 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		struct btrfs_block_group *bg_ret;
 
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro))
+		if (unlikely(block_group->ro)) {
+			if (for_treelog)
+				btrfs_clear_treelog_bg(block_group);
 			continue;
+		}
 
 		btrfs_grab_block_group(block_group, delalloc);
 		ffe_ctl.search_start = block_group->start;
@@ -4346,6 +4412,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool final_tried = num_bytes == min_alloc_size;
 	u64 flags;
 	int ret;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
@@ -4369,8 +4436,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 
 			sinfo = btrfs_find_space_info(fs_info, flags);
 			btrfs_err(fs_info,
-				  "allocation failed flags %llu, wanted %llu",
-				  flags, num_bytes);
+			"allocation failed flags %llu, wanted %llu tree-log %d",
+				  flags, num_bytes, for_treelog);
 			if (sinfo)
 				btrfs_dump_space_info(fs_info, sinfo,
 						      num_bytes, 1);
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 932ad9bc0de6..61e969652fe1 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -7,6 +7,7 @@
 #include <linux/blkdev.h>
 #include "volumes.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -290,4 +291,17 @@ static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
 	mutex_unlock(&fs_info->zoned_meta_io_lock);
 }
 
+static inline void btrfs_clear_treelog_bg(struct btrfs_block_group *bg)
+{
+	struct btrfs_fs_info *fs_info = bg->fs_info;
+
+	if (!btrfs_is_zoned(fs_info))
+		return;
+
+	spin_lock(&fs_info->treelog_bg_lock);
+	if (fs_info->treelog_bg == bg->start)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
+}
+
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (37 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 39/42] btrfs: zoned: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 11:50     ` Filipe Manana
  2021-02-05  9:15     ` Naohiro Aota
  2021-02-04 10:22   ` [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem Naohiro Aota
                     ` (2 subsequent siblings)
  41 siblings, 2 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Filipe Manana, Josef Bacik

This is the 2/3 patch to enable tree-log on zoned filesystems.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of a log transaction commit. The nodes of the global log root tree
(fs_info->log_root_tree), also have the same problem with mixed
allocation.

Serializes log transactions by waiting for a committing transaction when
someone tries to start a new transaction, to avoid the mixed allocation
problem. We must also wait for running log transactions from another
subvolume, but there is no easy way to detect which subvolume root is
running a log transaction. So, this patch forbids starting a new log
transaction when other subvolumes already allocated the global log root
tree.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/tree-log.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c02eeeac439c..8be3164d4c5d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -105,6 +105,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
 				       u64 dirid, int del_all);
+static void wait_log_commit(struct btrfs_root *root, int transid);
 
 /*
  * tree logging is a special write ahead log used to make sure that
@@ -140,6 +141,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *tree_root = fs_info->tree_root;
+	const bool zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
 
 	/*
@@ -160,12 +162,20 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 
 	mutex_lock(&root->log_mutex);
 
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		if (btrfs_need_log_full_commit(trans)) {
 			ret = -EAGAIN;
 			goto out;
 		}
 
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
+
 		if (!root->log_start_pid) {
 			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 			root->log_start_pid = current->pid;
@@ -173,6 +183,17 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 		}
 	} else {
+		if (zoned) {
+			mutex_lock(&fs_info->tree_log_mutex);
+			if (fs_info->log_root_tree)
+				ret = -EAGAIN;
+			else
+				ret = btrfs_init_log_root_tree(trans, fs_info);
+			mutex_unlock(&fs_info->tree_log_mutex);
+		}
+		if (ret)
+			goto out;
+
 		ret = btrfs_add_log_tree(trans, root);
 		if (ret)
 			goto out;
@@ -201,14 +222,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
  */
 static int join_running_log_trans(struct btrfs_root *root)
 {
+	const bool zoned = btrfs_is_zoned(root->fs_info);
 	int ret = -ENOENT;
 
 	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
 		return ret;
 
 	mutex_lock(&root->log_mutex);
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		ret = 0;
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
 		atomic_inc(&root->log_writers);
 	}
 	mutex_unlock(&root->log_mutex);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (38 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-04 11:57     ` Filipe Manana
  2021-02-04 10:22   ` [PATCH v15 42/42] btrfs: zoned: enable to mount ZONED incompat flag Naohiro Aota
  2021-02-05  9:26   ` [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages Naohiro Aota
  41 siblings, 1 reply; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Filipe Manana, Josef Bacik,
	Johannes Thumshirn

This is the 3/3 patch to enable tree-log on zoned filesystems.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c  | 12 +++++++-----
 fs/btrfs/tree-log.c | 27 +++++++++++++++++++++------
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 84c6650d5ef7..c2576c5fe62e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1298,16 +1298,18 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
-	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
-	ret = btrfs_alloc_log_tree_node(trans, log_root);
-	if (ret) {
-		btrfs_put_root(log_root);
-		return ret;
+	if (!btrfs_is_zoned(fs_info)) {
+		int ret = btrfs_alloc_log_tree_node(trans, log_root);
+
+		if (ret) {
+			btrfs_put_root(log_root);
+			return ret;
+		}
 	}
 
 	WARN_ON(fs_info->log_root_tree);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8be3164d4c5d..7ba044bfa9b1 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3159,6 +3159,19 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
 	root_log_ctx.log_transid = log_root_tree->log_transid;
 
+	if (btrfs_is_zoned(fs_info)) {
+		mutex_lock(&fs_info->tree_log_mutex);
+		if (!log_root_tree->node) {
+			ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
+			if (ret) {
+				mutex_unlock(&fs_info->tree_log_mutex);
+				mutex_unlock(&log_root_tree->log_mutex);
+				goto out;
+			}
+		}
+		mutex_unlock(&fs_info->tree_log_mutex);
+	}
+
 	/*
 	 * Now we are safe to update the log_root_tree because we're under the
 	 * log_mutex, and we're a current writer so we're holding the commit
@@ -3317,12 +3330,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		.process_func = process_one_buffer
 	};
 
-	ret = walk_log_tree(trans, log, &wc);
-	if (ret) {
-		if (trans)
-			btrfs_abort_transaction(trans, ret);
-		else
-			btrfs_handle_fs_error(log->fs_info, ret, NULL);
+	if (log->node) {
+		ret = walk_log_tree(trans, log, &wc);
+		if (ret) {
+			if (trans)
+				btrfs_abort_transaction(trans, ret);
+			else
+				btrfs_handle_fs_error(log->fs_info, ret, NULL);
+		}
 	}
 
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 42/42] btrfs: zoned: enable to mount ZONED incompat flag
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (39 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem Naohiro Aota
@ 2021-02-04 10:22   ` Naohiro Aota
  2021-02-05  9:26   ` [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages Naohiro Aota
  41 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-04 10:22 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Naohiro Aota, Anand Jain, Josef Bacik

This final patch adds the ZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file
system.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f4b493625ef..3bc00aed13b2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -298,7 +298,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
-	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
+	 BTRFS_FEATURE_INCOMPAT_ZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-04 10:22   ` [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems Naohiro Aota
@ 2021-02-04 11:50     ` Filipe Manana
  2021-02-05  7:21       ` Naohiro Aota
  2021-02-05  9:15     ` Naohiro Aota
  1 sibling, 1 reply; 72+ messages in thread
From: Filipe Manana @ 2021-02-04 11:50 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Josef Bacik

On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> This is the 2/3 patch to enable tree-log on zoned filesystems.
>
> Since we can start more than one log transactions per subvolume
> simultaneously, nodes from multiple transactions can be allocated
> interleaved. Such mixed allocation results in non-sequential writes at the
> time of a log transaction commit. The nodes of the global log root tree
> (fs_info->log_root_tree), also have the same problem with mixed
> allocation.
>
> Serializes log transactions by waiting for a committing transaction when
> someone tries to start a new transaction, to avoid the mixed allocation
> problem. We must also wait for running log transactions from another
> subvolume, but there is no easy way to detect which subvolume root is
> running a log transaction. So, this patch forbids starting a new log
> transaction when other subvolumes already allocated the global log root
> tree.
>
> Cc: Filipe Manana <fdmanana@gmail.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/tree-log.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c02eeeac439c..8be3164d4c5d 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -105,6 +105,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
>                                        struct btrfs_root *log,
>                                        struct btrfs_path *path,
>                                        u64 dirid, int del_all);
> +static void wait_log_commit(struct btrfs_root *root, int transid);
>
>  /*
>   * tree logging is a special write ahead log used to make sure that
> @@ -140,6 +141,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  {
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_root *tree_root = fs_info->tree_root;
> +       const bool zoned = btrfs_is_zoned(fs_info);
>         int ret = 0;
>
>         /*
> @@ -160,12 +162,20 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>
>         mutex_lock(&root->log_mutex);
>
> +again:
>         if (root->log_root) {
> +               int index = (root->log_transid + 1) % 2;
> +
>                 if (btrfs_need_log_full_commit(trans)) {
>                         ret = -EAGAIN;
>                         goto out;
>                 }
>
> +               if (zoned && atomic_read(&root->log_commit[index])) {
> +                       wait_log_commit(root, root->log_transid - 1);
> +                       goto again;
> +               }
> +
>                 if (!root->log_start_pid) {
>                         clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>                         root->log_start_pid = current->pid;
> @@ -173,6 +183,17 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>                         set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>                 }
>         } else {
> +               if (zoned) {
> +                       mutex_lock(&fs_info->tree_log_mutex);
> +                       if (fs_info->log_root_tree)
> +                               ret = -EAGAIN;
> +                       else
> +                               ret = btrfs_init_log_root_tree(trans, fs_info);
> +                       mutex_unlock(&fs_info->tree_log_mutex);
> +               }

So, nothing here changed since v14 - all my comments still apply [1]
This is based on pre-5.10 code and is broken as it is - it results in
every fsync falling back to a transaction commit, defeating the
purpose of all the patches that deal with log trees on zoned
filesystems.

Thanks.

[1] https://lore.kernel.org/linux-btrfs/CAL3q7H5pv416FVwThOHe+M3L5B-z_n6_ZGQQxsUq5vC5fsAoJw@mail.gmail.com/


> +               if (ret)
> +                       goto out;
> +
>                 ret = btrfs_add_log_tree(trans, root);
>                 if (ret)
>                         goto out;
> @@ -201,14 +222,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>   */
>  static int join_running_log_trans(struct btrfs_root *root)
>  {
> +       const bool zoned = btrfs_is_zoned(root->fs_info);
>         int ret = -ENOENT;
>
>         if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
>                 return ret;
>
>         mutex_lock(&root->log_mutex);
> +again:
>         if (root->log_root) {
> +               int index = (root->log_transid + 1) % 2;
> +
>                 ret = 0;
> +               if (zoned && atomic_read(&root->log_commit[index])) {
> +                       wait_log_commit(root, root->log_transid - 1);
> +                       goto again;
> +               }
>                 atomic_inc(&root->log_writers);
>         }
>         mutex_unlock(&root->log_mutex);
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem
  2021-02-04 10:22   ` [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem Naohiro Aota
@ 2021-02-04 11:57     ` Filipe Manana
  2021-02-04 14:54       ` Johannes Thumshirn
  0 siblings, 1 reply; 72+ messages in thread
From: Filipe Manana @ 2021-02-04 11:57 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Josef Bacik,
	Johannes Thumshirn

On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> This is the 3/3 patch to enable tree-log on zoned filesystems.
>
> The allocation order of nodes of "fs_info->log_root_tree" and nodes of
> "root->log_root" is not the same as the writing order of them. So, the
> writing causes unaligned write errors.
>
> Reorder the allocation of them by delaying allocation of the root node of
> "fs_info->log_root_tree," so that the node buffers can go out sequentially
> to devices.
>
> Cc: Filipe Manana <fdmanana@gmail.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/disk-io.c  | 12 +++++++-----
>  fs/btrfs/tree-log.c | 27 +++++++++++++++++++++------
>  2 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 84c6650d5ef7..c2576c5fe62e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1298,16 +1298,18 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
>                              struct btrfs_fs_info *fs_info)
>  {
>         struct btrfs_root *log_root;
> -       int ret;
>
>         log_root = alloc_log_tree(trans, fs_info);
>         if (IS_ERR(log_root))
>                 return PTR_ERR(log_root);
>
> -       ret = btrfs_alloc_log_tree_node(trans, log_root);
> -       if (ret) {
> -               btrfs_put_root(log_root);
> -               return ret;
> +       if (!btrfs_is_zoned(fs_info)) {
> +               int ret = btrfs_alloc_log_tree_node(trans, log_root);
> +
> +               if (ret) {
> +                       btrfs_put_root(log_root);
> +                       return ret;
> +               }
>         }
>
>         WARN_ON(fs_info->log_root_tree);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 8be3164d4c5d..7ba044bfa9b1 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3159,6 +3159,19 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
>         root_log_ctx.log_transid = log_root_tree->log_transid;
>
> +       if (btrfs_is_zoned(fs_info)) {
> +               mutex_lock(&fs_info->tree_log_mutex);
> +               if (!log_root_tree->node) {

As commented in v14, the log root tree is not protected by
fs_info->tree_log_mutex anymore.
It is fs_info->tree_root->log_mutex as of 5.10.

Everything else was addressed and looks good.
Thanks.

> +                       ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
> +                       if (ret) {
> +                               mutex_unlock(&fs_info->tree_log_mutex);
> +                               mutex_unlock(&log_root_tree->log_mutex);
> +                               goto out;
> +                       }
> +               }
> +               mutex_unlock(&fs_info->tree_log_mutex);
> +       }
> +
>         /*
>          * Now we are safe to update the log_root_tree because we're under the
>          * log_mutex, and we're a current writer so we're holding the commit
> @@ -3317,12 +3330,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
>                 .process_func = process_one_buffer
>         };
>
> -       ret = walk_log_tree(trans, log, &wc);
> -       if (ret) {
> -               if (trans)
> -                       btrfs_abort_transaction(trans, ret);
> -               else
> -                       btrfs_handle_fs_error(log->fs_info, ret, NULL);
> +       if (log->node) {
> +               ret = walk_log_tree(trans, log, &wc);
> +               if (ret) {
> +                       if (trans)
> +                               btrfs_abort_transaction(trans, ret);
> +                       else
> +                               btrfs_handle_fs_error(log->fs_info, ret, NULL);
> +               }
>         }
>
>         clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem
  2021-02-04 11:57     ` Filipe Manana
@ 2021-02-04 14:54       ` Johannes Thumshirn
  2021-02-04 15:48         ` David Sterba
  0 siblings, 1 reply; 72+ messages in thread
From: Johannes Thumshirn @ 2021-02-04 14:54 UTC (permalink / raw)
  To: fdmanana, Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Josef Bacik

On 04/02/2021 12:57, Filipe Manana wrote:
> On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>>
>> This is the 3/3 patch to enable tree-log on zoned filesystems.
>>
>> The allocation order of nodes of "fs_info->log_root_tree" and nodes of
>> "root->log_root" is not the same as the writing order of them. So, the
>> writing causes unaligned write errors.
>>
>> Reorder the allocation of them by delaying allocation of the root node of
>> "fs_info->log_root_tree," so that the node buffers can go out sequentially
>> to devices.
>>
>> Cc: Filipe Manana <fdmanana@gmail.com>
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>  fs/btrfs/disk-io.c  | 12 +++++++-----
>>  fs/btrfs/tree-log.c | 27 +++++++++++++++++++++------
>>  2 files changed, 28 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 84c6650d5ef7..c2576c5fe62e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1298,16 +1298,18 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
>>                              struct btrfs_fs_info *fs_info)
>>  {
>>         struct btrfs_root *log_root;
>> -       int ret;
>>
>>         log_root = alloc_log_tree(trans, fs_info);
>>         if (IS_ERR(log_root))
>>                 return PTR_ERR(log_root);
>>
>> -       ret = btrfs_alloc_log_tree_node(trans, log_root);
>> -       if (ret) {
>> -               btrfs_put_root(log_root);
>> -               return ret;
>> +       if (!btrfs_is_zoned(fs_info)) {
>> +               int ret = btrfs_alloc_log_tree_node(trans, log_root);
>> +
>> +               if (ret) {
>> +                       btrfs_put_root(log_root);
>> +                       return ret;
>> +               }
>>         }
>>
>>         WARN_ON(fs_info->log_root_tree);
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 8be3164d4c5d..7ba044bfa9b1 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -3159,6 +3159,19 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>         list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
>>         root_log_ctx.log_transid = log_root_tree->log_transid;
>>
>> +       if (btrfs_is_zoned(fs_info)) {
>> +               mutex_lock(&fs_info->tree_log_mutex);
>> +               if (!log_root_tree->node) {
> 
> As commented in v14, the log root tree is not protected by
> fs_info->tree_log_mutex anymore.
> It is fs_info->tree_root->log_mutex as of 5.10.
> 
> Everything else was addressed and looks good.
> Thanks.

David, can you add this or should we send an incremental patch?
This survived fstests -g quick run with lockdep enabled.

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7ba044bfa9b1..36c4a60d20dc 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3160,7 +3160,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
        root_log_ctx.log_transid = log_root_tree->log_transid;
        if (btrfs_is_zoned(fs_info)) {
-               mutex_lock(&fs_info->tree_log_mutex);
+               mutex_lock(&fs_info->tree_root->log_mutex);
                if (!log_root_tree->node) {
                        ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
                        if (ret) {
@@ -3169,7 +3169,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
                                goto out;
                        }
                }
-               mutex_unlock(&fs_info->tree_log_mutex);
+               mutex_unlock(&fs_info->tree_root->log_mutex);
        }
        /*

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem
  2021-02-04 14:54       ` Johannes Thumshirn
@ 2021-02-04 15:48         ` David Sterba
  2021-02-04 15:51           ` Johannes Thumshirn
  0 siblings, 1 reply; 72+ messages in thread
From: David Sterba @ 2021-02-04 15:48 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: fdmanana, Naohiro Aota, linux-btrfs, David Sterba, hare,
	linux-fsdevel, Josef Bacik

On Thu, Feb 04, 2021 at 02:54:25PM +0000, Johannes Thumshirn wrote:
> On 04/02/2021 12:57, Filipe Manana wrote:
> > On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >> --- a/fs/btrfs/tree-log.c
> >> +++ b/fs/btrfs/tree-log.c
> >> @@ -3159,6 +3159,19 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >>         list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
> >>         root_log_ctx.log_transid = log_root_tree->log_transid;
> >>
> >> +       if (btrfs_is_zoned(fs_info)) {
> >> +               mutex_lock(&fs_info->tree_log_mutex);
> >> +               if (!log_root_tree->node) {
> > 
> > As commented in v14, the log root tree is not protected by
> > fs_info->tree_log_mutex anymore.
> > It is fs_info->tree_root->log_mutex as of 5.10.
> > 
> > Everything else was addressed and looks good.
> > Thanks.
> 
> David, can you add this or should we send an incremental patch?
> This survived fstests -g quick run with lockdep enabled.
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 7ba044bfa9b1..36c4a60d20dc 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3160,7 +3160,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         root_log_ctx.log_transid = log_root_tree->log_transid;
>         if (btrfs_is_zoned(fs_info)) {
> -               mutex_lock(&fs_info->tree_log_mutex);
> +               mutex_lock(&fs_info->tree_root->log_mutex);
>                 if (!log_root_tree->node) {
>                         ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
>                         if (ret) {
> @@ -3169,7 +3169,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                                 goto out;
>                         }
>                 }
> -               mutex_unlock(&fs_info->tree_log_mutex);
> +               mutex_unlock(&fs_info->tree_root->log_mutex);

Folded to the patch, thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem
  2021-02-04 15:48         ` David Sterba
@ 2021-02-04 15:51           ` Johannes Thumshirn
  0 siblings, 0 replies; 72+ messages in thread
From: Johannes Thumshirn @ 2021-02-04 15:51 UTC (permalink / raw)
  To: dsterba
  Cc: fdmanana, Naohiro Aota, linux-btrfs, David Sterba, hare,
	linux-fsdevel, Josef Bacik

On 04/02/2021 16:50, David Sterba wrote:
> On Thu, Feb 04, 2021 at 02:54:25PM +0000, Johannes Thumshirn wrote:
>> On 04/02/2021 12:57, Filipe Manana wrote:
>>> On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>>>> --- a/fs/btrfs/tree-log.c
>>>> +++ b/fs/btrfs/tree-log.c
>>>> @@ -3159,6 +3159,19 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>>>         list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
>>>>         root_log_ctx.log_transid = log_root_tree->log_transid;
>>>>
>>>> +       if (btrfs_is_zoned(fs_info)) {
>>>> +               mutex_lock(&fs_info->tree_log_mutex);
>>>> +               if (!log_root_tree->node) {
>>>
>>> As commented in v14, the log root tree is not protected by
>>> fs_info->tree_log_mutex anymore.
>>> It is fs_info->tree_root->log_mutex as of 5.10.
>>>
>>> Everything else was addressed and looks good.
>>> Thanks.
>>
>> David, can you add this or should we send an incremental patch?
>> This survived fstests -g quick run with lockdep enabled.
>>
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 7ba044bfa9b1..36c4a60d20dc 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -3160,7 +3160,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>         root_log_ctx.log_transid = log_root_tree->log_transid;
>>         if (btrfs_is_zoned(fs_info)) {
>> -               mutex_lock(&fs_info->tree_log_mutex);
>> +               mutex_lock(&fs_info->tree_root->log_mutex);
>>                 if (!log_root_tree->node) {
>>                         ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
>>                         if (ret) {
>> @@ -3169,7 +3169,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>>                                 goto out;
>>                         }
>>                 }
>> -               mutex_unlock(&fs_info->tree_log_mutex);
>> +               mutex_unlock(&fs_info->tree_root->log_mutex);
> 
> Folded to the patch, thanks.
> 

Thanks a lot

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-04 11:50     ` Filipe Manana
@ 2021-02-05  7:21       ` Naohiro Aota
  0 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05  7:21 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Josef Bacik

On Thu, Feb 04, 2021 at 11:50:45AM +0000, Filipe Manana wrote:
> On Thu, Feb 4, 2021 at 10:23 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > This is the 2/3 patch to enable tree-log on zoned filesystems.
> >
> > Since we can start more than one log transactions per subvolume
> > simultaneously, nodes from multiple transactions can be allocated
> > interleaved. Such mixed allocation results in non-sequential writes at the
> > time of a log transaction commit. The nodes of the global log root tree
> > (fs_info->log_root_tree), also have the same problem with mixed
> > allocation.
> >
> > Serializes log transactions by waiting for a committing transaction when
> > someone tries to start a new transaction, to avoid the mixed allocation
> > problem. We must also wait for running log transactions from another
> > subvolume, but there is no easy way to detect which subvolume root is
> > running a log transaction. So, this patch forbids starting a new log
> > transaction when other subvolumes already allocated the global log root
> > tree.
> >
> > Cc: Filipe Manana <fdmanana@gmail.com>
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/tree-log.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index c02eeeac439c..8be3164d4c5d 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -105,6 +105,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
> >                                        struct btrfs_root *log,
> >                                        struct btrfs_path *path,
> >                                        u64 dirid, int del_all);
> > +static void wait_log_commit(struct btrfs_root *root, int transid);
> >
> >  /*
> >   * tree logging is a special write ahead log used to make sure that
> > @@ -140,6 +141,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >  {
> >         struct btrfs_fs_info *fs_info = root->fs_info;
> >         struct btrfs_root *tree_root = fs_info->tree_root;
> > +       const bool zoned = btrfs_is_zoned(fs_info);
> >         int ret = 0;
> >
> >         /*
> > @@ -160,12 +162,20 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >
> >         mutex_lock(&root->log_mutex);
> >
> > +again:
> >         if (root->log_root) {
> > +               int index = (root->log_transid + 1) % 2;
> > +
> >                 if (btrfs_need_log_full_commit(trans)) {
> >                         ret = -EAGAIN;
> >                         goto out;
> >                 }
> >
> > +               if (zoned && atomic_read(&root->log_commit[index])) {
> > +                       wait_log_commit(root, root->log_transid - 1);
> > +                       goto again;
> > +               }
> > +
> >                 if (!root->log_start_pid) {
> >                         clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >                         root->log_start_pid = current->pid;
> > @@ -173,6 +183,17 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                         set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >                 }
> >         } else {
> > +               if (zoned) {
> > +                       mutex_lock(&fs_info->tree_log_mutex);
> > +                       if (fs_info->log_root_tree)
> > +                               ret = -EAGAIN;
> > +                       else
> > +                               ret = btrfs_init_log_root_tree(trans, fs_info);
> > +                       mutex_unlock(&fs_info->tree_log_mutex);
> > +               }
> 
> So, nothing here changed since v14 - all my comments still apply [1]
> This is based on pre-5.10 code and is broken as it is - it results in
> every fsync falling back to a transaction commit, defeating the
> purpose of all the patches that deal with log trees on zoned
> filesystems.
> 
> Thanks.
> 
> [1] https://lore.kernel.org/linux-btrfs/CAL3q7H5pv416FVwThOHe+M3L5B-z_n6_ZGQQxsUq5vC5fsAoJw@mail.gmail.com/

Yes...

As noted in the cover letter, there is a fix for this issue
itself. However, the fix revealed other failures in fsync() path.
But, with further investigation, I found the failures are not really
related to zoned fsync() code. So, I will soon post two patches (one
incremental for this one, and one to deal with a regression case)..

> 
> 
> > +               if (ret)
> > +                       goto out;
> > +
> >                 ret = btrfs_add_log_tree(trans, root);
> >                 if (ret)
> >                         goto out;
> > @@ -201,14 +222,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >   */
> >  static int join_running_log_trans(struct btrfs_root *root)
> >  {
> > +       const bool zoned = btrfs_is_zoned(root->fs_info);
> >         int ret = -ENOENT;
> >
> >         if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
> >                 return ret;
> >
> >         mutex_lock(&root->log_mutex);
> > +again:
> >         if (root->log_root) {
> > +               int index = (root->log_transid + 1) % 2;
> > +
> >                 ret = 0;
> > +               if (zoned && atomic_read(&root->log_commit[index])) {
> > +                       wait_log_commit(root, root->log_transid - 1);
> > +                       goto again;
> > +               }
> >                 atomic_inc(&root->log_writers);
> >         }
> >         mutex_unlock(&root->log_mutex);
> > --
> > 2.30.0
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-04 10:22   ` [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems Naohiro Aota
  2021-02-04 11:50     ` Filipe Manana
@ 2021-02-05  9:15     ` Naohiro Aota
  2021-02-05 11:21       ` Filipe Manana
  2021-02-09  1:49       ` David Sterba
  1 sibling, 2 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05  9:15 UTC (permalink / raw)
  To: linux-btrfs, dsterba; +Cc: hare, linux-fsdevel, Filipe Manana, Josef Bacik

David, could you fold the below incremental diff to this patch? Or, I
can send a full replacement patch.

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8be3164d4c5d..4e72794342c0 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -143,6 +143,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 	struct btrfs_root *tree_root = fs_info->tree_root;
 	const bool zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
+	bool created = false;
 
 	/*
 	 * First check if the log root tree was already created. If not, create
@@ -152,8 +153,10 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 		mutex_lock(&tree_root->log_mutex);
 		if (!fs_info->log_root_tree) {
 			ret = btrfs_init_log_root_tree(trans, fs_info);
-			if (!ret)
+			if (!ret) {
 				set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
+				created = true;
+			}
 		}
 		mutex_unlock(&tree_root->log_mutex);
 		if (ret)
@@ -183,16 +186,16 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 		}
 	} else {
-		if (zoned) {
-			mutex_lock(&fs_info->tree_log_mutex);
-			if (fs_info->log_root_tree)
-				ret = -EAGAIN;
-			else
-				ret = btrfs_init_log_root_tree(trans, fs_info);
-			mutex_unlock(&fs_info->tree_log_mutex);
-		}
-		if (ret)
+		/*
+		 * This means fs_info->log_root_tree was already created
+		 * for some other FS trees. Do the full commit not to mix
+		 * nodes from multiple log transactions to do sequential
+		 * writing.
+		 */
+		if (zoned && !created) {
+			ret = -EAGAIN;
 			goto out;
+		}
 
 		ret = btrfs_add_log_tree(trans, root);
 		if (ret)


On Thu, Feb 04, 2021 at 07:22:19PM +0900, Naohiro Aota wrote:
> This is the 2/3 patch to enable tree-log on zoned filesystems.
> 
> Since we can start more than one log transactions per subvolume
> simultaneously, nodes from multiple transactions can be allocated
> interleaved. Such mixed allocation results in non-sequential writes at the
> time of a log transaction commit. The nodes of the global log root tree
> (fs_info->log_root_tree), also have the same problem with mixed
> allocation.
> 
> Serializes log transactions by waiting for a committing transaction when
> someone tries to start a new transaction, to avoid the mixed allocation
> problem. We must also wait for running log transactions from another
> subvolume, but there is no easy way to detect which subvolume root is
> running a log transaction. So, this patch forbids starting a new log
> transaction when other subvolumes already allocated the global log root
> tree.
> 
> Cc: Filipe Manana <fdmanana@gmail.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/tree-log.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c02eeeac439c..8be3164d4c5d 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -105,6 +105,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
>  				       struct btrfs_root *log,
>  				       struct btrfs_path *path,
>  				       u64 dirid, int del_all);
> +static void wait_log_commit(struct btrfs_root *root, int transid);
>  
>  /*
>   * tree logging is a special write ahead log used to make sure that
> @@ -140,6 +141,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	struct btrfs_root *tree_root = fs_info->tree_root;
> +	const bool zoned = btrfs_is_zoned(fs_info);
>  	int ret = 0;
>  
>  	/*
> @@ -160,12 +162,20 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  
>  	mutex_lock(&root->log_mutex);
>  
> +again:
>  	if (root->log_root) {
> +		int index = (root->log_transid + 1) % 2;
> +
>  		if (btrfs_need_log_full_commit(trans)) {
>  			ret = -EAGAIN;
>  			goto out;
>  		}
>  
> +		if (zoned && atomic_read(&root->log_commit[index])) {
> +			wait_log_commit(root, root->log_transid - 1);
> +			goto again;
> +		}
> +
>  		if (!root->log_start_pid) {
>  			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>  			root->log_start_pid = current->pid;
> @@ -173,6 +183,17 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>  		}
>  	} else {
> +		if (zoned) {
> +			mutex_lock(&fs_info->tree_log_mutex);
> +			if (fs_info->log_root_tree)
> +				ret = -EAGAIN;
> +			else
> +				ret = btrfs_init_log_root_tree(trans, fs_info);
> +			mutex_unlock(&fs_info->tree_log_mutex);
> +		}
> +		if (ret)
> +			goto out;
> +
>  		ret = btrfs_add_log_tree(trans, root);
>  		if (ret)
>  			goto out;
> @@ -201,14 +222,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>   */
>  static int join_running_log_trans(struct btrfs_root *root)
>  {
> +	const bool zoned = btrfs_is_zoned(root->fs_info);
>  	int ret = -ENOENT;
>  
>  	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
>  		return ret;
>  
>  	mutex_lock(&root->log_mutex);
> +again:
>  	if (root->log_root) {
> +		int index = (root->log_transid + 1) % 2;
> +
>  		ret = 0;
> +		if (zoned && atomic_read(&root->log_commit[index])) {
> +			wait_log_commit(root, root->log_transid - 1);
> +			goto again;
> +		}
>  		atomic_inc(&root->log_writers);
>  	}
>  	mutex_unlock(&root->log_mutex);
> -- 
> 2.30.0
> 

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
                     ` (40 preceding siblings ...)
  2021-02-04 10:22   ` [PATCH v15 42/42] btrfs: zoned: enable to mount ZONED incompat flag Naohiro Aota
@ 2021-02-05  9:26   ` Naohiro Aota
  2021-02-05 11:49     ` Filipe Manana
  2021-02-05 14:58     ` [PATCH v15.1 " Naohiro Aota
  41 siblings, 2 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05  9:26 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Filipe Manana

Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
to be written, because it will cause a deadlock. So, let's bail out to a
full commit in this case.

Cc: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

This patch solves a regression introduced by fixing patch 40. I'm
sorry for the confusing patch numbering.

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4e72794342c0..629e605cd62d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
+	/*
+	 * There is a hole writing out the extents and cannot proceed it on
+	 * zoned filesystem, which require sequential writing. We can
+	 * ignore the error for now, since we don't wait for completion for
+	 * now.
+	 */
+	if (ret == -EAGAIN)
+		ret = 0;
 	if (ret) {
 		blk_finish_plug(&plug);
 		btrfs_abort_transaction(trans, ret);
@@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
 	blk_finish_plug(&plug);
-	if (ret) {
+	/*
+	 * There is a hole in the extents, and failed to sequential write
+	 * on zoned filesystem. We cannot wait for this write outs, sinc it
+	 * cause a deadlock. Bail out to the full commit, instead.
+	 */
+	if (ret == -EAGAIN) {
+		btrfs_wait_tree_log_extents(log, mark);
+		mutex_unlock(&log_root_tree->log_mutex);
+		goto out_wake_log_root;
+	} else if (ret) {
 		btrfs_set_log_full_commit(trans);
 		btrfs_abort_transaction(trans, ret);
 		mutex_unlock(&log_root_tree->log_mutex);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-05  9:15     ` Naohiro Aota
@ 2021-02-05 11:21       ` Filipe Manana
  2021-02-09  1:49       ` David Sterba
  1 sibling, 0 replies; 72+ messages in thread
From: Filipe Manana @ 2021-02-05 11:21 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Josef Bacik

On Fri, Feb 5, 2021 at 9:15 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> David, could you fold the below incremental diff to this patch? Or, I
> can send a full replacement patch.
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 8be3164d4c5d..4e72794342c0 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -143,6 +143,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>         struct btrfs_root *tree_root = fs_info->tree_root;
>         const bool zoned = btrfs_is_zoned(fs_info);
>         int ret = 0;
> +       bool created = false;
>
>         /*
>          * First check if the log root tree was already created. If not, create
> @@ -152,8 +153,10 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>                 mutex_lock(&tree_root->log_mutex);
>                 if (!fs_info->log_root_tree) {
>                         ret = btrfs_init_log_root_tree(trans, fs_info);
> -                       if (!ret)
> +                       if (!ret) {
>                                 set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> +                               created = true;
> +                       }
>                 }
>                 mutex_unlock(&tree_root->log_mutex);
>                 if (ret)
> @@ -183,16 +186,16 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>                         set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>                 }
>         } else {
> -               if (zoned) {
> -                       mutex_lock(&fs_info->tree_log_mutex);
> -                       if (fs_info->log_root_tree)
> -                               ret = -EAGAIN;
> -                       else
> -                               ret = btrfs_init_log_root_tree(trans, fs_info);
> -                       mutex_unlock(&fs_info->tree_log_mutex);
> -               }
> -               if (ret)
> +               /*
> +                * This means fs_info->log_root_tree was already created
> +                * for some other FS trees. Do the full commit not to mix
> +                * nodes from multiple log transactions to do sequential
> +                * writing.
> +                */
> +               if (zoned && !created) {
> +                       ret = -EAGAIN;
>                         goto out;
> +               }
>
>                 ret = btrfs_add_log_tree(trans, root);
>                 if (ret)
>

Ok, with this, it looks good to me and you can have,

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> On Thu, Feb 04, 2021 at 07:22:19PM +0900, Naohiro Aota wrote:
> > This is the 2/3 patch to enable tree-log on zoned filesystems.
> >
> > Since we can start more than one log transactions per subvolume
> > simultaneously, nodes from multiple transactions can be allocated
> > interleaved. Such mixed allocation results in non-sequential writes at the
> > time of a log transaction commit. The nodes of the global log root tree
> > (fs_info->log_root_tree), also have the same problem with mixed
> > allocation.
> >
> > Serializes log transactions by waiting for a committing transaction when
> > someone tries to start a new transaction, to avoid the mixed allocation
> > problem. We must also wait for running log transactions from another
> > subvolume, but there is no easy way to detect which subvolume root is
> > running a log transaction. So, this patch forbids starting a new log
> > transaction when other subvolumes already allocated the global log root
> > tree.
> >
> > Cc: Filipe Manana <fdmanana@gmail.com>
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/tree-log.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index c02eeeac439c..8be3164d4c5d 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -105,6 +105,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
> >                                      struct btrfs_root *log,
> >                                      struct btrfs_path *path,
> >                                      u64 dirid, int del_all);
> > +static void wait_log_commit(struct btrfs_root *root, int transid);
> >
> >  /*
> >   * tree logging is a special write ahead log used to make sure that
> > @@ -140,6 +141,7 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >  {
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> >       struct btrfs_root *tree_root = fs_info->tree_root;
> > +     const bool zoned = btrfs_is_zoned(fs_info);
> >       int ret = 0;
> >
> >       /*
> > @@ -160,12 +162,20 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >
> >       mutex_lock(&root->log_mutex);
> >
> > +again:
> >       if (root->log_root) {
> > +             int index = (root->log_transid + 1) % 2;
> > +
> >               if (btrfs_need_log_full_commit(trans)) {
> >                       ret = -EAGAIN;
> >                       goto out;
> >               }
> >
> > +             if (zoned && atomic_read(&root->log_commit[index])) {
> > +                     wait_log_commit(root, root->log_transid - 1);
> > +                     goto again;
> > +             }
> > +
> >               if (!root->log_start_pid) {
> >                       clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >                       root->log_start_pid = current->pid;
> > @@ -173,6 +183,17 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                       set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >               }
> >       } else {
> > +             if (zoned) {
> > +                     mutex_lock(&fs_info->tree_log_mutex);
> > +                     if (fs_info->log_root_tree)
> > +                             ret = -EAGAIN;
> > +                     else
> > +                             ret = btrfs_init_log_root_tree(trans, fs_info);
> > +                     mutex_unlock(&fs_info->tree_log_mutex);
> > +             }
> > +             if (ret)
> > +                     goto out;
> > +
> >               ret = btrfs_add_log_tree(trans, root);
> >               if (ret)
> >                       goto out;
> > @@ -201,14 +222,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >   */
> >  static int join_running_log_trans(struct btrfs_root *root)
> >  {
> > +     const bool zoned = btrfs_is_zoned(root->fs_info);
> >       int ret = -ENOENT;
> >
> >       if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
> >               return ret;
> >
> >       mutex_lock(&root->log_mutex);
> > +again:
> >       if (root->log_root) {
> > +             int index = (root->log_transid + 1) % 2;
> > +
> >               ret = 0;
> > +             if (zoned && atomic_read(&root->log_commit[index])) {
> > +                     wait_log_commit(root, root->log_transid - 1);
> > +                     goto again;
> > +             }
> >               atomic_inc(&root->log_writers);
> >       }
> >       mutex_unlock(&root->log_mutex);
> > --
> > 2.30.0
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05  9:26   ` [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages Naohiro Aota
@ 2021-02-05 11:49     ` Filipe Manana
  2021-02-05 12:55       ` Naohiro Aota
  2021-02-05 14:19       ` Filipe Manana
  2021-02-05 14:58     ` [PATCH v15.1 " Naohiro Aota
  1 sibling, 2 replies; 72+ messages in thread
From: Filipe Manana @ 2021-02-05 11:49 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 5, 2021 at 9:26 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> Since the zoned filesystem requires sequential write out of metadata, we
> cannot proceed with a hole in tree-log pages. When such a hole exists,
> btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
> to be written, because it will cause a deadlock. So, let's bail out to a
> full commit in this case.
>
> Cc: Filipe Manana <fdmanana@gmail.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> This patch solves a regression introduced by fixing patch 40. I'm
> sorry for the confusing patch numbering.

Hum, how does patch 40 can cause this?
And is it before the fixup or after?

>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 4e72794342c0..629e605cd62d 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>          */
>         blk_start_plug(&plug);
>         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> +       /*
> +        * There is a hole writing out the extents and cannot proceed it on
> +        * zoned filesystem, which require sequential writing. We can

require -> requires

> +        * ignore the error for now, since we don't wait for completion for
> +        * now.

So why can we ignore the error for now?
Why not just bail out here and mark the log for full commit? (without
a transaction abort)

> +        */
> +       if (ret == -EAGAIN)
> +               ret = 0;
>         if (ret) {
>                 blk_finish_plug(&plug);
>                 btrfs_abort_transaction(trans, ret);
> @@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                                          &log_root_tree->dirty_log_pages,
>                                          EXTENT_DIRTY | EXTENT_NEW);
>         blk_finish_plug(&plug);
> -       if (ret) {
> +       /*
> +        * There is a hole in the extents, and failed to sequential write
> +        * on zoned filesystem. We cannot wait for this write outs, sinc it

this -> these

> +        * cause a deadlock. Bail out to the full commit, instead.
> +        */
> +       if (ret == -EAGAIN) {
> +               btrfs_wait_tree_log_extents(log, mark);
> +               mutex_unlock(&log_root_tree->log_mutex);
> +               goto out_wake_log_root;

Must also call btrfs_set_log_full_commit(trans);

Thanks.

> +       } else if (ret) {
>                 btrfs_set_log_full_commit(trans);
>                 btrfs_abort_transaction(trans, ret);
>                 mutex_unlock(&log_root_tree->log_mutex);
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 11:49     ` Filipe Manana
@ 2021-02-05 12:55       ` Naohiro Aota
  2021-02-05 13:07         ` Filipe Manana
  2021-02-05 14:19       ` Filipe Manana
  1 sibling, 1 reply; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05 12:55 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 05, 2021 at 11:49:05AM +0000, Filipe Manana wrote:
> On Fri, Feb 5, 2021 at 9:26 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > Since the zoned filesystem requires sequential write out of metadata, we
> > cannot proceed with a hole in tree-log pages. When such a hole exists,
> > btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
> > to be written, because it will cause a deadlock. So, let's bail out to a
> > full commit in this case.
> >
> > Cc: Filipe Manana <fdmanana@gmail.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > This patch solves a regression introduced by fixing patch 40. I'm
> > sorry for the confusing patch numbering.
> 
> Hum, how does patch 40 can cause this?
> And is it before the fixup or after?

With pre-5.10 code base + zoned series at that time, it passed
xfstests without this patch.

With current code base + zoned series without the fixup for patch 40,
it also passed the tests, because we are mostly bailing out to a full
commit.

The fixup now stressed the new fsync code on zoned mode and revealed
an issue to have -EAGAIN from btrfs_write_marked_extents(). This error
happens when a concurrent transaction commit is writing a dirty extent
in this tree-log commit. This issue didn't occur previously because of
a longer critical section, I guess.

> 
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 4e72794342c0..629e605cd62d 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >          */
> >         blk_start_plug(&plug);
> >         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> > +       /*
> > +        * There is a hole writing out the extents and cannot proceed it on
> > +        * zoned filesystem, which require sequential writing. We can
> 
> require -> requires
> 
> > +        * ignore the error for now, since we don't wait for completion for
> > +        * now.
> 
> So why can we ignore the error for now?
> Why not just bail out here and mark the log for full commit? (without
> a transaction abort)

As described above, -EAGAIN happens when a concurrent process writes
out an extent buffer of this tree-log commit. This concurrent write
out will fill a hole for us, so the next write out might
succeed. Indeed we can bail out here, but I opted to try the next
write.

> > +        */
> > +       if (ret == -EAGAIN)
> > +               ret = 0;
> >         if (ret) {
> >                 blk_finish_plug(&plug);
> >                 btrfs_abort_transaction(trans, ret);
> > @@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >                                          &log_root_tree->dirty_log_pages,
> >                                          EXTENT_DIRTY | EXTENT_NEW);
> >         blk_finish_plug(&plug);
> > -       if (ret) {
> > +       /*
> > +        * There is a hole in the extents, and failed to sequential write
> > +        * on zoned filesystem. We cannot wait for this write outs, sinc it
> 
> this -> these
> 
> > +        * cause a deadlock. Bail out to the full commit, instead.
> > +        */
> > +       if (ret == -EAGAIN) {
> > +               btrfs_wait_tree_log_extents(log, mark);
> > +               mutex_unlock(&log_root_tree->log_mutex);
> > +               goto out_wake_log_root;
> 
> Must also call btrfs_set_log_full_commit(trans);

Oops, I missed this one.

> Thanks.
> 
> > +       } else if (ret) {
> >                 btrfs_set_log_full_commit(trans);
> >                 btrfs_abort_transaction(trans, ret);
> >                 mutex_unlock(&log_root_tree->log_mutex);
> > --
> > 2.30.0
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 12:55       ` Naohiro Aota
@ 2021-02-05 13:07         ` Filipe Manana
  0 siblings, 0 replies; 72+ messages in thread
From: Filipe Manana @ 2021-02-05 13:07 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 5, 2021 at 12:55 PM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> On Fri, Feb 05, 2021 at 11:49:05AM +0000, Filipe Manana wrote:
> > On Fri, Feb 5, 2021 at 9:26 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> > >
> > > Since the zoned filesystem requires sequential write out of metadata, we
> > > cannot proceed with a hole in tree-log pages. When such a hole exists,
> > > btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
> > > to be written, because it will cause a deadlock. So, let's bail out to a
> > > full commit in this case.
> > >
> > > Cc: Filipe Manana <fdmanana@gmail.com>
> > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > ---
> > >  fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > >
> > > This patch solves a regression introduced by fixing patch 40. I'm
> > > sorry for the confusing patch numbering.
> >
> > Hum, how does patch 40 can cause this?
> > And is it before the fixup or after?
>
> With pre-5.10 code base + zoned series at that time, it passed
> xfstests without this patch.
>
> With current code base + zoned series without the fixup for patch 40,
> it also passed the tests, because we are mostly bailing out to a full
> commit.
>
> The fixup now stressed the new fsync code on zoned mode and revealed
> an issue to have -EAGAIN from btrfs_write_marked_extents(). This error
> happens when a concurrent transaction commit is writing a dirty extent
> in this tree-log commit. This issue didn't occur previously because of
> a longer critical section, I guess.

Ok, if I understand you correctly, the problem is a transaction commit
and an fsync both allocating metadata extents at the same time.

>
> >
> > >
> > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > > index 4e72794342c0..629e605cd62d 100644
> > > --- a/fs/btrfs/tree-log.c
> > > +++ b/fs/btrfs/tree-log.c
> > > @@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > >          */
> > >         blk_start_plug(&plug);
> > >         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> > > +       /*
> > > +        * There is a hole writing out the extents and cannot proceed it on
> > > +        * zoned filesystem, which require sequential writing. We can
> >
> > require -> requires
> >
> > > +        * ignore the error for now, since we don't wait for completion for
> > > +        * now.
> >
> > So why can we ignore the error for now?
> > Why not just bail out here and mark the log for full commit? (without
> > a transaction abort)
>
> As described above, -EAGAIN happens when a concurrent process writes
> out an extent buffer of this tree-log commit. This concurrent write
> out will fill a hole for us, so the next write out might
> succeed. Indeed we can bail out here, but I opted to try the next
> write.

Ok, if I understand you correctly, you mean it will be fine if after
this point no one allocates metadata extents from the hole?

I think such a clear explanation would fit nicely in the comment.

Thanks.

>
> > > +        */
> > > +       if (ret == -EAGAIN)
> > > +               ret = 0;
> > >         if (ret) {
> > >                 blk_finish_plug(&plug);
> > >                 btrfs_abort_transaction(trans, ret);
> > > @@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > >                                          &log_root_tree->dirty_log_pages,
> > >                                          EXTENT_DIRTY | EXTENT_NEW);
> > >         blk_finish_plug(&plug);
> > > -       if (ret) {
> > > +       /*
> > > +        * There is a hole in the extents, and failed to sequential write
> > > +        * on zoned filesystem. We cannot wait for this write outs, sinc it
> >
> > this -> these
> >
> > > +        * cause a deadlock. Bail out to the full commit, instead.
> > > +        */
> > > +       if (ret == -EAGAIN) {
> > > +               btrfs_wait_tree_log_extents(log, mark);
> > > +               mutex_unlock(&log_root_tree->log_mutex);
> > > +               goto out_wake_log_root;
> >
> > Must also call btrfs_set_log_full_commit(trans);
>
> Oops, I missed this one.
>
> > Thanks.
> >
> > > +       } else if (ret) {
> > >                 btrfs_set_log_full_commit(trans);
> > >                 btrfs_abort_transaction(trans, ret);
> > >                 mutex_unlock(&log_root_tree->log_mutex);
> > > --
> > > 2.30.0
> > >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 11:49     ` Filipe Manana
  2021-02-05 12:55       ` Naohiro Aota
@ 2021-02-05 14:19       ` Filipe Manana
  2021-02-05 14:46         ` Naohiro Aota
  1 sibling, 1 reply; 72+ messages in thread
From: Filipe Manana @ 2021-02-05 14:19 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 5, 2021 at 11:49 AM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Fri, Feb 5, 2021 at 9:26 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > Since the zoned filesystem requires sequential write out of metadata, we
> > cannot proceed with a hole in tree-log pages. When such a hole exists,
> > btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
> > to be written, because it will cause a deadlock. So, let's bail out to a
> > full commit in this case.
> >
> > Cc: Filipe Manana <fdmanana@gmail.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > This patch solves a regression introduced by fixing patch 40. I'm
> > sorry for the confusing patch numbering.
>
> Hum, how does patch 40 can cause this?
> And is it before the fixup or after?
>
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 4e72794342c0..629e605cd62d 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >          */
> >         blk_start_plug(&plug);
> >         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> > +       /*
> > +        * There is a hole writing out the extents and cannot proceed it on
> > +        * zoned filesystem, which require sequential writing. We can
>
> require -> requires
>
> > +        * ignore the error for now, since we don't wait for completion for
> > +        * now.
>
> So why can we ignore the error for now?
> Why not just bail out here and mark the log for full commit? (without
> a transaction abort)
>
> > +        */
> > +       if (ret == -EAGAIN)
> > +               ret = 0;

Thinking again about this, it would be safer, and self-documenting to
check here that we are in zoned mode:

if (ret == -EAGAIN && is_zoned)
    ret = 0;

Because if we start to get -EAGAIN here one day, from non-zoned code,
we risk not writing out some extent buffer and getting a corrupt log,
which may be very hard to find.
With that additional check in place, we'll end up aborting the
transaction with -EAGAIN and notice the problem much sooner.

> >         if (ret) {
> >                 blk_finish_plug(&plug);
> >                 btrfs_abort_transaction(trans, ret);
> > @@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >                                          &log_root_tree->dirty_log_pages,
> >                                          EXTENT_DIRTY | EXTENT_NEW);
> >         blk_finish_plug(&plug);
> > -       if (ret) {
> > +       /*
> > +        * There is a hole in the extents, and failed to sequential write
> > +        * on zoned filesystem. We cannot wait for this write outs, sinc it
>
> this -> these
>
> > +        * cause a deadlock. Bail out to the full commit, instead.
> > +        */
> > +       if (ret == -EAGAIN) {

I would add "&& is_zoned" here too.

Thanks.


> > +               btrfs_wait_tree_log_extents(log, mark);
> > +               mutex_unlock(&log_root_tree->log_mutex);
> > +               goto out_wake_log_root;
>
> Must also call btrfs_set_log_full_commit(trans);
>
> Thanks.
>
> > +       } else if (ret) {
> >                 btrfs_set_log_full_commit(trans);
> >                 btrfs_abort_transaction(trans, ret);
> >                 mutex_unlock(&log_root_tree->log_mutex);
> > --
> > 2.30.0
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 14:19       ` Filipe Manana
@ 2021-02-05 14:46         ` Naohiro Aota
  0 siblings, 0 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05 14:46 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 05, 2021 at 02:19:50PM +0000, Filipe Manana wrote:
> On Fri, Feb 5, 2021 at 11:49 AM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Fri, Feb 5, 2021 at 9:26 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> > >
> > > Since the zoned filesystem requires sequential write out of metadata, we
> > > cannot proceed with a hole in tree-log pages. When such a hole exists,
> > > btree_write_cache_pages() will return -EAGAIN. We cannot wait for the range
> > > to be written, because it will cause a deadlock. So, let's bail out to a
> > > full commit in this case.
> > >
> > > Cc: Filipe Manana <fdmanana@gmail.com>
> > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > ---
> > >  fs/btrfs/tree-log.c | 19 ++++++++++++++++++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > >
> > > This patch solves a regression introduced by fixing patch 40. I'm
> > > sorry for the confusing patch numbering.
> >
> > Hum, how does patch 40 can cause this?
> > And is it before the fixup or after?
> >
> > >
> > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > > index 4e72794342c0..629e605cd62d 100644
> > > --- a/fs/btrfs/tree-log.c
> > > +++ b/fs/btrfs/tree-log.c
> > > @@ -3120,6 +3120,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > >          */
> > >         blk_start_plug(&plug);
> > >         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> > > +       /*
> > > +        * There is a hole writing out the extents and cannot proceed it on
> > > +        * zoned filesystem, which require sequential writing. We can
> >
> > require -> requires
> >
> > > +        * ignore the error for now, since we don't wait for completion for
> > > +        * now.
> >
> > So why can we ignore the error for now?
> > Why not just bail out here and mark the log for full commit? (without
> > a transaction abort)
> >
> > > +        */
> > > +       if (ret == -EAGAIN)
> > > +               ret = 0;
> 
> Thinking again about this, it would be safer, and self-documenting to
> check here that we are in zoned mode:
> 
> if (ret == -EAGAIN && is_zoned)
>     ret = 0;
> 
> Because if we start to get -EAGAIN here one day, from non-zoned code,
> we risk not writing out some extent buffer and getting a corrupt log,
> which may be very hard to find.
> With that additional check in place, we'll end up aborting the
> transaction with -EAGAIN and notice the problem much sooner.

Yeah, I agree.

I'll post a new version with the comments revised and using "if (ret
== -EAGAIN && btrfs_is_zoned(fs_info))".

> > >         if (ret) {
> > >                 blk_finish_plug(&plug);
> > >                 btrfs_abort_transaction(trans, ret);
> > > @@ -3229,7 +3237,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > >                                          &log_root_tree->dirty_log_pages,
> > >                                          EXTENT_DIRTY | EXTENT_NEW);
> > >         blk_finish_plug(&plug);
> > > -       if (ret) {
> > > +       /*
> > > +        * There is a hole in the extents, and failed to sequential write
> > > +        * on zoned filesystem. We cannot wait for this write outs, sinc it
> >
> > this -> these
> >
> > > +        * cause a deadlock. Bail out to the full commit, instead.
> > > +        */
> > > +       if (ret == -EAGAIN) {
> 
> I would add "&& is_zoned" here too.
> 
> Thanks.
>
> 
> > > +               btrfs_wait_tree_log_extents(log, mark);
> > > +               mutex_unlock(&log_root_tree->log_mutex);
> > > +               goto out_wake_log_root;
> >
> > Must also call btrfs_set_log_full_commit(trans);
> >
> > Thanks.
> >
> > > +       } else if (ret) {
> > >                 btrfs_set_log_full_commit(trans);
> > >                 btrfs_abort_transaction(trans, ret);
> > >                 mutex_unlock(&log_root_tree->log_mutex);
> > > --
> > > 2.30.0
> > >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> --
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v15.1 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05  9:26   ` [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages Naohiro Aota
  2021-02-05 11:49     ` Filipe Manana
@ 2021-02-05 14:58     ` Naohiro Aota
  2021-02-05 16:25       ` Filipe Manana
  2021-02-09  1:55       ` David Sterba
  1 sibling, 2 replies; 72+ messages in thread
From: Naohiro Aota @ 2021-02-05 14:58 UTC (permalink / raw)
  To: linux-btrfs, dsterba
  Cc: hare, linux-fsdevel, Johannes Thumshirn, Filipe Manana

Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. This happens when someone,
e.g., a concurrent transaction commit, writes a dirty extent in this
tree-log commit.

If we are not going to wait for the extents, we can hope the concurrent
writing fills the hole for us. So, we can ignore the error in this case and
hope the next write will succeed.

If we want to wait for them and got the error, we cannot wait for them
because it will cause a deadlock. So, let's bail out to a full commit in
this case.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/tree-log.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index fc04625cbbd1..d90695c1ab6c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3120,6 +3120,17 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 */
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
+	/*
+	 * -EAGAIN happens when someone, e.g., a concurrent transaction
+	 *  commit, writes a dirty extent in this tree-log commit. This
+	 *  concurrent write will create a hole writing out the extents,
+	 *  and we cannot proceed on a zoned filesystem, requiring
+	 *  sequential writing. While we can bail out to a full commit
+	 *  here, but we can continue hoping the concurrent writing fills
+	 *  the hole.
+	 */
+	if (ret == -EAGAIN && btrfs_is_zoned(fs_info))
+		ret = 0;
 	if (ret) {
 		blk_finish_plug(&plug);
 		btrfs_abort_transaction(trans, ret);
@@ -3242,7 +3253,17 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
 	blk_finish_plug(&plug);
-	if (ret) {
+	/*
+	 * As described above, -EAGAIN indicates a hole in the extents. We
+	 * cannot wait for these write outs since the waiting cause a
+	 * deadlock. Bail out to the full commit instead.
+	 */
+	if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) {
+		btrfs_set_log_full_commit(trans);
+		btrfs_wait_tree_log_extents(log, mark);
+		mutex_unlock(&log_root_tree->log_mutex);
+		goto out_wake_log_root;
+	} else if (ret) {
 		btrfs_set_log_full_commit(trans);
 		btrfs_abort_transaction(trans, ret);
 		mutex_unlock(&log_root_tree->log_mutex);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v15.1 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 14:58     ` [PATCH v15.1 " Naohiro Aota
@ 2021-02-05 16:25       ` Filipe Manana
  2021-02-09  1:55       ` David Sterba
  1 sibling, 0 replies; 72+ messages in thread
From: Filipe Manana @ 2021-02-05 16:25 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, hare, linux-fsdevel, Johannes Thumshirn

On Fri, Feb 5, 2021 at 2:58 PM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> Since the zoned filesystem requires sequential write out of metadata, we
> cannot proceed with a hole in tree-log pages. When such a hole exists,
> btree_write_cache_pages() will return -EAGAIN. This happens when someone,
> e.g., a concurrent transaction commit, writes a dirty extent in this
> tree-log commit.
>
> If we are not going to wait for the extents, we can hope the concurrent
> writing fills the hole for us. So, we can ignore the error in this case and
> hope the next write will succeed.
>
> If we want to wait for them and got the error, we cannot wait for them
> because it will cause a deadlock. So, let's bail out to a full commit in
> this case.
>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

It looks good now, thanks!

> ---
>  fs/btrfs/tree-log.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index fc04625cbbd1..d90695c1ab6c 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3120,6 +3120,17 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>          */
>         blk_start_plug(&plug);
>         ret = btrfs_write_marked_extents(fs_info, &log->dirty_log_pages, mark);
> +       /*
> +        * -EAGAIN happens when someone, e.g., a concurrent transaction
> +        *  commit, writes a dirty extent in this tree-log commit. This
> +        *  concurrent write will create a hole writing out the extents,
> +        *  and we cannot proceed on a zoned filesystem, requiring
> +        *  sequential writing. While we can bail out to a full commit
> +        *  here, but we can continue hoping the concurrent writing fills
> +        *  the hole.
> +        */
> +       if (ret == -EAGAIN && btrfs_is_zoned(fs_info))
> +               ret = 0;
>         if (ret) {
>                 blk_finish_plug(&plug);
>                 btrfs_abort_transaction(trans, ret);
> @@ -3242,7 +3253,17 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                                          &log_root_tree->dirty_log_pages,
>                                          EXTENT_DIRTY | EXTENT_NEW);
>         blk_finish_plug(&plug);
> -       if (ret) {
> +       /*
> +        * As described above, -EAGAIN indicates a hole in the extents. We
> +        * cannot wait for these write outs since the waiting cause a
> +        * deadlock. Bail out to the full commit instead.
> +        */
> +       if (ret == -EAGAIN && btrfs_is_zoned(fs_info)) {
> +               btrfs_set_log_full_commit(trans);
> +               btrfs_wait_tree_log_extents(log, mark);
> +               mutex_unlock(&log_root_tree->log_mutex);
> +               goto out_wake_log_root;
> +       } else if (ret) {
>                 btrfs_set_log_full_commit(trans);
>                 btrfs_abort_transaction(trans, ret);
>                 mutex_unlock(&log_root_tree->log_mutex);
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems
  2021-02-05  9:15     ` Naohiro Aota
  2021-02-05 11:21       ` Filipe Manana
@ 2021-02-09  1:49       ` David Sterba
  1 sibling, 0 replies; 72+ messages in thread
From: David Sterba @ 2021-02-09  1:49 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Filipe Manana, Josef Bacik

On Fri, Feb 05, 2021 at 06:15:16PM +0900, Naohiro Aota wrote:
> David, could you fold the below incremental diff to this patch? Or, I
> can send a full replacement patch.

Folded to the patch, thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15.1 43/43] btrfs: zoned: deal with holes writing out tree-log pages
  2021-02-05 14:58     ` [PATCH v15.1 " Naohiro Aota
  2021-02-05 16:25       ` Filipe Manana
@ 2021-02-09  1:55       ` David Sterba
  1 sibling, 0 replies; 72+ messages in thread
From: David Sterba @ 2021-02-09  1:55 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, dsterba, hare, linux-fsdevel, Johannes Thumshirn,
	Filipe Manana

On Fri, Feb 05, 2021 at 11:58:36PM +0900, Naohiro Aota wrote:
> Since the zoned filesystem requires sequential write out of metadata, we
> cannot proceed with a hole in tree-log pages. When such a hole exists,
> btree_write_cache_pages() will return -EAGAIN. This happens when someone,
> e.g., a concurrent transaction commit, writes a dirty extent in this
> tree-log commit.
> 
> If we are not going to wait for the extents, we can hope the concurrent
> writing fills the hole for us. So, we can ignore the error in this case and
> hope the next write will succeed.
> 
> If we want to wait for them and got the error, we cannot wait for them
> because it will cause a deadlock. So, let's bail out to a full commit in
> this case.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Inserted before the last patch, thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-04 10:21 [PATCH v15 00/42] btrfs: zoned block device support Naohiro Aota
  2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
@ 2021-02-10 19:58 ` David Sterba
  2021-02-11  9:58   ` Johannes Thumshirn
  1 sibling, 1 reply; 72+ messages in thread
From: David Sterba @ 2021-02-10 19:58 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, dsterba, hare, linux-fsdevel

On Thu, Feb 04, 2021 at 07:21:39PM +0900, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs. Some of the patches
> in the previous series are already merged as preparation patches.

Moved from for-next to misc-next.

> * Log-structured superblock
> 
> Superblock (and its copies) is the only data structure in btrfs which
> has a fixed location on a device. Since we cannot overwrite in a
> sequential write required zone, we cannot place superblock in the
> zone.
> 
> This series implements superblock log writing. It uses two zones as a
> circular buffer to write updated superblocks. Once the first zone is filled
> up, start writing into the second zone. The first zone will be reset once
> both zones are filled. We can determine the postion of the latest
> superblock by reading the write pointer information from a device.

About that, in this patchset it's still leaving superblock at the fixed
zone number while we want it at a fixed location, spanning 2 zones
regardless of their size.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-10 19:58 ` [PATCH v15 00/42] btrfs: zoned block device support David Sterba
@ 2021-02-11  9:58   ` Johannes Thumshirn
  2021-02-11 15:19     ` David Sterba
  0 siblings, 1 reply; 72+ messages in thread
From: Johannes Thumshirn @ 2021-02-11  9:58 UTC (permalink / raw)
  To: dsterba, Naohiro Aota; +Cc: linux-btrfs, dsterba, hare, linux-fsdevel

On 10/02/2021 21:02, David Sterba wrote:
>> This series implements superblock log writing. It uses two zones as a
>> circular buffer to write updated superblocks. Once the first zone is filled
>> up, start writing into the second zone. The first zone will be reset once
>> both zones are filled. We can determine the postion of the latest
>> superblock by reading the write pointer information from a device.
> 
> About that, in this patchset it's still leaving superblock at the fixed
> zone number while we want it at a fixed location, spanning 2 zones
> regardless of their size.
> 

We'll always need 2 zones or otherwise we won't be powercut safe.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-11  9:58   ` Johannes Thumshirn
@ 2021-02-11 15:19     ` David Sterba
  2021-02-11 15:26       ` Johannes Thumshirn
  0 siblings, 1 reply; 72+ messages in thread
From: David Sterba @ 2021-02-11 15:19 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel

On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> On 10/02/2021 21:02, David Sterba wrote:
> >> This series implements superblock log writing. It uses two zones as a
> >> circular buffer to write updated superblocks. Once the first zone is filled
> >> up, start writing into the second zone. The first zone will be reset once
> >> both zones are filled. We can determine the postion of the latest
> >> superblock by reading the write pointer information from a device.
> > 
> > About that, in this patchset it's still leaving superblock at the fixed
> > zone number while we want it at a fixed location, spanning 2 zones
> > regardless of their size.
> 
> We'll always need 2 zones or otherwise we won't be powercut safe.

Yes we do, that hasn't changed.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-11 15:19     ` David Sterba
@ 2021-02-11 15:26       ` Johannes Thumshirn
  2021-02-11 15:46         ` David Sterba
  0 siblings, 1 reply; 72+ messages in thread
From: Johannes Thumshirn @ 2021-02-11 15:26 UTC (permalink / raw)
  To: dsterba; +Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel

On 11/02/2021 16:21, David Sterba wrote:
> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
>> On 10/02/2021 21:02, David Sterba wrote:
>>>> This series implements superblock log writing. It uses two zones as a
>>>> circular buffer to write updated superblocks. Once the first zone is filled
>>>> up, start writing into the second zone. The first zone will be reset once
>>>> both zones are filled. We can determine the postion of the latest
>>>> superblock by reading the write pointer information from a device.
>>>
>>> About that, in this patchset it's still leaving superblock at the fixed
>>> zone number while we want it at a fixed location, spanning 2 zones
>>> regardless of their size.
>>
>> We'll always need 2 zones or otherwise we won't be powercut safe.
> 
> Yes we do, that hasn't changed.
> 

OK that I don't understand, with the log structured superblocks on a zoned
filesystem, we're writing a new superblock until the 1st zone is filled.
Then we advance to the second zone. As soon as we wrote a superblock to
the second zone we can reset the first.
If we only use one zone, we would need to write until it's end, reset and
start writing again from the beginning. But if a powercut happens between
reset and first write after the reset, we end up with no superblock.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-11 15:26       ` Johannes Thumshirn
@ 2021-02-11 15:46         ` David Sterba
  2021-02-15 16:58           ` Johannes Thumshirn
  0 siblings, 1 reply; 72+ messages in thread
From: David Sterba @ 2021-02-11 15:46 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba, Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel

On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
> On 11/02/2021 16:21, David Sterba wrote:
> > On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> >> On 10/02/2021 21:02, David Sterba wrote:
> >>>> This series implements superblock log writing. It uses two zones as a
> >>>> circular buffer to write updated superblocks. Once the first zone is filled
> >>>> up, start writing into the second zone. The first zone will be reset once
> >>>> both zones are filled. We can determine the postion of the latest
> >>>> superblock by reading the write pointer information from a device.
> >>>
> >>> About that, in this patchset it's still leaving superblock at the fixed
> >>> zone number while we want it at a fixed location, spanning 2 zones
> >>> regardless of their size.
> >>
> >> We'll always need 2 zones or otherwise we won't be powercut safe.
> > 
> > Yes we do, that hasn't changed.
> 
> OK that I don't understand, with the log structured superblocks on a zoned
> filesystem, we're writing a new superblock until the 1st zone is filled.
> Then we advance to the second zone. As soon as we wrote a superblock to
> the second zone we can reset the first.
> If we only use one zone,

No, that can't work and nobody suggests that.

> we would need to write until it's end, reset and
> start writing again from the beginning. But if a powercut happens between
> reset and first write after the reset, we end up with no superblock.

What I'm saying and what we discussed on slack in December, we can't fix
the zone number for the 1st and 2nd copy of superblock like it is now in
sb_zone_number.

The primary superblock must be there for any reference and to actually
let the tools learn about the incompat bits.

The 1st copy is now fixed zone 16, which depends on the zone size. The
idea is to define the superblock offsets to start at given offsets,
where the ring buffer has the two consecutive zones, regardless of their
size.

primary:		   0
1st copy:		 16G
2nd copy:		256G

Due to the variability of the zones in future devices, we'll reserve a
space at the superblock interval, assuming the zone sizes can grow up to
several gigabytes. Current working number is 1G, with some safety margin
the reserved ranges would be (eg. for a 4G zone size):

primary:		0 up to 8G
1st copy:		16G up to 24G
2nd copy:		256G up to 262G

It is wasteful but we want to be future proof and expecting disk sizes
from tens of terabytes to a hundred terabytes, it's not significant
loss of space.

If the zone sizes can be expected higher than 4G, the 1st copy can be
defined at 64G, that would leave us some margin until somebody thinks
that 32G zones are a great idea.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-11 15:46         ` David Sterba
@ 2021-02-15 16:58           ` Johannes Thumshirn
  2021-02-15 17:02             ` David Sterba
  2021-02-16  4:33             ` Naohiro Aota
  0 siblings, 2 replies; 72+ messages in thread
From: Johannes Thumshirn @ 2021-02-15 16:58 UTC (permalink / raw)
  To: dsterba
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel, Damien Le Moal

On 11/02/2021 16:48, David Sterba wrote:
> On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
>> On 11/02/2021 16:21, David Sterba wrote:
>>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
>>>> On 10/02/2021 21:02, David Sterba wrote:
>>>>>> This series implements superblock log writing. It uses two zones as a
>>>>>> circular buffer to write updated superblocks. Once the first zone is filled
>>>>>> up, start writing into the second zone. The first zone will be reset once
>>>>>> both zones are filled. We can determine the postion of the latest
>>>>>> superblock by reading the write pointer information from a device.
>>>>>
>>>>> About that, in this patchset it's still leaving superblock at the fixed
>>>>> zone number while we want it at a fixed location, spanning 2 zones
>>>>> regardless of their size.
>>>>
>>>> We'll always need 2 zones or otherwise we won't be powercut safe.
>>>
>>> Yes we do, that hasn't changed.
>>
>> OK that I don't understand, with the log structured superblocks on a zoned
>> filesystem, we're writing a new superblock until the 1st zone is filled.
>> Then we advance to the second zone. As soon as we wrote a superblock to
>> the second zone we can reset the first.
>> If we only use one zone,
> 
> No, that can't work and nobody suggests that.
> 
>> we would need to write until it's end, reset and
>> start writing again from the beginning. But if a powercut happens between
>> reset and first write after the reset, we end up with no superblock.
> 
> What I'm saying and what we discussed on slack in December, we can't fix
> the zone number for the 1st and 2nd copy of superblock like it is now in
> sb_zone_number.
> 
> The primary superblock must be there for any reference and to actually
> let the tools learn about the incompat bits.
> 
> The 1st copy is now fixed zone 16, which depends on the zone size. The
> idea is to define the superblock offsets to start at given offsets,
> where the ring buffer has the two consecutive zones, regardless of their
> size.
> 
> primary:		   0
> 1st copy:		 16G
> 2nd copy:		256G
> 
> Due to the variability of the zones in future devices, we'll reserve a
> space at the superblock interval, assuming the zone sizes can grow up to
> several gigabytes. Current working number is 1G, with some safety margin
> the reserved ranges would be (eg. for a 4G zone size):
> 
> primary:		0 up to 8G
> 1st copy:		16G up to 24G
> 2nd copy:		256G up to 262G
> 
> It is wasteful but we want to be future proof and expecting disk sizes
> from tens of terabytes to a hundred terabytes, it's not significant
> loss of space.
> 
> If the zone sizes can be expected higher than 4G, the 1st copy can be
> defined at 64G, that would leave us some margin until somebody thinks
> that 32G zones are a great idea.
> 

We've been talking about this today and our proposal would be as follows:
Primary SB is two zones starting at LBA 0
Seconday SB the two zones starting with the zone that contains the address 16G
Third SB the two zones starting with the zone that contains the address 256G 
or not present if the disk is too small.

This would make it safe until a zone size of 8GB and we'd have adjacent 
superblock log zones then.

How does that sound?

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-15 16:58           ` Johannes Thumshirn
@ 2021-02-15 17:02             ` David Sterba
  2021-02-16  4:33             ` Naohiro Aota
  1 sibling, 0 replies; 72+ messages in thread
From: David Sterba @ 2021-02-15 17:02 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Naohiro Aota, linux-btrfs, dsterba, hare, linux-fsdevel, Damien Le Moal

On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote:
> On 11/02/2021 16:48, David Sterba wrote:
> > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
> >> On 11/02/2021 16:21, David Sterba wrote:
> >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> >>>> On 10/02/2021 21:02, David Sterba wrote:
> >>>>>> This series implements superblock log writing. It uses two zones as a
> >>>>>> circular buffer to write updated superblocks. Once the first zone is filled
> >>>>>> up, start writing into the second zone. The first zone will be reset once
> >>>>>> both zones are filled. We can determine the postion of the latest
> >>>>>> superblock by reading the write pointer information from a device.
> >>>>>
> >>>>> About that, in this patchset it's still leaving superblock at the fixed
> >>>>> zone number while we want it at a fixed location, spanning 2 zones
> >>>>> regardless of their size.
> >>>>
> >>>> We'll always need 2 zones or otherwise we won't be powercut safe.
> >>>
> >>> Yes we do, that hasn't changed.
> >>
> >> OK that I don't understand, with the log structured superblocks on a zoned
> >> filesystem, we're writing a new superblock until the 1st zone is filled.
> >> Then we advance to the second zone. As soon as we wrote a superblock to
> >> the second zone we can reset the first.
> >> If we only use one zone,
> > 
> > No, that can't work and nobody suggests that.
> > 
> >> we would need to write until it's end, reset and
> >> start writing again from the beginning. But if a powercut happens between
> >> reset and first write after the reset, we end up with no superblock.
> > 
> > What I'm saying and what we discussed on slack in December, we can't fix
> > the zone number for the 1st and 2nd copy of superblock like it is now in
> > sb_zone_number.
> > 
> > The primary superblock must be there for any reference and to actually
> > let the tools learn about the incompat bits.
> > 
> > The 1st copy is now fixed zone 16, which depends on the zone size. The
> > idea is to define the superblock offsets to start at given offsets,
> > where the ring buffer has the two consecutive zones, regardless of their
> > size.
> > 
> > primary:		   0
> > 1st copy:		 16G
> > 2nd copy:		256G
> > 
> > Due to the variability of the zones in future devices, we'll reserve a
> > space at the superblock interval, assuming the zone sizes can grow up to
> > several gigabytes. Current working number is 1G, with some safety margin
> > the reserved ranges would be (eg. for a 4G zone size):
> > 
> > primary:		0 up to 8G
> > 1st copy:		16G up to 24G
> > 2nd copy:		256G up to 262G
> > 
> > It is wasteful but we want to be future proof and expecting disk sizes
> > from tens of terabytes to a hundred terabytes, it's not significant
> > loss of space.
> > 
> > If the zone sizes can be expected higher than 4G, the 1st copy can be
> > defined at 64G, that would leave us some margin until somebody thinks
> > that 32G zones are a great idea.
> > 
> 
> We've been talking about this today and our proposal would be as follows:
> Primary SB is two zones starting at LBA 0
> Seconday SB the two zones starting with the zone that contains the address 16G
> Third SB the two zones starting with the zone that contains the address 256G 
> or not present if the disk is too small.
> 
> This would make it safe until a zone size of 8GB and we'd have adjacent 
> superblock log zones then.
> 
> How does that sound?

That we're on the same page regarding the superblock writes.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-15 16:58           ` Johannes Thumshirn
  2021-02-15 17:02             ` David Sterba
@ 2021-02-16  4:33             ` Naohiro Aota
  2021-02-16 11:46               ` David Sterba
  1 sibling, 1 reply; 72+ messages in thread
From: Naohiro Aota @ 2021-02-16  4:33 UTC (permalink / raw)
  To: dsterba
  Cc: Johannes Thumshirn, linux-btrfs, dsterba, hare, linux-fsdevel,
	Damien Le Moal

On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote:
> On 11/02/2021 16:48, David Sterba wrote:
> > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
> >> On 11/02/2021 16:21, David Sterba wrote:
> >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> >>>> On 10/02/2021 21:02, David Sterba wrote:
> >>>>>> This series implements superblock log writing. It uses two zones as a
> >>>>>> circular buffer to write updated superblocks. Once the first zone is filled
> >>>>>> up, start writing into the second zone. The first zone will be reset once
> >>>>>> both zones are filled. We can determine the postion of the latest
> >>>>>> superblock by reading the write pointer information from a device.
> >>>>>
> >>>>> About that, in this patchset it's still leaving superblock at the fixed
> >>>>> zone number while we want it at a fixed location, spanning 2 zones
> >>>>> regardless of their size.
> >>>>
> >>>> We'll always need 2 zones or otherwise we won't be powercut safe.
> >>>
> >>> Yes we do, that hasn't changed.
> >>
> >> OK that I don't understand, with the log structured superblocks on a zoned
> >> filesystem, we're writing a new superblock until the 1st zone is filled.
> >> Then we advance to the second zone. As soon as we wrote a superblock to
> >> the second zone we can reset the first.
> >> If we only use one zone,
> > 
> > No, that can't work and nobody suggests that.
> > 
> >> we would need to write until it's end, reset and
> >> start writing again from the beginning. But if a powercut happens between
> >> reset and first write after the reset, we end up with no superblock.
> > 
> > What I'm saying and what we discussed on slack in December, we can't fix
> > the zone number for the 1st and 2nd copy of superblock like it is now in
> > sb_zone_number.
> > 
> > The primary superblock must be there for any reference and to actually
> > let the tools learn about the incompat bits.
> > 
> > The 1st copy is now fixed zone 16, which depends on the zone size. The
> > idea is to define the superblock offsets to start at given offsets,
> > where the ring buffer has the two consecutive zones, regardless of their
> > size.
> > 
> > primary:		   0
> > 1st copy:		 16G
> > 2nd copy:		256G
> > 
> > Due to the variability of the zones in future devices, we'll reserve a
> > space at the superblock interval, assuming the zone sizes can grow up to
> > several gigabytes. Current working number is 1G, with some safety margin
> > the reserved ranges would be (eg. for a 4G zone size):
> > 
> > primary:		0 up to 8G
> > 1st copy:		16G up to 24G
> > 2nd copy:		256G up to 262G
> > 
> > It is wasteful but we want to be future proof and expecting disk sizes
> > from tens of terabytes to a hundred terabytes, it's not significant
> > loss of space.
> > 
> > If the zone sizes can be expected higher than 4G, the 1st copy can be
> > defined at 64G, that would leave us some margin until somebody thinks
> > that 32G zones are a great idea.
> > 
> 
> We've been talking about this today and our proposal would be as follows:
> Primary SB is two zones starting at LBA 0
> Seconday SB the two zones starting with the zone that contains the address 16G

For the secondary SB on a file system < 16GB, how do you think of
using the last two zones (or zones #2, #3 will do)? Then, we can
assure to have two SB copies even on such a file system.

> Third SB the two zones starting with the zone that contains the address 256G 
> or not present if the disk is too small.
> 
> This would make it safe until a zone size of 8GB and we'd have adjacent 
> superblock log zones then.
> 
> How does that sound?
> 
> Byte,
> 	Johannes
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-16  4:33             ` Naohiro Aota
@ 2021-02-16 11:46               ` David Sterba
  2021-02-22  7:50                 ` Naohiro Aota
  0 siblings, 1 reply; 72+ messages in thread
From: David Sterba @ 2021-02-16 11:46 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Johannes Thumshirn, linux-btrfs, dsterba, hare, linux-fsdevel,
	Damien Le Moal

On Tue, Feb 16, 2021 at 01:33:28PM +0900, Naohiro Aota wrote:
> On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote:
> > On 11/02/2021 16:48, David Sterba wrote:
> > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
> > >> On 11/02/2021 16:21, David Sterba wrote:
> > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> > >>>> On 10/02/2021 21:02, David Sterba wrote:
> > >>>>>> This series implements superblock log writing. It uses two zones as a
> > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled
> > >>>>>> up, start writing into the second zone. The first zone will be reset once
> > >>>>>> both zones are filled. We can determine the postion of the latest
> > >>>>>> superblock by reading the write pointer information from a device.
> > >>>>>
> > >>>>> About that, in this patchset it's still leaving superblock at the fixed
> > >>>>> zone number while we want it at a fixed location, spanning 2 zones
> > >>>>> regardless of their size.
> > >>>>
> > >>>> We'll always need 2 zones or otherwise we won't be powercut safe.
> > >>>
> > >>> Yes we do, that hasn't changed.
> > >>
> > >> OK that I don't understand, with the log structured superblocks on a zoned
> > >> filesystem, we're writing a new superblock until the 1st zone is filled.
> > >> Then we advance to the second zone. As soon as we wrote a superblock to
> > >> the second zone we can reset the first.
> > >> If we only use one zone,
> > > 
> > > No, that can't work and nobody suggests that.
> > > 
> > >> we would need to write until it's end, reset and
> > >> start writing again from the beginning. But if a powercut happens between
> > >> reset and first write after the reset, we end up with no superblock.
> > > 
> > > What I'm saying and what we discussed on slack in December, we can't fix
> > > the zone number for the 1st and 2nd copy of superblock like it is now in
> > > sb_zone_number.
> > > 
> > > The primary superblock must be there for any reference and to actually
> > > let the tools learn about the incompat bits.
> > > 
> > > The 1st copy is now fixed zone 16, which depends on the zone size. The
> > > idea is to define the superblock offsets to start at given offsets,
> > > where the ring buffer has the two consecutive zones, regardless of their
> > > size.
> > > 
> > > primary:		   0
> > > 1st copy:		 16G
> > > 2nd copy:		256G
> > > 
> > > Due to the variability of the zones in future devices, we'll reserve a
> > > space at the superblock interval, assuming the zone sizes can grow up to
> > > several gigabytes. Current working number is 1G, with some safety margin
> > > the reserved ranges would be (eg. for a 4G zone size):
> > > 
> > > primary:		0 up to 8G
> > > 1st copy:		16G up to 24G
> > > 2nd copy:		256G up to 262G
> > > 
> > > It is wasteful but we want to be future proof and expecting disk sizes
> > > from tens of terabytes to a hundred terabytes, it's not significant
> > > loss of space.
> > > 
> > > If the zone sizes can be expected higher than 4G, the 1st copy can be
> > > defined at 64G, that would leave us some margin until somebody thinks
> > > that 32G zones are a great idea.
> > > 
> > 
> > We've been talking about this today and our proposal would be as follows:
> > Primary SB is two zones starting at LBA 0
> > Seconday SB the two zones starting with the zone that contains the address 16G
> 
> For the secondary SB on a file system < 16GB, how do you think of
> using the last two zones (or zones #2, #3 will do)? Then, we can
> assure to have two SB copies even on such a file system.

For real hardware I think this is not relevant but for the emulated mode
we need to deal with that case. The reserved size is wasteful and this
will become noticeable for devices < 16G but I'd rather keep the logic
simple and not care much about this corner case. So, the superblock
range would be reserved and if there's not enough to store the secondary
sb, then don't.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-16 11:46               ` David Sterba
@ 2021-02-22  7:50                 ` Naohiro Aota
  2021-02-22 16:00                   ` David Sterba
  0 siblings, 1 reply; 72+ messages in thread
From: Naohiro Aota @ 2021-02-22  7:50 UTC (permalink / raw)
  To: dsterba, Johannes Thumshirn, linux-btrfs, dsterba, hare,
	linux-fsdevel, Damien Le Moal

On Tue, Feb 16, 2021 at 12:46:11PM +0100, David Sterba wrote:
> On Tue, Feb 16, 2021 at 01:33:28PM +0900, Naohiro Aota wrote:
> > On Mon, Feb 15, 2021 at 04:58:05PM +0000, Johannes Thumshirn wrote:
> > > On 11/02/2021 16:48, David Sterba wrote:
> > > > On Thu, Feb 11, 2021 at 03:26:04PM +0000, Johannes Thumshirn wrote:
> > > >> On 11/02/2021 16:21, David Sterba wrote:
> > > >>> On Thu, Feb 11, 2021 at 09:58:09AM +0000, Johannes Thumshirn wrote:
> > > >>>> On 10/02/2021 21:02, David Sterba wrote:
> > > >>>>>> This series implements superblock log writing. It uses two zones as a
> > > >>>>>> circular buffer to write updated superblocks. Once the first zone is filled
> > > >>>>>> up, start writing into the second zone. The first zone will be reset once
> > > >>>>>> both zones are filled. We can determine the postion of the latest
> > > >>>>>> superblock by reading the write pointer information from a device.
> > > >>>>>
> > > >>>>> About that, in this patchset it's still leaving superblock at the fixed
> > > >>>>> zone number while we want it at a fixed location, spanning 2 zones
> > > >>>>> regardless of their size.
> > > >>>>
> > > >>>> We'll always need 2 zones or otherwise we won't be powercut safe.
> > > >>>
> > > >>> Yes we do, that hasn't changed.
> > > >>
> > > >> OK that I don't understand, with the log structured superblocks on a zoned
> > > >> filesystem, we're writing a new superblock until the 1st zone is filled.
> > > >> Then we advance to the second zone. As soon as we wrote a superblock to
> > > >> the second zone we can reset the first.
> > > >> If we only use one zone,
> > > > 
> > > > No, that can't work and nobody suggests that.
> > > > 
> > > >> we would need to write until it's end, reset and
> > > >> start writing again from the beginning. But if a powercut happens between
> > > >> reset and first write after the reset, we end up with no superblock.
> > > > 
> > > > What I'm saying and what we discussed on slack in December, we can't fix
> > > > the zone number for the 1st and 2nd copy of superblock like it is now in
> > > > sb_zone_number.
> > > > 
> > > > The primary superblock must be there for any reference and to actually
> > > > let the tools learn about the incompat bits.
> > > > 
> > > > The 1st copy is now fixed zone 16, which depends on the zone size. The
> > > > idea is to define the superblock offsets to start at given offsets,
> > > > where the ring buffer has the two consecutive zones, regardless of their
> > > > size.
> > > > 
> > > > primary:		   0
> > > > 1st copy:		 16G
> > > > 2nd copy:		256G
> > > > 
> > > > Due to the variability of the zones in future devices, we'll reserve a
> > > > space at the superblock interval, assuming the zone sizes can grow up to
> > > > several gigabytes. Current working number is 1G, with some safety margin
> > > > the reserved ranges would be (eg. for a 4G zone size):
> > > > 
> > > > primary:		0 up to 8G
> > > > 1st copy:		16G up to 24G
> > > > 2nd copy:		256G up to 262G
> > > > 
> > > > It is wasteful but we want to be future proof and expecting disk sizes
> > > > from tens of terabytes to a hundred terabytes, it's not significant
> > > > loss of space.
> > > > 
> > > > If the zone sizes can be expected higher than 4G, the 1st copy can be
> > > > defined at 64G, that would leave us some margin until somebody thinks
> > > > that 32G zones are a great idea.
> > > > 
> > > 
> > > We've been talking about this today and our proposal would be as follows:
> > > Primary SB is two zones starting at LBA 0
> > > Seconday SB the two zones starting with the zone that contains the address 16G
> > 
> > For the secondary SB on a file system < 16GB, how do you think of
> > using the last two zones (or zones #2, #3 will do)? Then, we can
> > assure to have two SB copies even on such a file system.
> 
> For real hardware I think this is not relevant but for the emulated mode
> we need to deal with that case. The reserved size is wasteful and this
> will become noticeable for devices < 16G but I'd rather keep the logic
> simple and not care much about this corner case. So, the superblock
> range would be reserved and if there's not enough to store the secondary
> sb, then don't.

Sure. That works. I'm running xfstests with these new SB
locations. Once it passed, I'll post the patch.

One corner case left. What should we do with zone size > 8G? In this
case, the primary SB zones and the 1st copy SB zones overlap. I know
this is unrealistic for real hardware, but you can still create such a
device with null_blk.

1) Use the following zones (zones #2, #3) as the primary SB zones
2) Do not write the primary SBs
3) Reject to mkfs

To be simple logic, method #3 would be appropriate here?

Technically, all the log zones overlap with zone size > 128 GB. I'm
considering to reject to mkfs in this insane case anyway.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v15 00/42] btrfs: zoned block device support
  2021-02-22  7:50                 ` Naohiro Aota
@ 2021-02-22 16:00                   ` David Sterba
  0 siblings, 0 replies; 72+ messages in thread
From: David Sterba @ 2021-02-22 16:00 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: dsterba, Johannes Thumshirn, linux-btrfs, dsterba, hare,
	linux-fsdevel, Damien Le Moal

On Mon, Feb 22, 2021 at 04:50:43PM +0900, Naohiro Aota wrote:
> > For real hardware I think this is not relevant but for the emulated mode
> > we need to deal with that case. The reserved size is wasteful and this
> > will become noticeable for devices < 16G but I'd rather keep the logic
> > simple and not care much about this corner case. So, the superblock
> > range would be reserved and if there's not enough to store the secondary
> > sb, then don't.
> 
> Sure. That works. I'm running xfstests with these new SB
> locations. Once it passed, I'll post the patch.
> 
> One corner case left. What should we do with zone size > 8G? In this
> case, the primary SB zones and the 1st copy SB zones overlap. I know
> this is unrealistic for real hardware, but you can still create such a
> device with null_blk.
> 
> 1) Use the following zones (zones #2, #3) as the primary SB zones
> 2) Do not write the primary SBs
> 3) Reject to mkfs
> 
> To be simple logic, method #3 would be appropriate here?
> 
> Technically, all the log zones overlap with zone size > 128 GB. I'm
> considering to reject to mkfs in this insane case anyway.

The 8G zone size idea is to buy us some time to support future hardware,
once this won't suffice we'll add an incompat bit like BIGZONES that
will allow larger zone sizes. At that time we'll probably have a better
idea about an exact number. So it's #3.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2021-02-22 16:03 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-04 10:21 [PATCH v15 00/42] btrfs: zoned block device support Naohiro Aota
2021-02-04 10:21 ` [PATCH v15 01/42] block: add bio_add_zone_append_page Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 02/42] iomap: support REQ_OP_ZONE_APPEND Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 03/42] btrfs: zoned: defer loading zone info after opening trees Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 04/42] btrfs: zoned: use regular super block location on zone emulation Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 05/42] btrfs: release path before calling to btrfs_load_block_group_zone_info Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 06/42] btrfs: zoned: do not load fs_info::zoned from incompat flag Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 07/42] btrfs: zoned: disallow fitrim on zoned filesystems Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 08/42] btrfs: zoned: allow zoned filesystems on non-zoned block devices Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 09/42] btrfs: zoned: implement zoned chunk allocator Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 10/42] btrfs: zoned: verify device extent is aligned to zone Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 11/42] btrfs: zoned: load zone's allocation offset Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 12/42] btrfs: zoned: calculate allocation offset for conventional zones Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 13/42] btrfs: zoned: track unusable bytes for zones Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 14/42] btrfs: zoned: implement sequential extent allocation Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 15/42] btrfs: zoned: redirty released extent buffers Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 16/42] btrfs: zoned: advance allocation pointer after tree log node Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 17/42] btrfs: zoned: reset zones of unused block groups Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 18/42] btrfs: factor out helper adding a page to bio Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 19/42] btrfs: zoned: use bio_add_zone_append_page Naohiro Aota
2021-02-04 10:21   ` [PATCH v15 20/42] btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 21/42] btrfs: zoned: split ordered extent when bio is sent Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 22/42] btrfs: zoned: check if bio spans across an ordered extent Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 23/42] btrfs: extend btrfs_rmap_block for specifying a device Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 24/42] btrfs: zoned: cache if block-group is on a sequential zone Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 25/42] btrfs: save irq flags when looking up an ordered extent Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 26/42] btrfs: zoned: use ZONE_APPEND write for zoned btrfs Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 27/42] btrfs: zoned: enable zone append writing for direct IO Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 28/42] btrfs: zoned: introduce dedicated data write path for zoned filesystems Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 29/42] btrfs: zoned: serialize metadata IO Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 30/42] btrfs: zoned: wait for existing extents before truncating Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 31/42] btrfs: zoned: do not use async metadata checksum on zoned filesystems Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 32/42] btrfs: zoned: mark block groups to copy for device-replace Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 33/42] btrfs: zoned: implement cloning for zoned device-replace Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 34/42] btrfs: zoned: implement copying " Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 35/42] btrfs: zoned: support dev-replace in zoned filesystems Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 36/42] btrfs: zoned: enable relocation on a zoned filesystem Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 37/42] btrfs: zoned: relocate block group to repair IO failure in zoned filesystems Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 38/42] btrfs: split alloc_log_tree() Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 39/42] btrfs: zoned: extend zoned allocator to use dedicated tree-log block group Naohiro Aota
2021-02-04 10:22   ` [PATCH v15 40/42] btrfs: zoned: serialize log transaction on zoned filesystems Naohiro Aota
2021-02-04 11:50     ` Filipe Manana
2021-02-05  7:21       ` Naohiro Aota
2021-02-05  9:15     ` Naohiro Aota
2021-02-05 11:21       ` Filipe Manana
2021-02-09  1:49       ` David Sterba
2021-02-04 10:22   ` [PATCH v15 41/42] btrfs: zoned: reorder log node allocation on zoned filesystem Naohiro Aota
2021-02-04 11:57     ` Filipe Manana
2021-02-04 14:54       ` Johannes Thumshirn
2021-02-04 15:48         ` David Sterba
2021-02-04 15:51           ` Johannes Thumshirn
2021-02-04 10:22   ` [PATCH v15 42/42] btrfs: zoned: enable to mount ZONED incompat flag Naohiro Aota
2021-02-05  9:26   ` [PATCH v15 43/43] btrfs: zoned: deal with holes writing out tree-log pages Naohiro Aota
2021-02-05 11:49     ` Filipe Manana
2021-02-05 12:55       ` Naohiro Aota
2021-02-05 13:07         ` Filipe Manana
2021-02-05 14:19       ` Filipe Manana
2021-02-05 14:46         ` Naohiro Aota
2021-02-05 14:58     ` [PATCH v15.1 " Naohiro Aota
2021-02-05 16:25       ` Filipe Manana
2021-02-09  1:55       ` David Sterba
2021-02-10 19:58 ` [PATCH v15 00/42] btrfs: zoned block device support David Sterba
2021-02-11  9:58   ` Johannes Thumshirn
2021-02-11 15:19     ` David Sterba
2021-02-11 15:26       ` Johannes Thumshirn
2021-02-11 15:46         ` David Sterba
2021-02-15 16:58           ` Johannes Thumshirn
2021-02-15 17:02             ` David Sterba
2021-02-16  4:33             ` Naohiro Aota
2021-02-16 11:46               ` David Sterba
2021-02-22  7:50                 ` Naohiro Aota
2021-02-22 16:00                   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.