All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHBOMB v3] xfsprogs: everything headed towards 6.9
@ 2024-04-16  0:51 Darrick J. Wong
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:51 UTC (permalink / raw)
  To: Carlos Maiolino, Christoph Hellwig; +Cc: xfs

Hi Carlos and Christoph,

This is a resend of my earlier patchset of everything that we need to
get xfsprogs up to a 6.8 release.  A handful of the v2 patches didn't
complete review, so this v3 patchbomb contains only the series that have
unreviewed patches:

[PATCHSET 1/4] xfsprogs: bug fixes for 6.8
  [PATCH 2/5] xfs_db: improve number extraction in getbitval
  [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6

[PATCHSET 2/4] libxfs: sync with 6.9
  [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer
  [PATCH 089/111] libxfs: add xfile support
  [PATCH 090/111] libxfs: partition memfd files to avoid using too many
  [PATCH 091/111] xfs: teach buftargs to maintain their own buffer
  [PATCH 092/111] libxfs: support in-memory buffer cache targets

(Only patch 90 lacks a review, but I decided to throw in a few more for
context.)

[PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups
  [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free
  [PATCH 2/4] libxfs: add a bi_entry helper
  [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item
  [PATCH 4/4] libxfs: add a xattr_entry helper

[PATCHSET v30.3 4/4] xfs_repair: minor fixes
  [PATCH 1/1] xfs_repair: check num before bplist[num]

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHSET 1/4] xfsprogs: bug fixes for 6.8
  2024-04-16  0:51 [PATCHBOMB v3] xfsprogs: everything headed towards 6.9 Darrick J. Wong
@ 2024-04-16  0:57 ` Darrick J. Wong
  2024-04-16  0:58   ` [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers Darrick J. Wong
                     ` (5 more replies)
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 6 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:57 UTC (permalink / raw)
  To: djwong, cem
  Cc: Bill O'Donnell, Christoph Hellwig, cmaiolino, linux-xfs, hch

Hi all,

Bug fixes for xfsprogs for 6.8.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfsprogs-6.8-fixes
---
Commits in this patchset:
 * xfs_repair: double-check with shortform attr verifiers
 * xfs_db: improve number extraction in getbitval
 * xfs_scrub: fix threadcount estimates for phase 6
 * xfs_scrub: don't fail while reporting media scan errors
 * xfs_io: add linux madvise advice codes
---
 db/bit.c             |   37 ++++++++++--------------
 io/madvise.c         |   77 +++++++++++++++++++++++++++++++++++++++++++++++++-
 repair/attr_repair.c |   17 +++++++++++
 scrub/phase6.c       |   36 ++++++++++++++++++-----
 4 files changed, 137 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHSET 2/4] libxfs: sync with 6.9
  2024-04-16  0:51 [PATCHBOMB v3] xfsprogs: everything headed towards 6.9 Darrick J. Wong
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
@ 2024-04-16  0:58 ` Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
                     ` (4 more replies)
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
  2024-04-16  0:58 ` [PATCHSET v30.3 4/4] xfs_repair: minor fixes Darrick J. Wong
  3 siblings, 5 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:58 UTC (permalink / raw)
  To: djwong, cem
  Cc: Gao Xiang, Dan Carpenter, Chandan Babu R, Matthew Wilcox (Oracle),
	Dave Chinner, Christoph Hellwig, cmaiolino, linux-xfs, hch

Hi all,

Synchronize libxfs with the kernel.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=libxfs-6.9-sync
---
Commits in this patchset:
 * xfs: convert kmem_zalloc() to kzalloc()
 * xfs: convert kmem_alloc() to kmalloc()
 * xfs: convert remaining kmem_free() to kfree()
 * xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS
 * xfs: use GFP_KERNEL in pure transaction contexts
 * xfs: clean up remaining GFP_NOFS users
 * xfs: use xfs_defer_alloc a bit more
 * xfs: Replace xfs_isilocked with xfs_assert_ilocked
 * xfs: create a static name for the dot entry too
 * xfs: create a predicate to determine if two xfs_names are the same
 * xfs: create a macro for decoding ftypes in tracepoints
 * xfs: report the health of quota counts
 * xfs: implement live quotacheck inode scan
 * xfs: report health of inode link counts
 * xfs: teach scrub to check file nlinks
 * xfs: separate the marking of sick and checked metadata
 * xfs: report fs corruption errors to the health tracking system
 * xfs: report ag header corruption errors to the health tracking system
 * xfs: report block map corruption errors to the health tracking system
 * xfs: report btree block corruption errors to the health system
 * xfs: report dir/attr block corruption errors to the health system
 * xfs: report inode corruption errors to the health system
 * xfs: report realtime metadata corruption errors to the health system
 * xfs: report XFS_IS_CORRUPT errors to the health system
 * xfs: add secondary and indirect classes to the health tracking system
 * xfs: remember sick inodes that get inactivated
 * xfs: update health status if we get a clean bill of health
 * xfs: consolidate btree block freeing tracepoints
 * xfs: consolidate btree block allocation tracepoints
 * xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor
 * xfs: drop XFS_BTREE_CRC_BLOCKS
 * xfs: encode the btree geometry flags in the btree ops structure
 * xfs: remove bc_ino.flags
 * xfs: consolidate the xfs_alloc_lookup_* helpers
 * xfs: turn the allocbt cursor active field into a btree flag
 * xfs: extern some btree ops structures
 * xfs: initialize btree blocks using btree_ops structure
 * xfs: rename btree block/buffer init functions
 * xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf calls
 * xfs: remove the unnecessary daddr paramter to _init_block
 * xfs: set btree block buffer ops in _init_buf
 * xfs: move lru refs to the btree ops structure
 * xfs: move the btree stats offset into struct btree_ops
 * xfs: factor out a xfs_btree_owner helper
 * xfs: factor out a btree block owner check
 * xfs: store the btree pointer length in struct xfs_btree_ops
 * xfs: split out a btree type from the btree ops geometry flags
 * xfs: split the per-btree union in struct xfs_btree_cur
 * xfs: create predicate to determine if cursor is at inode root level
 * xfs: move comment about two 2 keys per pointer in the rmap btree
 * xfs: add a xfs_btree_init_ptr_from_cur
 * xfs: don't override bc_ops for staging btrees
 * xfs: fold xfs_allocbt_init_common into xfs_allocbt_init_cursor
 * xfs: remove xfs_allocbt_stage_cursor
 * xfs: fold xfs_inobt_init_common into xfs_inobt_init_cursor
 * xfs: remove xfs_inobt_stage_cursor
 * xfs: fold xfs_refcountbt_init_common into xfs_refcountbt_init_cursor
 * xfs: remove xfs_refcountbt_stage_cursor
 * xfs: fold xfs_rmapbt_init_common into xfs_rmapbt_init_cursor
 * xfs: remove xfs_rmapbt_stage_cursor
 * xfs: make full use of xfs_btree_stage_ifakeroot in xfs_bmbt_stage_cursor
 * xfs: make staging file forks explicit
 * xfs: fold xfs_bmbt_init_common into xfs_bmbt_init_cursor
 * xfs: remove xfs_bmbt_stage_cursor
 * xfs: split the agf_roots and agf_levels arrays
 * xfs: add a name field to struct xfs_btree_ops
 * xfs: add a sick_mask to struct xfs_btree_ops
 * xfs: split xfs_allocbt_init_cursor
 * xfs: remove xfs_inobt_cur
 * xfs: remove the btnum argument to xfs_inobt_count_blocks
 * xfs: split xfs_inobt_insert_sprec
 * xfs: split xfs_inobt_init_cursor
 * xfs: pass a 'bool is_finobt' to xfs_inobt_insert
 * xfs: remove xfs_btnum_t
 * xfs: simplify xfs_btree_check_sblock_siblings
 * xfs: simplify xfs_btree_check_lblock_siblings
 * xfs: open code xfs_btree_check_lptr in xfs_bmap_btree_to_extents
 * xfs: consolidate btree ptr checking
 * xfs: misc cleanups for __xfs_btree_check_sblock
 * xfs: remove the crc variable in __xfs_btree_check_lblock
 * xfs: tighten up validation of root block in inode forks
 * xfs: consolidate btree block verification
 * xfs: rename btree helpers that depends on the block number representation
 * xfs: factor out a __xfs_btree_check_lblock_hdr helper
 * xfs: remove xfs_btree_reada_bufl
 * xfs: remove xfs_btree_reada_bufs
 * xfs: move and rename xfs_btree_read_bufl
 * libxfs: teach buftargs to maintain their own buffer hashtable
 * libxfs: add xfile support
 * libxfs: partition memfd files to avoid using too many fds
 * xfs: teach buftargs to maintain their own buffer hashtable
 * libxfs: support in-memory buffer cache targets
 * xfs: add a xfs_btree_ptrs_equal helper
 * xfs: support in-memory btrees
 * xfs: launder in-memory btree buffers before transaction commit
 * xfs: create a helper to decide if a file mapping targets the rt volume
 * xfs: repair the rmapbt
 * xfs: create a shadow rmap btree during rmap repair
 * xfs: hook live rmap operations during a repair operation
 * xfs: clean up bmap log intent item tracepoint callsites
 * xfs: move xfs_bmap_defer_add to xfs_bmap_item.c
 * xfs: fix xfs_bunmapi to allow unmapping of partial rt extents
 * xfs: add a realtime flag to the bmap update log redo items
 * xfs: support deferred bmap updates on the attr fork
 * xfs: xfs_bmap_finish_one should map unwritten extents properly
 * xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
 * xfs: move remote symlink target read function to libxfs
 * xfs: move symlink target write function to libxfs
 * xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL
 * xfs: shrink failure needs to hold AGI buffer
 * xfs: allow sunit mount option to repair bad primary sb stripe values
---
 copy/xfs_copy.c             |    4 
 db/agf.c                    |   28 -
 db/bmap_inflate.c           |    8 
 db/check.c                  |   14 -
 db/freesp.c                 |    8 
 db/metadump.c               |   12 
 include/kmem.h              |    5 
 include/libxfs.h            |    4 
 include/xfs_mount.h         |    5 
 include/xfs_trace.h         |   17 -
 include/xfs_trans.h         |    1 
 libxfs/Makefile             |    9 
 libxfs/buf_mem.c            |  313 ++++++++++++
 libxfs/buf_mem.h            |   30 +
 libxfs/defer_item.c         |   15 +
 libxfs/defer_item.h         |   13 +
 libxfs/init.c               |   52 +-
 libxfs/libxfs_api_defs.h    |   10 
 libxfs/libxfs_io.h          |   42 +-
 libxfs/libxfs_priv.h        |   19 -
 libxfs/logitem.c            |    2 
 libxfs/rdwr.c               |   86 ++-
 libxfs/trans.c              |   40 ++
 libxfs/util.c               |   10 
 libxfs/xfile.c              |  393 +++++++++++++++
 libxfs/xfile.h              |   34 +
 libxfs/xfs_ag.c             |   79 ++-
 libxfs/xfs_ag.h             |   18 -
 libxfs/xfs_alloc.c          |  258 ++++++----
 libxfs/xfs_alloc_btree.c    |  191 ++++---
 libxfs/xfs_alloc_btree.h    |   10 
 libxfs/xfs_attr.c           |    5 
 libxfs/xfs_attr_leaf.c      |   22 +
 libxfs/xfs_attr_remote.c    |   37 +
 libxfs/xfs_bmap.c           |  365 ++++++++++----
 libxfs/xfs_bmap.h           |   19 +
 libxfs/xfs_bmap_btree.c     |  152 ++----
 libxfs/xfs_bmap_btree.h     |    5 
 libxfs/xfs_btree.c          | 1097 ++++++++++++++++++++++++++-----------------
 libxfs/xfs_btree.h          |  274 +++++------
 libxfs/xfs_btree_mem.c      |  346 ++++++++++++++
 libxfs/xfs_btree_mem.h      |   75 +++
 libxfs/xfs_btree_staging.c  |  133 +----
 libxfs/xfs_btree_staging.h  |   10 
 libxfs/xfs_da_btree.c       |   59 ++
 libxfs/xfs_da_format.h      |   11 
 libxfs/xfs_defer.c          |   25 -
 libxfs/xfs_dir2.c           |   59 +-
 libxfs/xfs_dir2.h           |   13 +
 libxfs/xfs_dir2_block.c     |    8 
 libxfs/xfs_dir2_data.c      |    3 
 libxfs/xfs_dir2_leaf.c      |    3 
 libxfs/xfs_dir2_node.c      |    7 
 libxfs/xfs_dir2_sf.c        |   16 -
 libxfs/xfs_format.h         |   21 -
 libxfs/xfs_fs.h             |    8 
 libxfs/xfs_health.h         |   95 ++++
 libxfs/xfs_ialloc.c         |  232 ++++++---
 libxfs/xfs_ialloc_btree.c   |  173 +++----
 libxfs/xfs_ialloc_btree.h   |   11 
 libxfs/xfs_iext_tree.c      |   26 +
 libxfs/xfs_inode_buf.c      |   12 
 libxfs/xfs_inode_fork.c     |   49 +-
 libxfs/xfs_inode_fork.h     |    1 
 libxfs/xfs_log_format.h     |    4 
 libxfs/xfs_refcount.c       |   69 ++-
 libxfs/xfs_refcount_btree.c |   78 +--
 libxfs/xfs_refcount_btree.h |    2 
 libxfs/xfs_rmap.c           |  284 +++++++++--
 libxfs/xfs_rmap.h           |   31 +
 libxfs/xfs_rmap_btree.c     |  240 +++++++--
 libxfs/xfs_rmap_btree.h     |    8 
 libxfs/xfs_rtbitmap.c       |   11 
 libxfs/xfs_sb.c             |   42 +-
 libxfs/xfs_sb.h             |    5 
 libxfs/xfs_shared.h         |   67 ++-
 libxfs/xfs_symlink_remote.c |  155 ++++++
 libxfs/xfs_symlink_remote.h |   26 +
 libxfs/xfs_trans_inode.c    |    6 
 libxfs/xfs_types.h          |   26 -
 logprint/log_misc.c         |    8 
 logprint/log_print_all.c    |    8 
 mkfs/xfs_mkfs.c             |    8 
 repair/agbtree.c            |   28 +
 repair/bmap_repair.c        |    4 
 repair/bulkload.c           |    2 
 repair/phase5.c             |   28 +
 repair/phase6.c             |    4 
 repair/prefetch.c           |   12 
 repair/prefetch.h           |    1 
 repair/progress.c           |   14 -
 repair/progress.h           |    2 
 repair/scan.c               |   18 -
 repair/xfs_repair.c         |   47 +-
 94 files changed, 4425 insertions(+), 1915 deletions(-)
 create mode 100644 libxfs/buf_mem.c
 create mode 100644 libxfs/buf_mem.h
 create mode 100644 libxfs/defer_item.h
 create mode 100644 libxfs/xfile.c
 create mode 100644 libxfs/xfile.h
 create mode 100644 libxfs/xfs_btree_mem.c
 create mode 100644 libxfs/xfs_btree_mem.h
 create mode 100644 libxfs/xfs_symlink_remote.h


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups
  2024-04-16  0:51 [PATCHBOMB v3] xfsprogs: everything headed towards 6.9 Darrick J. Wong
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
@ 2024-04-16  0:58 ` Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free Darrick J. Wong
                     ` (3 more replies)
  2024-04-16  0:58 ` [PATCHSET v30.3 4/4] xfs_repair: minor fixes Darrick J. Wong
  3 siblings, 4 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:58 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

Hi all,

The next major target of online repair are metadata that are persisted
in blocks mapped by a file fork.  In other words, we want to repair
directories, extended attributes, symbolic links, and the realtime free
space information.  For file-based metadata, we assume that the space
metadata is correct, which enables repair to construct new versions of
the metadata in a temporary file.  We then need to swap the file fork
mappings of the two files atomically.  With this patchset, we begin
constructing such a facility based on the existing bmap log items and a
new extent swap log item.

This series cleans up a few parts of the file block mapping log intent
code before we start adding support for realtime bmap intents.  Most of
it involves cleaning up tracepoints so that more of the data extraction
logic ends up in the tracepoint code and not the tracepoint call site,
which should reduce overhead further when tracepoints are disabled.
There is also a change to pass bmap intents all the way back to the bmap
code instead of unboxing the intent values and re-boxing them after the
_finish_one function completes.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=bmap-intent-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=bmap-intent-cleanups
---
Commits in this patchset:
 * libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free
 * libxfs: add a bi_entry helper
 * libxfs: reuse xfs_bmap_update_cancel_item
 * libxfs: add a xattr_entry helper
---
 db/bmap_inflate.c         |    2 +-
 include/kmem.h            |   10 +-------
 libxfs/defer_item.c       |   58 ++++++++++++++++++++++++---------------------
 libxfs/init.c             |    2 +-
 libxfs/kmem.c             |   32 ++++++++-----------------
 libxlog/xfs_log_recover.c |   19 +++++++--------
 repair/bmap_repair.c      |    4 ++-
 7 files changed, 55 insertions(+), 72 deletions(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHSET v30.3 4/4] xfs_repair: minor fixes
  2024-04-16  0:51 [PATCHBOMB v3] xfsprogs: everything headed towards 6.9 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
@ 2024-04-16  0:58 ` Darrick J. Wong
  2024-04-16  1:02   ` [PATCH 1/1] xfs_repair: check num before bplist[num] Darrick J. Wong
  3 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:58 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

Hi all,

Fix some random minor problems in xfs_repair.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-fixes
---
Commits in this patchset:
 * xfs_repair: check num before bplist[num]
---
 repair/prefetch.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
@ 2024-04-16  0:58   ` Darrick J. Wong
  2024-04-16  0:59   ` [PATCH 2/5] xfs_db: improve number extraction in getbitval Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:58 UTC (permalink / raw)
  To: djwong, cem
  Cc: Christoph Hellwig, Bill O'Donnell, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Call the shortform attr structure verifier as the last thing we do in
process_shortform_attr to make sure that we don't leave any latent
errors for the kernel to stumble over.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
---
 repair/attr_repair.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)


diff --git a/repair/attr_repair.c b/repair/attr_repair.c
index 01e4afb90d5c..f117f9aef9ce 100644
--- a/repair/attr_repair.c
+++ b/repair/attr_repair.c
@@ -212,6 +212,7 @@ process_shortform_attr(
 {
 	struct xfs_attr_sf_hdr		*hdr = XFS_DFORK_APTR(dip);
 	struct xfs_attr_sf_entry	*currententry, *nextentry, *tempentry;
+	xfs_failaddr_t			fa;
 	int				i, junkit;
 	int				currentsize, remainingspace;
 
@@ -373,6 +374,22 @@ process_shortform_attr(
 		}
 	}
 
+	fa = libxfs_attr_shortform_verify(hdr, be16_to_cpu(hdr->totsize));
+	if (fa) {
+		if (no_modify) {
+			do_warn(
+	_("inode %" PRIu64 " shortform attr verifier failure, would have cleared attrs\n"),
+				ino);
+		} else {
+			do_warn(
+	_("inode %" PRIu64 " shortform attr verifier failure, cleared attrs\n"),
+				ino);
+			hdr->count = 0;
+			hdr->totsize = cpu_to_be16(sizeof(struct xfs_attr_sf_hdr));
+			*repair = 1;
+		}
+	}
+
 	return(*repair);
 }
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/5] xfs_db: improve number extraction in getbitval
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
  2024-04-16  0:58   ` [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers Darrick J. Wong
@ 2024-04-16  0:59   ` Darrick J. Wong
  2024-04-16  4:53     ` Christoph Hellwig
  2024-04-16  0:59   ` [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6 Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:59 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

For some reason, getbitval insists upon collecting a u64 from a pointer
bit by bit if it's not aligned to a 16-byte boundary.  If not, then it
resorts to scraping bits individually.  I don't know of any platform
where we require 16-byte alignment for a 8-byte access, or why we'd care
now that we have things like get_unaligned_beXX.

Rework this function to detect either naturally aligned accesses and use
the regular beXX_to_cpu functions; or byte-aligned accesses and use the
get_unaligned_beXX functions.  Only fall back to the bit scraping
algorithm for the really weird cases.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 db/bit.c |   37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)


diff --git a/db/bit.c b/db/bit.c
index c9bfd2eb025f..1b9ca054f3b1 100644
--- a/db/bit.c
+++ b/db/bit.c
@@ -55,39 +55,34 @@ getbitval(
 	char		*p;
 	int64_t		rval;
 	int		signext;
-	int		z1, z2, z3, z4;
 
 	ASSERT(nbits<=64);
 
 	p = (char *)obj + byteize(bitoff);
 	bit = bitoffs(bitoff);
 	signext = (flags & BVSIGNED) != 0;
-	z4 = ((intptr_t)p & 0xf) == 0 && bit == 0;
-	if (nbits == 64 && z4)
-		return be64_to_cpu(*(__be64 *)p);
-	z3 = ((intptr_t)p & 0x7) == 0 && bit == 0;
-	if (nbits == 32 && z3) {
+
+	if (bit != 0)
+		goto scrape_bits;
+
+	switch (nbits) {
+	case 64:
+		return get_unaligned_be64(p);
+	case 32:
 		if (signext)
-			return (__s32)be32_to_cpu(*(__be32 *)p);
-		else
-			return (__u32)be32_to_cpu(*(__be32 *)p);
-	}
-	z2 = ((intptr_t)p & 0x3) == 0 && bit == 0;
-	if (nbits == 16 && z2) {
+			return (__s32)get_unaligned_be32(p);
+		return (__u32)get_unaligned_be32(p);
+	case 16:
 		if (signext)
-			return (__s16)be16_to_cpu(*(__be16 *)p);
-		else
-			return (__u16)be16_to_cpu(*(__be16 *)p);
-	}
-	z1 = ((intptr_t)p & 0x1) == 0 && bit == 0;
-	if (nbits == 8 && z1) {
+			return (__s16)get_unaligned_be16(p);
+		return (__u16)get_unaligned_be16(p);
+	case 8:
 		if (signext)
 			return *(__s8 *)p;
-		else
-			return *(__u8 *)p;
+		return *(__u8 *)p;
 	}
 
-
+scrape_bits:
 	for (i = 0, rval = 0LL; i < nbits; i++) {
 		if (getbit_l(p, bit + i)) {
 			/* If the last bit is on and we care about sign


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
  2024-04-16  0:58   ` [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers Darrick J. Wong
  2024-04-16  0:59   ` [PATCH 2/5] xfs_db: improve number extraction in getbitval Darrick J. Wong
@ 2024-04-16  0:59   ` Darrick J. Wong
  2024-04-16  4:53     ` Christoph Hellwig
  2024-04-16  0:59   ` [PATCH 4/5] xfs_scrub: don't fail while reporting media scan errors Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:59 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

If a filesystem has a realtime device or an external log device, the
media scan can start up a separate readverify controller (and workqueue)
to handle that.  Each of those controllers can call progress_add, so we
need to bump up nr_threads so that the progress reports controller knows
to make its ptvar big enough to handle all these threads.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase6.c |   10 ++++++++++
 1 file changed, 10 insertions(+)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index 99a32bc79620..393d9eaa83d8 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -743,7 +743,17 @@ phase6_estimate(
 
 	*items = cvt_off_fsb_to_b(&ctx->mnt,
 			(d_blocks - d_bfree) + (r_blocks - r_bfree));
+
+	/*
+	 * Each read-verify pool starts a thread pool, and each worker thread
+	 * can contribute to the progress counter.  Hence we need to set
+	 * nr_threads appropriately to handle that many threads.
+	 */
 	*nr_threads = disk_heads(ctx->datadev);
+	if (ctx->rtdev)
+		*nr_threads += disk_heads(ctx->rtdev);
+	if (ctx->logdev)
+		*nr_threads += disk_heads(ctx->logdev);
 	*rshift = 20;
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/5] xfs_scrub: don't fail while reporting media scan errors
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-04-16  0:59   ` [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6 Darrick J. Wong
@ 2024-04-16  0:59   ` Darrick J. Wong
  2024-04-16  0:59   ` [PATCH 5/5] xfs_io: add linux madvise advice codes Darrick J. Wong
  2024-04-17  7:34   ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Carlos Maiolino
  5 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:59 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

If we can't open a file to report that it has media errors, just log
that fact and move on.  In this case we want to keep going with phase 6
so we report as many errors as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 scrub/phase6.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index 393d9eaa83d8..193d3b4e9083 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -21,6 +21,7 @@
 #include "read_verify.h"
 #include "spacemap.h"
 #include "vfs.h"
+#include "common.h"
 
 /*
  * Phase 6: Verify data file integrity.
@@ -291,13 +292,14 @@ report_inode_loss(
 	/* Try to open the inode. */
 	fd = scrub_open_handle(handle);
 	if (fd < 0) {
-		error = errno;
-		if (error == ESTALE)
-			return error;
+		/* Handle is stale, try again. */
+		if (errno == ESTALE)
+			return ESTALE;
 
-		str_info(ctx, descr,
-_("Disappeared during read error reporting."));
-		return error;
+		str_error(ctx, descr,
+ _("Could not open to report read errors: %s."),
+				strerror(errno));
+		return 0;
 	}
 
 	/* Go find the badness. */
@@ -353,10 +355,18 @@ report_dirent_loss(
 	fd = openat(dir_fd, dirent->d_name,
 			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
 	if (fd < 0) {
+		char		descr[PATH_MAX + 1];
+
 		if (errno == ENOENT)
 			return 0;
-		str_errno(ctx, path);
-		return errno;
+
+		snprintf(descr, PATH_MAX, "%s/%s", path, dirent->d_name);
+		descr[PATH_MAX] = 0;
+
+		str_error(ctx, descr,
+ _("Could not open to report read errors: %s."),
+				strerror(errno));
+		return 0;
 	}
 
 	/* Go find the badness. */


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/5] xfs_io: add linux madvise advice codes
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-04-16  0:59   ` [PATCH 4/5] xfs_scrub: don't fail while reporting media scan errors Darrick J. Wong
@ 2024-04-16  0:59   ` Darrick J. Wong
  2024-04-17  7:34   ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Carlos Maiolino
  5 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  0:59 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add all the Linux-specific madvise codes.  We're going to need
MADV_POPULATE_READ for a regression test.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 io/madvise.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)


diff --git a/io/madvise.c b/io/madvise.c
index 6e9c5b121d72..ede233955ced 100644
--- a/io/madvise.c
+++ b/io/madvise.c
@@ -9,6 +9,7 @@
 #include <sys/mman.h>
 #include "init.h"
 #include "io.h"
+#include <asm/mman.h>
 
 static cmdinfo_t madvise_cmd;
 
@@ -26,6 +27,31 @@ madvise_help(void)
 " -r -- expect random page references (POSIX_MADV_RANDOM)\n"
 " -s -- expect sequential page references (POSIX_MADV_SEQUENTIAL)\n"
 " -w -- will need these pages (POSIX_MADV_WILLNEED) [*]\n"
+"\n"
+"The following Linux-specific advise values are available:\n"
+#ifdef MADV_COLLAPSE
+" -c -- try to collapse range into transparent hugepages (MADV_COLLAPSE)\n"
+#endif
+#ifdef MADV_COLD
+" -D -- deactivate the range (MADV_COLD)\n"
+#endif
+" -f -- free the range (MADV_FREE)\n"
+" -h -- disable transparent hugepages (MADV_NOHUGEPAGE)\n"
+" -H -- enable transparent hugepages (MADV_HUGEPAGE)\n"
+" -m -- mark the range mergeable (MADV_MERGEABLE)\n"
+" -M -- mark the range unmergeable (MADV_UNMERGEABLE)\n"
+" -o -- mark the range offline (MADV_SOFT_OFFLINE)\n"
+" -p -- punch a hole in the file (MADV_REMOVE)\n"
+" -P -- poison the page cache (MADV_HWPOISON)\n"
+#ifdef MADV_POPULATE_READ
+" -R -- prefault in the range for read (MADV_POPULATE_READ)\n"
+#endif
+#ifdef MADV_POPULATE_WRITE
+" -W -- prefault in the range for write (MADV_POPULATE_WRITE)\n"
+#endif
+#ifdef MADV_PAGEOUT
+" -X -- reclaim the range (MADV_PAGEOUT)\n"
+#endif
 " Notes:\n"
 "   NORMAL sets the default readahead setting on the file.\n"
 "   RANDOM sets the readahead setting on the file to zero.\n"
@@ -45,20 +71,69 @@ madvise_f(
 	int		advise = MADV_NORMAL, c;
 	size_t		blocksize, sectsize;
 
-	while ((c = getopt(argc, argv, "drsw")) != EOF) {
+	while ((c = getopt(argc, argv, "cdDfhHmMopPrRswWX")) != EOF) {
 		switch (c) {
+#ifdef MADV_COLLAPSE
+		case 'c':	/* collapse to thp */
+			advise = MADV_COLLAPSE;
+			break;
+#endif
 		case 'd':	/* Don't need these pages */
 			advise = MADV_DONTNEED;
 			break;
+#ifdef MADV_COLD
+		case 'D':	/* make more likely to be reclaimed */
+			advise = MADV_COLD;
+			break;
+#endif
+		case 'f':	/* page range out of memory */
+			advise = MADV_FREE;
+			break;
+		case 'h':	/* enable thp memory */
+			advise = MADV_HUGEPAGE;
+			break;
+		case 'H':	/* disable thp memory */
+			advise = MADV_NOHUGEPAGE;
+			break;
+		case 'm':	/* enable merging */
+			advise = MADV_MERGEABLE;
+			break;
+		case 'M':	/* disable merging */
+			advise = MADV_UNMERGEABLE;
+			break;
+		case 'o':	/* offline */
+			advise = MADV_SOFT_OFFLINE;
+			break;
+		case 'p':	/* punch hole */
+			advise = MADV_REMOVE;
+			break;
+		case 'P':	/* poison */
+			advise = MADV_HWPOISON;
+			break;
 		case 'r':	/* Expect random page references */
 			advise = MADV_RANDOM;
 			break;
+#ifdef MADV_POPULATE_READ
+		case 'R':	/* fault in pages for read */
+			advise = MADV_POPULATE_READ;
+			break;
+#endif
 		case 's':	/* Expect sequential page references */
 			advise = MADV_SEQUENTIAL;
 			break;
 		case 'w':	/* Will need these pages */
 			advise = MADV_WILLNEED;
 			break;
+#ifdef MADV_POPULATE_WRITE
+		case 'W':	/* fault in pages for write */
+			advise = MADV_POPULATE_WRITE;
+			break;
+#endif
+#ifdef MADV_PAGEOUT
+		case 'X':	/* reclaim memory */
+			advise = MADV_PAGEOUT;
+			break;
+#endif
 		default:
 			exitcode = 1;
 			return command_usage(&madvise_cmd);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer hashtable
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
@ 2024-04-16  1:00   ` Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 089/111] libxfs: add xfile support Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:00 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Currently, cached buffers are indexed with a single global bcache
structure.  This works ok for the limited use case where we only support
reading from the data device, but will fail badly when we want to
support buffers from in-memory btrees.  Move the bcache structure into
the buftarg.

As a side effect, we don't need to compare buftarg->bt_bdev anymore
since libxfs is careful enough not to create more than one buftarg per
open fd.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/libxfs.h    |    1 -
 libxfs/init.c       |   48 +++++++++++++++++++++---------------------------
 libxfs/libxfs_io.h  |   10 ++++++----
 libxfs/logitem.c    |    2 +-
 libxfs/rdwr.c       |   45 +++++++++++++++++++++++++++++----------------
 mkfs/xfs_mkfs.c     |    2 +-
 repair/prefetch.c   |   12 ++++++++----
 repair/prefetch.h   |    1 +
 repair/progress.c   |   14 +++++++++-----
 repair/progress.h   |    2 +-
 repair/scan.c       |    2 +-
 repair/xfs_repair.c |   32 +++++++++++++++++---------------
 12 files changed, 95 insertions(+), 76 deletions(-)


diff --git a/include/libxfs.h b/include/libxfs.h
index aeec2bc76126..60d3b7968775 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -147,7 +147,6 @@ int		libxfs_init(struct libxfs_init *);
 void		libxfs_destroy(struct libxfs_init *li);
 
 extern int	libxfs_device_alignment (void);
-extern void	libxfs_report(FILE *);
 
 /* check or write log footer: specify device, log size in blocks & uuid */
 typedef char	*(libxfs_get_block_t)(char *, int, void *);
diff --git a/libxfs/init.c b/libxfs/init.c
index 5641b9bef6bd..f002dc93cd56 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -36,7 +36,6 @@ pthread_mutex_t	atomic64_lock = PTHREAD_MUTEX_INITIALIZER;
 
 char *progname = "libxfs";	/* default, changed by each tool */
 
-struct cache *libxfs_bcache;	/* global buffer cache */
 int libxfs_bhash_size;		/* #buckets in bcache */
 
 int	use_xfs_buf_lock;	/* global flag: use xfs_buf locks for MT */
@@ -267,8 +266,6 @@ libxfs_init(struct libxfs_init *a)
 
 	if (!libxfs_bhash_size)
 		libxfs_bhash_size = LIBXFS_BHASHSIZE(sbp);
-	libxfs_bcache = cache_init(a->bcache_flags, libxfs_bhash_size,
-				   &libxfs_bcache_operations);
 	use_xfs_buf_lock = a->flags & LIBXFS_USEBUFLOCK;
 	xfs_dir_startup();
 	init_caches();
@@ -451,6 +448,7 @@ xfs_set_inode_alloc(
 static struct xfs_buftarg *
 libxfs_buftarg_alloc(
 	struct xfs_mount	*mp,
+	struct libxfs_init	*xi,
 	struct libxfs_dev	*dev,
 	unsigned long		write_fails)
 {
@@ -472,6 +470,9 @@ libxfs_buftarg_alloc(
 	}
 	pthread_mutex_init(&btp->lock, NULL);
 
+	btp->bcache = cache_init(xi->bcache_flags, libxfs_bhash_size,
+			&libxfs_bcache_operations);
+
 	return btp;
 }
 
@@ -568,12 +569,13 @@ libxfs_buftarg_init(
 		return;
 	}
 
-	mp->m_ddev_targp = libxfs_buftarg_alloc(mp, &xi->data, dfail);
+	mp->m_ddev_targp = libxfs_buftarg_alloc(mp, xi, &xi->data, dfail);
 	if (!xi->log.dev || xi->log.dev == xi->data.dev)
 		mp->m_logdev_targp = mp->m_ddev_targp;
 	else
-		mp->m_logdev_targp = libxfs_buftarg_alloc(mp, &xi->log, lfail);
-	mp->m_rtdev_targp = libxfs_buftarg_alloc(mp, &xi->rt, rfail);
+		mp->m_logdev_targp = libxfs_buftarg_alloc(mp, xi, &xi->log,
+				lfail);
+	mp->m_rtdev_targp = libxfs_buftarg_alloc(mp, xi, &xi->rt, rfail);
 }
 
 /* Compute maximum possible height for per-AG btree types for this fs. */
@@ -856,7 +858,7 @@ libxfs_flush_mount(
 	 * LOST_WRITE flag to be set in the buftarg.  Once that's done,
 	 * instruct the disks to persist their write caches.
 	 */
-	libxfs_bcache_flush();
+	libxfs_bcache_flush(mp);
 
 	/* Flush all kernel and disk write caches, and report failures. */
 	if (mp->m_ddev_targp) {
@@ -882,6 +884,14 @@ libxfs_flush_mount(
 	return error;
 }
 
+static void
+libxfs_buftarg_free(
+	struct xfs_buftarg	*btp)
+{
+	cache_destroy(btp->bcache);
+	kmem_free(btp);
+}
+
 /*
  * Release any resource obtained during a mount.
  */
@@ -898,7 +908,7 @@ libxfs_umount(
 	 * all incore buffers, then pick up the outcome when we tell the disks
 	 * to persist their write caches.
 	 */
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 	error = libxfs_flush_mount(mp);
 
 	/*
@@ -913,10 +923,10 @@ libxfs_umount(
 	free(mp->m_fsname);
 	mp->m_fsname = NULL;
 
-	kmem_free(mp->m_rtdev_targp);
+	libxfs_buftarg_free(mp->m_rtdev_targp);
 	if (mp->m_logdev_targp != mp->m_ddev_targp)
-		kmem_free(mp->m_logdev_targp);
-	kmem_free(mp->m_ddev_targp);
+		libxfs_buftarg_free(mp->m_logdev_targp);
+	libxfs_buftarg_free(mp->m_ddev_targp);
 
 	return error;
 }
@@ -932,10 +942,7 @@ libxfs_destroy(
 
 	libxfs_close_devices(li);
 
-	/* Free everything from the buffer cache before freeing buffer cache */
-	libxfs_bcache_purge();
 	libxfs_bcache_free();
-	cache_destroy(libxfs_bcache);
 	leaked = destroy_caches();
 	rcu_unregister_thread();
 	if (getenv("LIBXFS_LEAK_CHECK") && leaked)
@@ -947,16 +954,3 @@ libxfs_device_alignment(void)
 {
 	return platform_align_blockdev();
 }
-
-void
-libxfs_report(FILE *fp)
-{
-	time_t t;
-	char *c;
-
-	cache_report(fp, "libxfs_bcache", libxfs_bcache);
-
-	t = time(NULL);
-	c = asctime(localtime(&t));
-	fprintf(fp, "%s", c);
-}
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 259c6a7cf771..7877e17685b8 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -28,6 +28,7 @@ struct xfs_buftarg {
 	dev_t			bt_bdev;
 	int			bt_bdev_fd;
 	unsigned int		flags;
+	struct cache		*bcache;	/* buffer cache */
 };
 
 /* We purged a dirty buffer and lost a write. */
@@ -36,6 +37,8 @@ struct xfs_buftarg {
 #define XFS_BUFTARG_CORRUPT_WRITE	(1 << 1)
 /* Simulate failure after a certain number of writes. */
 #define XFS_BUFTARG_INJECT_WRITE_FAIL	(1 << 2)
+/* purge buffers when lookups find a size mismatch */
+#define XFS_BUFTARG_MISCOMPARE_PURGE	(1 << 3)
 
 /* Simulate the system crashing after a certain number of writes. */
 static inline void
@@ -140,7 +143,6 @@ int libxfs_buf_priority(struct xfs_buf *bp);
 
 /* Buffer Cache Interfaces */
 
-extern struct cache	*libxfs_bcache;
 extern struct cache_operations	libxfs_bcache_operations;
 
 #define LIBXFS_GETBUF_TRYLOCK	(1 << 0)
@@ -184,10 +186,10 @@ libxfs_buf_read(
 
 int libxfs_readbuf_verify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 struct xfs_buf *libxfs_getsb(struct xfs_mount *mp);
-extern void	libxfs_bcache_purge(void);
+extern void	libxfs_bcache_purge(struct xfs_mount *mp);
 extern void	libxfs_bcache_free(void);
-extern void	libxfs_bcache_flush(void);
-extern int	libxfs_bcache_overflowed(void);
+extern void	libxfs_bcache_flush(struct xfs_mount *mp);
+extern int	libxfs_bcache_overflowed(struct xfs_mount *mp);
 
 /* Buffer (Raw) Interfaces */
 int		libxfs_bwrite(struct xfs_buf *bp);
diff --git a/libxfs/logitem.c b/libxfs/logitem.c
index 3ce2d7574a37..7757259dfc5e 100644
--- a/libxfs/logitem.c
+++ b/libxfs/logitem.c
@@ -46,7 +46,7 @@ xfs_trans_buf_item_match(
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 		blip = (struct xfs_buf_log_item *)lip;
 		if (blip->bli_item.li_type == XFS_LI_BUF &&
-		    blip->bli_buf->b_target->bt_bdev == btp->bt_bdev &&
+		    blip->bli_buf->b_target == btp &&
 		    xfs_buf_daddr(blip->bli_buf) == map[0].bm_bn &&
 		    blip->bli_buf->b_length == len) {
 			ASSERT(blip->bli_buf->b_map_count == nmaps);
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 153007d5fc86..cf986a7e7820 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -198,18 +198,20 @@ libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 }
 
 static int
-libxfs_bcompare(struct cache_node *node, cache_key_t key)
+libxfs_bcompare(
+	struct cache_node	*node,
+	cache_key_t		key)
 {
 	struct xfs_buf		*bp = container_of(node, struct xfs_buf,
 						   b_node);
 	struct xfs_bufkey	*bkey = (struct xfs_bufkey *)key;
+	struct cache		*bcache = bkey->buftarg->bcache;
 
-	if (bp->b_target->bt_bdev == bkey->buftarg->bt_bdev &&
-	    bp->b_cache_key == bkey->blkno) {
+	if (bp->b_cache_key == bkey->blkno) {
 		if (bp->b_length == bkey->bblen)
 			return CACHE_HIT;
 #ifdef IO_BCOMPARE_CHECK
-		if (!(libxfs_bcache->c_flags & CACHE_MISCOMPARE_PURGE)) {
+		if (!(bcache->c_flags & CACHE_MISCOMPARE_PURGE)) {
 			fprintf(stderr,
 	"%lx: Badness in key lookup (length)\n"
 	"bp=(bno 0x%llx, len %u bytes) key=(bno 0x%llx, len %u bytes)\n",
@@ -399,11 +401,12 @@ __cache_lookup(
 	struct xfs_buf		**bpp)
 {
 	struct cache_node	*cn = NULL;
+	struct cache		*bcache = key->buftarg->bcache;
 	struct xfs_buf		*bp;
 
 	*bpp = NULL;
 
-	cache_node_get(libxfs_bcache, key, &cn);
+	cache_node_get(bcache, key, &cn);
 	if (!cn)
 		return -ENOMEM;
 	bp = container_of(cn, struct xfs_buf, b_node);
@@ -415,7 +418,7 @@ __cache_lookup(
 		if (ret) {
 			ASSERT(ret == EAGAIN);
 			if (flags & LIBXFS_GETBUF_TRYLOCK) {
-				cache_node_put(libxfs_bcache, cn);
+				cache_node_put(bcache, cn);
 				return -EAGAIN;
 			}
 
@@ -434,7 +437,7 @@ __cache_lookup(
 		bp->b_holder = pthread_self();
 	}
 
-	cache_node_set_priority(libxfs_bcache, cn,
+	cache_node_set_priority(bcache, cn,
 			cache_node_get_priority(cn) - CACHE_PREFETCH_PRIORITY);
 	*bpp = bp;
 	return 0;
@@ -550,7 +553,7 @@ libxfs_buf_relse(
 	}
 
 	if (!list_empty(&bp->b_node.cn_hash))
-		cache_node_put(libxfs_bcache, &bp->b_node);
+		cache_node_put(bp->b_target->bcache, &bp->b_node);
 	else if (--bp->b_node.cn_count == 0) {
 		if (bp->b_flags & LIBXFS_B_DIRTY)
 			libxfs_bwrite(bp);
@@ -606,7 +609,7 @@ libxfs_readbufr(struct xfs_buftarg *btp, xfs_daddr_t blkno, struct xfs_buf *bp,
 
 	error = __read_buf(fd, bp->b_addr, bytes, LIBXFS_BBTOOFF64(blkno), flags);
 	if (!error &&
-	    bp->b_target->bt_bdev == btp->bt_bdev &&
+	    bp->b_target == btp &&
 	    bp->b_cache_key == blkno &&
 	    bp->b_length == len)
 		bp->b_flags |= LIBXFS_B_UPTODATE;
@@ -1003,21 +1006,31 @@ libxfs_bflush(
 }
 
 void
-libxfs_bcache_purge(void)
+libxfs_bcache_purge(struct xfs_mount *mp)
 {
-	cache_purge(libxfs_bcache);
+	if (!mp)
+		return;
+	cache_purge(mp->m_ddev_targp->bcache);
+	cache_purge(mp->m_logdev_targp->bcache);
+	cache_purge(mp->m_rtdev_targp->bcache);
 }
 
 void
-libxfs_bcache_flush(void)
+libxfs_bcache_flush(struct xfs_mount *mp)
 {
-	cache_flush(libxfs_bcache);
+	if (!mp)
+		return;
+	cache_flush(mp->m_ddev_targp->bcache);
+	cache_flush(mp->m_logdev_targp->bcache);
+	cache_flush(mp->m_rtdev_targp->bcache);
 }
 
 int
-libxfs_bcache_overflowed(void)
+libxfs_bcache_overflowed(struct xfs_mount *mp)
 {
-	return cache_overflowed(libxfs_bcache);
+	return cache_overflowed(mp->m_ddev_targp->bcache) ||
+		cache_overflowed(mp->m_logdev_targp->bcache) ||
+		cache_overflowed(mp->m_rtdev_targp->bcache);
 }
 
 struct cache_operations libxfs_bcache_operations = {
@@ -1466,7 +1479,7 @@ libxfs_buf_set_priority(
 	struct xfs_buf	*bp,
 	int		priority)
 {
-	cache_node_set_priority(libxfs_bcache, &bp->b_node, priority);
+	cache_node_set_priority(bp->b_target->bcache, &bp->b_node, priority);
 }
 
 int
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index f4a9bf20f391..d6fa48edeab5 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -4613,7 +4613,7 @@ main(
 	 * Need to drop references to inodes we still hold, first.
 	 */
 	libxfs_rtmount_destroy(mp);
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 
 	/*
 	 * Mark the filesystem ok.
diff --git a/repair/prefetch.c b/repair/prefetch.c
index b0dd19775ca8..de36c5fe2cc9 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -886,10 +886,12 @@ init_prefetch(
 
 prefetch_args_t *
 start_inode_prefetch(
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	int			dirs_only,
 	prefetch_args_t		*prev_args)
 {
+	struct cache		*bcache = mp->m_ddev_targp->bcache;
 	prefetch_args_t		*args;
 	long			max_queue;
 	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
@@ -914,7 +916,7 @@ start_inode_prefetch(
 	 * and not any other associated metadata like directories
 	 */
 
-	max_queue = libxfs_bcache->c_maxcount / thread_count / 8;
+	max_queue = bcache->c_maxcount / thread_count / 8;
 	if (igeo->inode_cluster_size > mp->m_sb.sb_blocksize)
 		max_queue = max_queue * igeo->blocks_per_cluster /
 				igeo->ialloc_blks;
@@ -970,14 +972,16 @@ prefetch_ag_range(
 	void			(*func)(struct workqueue *,
 					xfs_agnumber_t, void *))
 {
+	struct xfs_mount	*mp = work->wq_ctx;
 	int			i;
 	struct prefetch_args	*pf_args[2];
 
-	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
+	pf_args[start_ag & 1] = start_inode_prefetch(mp, start_ag, dirs_only,
+			NULL);
 	for (i = start_ag; i < end_ag; i++) {
 		/* Don't prefetch end_ag */
 		if (i + 1 < end_ag)
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+			pf_args[(~i) & 1] = start_inode_prefetch(mp, i + 1,
 						dirs_only, pf_args[i & 1]);
 		func(work, i, pf_args[i & 1]);
 	}
@@ -1027,7 +1031,7 @@ do_inode_prefetch(
 	 * filesystem - it's all in the cache. In that case, run a thread per
 	 * CPU to maximise parallelism of the queue to be processed.
 	 */
-	if (check_cache && !libxfs_bcache_overflowed()) {
+	if (check_cache && !libxfs_bcache_overflowed(mp)) {
 		queue.wq_ctx = mp;
 		create_work_queue(&queue, mp, platform_nproc());
 		for (i = 0; i < mp->m_sb.sb_agcount; i++)
diff --git a/repair/prefetch.h b/repair/prefetch.h
index 54ece48ad228..a8c52a1195b6 100644
--- a/repair/prefetch.h
+++ b/repair/prefetch.h
@@ -39,6 +39,7 @@ init_prefetch(
 
 prefetch_args_t *
 start_inode_prefetch(
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	int			dirs_only,
 	prefetch_args_t		*prev_args);
diff --git a/repair/progress.c b/repair/progress.c
index f6c4d988444e..625dc41c2894 100644
--- a/repair/progress.c
+++ b/repair/progress.c
@@ -383,14 +383,18 @@ timediff(int phase)
 **  array.
 */
 char *
-timestamp(int end, int phase, char *buf)
+timestamp(
+	struct xfs_mount	*mp,
+	int			end,
+	int			phase,
+	char			*buf)
 {
 
-	time_t    now;
-	struct tm *tmp;
+	time_t			now;
+	struct tm		*tmp;
 
-	if (verbose > 1)
-		cache_report(stderr, "libxfs_bcache", libxfs_bcache);
+	if (verbose > 1 && mp && mp->m_ddev_targp)
+		cache_report(stderr, "libxfs_bcache", mp->m_ddev_targp->bcache);
 
 	now = time(NULL);
 
diff --git a/repair/progress.h b/repair/progress.h
index 2c1690db1b17..75b751b783b2 100644
--- a/repair/progress.h
+++ b/repair/progress.h
@@ -37,7 +37,7 @@ extern void stop_progress_rpt(void);
 extern void summary_report(void);
 extern int  set_progress_msg(int report, uint64_t total);
 extern uint64_t print_final_rpt(void);
-extern char *timestamp(int end, int phase, char *buf);
+extern char *timestamp(struct xfs_mount *mp, int end, int phase, char *buf);
 extern char *duration(int val, char *buf);
 extern int do_parallel;
 
diff --git a/repair/scan.c b/repair/scan.c
index 7e6d94cfa670..715be1166fc2 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -42,7 +42,7 @@ struct aghdr_cnts {
 void
 set_mp(xfs_mount_t *mpp)
 {
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 	mp = mpp;
 }
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index ba9d28330d82..d4f99f36f71d 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -942,9 +942,11 @@ repair_capture_writeback(
 }
 
 static inline void
-phase_end(int phase)
+phase_end(
+	struct xfs_mount	*mp,
+	int			phase)
 {
-	timestamp(PHASE_END, phase, NULL);
+	timestamp(mp, PHASE_END, phase, NULL);
 
 	/* Fail if someone injected an post-phase error. */
 	if (fail_after_phase && phase == fail_after_phase)
@@ -979,8 +981,8 @@ main(int argc, char **argv)
 
 	msgbuf = malloc(DURATION_BUF_SIZE);
 
-	timestamp(PHASE_START, 0, NULL);
-	phase_end(0);
+	timestamp(temp_mp, PHASE_START, 0, NULL);
+	phase_end(temp_mp, 0);
 
 	/* -f forces this, but let's be nice and autodetect it, as well. */
 	if (!isa_file) {
@@ -1002,7 +1004,7 @@ main(int argc, char **argv)
 
 	/* do phase1 to make sure we have a superblock */
 	phase1(temp_mp);
-	phase_end(1);
+	phase_end(temp_mp, 1);
 
 	if (no_modify && primary_sb_modified)  {
 		do_warn(_("Primary superblock would have been modified.\n"
@@ -1139,8 +1141,8 @@ main(int argc, char **argv)
 		unsigned long	max_mem;
 		struct rlimit	rlim;
 
-		libxfs_bcache_purge();
-		cache_destroy(libxfs_bcache);
+		libxfs_bcache_purge(mp);
+		cache_destroy(mp->m_ddev_targp->bcache);
 
 		mem_used = (mp->m_sb.sb_icount >> (10 - 2)) +
 					(mp->m_sb.sb_dblocks >> (10 + 1)) +
@@ -1200,7 +1202,7 @@ main(int argc, char **argv)
 			do_log(_("        - block cache size set to %d entries\n"),
 				libxfs_bhash_size * HASH_CACHE_RATIO);
 
-		libxfs_bcache = cache_init(0, libxfs_bhash_size,
+		mp->m_ddev_targp->bcache = cache_init(0, libxfs_bhash_size,
 						&libxfs_bcache_operations);
 	}
 
@@ -1228,16 +1230,16 @@ main(int argc, char **argv)
 
 	/* make sure the per-ag freespace maps are ok so we can mount the fs */
 	phase2(mp, phase2_threads);
-	phase_end(2);
+	phase_end(mp, 2);
 
 	if (do_prefetch)
 		init_prefetch(mp);
 
 	phase3(mp, phase2_threads);
-	phase_end(3);
+	phase_end(mp, 3);
 
 	phase4(mp);
-	phase_end(4);
+	phase_end(mp, 4);
 
 	if (no_modify) {
 		printf(_("No modify flag set, skipping phase 5\n"));
@@ -1247,7 +1249,7 @@ main(int argc, char **argv)
 	} else {
 		phase5(mp);
 	}
-	phase_end(5);
+	phase_end(mp, 5);
 
 	/*
 	 * Done with the block usage maps, toss them...
@@ -1257,10 +1259,10 @@ main(int argc, char **argv)
 
 	if (!bad_ino_btree)  {
 		phase6(mp);
-		phase_end(6);
+		phase_end(mp, 6);
 
 		phase7(mp, phase2_threads);
-		phase_end(7);
+		phase_end(mp, 7);
 	} else  {
 		do_warn(
 _("Inode allocation btrees are too corrupted, skipping phases 6 and 7\n"));
@@ -1385,7 +1387,7 @@ _("Note - stripe unit (%d) and width (%d) were copied from a backup superblock.\
 	 * verifiers are run (where we discover the max metadata LSN), reformat
 	 * the log if necessary and unmount.
 	 */
-	libxfs_bcache_flush();
+	libxfs_bcache_flush(mp);
 	format_log_max_lsn(mp);
 
 	if (xfs_sb_version_needsrepair(&mp->m_sb))


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 089/111] libxfs: add xfile support
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
@ 2024-04-16  1:00   ` Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:00 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Port the xfile functionality (anonymous pageable file-index memory) from
the kernel.  In userspace, we try to use memfd() to create tmpfs files
that are not in any namespace, matching the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libxfs/Makefile     |    2 
 libxfs/xfile.c      |  210 +++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfile.h      |   21 +++++
 repair/xfs_repair.c |   15 ++++
 4 files changed, 248 insertions(+)
 create mode 100644 libxfs/xfile.c
 create mode 100644 libxfs/xfile.h


diff --git a/libxfs/Makefile b/libxfs/Makefile
index 6f688c0ad25a..43e8ae183229 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -26,6 +26,7 @@ HFILES = \
 	libxfs_priv.h \
 	linux-err.h \
 	topology.h \
+	xfile.h \
 	xfs_ag_resv.h \
 	xfs_alloc.h \
 	xfs_alloc_btree.h \
@@ -66,6 +67,7 @@ CFILES = cache.c \
 	topology.c \
 	trans.c \
 	util.c \
+	xfile.c \
 	xfs_ag.c \
 	xfs_ag_resv.c \
 	xfs_alloc.c \
diff --git a/libxfs/xfile.c b/libxfs/xfile.c
new file mode 100644
index 000000000000..cba173cc17f1
--- /dev/null
+++ b/libxfs/xfile.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2021-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs_priv.h"
+#include "libxfs.h"
+#include "libxfs/xfile.h"
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+/*
+ * Swappable Temporary Memory
+ * ==========================
+ *
+ * Offline checking sometimes needs to be able to stage a large amount of data
+ * in memory.  This information might not fit in the available memory and it
+ * doesn't all need to be accessible at all times.  In other words, we want an
+ * indexed data buffer to store data that can be paged out.
+ *
+ * memfd files meet those requirements.  Therefore, the xfile mechanism uses
+ * one to store our staging data.  The xfile must be freed with xfile_destroy.
+ *
+ * xfiles assume that the caller will handle all required concurrency
+ * management; file locks are not taken.
+ */
+
+/*
+ * Starting with Linux 6.3, there's a new MFD_NOEXEC_SEAL flag that disables
+ * the longstanding memfd behavior that files are created with the executable
+ * bit set, and seals the file against it being turned back on.
+ */
+#ifndef MFD_NOEXEC_SEAL
+# define MFD_NOEXEC_SEAL	(0x0008U)
+#endif
+
+/*
+ * Open a memory-backed fd to back an xfile.  We require close-on-exec here,
+ * because these memfd files function as windowed RAM and hence should never
+ * be shared with other processes.
+ */
+static int
+xfile_create_fd(
+	const char		*description)
+{
+	int			fd = -1;
+	int			ret;
+
+	/*
+	 * memfd_create was added to kernel 3.17 (2014).  MFD_NOEXEC_SEAL
+	 * causes -EINVAL on old kernels, so fall back to omitting it so that
+	 * new xfs_repair can run on an older recovery cd kernel.
+	 */
+	fd = memfd_create(description, MFD_CLOEXEC | MFD_NOEXEC_SEAL);
+	if (fd >= 0)
+		goto got_fd;
+	fd = memfd_create(description, MFD_CLOEXEC);
+	if (fd >= 0)
+		goto got_fd;
+
+	/*
+	 * O_TMPFILE exists as of kernel 3.11 (2013), which means that if we
+	 * find it, we're pretty safe in assuming O_CLOEXEC exists too.
+	 */
+	fd = open("/dev/shm", O_TMPFILE | O_CLOEXEC | O_RDWR, 0600);
+	if (fd >= 0)
+		goto got_fd;
+
+	fd = open("/tmp", O_TMPFILE | O_CLOEXEC | O_RDWR, 0600);
+	if (fd >= 0)
+		goto got_fd;
+
+	/*
+	 * mkostemp exists as of glibc 2.7 (2007) and O_CLOEXEC exists as of
+	 * kernel 2.6.23 (2007).
+	 */
+	fd = mkostemp("libxfsXXXXXX", O_CLOEXEC);
+	if (fd >= 0)
+		goto got_fd;
+
+	if (!errno)
+		errno = EOPNOTSUPP;
+	return -1;
+got_fd:
+	/*
+	 * Turn off mode bits we don't want -- group members and others should
+	 * not have access to the xfile, nor it be executable.  memfds are
+	 * created with mode 0777, but we'll be careful just in case the other
+	 * implementations fail to set 0600.
+	 */
+	ret = fchmod(fd, 0600);
+	if (ret)
+		perror("disabling xfile executable bit");
+
+	return fd;
+}
+
+/*
+ * Create an xfile of the given size.  The description will be used in the
+ * trace output.
+ */
+int
+xfile_create(
+	const char		*description,
+	struct xfile		**xfilep)
+{
+	struct xfile		*xf;
+	int			error;
+
+	xf = kmalloc(sizeof(struct xfile), 0);
+	if (!xf)
+		return -ENOMEM;
+
+	xf->fd = xfile_create_fd(description);
+	if (xf->fd < 0) {
+		error = -errno;
+		kfree(xf);
+		return error;
+	}
+
+	*xfilep = xf;
+	return 0;
+}
+
+/* Close the file and release all resources. */
+void
+xfile_destroy(
+	struct xfile		*xf)
+{
+	close(xf->fd);
+	kfree(xf);
+}
+
+static inline loff_t
+xfile_maxbytes(
+	struct xfile		*xf)
+{
+	if (sizeof(loff_t) == 8)
+		return LLONG_MAX;
+	return LONG_MAX;
+}
+
+/*
+ * Load an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+ssize_t
+xfile_load(
+	struct xfile		*xf,
+	void			*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	ssize_t			ret;
+
+	if (count > INT_MAX)
+		return -ENOMEM;
+	if (xfile_maxbytes(xf) - pos < count)
+		return -ENOMEM;
+
+	ret = pread(xf->fd, buf, count, pos);
+	if (ret < 0)
+		return -errno;
+	if (ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+/*
+ * Store an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+ssize_t
+xfile_store(
+	struct xfile		*xf,
+	const void		*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	ssize_t			ret;
+
+	if (count > INT_MAX)
+		return -E2BIG;
+	if (xfile_maxbytes(xf) - pos < count)
+		return -EFBIG;
+
+	ret = pwrite(xf->fd, buf, count, pos);
+	if (ret < 0)
+		return -errno;
+	if (ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+/* Compute the number of bytes used by a xfile. */
+unsigned long long
+xfile_bytes(
+	struct xfile		*xf)
+{
+	struct stat		statbuf;
+	int			error;
+
+	error = fstat(xf->fd, &statbuf);
+	if (error)
+		return -errno;
+
+	return (unsigned long long)statbuf.st_blocks << 9;
+}
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
new file mode 100644
index 000000000000..d60084011357
--- /dev/null
+++ b/libxfs/xfile.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2021-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __LIBXFS_XFILE_H__
+#define __LIBXFS_XFILE_H__
+
+struct xfile {
+	int			fd;
+};
+
+int xfile_create(const char *description, struct xfile **xfilep);
+void xfile_destroy(struct xfile *xf);
+
+ssize_t xfile_load(struct xfile *xf, void *buf, size_t count, loff_t pos);
+ssize_t xfile_store(struct xfile *xf, const void *buf, size_t count, loff_t pos);
+
+unsigned long long xfile_bytes(struct xfile *xf);
+
+#endif /* __LIBXFS_XFILE_H__ */
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index d4f99f36f71d..01f92e841f29 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -953,6 +953,20 @@ phase_end(
 		platform_crash();
 }
 
+/* Try to allow as many memfds as possible. */
+static void
+bump_max_fds(void)
+{
+	struct rlimit	rlim = { };
+	int		ret;
+
+	ret = getrlimit(RLIMIT_NOFILE, &rlim);
+	if (!ret) {
+		rlim.rlim_cur = rlim.rlim_max;
+		setrlimit(RLIMIT_NOFILE, &rlim);
+	}
+}
+
 int
 main(int argc, char **argv)
 {
@@ -972,6 +986,7 @@ main(int argc, char **argv)
 	bindtextdomain(PACKAGE, LOCALEDIR);
 	textdomain(PACKAGE);
 	dinode_bmbt_translation_init();
+	bump_max_fds();
 
 	temp_mp = &xfs_m;
 	setbuf(stdout, NULL);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 089/111] libxfs: add xfile support Darrick J. Wong
@ 2024-04-16  1:00   ` Darrick J. Wong
  2024-04-16  4:55     ` Christoph Hellwig
  2024-04-24 17:20     ` [PATCH v3.1 " Darrick J. Wong
  2024-04-16  1:00   ` [PATCH 091/111] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 092/111] libxfs: support in-memory buffer cache targets Darrick J. Wong
  4 siblings, 2 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:00 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can partition a memfd file to avoid running out of
file descriptors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfile.c |  197 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 libxfs/xfile.h |   17 ++++-
 2 files changed, 205 insertions(+), 9 deletions(-)


diff --git a/libxfs/xfile.c b/libxfs/xfile.c
index cba173cc17f1..fdb76f406647 100644
--- a/libxfs/xfile.c
+++ b/libxfs/xfile.c
@@ -97,6 +97,149 @@ xfile_create_fd(
 	return fd;
 }
 
+static LIST_HEAD(fcb_list);
+static pthread_mutex_t fcb_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/* Create a new memfd. */
+static inline int
+xfile_fcb_create(
+	const char		*description,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			fd;
+
+	fd = xfile_create_fd(description);
+	if (fd < 0)
+		return -errno;
+
+	fcb = malloc(sizeof(struct xfile_fcb));
+	if (!fcb) {
+		close(fd);
+		return -ENOMEM;
+	}
+
+	list_head_init(&fcb->fcb_list);
+	fcb->fd = fd;
+	fcb->refcount = 1;
+
+	*fcbp = fcb;
+	return 0;
+}
+
+/* Release an xfile control block */
+static void
+xfile_fcb_irele(
+	struct xfile_fcb	*fcb,
+	loff_t			pos,
+	uint64_t		len)
+{
+	/*
+	 * If this memfd is linked only to itself, it's private, so we can
+	 * close it without taking any locks.
+	 */
+	if (list_empty(&fcb->fcb_list)) {
+		close(fcb->fd);
+		free(fcb);
+		return;
+	}
+
+	pthread_mutex_lock(&fcb_mutex);
+	if (--fcb->refcount == 0) {
+		/* If we're the last user of this memfd file, kill it fast. */
+		list_del(&fcb->fcb_list);
+		close(fcb->fd);
+		free(fcb);
+	} else if (len > 0) {
+		struct stat	statbuf;
+		int		ret;
+
+		/*
+		 * If we were using the end of a partitioned file, free the
+		 * address space.  IOWs, bonus points if you delete these in
+		 * reverse-order of creation.
+		 */
+		ret = fstat(fcb->fd, &statbuf);
+		if (!ret && statbuf.st_size == pos + len) {
+			ret = ftruncate(fcb->fd, pos);
+		}
+	}
+	pthread_mutex_unlock(&fcb_mutex);
+}
+
+/*
+ * Find an memfd that can accomodate the given amount of address space.
+ */
+static int
+xfile_fcb_find(
+	const char		*description,
+	uint64_t		maxbytes,
+	loff_t			*posp,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			ret;
+	int			error;
+
+	/* No maximum range means that the caller gets a private memfd. */
+	if (maxbytes == 0) {
+		*posp = 0;
+		return xfile_fcb_create(description, fcbp);
+	}
+
+	/* round up to page granularity so we can do mmap */
+	maxbytes = roundup_64(maxbytes, PAGE_SIZE);
+
+	pthread_mutex_lock(&fcb_mutex);
+
+	/*
+	 * If we only need a certain number of byte range, look for one with
+	 * available file range.
+	 */
+	list_for_each_entry(fcb, &fcb_list, fcb_list) {
+		struct stat	statbuf;
+		loff_t		pos;
+
+		ret = fstat(fcb->fd, &statbuf);
+		if (ret)
+			continue;
+		pos = roundup_64(statbuf.st_size, PAGE_SIZE);
+
+		/*
+		 * Truncate up to ensure that the memfd can actually handle
+		 * writes to the end of the range.
+		 */
+		ret = ftruncate(fcb->fd, pos + maxbytes);
+		if (ret)
+			continue;
+
+		fcb->refcount++;
+		*posp = pos;
+		*fcbp = fcb;
+		goto out_unlock;
+	}
+
+	/* Otherwise, open a new memfd and add it to our list. */
+	error = xfile_fcb_create(description, &fcb);
+	if (error)
+		return error;
+
+	ret = ftruncate(fcb->fd, maxbytes);
+	if (ret) {
+		error = -errno;
+		xfile_fcb_irele(fcb, 0, maxbytes);
+		return error;
+	}
+
+	list_add_tail(&fcb->fcb_list, &fcb_list);
+	*posp = 0;
+	*fcbp = fcb;
+
+out_unlock:
+	pthread_mutex_unlock(&fcb_mutex);
+	return error;
+}
+
 /*
  * Create an xfile of the given size.  The description will be used in the
  * trace output.
@@ -104,6 +247,7 @@ xfile_create_fd(
 int
 xfile_create(
 	const char		*description,
+	unsigned long long	maxbytes,
 	struct xfile		**xfilep)
 {
 	struct xfile		*xf;
@@ -113,13 +257,14 @@ xfile_create(
 	if (!xf)
 		return -ENOMEM;
 
-	xf->fd = xfile_create_fd(description);
-	if (xf->fd < 0) {
-		error = -errno;
+	error = xfile_fcb_find(description, maxbytes, &xf->partition_pos,
+			&xf->fcb);
+	if (error) {
 		kfree(xf);
 		return error;
 	}
 
+	xf->maxbytes = maxbytes;
 	*xfilep = xf;
 	return 0;
 }
@@ -129,7 +274,7 @@ void
 xfile_destroy(
 	struct xfile		*xf)
 {
-	close(xf->fd);
+	xfile_fcb_irele(xf->fcb, xf->partition_pos, xf->maxbytes);
 	kfree(xf);
 }
 
@@ -137,6 +282,9 @@ static inline loff_t
 xfile_maxbytes(
 	struct xfile		*xf)
 {
+	if (xf->maxbytes > 0)
+		return xf->maxbytes;
+
 	if (sizeof(loff_t) == 8)
 		return LLONG_MAX;
 	return LONG_MAX;
@@ -160,7 +308,7 @@ xfile_load(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -ENOMEM;
 
-	ret = pread(xf->fd, buf, count, pos);
+	ret = pread(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret < 0)
 		return -errno;
 	if (ret != count)
@@ -186,7 +334,7 @@ xfile_store(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -EFBIG;
 
-	ret = pwrite(xf->fd, buf, count, pos);
+	ret = pwrite(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret < 0)
 		return -errno;
 	if (ret != count)
@@ -194,6 +342,38 @@ xfile_store(
 	return 0;
 }
 
+/* Compute the number of bytes used by a partitioned xfile. */
+static unsigned long long
+xfile_partition_bytes(
+	struct xfile		*xf)
+{
+	loff_t			data_pos = xf->partition_pos;
+	loff_t			stop_pos = data_pos + xf->maxbytes;
+	loff_t			hole_pos;
+	unsigned long long	bytes = 0;
+
+	data_pos = lseek(xf->fcb->fd, data_pos, SEEK_DATA);
+	while (data_pos >= 0 && data_pos < stop_pos) {
+		hole_pos = lseek(xf->fcb->fd, data_pos, SEEK_HOLE);
+		if (hole_pos < 0) {
+			/* save error, break */
+			data_pos = hole_pos;
+			break;
+		}
+		if (hole_pos >= stop_pos) {
+			bytes += stop_pos - data_pos;
+			return bytes;
+		}
+		bytes += hole_pos - data_pos;
+
+		data_pos = lseek(xf->fcb->fd, hole_pos, SEEK_DATA);
+	}
+	if (data_pos < 0 && errno != ENXIO)
+		return xf->maxbytes;
+
+	return bytes;
+}
+
 /* Compute the number of bytes used by a xfile. */
 unsigned long long
 xfile_bytes(
@@ -202,7 +382,10 @@ xfile_bytes(
 	struct stat		statbuf;
 	int			error;
 
-	error = fstat(xf->fd, &statbuf);
+	if (xf->maxbytes > 0)
+		return xfile_partition_bytes(xf);
+
+	error = fstat(xf->fcb->fd, &statbuf);
 	if (error)
 		return -errno;
 
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
index d60084011357..180a42bbbaa2 100644
--- a/libxfs/xfile.h
+++ b/libxfs/xfile.h
@@ -6,11 +6,24 @@
 #ifndef __LIBXFS_XFILE_H__
 #define __LIBXFS_XFILE_H__
 
-struct xfile {
+struct xfile_fcb {
+	struct list_head	fcb_list;
 	int			fd;
+	unsigned int		refcount;
 };
 
-int xfile_create(const char *description, struct xfile **xfilep);
+struct xfile {
+	struct xfile_fcb	*fcb;
+
+	/* File position within fcb->fd where this partition starts */
+	loff_t			partition_pos;
+
+	/* Maximum number of bytes that can be written to the partition. */
+	uint64_t		maxbytes;
+};
+
+int xfile_create(const char *description, unsigned long long maxbytes,
+		struct xfile **xfilep);
 void xfile_destroy(struct xfile *xf);
 
 ssize_t xfile_load(struct xfile *xf, void *buf, size_t count, loff_t pos);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 091/111] xfs: teach buftargs to maintain their own buffer hashtable
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-04-16  1:00   ` [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
@ 2024-04-16  1:00   ` Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 092/111] libxfs: support in-memory buffer cache targets Darrick J. Wong
  4 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:00 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Source kernel commit: e7b58f7c1be20550d4f51cec6307b811e7555f52

Currently, cached buffers are indexed by per-AG hashtables.  This works
great for the data device, but won't work for in-memory btrees.  To
handle that use case, buftargs will need to be able to index buffers
independently of other data structures.

We accomplish this by hoisting the rhashtable and its lock into a
separate xfs_buf_cache structure, make the buftarg point to the
_buf_cache structure, and rework various functions to use it.  This
will enable the in-memory buftarg to come up with its own _buf_cache.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libxfs/libxfs_priv.h |    4 ++--
 libxfs/xfs_ag.c      |    6 +++---
 libxfs/xfs_ag.h      |    4 +---
 3 files changed, 6 insertions(+), 8 deletions(-)


diff --git a/libxfs/libxfs_priv.h b/libxfs/libxfs_priv.h
index 0a4f686d9455..aee85c155abf 100644
--- a/libxfs/libxfs_priv.h
+++ b/libxfs/libxfs_priv.h
@@ -550,8 +550,8 @@ unsigned int hweight8(unsigned int w);
 unsigned int hweight32(unsigned int w);
 unsigned int hweight64(__u64 w);
 
-static inline int xfs_buf_hash_init(struct xfs_perag *pag) { return 0; }
-static inline void xfs_buf_hash_destroy(struct xfs_perag *pag) { }
+#define xfs_buf_cache_init(bch)		(0)
+#define xfs_buf_cache_destroy(bch)	((void)0)
 
 static inline int xfs_iunlink_init(struct xfs_perag *pag) { return 0; }
 static inline void xfs_iunlink_destroy(struct xfs_perag *pag) { }
diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c
index 389a8288e989..06a881285682 100644
--- a/libxfs/xfs_ag.c
+++ b/libxfs/xfs_ag.c
@@ -262,7 +262,7 @@ xfs_free_perag(
 		xfs_defer_drain_free(&pag->pag_intents_drain);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
-		xfs_buf_hash_destroy(pag);
+		xfs_buf_cache_destroy(&pag->pag_bcache);
 
 		/* drop the mount's active reference */
 		xfs_perag_rele(pag);
@@ -350,7 +350,7 @@ xfs_free_unused_perag_range(
 		spin_unlock(&mp->m_perag_lock);
 		if (!pag)
 			break;
-		xfs_buf_hash_destroy(pag);
+		xfs_buf_cache_destroy(&pag->pag_bcache);
 		xfs_defer_drain_free(&pag->pag_intents_drain);
 		kfree(pag);
 	}
@@ -417,7 +417,7 @@ xfs_initialize_perag(
 		pag->pagb_tree = RB_ROOT;
 #endif /* __KERNEL__ */
 
-		error = xfs_buf_hash_init(pag);
+		error = xfs_buf_cache_init(&pag->pag_bcache);
 		if (error)
 			goto out_remove_pag;
 
diff --git a/libxfs/xfs_ag.h b/libxfs/xfs_ag.h
index 19eddba09894..29bfa6273dec 100644
--- a/libxfs/xfs_ag.h
+++ b/libxfs/xfs_ag.h
@@ -106,9 +106,7 @@ struct xfs_perag {
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
-	/* buffer cache index */
-	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
-	struct rhashtable pag_buf_hash;
+	struct xfs_buf_cache	pag_bcache;
 
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 092/111] libxfs: support in-memory buffer cache targets
  2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-04-16  1:00   ` [PATCH 091/111] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
@ 2024-04-16  1:01   ` Darrick J. Wong
  4 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:01 UTC (permalink / raw)
  To: djwong, cem; +Cc: Christoph Hellwig, cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Allow the buffer cache to target in-memory files by connecting it to
xfiles.  The next few patches will enable creating xfs_btrees in memory.
Unlike the kernel version of this patch, we use a partitioned xfile to
avoid overflowing the fd table instead of opening a separate memfd for
each target.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libxfs/Makefile    |    4 +
 libxfs/buf_mem.c   |  235 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/buf_mem.h   |   26 ++++++
 libxfs/init.c      |    4 +
 libxfs/libxfs_io.h |   22 +++++
 libxfs/rdwr.c      |   41 ++++-----
 6 files changed, 310 insertions(+), 22 deletions(-)
 create mode 100644 libxfs/buf_mem.c
 create mode 100644 libxfs/buf_mem.h


diff --git a/libxfs/Makefile b/libxfs/Makefile
index 43e8ae183229..8dc9a79059ed 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -26,6 +26,7 @@ HFILES = \
 	libxfs_priv.h \
 	linux-err.h \
 	topology.h \
+	buf_mem.h \
 	xfile.h \
 	xfs_ag_resv.h \
 	xfs_alloc.h \
@@ -58,7 +59,8 @@ HFILES = \
 	xfs_trans_space.h \
 	xfs_dir2_priv.h
 
-CFILES = cache.c \
+CFILES = buf_mem.c \
+	cache.c \
 	defer_item.c \
 	init.c \
 	kmem.c \
diff --git a/libxfs/buf_mem.c b/libxfs/buf_mem.c
new file mode 100644
index 000000000000..7c8fa1d2cdcd
--- /dev/null
+++ b/libxfs/buf_mem.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2023-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs_priv.h"
+#include "libxfs.h"
+#include "libxfs/xfile.h"
+#include "libxfs/buf_mem.h"
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+/*
+ * Buffer Cache for In-Memory Files
+ * ================================
+ *
+ * Offline fsck wants to create ephemeral ordered recordsets.  The existing
+ * btree infrastructure can do this, but we need the buffer cache to target
+ * memory instead of block devices.
+ *
+ * xfiles meet those requirements.  Therefore, the xmbuf mechanism uses a
+ * partition on an xfile to store the staging data.
+ *
+ * xmbufs assume that the caller will handle all required concurrency
+ * management.  The resulting xfs_buf objects are kept private to the xmbuf
+ * (they are not recycled to the LRU) because b_addr is mapped directly to the
+ * memfd file.
+ *
+ * The only supported block size is the system page size.
+ */
+
+/* Figure out the xfile buffer cache block size here */
+unsigned int	XMBUF_BLOCKSIZE;
+unsigned int	XMBUF_BLOCKSHIFT;
+
+void
+xmbuf_libinit(void)
+{
+	long		ret = sysconf(_SC_PAGESIZE);
+
+	/* If we don't find a power-of-two page size, go with 4k. */
+	if (ret < 0 || !is_power_of_2(ret))
+		ret = 4096;
+
+	XMBUF_BLOCKSIZE = ret;
+	XMBUF_BLOCKSHIFT = libxfs_highbit32(XMBUF_BLOCKSIZE);
+}
+
+/* Allocate a new cache node (aka a xfs_buf) */
+static struct cache_node *
+xmbuf_cache_alloc(
+	cache_key_t		key)
+{
+	struct xfs_bufkey	*bufkey = (struct xfs_bufkey *)key;
+	struct xfs_buf		*bp;
+	int			error;
+
+	bp = kmem_cache_zalloc(xfs_buf_cache, 0);
+	if (!bp)
+		return NULL;
+
+	bp->b_cache_key = bufkey->blkno;
+	bp->b_length = bufkey->bblen;
+	bp->b_target = bufkey->buftarg;
+	bp->b_mount = bufkey->buftarg->bt_mount;
+
+	pthread_mutex_init(&bp->b_lock, NULL);
+	INIT_LIST_HEAD(&bp->b_li_list);
+	bp->b_maps = &bp->__b_map;
+
+	bp->b_nmaps = 1;
+	bp->b_maps[0].bm_bn = bufkey->blkno;
+	bp->b_maps[0].bm_len = bp->b_length;
+
+	error = xmbuf_map_page(bp);
+	if (error) {
+		fprintf(stderr,
+ _("%s: %s can't mmap %u bytes at xfile offset %llu: %s\n"),
+				progname, __FUNCTION__, BBTOB(bp->b_length),
+				(unsigned long long)BBTOB(bufkey->blkno),
+				strerror(error));
+
+		kmem_cache_free(xfs_buf_cache, bp);
+		return NULL;
+	}
+
+	return &bp->b_node;
+}
+
+/* Flush a buffer to disk before purging the cache node */
+static int
+xmbuf_cache_flush(
+	struct cache_node	*node)
+{
+	/* direct mapped buffers do not need writing */
+	return 0;
+}
+
+/* Release resources, free the buffer. */
+static void
+xmbuf_cache_relse(
+	struct cache_node	*node)
+{
+	struct xfs_buf		*bp;
+
+	bp = container_of(node, struct xfs_buf, b_node);
+	xmbuf_unmap_page(bp);
+	kmem_cache_free(xfs_buf_cache, bp);
+}
+
+/* Release a bunch of buffers */
+static unsigned int
+xmbuf_cache_bulkrelse(
+	struct cache		*cache,
+	struct list_head	*list)
+{
+	struct cache_node	*cn, *n;
+	int			count = 0;
+
+	if (list_empty(list))
+		return 0;
+
+	list_for_each_entry_safe(cn, n, list, cn_mru) {
+		xmbuf_cache_relse(cn);
+		count++;
+	}
+
+	return count;
+}
+
+static struct cache_operations xmbuf_bcache_operations = {
+	.hash		= libxfs_bhash,
+	.alloc		= xmbuf_cache_alloc,
+	.flush		= xmbuf_cache_flush,
+	.relse		= xmbuf_cache_relse,
+	.compare	= libxfs_bcompare,
+	.bulkrelse	= xmbuf_cache_bulkrelse
+};
+
+/*
+ * Allocate a buffer cache target for a memory-backed file and set up the
+ * buffer target.
+ */
+int
+xmbuf_alloc(
+	struct xfs_mount	*mp,
+	const char		*descr,
+	unsigned long long	maxpos,
+	struct xfs_buftarg	**btpp)
+{
+	struct xfs_buftarg	*btp;
+	struct xfile		*xfile;
+	struct cache		*cache;
+	int			error;
+
+	btp = kzalloc(sizeof(*btp), GFP_KERNEL);
+	if (!btp)
+		return -ENOMEM;
+
+	error = xfile_create(descr, maxpos, &xfile);
+	if (error)
+		goto out_btp;
+
+	cache = cache_init(0, LIBXFS_BHASHSIZE(NULL), &xmbuf_bcache_operations);
+	if (!cache) {
+		error = -ENOMEM;
+		goto out_xfile;
+	}
+
+	/* Initialize buffer target */
+	btp->bt_mount = mp;
+	btp->bt_bdev = (dev_t)-1;
+	btp->bt_bdev_fd = -1;
+	btp->bt_xfile = xfile;
+	btp->bcache = cache;
+
+	error = pthread_mutex_init(&btp->lock, NULL);
+	if (error)
+		goto out_cache;
+
+	*btpp = btp;
+	return 0;
+
+out_cache:
+	cache_destroy(cache);
+out_xfile:
+	xfile_destroy(xfile);
+out_btp:
+	kfree(btp);
+	return error;
+}
+
+/* Free a buffer cache target for a memory-backed file. */
+void
+xmbuf_free(
+	struct xfs_buftarg	*btp)
+{
+	ASSERT(xfs_buftarg_is_mem(btp));
+
+	cache_destroy(btp->bcache);
+	pthread_mutex_destroy(&btp->lock);
+	xfile_destroy(btp->bt_xfile);
+	kfree(btp);
+}
+
+/* Directly map a memfd page into the buffer cache. */
+int
+xmbuf_map_page(
+	struct xfs_buf		*bp)
+{
+	struct xfile		*xfile = bp->b_target->bt_xfile;
+	void			*p;
+	loff_t			pos;
+
+	pos = xfile->partition_pos + BBTOB(xfs_buf_daddr(bp));
+	p = mmap(NULL, BBTOB(bp->b_length), PROT_READ | PROT_WRITE, MAP_SHARED,
+			xfile->fcb->fd, pos);
+	if (p == MAP_FAILED)
+		return -errno;
+
+	bp->b_addr = p;
+	bp->b_flags |= LIBXFS_B_UPTODATE | LIBXFS_B_UNCHECKED;
+	bp->b_error = 0;
+	return 0;
+}
+
+/* Unmap a memfd page that was mapped into the buffer cache. */
+void
+xmbuf_unmap_page(
+	struct xfs_buf		*bp)
+{
+	munmap(bp->b_addr, BBTOB(bp->b_length));
+	bp->b_addr = NULL;
+}
diff --git a/libxfs/buf_mem.h b/libxfs/buf_mem.h
new file mode 100644
index 000000000000..d2be2c4240b6
--- /dev/null
+++ b/libxfs/buf_mem.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2023-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_BUF_MEM_H__
+#define __XFS_BUF_MEM_H__
+
+extern unsigned int		XMBUF_BLOCKSIZE;
+extern unsigned int		XMBUF_BLOCKSHIFT;
+
+void xmbuf_libinit(void);
+
+static inline bool xfs_buftarg_is_mem(const struct xfs_buftarg *target)
+{
+	return target->bt_xfile != NULL;
+}
+
+int xmbuf_alloc(struct xfs_mount *mp, const char *descr,
+		unsigned long long maxpos, struct xfs_buftarg **btpp);
+void xmbuf_free(struct xfs_buftarg *btp);
+
+int xmbuf_map_page(struct xfs_buf *bp);
+void xmbuf_unmap_page(struct xfs_buf *bp);
+
+#endif /* __XFS_BUF_MEM_H__ */
diff --git a/libxfs/init.c b/libxfs/init.c
index f002dc93cd56..f5cd85655cf0 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -22,6 +22,8 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "libfrog/platform.h"
+#include "libxfs/xfile.h"
+#include "libxfs/buf_mem.h"
 
 #include "xfs_format.h"
 #include "xfs_da_format.h"
@@ -253,6 +255,7 @@ int
 libxfs_init(struct libxfs_init *a)
 {
 	xfs_check_ondisk_structs();
+	xmbuf_libinit();
 	rcu_init();
 	rcu_register_thread();
 	radix_tree_init();
@@ -463,6 +466,7 @@ libxfs_buftarg_alloc(
 	btp->bt_mount = mp;
 	btp->bt_bdev = dev->dev;
 	btp->bt_bdev_fd = dev->fd;
+	btp->bt_xfile = NULL;
 	btp->flags = 0;
 	if (write_fails) {
 		btp->writes_left = write_fails;
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 7877e17685b8..ae3c4a9484c7 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -27,6 +27,7 @@ struct xfs_buftarg {
 	unsigned long		writes_left;
 	dev_t			bt_bdev;
 	int			bt_bdev_fd;
+	struct xfile		*bt_xfile;
 	unsigned int		flags;
 	struct cache		*bcache;	/* buffer cache */
 };
@@ -58,6 +59,27 @@ xfs_buftarg_trip_write(
 void libxfs_buftarg_init(struct xfs_mount *mp, struct libxfs_init *xi);
 int libxfs_blkdev_issue_flush(struct xfs_buftarg *btp);
 
+/*
+ * The bufkey is used to pass the new buffer information to the cache object
+ * allocation routine. Because discontiguous buffers need to pass different
+ * information, we need fields to pass that information. However, because the
+ * blkno and bblen is needed for the initial cache entry lookup (i.e. for
+ * bcompare) the fact that the map/nmaps is non-null to switch to discontiguous
+ * buffer initialisation instead of a contiguous buffer.
+ */
+struct xfs_bufkey {
+	struct xfs_buftarg	*buftarg;
+	xfs_daddr_t		blkno;
+	unsigned int		bblen;
+	struct xfs_buf_map	*map;
+	int			nmaps;
+};
+
+/* for buf_mem.c only: */
+unsigned int libxfs_bhash(cache_key_t key, unsigned int hashsize,
+		unsigned int hashshift);
+int libxfs_bcompare(struct cache_node *node, cache_key_t key);
+
 #define LIBXFS_BBTOOFF64(bbs)	(((xfs_off_t)(bbs)) << BBSHIFT)
 
 #define XB_PAGES        2
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index cf986a7e7820..50760cd866e3 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -18,7 +18,8 @@
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "libfrog/platform.h"
-
+#include "libxfs/xfile.h"
+#include "libxfs/buf_mem.h"
 #include "libxfs.h"
 
 static void libxfs_brelse(struct cache_node *node);
@@ -69,6 +70,9 @@ libxfs_device_zero(struct xfs_buftarg *btp, xfs_daddr_t start, uint len)
 	char		*z;
 	int		error;
 
+	if (xfs_buftarg_is_mem(btp))
+		return -EOPNOTSUPP;
+
 	start_offset = LIBXFS_BBTOOFF64(start);
 
 	/* try to use special zeroing methods, fall back to writes if needed */
@@ -167,26 +171,10 @@ static struct cache_mru		xfs_buf_freelist =
 	{{&xfs_buf_freelist.cm_list, &xfs_buf_freelist.cm_list},
 	 0, PTHREAD_MUTEX_INITIALIZER };
 
-/*
- * The bufkey is used to pass the new buffer information to the cache object
- * allocation routine. Because discontiguous buffers need to pass different
- * information, we need fields to pass that information. However, because the
- * blkno and bblen is needed for the initial cache entry lookup (i.e. for
- * bcompare) the fact that the map/nmaps is non-null to switch to discontiguous
- * buffer initialisation instead of a contiguous buffer.
- */
-struct xfs_bufkey {
-	struct xfs_buftarg	*buftarg;
-	xfs_daddr_t		blkno;
-	unsigned int		bblen;
-	struct xfs_buf_map	*map;
-	int			nmaps;
-};
-
 /*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
 #define GOLDEN_RATIO_PRIME	0x9e37fffffffc0001UL
 #define CACHE_LINE_SIZE		64
-static unsigned int
+unsigned int
 libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 {
 	uint64_t	hashval = ((struct xfs_bufkey *)key)->blkno;
@@ -197,7 +185,7 @@ libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 	return tmp % hashsize;
 }
 
-static int
+int
 libxfs_bcompare(
 	struct cache_node	*node,
 	cache_key_t		key)
@@ -231,6 +219,8 @@ static void
 __initbuf(struct xfs_buf *bp, struct xfs_buftarg *btp, xfs_daddr_t bno,
 		unsigned int bytes)
 {
+	ASSERT(!xfs_buftarg_is_mem(btp));
+
 	bp->b_flags = 0;
 	bp->b_cache_key = bno;
 	bp->b_length = BTOBB(bytes);
@@ -577,7 +567,6 @@ libxfs_balloc(
 	return &bp->b_node;
 }
 
-
 static int
 __read_buf(int fd, void *buf, int len, off_t offset, int flags)
 {
@@ -607,6 +596,9 @@ libxfs_readbufr(struct xfs_buftarg *btp, xfs_daddr_t blkno, struct xfs_buf *bp,
 
 	ASSERT(len <= bp->b_length);
 
+	if (xfs_buftarg_is_mem(btp))
+		return 0;
+
 	error = __read_buf(fd, bp->b_addr, bytes, LIBXFS_BBTOOFF64(blkno), flags);
 	if (!error &&
 	    bp->b_target == btp &&
@@ -639,6 +631,9 @@ libxfs_readbufr_map(struct xfs_buftarg *btp, struct xfs_buf *bp, int flags)
 	void	*buf;
 	int	i;
 
+	if (xfs_buftarg_is_mem(btp))
+		return 0;
+
 	buf = bp->b_addr;
 	for (i = 0; i < bp->b_nmaps; i++) {
 		off_t	offset = LIBXFS_BBTOOFF64(bp->b_maps[i].bm_bn);
@@ -857,7 +852,9 @@ libxfs_bwrite(
 		}
 	}
 
-	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
+	if (xfs_buftarg_is_mem(bp->b_target)) {
+		bp->b_error = 0;
+	} else if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
 		bp->b_error = __write_buf(fd, bp->b_addr, BBTOB(bp->b_length),
 				    LIBXFS_BBTOOFF64(xfs_buf_daddr(bp)),
 				    bp->b_flags);
@@ -917,6 +914,8 @@ libxfs_buf_prepare_mru(
 		xfs_perag_put(bp->b_pag);
 	bp->b_pag = NULL;
 
+	ASSERT(!xfs_buftarg_is_mem(btp));
+
 	if (!(bp->b_flags & LIBXFS_B_DIRTY))
 		return;
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
@ 2024-04-16  1:01   ` Darrick J. Wong
  2024-04-16  4:55     ` Christoph Hellwig
  2024-04-16  1:01   ` [PATCH 2/4] libxfs: add a bi_entry helper Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:01 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Remove all three of these helpers now that the kernel has dropped them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 db/bmap_inflate.c         |    2 +-
 include/kmem.h            |   10 +---------
 libxfs/defer_item.c       |    2 +-
 libxfs/init.c             |    2 +-
 libxfs/kmem.c             |   32 ++++++++++----------------------
 libxlog/xfs_log_recover.c |   19 +++++++++----------
 repair/bmap_repair.c      |    4 ++--
 7 files changed, 25 insertions(+), 46 deletions(-)


diff --git a/db/bmap_inflate.c b/db/bmap_inflate.c
index c85d5dc0d64a..00e1aff66567 100644
--- a/db/bmap_inflate.c
+++ b/db/bmap_inflate.c
@@ -327,7 +327,7 @@ populate_btree(
 	/* Leak any unused blocks */
 	list_for_each_entry_safe(resv, n, &bd.resv_list, list) {
 		list_del(&resv->list);
-		kmem_free(resv);
+		kfree(resv);
 	}
 	return error;
 }
diff --git a/include/kmem.h b/include/kmem.h
index 6818a404728f..386b4a6be783 100644
--- a/include/kmem.h
+++ b/include/kmem.h
@@ -50,15 +50,7 @@ kmem_cache_free(struct kmem_cache *cache, void *ptr)
 	free(ptr);
 }
 
-extern void	*kmem_alloc(size_t, int);
 extern void	*kvmalloc(size_t, gfp_t);
-extern void	*kmem_zalloc(size_t, int);
-
-static inline void
-kmem_free(const void *ptr) {
-	free((void *)ptr);
-}
-
 extern void	*krealloc(void *, size_t, int);
 
 static inline void *kmalloc(size_t size, gfp_t flags)
@@ -70,7 +62,7 @@ static inline void *kmalloc(size_t size, gfp_t flags)
 
 static inline void kfree(const void *ptr)
 {
-	return kmem_free(ptr);
+	free((void *)ptr);
 }
 
 #endif
diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index d67032c26200..680a72664746 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -606,7 +606,7 @@ xfs_attr_free_item(
 	if (attr->xattri_da_state)
 		xfs_da_state_free(attr->xattri_da_state);
 	if (attr->xattri_da_args->op_flags & XFS_DA_OP_RECOVERY)
-		kmem_free(attr);
+		kfree(attr);
 	else
 		kmem_cache_free(xfs_attr_intent_cache, attr);
 }
diff --git a/libxfs/init.c b/libxfs/init.c
index f5cd85655cf0..d0478960278a 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -893,7 +893,7 @@ libxfs_buftarg_free(
 	struct xfs_buftarg	*btp)
 {
 	cache_destroy(btp->bcache);
-	kmem_free(btp);
+	kfree(btp);
 }
 
 /*
diff --git a/libxfs/kmem.c b/libxfs/kmem.c
index c264be018bdc..a2a3935d00e8 100644
--- a/libxfs/kmem.c
+++ b/libxfs/kmem.c
@@ -65,33 +65,21 @@ kmem_cache_zalloc(struct kmem_cache *cache, gfp_t flags)
 	return ptr;
 }
 
-void *
-kmem_alloc(size_t size, int flags)
-{
-	void	*ptr = malloc(size);
-
-	if (ptr == NULL) {
-		fprintf(stderr, _("%s: malloc failed (%d bytes): %s\n"),
-			progname, (int)size, strerror(errno));
-		exit(1);
-	}
-	return ptr;
-}
-
 void *
 kvmalloc(size_t size, gfp_t flags)
 {
+	void	*ptr;
+
 	if (flags & __GFP_ZERO)
-		return kmem_zalloc(size, 0);
-	return kmem_alloc(size, 0);
-}
+		ptr = calloc(1, size);
+	else
+		ptr = malloc(size);
 
-void *
-kmem_zalloc(size_t size, int flags)
-{
-	void	*ptr = kmem_alloc(size, flags);
-
-	memset(ptr, 0, size);
+	if (ptr == NULL) {
+		fprintf(stderr, _("%s: malloc failed (%d bytes): %s\n"),
+			progname, (int)size, strerror(errno));
+		exit(1);
+	}
 	return ptr;
 }
 
diff --git a/libxlog/xfs_log_recover.c b/libxlog/xfs_log_recover.c
index 99f759d5cb03..31b11fee9e47 100644
--- a/libxlog/xfs_log_recover.c
+++ b/libxlog/xfs_log_recover.c
@@ -991,7 +991,7 @@ xlog_recover_new_tid(
 {
 	struct xlog_recover	*trans;
 
-	trans = kmem_zalloc(sizeof(struct xlog_recover), 0);
+	trans = kzalloc(sizeof(struct xlog_recover), 0);
 	trans->r_log_tid   = tid;
 	trans->r_lsn	   = lsn;
 	INIT_LIST_HEAD(&trans->r_itemq);
@@ -1006,7 +1006,7 @@ xlog_recover_add_item(
 {
 	struct xlog_recover_item *item;
 
-	item = kmem_zalloc(sizeof(struct xlog_recover_item), 0);
+	item = kzalloc(sizeof(struct xlog_recover_item), 0);
 	INIT_LIST_HEAD(&item->ri_list);
 	list_add_tail(&item->ri_list, head);
 }
@@ -1085,7 +1085,7 @@ xlog_recover_add_to_trans(
 		return 0;
 	}
 
-	ptr = kmem_alloc(len, 0);
+	ptr = kmalloc(len, 0);
 	memcpy(ptr, dp, len);
 	in_f = (struct xfs_inode_log_format *)ptr;
 
@@ -1107,13 +1107,12 @@ xlog_recover_add_to_trans(
 		"bad number of regions (%d) in inode log format",
 				  in_f->ilf_size);
 			ASSERT(0);
-			kmem_free(ptr);
+			kfree(ptr);
 			return XFS_ERROR(EIO);
 		}
 
 		item->ri_total = in_f->ilf_size;
-		item->ri_buf =
-			kmem_zalloc(item->ri_total * sizeof(xfs_log_iovec_t),
+		item->ri_buf = kzalloc(item->ri_total * sizeof(xfs_log_iovec_t),
 				    0);
 	}
 	ASSERT(item->ri_total > item->ri_cnt);
@@ -1141,13 +1140,13 @@ xlog_recover_free_trans(
 		/* Free the regions in the item. */
 		list_del(&item->ri_list);
 		for (i = 0; i < item->ri_cnt; i++)
-			kmem_free(item->ri_buf[i].i_addr);
+			kfree(item->ri_buf[i].i_addr);
 		/* Free the item itself */
-		kmem_free(item->ri_buf);
-		kmem_free(item);
+		kfree(item->ri_buf);
+		kfree(item);
 	}
 	/* Free the transaction recover structure */
-	kmem_free(trans);
+	kfree(trans);
 }
 
 /*
diff --git a/repair/bmap_repair.c b/repair/bmap_repair.c
index 845584f18450..317061aa564f 100644
--- a/repair/bmap_repair.c
+++ b/repair/bmap_repair.c
@@ -595,7 +595,7 @@ xrep_bmap(
 	if (error)
 		return error;
 
-	rb = kmem_zalloc(sizeof(struct xrep_bmap), KM_NOFS | KM_MAYFAIL);
+	rb = kzalloc(sizeof(struct xrep_bmap), 0);
 	if (!rb)
 		return ENOMEM;
 	rb->sc = sc;
@@ -622,7 +622,7 @@ xrep_bmap(
 out_bitmap:
 	free_slab(&rb->bmap_records);
 out_rb:
-	kmem_free(rb);
+	kfree(rb);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] libxfs: add a bi_entry helper
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free Darrick J. Wong
@ 2024-04-16  1:01   ` Darrick J. Wong
  2024-04-16  4:55     ` Christoph Hellwig
  2024-04-16  1:01   ` [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item Darrick J. Wong
  2024-04-16  1:02   ` [PATCH 4/4] libxfs: add a xattr_entry helper Darrick J. Wong
  3 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:01 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add a helper to translate from the item list head to the bmap_intent
structure and use it so shorten assignments and avoid the need for extra
local variables.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/defer_item.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)


diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index 680a72664746..d19322a0b255 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -439,6 +439,11 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 
 /* Inode Block Mapping */
 
+static inline struct xfs_bmap_intent *bi_entry(const struct list_head *e)
+{
+	return list_entry(e, struct xfs_bmap_intent, bi_list);
+}
+
 /* Sort bmap intents by inode. */
 static int
 xfs_bmap_update_diff_items(
@@ -446,11 +451,9 @@ xfs_bmap_update_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	const struct xfs_bmap_intent	*ba;
-	const struct xfs_bmap_intent	*bb;
+	struct xfs_bmap_intent		*ba = bi_entry(a);
+	struct xfs_bmap_intent		*bb = bi_entry(b);
 
-	ba = container_of(a, struct xfs_bmap_intent, bi_list);
-	bb = container_of(b, struct xfs_bmap_intent, bi_list);
 	return ba->bi_owner->i_ino - bb->bi_owner->i_ino;
 }
 
@@ -527,10 +530,9 @@ xfs_bmap_update_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_bmap_intent		*bi;
+	struct xfs_bmap_intent		*bi = bi_entry(item);
 	int				error;
 
-	bi = container_of(item, struct xfs_bmap_intent, bi_list);
 	error = xfs_bmap_finish_one(tp, bi);
 	if (!error && bi->bi_bmap.br_blockcount > 0) {
 		ASSERT(bi->bi_type == XFS_BMAP_UNMAP);
@@ -554,9 +556,7 @@ STATIC void
 xfs_bmap_update_cancel_item(
 	struct list_head		*item)
 {
-	struct xfs_bmap_intent		*bi;
-
-	bi = container_of(item, struct xfs_bmap_intent, bi_list);
+	struct xfs_bmap_intent		*bi = bi_entry(item);
 
 	xfs_bmap_update_put_group(bi);
 	kmem_cache_free(xfs_bmap_intent_cache, bi);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free Darrick J. Wong
  2024-04-16  1:01   ` [PATCH 2/4] libxfs: add a bi_entry helper Darrick J. Wong
@ 2024-04-16  1:01   ` Darrick J. Wong
  2024-04-16  4:55     ` Christoph Hellwig
  2024-04-16  1:02   ` [PATCH 4/4] libxfs: add a xattr_entry helper Darrick J. Wong
  3 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:01 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Reuse xfs_bmap_update_cancel_item to put the AG/RTG and free the item in
a few places that currently open code the logic.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/defer_item.c |   25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)


diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index d19322a0b255..36811c7fece1 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -522,6 +522,17 @@ xfs_bmap_update_put_group(
 	xfs_perag_intent_put(bi->bi_pag);
 }
 
+/* Cancel a deferred rmap update. */
+STATIC void
+xfs_bmap_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bi = bi_entry(item);
+
+	xfs_bmap_update_put_group(bi);
+	kmem_cache_free(xfs_bmap_intent_cache, bi);
+}
+
 /* Process a deferred rmap update. */
 STATIC int
 xfs_bmap_update_finish_item(
@@ -539,8 +550,7 @@ xfs_bmap_update_finish_item(
 		return -EAGAIN;
 	}
 
-	xfs_bmap_update_put_group(bi);
-	kmem_cache_free(xfs_bmap_intent_cache, bi);
+	xfs_bmap_update_cancel_item(item);
 	return error;
 }
 
@@ -551,17 +561,6 @@ xfs_bmap_update_abort_intent(
 {
 }
 
-/* Cancel a deferred rmap update. */
-STATIC void
-xfs_bmap_update_cancel_item(
-	struct list_head		*item)
-{
-	struct xfs_bmap_intent		*bi = bi_entry(item);
-
-	xfs_bmap_update_put_group(bi);
-	kmem_cache_free(xfs_bmap_intent_cache, bi);
-}
-
 const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 	.name		= "bmap",
 	.create_intent	= xfs_bmap_update_create_intent,


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] libxfs: add a xattr_entry helper
  2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-04-16  1:01   ` [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item Darrick J. Wong
@ 2024-04-16  1:02   ` Darrick J. Wong
  2024-04-16  4:56     ` Christoph Hellwig
  3 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:02 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add a helper to translate from the item list head to the attr_intent
item structure and use it so shorten assignments and avoid the need for
extra local variables.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/defer_item.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)


diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index 36811c7fece1..fdb922f08c39 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -570,6 +570,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 	.cancel_item	= xfs_bmap_update_cancel_item,
 };
 
+/* Logged extended attributes */
+
+static inline struct xfs_attr_intent *attri_entry(const struct list_head *e)
+{
+	return list_entry(e, struct xfs_attr_intent, xattri_list);
+}
+
 /* Get an ATTRI. */
 static struct xfs_log_item *
 xfs_attr_create_intent(
@@ -618,11 +625,10 @@ xfs_attr_finish_item(
 	struct list_head	*item,
 	struct xfs_btree_cur	**state)
 {
-	struct xfs_attr_intent	*attr;
-	int			error;
+	struct xfs_attr_intent	*attr = attri_entry(item);
 	struct xfs_da_args	*args;
+	int			error;
 
-	attr = container_of(item, struct xfs_attr_intent, xattri_list);
 	args = attr->xattri_da_args;
 
 	/*
@@ -651,9 +657,8 @@ static void
 xfs_attr_cancel_item(
 	struct list_head	*item)
 {
-	struct xfs_attr_intent	*attr;
+	struct xfs_attr_intent	*attr = attri_entry(item);
 
-	attr = container_of(item, struct xfs_attr_intent, xattri_list);
 	xfs_attr_free_item(attr);
 }
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/1] xfs_repair: check num before bplist[num]
  2024-04-16  0:58 ` [PATCHSET v30.3 4/4] xfs_repair: minor fixes Darrick J. Wong
@ 2024-04-16  1:02   ` Darrick J. Wong
  2024-04-16  4:56     ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16  1:02 UTC (permalink / raw)
  To: djwong, cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

smatch complained about checking an array index before indexing the
array, so fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/prefetch.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/repair/prefetch.c b/repair/prefetch.c
index de36c5fe2cc9..22efd54bf9eb 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -494,7 +494,7 @@ pf_batch_read(
 						args->last_bno_read, &fsbno);
 			max_fsbno = fsbno + pf_max_fsbs;
 		}
-		while (bplist[num] && num < MAX_BUFS && fsbno < max_fsbno) {
+		while (num < MAX_BUFS && bplist[num] && fsbno < max_fsbno) {
 			/*
 			 * Discontiguous buffers need special handling, so stop
 			 * gathering new buffers and process the list and this


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/5] xfs_db: improve number extraction in getbitval
  2024-04-16  0:59   ` [PATCH 2/5] xfs_db: improve number extraction in getbitval Darrick J. Wong
@ 2024-04-16  4:53     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6
  2024-04-16  0:59   ` [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6 Darrick J. Wong
@ 2024-04-16  4:53     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16  1:00   ` [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
@ 2024-04-16  4:55     ` Christoph Hellwig
  2024-04-16 15:49       ` Darrick J. Wong
  2024-04-24 17:20     ` [PATCH v3.1 " Darrick J. Wong
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

On Mon, Apr 15, 2024 at 06:00:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make it so that we can partition a memfd file to avoid running out of
> file descriptors.

Not a fan of this, but I guess there is a real need somewhere because
we run out of the number of open fds otherwise?  Given that repair
generally runs as root wouldn't it make more sense to just raise the
limit?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free
  2024-04-16  1:01   ` [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free Darrick J. Wong
@ 2024-04-16  4:55     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] libxfs: add a bi_entry helper
  2024-04-16  1:01   ` [PATCH 2/4] libxfs: add a bi_entry helper Darrick J. Wong
@ 2024-04-16  4:55     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item
  2024-04-16  1:01   ` [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item Darrick J. Wong
@ 2024-04-16  4:55     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] libxfs: add a xattr_entry helper
  2024-04-16  1:02   ` [PATCH 4/4] libxfs: add a xattr_entry helper Darrick J. Wong
@ 2024-04-16  4:56     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/1] xfs_repair: check num before bplist[num]
  2024-04-16  1:02   ` [PATCH 1/1] xfs_repair: check num before bplist[num] Darrick J. Wong
@ 2024-04-16  4:56     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16  4:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, cmaiolino, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16  4:55     ` Christoph Hellwig
@ 2024-04-16 15:49       ` Darrick J. Wong
  2024-04-16 16:29         ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16 15:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cem, cmaiolino, linux-xfs

On Mon, Apr 15, 2024 at 09:55:00PM -0700, Christoph Hellwig wrote:
> On Mon, Apr 15, 2024 at 06:00:39PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Make it so that we can partition a memfd file to avoid running out of
> > file descriptors.
> 
> Not a fan of this, but I guess there is a real need somewhere because
> we run out of the number of open fds otherwise?

Yes, we can hit the open fd limit...

>                                                  Given that repair
> generally runs as root wouldn't it make more sense to just raise the
> limit?

...and we /did/ raise the limit to whatever RLIMIT_NOFILE says is the
maximum, but sysadmins could have lowered sysctl_nr_open on us, so we
still ought to partition to try to avoid ENFILE on those environments.

(Granted the /proc/sys/fs/nr_open default is a million, and if you
actually have more than 500,000 AGs then either wowee you are rich!! or
clod-init exploded the fs and you get what you deserve :P)

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16 15:49       ` Darrick J. Wong
@ 2024-04-16 16:29         ` Christoph Hellwig
  2024-04-16 16:57           ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16 16:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, cem, cmaiolino, linux-xfs

On Tue, Apr 16, 2024 at 08:49:32AM -0700, Darrick J. Wong wrote:
> > Not a fan of this, but I guess there is a real need somewhere because
> > we run out of the number of open fds otherwise?
> 
> Yes, we can hit the open fd limit...
> 
> >                                                  Given that repair
> > generally runs as root wouldn't it make more sense to just raise the
> > limit?
> 
> ...and we /did/ raise the limit to whatever RLIMIT_NOFILE says is the
> maximum, but sysadmins could have lowered sysctl_nr_open on us, so we
> still ought to partition to try to avoid ENFILE on those environments.
> 
> (Granted the /proc/sys/fs/nr_open default is a million, and if you
> actually have more than 500,000 AGs then either wowee you are rich!! or
> clod-init exploded the fs and you get what you deserve :P)

Whar is clod-init?  And where did you see this happen?  


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16 16:29         ` Christoph Hellwig
@ 2024-04-16 16:57           ` Darrick J. Wong
  2024-04-16 18:47             ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16 16:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cem, cmaiolino, linux-xfs

On Tue, Apr 16, 2024 at 09:29:09AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 16, 2024 at 08:49:32AM -0700, Darrick J. Wong wrote:
> > > Not a fan of this, but I guess there is a real need somewhere because
> > > we run out of the number of open fds otherwise?
> > 
> > Yes, we can hit the open fd limit...
> > 
> > >                                                  Given that repair
> > > generally runs as root wouldn't it make more sense to just raise the
> > > limit?
> > 
> > ...and we /did/ raise the limit to whatever RLIMIT_NOFILE says is the
> > maximum, but sysadmins could have lowered sysctl_nr_open on us, so we
> > still ought to partition to try to avoid ENFILE on those environments.
> > 
> > (Granted the /proc/sys/fs/nr_open default is a million, and if you
> > actually have more than 500,000 AGs then either wowee you are rich!! or
> > clod-init exploded the fs and you get what you deserve :P)
> 
> Whar is clod-init?  And where did you see this happen?  

cloud-init is a piece of software that cloud/container vendors install
in the rootfs that will, upon the first startup, growfs the minified
root image to cover the entire root disk.  This is why we keep getting
complaints about 1TB filesystems with 1,000 AGs in them.  It's "fine"
for ext4 because of the 128M groups, and completely terrible for XFS.

(More generally it will also configure networking, accounts, and the
mandatory vendor agents and whatnot.)

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16 16:57           ` Darrick J. Wong
@ 2024-04-16 18:47             ` Christoph Hellwig
  2024-04-16 18:55               ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-04-16 18:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, cem, cmaiolino, linux-xfs

On Tue, Apr 16, 2024 at 09:57:41AM -0700, Darrick J. Wong wrote:
> cloud-init is a piece of software that cloud/container vendors install
> in the rootfs that will, upon the first startup, growfs the minified
> root image to cover the entire root disk.  This is why we keep getting
> complaints about 1TB filesystems with 1,000 AGs in them.  It's "fine"
> for ext4 because of the 128M groups, and completely terrible for XFS.
> 
> (More generally it will also configure networking, accounts, and the
> mandatory vendor agents and whatnot.)

Yes, I know cloud-init, but between the misspelling and not directly
obvious relevance I didn't get the reference.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16 18:47             ` Christoph Hellwig
@ 2024-04-16 18:55               ` Darrick J. Wong
  0 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-16 18:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cem, cmaiolino, linux-xfs

On Tue, Apr 16, 2024 at 11:47:35AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 16, 2024 at 09:57:41AM -0700, Darrick J. Wong wrote:
> > cloud-init is a piece of software that cloud/container vendors install
> > in the rootfs that will, upon the first startup, growfs the minified
> > root image to cover the entire root disk.  This is why we keep getting
> > complaints about 1TB filesystems with 1,000 AGs in them.  It's "fine"
> > for ext4 because of the 128M groups, and completely terrible for XFS.
> > 
> > (More generally it will also configure networking, accounts, and the
> > mandatory vendor agents and whatnot.)
> 
> Yes, I know cloud-init, but between the misspelling and not directly
> obvious relevance I didn't get the reference.

Sorry, that was a typo on my part.

--D

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 1/4] xfsprogs: bug fixes for 6.8
  2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-04-16  0:59   ` [PATCH 5/5] xfs_io: add linux madvise advice codes Darrick J. Wong
@ 2024-04-17  7:34   ` Carlos Maiolino
  2024-04-17 15:30     ` Darrick J. Wong
  5 siblings, 1 reply; 36+ messages in thread
From: Carlos Maiolino @ 2024-04-17  7:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bill O'Donnell, Christoph Hellwig, cmaiolino, linux-xfs, hch

On Mon, Apr 15, 2024 at 05:57:47PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Bug fixes for xfsprogs for 6.8.

Hey, do you plan to sent a PR for such patches or you want me to pick them up from the list?

Carlos

> 
> If you're going to start using this code, I strongly recommend pulling
> from my git trees, which are linked below.
> 
> This has been running on the djcloud for months with no problems.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D
> 
> xfsprogs git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfsprogs-6.8-fixes
> ---
> Commits in this patchset:
>  * xfs_repair: double-check with shortform attr verifiers
>  * xfs_db: improve number extraction in getbitval
>  * xfs_scrub: fix threadcount estimates for phase 6
>  * xfs_scrub: don't fail while reporting media scan errors
>  * xfs_io: add linux madvise advice codes
> ---
>  db/bit.c             |   37 ++++++++++--------------
>  io/madvise.c         |   77 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  repair/attr_repair.c |   17 +++++++++++
>  scrub/phase6.c       |   36 ++++++++++++++++++-----
>  4 files changed, 137 insertions(+), 30 deletions(-)
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 1/4] xfsprogs: bug fixes for 6.8
  2024-04-17  7:34   ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Carlos Maiolino
@ 2024-04-17 15:30     ` Darrick J. Wong
  0 siblings, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-17 15:30 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Bill O'Donnell, Christoph Hellwig, cmaiolino, linux-xfs, hch

On Wed, Apr 17, 2024 at 09:34:41AM +0200, Carlos Maiolino wrote:
> On Mon, Apr 15, 2024 at 05:57:47PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > Bug fixes for xfsprogs for 6.8.
> 
> Hey, do you plan to sent a PR for such patches or you want me to pick them up from the list?

I can send PRs.  Do you want them for just the bugfixes at the start of
my branch, or should I keep going through libxfs syncs all the way to
the end?

--D

> Carlos
> 
> > 
> > If you're going to start using this code, I strongly recommend pulling
> > from my git trees, which are linked below.
> > 
> > This has been running on the djcloud for months with no problems.  Enjoy!
> > Comments and questions are, as always, welcome.
> > 
> > --D
> > 
> > xfsprogs git tree:
> > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfsprogs-6.8-fixes
> > ---
> > Commits in this patchset:
> >  * xfs_repair: double-check with shortform attr verifiers
> >  * xfs_db: improve number extraction in getbitval
> >  * xfs_scrub: fix threadcount estimates for phase 6
> >  * xfs_scrub: don't fail while reporting media scan errors
> >  * xfs_io: add linux madvise advice codes
> > ---
> >  db/bit.c             |   37 ++++++++++--------------
> >  io/madvise.c         |   77 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  repair/attr_repair.c |   17 +++++++++++
> >  scrub/phase6.c       |   36 ++++++++++++++++++-----
> >  4 files changed, 137 insertions(+), 30 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3.1 090/111] libxfs: partition memfd files to avoid using too many fds
  2024-04-16  1:00   ` [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
  2024-04-16  4:55     ` Christoph Hellwig
@ 2024-04-24 17:20     ` Darrick J. Wong
  1 sibling, 0 replies; 36+ messages in thread
From: Darrick J. Wong @ 2024-04-24 17:20 UTC (permalink / raw)
  To: cem; +Cc: cmaiolino, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

In a few patchsets from now, we'll transition xfs_repair to use
memfd-backed rmap and rcbag btrees for storing repair data instead of
heap allocations.  This allows repair to use libxfs code shared from the
online repair code, which reduces the size of the codebase.  It also
reduces heap fragmentation, which might be critical on 32-bit systems.

However, there's one hitch -- userspace xfiles naively allocate one
memfd per data structure, but there's only so many file descriptors that
a process can open.  If a filesystem has a lot of allocation groups, we
can run out of fds and fail.  xfs_repair already tries to increase
RLIMIT_NOFILE to the maximum (~1M) but this can fail due to system or
memory constraints.

Fortunately, it is possible to compute the upper bound of a memfd btree,
which implies that we can store multiple btrees per memfd.  Make it so
that we can partition a memfd file to avoid running out of file
descriptors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v3.1: improve commit message to explain why we need this
---
 libxfs/xfile.c |  197 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 libxfs/xfile.h |   17 ++++-
 2 files changed, 205 insertions(+), 9 deletions(-)

diff --git a/libxfs/xfile.c b/libxfs/xfile.c
index cba173cc17f1..fdb76f406647 100644
--- a/libxfs/xfile.c
+++ b/libxfs/xfile.c
@@ -97,6 +97,149 @@ xfile_create_fd(
 	return fd;
 }
 
+static LIST_HEAD(fcb_list);
+static pthread_mutex_t fcb_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/* Create a new memfd. */
+static inline int
+xfile_fcb_create(
+	const char		*description,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			fd;
+
+	fd = xfile_create_fd(description);
+	if (fd < 0)
+		return -errno;
+
+	fcb = malloc(sizeof(struct xfile_fcb));
+	if (!fcb) {
+		close(fd);
+		return -ENOMEM;
+	}
+
+	list_head_init(&fcb->fcb_list);
+	fcb->fd = fd;
+	fcb->refcount = 1;
+
+	*fcbp = fcb;
+	return 0;
+}
+
+/* Release an xfile control block */
+static void
+xfile_fcb_irele(
+	struct xfile_fcb	*fcb,
+	loff_t			pos,
+	uint64_t		len)
+{
+	/*
+	 * If this memfd is linked only to itself, it's private, so we can
+	 * close it without taking any locks.
+	 */
+	if (list_empty(&fcb->fcb_list)) {
+		close(fcb->fd);
+		free(fcb);
+		return;
+	}
+
+	pthread_mutex_lock(&fcb_mutex);
+	if (--fcb->refcount == 0) {
+		/* If we're the last user of this memfd file, kill it fast. */
+		list_del(&fcb->fcb_list);
+		close(fcb->fd);
+		free(fcb);
+	} else if (len > 0) {
+		struct stat	statbuf;
+		int		ret;
+
+		/*
+		 * If we were using the end of a partitioned file, free the
+		 * address space.  IOWs, bonus points if you delete these in
+		 * reverse-order of creation.
+		 */
+		ret = fstat(fcb->fd, &statbuf);
+		if (!ret && statbuf.st_size == pos + len) {
+			ret = ftruncate(fcb->fd, pos);
+		}
+	}
+	pthread_mutex_unlock(&fcb_mutex);
+}
+
+/*
+ * Find an memfd that can accomodate the given amount of address space.
+ */
+static int
+xfile_fcb_find(
+	const char		*description,
+	uint64_t		maxbytes,
+	loff_t			*posp,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			ret;
+	int			error;
+
+	/* No maximum range means that the caller gets a private memfd. */
+	if (maxbytes == 0) {
+		*posp = 0;
+		return xfile_fcb_create(description, fcbp);
+	}
+
+	/* round up to page granularity so we can do mmap */
+	maxbytes = roundup_64(maxbytes, PAGE_SIZE);
+
+	pthread_mutex_lock(&fcb_mutex);
+
+	/*
+	 * If we only need a certain number of byte range, look for one with
+	 * available file range.
+	 */
+	list_for_each_entry(fcb, &fcb_list, fcb_list) {
+		struct stat	statbuf;
+		loff_t		pos;
+
+		ret = fstat(fcb->fd, &statbuf);
+		if (ret)
+			continue;
+		pos = roundup_64(statbuf.st_size, PAGE_SIZE);
+
+		/*
+		 * Truncate up to ensure that the memfd can actually handle
+		 * writes to the end of the range.
+		 */
+		ret = ftruncate(fcb->fd, pos + maxbytes);
+		if (ret)
+			continue;
+
+		fcb->refcount++;
+		*posp = pos;
+		*fcbp = fcb;
+		goto out_unlock;
+	}
+
+	/* Otherwise, open a new memfd and add it to our list. */
+	error = xfile_fcb_create(description, &fcb);
+	if (error)
+		return error;
+
+	ret = ftruncate(fcb->fd, maxbytes);
+	if (ret) {
+		error = -errno;
+		xfile_fcb_irele(fcb, 0, maxbytes);
+		return error;
+	}
+
+	list_add_tail(&fcb->fcb_list, &fcb_list);
+	*posp = 0;
+	*fcbp = fcb;
+
+out_unlock:
+	pthread_mutex_unlock(&fcb_mutex);
+	return error;
+}
+
 /*
  * Create an xfile of the given size.  The description will be used in the
  * trace output.
@@ -104,6 +247,7 @@ xfile_create_fd(
 int
 xfile_create(
 	const char		*description,
+	unsigned long long	maxbytes,
 	struct xfile		**xfilep)
 {
 	struct xfile		*xf;
@@ -113,13 +257,14 @@ xfile_create(
 	if (!xf)
 		return -ENOMEM;
 
-	xf->fd = xfile_create_fd(description);
-	if (xf->fd < 0) {
-		error = -errno;
+	error = xfile_fcb_find(description, maxbytes, &xf->partition_pos,
+			&xf->fcb);
+	if (error) {
 		kfree(xf);
 		return error;
 	}
 
+	xf->maxbytes = maxbytes;
 	*xfilep = xf;
 	return 0;
 }
@@ -129,7 +274,7 @@ void
 xfile_destroy(
 	struct xfile		*xf)
 {
-	close(xf->fd);
+	xfile_fcb_irele(xf->fcb, xf->partition_pos, xf->maxbytes);
 	kfree(xf);
 }
 
@@ -137,6 +282,9 @@ static inline loff_t
 xfile_maxbytes(
 	struct xfile		*xf)
 {
+	if (xf->maxbytes > 0)
+		return xf->maxbytes;
+
 	if (sizeof(loff_t) == 8)
 		return LLONG_MAX;
 	return LONG_MAX;
@@ -160,7 +308,7 @@ xfile_load(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -ENOMEM;
 
-	ret = pread(xf->fd, buf, count, pos);
+	ret = pread(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret < 0)
 		return -errno;
 	if (ret != count)
@@ -186,7 +334,7 @@ xfile_store(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -EFBIG;
 
-	ret = pwrite(xf->fd, buf, count, pos);
+	ret = pwrite(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret < 0)
 		return -errno;
 	if (ret != count)
@@ -194,6 +342,38 @@ xfile_store(
 	return 0;
 }
 
+/* Compute the number of bytes used by a partitioned xfile. */
+static unsigned long long
+xfile_partition_bytes(
+	struct xfile		*xf)
+{
+	loff_t			data_pos = xf->partition_pos;
+	loff_t			stop_pos = data_pos + xf->maxbytes;
+	loff_t			hole_pos;
+	unsigned long long	bytes = 0;
+
+	data_pos = lseek(xf->fcb->fd, data_pos, SEEK_DATA);
+	while (data_pos >= 0 && data_pos < stop_pos) {
+		hole_pos = lseek(xf->fcb->fd, data_pos, SEEK_HOLE);
+		if (hole_pos < 0) {
+			/* save error, break */
+			data_pos = hole_pos;
+			break;
+		}
+		if (hole_pos >= stop_pos) {
+			bytes += stop_pos - data_pos;
+			return bytes;
+		}
+		bytes += hole_pos - data_pos;
+
+		data_pos = lseek(xf->fcb->fd, hole_pos, SEEK_DATA);
+	}
+	if (data_pos < 0 && errno != ENXIO)
+		return xf->maxbytes;
+
+	return bytes;
+}
+
 /* Compute the number of bytes used by a xfile. */
 unsigned long long
 xfile_bytes(
@@ -202,7 +382,10 @@ xfile_bytes(
 	struct stat		statbuf;
 	int			error;
 
-	error = fstat(xf->fd, &statbuf);
+	if (xf->maxbytes > 0)
+		return xfile_partition_bytes(xf);
+
+	error = fstat(xf->fcb->fd, &statbuf);
 	if (error)
 		return -errno;
 
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
index d60084011357..180a42bbbaa2 100644
--- a/libxfs/xfile.h
+++ b/libxfs/xfile.h
@@ -6,11 +6,24 @@
 #ifndef __LIBXFS_XFILE_H__
 #define __LIBXFS_XFILE_H__
 
-struct xfile {
+struct xfile_fcb {
+	struct list_head	fcb_list;
 	int			fd;
+	unsigned int		refcount;
 };
 
-int xfile_create(const char *description, struct xfile **xfilep);
+struct xfile {
+	struct xfile_fcb	*fcb;
+
+	/* File position within fcb->fd where this partition starts */
+	loff_t			partition_pos;
+
+	/* Maximum number of bytes that can be written to the partition. */
+	uint64_t		maxbytes;
+};
+
+int xfile_create(const char *description, unsigned long long maxbytes,
+		struct xfile **xfilep);
 void xfile_destroy(struct xfile *xf);
 
 ssize_t xfile_load(struct xfile *xf, void *buf, size_t count, loff_t pos);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-04-24 17:20 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-16  0:51 [PATCHBOMB v3] xfsprogs: everything headed towards 6.9 Darrick J. Wong
2024-04-16  0:57 ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Darrick J. Wong
2024-04-16  0:58   ` [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers Darrick J. Wong
2024-04-16  0:59   ` [PATCH 2/5] xfs_db: improve number extraction in getbitval Darrick J. Wong
2024-04-16  4:53     ` Christoph Hellwig
2024-04-16  0:59   ` [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6 Darrick J. Wong
2024-04-16  4:53     ` Christoph Hellwig
2024-04-16  0:59   ` [PATCH 4/5] xfs_scrub: don't fail while reporting media scan errors Darrick J. Wong
2024-04-16  0:59   ` [PATCH 5/5] xfs_io: add linux madvise advice codes Darrick J. Wong
2024-04-17  7:34   ` [PATCHSET 1/4] xfsprogs: bug fixes for 6.8 Carlos Maiolino
2024-04-17 15:30     ` Darrick J. Wong
2024-04-16  0:58 ` [PATCHSET 2/4] libxfs: sync with 6.9 Darrick J. Wong
2024-04-16  1:00   ` [PATCH 088/111] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
2024-04-16  1:00   ` [PATCH 089/111] libxfs: add xfile support Darrick J. Wong
2024-04-16  1:00   ` [PATCH 090/111] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
2024-04-16  4:55     ` Christoph Hellwig
2024-04-16 15:49       ` Darrick J. Wong
2024-04-16 16:29         ` Christoph Hellwig
2024-04-16 16:57           ` Darrick J. Wong
2024-04-16 18:47             ` Christoph Hellwig
2024-04-16 18:55               ` Darrick J. Wong
2024-04-24 17:20     ` [PATCH v3.1 " Darrick J. Wong
2024-04-16  1:00   ` [PATCH 091/111] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
2024-04-16  1:01   ` [PATCH 092/111] libxfs: support in-memory buffer cache targets Darrick J. Wong
2024-04-16  0:58 ` [PATCHSET v30.3 3/4] xfsprogs: bmap log intent cleanups Darrick J. Wong
2024-04-16  1:01   ` [PATCH 1/4] libxfs: remove kmem_alloc, kmem_zalloc, and kmem_free Darrick J. Wong
2024-04-16  4:55     ` Christoph Hellwig
2024-04-16  1:01   ` [PATCH 2/4] libxfs: add a bi_entry helper Darrick J. Wong
2024-04-16  4:55     ` Christoph Hellwig
2024-04-16  1:01   ` [PATCH 3/4] libxfs: reuse xfs_bmap_update_cancel_item Darrick J. Wong
2024-04-16  4:55     ` Christoph Hellwig
2024-04-16  1:02   ` [PATCH 4/4] libxfs: add a xattr_entry helper Darrick J. Wong
2024-04-16  4:56     ` Christoph Hellwig
2024-04-16  0:58 ` [PATCHSET v30.3 4/4] xfs_repair: minor fixes Darrick J. Wong
2024-04-16  1:02   ` [PATCH 1/1] xfs_repair: check num before bplist[num] Darrick J. Wong
2024-04-16  4:56     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.