linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] tmpfs: add the option to disable swap
@ 2023-03-09 23:05 Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 1/6] shmem: remove check for folio lock on writepage() Luis Chamberlain
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel

Changes on this v2 PATCH series:

  o Added all respective tags for Reviewed-by, Acked-by's
  o David Hildenbrand suggested on the update-docs patch to mention THP.
    It turns out tmpfs.rst makes absolutely no mention to THP at all
    so I added all the relevant options to the docs including the
    system wide sysfs file. All that should hopefully demistify that
    and make it clearer.
  o Yosry Ahmed spell checked my patch "shmem: add support to ignore swap"

Changes since RFCv2 to the first real PATCH series:

  o Added Christian Brauner'd Acked-by for the noswap patch (the only
    change in that patch is just the new shmem_show_options() change I
    describe below).
  o Embraced Yosry Ahmed's recommendation to use mapping_set_unevictable()
    to at ensure the folios at least appear in the unevictable LRU.
    Since that is the goal, this accomplishes what we want and the VM
    takes care of things for us. The shem writepage() still uses a stop-gap
    to ensure we don't get called for swap when its shmem uses
    mapping_set_unevictable().
  o I had evaluated using shmem_lock() instead of calling mapping_set_unevictable()
    but upon my review this doesn't make much sense, as shmem_lock() was
    designed to make use of the RLIMIT_MEMLOCK and this was designed for
    files / IPC / unprivileged perf limits. If we were to use
    shmem_lock() we'd bump the count on each new inode. Using
    shmem_lock() would also complicate inode allocation on shmem as
    we'd to unwind on failure from the user_shm_lock(). It would also
    beg the question of when to capture a ucount for an inode, should we
    just share one for the superblock at shmem_fill_super() or do we
    really need to capture it at every single inode creation? In theory
    we could end up with different limits. The simple solution is to
    juse use mapping_set_unevictable() upon inode creation and be done
    with it, as it cannot fail.
  o Update the documentation for tmpfs before / after my patch to
    reflect use cases a bit more clearly between ramfs, tmpfs and brd
    ramdisks.
  o I updated the shmem_show_options() to also reveal the noswap option
    when its used.
  o Address checkpatch style complaint with spaces before tabs on
    shmem_fs.h.

Chances since first RFC:

  o Matthew suggested BUG_ON(!folio_test_locked(folio)) is not needed
    on writepage() callback for shmem so just remove that.
  o Based on Matthew's feedback the inode is set up early as it is not
    reset in case we split the folio. So now we move all the variables
    we can set up really early.
  o shmem writepage() should only be issued on reclaim, so just move
    the WARN_ON_ONCE(!wbc->for_reclaim) early so that the code and
    expectations are easier to read. This also avoid the folio splitting
    in case of that odd case.
  o There are a few cases where the shmem writepage() could possibly
    hit, but in the total_swap_pages we just bail out. We shouldn't be
    splitting the folio then. Likewise for VM_LOCKED case. But for
    a writepage() on a VM_LOCKED case is not expected so we want to
    learn about it so add a WARN_ON_ONCE() on that condition.
  o Based on Yosry Ahmed's feedback the patch which allows tmpfs to
    disable swap now just uses mapping_set_unevictable() on inode
    creation. In that case writepage() should not be called so we
    augment the WARN_ON_ONCE() for writepage() for that case to ensure
    that never happens.

To test I've used kdevops [0] 8 vpcu 4 GiB libvirt guest on linux-next.

I'm doing this work as part of future experimentation with tmpfs and the
page cache, but given a common complaint found about tmpfs is the
innability to work without the page cache I figured this might be useful
to others. It turns out it is -- at least Christian Brauner indicates
systemd uses ramfs for a few use-cases because they don't want to use
swap and so having this option would let them move over to using tmpfs
for those small use cases, see systemd-creds(1).

To see if you hit swap:

mkswap /dev/nvme2n1
swapon /dev/nvme2n1
free -h

With swap - what we see today
=============================
mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       2.6Gi       1.2Gi       2.2Gi       2.2Gi       1.2Gi
Swap:           99Gi       2.8Gi        97Gi


Without swap
=============

free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       387Mi       3.4Gi       2.1Mi        57Mi       3.3Gi
Swap:           99Gi          0B        99Gi
mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       2.6Gi       1.2Gi       2.3Gi       2.3Gi       1.1Gi
Swap:           99Gi        21Mi        99Gi

The mix and match remount testing
=================================

# Cannot disable swap after it was first enabled:
mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
mount: /data-tmpfs: mount point not mounted or bad option.
       dmesg(1) may have more information after failed mount system call.
dmesg -c
tmpfs: Cannot disable swap on remount

# Remount with the same noswap option is OK:
mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
dmesg -c

# Trying to enable swap with a remount after it first disabled:
mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G           tmpfs /data-tmpfs/
mount: /data-tmpfs: mount point not mounted or bad option.
       dmesg(1) may have more information after failed mount system call.
dmesg -c
tmpfs: Cannot enable swap on remount if it was disabled on first mount

[0] https://github.com/linux-kdevops/kdevops

Luis Chamberlain (6):
  shmem: remove check for folio lock on writepage()
  shmem: set shmem_writepage() variables early
  shmem: move reclaim check early on writepages()
  shmem: skip page split if we're not reclaiming
  shmem: update documentation
  shmem: add support to ignore swap

 Documentation/filesystems/tmpfs.rst  | 66 ++++++++++++++++++++++-----
 Documentation/mm/unevictable-lru.rst |  2 +
 include/linux/shmem_fs.h             |  1 +
 mm/shmem.c                           | 68 ++++++++++++++++++----------
 4 files changed, 103 insertions(+), 34 deletions(-)

-- 
2.39.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/6] shmem: remove check for folio lock on writepage()
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 2/6] shmem: set shmem_writepage() variables early Luis Chamberlain
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel, David Hildenbrand

Matthew notes we should not need to check the folio lock
on the writepage() callback so remove it. This sanity check
has been lingering since linux-history days. We remove this
as we tidy up the writepage() callback to make things a bit
clearer.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1af85259b6fc..7fff1a3af092 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1354,7 +1354,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		folio_clear_dirty(folio);
 	}
 
-	BUG_ON(!folio_test_locked(folio));
 	mapping = folio->mapping;
 	index = folio->index;
 	inode = mapping->host;
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/6] shmem: set shmem_writepage() variables early
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 1/6] shmem: remove check for folio lock on writepage() Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 3/6] shmem: move reclaim check early on writepages() Luis Chamberlain
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel, David Hildenbrand

shmem_writepage() sets up variables typically used *after* a possible
huge page split. However even if that does happen the address space
mapping should not change, and the inode does not change either. So it
should be safe to set that from the very beginning.

This commit makes no functional changes.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7fff1a3af092..2b9ff585a553 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1334,9 +1334,9 @@ int shmem_unuse(unsigned int type)
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct folio *folio = page_folio(page);
-	struct shmem_inode_info *info;
-	struct address_space *mapping;
-	struct inode *inode;
+	struct address_space *mapping = folio->mapping;
+	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	swp_entry_t swap;
 	pgoff_t index;
 
@@ -1354,10 +1354,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		folio_clear_dirty(folio);
 	}
 
-	mapping = folio->mapping;
 	index = folio->index;
-	inode = mapping->host;
-	info = SHMEM_I(inode);
 	if (info->flags & VM_LOCKED)
 		goto redirty;
 	if (!total_swap_pages)
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/6] shmem: move reclaim check early on writepages()
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 1/6] shmem: remove check for folio lock on writepage() Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 2/6] shmem: set shmem_writepage() variables early Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-03-09 23:05 ` [PATCH v2 4/6] shmem: skip page split if we're not reclaiming Luis Chamberlain
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel, David Hildenbrand

i915_gem requires huge folios to be split when swapping.
However we have  check for usage of writepages() to ensure
it used only for swap purposes later. Avoid the splits if
we're not being called for reclaim, even if they should in
theory not happen.

This makes the conditions easier to follow on shem_writepage().

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2b9ff585a553..68e9970baf1e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1340,6 +1340,16 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
+	/*
+	 * Our capabilities prevent regular writeback or sync from ever calling
+	 * shmem_writepage; but a stacking filesystem might use ->writepage of
+	 * its underlying filesystem, in which case tmpfs should write out to
+	 * swap only in response to memory pressure, and not for the writeback
+	 * threads or sync.
+	 */
+	if (WARN_ON_ONCE(!wbc->for_reclaim))
+		goto redirty;
+
 	/*
 	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
 	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
@@ -1360,18 +1370,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!total_swap_pages)
 		goto redirty;
 
-	/*
-	 * Our capabilities prevent regular writeback or sync from ever calling
-	 * shmem_writepage; but a stacking filesystem might use ->writepage of
-	 * its underlying filesystem, in which case tmpfs should write out to
-	 * swap only in response to memory pressure, and not for the writeback
-	 * threads or sync.
-	 */
-	if (!wbc->for_reclaim) {
-		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
-		goto redirty;
-	}
-
 	/*
 	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
 	 * value into swapfile.c, the only way we can correctly account for a
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/6] shmem: skip page split if we're not reclaiming
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (2 preceding siblings ...)
  2023-03-09 23:05 ` [PATCH v2 3/6] shmem: move reclaim check early on writepages() Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-03-09 23:09   ` Yosry Ahmed
  2023-04-18  4:41   ` Hugh Dickins
  2023-03-09 23:05 ` [PATCH v2 5/6] shmem: update documentation Luis Chamberlain
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel, David Hildenbrand

In theory when info->flags & VM_LOCKED we should not be getting
shem_writepage() called so we should be verifying this with a
WARN_ON_ONCE(). Since we should not be swapping then best to ensure
we also don't do the folio split earlier too. So just move the check
early to avoid folio splits in case its a dubious call.

We also have a similar early bail when !total_swap_pages so just move
that earlier to avoid the possible folio split in the same situation.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 68e9970baf1e..dfd995da77b4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1350,6 +1350,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (WARN_ON_ONCE(!wbc->for_reclaim))
 		goto redirty;
 
+	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
+		goto redirty;
+
+	if (!total_swap_pages)
+		goto redirty;
+
 	/*
 	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
 	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
@@ -1365,10 +1371,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	index = folio->index;
-	if (info->flags & VM_LOCKED)
-		goto redirty;
-	if (!total_swap_pages)
-		goto redirty;
 
 	/*
 	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/6] shmem: update documentation
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (3 preceding siblings ...)
  2023-03-09 23:05 ` [PATCH v2 4/6] shmem: skip page split if we're not reclaiming Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-04-18  5:29   ` Hugh Dickins
  2023-03-09 23:05 ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel, David Hildenbrand

Update the docs to reflect a bit better why some folks prefer tmpfs
over ramfs and clarify a bit more about the difference between brd
ramdisks.

While at it, add THP docs for tmpfs, both the mount options and the
sysfs file.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 Documentation/filesystems/tmpfs.rst | 57 +++++++++++++++++++++++++----
 1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 0408c245785e..1ec9a9f8196b 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -13,14 +13,25 @@ everything stored therein is lost.
 
 tmpfs puts everything into the kernel internal caches and grows and
 shrinks to accommodate the files it contains and is able to swap
-unneeded pages out to swap space. It has maximum size limits which can
-be adjusted on the fly via 'mount -o remount ...'
-
-If you compare it to ramfs (which was the template to create tmpfs)
-you gain swapping and limit checking. Another similar thing is the RAM
-disk (/dev/ram*), which simulates a fixed size hard disk in physical
-RAM, where you have to create an ordinary filesystem on top. Ramdisks
-cannot swap and you do not have the possibility to resize them.
+unneeded pages out to swap space, and supports THP.
+
+tmpfs extends ramfs with a few userspace configurable options listed and
+explained further below, some of which can be reconfigured dynamically on the
+fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
+filesystem can be resized but it cannot be resized to a size below its current
+usage. tmpfs also supports POSIX ACLs, and extended attributes for the
+trusted.* and security.* namespaces. ramfs does not use swap and you cannot
+modify any parameter for a ramfs filesystem. The size limit of a ramfs
+filesystem is how much memory you have available, and so care must be taken if
+used so to not run out of memory.
+
+An alternative to tmpfs and ramfs is to use brd to create RAM disks
+(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
+To write data you would just then need to create an regular filesystem on top
+this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
+configured in size at initialization and you cannot dynamically resize them.
+Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
+block layer at all.
 
 Since tmpfs lives completely in the page cache and on swap, all tmpfs
 pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
@@ -85,6 +96,36 @@ mount with such options, since it allows any user with write access to
 use up all the memory on the machine; but enhances the scalability of
 that instance in a system with many CPUs making intensive use of it.
 
+tmpfs also supports Transparent Huge Pages which requires a kernel
+configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
+your system (has_transparent_hugepage(), which is architecture specific).
+The mount options for this are:
+
+======  ============================================================
+huge=0  never: disables huge pages for the mount
+huge=1  always: enables huge pages for the mount
+huge=2  within_size: only allocate huge pages if the page will be
+        fully within i_size, also respect fadvise()/madvise() hints.
+huge=3  advise: only allocate huge pages if requested with
+        fadvise()/madvise()
+======  ============================================================
+
+There is a sysfs file which you can also use to control system wide THP
+configuration for all tmpfs mounts, the file is:
+
+/sys/kernel/mm/transparent_hugepage/shmem_enabled
+
+This sysfs file is placed on top of THP sysfs directory and so is registered
+by THP code. It is however only used to control all tmpfs mounts with one
+single knob. Since it controls all tmpfs mounts it should only be used either
+for emergency or testing purposes. The values you can set for shmem_enabled are:
+
+==  ============================================================
+-1  deny: disables huge on shm_mnt and all mounts, for
+    emergency use
+-2  force: enables huge on shm_mnt and all mounts, w/o needing
+    option, for testing
+==  ============================================================
 
 tmpfs has a mount option to set the NUMA memory allocation policy for
 all files in that instance (if CONFIG_NUMA is enabled) - which can be
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 6/6] shmem: add support to ignore swap
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (4 preceding siblings ...)
  2023-03-09 23:05 ` [PATCH v2 5/6] shmem: update documentation Luis Chamberlain
@ 2023-03-09 23:05 ` Luis Chamberlain
  2023-04-18  5:50   ` Hugh Dickins
  2023-03-14  1:21 ` [PATCH v2 0/6] tmpfs: add the option to disable swap Davidlohr Bueso
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-09 23:05 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, mcgrof, patches, linux-kernel

In doing experimentations with shmem having the option to avoid swap
becomes a useful mechanism. One of the *raves* about brd over shmem is
you can avoid swap, but that's not really a good reason to use brd if
we can instead use shmem. Using brd has its own good reasons to exist,
but just because "tmpfs" doesn't let you do that is not a great reason
to avoid it if we can easily add support for it.

I don't add support for reconfiguring incompatible options, but if
we really wanted to we can add support for that.

To avoid swap we use mapping_set_unevictable() upon inode creation,
and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 Documentation/filesystems/tmpfs.rst  |  9 ++++++---
 Documentation/mm/unevictable-lru.rst |  2 ++
 include/linux/shmem_fs.h             |  1 +
 mm/shmem.c                           | 28 +++++++++++++++++++++++++++-
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 1ec9a9f8196b..f18f46be5c0c 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -13,7 +13,8 @@ everything stored therein is lost.
 
 tmpfs puts everything into the kernel internal caches and grows and
 shrinks to accommodate the files it contains and is able to swap
-unneeded pages out to swap space, and supports THP.
+unneeded pages out to swap space, if swap was enabled for the tmpfs
+mount. tmpfs also supports THP.
 
 tmpfs extends ramfs with a few userspace configurable options listed and
 explained further below, some of which can be reconfigured dynamically on the
@@ -33,8 +34,8 @@ configured in size at initialization and you cannot dynamically resize them.
 Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
 block layer at all.
 
-Since tmpfs lives completely in the page cache and on swap, all tmpfs
-pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
+Since tmpfs lives completely in the page cache and optionally on swap,
+all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
 free(1). Notice that these counters also include shared memory
 (shmem, see ipcs(1)). The most reliable way to get the count is
 using df(1) and du(1).
@@ -83,6 +84,8 @@ nr_inodes  The maximum number of inodes for this instance. The default
            is half of the number of your physical RAM pages, or (on a
            machine with highmem) the number of lowmem RAM pages,
            whichever is the lower.
+noswap     Disables swap. Remounts must respect the original settings.
+           By default swap is enabled.
 =========  ============================================================
 
 These parameters accept a suffix k, m or g for kilo, mega and giga and
diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst
index 92ac5dca420c..d5ac8511eb67 100644
--- a/Documentation/mm/unevictable-lru.rst
+++ b/Documentation/mm/unevictable-lru.rst
@@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:
 
  * Those owned by ramfs.
 
+ * Those owned by tmpfs with the noswap mount option.
+
  * Those mapped into SHM_LOCK'd shared memory regions.
 
  * Those mapped into VM_LOCKED [mlock()ed] VMAs.
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 103d1000a5a2..50bf82b36995 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -45,6 +45,7 @@ struct shmem_sb_info {
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
 	bool full_inums;	    /* If i_ino should be uint or ino_t */
+	bool noswap;		    /* ignores VM reclaim / swap requests */
 	ino_t next_ino;		    /* The next per-sb inode number to use */
 	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
diff --git a/mm/shmem.c b/mm/shmem.c
index dfd995da77b4..2e122c72b375 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -119,10 +119,12 @@ struct shmem_options {
 	bool full_inums;
 	int huge;
 	int seen;
+	bool noswap;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
+#define SHMEM_SEEN_NOSWAP 16
 };
 
 #ifdef CONFIG_TMPFS
@@ -1337,6 +1339,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	struct address_space *mapping = folio->mapping;
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	swp_entry_t swap;
 	pgoff_t index;
 
@@ -1350,7 +1353,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (WARN_ON_ONCE(!wbc->for_reclaim))
 		goto redirty;
 
-	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
+	if (WARN_ON_ONCE((info->flags & VM_LOCKED) || sbinfo->noswap))
 		goto redirty;
 
 	if (!total_swap_pages)
@@ -2487,6 +2490,8 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block
 			shmem_set_inode_flags(inode, info->fsflags);
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
+		if (sbinfo->noswap)
+			mapping_set_unevictable(inode->i_mapping);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
 		mapping_set_large_folios(inode->i_mapping);
@@ -3574,6 +3579,7 @@ enum shmem_param {
 	Opt_uid,
 	Opt_inode32,
 	Opt_inode64,
+	Opt_noswap,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3595,6 +3601,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
 	fsparam_u32   ("uid",		Opt_uid),
 	fsparam_flag  ("inode32",	Opt_inode32),
 	fsparam_flag  ("inode64",	Opt_inode64),
+	fsparam_flag  ("noswap",	Opt_noswap),
 	{}
 };
 
@@ -3678,6 +3685,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->full_inums = true;
 		ctx->seen |= SHMEM_SEEN_INUMS;
 		break;
+	case Opt_noswap:
+		ctx->noswap = true;
+		ctx->seen |= SHMEM_SEEN_NOSWAP;
+		break;
 	}
 	return 0;
 
@@ -3776,6 +3787,14 @@ static int shmem_reconfigure(struct fs_context *fc)
 		err = "Current inum too high to switch to 32-bit inums";
 		goto out;
 	}
+	if ((ctx->seen & SHMEM_SEEN_NOSWAP) && ctx->noswap && !sbinfo->noswap) {
+		err = "Cannot disable swap on remount";
+		goto out;
+	}
+	if (!(ctx->seen & SHMEM_SEEN_NOSWAP) && !ctx->noswap && sbinfo->noswap) {
+		err = "Cannot enable swap on remount if it was disabled on first mount";
+		goto out;
+	}
 
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
@@ -3796,6 +3815,10 @@ static int shmem_reconfigure(struct fs_context *fc)
 		sbinfo->mpol = ctx->mpol;	/* transfers initial ref */
 		ctx->mpol = NULL;
 	}
+
+	if (ctx->noswap)
+		sbinfo->noswap = true;
+
 	raw_spin_unlock(&sbinfo->stat_lock);
 	mpol_put(mpol);
 	return 0;
@@ -3850,6 +3873,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
+	if (sbinfo->noswap)
+		seq_printf(seq, ",noswap");
 	return 0;
 }
 
@@ -3893,6 +3918,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 			ctx->inodes = shmem_default_max_inodes();
 		if (!(ctx->seen & SHMEM_SEEN_INUMS))
 			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
+		sbinfo->noswap = ctx->noswap;
 	} else {
 		sb->s_flags |= SB_NOUSER;
 	}
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] shmem: skip page split if we're not reclaiming
  2023-03-09 23:05 ` [PATCH v2 4/6] shmem: skip page split if we're not reclaiming Luis Chamberlain
@ 2023-03-09 23:09   ` Yosry Ahmed
  2023-04-18  4:41   ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2023-03-09 23:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, keescook, patches, linux-kernel,
	David Hildenbrand

On Thu, Mar 9, 2023 at 3:05 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> In theory when info->flags & VM_LOCKED we should not be getting
> shem_writepage() called so we should be verifying this with a
> WARN_ON_ONCE(). Since we should not be swapping then best to ensure
> we also don't do the folio split earlier too. So just move the check
> early to avoid folio splits in case its a dubious call.
>
> We also have a similar early bail when !total_swap_pages so just move
> that earlier to avoid the possible folio split in the same situation.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/shmem.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 68e9970baf1e..dfd995da77b4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1350,6 +1350,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         if (WARN_ON_ONCE(!wbc->for_reclaim))
>                 goto redirty;
>
> +       if (WARN_ON_ONCE(info->flags & VM_LOCKED))
> +               goto redirty;
> +
> +       if (!total_swap_pages)
> +               goto redirty;
> +
>         /*
>          * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
>          * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> @@ -1365,10 +1371,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         }
>
>         index = folio->index;
> -       if (info->flags & VM_LOCKED)
> -               goto redirty;
> -       if (!total_swap_pages)
> -               goto redirty;
>
>         /*
>          * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
> --
> 2.39.1
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (5 preceding siblings ...)
  2023-03-09 23:05 ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
@ 2023-03-14  1:21 ` Davidlohr Bueso
  2023-03-14  2:46 ` haoxin
  2023-04-18  4:31 ` Hugh Dickins
  8 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2023-03-14  1:21 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, yosryahmed, keescook, patches, linux-kernel

On Thu, 09 Mar 2023, Luis Chamberlain wrote:

>Changes on this v2 PATCH series:
>
>  o Added all respective tags for Reviewed-by, Acked-by's
>  o David Hildenbrand suggested on the update-docs patch to mention THP.
>    It turns out tmpfs.rst makes absolutely no mention to THP at all
>    so I added all the relevant options to the docs including the
>    system wide sysfs file. All that should hopefully demistify that
>    and make it clearer.
>  o Yosry Ahmed spell checked my patch "shmem: add support to ignore swap"
>
>Changes since RFCv2 to the first real PATCH series:
>
>  o Added Christian Brauner'd Acked-by for the noswap patch (the only
>    change in that patch is just the new shmem_show_options() change I
>    describe below).
>  o Embraced Yosry Ahmed's recommendation to use mapping_set_unevictable()
>    to at ensure the folios at least appear in the unevictable LRU.
>    Since that is the goal, this accomplishes what we want and the VM
>    takes care of things for us. The shem writepage() still uses a stop-gap
>    to ensure we don't get called for swap when its shmem uses
>    mapping_set_unevictable().
>  o I had evaluated using shmem_lock() instead of calling mapping_set_unevictable()
>    but upon my review this doesn't make much sense, as shmem_lock() was
>    designed to make use of the RLIMIT_MEMLOCK and this was designed for
>    files / IPC / unprivileged perf limits. If we were to use
>    shmem_lock() we'd bump the count on each new inode. Using
>    shmem_lock() would also complicate inode allocation on shmem as
>    we'd to unwind on failure from the user_shm_lock(). It would also
>    beg the question of when to capture a ucount for an inode, should we
>    just share one for the superblock at shmem_fill_super() or do we
>    really need to capture it at every single inode creation? In theory
>    we could end up with different limits. The simple solution is to
>    juse use mapping_set_unevictable() upon inode creation and be done
>    with it, as it cannot fail.
>  o Update the documentation for tmpfs before / after my patch to
>    reflect use cases a bit more clearly between ramfs, tmpfs and brd
>    ramdisks.
>  o I updated the shmem_show_options() to also reveal the noswap option
>    when its used.
>  o Address checkpatch style complaint with spaces before tabs on
>    shmem_fs.h.
>
>Chances since first RFC:
>
>  o Matthew suggested BUG_ON(!folio_test_locked(folio)) is not needed
>    on writepage() callback for shmem so just remove that.
>  o Based on Matthew's feedback the inode is set up early as it is not
>    reset in case we split the folio. So now we move all the variables
>    we can set up really early.
>  o shmem writepage() should only be issued on reclaim, so just move
>    the WARN_ON_ONCE(!wbc->for_reclaim) early so that the code and
>    expectations are easier to read. This also avoid the folio splitting
>    in case of that odd case.
>  o There are a few cases where the shmem writepage() could possibly
>    hit, but in the total_swap_pages we just bail out. We shouldn't be
>    splitting the folio then. Likewise for VM_LOCKED case. But for
>    a writepage() on a VM_LOCKED case is not expected so we want to
>    learn about it so add a WARN_ON_ONCE() on that condition.
>  o Based on Yosry Ahmed's feedback the patch which allows tmpfs to
>    disable swap now just uses mapping_set_unevictable() on inode
>    creation. In that case writepage() should not be called so we
>    augment the WARN_ON_ONCE() for writepage() for that case to ensure
>    that never happens.
>
>To test I've used kdevops [0] 8 vpcu 4 GiB libvirt guest on linux-next.
>
>I'm doing this work as part of future experimentation with tmpfs and the
>page cache, but given a common complaint found about tmpfs is the
>innability to work without the page cache I figured this might be useful
>to others. It turns out it is -- at least Christian Brauner indicates
>systemd uses ramfs for a few use-cases because they don't want to use
>swap and so having this option would let them move over to using tmpfs
>for those small use cases, see systemd-creds(1).
>
>To see if you hit swap:
>
>mkswap /dev/nvme2n1
>swapon /dev/nvme2n1
>free -h
>
>With swap - what we see today
>=============================
>mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
>dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
>free -h
>               total        used        free      shared  buff/cache   available
>Mem:           3.7Gi       2.6Gi       1.2Gi       2.2Gi       2.2Gi       1.2Gi
>Swap:           99Gi       2.8Gi        97Gi
>
>
>Without swap
>=============
>
>free -h
>               total        used        free      shared  buff/cache   available
>Mem:           3.7Gi       387Mi       3.4Gi       2.1Mi        57Mi       3.3Gi
>Swap:           99Gi          0B        99Gi
>mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
>dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
>free -h
>               total        used        free      shared  buff/cache   available
>Mem:           3.7Gi       2.6Gi       1.2Gi       2.3Gi       2.3Gi       1.1Gi
>Swap:           99Gi        21Mi        99Gi
>
>The mix and match remount testing
>=================================
>
># Cannot disable swap after it was first enabled:
>mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
>mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
>mount: /data-tmpfs: mount point not mounted or bad option.
>       dmesg(1) may have more information after failed mount system call.
>dmesg -c
>tmpfs: Cannot disable swap on remount
>
># Remount with the same noswap option is OK:
>mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
>mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
>dmesg -c
>
># Trying to enable swap with a remount after it first disabled:
>mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
>mount -t tmpfs -o remount -o size=5G           tmpfs /data-tmpfs/
>mount: /data-tmpfs: mount point not mounted or bad option.
>       dmesg(1) may have more information after failed mount system call.
>dmesg -c
>tmpfs: Cannot enable swap on remount if it was disabled on first mount

Nice! For the whole series:

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (6 preceding siblings ...)
  2023-03-14  1:21 ` [PATCH v2 0/6] tmpfs: add the option to disable swap Davidlohr Bueso
@ 2023-03-14  2:46 ` haoxin
  2023-03-19 20:32   ` Luis Chamberlain
  2023-04-18  4:31 ` Hugh Dickins
  8 siblings, 1 reply; 30+ messages in thread
From: haoxin @ 2023-03-14  2:46 UTC (permalink / raw)
  To: Luis Chamberlain, hughd, akpm, willy, brauner
  Cc: linux-mm, p.raghav, da.gomez, a.manzanares, dave, yosryahmed,
	keescook, patches, linux-kernel

All these series looks good to me and i do some test on my virtual 
machine it works well.

so please add Tested-by: Xin Hao <xhao@linux.alibaba.com> .

just one question, if tmpfs pagecache occupies a large amount of memory, 
how can we ensure successful memory reclamation in case of memory shortage?

在 2023/3/10 上午7:05, Luis Chamberlain 写道:
> Changes on this v2 PATCH series:
>
>    o Added all respective tags for Reviewed-by, Acked-by's
>    o David Hildenbrand suggested on the update-docs patch to mention THP.
>      It turns out tmpfs.rst makes absolutely no mention to THP at all
>      so I added all the relevant options to the docs including the
>      system wide sysfs file. All that should hopefully demistify that
>      and make it clearer.
>    o Yosry Ahmed spell checked my patch "shmem: add support to ignore swap"
>
> Changes since RFCv2 to the first real PATCH series:
>
>    o Added Christian Brauner'd Acked-by for the noswap patch (the only
>      change in that patch is just the new shmem_show_options() change I
>      describe below).
>    o Embraced Yosry Ahmed's recommendation to use mapping_set_unevictable()
>      to at ensure the folios at least appear in the unevictable LRU.
>      Since that is the goal, this accomplishes what we want and the VM
>      takes care of things for us. The shem writepage() still uses a stop-gap
>      to ensure we don't get called for swap when its shmem uses
>      mapping_set_unevictable().
>    o I had evaluated using shmem_lock() instead of calling mapping_set_unevictable()
>      but upon my review this doesn't make much sense, as shmem_lock() was
>      designed to make use of the RLIMIT_MEMLOCK and this was designed for
>      files / IPC / unprivileged perf limits. If we were to use
>      shmem_lock() we'd bump the count on each new inode. Using
>      shmem_lock() would also complicate inode allocation on shmem as
>      we'd to unwind on failure from the user_shm_lock(). It would also
>      beg the question of when to capture a ucount for an inode, should we
>      just share one for the superblock at shmem_fill_super() or do we
>      really need to capture it at every single inode creation? In theory
>      we could end up with different limits. The simple solution is to
>      juse use mapping_set_unevictable() upon inode creation and be done
>      with it, as it cannot fail.
>    o Update the documentation for tmpfs before / after my patch to
>      reflect use cases a bit more clearly between ramfs, tmpfs and brd
>      ramdisks.
>    o I updated the shmem_show_options() to also reveal the noswap option
>      when its used.
>    o Address checkpatch style complaint with spaces before tabs on
>      shmem_fs.h.
>
> Chances since first RFC:
>
>    o Matthew suggested BUG_ON(!folio_test_locked(folio)) is not needed
>      on writepage() callback for shmem so just remove that.
>    o Based on Matthew's feedback the inode is set up early as it is not
>      reset in case we split the folio. So now we move all the variables
>      we can set up really early.
>    o shmem writepage() should only be issued on reclaim, so just move
>      the WARN_ON_ONCE(!wbc->for_reclaim) early so that the code and
>      expectations are easier to read. This also avoid the folio splitting
>      in case of that odd case.
>    o There are a few cases where the shmem writepage() could possibly
>      hit, but in the total_swap_pages we just bail out. We shouldn't be
>      splitting the folio then. Likewise for VM_LOCKED case. But for
>      a writepage() on a VM_LOCKED case is not expected so we want to
>      learn about it so add a WARN_ON_ONCE() on that condition.
>    o Based on Yosry Ahmed's feedback the patch which allows tmpfs to
>      disable swap now just uses mapping_set_unevictable() on inode
>      creation. In that case writepage() should not be called so we
>      augment the WARN_ON_ONCE() for writepage() for that case to ensure
>      that never happens.
>
> To test I've used kdevops [0] 8 vpcu 4 GiB libvirt guest on linux-next.
>
> I'm doing this work as part of future experimentation with tmpfs and the
> page cache, but given a common complaint found about tmpfs is the
> innability to work without the page cache I figured this might be useful
> to others. It turns out it is -- at least Christian Brauner indicates
> systemd uses ramfs for a few use-cases because they don't want to use
> swap and so having this option would let them move over to using tmpfs
> for those small use cases, see systemd-creds(1).
>
> To see if you hit swap:
>
> mkswap /dev/nvme2n1
> swapon /dev/nvme2n1
> free -h
>
> With swap - what we see today
> =============================
> mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
> dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
> free -h
>                 total        used        free      shared  buff/cache   available
> Mem:           3.7Gi       2.6Gi       1.2Gi       2.2Gi       2.2Gi       1.2Gi
> Swap:           99Gi       2.8Gi        97Gi
>
>
> Without swap
> =============
>
> free -h
>                 total        used        free      shared  buff/cache   available
> Mem:           3.7Gi       387Mi       3.4Gi       2.1Mi        57Mi       3.3Gi
> Swap:           99Gi          0B        99Gi
> mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
> dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
> free -h
>                 total        used        free      shared  buff/cache   available
> Mem:           3.7Gi       2.6Gi       1.2Gi       2.3Gi       2.3Gi       1.1Gi
> Swap:           99Gi        21Mi        99Gi
>
> The mix and match remount testing
> =================================
>
> # Cannot disable swap after it was first enabled:
> mount -t tmpfs            -o size=5G           tmpfs /data-tmpfs/
> mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
> mount: /data-tmpfs: mount point not mounted or bad option.
>         dmesg(1) may have more information after failed mount system call.
> dmesg -c
> tmpfs: Cannot disable swap on remount
>
> # Remount with the same noswap option is OK:
> mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
> mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
> dmesg -c
>
> # Trying to enable swap with a remount after it first disabled:
> mount -t tmpfs            -o size=5G -o noswap tmpfs /data-tmpfs/
> mount -t tmpfs -o remount -o size=5G           tmpfs /data-tmpfs/
> mount: /data-tmpfs: mount point not mounted or bad option.
>         dmesg(1) may have more information after failed mount system call.
> dmesg -c
> tmpfs: Cannot enable swap on remount if it was disabled on first mount
>
> [0] https://github.com/linux-kdevops/kdevops
>
> Luis Chamberlain (6):
>    shmem: remove check for folio lock on writepage()
>    shmem: set shmem_writepage() variables early
>    shmem: move reclaim check early on writepages()
>    shmem: skip page split if we're not reclaiming
>    shmem: update documentation
>    shmem: add support to ignore swap
>
>   Documentation/filesystems/tmpfs.rst  | 66 ++++++++++++++++++++++-----
>   Documentation/mm/unevictable-lru.rst |  2 +
>   include/linux/shmem_fs.h             |  1 +
>   mm/shmem.c                           | 68 ++++++++++++++++++----------
>   4 files changed, 103 insertions(+), 34 deletions(-)
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-14  2:46 ` haoxin
@ 2023-03-19 20:32   ` Luis Chamberlain
  2023-03-20 11:14     ` haoxin
  0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-19 20:32 UTC (permalink / raw)
  To: haoxin
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Tue, Mar 14, 2023 at 10:46:28AM +0800, haoxin wrote:
> All these series looks good to me and i do some test on my virtual machine
> it works well.
> 
> so please add Tested-by: Xin Hao <xhao@linux.alibaba.com> .
> 
> just one question, if tmpfs pagecache occupies a large amount of memory, how
> can we ensure successful memory reclamation in case of memory shortage?

If you're disabling swap then you know the only thing you can do is
unmount if you want to help the VM, otherwise the pressure is just
greater for the VM.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-19 20:32   ` Luis Chamberlain
@ 2023-03-20 11:14     ` haoxin
  2023-03-20 21:36       ` Luis Chamberlain
  0 siblings, 1 reply; 30+ messages in thread
From: haoxin @ 2023-03-20 11:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]


在 2023/3/20 上午4:32, Luis Chamberlain 写道:
> On Tue, Mar 14, 2023 at 10:46:28AM +0800, haoxin wrote:
>> All these series looks good to me and i do some test on my virtual machine
>> it works well.
>>
>> so please add Tested-by: Xin Hao<xhao@linux.alibaba.com>  .
>>
>> just one question, if tmpfs pagecache occupies a large amount of memory, how
>> can we ensure successful memory reclamation in case of memory shortage?
> If you're disabling swap then you know the only thing you can do is
> unmount if you want to help the VM, otherwise the pressure is just
> greater for the VM.

Un, what i mean is can we add a priority so that this type of pagecache 
is reclaimed last ?

Instead of just setting the parameter noswap to make it unreclaimed, 
because if such pagecache which

occupy big part of memory which can not be reclaimed, it will cause OOM.


>
>    Luis

[-- Attachment #2: Type: text/html, Size: 2319 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-20 11:14     ` haoxin
@ 2023-03-20 21:36       ` Luis Chamberlain
  2023-03-21 11:37         ` haoxin
  0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-03-20 21:36 UTC (permalink / raw)
  To: haoxin
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Mon, Mar 20, 2023 at 07:14:22PM +0800, haoxin wrote:
> 
> 在 2023/3/20 上午4:32, Luis Chamberlain 写道:
> > On Tue, Mar 14, 2023 at 10:46:28AM +0800, haoxin wrote:
> > > All these series looks good to me and i do some test on my virtual machine
> > > it works well.
> > > 
> > > so please add Tested-by: Xin Hao<xhao@linux.alibaba.com>  .
> > > 
> > > just one question, if tmpfs pagecache occupies a large amount of memory, how
> > > can we ensure successful memory reclamation in case of memory shortage?
> > If you're disabling swap then you know the only thing you can do is
> > unmount if you want to help the VM, otherwise the pressure is just
> > greater for the VM.
> 
> Un, what i mean is can we add a priority so that this type of pagecache is
> reclaimed last ?

That seems to be a classifier request for something much less aggressive
than mapping_set_unevictable(). My patches *prior* to using mapping_set_unevictable()
are I think closer to what it seems you want, but as noted before by
folks, that also puts unecessary stress on the VM because just fail
reclaim on our writepage().

> Instead of just setting the parameter noswap to make it unreclaimed, because
> if such pagecache which occupy big part of memory which can not be
> reclaimed, it will cause OOM.

You can't simultaneously retain possession of a cake and eat it, too,
once you eat it, its gone and noswap eats the cake because of the
suggestion / decision to follow through with mapping_set_unevictable().

It sounds like you want to make mapping_set_unevictable() optional and
deal with the possible stress incurred writepage() failing? Not quite
sure what else to recommend here.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-20 21:36       ` Luis Chamberlain
@ 2023-03-21 11:37         ` haoxin
  0 siblings, 0 replies; 30+ messages in thread
From: haoxin @ 2023-03-21 11:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel


在 2023/3/21 上午5:36, Luis Chamberlain 写道:
> On Mon, Mar 20, 2023 at 07:14:22PM +0800, haoxin wrote:
>> 在 2023/3/20 上午4:32, Luis Chamberlain 写道:
>>> On Tue, Mar 14, 2023 at 10:46:28AM +0800, haoxin wrote:
>>>> All these series looks good to me and i do some test on my virtual machine
>>>> it works well.
>>>>
>>>> so please add Tested-by: Xin Hao<xhao@linux.alibaba.com>  .
>>>>
>>>> just one question, if tmpfs pagecache occupies a large amount of memory, how
>>>> can we ensure successful memory reclamation in case of memory shortage?
>>> If you're disabling swap then you know the only thing you can do is
>>> unmount if you want to help the VM, otherwise the pressure is just
>>> greater for the VM.
>> Un, what i mean is can we add a priority so that this type of pagecache is
>> reclaimed last ?
> That seems to be a classifier request for something much less aggressive
> than mapping_set_unevictable(). My patches *prior* to using mapping_set_unevictable()
> are I think closer to what it seems you want, but as noted before by
> folks, that also puts unecessary stress on the VM because just fail
> reclaim on our writepage().
>
>> Instead of just setting the parameter noswap to make it unreclaimed, because
>> if such pagecache which occupy big part of memory which can not be
>> reclaimed, it will cause OOM.
> You can't simultaneously retain possession of a cake and eat it, too,
> once you eat it, its gone and noswap eats the cake because of the
> suggestion / decision to follow through with mapping_set_unevictable().
>
> It sounds like you want to make mapping_set_unevictable() optional and
> deal with the possible stress incurred writepage() failing?
Yes, Just a personal idea, in any way, the current patch is an excellent 
implementation,  thank you very much.
>   Not quite
> sure what else to recommend here.
>
>    Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
                   ` (7 preceding siblings ...)
  2023-03-14  2:46 ` haoxin
@ 2023-04-18  4:31 ` Hugh Dickins
  2023-04-18 20:55   ` Luis Chamberlain
  8 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18  4:31 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Thu, 9 Mar 2023, Luis Chamberlain wrote:

> I'm doing this work as part of future experimentation with tmpfs and the
> page cache, but given a common complaint found about tmpfs is the
> innability to work without the page cache I figured this might be useful
> to others. It turns out it is -- at least Christian Brauner indicates
> systemd uses ramfs for a few use-cases because they don't want to use
> swap and so having this option would let them move over to using tmpfs
> for those small use cases, see systemd-creds(1).

Thanks for your thorough work on tmpfs "noswap": seems well-received
by quite a few others, that's good.

I've just a few comments on later patches (I don't understand why you
went into those little rearrangements at the start of shmem_writepage(),
but they seem harmless so I don't object), but wanted to ask here:

You say "a common complaint about tmpfs is the inability to work without
the page cache".  Ehh?  I don't understand that at all, and have never
heard such a complaint.  It doesn't affect the series itself (oh, Andrew
has copied that text into the first patch), but please illuminate!

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] shmem: skip page split if we're not reclaiming
  2023-03-09 23:05 ` [PATCH v2 4/6] shmem: skip page split if we're not reclaiming Luis Chamberlain
  2023-03-09 23:09   ` Yosry Ahmed
@ 2023-04-18  4:41   ` Hugh Dickins
  2023-04-18 21:11     ` Luis Chamberlain
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18  4:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Thu, 9 Mar 2023, Luis Chamberlain wrote:

> In theory when info->flags & VM_LOCKED we should not be getting
> shem_writepage() called so we should be verifying this with a
> WARN_ON_ONCE(). Since we should not be swapping then best to ensure
> we also don't do the folio split earlier too. So just move the check
> early to avoid folio splits in case its a dubious call.
> 
> We also have a similar early bail when !total_swap_pages so just move
> that earlier to avoid the possible folio split in the same situation.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/shmem.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 68e9970baf1e..dfd995da77b4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1350,6 +1350,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	if (WARN_ON_ONCE(!wbc->for_reclaim))
>  		goto redirty;
>  
> +	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
> +		goto redirty;

Well, okay, I don't mind that.  But shall we take bets on how soon syzbot
(hope it's not watching) will try flipping SHM_LOCK on while swapping out
pages from a SHM segment, and hit that warning?  Perhaps I'm wrong, but I
don't think any serialization prevents that.

Hugh

> +
> +	if (!total_swap_pages)
> +		goto redirty;
> +
>  	/*
>  	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
>  	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> @@ -1365,10 +1371,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	}
>  
>  	index = folio->index;
> -	if (info->flags & VM_LOCKED)
> -		goto redirty;
> -	if (!total_swap_pages)
> -		goto redirty;
>  
>  	/*
>  	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
> -- 
> 2.39.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] shmem: update documentation
  2023-03-09 23:05 ` [PATCH v2 5/6] shmem: update documentation Luis Chamberlain
@ 2023-04-18  5:29   ` Hugh Dickins
  2023-04-18 21:20     ` Luis Chamberlain
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18  5:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Thu, 9 Mar 2023, Luis Chamberlain wrote:

> Update the docs to reflect a bit better why some folks prefer tmpfs
> over ramfs and clarify a bit more about the difference between brd
> ramdisks.
> 
> While at it, add THP docs for tmpfs, both the mount options and the
> sysfs file.

Okay: the original canonical reference for THP options on tmpfs has
been Documentation/admin-guide/mm/transhuge.rst.  You're right that
they would be helpful here too: IIRC (but I might well be confusing
with our Google tree) we used to have them documented in both places,
but grew tired of keeping the two in synch.  You're volunteering to
do so! so please check now that they tell the same story.

But nowadays, "man 5 tmpfs" is much more important (and that might
give you a hint for what needs to be done after this series goes into
6.4-rc - and I wonder if there are tmpfs manpage updates needed from
Christian for idmapped too? or already taken care of?).

There's a little detail we do need you to remove, indicated below.

> 
> Reviewed-by: Christian Brauner <brauner@kernel.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  Documentation/filesystems/tmpfs.rst | 57 +++++++++++++++++++++++++----
>  1 file changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index 0408c245785e..1ec9a9f8196b 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -13,14 +13,25 @@ everything stored therein is lost.
>  
>  tmpfs puts everything into the kernel internal caches and grows and
>  shrinks to accommodate the files it contains and is able to swap
> -unneeded pages out to swap space. It has maximum size limits which can
> -be adjusted on the fly via 'mount -o remount ...'
> -
> -If you compare it to ramfs (which was the template to create tmpfs)
> -you gain swapping and limit checking. Another similar thing is the RAM
> -disk (/dev/ram*), which simulates a fixed size hard disk in physical
> -RAM, where you have to create an ordinary filesystem on top. Ramdisks
> -cannot swap and you do not have the possibility to resize them.
> +unneeded pages out to swap space, and supports THP.
> +
> +tmpfs extends ramfs with a few userspace configurable options listed and
> +explained further below, some of which can be reconfigured dynamically on the
> +fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
> +filesystem can be resized but it cannot be resized to a size below its current
> +usage. tmpfs also supports POSIX ACLs, and extended attributes for the
> +trusted.* and security.* namespaces. ramfs does not use swap and you cannot
> +modify any parameter for a ramfs filesystem. The size limit of a ramfs
> +filesystem is how much memory you have available, and so care must be taken if
> +used so to not run out of memory.
> +
> +An alternative to tmpfs and ramfs is to use brd to create RAM disks
> +(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
> +To write data you would just then need to create an regular filesystem on top
> +this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
> +configured in size at initialization and you cannot dynamically resize them.
> +Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
> +block layer at all.
>  
>  Since tmpfs lives completely in the page cache and on swap, all tmpfs
>  pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
> @@ -85,6 +96,36 @@ mount with such options, since it allows any user with write access to
>  use up all the memory on the machine; but enhances the scalability of
>  that instance in a system with many CPUs making intensive use of it.
>  
> +tmpfs also supports Transparent Huge Pages which requires a kernel
> +configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
> +your system (has_transparent_hugepage(), which is architecture specific).
> +The mount options for this are:
> +
> +======  ============================================================
> +huge=0  never: disables huge pages for the mount
> +huge=1  always: enables huge pages for the mount
> +huge=2  within_size: only allocate huge pages if the page will be
> +        fully within i_size, also respect fadvise()/madvise() hints.
> +huge=3  advise: only allocate huge pages if requested with
> +        fadvise()/madvise()

You're taking the source too literally there.  Minor point is that there
is no fadvise() for this, to date anyway.  Major point is: have you tried
mounting tmpfs with huge=0 etc?  I did propose "huge=0" and "huge=1" years
ago, but those "never" went in, it's "always" been the named options.
Please remove those misleading numbers, it's "huge=never" etc.

(Old Google internal trees excepted: and trying to wean people off
"huge=1" internally makes me a bit touchy when seeing those numbers above!)

> +======  ============================================================
> +
> +There is a sysfs file which you can also use to control system wide THP
> +configuration for all tmpfs mounts, the file is:
> +
> +/sys/kernel/mm/transparent_hugepage/shmem_enabled
> +
> +This sysfs file is placed on top of THP sysfs directory and so is registered
> +by THP code. It is however only used to control all tmpfs mounts with one
> +single knob. Since it controls all tmpfs mounts it should only be used either
> +for emergency or testing purposes. The values you can set for shmem_enabled are:
> +
> +==  ============================================================
> +-1  deny: disables huge on shm_mnt and all mounts, for
> +    emergency use
> +-2  force: enables huge on shm_mnt and all mounts, w/o needing
> +    option, for testing

Likewise here, please delete the invalid "-1" and "-2" notations,
-1 and -2 are just #defines for use in the kernel source.

And the description above is not quite accurate: it is very hard to
describe shmem_enabled, partly because it combines two different things.
It's partly the "huge=" mount option for any "internal mount", those
things like SysV SHM and memfd and i915 and shared-anonymous: the shmem
which has no user-visible mount to hold the option.  But also these
"deny" and "force" overrides affecting *all* internal and visible mounts.

Hugh

> +==  ============================================================
>  
>  tmpfs has a mount option to set the NUMA memory allocation policy for
>  all files in that instance (if CONFIG_NUMA is enabled) - which can be
> -- 
> 2.39.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] shmem: add support to ignore swap
  2023-03-09 23:05 ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
@ 2023-04-18  5:50   ` Hugh Dickins
  2023-04-18  7:38     ` Christian Brauner
  2023-04-18 21:22     ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
  0 siblings, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18  5:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Thu, 9 Mar 2023, Luis Chamberlain wrote:

> In doing experimentations with shmem having the option to avoid swap
> becomes a useful mechanism. One of the *raves* about brd over shmem is
> you can avoid swap, but that's not really a good reason to use brd if
> we can instead use shmem. Using brd has its own good reasons to exist,
> but just because "tmpfs" doesn't let you do that is not a great reason
> to avoid it if we can easily add support for it.
> 
> I don't add support for reconfiguring incompatible options, but if
> we really wanted to we can add support for that.
> 
> To avoid swap we use mapping_set_unevictable() upon inode creation,
> and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.

I have one big question here, which betrays my ignorance:
I hope that you or Christian can reassure me on this.

tmpfs has fs_flags FS_USERNS_MOUNT.  I know nothing about namespaces,
nothing; but from overhearings, wonder if an ordinary user in a namespace
might be able to mount their own tmpfs with "noswap", and thereby evade
all accounting of the locked memory.

That would be an absolute no-no for this patch; but I assume that even
if so, it can be easily remedied by inserting an appropriate (unknown
to me!) privilege check where the "noswap" option is validated.

I did idly wonder what happens with "noswap" when CONFIG_SWAP is not
enabled, or no swap is enabled; but I think it would be a waste of time
and code to worry over doing anything different from whatever behaviour
falls out trivially.

You'll be sending a manpage update to Alejandro in due course, I think.

Thanks,
Hugh

> 
> Acked-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  Documentation/filesystems/tmpfs.rst  |  9 ++++++---
>  Documentation/mm/unevictable-lru.rst |  2 ++
>  include/linux/shmem_fs.h             |  1 +
>  mm/shmem.c                           | 28 +++++++++++++++++++++++++++-
>  4 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index 1ec9a9f8196b..f18f46be5c0c 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -13,7 +13,8 @@ everything stored therein is lost.
>  
>  tmpfs puts everything into the kernel internal caches and grows and
>  shrinks to accommodate the files it contains and is able to swap
> -unneeded pages out to swap space, and supports THP.
> +unneeded pages out to swap space, if swap was enabled for the tmpfs
> +mount. tmpfs also supports THP.
>  
>  tmpfs extends ramfs with a few userspace configurable options listed and
>  explained further below, some of which can be reconfigured dynamically on the
> @@ -33,8 +34,8 @@ configured in size at initialization and you cannot dynamically resize them.
>  Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
>  block layer at all.
>  
> -Since tmpfs lives completely in the page cache and on swap, all tmpfs
> -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
> +Since tmpfs lives completely in the page cache and optionally on swap,
> +all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
>  free(1). Notice that these counters also include shared memory
>  (shmem, see ipcs(1)). The most reliable way to get the count is
>  using df(1) and du(1).
> @@ -83,6 +84,8 @@ nr_inodes  The maximum number of inodes for this instance. The default
>             is half of the number of your physical RAM pages, or (on a
>             machine with highmem) the number of lowmem RAM pages,
>             whichever is the lower.
> +noswap     Disables swap. Remounts must respect the original settings.
> +           By default swap is enabled.
>  =========  ============================================================
>  
>  These parameters accept a suffix k, m or g for kilo, mega and giga and
> diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst
> index 92ac5dca420c..d5ac8511eb67 100644
> --- a/Documentation/mm/unevictable-lru.rst
> +++ b/Documentation/mm/unevictable-lru.rst
> @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:
>  
>   * Those owned by ramfs.
>  
> + * Those owned by tmpfs with the noswap mount option.
> +
>   * Those mapped into SHM_LOCK'd shared memory regions.
>  
>   * Those mapped into VM_LOCKED [mlock()ed] VMAs.
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 103d1000a5a2..50bf82b36995 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -45,6 +45,7 @@ struct shmem_sb_info {
>  	kuid_t uid;		    /* Mount uid for root directory */
>  	kgid_t gid;		    /* Mount gid for root directory */
>  	bool full_inums;	    /* If i_ino should be uint or ino_t */
> +	bool noswap;		    /* ignores VM reclaim / swap requests */
>  	ino_t next_ino;		    /* The next per-sb inode number to use */
>  	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
>  	struct mempolicy *mpol;     /* default memory policy for mappings */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index dfd995da77b4..2e122c72b375 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -119,10 +119,12 @@ struct shmem_options {
>  	bool full_inums;
>  	int huge;
>  	int seen;
> +	bool noswap;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
>  #define SHMEM_SEEN_INUMS 8
> +#define SHMEM_SEEN_NOSWAP 16
>  };
>  
>  #ifdef CONFIG_TMPFS
> @@ -1337,6 +1339,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	struct address_space *mapping = folio->mapping;
>  	struct inode *inode = mapping->host;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> +	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> @@ -1350,7 +1353,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	if (WARN_ON_ONCE(!wbc->for_reclaim))
>  		goto redirty;
>  
> -	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
> +	if (WARN_ON_ONCE((info->flags & VM_LOCKED) || sbinfo->noswap))
>  		goto redirty;
>  
>  	if (!total_swap_pages)
> @@ -2487,6 +2490,8 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block
>  			shmem_set_inode_flags(inode, info->fsflags);
>  		INIT_LIST_HEAD(&info->shrinklist);
>  		INIT_LIST_HEAD(&info->swaplist);
> +		if (sbinfo->noswap)
> +			mapping_set_unevictable(inode->i_mapping);
>  		simple_xattrs_init(&info->xattrs);
>  		cache_no_acl(inode);
>  		mapping_set_large_folios(inode->i_mapping);
> @@ -3574,6 +3579,7 @@ enum shmem_param {
>  	Opt_uid,
>  	Opt_inode32,
>  	Opt_inode64,
> +	Opt_noswap,
>  };
>  
>  static const struct constant_table shmem_param_enums_huge[] = {
> @@ -3595,6 +3601,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
>  	fsparam_u32   ("uid",		Opt_uid),
>  	fsparam_flag  ("inode32",	Opt_inode32),
>  	fsparam_flag  ("inode64",	Opt_inode64),
> +	fsparam_flag  ("noswap",	Opt_noswap),
>  	{}
>  };
>  
> @@ -3678,6 +3685,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>  		ctx->full_inums = true;
>  		ctx->seen |= SHMEM_SEEN_INUMS;
>  		break;
> +	case Opt_noswap:
> +		ctx->noswap = true;
> +		ctx->seen |= SHMEM_SEEN_NOSWAP;
> +		break;
>  	}
>  	return 0;
>  
> @@ -3776,6 +3787,14 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		err = "Current inum too high to switch to 32-bit inums";
>  		goto out;
>  	}
> +	if ((ctx->seen & SHMEM_SEEN_NOSWAP) && ctx->noswap && !sbinfo->noswap) {
> +		err = "Cannot disable swap on remount";
> +		goto out;
> +	}
> +	if (!(ctx->seen & SHMEM_SEEN_NOSWAP) && !ctx->noswap && sbinfo->noswap) {
> +		err = "Cannot enable swap on remount if it was disabled on first mount";
> +		goto out;
> +	}
>  
>  	if (ctx->seen & SHMEM_SEEN_HUGE)
>  		sbinfo->huge = ctx->huge;
> @@ -3796,6 +3815,10 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		sbinfo->mpol = ctx->mpol;	/* transfers initial ref */
>  		ctx->mpol = NULL;
>  	}
> +
> +	if (ctx->noswap)
> +		sbinfo->noswap = true;
> +
>  	raw_spin_unlock(&sbinfo->stat_lock);
>  	mpol_put(mpol);
>  	return 0;
> @@ -3850,6 +3873,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>  		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
>  #endif
>  	shmem_show_mpol(seq, sbinfo->mpol);
> +	if (sbinfo->noswap)
> +		seq_printf(seq, ",noswap");
>  	return 0;
>  }
>  
> @@ -3893,6 +3918,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  			ctx->inodes = shmem_default_max_inodes();
>  		if (!(ctx->seen & SHMEM_SEEN_INUMS))
>  			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
> +		sbinfo->noswap = ctx->noswap;
>  	} else {
>  		sb->s_flags |= SB_NOUSER;
>  	}
> -- 
> 2.39.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] shmem: add support to ignore swap
  2023-04-18  5:50   ` Hugh Dickins
@ 2023-04-18  7:38     ` Christian Brauner
  2023-04-18 21:51       ` Luis Chamberlain
  2023-04-18 21:22     ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2023-04-18  7:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Luis Chamberlain, akpm, willy, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote:
> On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> 
> > In doing experimentations with shmem having the option to avoid swap
> > becomes a useful mechanism. One of the *raves* about brd over shmem is
> > you can avoid swap, but that's not really a good reason to use brd if
> > we can instead use shmem. Using brd has its own good reasons to exist,
> > but just because "tmpfs" doesn't let you do that is not a great reason
> > to avoid it if we can easily add support for it.
> > 
> > I don't add support for reconfiguring incompatible options, but if
> > we really wanted to we can add support for that.
> > 
> > To avoid swap we use mapping_set_unevictable() upon inode creation,
> > and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.
> 
> I have one big question here, which betrays my ignorance:
> I hope that you or Christian can reassure me on this.
> 
> tmpfs has fs_flags FS_USERNS_MOUNT.  I know nothing about namespaces,
> nothing; but from overhearings, wonder if an ordinary user in a namespace
> might be able to mount their own tmpfs with "noswap", and thereby evade
> all accounting of the locked memory.
> 
> That would be an absolute no-no for this patch; but I assume that even
> if so, it can be easily remedied by inserting an appropriate (unknown
> to me!) privilege check where the "noswap" option is validated.

Oh, good catch. Thanks! So you would just need sm like:

diff --git a/mm/shmem.c b/mm/shmem.c
index 787e83791eb5..21ce9b26bb4d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3571,6 +3571,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
                ctx->seen |= SHMEM_SEEN_INUMS;
                break;
        case Opt_noswap:
+               if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) {
+                       return invalfc(fc,
+                                      "Turning off swap in unprivileged tmpfs mounts unsupported");
+               }
                ctx->noswap = true;
                ctx->seen |= SHMEM_SEEN_NOSWAP;
                break;

The fc->user_ns is the userns that the tmpfs mount will be mounted in, i.e.,
fc->user_ns will become sb->s_user_ns if FS_USERNS_MOUNT is raised. So with the
check above we require that the tmpfs instance must ultimately belong to the
initial userns and that the caller has CAP_SYS_ADMIN in the initial userns
(CAP_SYS_ADMIN guards swapon and swapoff) according to capabilities(7).


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] tmpfs: add the option to disable swap
  2023-04-18  4:31 ` Hugh Dickins
@ 2023-04-18 20:55   ` Luis Chamberlain
  0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel

On Mon, Apr 17, 2023 at 09:31:20PM -0700, Hugh Dickins wrote:
> On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> 
> > I'm doing this work as part of future experimentation with tmpfs and the
> > page cache, but given a common complaint found about tmpfs is the
> > innability to work without the page cache I figured this might be useful
> > to others. It turns out it is -- at least Christian Brauner indicates
> > systemd uses ramfs for a few use-cases because they don't want to use
> > swap and so having this option would let them move over to using tmpfs
> > for those small use cases, see systemd-creds(1).
> 
> Thanks for your thorough work on tmpfs "noswap": seems well-received
> by quite a few others, that's good.
> 
> I've just a few comments on later patches (I don't understand why you
> went into those little rearrangements at the start of shmem_writepage(),
> but they seem harmless so I don't object),

Because the devil is in the details as you noted too!

> but wanted to ask here:
> 
> You say "a common complaint about tmpfs is the inability to work without
> the page cache".  Ehh?

That was a mistake! s/page cache/swap.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] shmem: skip page split if we're not reclaiming
  2023-04-18  4:41   ` Hugh Dickins
@ 2023-04-18 21:11     ` Luis Chamberlain
  2023-04-18 21:20       ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 21:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Mon, Apr 17, 2023 at 09:41:41PM -0700, Hugh Dickins wrote:
> On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> 
> > In theory when info->flags & VM_LOCKED we should not be getting
> > shem_writepage() called so we should be verifying this with a
> > WARN_ON_ONCE(). Since we should not be swapping then best to ensure
> > we also don't do the folio split earlier too. So just move the check
> > early to avoid folio splits in case its a dubious call.
> > 
> > We also have a similar early bail when !total_swap_pages so just move
> > that earlier to avoid the possible folio split in the same situation.
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/shmem.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 68e9970baf1e..dfd995da77b4 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1350,6 +1350,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >  	if (WARN_ON_ONCE(!wbc->for_reclaim))
> >  		goto redirty;
> >  
> > +	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
> > +		goto redirty;
> 
> Well, okay, I don't mind that.  But shall we take bets on how soon syzbot
> (hope it's not watching) will try flipping SHM_LOCK on while swapping out
> pages from a SHM segment, and hit that warning?  Perhaps I'm wrong, but I
> don't think any serialization prevents that.

I though that may be the case. Would such serialization be welcomed?

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] shmem: update documentation
  2023-04-18  5:29   ` Hugh Dickins
@ 2023-04-18 21:20     ` Luis Chamberlain
  2023-04-18 21:41       ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 21:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Mon, Apr 17, 2023 at 10:29:59PM -0700, Hugh Dickins wrote:
> On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> 
> > Update the docs to reflect a bit better why some folks prefer tmpfs
> > over ramfs and clarify a bit more about the difference between brd
> > ramdisks.
> > 
> > While at it, add THP docs for tmpfs, both the mount options and the
> > sysfs file.
> 
> Okay: the original canonical reference for THP options on tmpfs has
> been Documentation/admin-guide/mm/transhuge.rst.  You're right that
> they would be helpful here too: IIRC (but I might well be confusing
> with our Google tree) we used to have them documented in both places,
> but grew tired of keeping the two in synch.  You're volunteering to
> do so! so please check now that they tell the same story.

Hehe. Sure, we should just make one point to the other. Which one should
be the authoritive source?

> But nowadays, "man 5 tmpfs" is much more important (and that might
> give you a hint for what needs to be done after this series goes into
> 6.4-rc - and I wonder if there are tmpfs manpage updates needed from
> Christian for idmapped too? or already taken care of?).

Sure, what's the man page git tree to use? I can do that once these
documents are settled as well. I'll send fixes.

> There's a little detail we do need you to remove, indicated below.
> 
> > +======  ============================================================
> > +huge=0  never: disables huge pages for the mount
> > +huge=1  always: enables huge pages for the mount
> > +huge=2  within_size: only allocate huge pages if the page will be
> > +        fully within i_size, also respect fadvise()/madvise() hints.
> > +huge=3  advise: only allocate huge pages if requested with
> > +        fadvise()/madvise()
> 
> You're taking the source too literally there.  Minor point is that there
> is no fadvise() for this, to date anyway.  Major point is: have you tried
> mounting tmpfs with huge=0 etc?  I did propose "huge=0" and "huge=1" years
> ago, but those "never" went in, it's "always" been the named options.
> Please remove those misleading numbers, it's "huge=never" etc.

Will do.

> > +==  ============================================================
> > +-1  deny: disables huge on shm_mnt and all mounts, for
> > +    emergency use
> > +-2  force: enables huge on shm_mnt and all mounts, w/o needing
> > +    option, for testing
> 
> Likewise here, please delete the invalid "-1" and "-2" notations,
> -1 and -2 are just #defines for use in the kernel source.

ok!

> And the description above is not quite accurate: it is very hard to
> describe shmem_enabled, partly because it combines two different things.
> It's partly the "huge=" mount option for any "internal mount", those
> things like SysV SHM and memfd and i915 and shared-anonymous: the shmem
> which has no user-visible mount to hold the option.  But also these
> "deny" and "force" overrides affecting *all* internal and visible mounts.

I see thanks.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] shmem: skip page split if we're not reclaiming
  2023-04-18 21:11     ` Luis Chamberlain
@ 2023-04-18 21:20       ` Hugh Dickins
  0 siblings, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18 21:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Hugh Dickins, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Tue, 18 Apr 2023, Luis Chamberlain wrote:
> On Mon, Apr 17, 2023 at 09:41:41PM -0700, Hugh Dickins wrote:
> > On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> > 
> > > +	if (WARN_ON_ONCE(info->flags & VM_LOCKED))
> > > +		goto redirty;
> > 
> > Well, okay, I don't mind that.  But shall we take bets on how soon syzbot
> > (hope it's not watching) will try flipping SHM_LOCK on while swapping out
> > pages from a SHM segment, and hit that warning?  Perhaps I'm wrong, but I
> > don't think any serialization prevents that.
> 
> I though that may be the case. Would such serialization be welcomed?

Absolutely not!  We don't insert slowdowns just to avoid warnings,
unless the warning is of something that really matters.  This one
does not matter, the situation is correctly handled, so the warning
would be better reverted.  Though I personally don't mind you leaving
it in until the first report of it arrives.

Hugh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] shmem: add support to ignore swap
  2023-04-18  5:50   ` Hugh Dickins
  2023-04-18  7:38     ` Christian Brauner
@ 2023-04-18 21:22     ` Luis Chamberlain
  2023-04-18 21:30       ` Randy Dunlap
  1 sibling, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 21:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel

On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote:
> You'll be sending a manpage update to Alejandro in due course, I think.

Sure thing! Just need a git tree. I can send the updates as we reach
a consensus on where to store / share huge page shmem updates.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] shmem: add support to ignore swap
  2023-04-18 21:22     ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
@ 2023-04-18 21:30       ` Randy Dunlap
  0 siblings, 0 replies; 30+ messages in thread
From: Randy Dunlap @ 2023-04-18 21:30 UTC (permalink / raw)
  To: Luis Chamberlain, Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel



On 4/18/23 14:22, Luis Chamberlain wrote:
> On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote:
>> You'll be sending a manpage update to Alejandro in due course, I think.
> 
> Sure thing! Just need a git tree. I can send the updates as we reach
> a consensus on where to store / share huge page shmem updates.
> 
>   Luis

From the latest man-page announcement:


	man-pages-6.04 - manual pages for GNU/Linux

The release tarball is already available at <kernel.org>.

Tarball download:
	<https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/>
Git repository:
	<https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>

-- 
~Randy


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] shmem: update documentation
  2023-04-18 21:20     ` Luis Chamberlain
@ 2023-04-18 21:41       ` Hugh Dickins
  2023-04-18 21:49         ` Luis Chamberlain
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2023-04-18 21:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Hugh Dickins, akpm, willy, brauner, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Tue, 18 Apr 2023, Luis Chamberlain wrote:
> On Mon, Apr 17, 2023 at 10:29:59PM -0700, Hugh Dickins wrote:
> > On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> > 
> > > Update the docs to reflect a bit better why some folks prefer tmpfs
> > > over ramfs and clarify a bit more about the difference between brd
> > > ramdisks.
> > > 
> > > While at it, add THP docs for tmpfs, both the mount options and the
> > > sysfs file.
> > 
> > Okay: the original canonical reference for THP options on tmpfs has
> > been Documentation/admin-guide/mm/transhuge.rst.  You're right that
> > they would be helpful here too: IIRC (but I might well be confusing
> > with our Google tree) we used to have them documented in both places,
> > but grew tired of keeping the two in synch.  You're volunteering to
> > do so! so please check now that they tell the same story.
> 
> Hehe. Sure, we should just make one point to the other. Which one should
> be the authoritive source?

Documentation/admin-guide/mm/transhuge.rst has been the authoritative
source up until this patch, so I suggest it remain so; but good if you
point to it from this Doc - unless in reading it you find that actually
its account is wrong.  (Haha, it refers to fadvise too, never mind that.)

But the man page is more important than either, so it would be good to
point to that too.  Mention the "huge=" option in this document, but
point to elsewhere for the detail of its values.

> 
> > But nowadays, "man 5 tmpfs" is much more important (and that might
> > give you a hint for what needs to be done after this series goes into
> > 6.4-rc - and I wonder if there are tmpfs manpage updates needed from
> > Christian for idmapped too? or already taken care of?).
> 
> Sure, what's the man page git tree to use? I can do that once these
> documents are settled as well. I'll send fixes.

Thanks. I'll look up a mail to lkml from Alejandro and forward that
to you, it has the details.

Hugh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] shmem: update documentation
  2023-04-18 21:41       ` Hugh Dickins
@ 2023-04-18 21:49         ` Luis Chamberlain
  0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 21:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, brauner, linux-mm, p.raghav, da.gomez, a.manzanares,
	dave, yosryahmed, keescook, patches, linux-kernel,
	David Hildenbrand

On Tue, Apr 18, 2023 at 02:41:07PM -0700, Hugh Dickins wrote:
> On Tue, 18 Apr 2023, Luis Chamberlain wrote:
> > On Mon, Apr 17, 2023 at 10:29:59PM -0700, Hugh Dickins wrote:
> > > On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> > > 
> > > > Update the docs to reflect a bit better why some folks prefer tmpfs
> > > > over ramfs and clarify a bit more about the difference between brd
> > > > ramdisks.
> > > > 
> > > > While at it, add THP docs for tmpfs, both the mount options and the
> > > > sysfs file.
> > > 
> > > Okay: the original canonical reference for THP options on tmpfs has
> > > been Documentation/admin-guide/mm/transhuge.rst.  You're right that
> > > they would be helpful here too: IIRC (but I might well be confusing
> > > with our Google tree) we used to have them documented in both places,
> > > but grew tired of keeping the two in synch.  You're volunteering to
> > > do so! so please check now that they tell the same story.
> > 
> > Hehe. Sure, we should just make one point to the other. Which one should
> > be the authoritive source?
> 
> Documentation/admin-guide/mm/transhuge.rst has been the authoritative
> source up until this patch, so I suggest it remain so; but good if you
> point to it from this Doc - unless in reading it you find that actually
> its account is wrong.  (Haha, it refers to fadvise too, never mind that.)

Yeah I'll make the tmpfs kdoc point to the transhuge.rst page. I think
that's possible.

> But the man page is more important than either, so it would be good to
> point to that too. 

Sure I'll have the tmpfs kdoc also point to the tmpfs man page.

> Mention the "huge=" option in this document, but
> point to elsewhere for the detail of its values.

Sounds good.

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] shmem: add support to ignore swap
  2023-04-18  7:38     ` Christian Brauner
@ 2023-04-18 21:51       ` Luis Chamberlain
  2023-04-20  8:57         ` [PATCH] shmem: restrict noswap option to initial user namespace Christian Brauner
  0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-18 21:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Hugh Dickins, akpm, willy, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Tue, Apr 18, 2023 at 09:38:10AM +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote:
> > On Thu, 9 Mar 2023, Luis Chamberlain wrote:
> > 
> > > In doing experimentations with shmem having the option to avoid swap
> > > becomes a useful mechanism. One of the *raves* about brd over shmem is
> > > you can avoid swap, but that's not really a good reason to use brd if
> > > we can instead use shmem. Using brd has its own good reasons to exist,
> > > but just because "tmpfs" doesn't let you do that is not a great reason
> > > to avoid it if we can easily add support for it.
> > > 
> > > I don't add support for reconfiguring incompatible options, but if
> > > we really wanted to we can add support for that.
> > > 
> > > To avoid swap we use mapping_set_unevictable() upon inode creation,
> > > and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.
> > 
> > I have one big question here, which betrays my ignorance:
> > I hope that you or Christian can reassure me on this.
> > 
> > tmpfs has fs_flags FS_USERNS_MOUNT.  I know nothing about namespaces,
> > nothing; but from overhearings, wonder if an ordinary user in a namespace
> > might be able to mount their own tmpfs with "noswap", and thereby evade
> > all accounting of the locked memory.
> > 
> > That would be an absolute no-no for this patch; but I assume that even
> > if so, it can be easily remedied by inserting an appropriate (unknown
> > to me!) privilege check where the "noswap" option is validated.
> 
> Oh, good catch. Thanks! So you would just need sm like:
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 787e83791eb5..21ce9b26bb4d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3571,6 +3571,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>                 ctx->seen |= SHMEM_SEEN_INUMS;
>                 break;
>         case Opt_noswap:
> +               if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) {
> +                       return invalfc(fc,
> +                                      "Turning off swap in unprivileged tmpfs mounts unsupported");
> +               }
>                 ctx->noswap = true;
>                 ctx->seen |= SHMEM_SEEN_NOSWAP;
>                 break;
> 
> The fc->user_ns is the userns that the tmpfs mount will be mounted in, i.e.,
> fc->user_ns will become sb->s_user_ns if FS_USERNS_MOUNT is raised. So with the
> check above we require that the tmpfs instance must ultimately belong to the
> initial userns and that the caller has CAP_SYS_ADMIN in the initial userns
> (CAP_SYS_ADMIN guards swapon and swapoff) according to capabilities(7).

Christian, mind sending this as a fix?

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] shmem: restrict noswap option to initial user namespace
  2023-04-18 21:51       ` Luis Chamberlain
@ 2023-04-20  8:57         ` Christian Brauner
  2023-04-20 19:18           ` Luis Chamberlain
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Brauner @ 2023-04-20  8:57 UTC (permalink / raw)
  To: Luis Chamberlain, Hugh Dickins
  Cc: Christian Brauner, akpm, willy, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

Prevent tmpfs instances mounted in an unprivileged namespaces from
evading accounting of locked memory by using the "noswap" mount option.

Cc: Luis Chamberlain <mcgrof@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/lkml/79eae9fe-7818-a65c-89c6-138b55d609a@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 mm/shmem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 787e83791eb5..21ce9b26bb4d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3571,6 +3571,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->seen |= SHMEM_SEEN_INUMS;
 		break;
 	case Opt_noswap:
+		if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) {
+			return invalfc(fc,
+				       "Turning off swap in unprivileged tmpfs mounts unsupported");
+		}
 		ctx->noswap = true;
 		ctx->seen |= SHMEM_SEEN_NOSWAP;
 		break;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] shmem: restrict noswap option to initial user namespace
  2023-04-20  8:57         ` [PATCH] shmem: restrict noswap option to initial user namespace Christian Brauner
@ 2023-04-20 19:18           ` Luis Chamberlain
  0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2023-04-20 19:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Hugh Dickins, akpm, willy, linux-mm, p.raghav, da.gomez,
	a.manzanares, dave, yosryahmed, keescook, patches, linux-kernel

On Thu, Apr 20, 2023 at 10:57:43AM +0200, Christian Brauner wrote:
> Prevent tmpfs instances mounted in an unprivileged namespaces from
> evading accounting of locked memory by using the "noswap" mount option.
> 
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Reported-by: Hugh Dickins <hughd@google.com>
> Link: https://lore.kernel.org/lkml/79eae9fe-7818-a65c-89c6-138b55d609a@google.com
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-04-20 19:18 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-09 23:05 [PATCH v2 0/6] tmpfs: add the option to disable swap Luis Chamberlain
2023-03-09 23:05 ` [PATCH v2 1/6] shmem: remove check for folio lock on writepage() Luis Chamberlain
2023-03-09 23:05 ` [PATCH v2 2/6] shmem: set shmem_writepage() variables early Luis Chamberlain
2023-03-09 23:05 ` [PATCH v2 3/6] shmem: move reclaim check early on writepages() Luis Chamberlain
2023-03-09 23:05 ` [PATCH v2 4/6] shmem: skip page split if we're not reclaiming Luis Chamberlain
2023-03-09 23:09   ` Yosry Ahmed
2023-04-18  4:41   ` Hugh Dickins
2023-04-18 21:11     ` Luis Chamberlain
2023-04-18 21:20       ` Hugh Dickins
2023-03-09 23:05 ` [PATCH v2 5/6] shmem: update documentation Luis Chamberlain
2023-04-18  5:29   ` Hugh Dickins
2023-04-18 21:20     ` Luis Chamberlain
2023-04-18 21:41       ` Hugh Dickins
2023-04-18 21:49         ` Luis Chamberlain
2023-03-09 23:05 ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
2023-04-18  5:50   ` Hugh Dickins
2023-04-18  7:38     ` Christian Brauner
2023-04-18 21:51       ` Luis Chamberlain
2023-04-20  8:57         ` [PATCH] shmem: restrict noswap option to initial user namespace Christian Brauner
2023-04-20 19:18           ` Luis Chamberlain
2023-04-18 21:22     ` [PATCH v2 6/6] shmem: add support to ignore swap Luis Chamberlain
2023-04-18 21:30       ` Randy Dunlap
2023-03-14  1:21 ` [PATCH v2 0/6] tmpfs: add the option to disable swap Davidlohr Bueso
2023-03-14  2:46 ` haoxin
2023-03-19 20:32   ` Luis Chamberlain
2023-03-20 11:14     ` haoxin
2023-03-20 21:36       ` Luis Chamberlain
2023-03-21 11:37         ` haoxin
2023-04-18  4:31 ` Hugh Dickins
2023-04-18 20:55   ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).