All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	 linux-mm@kvack.org, linux-btrfs@vger.kernel.org,
	linux-block@vger.kernel.org,
	 Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Christoph Hellwig <hch@infradead.org>,
	 adrianvovk@gmail.com
Subject: Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
Date: Wed, 17 Jan 2024 14:19:43 +0100	[thread overview]
Message-ID: <20240117-yuppie-unflexibel-dbbb281cb948@brauner> (raw)
In-Reply-To: <ZabtYQqakvxJVYjM@dread.disaster.area>

On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote:
> On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote:
> > Hey,
> > 
> > I'm not sure this even needs a full LSFMM discussion but since I
> > currently don't have time to work on the patch I may as well submit it.
> > 
> > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
> > STF was created by the German government to fund public infrastructure:
> > 
> > "The Sovereign Tech Fund supports the development, improvement and
> >  maintenance of open digital infrastructure. Our goal is to sustainably
> >  strengthen the open source ecosystem. We focus on security, resilience,
> >  technological diversity, and the people behind the code." (cf. [1])
> > 
> > Gnome has proposed various specific projects including integrating
> > systemd-homed with Gnome. Systemd-homed provides various features and if
> > you're interested in details then you might find it useful to read [2].
> > It makes use of various new VFS and fs specific developments over the
> > last years.
> > 
> > One feature is encrypting the home directory via LUKS. An approriate
> > image or device must contain a GPT partition table. Currently there's
> > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
> > a Linux filesystem. Currently supported are btrfs (see [4] though),
> > ext4, and xfs.
> > 
> > The following issue isn't specific to systemd-homed. Gnome wants to be
> > able to support locking encrypted home directories. For example, when
> > the laptop is suspended. To do this the luksSuspend command can be used.
> > 
> > The luksSuspend call is nothing else than a device mapper ioctl to
> > suspend the block device and it's owning superblock/filesystem. Which in
> > turn is nothing but a freeze initiated from the block layer:
> > 
> > dm_suspend()
> > -> __dm_suspend()
> >    -> lock_fs()
> >       -> bdev_freeze()
> > 
> > So when we say luksSuspend we really mean block layer initiated freeze.
> > The overall goal or expectation of userspace is that after a luksSuspend
> > call all sensitive material has been evicted from relevant caches to
> > harden against various attacks. And luksSuspend does wipe the encryption
> > key and suspend the block device. However, the encryption key can still
> > be available clear-text in the page cache.
> 
> The wiping of secrets is completely orthogonal to the freezing of
> the device and filesystem - the freeze does not need to occur to
> allow the encryption keys and decrypted data to be purged. They
> should not be conflated; purging needs to be a completely separate
> operation that can be run regardless of device/fs freeze status.

Yes, I'm aware. I didn't mean to imply that these things are in any way
necessarily connected. Just that there are use-cases where they are. And
the encrypted home directory case is one. One froze the block device and
filesystem one would now also like to drop the page cache which has most
of the interesting data.

The fact that after a block layer initiated freeze - again mostly a
device mapper problem - one may or may not be able to successfully read
from the filesystem is annoying. Of course one can't write, that will
hang one immediately. But if one still has some data in the page cache
one can still dump the contents of that file. That's at least odd
behavior from a users POV even if for us it's cleary why that's the
case.

And a freeze does do a sync_filesystem() and a sync_blockdev() to flush
out any dirty data for that specific filesystem. So it would be fitting
to give users an api that allows them to also drop the page cache
contents.

For some use-cases like the Gnome use-case one wants to do a freeze and
drop everything that one can from the page cache for that specific
filesystem.

And drop_caches is a big hammer simply because there are workloads where
that isn't feasible. Even on a modern boring laption system one may have
lots of services. On a large scale system one may have thousands of
services and they may all uses separate images (And the border between
isolated services and containers is fuzzy at best.). And here invoking
drop_caches penalizes every service.

One may want to drop the contents of _some_ services but not all of
them. Especially during suspend where one cares about dropping the page
cache of the home directory that gets suspended - encrypted or
unencrypted.

Ignoring the security aspect itself. Just the fact that one froze the
block device and the owning filesystem one may want to go and drop the
page cache as well without impacting every other filesystem on the
system. Which may be thousands. One doesn't want to penalize them all.

Ignoring the specific use-case I know that David has been interested in
a way to drop the page cache for afs. So this is not just for the home
directory case. I mostly wanted to make it clear that there are users of
an interface like this; even if it were just best effort.

> 
> FWIW, focussing on purging the page cache omits the fact that
> having access to the directory structure is a problem - one can
> still retrieve other user information that is stored in metadata
> (e.g. xattrs) that isn't part of the page cache. Even the directory
> structure that is cached in dentries could reveal secrets someone
> wants to keep hidden (e.g code names for operations/products).

Yes, of course but that's fine. The most sensitive data and the biggest
chunks of data will be the contents of files. We don't necessarily need
to cater to the paranoid with this.

> 
> So if we want luksSuspend to actually protect user information when
> it runs, then it effectively needs to bring the filesystem right
> back to it's "just mounted" state where the only thing in memory is
> the root directory dentry and inode and nothing else.

Yes, which we know isn't feasible.

> 
> And, of course, this is largely impossible to do because anything
> with an open file on the filesystem will prevent this robust cache
> purge from occurring....
> 
> Which brings us back to "best effort" only, and at this point we
> already have drop-caches....
> 
> Mind you, I do wonder if drop caches is fast enough for this sort of
> use case. It is single threaded, and if the filesystem/system has
> millions of cached inodes it can take minutes to run. Unmount has
> the same problem - purging large dentry/inode caches takes a *lot*
> of CPU time and these operations are single threaded.
> 
> So it may not be practical in the luks context to purge caches e.g.
> suspending a laptop shouldn't take minutes. However laptops are
> getting to the hundreds of GB of RAM these days and so they can
> cache millions of inodes, so cache purge runtime is definitely a
> consideration here.

I'm really trying to look for a practical api that doesn't require users
to drop the caches for every mounted image on the system.

FYI, I've tried to get some users to reply here so they could speak to
the fact that they don't expect this to be an optimal solution but none
of them know how to reply to lore mboxes so I can just relay
information.

  parent reply	other threads:[~2024-01-17 13:19 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
2024-01-16 11:45 ` Jan Kara
2024-01-17 12:53   ` Christian Brauner
2024-01-17 14:35     ` Jan Kara
2024-01-17 14:52       ` Matthew Wilcox
2024-01-17 20:51         ` Phillip Susi
2024-01-17 20:58           ` Matthew Wilcox
2024-01-18 14:26         ` Christian Brauner
2024-01-30  0:13         ` Adrian Vovk
2024-02-15 13:57           ` Jan Kara
2024-02-15 19:46             ` Adrian Vovk
2024-02-15 23:17               ` Dave Chinner
2024-02-16  1:14                 ` Adrian Vovk
2024-02-16 20:38                   ` init_on_alloc digression: " John Hubbard
2024-02-16 21:11                     ` Adrian Vovk
2024-02-16 21:19                       ` John Hubbard
2024-01-16 15:25 ` James Bottomley
2024-01-16 15:40   ` Matthew Wilcox
2024-01-16 15:54     ` James Bottomley
2024-01-16 20:56 ` Dave Chinner
2024-01-17  6:17   ` Theodore Ts'o
2024-01-30  1:14     ` Adrian Vovk
2024-01-17 13:19   ` Christian Brauner [this message]
2024-01-17 22:26     ` Dave Chinner
2024-01-18 14:09       ` Christian Brauner
2024-02-05 17:39     ` Russell Haley
2024-02-17  4:04 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240117-yuppie-unflexibel-dbbb281cb948@brauner \
    --to=brauner@kernel.org \
    --cc=adrianvovk@gmail.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.