All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Lee Jones <lee.jones@linaro.org>,
	linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <dchinner@redhat.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Bob Peterson <rpeterso@redhat.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Johannes Thumshirn <jth@kernel.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	cluster-devel@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
Date: Wed, 23 Feb 2022 16:44:07 -0800	[thread overview]
Message-ID: <d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com> (raw)
In-Reply-To: <YhbD1T7qhgnz4myM@mit.edu>

On 2/23/22 3:31 PM, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
>>
>> Just a small thing I'll say once, to get it out of my system. No action
>> required here, I just want it understood:
>>
>> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
>> pin_user_pages_remote()"), you would have written that like this:
>>
>> "process_vm_writev() is dirtying pages without properly warning the file
>> system in advance..."
>>
>> Because, there were many callers that were doing this:
>>
>>     get_user_pages*()
>>     ...use the pages...
>>
>>     for_each_page() {
>>             set_page_dirty*()
>>             put_page()
>>     }
> 
> Sure, but that's not sufficient when modifying file-backed pages.

We are in complete agreement. I was just trying to hint (too subtly) 
that the problem is ages old, and after doing all this work on the 
prerequisites to solving it (pin_user_pages() is a prerequisite), it kind 
of bothers me to see a commit with "work around bugs in mm/gup.c" in the 
title. That's all. :)


> Previously, there was only two ways of modifying file-backed pages in
> the page cache --- either using the write(2) system call, or when a
> mmaped page is modified by the userspace.
> 
> In the case of write(2), the file system gets notified before the page
> cache is modified by a call to the address operation's write_begin()
> call, and after the page cache is modified, the address operation's
> write_end() call lets the file system know that the modification is
> done.  After the write is done, the 30 second writeback timer is
> triggered, and in the case of ext4's data=journalling mode, we close
> the ext4 micro-transation (and therefore the time between write_begin
> and write_end calls needs to be minimal); otherwise this can block
> ext4 transactions.
> 
> In the case of a user page fault, the address operation's
> page_mkwrite() is called, and at that point we will allocate any
> blocks needed to back memory if necessary (in the case of delayed
> allocation, file system space has to get reserved).  The problem here
> for remote access is that page_mkwrite() can block, and it can
> potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
> can't just add the page buffers and do the file system allocation in
> set_page_dirty(), since set_page_dirty() isn't allowed to block.

Oh yes. Believe me, I am well-versed in that story! But it's always
nice to hear it again, especially from a file system maintainer.
Each time there is something new, such as the micro-transaction detail.

> 
> One approach might be to make all of the pages writeable when
> pin_user_pages_remote() is called.  That that would mean that in the
> case of remote access via process_vm_writev or RDMA, all of the blocks
> will be allocated early.  But that's probably better since at that
> point the userspace code is in a position to receive the error when
> setting up the RDMA memory, and we don't want to be asking the file
> system to do block allocation when incoming data is coming in from
> Infiniband or iWARP.

So it sounds like the file lease idea, yes? I'm hoping that that
is still viable. I still think it's a good LSF/MM topic, to work through.

> 
>> I see that ext4_warning_inode() has rate limiting, but it doesn't look
>> like it's really intended for a per-page rate. It looks like it is
>> per-superblock (yes?), so I suspect one instance of this problem, with
>> many pages involved, could hit the limit.
>>
>> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
>> either, but it might be better here. OTOH, I have minimal experience
>> with ext4_warning_inode() and maybe it really is just fine with per-page
>> failure rates.
> 
> With the syzbot reproducer, we're not actually triggering the rate
> limiter, since the ext4 warning is only getting hit a bit more than
> once every 4 seconds.  And I'm guessing that in the real world, people
> aren't actually trying to do remote direct access to file-backed
> memory, at least not using ext4, since that's an invitation to a
> kernel crash, and we would have gotten user complaints.  If some user

Actually...I can confirm that real customers really are doing *exactly* 
that! Despite the kernel crashes--because the crashes don't always 
happen unless you have a large (supercomputer-sized) installation. And 
even then it is not always root-caused properly.

I guess that goes in the "weird but true" category. :)


thanks,
-- 
John Hubbard
NVIDIA

> actually tries to use process_vm_writev for realsies, as opposed to a
> random fuzzer or from a malicious program , we do want to warn them
> about the potential data loss, so I'd prefer to warn once for each
> inode --- but I'm not convinced that it's worth the effort.
> 
> Cheers,
> 
> 						- Ted

WARNING: multiple messages have this Message-ID (diff)
From: John Hubbard <jhubbard@nvidia.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
Date: Wed, 23 Feb 2022 16:44:07 -0800	[thread overview]
Message-ID: <d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com> (raw)
In-Reply-To: <YhbD1T7qhgnz4myM@mit.edu>

On 2/23/22 3:31 PM, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
>>
>> Just a small thing I'll say once, to get it out of my system. No action
>> required here, I just want it understood:
>>
>> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
>> pin_user_pages_remote()"), you would have written that like this:
>>
>> "process_vm_writev() is dirtying pages without properly warning the file
>> system in advance..."
>>
>> Because, there were many callers that were doing this:
>>
>>     get_user_pages*()
>>     ...use the pages...
>>
>>     for_each_page() {
>>             set_page_dirty*()
>>             put_page()
>>     }
> 
> Sure, but that's not sufficient when modifying file-backed pages.

We are in complete agreement. I was just trying to hint (too subtly) 
that the problem is ages old, and after doing all this work on the 
prerequisites to solving it (pin_user_pages() is a prerequisite), it kind 
of bothers me to see a commit with "work around bugs in mm/gup.c" in the 
title. That's all. :)


> Previously, there was only two ways of modifying file-backed pages in
> the page cache --- either using the write(2) system call, or when a
> mmaped page is modified by the userspace.
> 
> In the case of write(2), the file system gets notified before the page
> cache is modified by a call to the address operation's write_begin()
> call, and after the page cache is modified, the address operation's
> write_end() call lets the file system know that the modification is
> done.  After the write is done, the 30 second writeback timer is
> triggered, and in the case of ext4's data=journalling mode, we close
> the ext4 micro-transation (and therefore the time between write_begin
> and write_end calls needs to be minimal); otherwise this can block
> ext4 transactions.
> 
> In the case of a user page fault, the address operation's
> page_mkwrite() is called, and at that point we will allocate any
> blocks needed to back memory if necessary (in the case of delayed
> allocation, file system space has to get reserved).  The problem here
> for remote access is that page_mkwrite() can block, and it can
> potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
> can't just add the page buffers and do the file system allocation in
> set_page_dirty(), since set_page_dirty() isn't allowed to block.

Oh yes. Believe me, I am well-versed in that story! But it's always
nice to hear it again, especially from a file system maintainer.
Each time there is something new, such as the micro-transaction detail.

> 
> One approach might be to make all of the pages writeable when
> pin_user_pages_remote() is called.  That that would mean that in the
> case of remote access via process_vm_writev or RDMA, all of the blocks
> will be allocated early.  But that's probably better since at that
> point the userspace code is in a position to receive the error when
> setting up the RDMA memory, and we don't want to be asking the file
> system to do block allocation when incoming data is coming in from
> Infiniband or iWARP.

So it sounds like the file lease idea, yes? I'm hoping that that
is still viable. I still think it's a good LSF/MM topic, to work through.

> 
>> I see that ext4_warning_inode() has rate limiting, but it doesn't look
>> like it's really intended for a per-page rate. It looks like it is
>> per-superblock (yes?), so I suspect one instance of this problem, with
>> many pages involved, could hit the limit.
>>
>> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
>> either, but it might be better here. OTOH, I have minimal experience
>> with ext4_warning_inode() and maybe it really is just fine with per-page
>> failure rates.
> 
> With the syzbot reproducer, we're not actually triggering the rate
> limiter, since the ext4 warning is only getting hit a bit more than
> once every 4 seconds.  And I'm guessing that in the real world, people
> aren't actually trying to do remote direct access to file-backed
> memory, at least not using ext4, since that's an invitation to a
> kernel crash, and we would have gotten user complaints.  If some user

Actually...I can confirm that real customers really are doing *exactly* 
that! Despite the kernel crashes--because the crashes don't always 
happen unless you have a large (supercomputer-sized) installation. And 
even then it is not always root-caused properly.

I guess that goes in the "weird but true" category. :)


thanks,
-- 
John Hubbard
NVIDIA

> actually tries to use process_vm_writev for realsies, as opposed to a
> random fuzzer or from a malicious program , we do want to warn them
> about the potential data loss, so I'd prefer to warn once for each
> inode --- but I'm not convinced that it's worth the effort.
> 
> Cheers,
> 
> 						- Ted



  reply	other threads:[~2022-02-24  0:44 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
2022-02-16 16:31 ` [Cluster-devel] " Lee Jones
2022-02-18  1:06 ` John Hubbard
2022-02-18  1:06   ` [Cluster-devel] " John Hubbard
2022-02-18  4:08   ` Theodore Ts'o
2022-02-18  4:08     ` [Cluster-devel] " Theodore Ts'o
2022-02-18  6:33     ` John Hubbard
2022-02-18  6:33       ` [Cluster-devel] " John Hubbard
2022-02-23 23:31       ` Theodore Ts'o
2022-02-23 23:31         ` [Cluster-devel] " Theodore Ts'o
2022-02-24  0:44         ` John Hubbard [this message]
2022-02-24  0:44           ` John Hubbard
2022-02-24  4:04           ` Theodore Ts'o
2022-02-24  4:04             ` [Cluster-devel] " Theodore Ts'o
2022-02-18  7:51     ` Greg Kroah-Hartman
2022-02-18  7:51       ` [Cluster-devel] " Greg Kroah-Hartman
2022-02-23 23:35       ` Theodore Ts'o
2022-02-23 23:35         ` [Cluster-devel] " Theodore Ts'o
2022-02-24  1:48         ` Dave Chinner
2022-02-24  1:48           ` [Cluster-devel] " Dave Chinner
2022-02-24  3:50           ` Theodore Ts'o
2022-02-24  3:50             ` [Cluster-devel] " Theodore Ts'o
2022-02-24 10:29             ` Dave Chinner
2022-02-24 10:29               ` [Cluster-devel] " Dave Chinner
2022-02-18  2:54 ` Theodore Ts'o
2022-02-18  2:54   ` [Cluster-devel] " Theodore Ts'o
2022-02-18  4:24   ` Matthew Wilcox
2022-02-18  4:24     ` [Cluster-devel] " Matthew Wilcox
2022-02-18  6:03     ` Theodore Ts'o
2022-02-18  6:03       ` [Cluster-devel] " Theodore Ts'o
2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
2022-02-25 19:24   ` [Cluster-devel] " Theodore Ts'o
2022-02-25 20:51   ` Eric Biggers
2022-02-25 20:51     ` [Cluster-devel] " Eric Biggers
2022-02-25 21:08     ` Theodore Ts'o
2022-02-25 21:08       ` [Cluster-devel] " Theodore Ts'o
2022-02-25 21:23       ` [PATCH -v3] " Theodore Ts'o
2022-02-25 21:23         ` [Cluster-devel] " Theodore Ts'o
2022-02-25 21:33         ` John Hubbard
2022-02-25 21:33           ` [Cluster-devel] " John Hubbard
2022-02-25 23:21           ` Theodore Ts'o
2022-02-25 23:21             ` [Cluster-devel] " Theodore Ts'o
2022-02-26  0:18             ` Hillf Danton
2022-02-26  0:41             ` John Hubbard
2022-02-26  0:41               ` [Cluster-devel] " John Hubbard
2022-02-26  1:40               ` Theodore Ts'o
2022-02-26  1:40                 ` [Cluster-devel] " Theodore Ts'o
2022-02-26  2:00                 ` Theodore Ts'o
2022-02-26  2:00                   ` [Cluster-devel] " Theodore Ts'o
2022-02-26  2:55                 ` John Hubbard
2022-02-26  2:55                   ` [Cluster-devel] " John Hubbard
2022-03-03  4:26         ` [PATCH -v4] " Theodore Ts'o
2022-03-03  4:26           ` [Cluster-devel] " Theodore Ts'o
2022-03-03  8:21           ` Christoph Hellwig
2022-03-03  8:21             ` [Cluster-devel] " Christoph Hellwig
2022-03-03  9:21           ` Lee Jones
2022-03-03  9:21             ` [Cluster-devel] " Lee Jones
2022-03-03 14:38           ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o
2022-03-03 14:38             ` [Cluster-devel] " Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=jth@kernel.org \
    --cc=lee.jones@linaro.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rgoldwyn@suse.com \
    --cc=riteshh@linux.ibm.com \
    --cc=rpeterso@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.