Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

From: John Hubbard <jhubbard@nvidia.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Lee Jones <lee.jones@linaro.org>,
	linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <dchinner@redhat.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Bob Peterson <rpeterso@redhat.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Johannes Thumshirn <jth@kernel.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	cluster-devel@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
Date: Wed, 23 Feb 2022 16:44:07 -0800	[thread overview]
Message-ID: <d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com> (raw)
In-Reply-To: <YhbD1T7qhgnz4myM@mit.edu>

On 2/23/22 3:31 PM, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
>>
>> Just a small thing I'll say once, to get it out of my system. No action
>> required here, I just want it understood:
>>
>> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
>> pin_user_pages_remote()"), you would have written that like this:
>>
>> "process_vm_writev() is dirtying pages without properly warning the file
>> system in advance..."
>>
>> Because, there were many callers that were doing this:
>>
>>     get_user_pages*()
>>     ...use the pages...
>>
>>     for_each_page() {
>>             set_page_dirty*()
>>             put_page()
>>     }
> 
> Sure, but that's not sufficient when modifying file-backed pages.

We are in complete agreement. I was just trying to hint (too subtly) 
that the problem is ages old, and after doing all this work on the 
prerequisites to solving it (pin_user_pages() is a prerequisite), it kind 
of bothers me to see a commit with "work around bugs in mm/gup.c" in the 
title. That's all. :)

> Previously, there was only two ways of modifying file-backed pages in
> the page cache --- either using the write(2) system call, or when a
> mmaped page is modified by the userspace.
> 
> In the case of write(2), the file system gets notified before the page
> cache is modified by a call to the address operation's write_begin()
> call, and after the page cache is modified, the address operation's
> write_end() call lets the file system know that the modification is
> done.  After the write is done, the 30 second writeback timer is
> triggered, and in the case of ext4's data=journalling mode, we close
> the ext4 micro-transation (and therefore the time between write_begin
> and write_end calls needs to be minimal); otherwise this can block
> ext4 transactions.
> 
> In the case of a user page fault, the address operation's
> page_mkwrite() is called, and at that point we will allocate any
> blocks needed to back memory if necessary (in the case of delayed
> allocation, file system space has to get reserved).  The problem here
> for remote access is that page_mkwrite() can block, and it can
> potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
> can't just add the page buffers and do the file system allocation in
> set_page_dirty(), since set_page_dirty() isn't allowed to block.

Oh yes. Believe me, I am well-versed in that story! But it's always
nice to hear it again, especially from a file system maintainer.
Each time there is something new, such as the micro-transaction detail.

> 
> One approach might be to make all of the pages writeable when
> pin_user_pages_remote() is called.  That that would mean that in the
> case of remote access via process_vm_writev or RDMA, all of the blocks
> will be allocated early.  But that's probably better since at that
> point the userspace code is in a position to receive the error when
> setting up the RDMA memory, and we don't want to be asking the file
> system to do block allocation when incoming data is coming in from
> Infiniband or iWARP.

So it sounds like the file lease idea, yes? I'm hoping that that
is still viable. I still think it's a good LSF/MM topic, to work through.

> 
>> I see that ext4_warning_inode() has rate limiting, but it doesn't look
>> like it's really intended for a per-page rate. It looks like it is
>> per-superblock (yes?), so I suspect one instance of this problem, with
>> many pages involved, could hit the limit.
>>
>> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
>> either, but it might be better here. OTOH, I have minimal experience
>> with ext4_warning_inode() and maybe it really is just fine with per-page
>> failure rates.
> 
> With the syzbot reproducer, we're not actually triggering the rate
> limiter, since the ext4 warning is only getting hit a bit more than
> once every 4 seconds.  And I'm guessing that in the real world, people
> aren't actually trying to do remote direct access to file-backed
> memory, at least not using ext4, since that's an invitation to a
> kernel crash, and we would have gotten user complaints.  If some user

Actually...I can confirm that real customers really are doing *exactly* 
that! Despite the kernel crashes--because the crashes don't always 
happen unless you have a large (supercomputer-sized) installation. And 
even then it is not always root-caused properly.

I guess that goes in the "weird but true" category. :)

thanks,
-- 
John Hubbard
NVIDIA

> actually tries to use process_vm_writev for realsies, as opposed to a
> random fuzzer or from a malicious program , we do want to warn them
> about the potential data loss, so I'd prefer to warn once for each
> inode --- but I'm not convinced that it's worth the effort.
> 
> Cheers,
> 
> 						- Ted