From: John Hubbard <jhubbard@nvidia.com> To: Theodore Ts'o <tytso@mit.edu> Cc: Lee Jones <lee.jones@linaro.org>, linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Dave Chinner <dchinner@redhat.com>, Goldwyn Rodrigues <rgoldwyn@suse.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, Bob Peterson <rpeterso@redhat.com>, Damien Le Moal <damien.lemoal@wdc.com>, Andreas Gruenbacher <agruenba@redhat.com>, Ritesh Harjani <riteshh@linux.ibm.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Johannes Thumshirn <jth@kernel.org>, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Date: Wed, 23 Feb 2022 16:44:07 -0800 [thread overview] Message-ID: <d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com> (raw) In-Reply-To: <YhbD1T7qhgnz4myM@mit.edu> On 2/23/22 3:31 PM, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote: >> >> Just a small thing I'll say once, to get it out of my system. No action >> required here, I just want it understood: >> >> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via >> pin_user_pages_remote()"), you would have written that like this: >> >> "process_vm_writev() is dirtying pages without properly warning the file >> system in advance..." >> >> Because, there were many callers that were doing this: >> >> get_user_pages*() >> ...use the pages... >> >> for_each_page() { >> set_page_dirty*() >> put_page() >> } > > Sure, but that's not sufficient when modifying file-backed pages. We are in complete agreement. I was just trying to hint (too subtly) that the problem is ages old, and after doing all this work on the prerequisites to solving it (pin_user_pages() is a prerequisite), it kind of bothers me to see a commit with "work around bugs in mm/gup.c" in the title. That's all. :) > Previously, there was only two ways of modifying file-backed pages in > the page cache --- either using the write(2) system call, or when a > mmaped page is modified by the userspace. > > In the case of write(2), the file system gets notified before the page > cache is modified by a call to the address operation's write_begin() > call, and after the page cache is modified, the address operation's > write_end() call lets the file system know that the modification is > done. After the write is done, the 30 second writeback timer is > triggered, and in the case of ext4's data=journalling mode, we close > the ext4 micro-transation (and therefore the time between write_begin > and write_end calls needs to be minimal); otherwise this can block > ext4 transactions. > > In the case of a user page fault, the address operation's > page_mkwrite() is called, and at that point we will allocate any > blocks needed to back memory if necessary (in the case of delayed > allocation, file system space has to get reserved). The problem here > for remote access is that page_mkwrite() can block, and it can > potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we > can't just add the page buffers and do the file system allocation in > set_page_dirty(), since set_page_dirty() isn't allowed to block. Oh yes. Believe me, I am well-versed in that story! But it's always nice to hear it again, especially from a file system maintainer. Each time there is something new, such as the micro-transaction detail. > > One approach might be to make all of the pages writeable when > pin_user_pages_remote() is called. That that would mean that in the > case of remote access via process_vm_writev or RDMA, all of the blocks > will be allocated early. But that's probably better since at that > point the userspace code is in a position to receive the error when > setting up the RDMA memory, and we don't want to be asking the file > system to do block allocation when incoming data is coming in from > Infiniband or iWARP. So it sounds like the file lease idea, yes? I'm hoping that that is still viable. I still think it's a good LSF/MM topic, to work through. > >> I see that ext4_warning_inode() has rate limiting, but it doesn't look >> like it's really intended for a per-page rate. It looks like it is >> per-superblock (yes?), so I suspect one instance of this problem, with >> many pages involved, could hit the limit. >> >> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great >> either, but it might be better here. OTOH, I have minimal experience >> with ext4_warning_inode() and maybe it really is just fine with per-page >> failure rates. > > With the syzbot reproducer, we're not actually triggering the rate > limiter, since the ext4 warning is only getting hit a bit more than > once every 4 seconds. And I'm guessing that in the real world, people > aren't actually trying to do remote direct access to file-backed > memory, at least not using ext4, since that's an invitation to a > kernel crash, and we would have gotten user complaints. If some user Actually...I can confirm that real customers really are doing *exactly* that! Despite the kernel crashes--because the crashes don't always happen unless you have a large (supercomputer-sized) installation. And even then it is not always root-caused properly. I guess that goes in the "weird but true" category. :) thanks, -- John Hubbard NVIDIA > actually tries to use process_vm_writev for realsies, as opposed to a > random fuzzer or from a malicious program , we do want to warn them > about the potential data loss, so I'd prefer to warn once for each > inode --- but I'm not convinced that it's worth the effort. > > Cheers, > > - Ted
WARNING: multiple messages have this Message-ID (diff)
From: John Hubbard <jhubbard@nvidia.com> To: cluster-devel.redhat.com Subject: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Date: Wed, 23 Feb 2022 16:44:07 -0800 [thread overview] Message-ID: <d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com> (raw) In-Reply-To: <YhbD1T7qhgnz4myM@mit.edu> On 2/23/22 3:31 PM, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote: >> >> Just a small thing I'll say once, to get it out of my system. No action >> required here, I just want it understood: >> >> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via >> pin_user_pages_remote()"), you would have written that like this: >> >> "process_vm_writev() is dirtying pages without properly warning the file >> system in advance..." >> >> Because, there were many callers that were doing this: >> >> get_user_pages*() >> ...use the pages... >> >> for_each_page() { >> set_page_dirty*() >> put_page() >> } > > Sure, but that's not sufficient when modifying file-backed pages. We are in complete agreement. I was just trying to hint (too subtly) that the problem is ages old, and after doing all this work on the prerequisites to solving it (pin_user_pages() is a prerequisite), it kind of bothers me to see a commit with "work around bugs in mm/gup.c" in the title. That's all. :) > Previously, there was only two ways of modifying file-backed pages in > the page cache --- either using the write(2) system call, or when a > mmaped page is modified by the userspace. > > In the case of write(2), the file system gets notified before the page > cache is modified by a call to the address operation's write_begin() > call, and after the page cache is modified, the address operation's > write_end() call lets the file system know that the modification is > done. After the write is done, the 30 second writeback timer is > triggered, and in the case of ext4's data=journalling mode, we close > the ext4 micro-transation (and therefore the time between write_begin > and write_end calls needs to be minimal); otherwise this can block > ext4 transactions. > > In the case of a user page fault, the address operation's > page_mkwrite() is called, and at that point we will allocate any > blocks needed to back memory if necessary (in the case of delayed > allocation, file system space has to get reserved). The problem here > for remote access is that page_mkwrite() can block, and it can > potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we > can't just add the page buffers and do the file system allocation in > set_page_dirty(), since set_page_dirty() isn't allowed to block. Oh yes. Believe me, I am well-versed in that story! But it's always nice to hear it again, especially from a file system maintainer. Each time there is something new, such as the micro-transaction detail. > > One approach might be to make all of the pages writeable when > pin_user_pages_remote() is called. That that would mean that in the > case of remote access via process_vm_writev or RDMA, all of the blocks > will be allocated early. But that's probably better since at that > point the userspace code is in a position to receive the error when > setting up the RDMA memory, and we don't want to be asking the file > system to do block allocation when incoming data is coming in from > Infiniband or iWARP. So it sounds like the file lease idea, yes? I'm hoping that that is still viable. I still think it's a good LSF/MM topic, to work through. > >> I see that ext4_warning_inode() has rate limiting, but it doesn't look >> like it's really intended for a per-page rate. It looks like it is >> per-superblock (yes?), so I suspect one instance of this problem, with >> many pages involved, could hit the limit. >> >> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great >> either, but it might be better here. OTOH, I have minimal experience >> with ext4_warning_inode() and maybe it really is just fine with per-page >> failure rates. > > With the syzbot reproducer, we're not actually triggering the rate > limiter, since the ext4 warning is only getting hit a bit more than > once every 4 seconds. And I'm guessing that in the real world, people > aren't actually trying to do remote direct access to file-backed > memory, at least not using ext4, since that's an invitation to a > kernel crash, and we would have gotten user complaints. If some user Actually...I can confirm that real customers really are doing *exactly* that! Despite the kernel crashes--because the crashes don't always happen unless you have a large (supercomputer-sized) installation. And even then it is not always root-caused properly. I guess that goes in the "weird but true" category. :) thanks, -- John Hubbard NVIDIA > actually tries to use process_vm_writev for realsies, as opposed to a > random fuzzer or from a malicious program , we do want to warn them > about the potential data loss, so I'd prefer to warn once for each > inode --- but I'm not convinced that it's worth the effort. > > Cheers, > > - Ted
next prev parent reply other threads:[~2022-02-24 0:44 UTC|newest] Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones 2022-02-16 16:31 ` [Cluster-devel] " Lee Jones 2022-02-18 1:06 ` John Hubbard 2022-02-18 1:06 ` [Cluster-devel] " John Hubbard 2022-02-18 4:08 ` Theodore Ts'o 2022-02-18 4:08 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 6:33 ` John Hubbard 2022-02-18 6:33 ` [Cluster-devel] " John Hubbard 2022-02-23 23:31 ` Theodore Ts'o 2022-02-23 23:31 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 0:44 ` John Hubbard [this message] 2022-02-24 0:44 ` John Hubbard 2022-02-24 4:04 ` Theodore Ts'o 2022-02-24 4:04 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 7:51 ` Greg Kroah-Hartman 2022-02-18 7:51 ` [Cluster-devel] " Greg Kroah-Hartman 2022-02-23 23:35 ` Theodore Ts'o 2022-02-23 23:35 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 1:48 ` Dave Chinner 2022-02-24 1:48 ` [Cluster-devel] " Dave Chinner 2022-02-24 3:50 ` Theodore Ts'o 2022-02-24 3:50 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 10:29 ` Dave Chinner 2022-02-24 10:29 ` [Cluster-devel] " Dave Chinner 2022-02-18 2:54 ` Theodore Ts'o 2022-02-18 2:54 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 4:24 ` Matthew Wilcox 2022-02-18 4:24 ` [Cluster-devel] " Matthew Wilcox 2022-02-18 6:03 ` Theodore Ts'o 2022-02-18 6:03 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o 2022-02-25 19:24 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 20:51 ` Eric Biggers 2022-02-25 20:51 ` [Cluster-devel] " Eric Biggers 2022-02-25 21:08 ` Theodore Ts'o 2022-02-25 21:08 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 21:23 ` [PATCH -v3] " Theodore Ts'o 2022-02-25 21:23 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 21:33 ` John Hubbard 2022-02-25 21:33 ` [Cluster-devel] " John Hubbard 2022-02-25 23:21 ` Theodore Ts'o 2022-02-25 23:21 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 0:18 ` Hillf Danton 2022-02-26 0:41 ` John Hubbard 2022-02-26 0:41 ` [Cluster-devel] " John Hubbard 2022-02-26 1:40 ` Theodore Ts'o 2022-02-26 1:40 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 2:00 ` Theodore Ts'o 2022-02-26 2:00 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 2:55 ` John Hubbard 2022-02-26 2:55 ` [Cluster-devel] " John Hubbard 2022-03-03 4:26 ` [PATCH -v4] " Theodore Ts'o 2022-03-03 4:26 ` [Cluster-devel] " Theodore Ts'o 2022-03-03 8:21 ` Christoph Hellwig 2022-03-03 8:21 ` [Cluster-devel] " Christoph Hellwig 2022-03-03 9:21 ` Lee Jones 2022-03-03 9:21 ` [Cluster-devel] " Lee Jones 2022-03-03 14:38 ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o 2022-03-03 14:38 ` [Cluster-devel] " Theodore Ts'o
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=d75891d6-4c2e-57bc-f840-9d8d5449628a@nvidia.com \ --to=jhubbard@nvidia.com \ --cc=agruenba@redhat.com \ --cc=cluster-devel@redhat.com \ --cc=damien.lemoal@wdc.com \ --cc=darrick.wong@oracle.com \ --cc=dchinner@redhat.com \ --cc=gregkh@linuxfoundation.org \ --cc=hch@lst.de \ --cc=jth@kernel.org \ --cc=lee.jones@linaro.org \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-xfs@vger.kernel.org \ --cc=rgoldwyn@suse.com \ --cc=riteshh@linux.ibm.com \ --cc=rpeterso@redhat.com \ --cc=tytso@mit.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.