From: John Hubbard <jhubbard@nvidia.com> To: Theodore Ts'o <tytso@mit.edu> Cc: Lee Jones <lee.jones@linaro.org>, linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Dave Chinner <dchinner@redhat.com>, Goldwyn Rodrigues <rgoldwyn@suse.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, Bob Peterson <rpeterso@redhat.com>, Damien Le Moal <damien.lemoal@wdc.com>, Andreas Gruenbacher <agruenba@redhat.com>, Ritesh Harjani <riteshh@linux.ibm.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Johannes Thumshirn <jth@kernel.org>, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Date: Thu, 17 Feb 2022 22:33:34 -0800 [thread overview] Message-ID: <7bd88058-2a9a-92a6-2280-43c805b516c3@nvidia.com> (raw) In-Reply-To: <Yg8bxiz02WBGf6qO@mit.edu> On 2/17/22 20:08, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote: >> Yes. And looking at the pair of backtraces below, this looks very much >> like another aspect of the "get_user_pages problem" [1], originally >> described in Jan Kara's 2018 email [2]. > > Hmm... I just posted my analysis, which tracks with yours; but I had > forgotten about Jan's 2018 e-mail on the matter. > >> I'm getting close to posting an RFC for the direct IO conversion to >> FOLL_PIN, but even after that, various parts of the kernel (reclaim, >> filesystems/block layer) still need to be changed so as to use >> page_maybe_dma_pinned() to help avoid this problem. There's a bit >> more than that, actually. > > The challenge is that fixing this "the right away" is probably not > something we can backport into an LTS kernel, whether it's 5.15 or > 5.10... or 4.19. > > The only thing which can probably survive getting backported is > something like this. It won't make the right thing happen if someone > is trying to RDMA or call process_vm_writev() into a file-backed > memory region --- but I'm not sure I care. Certainly if the goal is > to make Android kernels, I'm pretty sure they are't either using RDMA, > and I suspect they are probably masking out the process_vm_writev(2) > system call (at least, for Android O and newer). So if the goal is to > just to close some Syzbot bugs, what do folks think about this? > > - Ted Hi Ted! This seems reasonable...-ish. Although this could turn into a pretty grand game of whack-a-mole, one filesystem at a time. :) > > commit 7711b1fda6f7f04274fa1cba6f092410262b0296 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu Feb 17 22:54:03 2022 -0500 > > ext4: work around bugs in mm/gup.c that can cause ext4 to BUG() > > [un]pin_user_pages_remote is dirtying pages without properly warning > the file system in advance (or faulting in the file data if the page Just a small thing I'll say once, to get it out of my system. No action required here, I just want it understood: Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()"), you would have written that like this: "process_vm_writev() is dirtying pages without properly warning the file system in advance..." Because, there were many callers that were doing this: get_user_pages*() ...use the pages... for_each_page() { set_page_dirty*() put_page() } anyway, moving on smartly... > is not yet in the page cache). This was noted by Jan Kara in 2018[1] > and more recently has resulted in bug reports by Syzbot in various > Android kernels[2]. > > Fixing this for real is non-trivial, and will never be backportable > into stable kernels. So this is a simple workaround that stops the > kernel from BUG()'ing. The changed pages will not be properly written > back, but given that the current gup code is missing the "read" in > "read-modify-write", the dirty page in the page cache might contain > corrupted data anyway. > > [1] https://www.spinics.net/lists/linux-mm/msg142700.html (Sorry my earlier response mangled this link. I've manually fixed it here, and am working with our IT to get the root cause addressed.) > [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com > > Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com > Reported-by: Lee Jones <lee.jones@linaro.org> > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 01c9e4f743ba..3b2f336a90d1 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, > else > len = PAGE_SIZE; > > + /* Should never happen but for buggy gup code */ > + if (!page_has_buffers(page)) { > + ext4_warning_inode(inode, > + "page %lu does not have buffers attached", page->index); I see that ext4_warning_inode() has rate limiting, but it doesn't look like it's really intended for a per-page rate. It looks like it is per-superblock (yes?), so I suspect one instance of this problem, with many pages involved, could hit the limit. Often, WARN_ON_ONCE() is used with per-page assertions. That's not great either, but it might be better here. OTOH, I have minimal experience with ext4_warning_inode() and maybe it really is just fine with per-page failure rates. thanks, -- John Hubbard NVIDIA > + ClearPageDirty(page); > + unlock_page(page); > + return 0; > + } > + > page_bufs = page_buffers(page); > /* > * We cannot do block allocation or other extent handling in this > @@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) > wait_on_page_writeback(page); > BUG_ON(PageWriteback(page)); > > + /* Should never happen but for buggy gup code */ > + if (!page_has_buffers(page)) { > + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); > + ClearPageDirty(page); > + unlock_page(page); > + continue; > + } > + > if (mpd->map.m_len == 0) > mpd->first_page = page->index; > mpd->next_page = page->index + 1;
WARNING: multiple messages have this Message-ID (diff)
From: John Hubbard <jhubbard@nvidia.com> To: cluster-devel.redhat.com Subject: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Date: Thu, 17 Feb 2022 22:33:34 -0800 [thread overview] Message-ID: <7bd88058-2a9a-92a6-2280-43c805b516c3@nvidia.com> (raw) In-Reply-To: <Yg8bxiz02WBGf6qO@mit.edu> On 2/17/22 20:08, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote: >> Yes. And looking at the pair of backtraces below, this looks very much >> like another aspect of the "get_user_pages problem" [1], originally >> described in Jan Kara's 2018 email [2]. > > Hmm... I just posted my analysis, which tracks with yours; but I had > forgotten about Jan's 2018 e-mail on the matter. > >> I'm getting close to posting an RFC for the direct IO conversion to >> FOLL_PIN, but even after that, various parts of the kernel (reclaim, >> filesystems/block layer) still need to be changed so as to use >> page_maybe_dma_pinned() to help avoid this problem. There's a bit >> more than that, actually. > > The challenge is that fixing this "the right away" is probably not > something we can backport into an LTS kernel, whether it's 5.15 or > 5.10... or 4.19. > > The only thing which can probably survive getting backported is > something like this. It won't make the right thing happen if someone > is trying to RDMA or call process_vm_writev() into a file-backed > memory region --- but I'm not sure I care. Certainly if the goal is > to make Android kernels, I'm pretty sure they are't either using RDMA, > and I suspect they are probably masking out the process_vm_writev(2) > system call (at least, for Android O and newer). So if the goal is to > just to close some Syzbot bugs, what do folks think about this? > > - Ted Hi Ted! This seems reasonable...-ish. Although this could turn into a pretty grand game of whack-a-mole, one filesystem at a time. :) > > commit 7711b1fda6f7f04274fa1cba6f092410262b0296 > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu Feb 17 22:54:03 2022 -0500 > > ext4: work around bugs in mm/gup.c that can cause ext4 to BUG() > > [un]pin_user_pages_remote is dirtying pages without properly warning > the file system in advance (or faulting in the file data if the page Just a small thing I'll say once, to get it out of my system. No action required here, I just want it understood: Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()"), you would have written that like this: "process_vm_writev() is dirtying pages without properly warning the file system in advance..." Because, there were many callers that were doing this: get_user_pages*() ...use the pages... for_each_page() { set_page_dirty*() put_page() } anyway, moving on smartly... > is not yet in the page cache). This was noted by Jan Kara in 2018[1] > and more recently has resulted in bug reports by Syzbot in various > Android kernels[2]. > > Fixing this for real is non-trivial, and will never be backportable > into stable kernels. So this is a simple workaround that stops the > kernel from BUG()'ing. The changed pages will not be properly written > back, but given that the current gup code is missing the "read" in > "read-modify-write", the dirty page in the page cache might contain > corrupted data anyway. > > [1] https://www.spinics.net/lists/linux-mm/msg142700.html (Sorry my earlier response mangled this link. I've manually fixed it here, and am working with our IT to get the root cause addressed.) > [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM at google.com > > Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db at syzkaller.appspotmail.com > Reported-by: Lee Jones <lee.jones@linaro.org> > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 01c9e4f743ba..3b2f336a90d1 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, > else > len = PAGE_SIZE; > > + /* Should never happen but for buggy gup code */ > + if (!page_has_buffers(page)) { > + ext4_warning_inode(inode, > + "page %lu does not have buffers attached", page->index); I see that ext4_warning_inode() has rate limiting, but it doesn't look like it's really intended for a per-page rate. It looks like it is per-superblock (yes?), so I suspect one instance of this problem, with many pages involved, could hit the limit. Often, WARN_ON_ONCE() is used with per-page assertions. That's not great either, but it might be better here. OTOH, I have minimal experience with ext4_warning_inode() and maybe it really is just fine with per-page failure rates. thanks, -- John Hubbard NVIDIA > + ClearPageDirty(page); > + unlock_page(page); > + return 0; > + } > + > page_bufs = page_buffers(page); > /* > * We cannot do block allocation or other extent handling in this > @@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) > wait_on_page_writeback(page); > BUG_ON(PageWriteback(page)); > > + /* Should never happen but for buggy gup code */ > + if (!page_has_buffers(page)) { > + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); > + ClearPageDirty(page); > + unlock_page(page); > + continue; > + } > + > if (mpd->map.m_len == 0) > mpd->first_page = page->index; > mpd->next_page = page->index + 1;
next prev parent reply other threads:[~2022-02-18 6:33 UTC|newest] Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones 2022-02-16 16:31 ` [Cluster-devel] " Lee Jones 2022-02-18 1:06 ` John Hubbard 2022-02-18 1:06 ` [Cluster-devel] " John Hubbard 2022-02-18 4:08 ` Theodore Ts'o 2022-02-18 4:08 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 6:33 ` John Hubbard [this message] 2022-02-18 6:33 ` John Hubbard 2022-02-23 23:31 ` Theodore Ts'o 2022-02-23 23:31 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 0:44 ` John Hubbard 2022-02-24 0:44 ` [Cluster-devel] " John Hubbard 2022-02-24 4:04 ` Theodore Ts'o 2022-02-24 4:04 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 7:51 ` Greg Kroah-Hartman 2022-02-18 7:51 ` [Cluster-devel] " Greg Kroah-Hartman 2022-02-23 23:35 ` Theodore Ts'o 2022-02-23 23:35 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 1:48 ` Dave Chinner 2022-02-24 1:48 ` [Cluster-devel] " Dave Chinner 2022-02-24 3:50 ` Theodore Ts'o 2022-02-24 3:50 ` [Cluster-devel] " Theodore Ts'o 2022-02-24 10:29 ` Dave Chinner 2022-02-24 10:29 ` [Cluster-devel] " Dave Chinner 2022-02-18 2:54 ` Theodore Ts'o 2022-02-18 2:54 ` [Cluster-devel] " Theodore Ts'o 2022-02-18 4:24 ` Matthew Wilcox 2022-02-18 4:24 ` [Cluster-devel] " Matthew Wilcox 2022-02-18 6:03 ` Theodore Ts'o 2022-02-18 6:03 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o 2022-02-25 19:24 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 20:51 ` Eric Biggers 2022-02-25 20:51 ` [Cluster-devel] " Eric Biggers 2022-02-25 21:08 ` Theodore Ts'o 2022-02-25 21:08 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 21:23 ` [PATCH -v3] " Theodore Ts'o 2022-02-25 21:23 ` [Cluster-devel] " Theodore Ts'o 2022-02-25 21:33 ` John Hubbard 2022-02-25 21:33 ` [Cluster-devel] " John Hubbard 2022-02-25 23:21 ` Theodore Ts'o 2022-02-25 23:21 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 0:18 ` Hillf Danton 2022-02-26 0:41 ` John Hubbard 2022-02-26 0:41 ` [Cluster-devel] " John Hubbard 2022-02-26 1:40 ` Theodore Ts'o 2022-02-26 1:40 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 2:00 ` Theodore Ts'o 2022-02-26 2:00 ` [Cluster-devel] " Theodore Ts'o 2022-02-26 2:55 ` John Hubbard 2022-02-26 2:55 ` [Cluster-devel] " John Hubbard 2022-03-03 4:26 ` [PATCH -v4] " Theodore Ts'o 2022-03-03 4:26 ` [Cluster-devel] " Theodore Ts'o 2022-03-03 8:21 ` Christoph Hellwig 2022-03-03 8:21 ` [Cluster-devel] " Christoph Hellwig 2022-03-03 9:21 ` Lee Jones 2022-03-03 9:21 ` [Cluster-devel] " Lee Jones 2022-03-03 14:38 ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o 2022-03-03 14:38 ` [Cluster-devel] " Theodore Ts'o
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7bd88058-2a9a-92a6-2280-43c805b516c3@nvidia.com \ --to=jhubbard@nvidia.com \ --cc=agruenba@redhat.com \ --cc=cluster-devel@redhat.com \ --cc=damien.lemoal@wdc.com \ --cc=darrick.wong@oracle.com \ --cc=dchinner@redhat.com \ --cc=gregkh@linuxfoundation.org \ --cc=hch@lst.de \ --cc=jth@kernel.org \ --cc=lee.jones@linaro.org \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-xfs@vger.kernel.org \ --cc=rgoldwyn@suse.com \ --cc=riteshh@linux.ibm.com \ --cc=rpeterso@redhat.com \ --cc=tytso@mit.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.