From: Jan Kara <jack@suse.cz> To: "Yan, Zheng" <ukernel@gmail.com> Cc: Jan Kara <jack@suse.cz>, "Yan, Zheng" <zyan@redhat.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>, ceph-devel <ceph-devel@vger.kernel.org>, linux-ext4@vger.kernel.org, linux-btrfs <linux-btrfs@vger.kernel.org>, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, Al Viro <viro@zeniv.linux.org.uk>, Jeff Layton <jlayton@redhat.com>, stable@vger.kernel.org Subject: Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault Date: Thu, 14 Dec 2017 17:53:14 +0100 [thread overview] Message-ID: <20171214165314.GB1930@quack2.suse.cz> (raw) In-Reply-To: <CAAM7YA=ThWbBpOe1wgeYjGt3ogr9kT6uy3UpqSn94XqbhjOHJw@mail.gmail.com> On Thu 14-12-17 22:30:26, Yan, Zheng wrote: > On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: > > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > >> We recently got an Oops report: > >> > >> BUG: unable to handle kernel NULL pointer dereference at (null) > >> IP: jbd2__journal_start+0x38/0x1a2 > >> [...] > >> Call Trace: > >> ext4_page_mkwrite+0x307/0x52b > >> _ext4_get_block+0xd8/0xd8 > >> do_page_mkwrite+0x6e/0xd8 > >> handle_mm_fault+0x686/0xf9b > >> mntput_no_expire+0x1f/0x21e > >> __do_page_fault+0x21d/0x465 > >> dput+0x4a/0x2f7 > >> page_fault+0x22/0x30 > >> copy_user_generic_string+0x2c/0x40 > >> copy_page_to_iter+0x8c/0x2b8 > >> generic_file_read_iter+0x26e/0x845 > >> timerqueue_del+0x31/0x90 > >> ceph_read_iter+0x697/0xa33 [ceph] > >> hrtimer_cancel+0x23/0x41 > >> futex_wait+0x1c8/0x24d > >> get_futex_key+0x32c/0x39a > >> __vfs_read+0xe0/0x130 > >> vfs_read.part.1+0x6c/0x123 > >> handle_mm_fault+0x831/0xf9b > >> __fget+0x7e/0xbf > >> SyS_read+0x4d/0xb5 > >> > >> ceph_read_iter() uses current->journal_info to pass context info to > >> ceph_readpages(). Because ceph_readpages() needs to know if its caller > >> has already gotten capability of using page cache (distinguish read > >> from readahead/fadvise). ceph_read_iter() set current->journal_info, > >> then calls generic_file_read_iter(). > >> > >> In above Oops, page fault happened when copying data to userspace. > >> Page fault handler called ext4_page_mkwrite(). Ext4 code read > >> current->journal_info and assumed it is journal handle. > >> > >> I checked other filesystems, btrfs probably suffers similar problem > >> for its readpage. (page fault happens when write() copies data from > >> userspace memory and the memory is mapped to a file in btrfs. > >> verify_parent_transid() can be called during readpage) > >> > >> Cc: stable@vger.kernel.org > >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > I agree with the analysis but the patch is too ugly too live. Ceph just > > should not be abusing current->journal_info for passing information between > > two random functions or when it does a hackery like this, it should just > > make sure the pieces hold together. Poluting generic code to accommodate > > this hack in Ceph is not acceptable. Also bear in mind there are likely > > other code paths (e.g. memory reclaim) which could recurse into another > > filesystem confusing it with non-NULL current->journal_info in the same > > way. > > But ... > > some filesystem set journal_info in its write_begin(), then clear it > in write_end(). If buffer for write is mapped to another filesystem, > current->journal can leak to the later filesystem's page_readpage(). > The later filesystem may read current->journal and treat it as its own > journal handle. Besides, most filesystem's vm fault handle is > filemap_fault(), filemap also may tigger memory reclaim. Did you really observe this? Because write path uses iov_iter_copy_from_user_atomic() which does not allow page faults to happen. All page faulting happens in iov_iter_fault_in_readable() before ->write_begin() is called. And the recursion problems like you mention above are exactly the reason why things are done in a more complicated way like this. > > > > In this particular case I'm not sure why does ceph pass 'filp' into > > readpage() / readpages() handler when it already gets that pointer as part > > of arguments... > > It actually a flag which tells ceph_readpages() if its caller is > ceph_read_iter or readahead/fadvise/madvise. because when there are > multiple clients read/write a file a the same time, page cache should > be disabled. I'm not sure I understand the reasoning properly but from what you say above it rather seems the 'hint' should be stored in the inode (or possibly struct file)? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz> To: "Yan, Zheng" <ukernel@gmail.com> Cc: Jan Kara <jack@suse.cz>, "Yan, Zheng" <zyan@redhat.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>, ceph-devel <ceph-devel@vger.kernel.org>, linux-ext4@vger.kernel.org, linux-btrfs <linux-btrfs@vger.kernel.org>, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, Al Viro <viro@zeniv.linux.org.uk>, Jeff Layton <jlayton@redhat.com>, stable@vger.kernel.org Subject: Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault Date: Thu, 14 Dec 2017 17:53:14 +0100 [thread overview] Message-ID: <20171214165314.GB1930@quack2.suse.cz> (raw) In-Reply-To: <CAAM7YA=ThWbBpOe1wgeYjGt3ogr9kT6uy3UpqSn94XqbhjOHJw@mail.gmail.com> On Thu 14-12-17 22:30:26, Yan, Zheng wrote: > On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@suse.cz> wrote: > > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > >> We recently got an Oops report: > >> > >> BUG: unable to handle kernel NULL pointer dereference at (null) > >> IP: jbd2__journal_start+0x38/0x1a2 > >> [...] > >> Call Trace: > >> ext4_page_mkwrite+0x307/0x52b > >> _ext4_get_block+0xd8/0xd8 > >> do_page_mkwrite+0x6e/0xd8 > >> handle_mm_fault+0x686/0xf9b > >> mntput_no_expire+0x1f/0x21e > >> __do_page_fault+0x21d/0x465 > >> dput+0x4a/0x2f7 > >> page_fault+0x22/0x30 > >> copy_user_generic_string+0x2c/0x40 > >> copy_page_to_iter+0x8c/0x2b8 > >> generic_file_read_iter+0x26e/0x845 > >> timerqueue_del+0x31/0x90 > >> ceph_read_iter+0x697/0xa33 [ceph] > >> hrtimer_cancel+0x23/0x41 > >> futex_wait+0x1c8/0x24d > >> get_futex_key+0x32c/0x39a > >> __vfs_read+0xe0/0x130 > >> vfs_read.part.1+0x6c/0x123 > >> handle_mm_fault+0x831/0xf9b > >> __fget+0x7e/0xbf > >> SyS_read+0x4d/0xb5 > >> > >> ceph_read_iter() uses current->journal_info to pass context info to > >> ceph_readpages(). Because ceph_readpages() needs to know if its caller > >> has already gotten capability of using page cache (distinguish read > >> from readahead/fadvise). ceph_read_iter() set current->journal_info, > >> then calls generic_file_read_iter(). > >> > >> In above Oops, page fault happened when copying data to userspace. > >> Page fault handler called ext4_page_mkwrite(). Ext4 code read > >> current->journal_info and assumed it is journal handle. > >> > >> I checked other filesystems, btrfs probably suffers similar problem > >> for its readpage. (page fault happens when write() copies data from > >> userspace memory and the memory is mapped to a file in btrfs. > >> verify_parent_transid() can be called during readpage) > >> > >> Cc: stable@vger.kernel.org > >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > I agree with the analysis but the patch is too ugly too live. Ceph just > > should not be abusing current->journal_info for passing information between > > two random functions or when it does a hackery like this, it should just > > make sure the pieces hold together. Poluting generic code to accommodate > > this hack in Ceph is not acceptable. Also bear in mind there are likely > > other code paths (e.g. memory reclaim) which could recurse into another > > filesystem confusing it with non-NULL current->journal_info in the same > > way. > > But ... > > some filesystem set journal_info in its write_begin(), then clear it > in write_end(). If buffer for write is mapped to another filesystem, > current->journal can leak to the later filesystem's page_readpage(). > The later filesystem may read current->journal and treat it as its own > journal handle. Besides, most filesystem's vm fault handle is > filemap_fault(), filemap also may tigger memory reclaim. Did you really observe this? Because write path uses iov_iter_copy_from_user_atomic() which does not allow page faults to happen. All page faulting happens in iov_iter_fault_in_readable() before ->write_begin() is called. And the recursion problems like you mention above are exactly the reason why things are done in a more complicated way like this. > > > > In this particular case I'm not sure why does ceph pass 'filp' into > > readpage() / readpages() handler when it already gets that pointer as part > > of arguments... > > It actually a flag which tells ceph_readpages() if its caller is > ceph_read_iter or readahead/fadvise/madvise. because when there are > multiple clients read/write a file a the same time, page cache should > be disabled. I'm not sure I understand the reasoning properly but from what you say above it rather seems the 'hint' should be stored in the inode (or possibly struct file)? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-12-14 16:53 UTC|newest] Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-12-14 10:55 [PATCH] mm: save/restore current->journal_info in handle_mm_fault Yan, Zheng 2017-12-14 10:55 ` Yan, Zheng 2017-12-14 13:30 ` Michal Hocko 2017-12-14 13:30 ` Michal Hocko 2017-12-14 13:43 ` Jan Kara 2017-12-14 13:43 ` Jan Kara 2017-12-14 14:30 ` Yan, Zheng 2017-12-14 14:30 ` Yan, Zheng 2017-12-14 16:53 ` Jan Kara [this message] 2017-12-14 16:53 ` Jan Kara 2017-12-15 1:17 ` Yan, Zheng 2017-12-15 1:17 ` Yan, Zheng 2017-12-15 10:33 ` Jan Kara 2017-12-15 10:33 ` Jan Kara 2017-12-14 20:48 ` Andreas Dilger
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20171214165314.GB1930@quack2.suse.cz \ --to=jack@suse.cz \ --cc=akpm@linux-foundation.org \ --cc=ceph-devel@vger.kernel.org \ --cc=jlayton@redhat.com \ --cc=linux-btrfs@vger.kernel.org \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=stable@vger.kernel.org \ --cc=ukernel@gmail.com \ --cc=viro@zeniv.linux.org.uk \ --cc=zyan@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.