From: Hugh Dickins <hughd@google.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
Matthew Wilcox <willy@infradead.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Will Deacon <will@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Jan Kara <jack@suse.cz>, Minchan Kim <minchan@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Vinayak Menon <vinmenon@codeaurora.org>,
Android Kernel Team <kernel-team@android.com>
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting
Date: Mon, 28 Dec 2020 20:35:06 -0800 (PST) [thread overview]
Message-ID: <alpine.LSU.2.11.2012281925420.1028@eggly.anvils> (raw)
In-Reply-To: <20201228221237.6nu75kgxq7ikxn2a@box>
Got it at last, sorry it's taken so long.
On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.
Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.
> > >
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > >
> > > No,m I think the problem is the
> > >
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >
> > > at the end of filemap_map_pages().
> > >
> > > Why?
> > >
> > > Because we've been updating vmf->pte as we go along:
> > >
> > > vmf->pte += xas.xa_index - last_pgoff;
> > >
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> >
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> >
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.
Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.
> >
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
>
> Ugh.. Wrong place. Need to sleep.
>
> I'll look into your idea tomorrow.
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
> } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> rcu_read_unlock();
> + vmf->address = address;
> + vmf->pte = NULL;
> WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>
> return ret;
> --
And that made no (noticeable) difference either. But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f
rcu_read_lock();
head = first_map_page(vmf, &xas, end_pgoff);
- if (!head) {
- rcu_read_unlock();
- return 0;
- }
+ if (!head)
+ goto out;
if (filemap_map_pmd(vmf, head)) {
- rcu_read_unlock();
- return VM_FAULT_NOPAGE;
+ ret = VM_FAULT_NOPAGE;
+ goto out;
}
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
put_page(head);
} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
vmf->address = address;
- vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
--
and then the corruption is fixed. It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.
You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).
So, the mystery is solved; but I don't think any of these patches
should be applied. Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.
I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.
Hugh
WARNING: multiple messages have this Message-ID
From: Hugh Dickins <hughd@google.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Android Kernel Team <kernel-team@android.com>,
Jan Kara <jack@suse.cz>, Minchan Kim <minchan@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Matthew Wilcox <willy@infradead.org>,
Linux-MM <linux-mm@kvack.org>,
Vinayak Menon <vinmenon@codeaurora.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Will Deacon <will@kernel.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting
Date: Mon, 28 Dec 2020 20:35:06 -0800 (PST) [thread overview]
Message-ID: <alpine.LSU.2.11.2012281925420.1028@eggly.anvils> (raw)
In-Reply-To: <20201228221237.6nu75kgxq7ikxn2a@box>
Got it at last, sorry it's taken so long.
On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.
Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.
> > >
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > >
> > > No,m I think the problem is the
> > >
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >
> > > at the end of filemap_map_pages().
> > >
> > > Why?
> > >
> > > Because we've been updating vmf->pte as we go along:
> > >
> > > vmf->pte += xas.xa_index - last_pgoff;
> > >
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> >
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> >
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.
Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.
> >
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
>
> Ugh.. Wrong place. Need to sleep.
>
> I'll look into your idea tomorrow.
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
> } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> rcu_read_unlock();
> + vmf->address = address;
> + vmf->pte = NULL;
> WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>
> return ret;
> --
And that made no (noticeable) difference either. But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f
rcu_read_lock();
head = first_map_page(vmf, &xas, end_pgoff);
- if (!head) {
- rcu_read_unlock();
- return 0;
- }
+ if (!head)
+ goto out;
if (filemap_map_pmd(vmf, head)) {
- rcu_read_unlock();
- return VM_FAULT_NOPAGE;
+ ret = VM_FAULT_NOPAGE;
+ goto out;
}
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
put_page(head);
} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
vmf->address = address;
- vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
--
and then the corruption is fixed. It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.
You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).
So, the mystery is solved; but I don't think any of these patches
should be applied. Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.
I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.
Hugh
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2020-12-29 4:36 UTC|newest]
Thread overview: 138+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-09 16:39 [PATCH 0/2] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag Will Deacon
2020-12-09 16:39 ` Will Deacon
2020-12-09 16:39 ` [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting Will Deacon
2020-12-09 16:39 ` Will Deacon
2020-12-09 17:58 ` Linus Torvalds
2020-12-09 17:58 ` Linus Torvalds
2020-12-09 17:58 ` Linus Torvalds
2020-12-09 18:40 ` Will Deacon
2020-12-09 18:40 ` Will Deacon
2020-12-09 19:04 ` Linus Torvalds
2020-12-09 19:04 ` Linus Torvalds
2020-12-09 19:04 ` Linus Torvalds
2020-12-09 20:32 ` Matthew Wilcox
2020-12-09 20:32 ` Matthew Wilcox
2020-12-09 21:04 ` Linus Torvalds
2020-12-09 21:04 ` Linus Torvalds
2020-12-09 21:04 ` Linus Torvalds
2020-12-10 15:08 ` Kirill A. Shutemov
2020-12-10 15:08 ` Kirill A. Shutemov
2020-12-10 17:23 ` Linus Torvalds
2020-12-10 17:23 ` Linus Torvalds
2020-12-10 17:23 ` Linus Torvalds
2020-12-14 16:07 ` Kirill A. Shutemov
2020-12-14 16:07 ` Kirill A. Shutemov
2020-12-14 17:54 ` Linus Torvalds
2020-12-14 17:54 ` Linus Torvalds
2020-12-14 17:54 ` Linus Torvalds
2020-12-14 18:56 ` Matthew Wilcox
2020-12-14 18:56 ` Matthew Wilcox
2020-12-16 17:07 ` Kirill A. Shutemov
2020-12-16 17:07 ` Kirill A. Shutemov
2020-12-16 18:41 ` Linus Torvalds
2020-12-16 18:41 ` Linus Torvalds
2020-12-16 18:41 ` Linus Torvalds
2020-12-17 10:54 ` Kirill A. Shutemov
2020-12-17 10:54 ` Kirill A. Shutemov
2020-12-17 18:22 ` Linus Torvalds
2020-12-17 18:22 ` Linus Torvalds
2020-12-17 18:22 ` Linus Torvalds
2020-12-18 11:04 ` Kirill A. Shutemov
2020-12-18 11:04 ` Kirill A. Shutemov
2020-12-18 18:56 ` Linus Torvalds
2020-12-18 18:56 ` Linus Torvalds
2020-12-18 18:56 ` Linus Torvalds
2020-12-19 12:41 ` Kirill A. Shutemov
2020-12-19 12:41 ` Kirill A. Shutemov
2020-12-19 20:08 ` Linus Torvalds
2020-12-19 20:08 ` Linus Torvalds
2020-12-19 20:08 ` Linus Torvalds
2020-12-19 20:34 ` Linus Torvalds
2020-12-19 20:34 ` Linus Torvalds
2020-12-19 20:34 ` Linus Torvalds
2020-12-22 10:00 ` Kirill A. Shutemov
2020-12-22 10:00 ` Kirill A. Shutemov
2020-12-24 4:04 ` Hugh Dickins
2020-12-24 4:04 ` Hugh Dickins
2020-12-24 4:04 ` Hugh Dickins
2020-12-25 11:31 ` Kirill A. Shutemov
2020-12-25 11:31 ` Kirill A. Shutemov
2020-12-26 17:57 ` Linus Torvalds
2020-12-26 17:57 ` Linus Torvalds
2020-12-26 17:57 ` Linus Torvalds
2020-12-26 20:43 ` Kirill A. Shutemov
2020-12-26 20:43 ` Kirill A. Shutemov
2020-12-26 21:03 ` Hugh Dickins
2020-12-26 21:03 ` Hugh Dickins
2020-12-26 21:03 ` Hugh Dickins
2020-12-26 21:16 ` Linus Torvalds
2020-12-26 21:16 ` Linus Torvalds
2020-12-26 21:16 ` Linus Torvalds
2020-12-26 22:40 ` Kirill A. Shutemov
2020-12-26 22:40 ` Kirill A. Shutemov
2020-12-27 0:45 ` Hugh Dickins
2020-12-27 0:45 ` Hugh Dickins
2020-12-27 0:45 ` Hugh Dickins
2020-12-27 2:38 ` Hugh Dickins
2020-12-27 2:38 ` Hugh Dickins
2020-12-27 2:38 ` Hugh Dickins
2020-12-27 19:38 ` Linus Torvalds
2020-12-27 19:38 ` Linus Torvalds
2020-12-27 19:38 ` Linus Torvalds
2020-12-27 20:32 ` Damian Tometzki
2020-12-27 20:32 ` Damian Tometzki
2020-12-27 22:35 ` Hugh Dickins
2020-12-27 22:35 ` Hugh Dickins
2020-12-27 22:35 ` Hugh Dickins
2020-12-27 23:12 ` Linus Torvalds
2020-12-27 23:12 ` Linus Torvalds
2020-12-27 23:12 ` Linus Torvalds
2020-12-27 23:40 ` Linus Torvalds
2020-12-27 23:40 ` Linus Torvalds
2020-12-27 23:40 ` Linus Torvalds
2020-12-27 23:55 ` Kirill A. Shutemov
2020-12-27 23:55 ` Kirill A. Shutemov
2020-12-27 23:48 ` Kirill A. Shutemov
2020-12-27 23:48 ` Kirill A. Shutemov
2020-12-28 1:54 ` Linus Torvalds
2020-12-28 1:54 ` Linus Torvalds
2020-12-28 1:54 ` Linus Torvalds
2020-12-28 6:43 ` Hugh Dickins
2020-12-28 6:43 ` Hugh Dickins
2020-12-28 6:43 ` Hugh Dickins
2020-12-28 12:53 ` Kirill A. Shutemov
2020-12-28 12:53 ` Kirill A. Shutemov
2020-12-28 18:47 ` Linus Torvalds
2020-12-28 18:47 ` Linus Torvalds
2020-12-28 18:47 ` Linus Torvalds
2020-12-28 21:58 ` Linus Torvalds
2020-12-28 21:58 ` Linus Torvalds
2020-12-28 21:58 ` Linus Torvalds
2020-12-29 13:28 ` Kirill A. Shutemov
2020-12-29 13:28 ` Kirill A. Shutemov
2020-12-29 15:19 ` Matthew Wilcox
2020-12-29 15:19 ` Matthew Wilcox
2020-12-29 20:52 ` Linus Torvalds
2020-12-29 20:52 ` Linus Torvalds
2020-12-29 20:52 ` Linus Torvalds
2020-12-28 22:05 ` Kirill A. Shutemov
2020-12-28 22:05 ` Kirill A. Shutemov
2020-12-28 22:12 ` Kirill A. Shutemov
2020-12-28 22:12 ` Kirill A. Shutemov
2020-12-29 4:35 ` Hugh Dickins [this message]
2020-12-29 4:35 ` Hugh Dickins
2020-12-29 4:35 ` Hugh Dickins
2020-12-28 23:28 ` Linus Torvalds
2020-12-28 23:28 ` Linus Torvalds
2020-12-28 23:28 ` Linus Torvalds
2020-12-26 21:07 ` Linus Torvalds
2020-12-26 21:07 ` Linus Torvalds
2020-12-26 21:07 ` Linus Torvalds
2020-12-26 21:41 ` Matthew Wilcox
2020-12-26 21:41 ` Matthew Wilcox
2020-12-09 16:39 ` [PATCH 2/2] arm64: mm: Implement arch_wants_old_faultaround_pte() Will Deacon
2020-12-09 16:39 ` Will Deacon
2020-12-09 18:35 ` Catalin Marinas
2020-12-09 18:35 ` Catalin Marinas
2020-12-09 18:46 ` Will Deacon
2020-12-09 18:46 ` Will Deacon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.11.2012281925420.1028@eggly.anvils \
--to=hughd@google.com \
--cc=akpm@linux-foundation.org \
--cc=catalin.marinas@arm.com \
--cc=jack@suse.cz \
--cc=kernel-team@android.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vinmenon@codeaurora.org \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.