linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Hellstrom <thellstrom@vmware.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>,
	"Thomas Hellström (VMware)" <thomas_os@shipmail.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Will Deacon" <will.deacon@arm.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Minchan Kim" <minchan@kernel.org>,
	"Michal Hocko" <mhocko@suse.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Jérôme Glisse" <jglisse@redhat.com>
Subject: Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present
Date: Wed, 9 Oct 2019 18:52:34 +0000	[thread overview]
Message-ID: <MN2PR05MB6141B981C2CAB4955D59747EA1950@MN2PR05MB6141.namprd05.prod.outlook.com> (raw)
In-Reply-To: CAHk-=whhdRSqjX5wy1LzFYnOG58UztpifkNvbxBcTVbT3Mzv4g@mail.gmail.com

Hi,

On 10/9/19 7:17 PM, Linus Torvalds wrote:
> On Wed, Oct 9, 2019 at 10:03 AM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> Nope, it handles the hugepages by ignoring them, since they should be
>> read-only, but if pmd_entry() was called with something else than a
>> hugepage, then it requests the fallback, but never a split.
> But  PAGE_WALK_FALLBACK _is_ a split.
>
> Oh, except you did this
>
> -               split_huge_pmd(walk->vma, pmd, addr);
> +               if (!ops->pmd_entry)
> +                       split_huge_pmd(walk->vma, pmd, addr);
>
>
> so it avoids the split.
>
> No, that's unacceptable. And makes no sense anyway. If it doesn't
> split the pmd, then it shouldn't walk the pte's - because they don't
> exist. And if it's not a hugepmd, then the split is a no-op, so the
> test makes no sense.
>
> I hadn't noticed that part of the patch. That simply can't be right. I
> don't think you've tested this, because you never actually have
> hugepages, do you?
>
> You didn't notice or realize that split_huge_pmd() just does that
>
>                 if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)   \
>                                         || pmd_devmap(*____pmd))        \
>
> thing and doesn't do anythign at all if it's not huge.
>
> So no. That code makes no sense at all, and I didn't realize how
> senseless it was, becasue I stupidly missed that "make the split
> conditional" - which is insane and wrong - and I thought that you
> wanted PAGE_WALK_FALLBACK to split a pmd and fall back to per-pte
> entries, which is what the name implies.
>
> But that's not what you wanted at all.
>
> Just get rid of PAGE_WALK_FALLBACK entirely then, and make the rule be
> that a zero return value just means "split and do ptes". Which is what
> you want (see above why "split" simply is wrong, and isn't an issue
> for you anyway.
>
> That won't change any existing cases, since even if they do have a
> zero return value, they don't have a pte_entry() callback, so they
> won't do that "split and do ptes" anyway.
>
>              Linus
>
Hmm, so we have the following cases we need to handle when returning
from the pmd_entry() handler.

1) Huge pmd was handled - Returns 0 and continues.
2) A pmd is otherwise unstable, typically someone just zapped a huge
pmd. Returns PAGE_WALK_FALLBACK, gets caught in the pmd_trans_unstable()
test and retries.
3) A pte directory - Returns PAGE_WALK_FALLBACK, falls through, avoids
the split and continues to the next level. Yeah that split avoidance
test is indeed made unnecessary by the preceding pmd_trans_unstable() test.

-               split_huge_pmd(walk->vma, pmd, addr);
+               if (!ops->pmd_entry)
+                       split_huge_pmd(walk->vma, pmd, addr);

But as the commit message says, PAGE_WALK_FALLBACK is necessary to have
a virtual address range being handled once and only once. Therefore we
must distinguish between 1) and 2) since 2) must be retried until it's
handled correctly.

So we need the PAGE_WALK_FALLBACK. And if we instead were to combine 1)
and 3) in a single return value and use, for example PAGE_WALK_RETRY for
2)  the following could happen.

a) we handle the huge pmd and return 0 from pte_entry().
b) another process splits it.
c) we fall through to the pte level and handle the same address range
again...

So to summarize, yes the test in the code you cite is unnecessary. But
if we want to guarantee a virtual address range being handled once and
only once we need the PAGE_WALK_FALLBACK, (perhaps renamed). If not, we
can do without it similar to your original patch.

Thanks,

/Thomas





  reply	other threads:[~2019-10-09 18:52 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-08  9:14 [PATCH v4 0/9] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 1/9] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 2/9] mm: pagewalk: Take the pagetable lock in walk_pte_range() Thomas Hellström (VMware)
2019-10-09 15:14   ` Kirill A. Shutemov
2019-10-09 16:07     ` Linus Torvalds
2019-10-08  9:15 ` [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present Thomas Hellström (VMware)
2019-10-09 15:27   ` Kirill A. Shutemov
2019-10-09 15:27     ` Kirill A. Shutemov
2019-10-09 16:20     ` Thomas Hellström (VMware)
2019-10-09 16:20       ` Thomas Hellström (VMware)        
2019-10-09 16:21     ` Linus Torvalds
2019-10-09 17:03       ` Thomas Hellström (VMware)
2019-10-09 17:16         ` Linus Torvalds
2019-10-09 18:52           ` Thomas Hellstrom [this message]
2019-10-09 19:20             ` Linus Torvalds
2019-10-09 20:06               ` Thomas Hellström (VMware)
2019-10-09 20:20                 ` Linus Torvalds
2019-10-09 22:30                   ` Thomas Hellström (VMware)
2019-10-09 23:50                     ` Thomas Hellström (VMware)
2019-10-09 23:51                     ` Linus Torvalds
2019-10-10  0:18                       ` Linus Torvalds
2019-10-10  1:09                       ` Thomas Hellström (VMware)
2019-10-10  2:07                         ` Linus Torvalds
2019-10-10  6:15                           ` Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 4/9] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 5/9] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
2019-10-08 17:06   ` Linus Torvalds
2019-10-08 18:25     ` Thomas Hellstrom
2019-10-08  9:15 ` [PATCH v4 6/9] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 7/9] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 8/9] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
2019-10-08  9:15 ` [PATCH v4 9/9] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MN2PR05MB6141B981C2CAB4955D59747EA1950@MN2PR05MB6141.namprd05.prod.outlook.com \
    --to=thellstrom@vmware.com \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=thomas_os@shipmail.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).