All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: "Thomas Hellström (VMware)" <thomas_os@shipmail.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	torvalds@linux-foundation.org,
	"Thomas Hellstrom" <thellstrom@vmware.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Will Deacon" <will.deacon@arm.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Minchan Kim" <minchan@kernel.org>,
	"Michal Hocko" <mhocko@suse.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Jérôme Glisse" <jglisse@redhat.com>
Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code
Date: Fri, 4 Oct 2019 16:24:07 +0300	[thread overview]
Message-ID: <20191004132407.gzttci7lio6be467@box> (raw)
In-Reply-To: <8ef9fff3-df8d-cc14-35f9-d83db62e874f@shipmail.org>

On Fri, Oct 04, 2019 at 02:58:59PM +0200, Thomas Hellström (VMware) wrote:
> On 10/4/19 2:37 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > > > + *   If @mapping allows faulting of huge pmds and puds, it is desirable
> > > > > + *   that its huge_fault() handler blocks while this function is running on
> > > > > + *   @mapping. Otherwise a race may occur where the huge entry is split when
> > > > > + *   it was intended to be handled in a huge entry callback. This requires an
> > > > > + *   external lock, for example that @mapping->i_mmap_rwsem is held in
> > > > > + *   write mode in the huge_fault() handlers.
> > > > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> > > > on read) to split PMD entry into PTE table. And it can happen not only
> > > > from fault path.
> > > > 
> > > > If you care about splitting compound page under you, take a pin or lock a
> > > > page. It will block split_huge_page().
> > > > 
> > > > Suggestion to block fault path is not viable (and it will not happen
> > > > magically just because of this comment).
> > > > 
> > > I was specifically thinking of this:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
> > > 
> > > If a huge pud is concurrently faulted in here, it will immediatly get split
> > > without getting processed in pud_entry(). An external lock would protect
> > > against that, but that's perhaps a bug in the pagewalk code?  For pmds the
> > > situation is not the same since when pte_entry is used, all pmds will
> > > unconditionally get split.
> > I *think* it should be fixed with something like this (there's no
> > pud_trans_unstable() yet):
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index d48c2a986ea3..221a3b945f42 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   					break;
> >   				continue;
> >   			}
> > +		} else {
> > +			split_huge_pud(walk->vma, pud, addr);
> >   		}
> > -		split_huge_pud(walk->vma, pud, addr);
> > -		if (pud_none(*pud))
> > +		if (pud_none(*pud) || pud_trans_unstable(*pud))
> >   			goto again;
> >   		if (ops->pmd_entry || ops->pte_entry)
> 
> Yes, this seems better. I was looking at implementing a pud_trans_unstable()
> as a basis of fixing problems like this, but when I looked at
> pmd_trans_unstable I got a bit confused:
> 
> Why are devmap huge pmds considered stable? I mean, couldn't anybody just
> run madvise() to clear those just like transhuge pmds?

Matthew, Dan, could you comment on this?

> > Or better yet converted to what we do on pmd level.
> > 
> > Honestly, all the code around PUD THP missing a lot of ground work.
> > Rushing it upstream for DAX was not a right move.
> > 
> > > There's a similar more scary race in
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
> > > 
> > > It looks like if a concurrent thread faults in a huge pud just after the
> > > test for pud_none in that pmd_alloc, things might go pretty bad.
> > Hm? It will fail the next pmd_none() check under ptl. Do you have a
> > particular racing scenarion?
> > 
> Yes, I misinterpreted the code somewhat, but here's the scenario that looks
> racy:
> 
> Thread 1		Thread 2
> huge_fault(pud)					- Fell back, for example because of write fault on dirty-tracking.
> 			huge_fault(pud)         - Taken, read fault.
> pmd_alloc()                                     - Will fail pmd_none check and return a pmd_offset()

I see. It also misses pud_tans_unstable() check or its variant.

-- 
 Kirill A. Shutemov

  reply	other threads:[~2019-10-04 13:24 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-02 13:47 [PATCH v3 0/7] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 1/7] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
2019-10-03 11:02   ` Kirill A. Shutemov
2019-10-03 11:32     ` Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
2019-10-02 17:52   ` Linus Torvalds
2019-10-03 11:17   ` Kirill A. Shutemov
2019-10-03 11:32     ` Thomas Hellström (VMware)
2019-10-04 12:37       ` Kirill A. Shutemov
2019-10-04 12:58         ` Thomas Hellström (VMware)
2019-10-04 13:24           ` Kirill A. Shutemov [this message]
2019-10-02 13:47 ` [PATCH v3 3/7] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
2019-10-02 18:06   ` Linus Torvalds
2019-10-02 18:13     ` Matthew Wilcox
2019-10-02 19:09     ` Thomas Hellström (VMware)
2019-10-02 20:27       ` Linus Torvalds
2019-10-03  7:56         ` Thomas Hellstrom
2019-10-03 16:55           ` Linus Torvalds
2019-10-03 18:03             ` Thomas Hellström (VMware)
2019-10-03 18:11               ` Linus Torvalds
2019-10-02 13:47 ` [PATCH v3 4/7] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
2019-10-02 19:07   ` kbuild test robot
2019-10-02 19:17   ` kbuild test robot
2019-10-02 13:47 ` [PATCH v3 5/7] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 6/7] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 7/7] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191004132407.gzttci7lio6be467@box \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=jglisse@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=thellstrom@vmware.com \
    --cc=thomas_os@shipmail.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.