linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <namit@vmware.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>,
	"jstancek@redhat.com" <jstancek@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Nick Piggin <npiggin@gmail.com>, Minchan Kim <minchan@kernel.org>,
	Mel Gorman <mgorman@suse.de>, Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
Date: Mon, 13 May 2019 09:11:38 +0000	[thread overview]
Message-ID: <75FD46B2-2E0C-41F2-9308-AB68C8780E33@vmware.com> (raw)
In-Reply-To: <20190513083606.GL2623@hirez.programming.kicks-ass.net>

> On May 13, 2019, at 1:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 09, 2019 at 09:21:35PM +0000, Nadav Amit wrote:
> 
>>>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>>>> have completed.
>>>>> 
>>>>> This should not be too hard to make happen.
>>>> 
>>>> This synchronization sounds much more expensive than what I proposed. But I
>>>> agree that cache-lines that move from one CPU to another might become an
>>>> issue. But I think that the scheme I suggested would minimize this overhead.
>>> 
>>> Well, it would have a lot more unconditional atomic ops. My scheme only
>>> waits when there is actual concurrency.
>> 
>> Well, something has to give. I didn’t think that if the same core does the
>> atomic op it would be too expensive.
> 
> They're still at least 20 cycles a pop, uncontended.
> 
>>> I _think_ something like the below ought to work, but its not even been
>>> near a compiler. The only problem is the unconditional wakeup; we can
>>> play games to avoid that if we want to continue with this.
>>> 
>>> Ideally we'd only do this when there's been actual overlap, but I've not
>>> found a sensible way to detect that.
>>> 
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 4ef4bbe78a1d..b70e35792d29 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>>> 	 *
>>> 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>>> 	 */
>>> -	atomic_dec(&mm->tlb_flush_pending);
>>> +	if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
>>> +		wake_up_var(&mm->tlb_flush_pending);
>>> +	} else {
>>> +		wait_event_var(&mm->tlb_flush_pending,
>>> +			       !atomic_read_acquire(&mm->tlb_flush_pending));
>>> +	}
>>> }
>> 
>> It still seems very expensive to me, at least for certain workloads (e.g.,
>> Apache with multithreaded MPM).
> 
> Is that Apache-MPM workload triggering this lots? Having a known
> benchmark for this stuff is good for when someone has time to play with
> things.

Setting Apache2 with mpm_worker causes every request to go through
mmap-writev-munmap flow on every thread. I didn’t run this workload after
the patches that downgrade the mmap_sem to read before the page-table
zapping were introduced. I presume these patches would allow the page-table
zapping to be done concurrently, and therefore would hit this flow.

>> It may be possible to avoid false-positive nesting indications (when the
>> flushes do not overlap) by creating a new struct mmu_gather_pending, with
>> something like:
>> 
>>  struct mmu_gather_pending {
>> 	u64 start;
>> 	u64 end;
>> 	struct mmu_gather_pending *next;
>>  }
>> 
>> tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
>> (pointing to the linked list) and find whether there is any overlap. This
>> would still require synchronization (acquiring a lock when allocating and
>> deallocating or something fancier).
> 
> We have an interval_tree for this, and yes, that's how far I got :/
> 
> The other thing I was thinking of is trying to detect overlap through
> the page-tables themselves, but we have a distinct lack of storage
> there.

I tried to think about saving some generation info somewhere in the
page-struct, but I could not come up with a reasonable solution that
would not requite to traverse all the page tables again one the TLB
flush is done.

> The things is, if this threaded monster runs on all CPUs (busy front end
> server) and does a ton of invalidation due to all the short lived
> request crud, then all the extra invalidations will add up too. Having
> to do process (machine in this case) wide invalidations is expensive,
> having to do more of them surely isn't cheap either.
> 
> So there might be something to win here.

Yes. I remember that these full TLB flushes leave their mark.

BTW: sometimes you don’t see the effect of these full TLB flushes as much in
VMs. I encountered a strange phenomenon at the time - INVLPG for an
arbitrary page cause my Haswell machine flush the entire TLB, when the
INVLPG was issued inside a VM. It took me quite some time to analyze this
problem. Eventually Intel told me that’s part of what is called “page
fracturing” - if the host uses 4k pages in the EPT, they (usually) need to
flush the entire TLB for any INVLPG. That’s happens since they don’t know
the size of the flushed page.

I really need to finish my blog-post about it some day.

  reply	other threads:[~2019-05-13  9:11 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-07 21:34 [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
2019-05-09  8:37 ` Will Deacon
2019-05-09 10:38   ` Peter Zijlstra
2019-05-09 10:54     ` Peter Zijlstra
2019-05-09 18:35       ` Yang Shi
2019-05-09 18:40         ` Peter Zijlstra
2019-05-09 12:44     ` Jan Stancek
2019-05-09 17:36     ` Nadav Amit
2019-05-09 18:24       ` Peter Zijlstra
2019-05-09 19:10         ` Yang Shi
2019-05-09 21:06           ` Jan Stancek
2019-05-09 21:48             ` Yang Shi
2019-05-09 22:12               ` Jan Stancek
     [not found]         ` <04668E51-FD87-4D53-A066-5A35ABC3A0D6@vmware.com>
     [not found]           ` <20190509191120.GD2623@hirez.programming.kicks-ass.net>
2019-05-09 21:21             ` Nadav Amit
2019-05-13  8:36               ` Peter Zijlstra
2019-05-13  9:11                 ` Nadav Amit [this message]
2019-05-13 11:30                   ` Peter Zijlstra
2019-05-13 16:37                   ` Will Deacon
2019-05-13 17:06                     ` Nadav Amit
2019-05-14  8:58                       ` Mel Gorman
2019-05-13  9:12                 ` Peter Zijlstra
2019-05-13  9:21                   ` Nadav Amit
2019-05-13 11:27                     ` Peter Zijlstra
2019-05-13 17:41                       ` Nadav Amit
2019-05-09 18:22     ` Yang Shi
2019-05-09 19:56     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=75FD46B2-2E0C-41F2-9308-AB68C8780E33@vmware.com \
    --to=namit@vmware.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=will.deacon@arm.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).