From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU References: <0efb5547-9250-6b6c-fe8e-cf4f44aaa5eb@netapp.com> <20180514144901.0fe99d240ff8a53047dd512e@linux-foundation.org> <20180515004406.GB5168@bombadil.infradead.org> CC: Matthew Wilcox , Andrew Morton , Jeff Moyer , "Kirill A. Shutemov" , linux-kernel , linux-fsdevel , "linux-mm@kvack.org" , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , , Peter Zijlstra , Dave Hansen , "Rik van Riel" , Jan Kara , Matthew Wilcox , Amit Golander To: Mark Rutland , Peter Zijlstra From: Boaz Harrosh Message-ID: <6698c486-56e7-1fc2-567c-69cc446a58be@netapp.com> Date: Tue, 15 May 2018 16:24:05 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 15/05/18 14:54, Boaz Harrosh wrote: > On 15/05/18 03:44, Matthew Wilcox wrote: >> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote: >>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh wrote: >>>> In this project we utilize a per-core server thread so everything >>>> is kept local. If we use the regular zap_ptes() API All CPU's >>>> are scheduled for the unmap, though in our case we know that we >>>> have only used a single core. The regular zap_ptes adds a very big >>>> latency on every operation and mostly kills the concurrency of the >>>> over all system. Because it imposes a serialization between all cores >>> >>> I'd have thought that in this situation, only the local CPU's bit is >>> set in the vma's mm_cpumask() and the remote invalidations are not >>> performed. Is that a misunderstanding, or is all that stuff not working >>> correctly? >> >> I think you misunderstand Boaz's architecture. He has one thread per CPU, >> so every bit will be set in the mm's (not vma's) mm_cpumask. >> > > Hi Andrew, Matthew > > Yes I have been trying to investigate and trace this for days. > Please see the code below: > >> #define flush_tlb_range(vma, start, end) \ >> flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags) > > The mm_struct @mm below comes from here vma->vm_mm > >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >> index e055d1a..1d398a0 100644 >> --- a/arch/x86/mm/tlb.c >> +++ b/arch/x86/mm/tlb.c >> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; >> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, >> unsigned long end, unsigned long vmflag) >> { >> int cpu; >> >> struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = { >> .mm = mm, >> }; >> >> cpu = get_cpu(); >> >> /* This is also a barrier that synchronizes with switch_mm(). */ >> info.new_tlb_gen = inc_mm_tlb_gen(mm); >> >> /* Should we flush just the requested range? */ >> if ((end != TLB_FLUSH_ALL) && >> !(vmflag & VM_HUGETLB) && >> ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) { >> info.start = start; >> info.end = end; >> } else { >> info.start = 0UL; >> info.end = TLB_FLUSH_ALL; >> } >> >> if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { >> VM_WARN_ON(irqs_disabled()); >> local_irq_disable(); >> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); >> local_irq_enable(); >> } >> >> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) >> + if (!(vmflag & VM_LOCAL_CPU) && >> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) >> flush_tlb_others(mm_cpumask(mm), &info); >> > > I have been tracing the mm_cpumask(vma->vm_mm) at my driver at > different points. At vma creation (file_operations->mmap()), > and before the call to insert_pfn (which calls here). > > At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm) > should have a single bit set just as the affinity of the thread on > creation of that thread. But then I saw that at %80 of the times some > other random bits are also set. > > Yes Random. Always the thread affinity (single) bit was set but > then zero one or two more bits were set as well. Never seen more then > two though, which baffles me a lot. > > If it was like Matthew said .i.e the cpumask of the all process > then I would expect all the bits to be set. Because I have a thread > on each core. And also I would even expect that all vma->vm_mm > or maybe mm_cpumask(vma->vm_mm) to point to the same global object. > But it was not so. it was pointing to some thread unique object but > still those phantom bits were set all over. (And I am almost sure > same vma had those bits change over time) > > So I would love some mm guy to explain where are those bits collected? > But I do not think this is a Kernel bug because as Matthew showed. > that vma above can and is allowed to leak memory addresses to other > threads / cores in the same process. So it appears that the Kernel > has some correct logic behind its madness. > Hi Mark So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single bit set. And a natural call to flush_tlb_range will only invalidate the local cpu. Are you familiar with this logic? > Which brings me to another question. How can I find from > within a thread Say at the file_operations->mmap() call that the thread > is indeed core-pinned. What mm_cpumask should I inspect? > Mark, Peter do you know how I can check the above? Thanks Boaz >> put_cpu(); >> } > > Thanks > Boaz >