From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123])
	by kanga.kvack.org (Postfix) with SMTP id D169F6B00BB
	for <linux-mm@kvack.org>; Sat, 30 Jun 2012 22:41:37 -0400 (EDT)
Received: by pbbrp2 with SMTP id rp2so7513676pbb.14
        for <linux-mm@kvack.org>; Sat, 30 Jun 2012 19:41:37 -0700 (PDT)
Message-ID: <4FEFB8CB.9000302@gmail.com>
Date: Sun, 01 Jul 2012 10:41:15 +0800
From: Nai Xia <nai.xia@gmail.com>
Reply-To: nai.xia@gmail.com
MIME-Version: 1.0
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
References: <1340888180-15355-1-git-send-email-aarcange@redhat.com> <1340888180-15355-14-git-send-email-aarcange@redhat.com> <1340895238.28750.49.camel@twins> <CAJd=RBA+FPgB9iq07YG0Pd=tN65SGK1ifmj98tomBDbYeKOE-Q@mail.gmail.com> <20120629125517.GD32637@gmail.com> <4FEDDD0C.60609@redhat.com> <1340995986.28750.114.camel@twins> <CAPQyPG4R34bi0fXHBspSpR1+gDLj2PGYpPXNLPTTTBmrRL=m4g@mail.gmail.com> <20120630012338.GY6676@redhat.com> <CAPQyPG7Nx1Jdq7WBBDC41iRGOMx8CdQjcWTNOWyj1fzVeuRcgw@mail.gmail.com> <4FEE9310.1050908@redhat.com> <CAPQyPG50wtowNsPm1UADCNchY-gFk-cKW8oiU34L2REybhNoEg@mail.gmail.com> <4FEF558D.20603@redhat.com>
In-Reply-To: <4FEF558D.20603@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: dlaor@redhat.com
Cc: Andrea Arcangeli <aarcange@redhat.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Ingo Molnar <mingo@kernel.org>, Hillf Danton <dhillf@gmail.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dan Smith <danms@us.ibm.com>, Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>, Paul Turner <pjt@google.com>, Suresh Siddha <suresh.b.siddha@intel.com>, Mike Galbraith <efault@gmx.de>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, Lai Jiangshan <laijs@cn.fujitsu.com>, Bharata B Rao <bharata.rao@gmail.com>, Lee Schermerhorn <Lee.Schermerhorn@hp.com>, Rik van Riel <riel@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>, Christoph Lameter <cl@linux.com>, Alex Shi <alex.shi@intel.com>, Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Don Morris <don.morris@hp.com>, Benjamin Herrenschmidt <benh@kernel.crashing.org>


On 2012a1'07ae??01ae?JPY 03:37, Dor Laor wrote:
> On 06/30/2012 09:58 AM, Nai Xia wrote:
>> On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@redhat.com> wrote:
>>> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>>>
>>>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>>>
>>>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>>>
>>>>>>>
>>>>>>> I never said any such thing. I just said why should I bother reading
>>>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>>>
>>>>>>> If you want to read that as a NAK, not my problem.
>>>>>>
>>>>>>
>>>>>> Hey guys, Can I say NAK to these patches ?
>>>>>>
>>>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>>>> take
>>>>>> a few seconds to see what it is trying to solve:
>>>>>>
>>>>>> We all know that LRU is try to solve the question of "what are the
>>>>>> pages recently accessed?",
>>>>>> so its engouth to use pte bits to approximate.
>>>>>
>>>>>
>>>>> I made an example about the active list to try to explain it why your
>>>>> example is still going to work fine.
>>>>>
>>>>> After it becomes active (from inactive) and it's being a referenced
>>>>> active page, it won't become _very_active_ or _very_very_active_ or
>>>>> more no matter how many more times you look up the pagecache.
>>>>>
>>>>> The LRU order wasn't relevant here.
>>>>>
>>>>>> However, the numa balancing problem is fundamentally like this:
>>>>>>
>>>>>> In some time unit,
>>>>>>
>>>>>> W = pages_accessed * average_page_access_frequence
>>>>>>
>>>>>> We are trying to move process to the node having max W, right?
>>>>>
>>>>>
>>>>> First of all, the mm_autonuma statistics are not in function of time
>>>>> and there is no page access frequency there.
>>>>>
>>>>> mm_autonuma is static information collected by knuma_scand from the
>>>>> pagetables. That's static and 100% accurate on the whole process and
>>>>> definitely not generated by the numa hinting page faults. I could shut
>>>>> off all numa hinting page faults permanently and still generate the
>>>>> mm_autonuma information identically.
>>>>>
>>>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>>>> you can enable if you want to use a "runtime" and not static
>>>>> information for the mm_autonuma too, but that's not the default for
>>>>> now (but I think it may be a better default, there wasn't enough time
>>>>> to test this yet)
>>>>>
>>>>> The task_autonuma (thread) statistics are the only thing that is
>>>>> sampled by default in a 10sec interval (the interval tunable too with
>>>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>>>> eventually going to make it dynamic anyway)
>>>>>
>>>>> So even if you were right, the thread statistics only kicks in to
>>>>> balance threads against threads of the same process, most of the time
>>>>> what's more important are the mm_autonuma statistics.
>>>>>
>>>>> But in reality the thread statistics also works perfectly for the job,
>>>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>>>> other threads). And then the rest of the memory slowly follows
>>>>> whatever node CPUs I placed the thread (even if that's not the
>>>>> absolutely best one at all times).
>>>>>
>>>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>>>> time unit(scan interval),
>>>>>> I don't think it can catch even 1% of average_page_access_frequence
>>>>>> on a busy workload.
>>>>>> Blindly assuming that all the pages' average_page_access_frequence is
>>>>>> the same is seemly
>>>>>> broken to me.
>>>>>
>>>>>
>>>>> All we need is an approximation to take a better than random decision,
>>>>> even if you get it 1% right, it's still better than 0% right by going
>>>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>>>> are more like >90% correct in average (I monitor them with the debug
>>>>> mode constantly).
>>>>>
>>>>> If this 1% right, happens one a million samples, who cares, it's not
>>>>> going to run measurably slower anyway (and it will still be better
>>>>> than picking a 0% right node).
>>>>>
>>>>> What you're saying is that because the active list in the pagecache
>>>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>>>> drop the active list and stop activating pages and just threat them
>>>>> all the same because in some unlucky access pattern, the active list
>>>>> may only get right 1% of the working set. But there's a reason why the
>>>>> active list exists despite it may get things wrong in some corner case
>>>>> and possibly leave the large amount of pages accessed infrequently in
>>>>> the inactive list forever (even if it gets things only 1% right in
>>>>> those worst cases, it's still better than 0% right and no active list
>>>>> at all).
>>>>>
>>>>> To say it in another way, you may still crash with the car even if
>>>>> you're careful, but do you think it's better to watch at the street or
>>>>> to drive blindfolded?
>>>>>
>>>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>>>> very carefully for the best next turn to take with the car and to
>>>>> avoid obstacles, you can imagine who wins.
>>>>>
>>>>> Watching the street carefully every 10sec doesn't mean the next moment
>>>>> a missile won't hit your car to make you crash, you're still having
>>>>> better chances not to crash than by driving blindfolded.
>>>>>
>>>>> numa/sched pretends to compete without collecting information for the
>>>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>>>> exponential backoff at 10sec intervals), and without process
>>>>> information (full static information from the pagetables, not
>>>>> sampled). No matter how you compute stuff, if you've nothing
>>>>> meaningful in input to your algorithm you lose. And it looks like you
>>>>> believe that you can take better decisions with nothing in input to
>>>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>>>> isn't 100% perfect at all times and it can't predict the future. The
>>>>> alternative is to get that information from syscalls, but even
>>>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>>>> and overall the task_autonuma information may be more reliable in the
>>>>> end, even if it's sampled using an exponential backoff.
>>>>>
>>>>> Also note the exponential backoff thing, it's not really the last
>>>>> interval, it's the last interval plus half the previous interval plus
>>>>> 1/4 the previous interval etc... and we can trivially control the
>>>>> decay.
>>>>>
>>>>> All we need is to get a direction and knowing _exactly_ what the task
>>>>> did over the last 10 seconds (even if it can't predict the future of
>>>>> what the thread will do in the next 1sec), is all we need to get a
>>>>> direction. After we take the direction then the memory will follow so
>>>>> we cannot care less what it does in the next second because that will
>>>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>>>> permitting), and at least we'll know for sure that the memory accessed
>>>>> in the last 10sec is already local and that defines the best node to
>>>>> schedule the thread.
>>>>>
>>>>> I don't mean there's no room for improvement in the way the input data
>>>>> can be computed, and even in the way the input data can be generated,
>>>>> the exponential backoff decay can be tuned too, I just tried to do the
>>>>> simplest computations on the data to make the workloads converge fast
>>>>> and you're welcome to contribute.
>>>>>
>>>>> But I believe the task_autonuma information is extremely valuable and
>>>>> we can trust it very much knowing we'll get a great placement. The
>>>>> concern you have isn't invalid, but it's a very minor one and the
>>>>> sampling rate effects you are concerned about, while real, they're
>>>>> lost in the noise in practice.
>>>>
>>>>
>>>> Well, I think I am not convinced by your this many words. And surely
>>>> I will NOT follow your reasoning of "Having information is always
>>>> good than nothing". We all know that an illy biased balancing is worse
>>>> than randomness: at least randomness means "average, fair play, ...".
>>>> With all uncertain things, I think only a comprehensive survey
>>>> of real world workloads can tell if my concern is significant or not.
>>>>
>>>> So I think my suggestion to you is: Show world some solid and sound
>
> ^^^^^^^^^^^^^
>>>> real world proof that your approximation is > 90% accurate, just like
>>>
>>>
>>> The cover letter contained a link to the performance:
>>> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>>
>> Yes, I saw this. But if you consider this already a solid and
>> comprehensive proof.
>> You win , I have no other words to say.
>
> No one says there is a proof, on contrary, I said it's possible to beat any heuristic algorithm and Andrea explained the LRU is such too.
>
> You asked above for real world example and that's what Andrea was trying to achieve (note that it includes tiny regression w/ parallel kernel compile on tmpfs).
>
>>
>>>
>>> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
>>> tests.
>>>
>>> I'm sure anyone can beat most kernel algorithm with some pathological case
>>> including LRU and CFS. The only way to improve the numa balancing stuff is
>>
>> Like I already put, the pathological cases for LRU were already well understood
>> for decades, they are quite valid to ignore. And every programmer has
>> be taught to
>> avoid these cases. And this conclusion took much much time of many many
>> talented brains.
>
> Who are these programmers that you talk about? The average Java programmer is clueless w.r.t memory allocation.
> Even w/ KVM we have an issue of double swap storm when both the guest and the host will have the same page on their LRU list.
>
>>
>> But the problem of this algorithm is not. And you are putting haste conclusion
>> of it without bothering to do comprehensive research.
>>
>> "Collect the data from a wide range of pages occasionally,
>> and then do a condense computing on a small set of pages" looks a very common
>> practice to me. But again, if you simply label this as "minor".
>> I have no other words to say.
>>
>>> to sample more, meaning faulting more == larger overhead.
>>
>> Are you sure you really want to compete the sampling speed with CPU intensive
>> workloads?
>
> You didn't understand my point - I was saying exactly this - it's not worth to sample more because it carries a huge over head.
> Pls don't be that fast on the 'send' trigger :)
>
>>
>> OK, I think I'd stop discussing this topic now. Without strict and comprehensive
>> research on this topic, further arguments seems to me to be purely based on
>> imagination.
>>
>> And I have no interest in beating any of your fancy algorithm, it wouldn't bring
>> me 1G$. I am just curiously about the truth.
>
> No one said it's fancy beside you. I actually proposed a way to relax such false migrations in my previous reply.
>
>
>> Basically, anyone has the right to laugh, if W = x * y and you only
>
> Laughing while discussing numa code is under estimated!
> Let's not continue to spam the list, I think we've all made our points,
> Dor

Note, my laughing is based on your attitude of "anyone beat anything"
which is also under estimated. I remember despite my criticism of
the sampling, I was quite friendly to Andrea, then why the
acidness/bitterness?

But you are right. We all made our points. Let's stop it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>