From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id D169F6B00BB for ; Sat, 30 Jun 2012 22:41:37 -0400 (EDT) Received: by pbbrp2 with SMTP id rp2so7513676pbb.14 for ; Sat, 30 Jun 2012 19:41:37 -0700 (PDT) Message-ID: <4FEFB8CB.9000302@gmail.com> Date: Sun, 01 Jul 2012 10:41:15 +0800 From: Nai Xia Reply-To: nai.xia@gmail.com MIME-Version: 1.0 Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm References: <1340888180-15355-1-git-send-email-aarcange@redhat.com> <1340888180-15355-14-git-send-email-aarcange@redhat.com> <1340895238.28750.49.camel@twins> <20120629125517.GD32637@gmail.com> <4FEDDD0C.60609@redhat.com> <1340995986.28750.114.camel@twins> <20120630012338.GY6676@redhat.com> <4FEE9310.1050908@redhat.com> <4FEF558D.20603@redhat.com> In-Reply-To: <4FEF558D.20603@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: dlaor@redhat.com Cc: Andrea Arcangeli , Peter Zijlstra , Ingo Molnar , Hillf Danton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dan Smith , Linus Torvalds , Andrew Morton , Thomas Gleixner , Ingo Molnar , Paul Turner , Suresh Siddha , Mike Galbraith , "Paul E. McKenney" , Lai Jiangshan , Bharata B Rao , Lee Schermerhorn , Rik van Riel , Johannes Weiner , Srivatsa Vaddagiri , Christoph Lameter , Alex Shi , Mauricio Faria de Oliveira , Konrad Rzeszutek Wilk , Don Morris , Benjamin Herrenschmidt On 2012a1'07ae??01ae?JPY 03:37, Dor Laor wrote: > On 06/30/2012 09:58 AM, Nai Xia wrote: >> On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor wrote: >>> On 06/30/2012 05:43 AM, Nai Xia wrote: >>>> >>>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli >>>> wrote: >>>>> >>>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote: >>>>>> >>>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra >>>>>> wrote: >>>>>>> >>>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote: >>>>>>>> >>>>>>>> The previous comments were not shouts but the mother of all NAKs. >>>>>>> >>>>>>> >>>>>>> I never said any such thing. I just said why should I bother reading >>>>>>> your stuff if you're ignoring most my feedback anyway. >>>>>>> >>>>>>> If you want to read that as a NAK, not my problem. >>>>>> >>>>>> >>>>>> Hey guys, Can I say NAK to these patches ? >>>>>> >>>>>> Now I aware that this sampling algorithm is completely broken, if we >>>>>> take >>>>>> a few seconds to see what it is trying to solve: >>>>>> >>>>>> We all know that LRU is try to solve the question of "what are the >>>>>> pages recently accessed?", >>>>>> so its engouth to use pte bits to approximate. >>>>> >>>>> >>>>> I made an example about the active list to try to explain it why your >>>>> example is still going to work fine. >>>>> >>>>> After it becomes active (from inactive) and it's being a referenced >>>>> active page, it won't become _very_active_ or _very_very_active_ or >>>>> more no matter how many more times you look up the pagecache. >>>>> >>>>> The LRU order wasn't relevant here. >>>>> >>>>>> However, the numa balancing problem is fundamentally like this: >>>>>> >>>>>> In some time unit, >>>>>> >>>>>> W = pages_accessed * average_page_access_frequence >>>>>> >>>>>> We are trying to move process to the node having max W, right? >>>>> >>>>> >>>>> First of all, the mm_autonuma statistics are not in function of time >>>>> and there is no page access frequency there. >>>>> >>>>> mm_autonuma is static information collected by knuma_scand from the >>>>> pagetables. That's static and 100% accurate on the whole process and >>>>> definitely not generated by the numa hinting page faults. I could shut >>>>> off all numa hinting page faults permanently and still generate the >>>>> mm_autonuma information identically. >>>>> >>>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that >>>>> you can enable if you want to use a "runtime" and not static >>>>> information for the mm_autonuma too, but that's not the default for >>>>> now (but I think it may be a better default, there wasn't enough time >>>>> to test this yet) >>>>> >>>>> The task_autonuma (thread) statistics are the only thing that is >>>>> sampled by default in a 10sec interval (the interval tunable too with >>>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're >>>>> eventually going to make it dynamic anyway) >>>>> >>>>> So even if you were right, the thread statistics only kicks in to >>>>> balance threads against threads of the same process, most of the time >>>>> what's more important are the mm_autonuma statistics. >>>>> >>>>> But in reality the thread statistics also works perfectly for the job, >>>>> as an approximation of the NUMA memory footprint of the thread (vs the >>>>> other threads). And then the rest of the memory slowly follows >>>>> whatever node CPUs I placed the thread (even if that's not the >>>>> absolutely best one at all times). >>>>> >>>>>> Andrea's patch can only approximate the pages_accessed number in a >>>>>> time unit(scan interval), >>>>>> I don't think it can catch even 1% of average_page_access_frequence >>>>>> on a busy workload. >>>>>> Blindly assuming that all the pages' average_page_access_frequence is >>>>>> the same is seemly >>>>>> broken to me. >>>>> >>>>> >>>>> All we need is an approximation to take a better than random decision, >>>>> even if you get it 1% right, it's still better than 0% right by going >>>>> blind. Your 1% is too pessimistic, in my tests the thread statistics >>>>> are more like >90% correct in average (I monitor them with the debug >>>>> mode constantly). >>>>> >>>>> If this 1% right, happens one a million samples, who cares, it's not >>>>> going to run measurably slower anyway (and it will still be better >>>>> than picking a 0% right node). >>>>> >>>>> What you're saying is that because the active list in the pagecache >>>>> won't differentiate between 10 cache hits and 20 cache hits, we should >>>>> drop the active list and stop activating pages and just threat them >>>>> all the same because in some unlucky access pattern, the active list >>>>> may only get right 1% of the working set. But there's a reason why the >>>>> active list exists despite it may get things wrong in some corner case >>>>> and possibly leave the large amount of pages accessed infrequently in >>>>> the inactive list forever (even if it gets things only 1% right in >>>>> those worst cases, it's still better than 0% right and no active list >>>>> at all). >>>>> >>>>> To say it in another way, you may still crash with the car even if >>>>> you're careful, but do you think it's better to watch at the street or >>>>> to drive blindfolded? >>>>> >>>>> numa/sched drives blindfolded, autonuma watches around every 10sec >>>>> very carefully for the best next turn to take with the car and to >>>>> avoid obstacles, you can imagine who wins. >>>>> >>>>> Watching the street carefully every 10sec doesn't mean the next moment >>>>> a missile won't hit your car to make you crash, you're still having >>>>> better chances not to crash than by driving blindfolded. >>>>> >>>>> numa/sched pretends to compete without collecting information for the >>>>> NUMA thread memory footprint (task_autonuma, sampled with a >>>>> exponential backoff at 10sec intervals), and without process >>>>> information (full static information from the pagetables, not >>>>> sampled). No matter how you compute stuff, if you've nothing >>>>> meaningful in input to your algorithm you lose. And it looks like you >>>>> believe that you can take better decisions with nothing in input to >>>>> your NUMA placement algorithm, because my thread info (task_autonuma) >>>>> isn't 100% perfect at all times and it can't predict the future. The >>>>> alternative is to get that information from syscalls, but even >>>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs >>>>> and overall the task_autonuma information may be more reliable in the >>>>> end, even if it's sampled using an exponential backoff. >>>>> >>>>> Also note the exponential backoff thing, it's not really the last >>>>> interval, it's the last interval plus half the previous interval plus >>>>> 1/4 the previous interval etc... and we can trivially control the >>>>> decay. >>>>> >>>>> All we need is to get a direction and knowing _exactly_ what the task >>>>> did over the last 10 seconds (even if it can't predict the future of >>>>> what the thread will do in the next 1sec), is all we need to get a >>>>> direction. After we take the direction then the memory will follow so >>>>> we cannot care less what it does in the next second because that will >>>>> follow the CPU (after a while, last_nid anti-false-sharing logic >>>>> permitting), and at least we'll know for sure that the memory accessed >>>>> in the last 10sec is already local and that defines the best node to >>>>> schedule the thread. >>>>> >>>>> I don't mean there's no room for improvement in the way the input data >>>>> can be computed, and even in the way the input data can be generated, >>>>> the exponential backoff decay can be tuned too, I just tried to do the >>>>> simplest computations on the data to make the workloads converge fast >>>>> and you're welcome to contribute. >>>>> >>>>> But I believe the task_autonuma information is extremely valuable and >>>>> we can trust it very much knowing we'll get a great placement. The >>>>> concern you have isn't invalid, but it's a very minor one and the >>>>> sampling rate effects you are concerned about, while real, they're >>>>> lost in the noise in practice. >>>> >>>> >>>> Well, I think I am not convinced by your this many words. And surely >>>> I will NOT follow your reasoning of "Having information is always >>>> good than nothing". We all know that an illy biased balancing is worse >>>> than randomness: at least randomness means "average, fair play, ...". >>>> With all uncertain things, I think only a comprehensive survey >>>> of real world workloads can tell if my concern is significant or not. >>>> >>>> So I think my suggestion to you is: Show world some solid and sound > > ^^^^^^^^^^^^^ >>>> real world proof that your approximation is > 90% accurate, just like >>> >>> >>> The cover letter contained a link to the performance: >>> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf >> >> Yes, I saw this. But if you consider this already a solid and >> comprehensive proof. >> You win , I have no other words to say. > > No one says there is a proof, on contrary, I said it's possible to beat any heuristic algorithm and Andrea explained the LRU is such too. > > You asked above for real world example and that's what Andrea was trying to achieve (note that it includes tiny regression w/ parallel kernel compile on tmpfs). > >> >>> >>> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units >>> tests. >>> >>> I'm sure anyone can beat most kernel algorithm with some pathological case >>> including LRU and CFS. The only way to improve the numa balancing stuff is >> >> Like I already put, the pathological cases for LRU were already well understood >> for decades, they are quite valid to ignore. And every programmer has >> be taught to >> avoid these cases. And this conclusion took much much time of many many >> talented brains. > > Who are these programmers that you talk about? The average Java programmer is clueless w.r.t memory allocation. > Even w/ KVM we have an issue of double swap storm when both the guest and the host will have the same page on their LRU list. > >> >> But the problem of this algorithm is not. And you are putting haste conclusion >> of it without bothering to do comprehensive research. >> >> "Collect the data from a wide range of pages occasionally, >> and then do a condense computing on a small set of pages" looks a very common >> practice to me. But again, if you simply label this as "minor". >> I have no other words to say. >> >>> to sample more, meaning faulting more == larger overhead. >> >> Are you sure you really want to compete the sampling speed with CPU intensive >> workloads? > > You didn't understand my point - I was saying exactly this - it's not worth to sample more because it carries a huge over head. > Pls don't be that fast on the 'send' trigger :) > >> >> OK, I think I'd stop discussing this topic now. Without strict and comprehensive >> research on this topic, further arguments seems to me to be purely based on >> imagination. >> >> And I have no interest in beating any of your fancy algorithm, it wouldn't bring >> me 1G$. I am just curiously about the truth. > > No one said it's fancy beside you. I actually proposed a way to relax such false migrations in my previous reply. > > >> Basically, anyone has the right to laugh, if W = x * y and you only > > Laughing while discussing numa code is under estimated! > Let's not continue to spam the list, I think we've all made our points, > Dor Note, my laughing is based on your attitude of "anyone beat anything" which is also under estimated. I remember despite my criticism of the sampling, I was quite friendly to Andrea, then why the acidness/bitterness? But you are right. We all made our points. Let's stop it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org