From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932674AbeARXuK (ORCPT ); Thu, 18 Jan 2018 18:50:10 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:36149 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932394AbeARXuD (ORCPT ); Thu, 18 Jan 2018 18:50:03 -0500 X-Google-Smtp-Source: ACJfBosbRAbtI5lvjei4SMu6zfWGG7cZloen0zamt8uuzwreaaWZA7GHRWAbjR8caDLMaud4+DYOOg== Date: Fri, 19 Jan 2018 02:49:55 +0300 From: "Kirill A. Shutemov" To: Linus Torvalds , Peter Zijlstra Cc: Andrea Arcangeli , Dave Hansen , Tetsuo Handa , "Kirill A. Shutemov" , Andrew Morton , Johannes Weiner , Joonsoo Kim , Mel Gorman , Tony Luck , Vlastimil Babka , Michal Hocko , "hillf.zj" , Hugh Dickins , Oleg Nesterov , Rik van Riel , Srikar Dronamraju , Vladimir Davydov , Ingo Molnar , Linux Kernel Mailing List , linux-mm , the arch/x86 maintainers Subject: Re: [mm 4.15-rc8] Random oopses under memory pressure. Message-ID: <20180118234955.nlo55rw2qsfnavfm@node.shutemov.name> References: <201801170233.JDG21842.OFOJMQSHtOFFLV@I-love.SAKURA.ne.jp> <201801172008.CHH39543.FFtMHOOVSQJLFO@I-love.SAKURA.ne.jp> <201801181712.BFD13039.LtHOSVMFJQFOFO@I-love.SAKURA.ne.jp> <20180118122550.2lhsjx7hg5drcjo4@node.shutemov.name> <20180118145830.GA6406@redhat.com> <20180118165629.kpdkezarsf4qymnw@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20171215 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote: > On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov > wrote: > > > > I can't say I fully grasp how 'diff' got this value and how it leads to both > > checks being false. > > I think the problem is that page difference when they are in different sections. > > When you do > > pte_page(*pvmw->pte) - pvmw->page > > then the compiler takes the pointer difference, and then divides by > the size of "struct page" to get an index. > > But - and this is important - it does so knowing that the division it > does will have no modulus: the two 'struct page *' pointers are really > in the same array, and they really are 'n*sizeof(struct page)' apart > for some 'n'. > > That means that the compiler can optimize the division. In fact, for > this case, gcc will generate > > subl %ebx, %eax > sarl $3, %eax > imull $-858993459, %eax, %eax > > because 'struct page' is 40 bytes in size, and that magic sequence > happens to divide by 40 (first divide by 8, then that magical "imull" > will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too > big - but the shift guarantees that). > > Basically, it's a magic trick, because real divides are very > expensive, but you can fake them more quickly if you can limit the > input domain. > > But what does it mean if the two "struct page *" are not in the same > array, and the two arrays were allocated not aligned exactly 40 bytes > away, but some random number of pages away? > > You get *COMPLETE*GARBAGE* when you do the above optimized divide. > Suddenly the divide had a modulus (because the base of the two arrays > weren't 40-byte aligned), and the "trick" doesn't work. > > So that's why you can't do pointer diffs between two arrays. Not > because you can't subtract the two pointers, but because the > *division* part of the C pointer diff rules leads to issues. Thanks a lot for the explanation! I wounder if this may be a problem in other places? For instance, perf uses address of a mutex to determinate the lock ordering. See mutex_lock_double(). The mutex is embedded into struct perf_event_context, which is allocated with kzalloc() so I don't see how we can presume that alignment is consistent between them. I don't think it's the only example in kernel. Are we just lucky? -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id BA3026B0038 for ; Thu, 18 Jan 2018 18:50:03 -0500 (EST) Received: by mail-wm0-f72.google.com with SMTP id f3so54963wmc.8 for ; Thu, 18 Jan 2018 15:50:03 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id y8sor5475456edb.5.2018.01.18.15.50.02 for (Google Transport Security); Thu, 18 Jan 2018 15:50:02 -0800 (PST) Date: Fri, 19 Jan 2018 02:49:55 +0300 From: "Kirill A. Shutemov" Subject: Re: [mm 4.15-rc8] Random oopses under memory pressure. Message-ID: <20180118234955.nlo55rw2qsfnavfm@node.shutemov.name> References: <201801170233.JDG21842.OFOJMQSHtOFFLV@I-love.SAKURA.ne.jp> <201801172008.CHH39543.FFtMHOOVSQJLFO@I-love.SAKURA.ne.jp> <201801181712.BFD13039.LtHOSVMFJQFOFO@I-love.SAKURA.ne.jp> <20180118122550.2lhsjx7hg5drcjo4@node.shutemov.name> <20180118145830.GA6406@redhat.com> <20180118165629.kpdkezarsf4qymnw@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds , Peter Zijlstra Cc: Andrea Arcangeli , Dave Hansen , Tetsuo Handa , "Kirill A. Shutemov" , Andrew Morton , Johannes Weiner , Joonsoo Kim , Mel Gorman , Tony Luck , Vlastimil Babka , Michal Hocko , "hillf.zj" , Hugh Dickins , Oleg Nesterov , Rik van Riel , Srikar Dronamraju , Vladimir Davydov , Ingo Molnar , Linux Kernel Mailing List , linux-mm , the arch/x86 maintainers On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote: > On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov > wrote: > > > > I can't say I fully grasp how 'diff' got this value and how it leads to both > > checks being false. > > I think the problem is that page difference when they are in different sections. > > When you do > > pte_page(*pvmw->pte) - pvmw->page > > then the compiler takes the pointer difference, and then divides by > the size of "struct page" to get an index. > > But - and this is important - it does so knowing that the division it > does will have no modulus: the two 'struct page *' pointers are really > in the same array, and they really are 'n*sizeof(struct page)' apart > for some 'n'. > > That means that the compiler can optimize the division. In fact, for > this case, gcc will generate > > subl %ebx, %eax > sarl $3, %eax > imull $-858993459, %eax, %eax > > because 'struct page' is 40 bytes in size, and that magic sequence > happens to divide by 40 (first divide by 8, then that magical "imull" > will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too > big - but the shift guarantees that). > > Basically, it's a magic trick, because real divides are very > expensive, but you can fake them more quickly if you can limit the > input domain. > > But what does it mean if the two "struct page *" are not in the same > array, and the two arrays were allocated not aligned exactly 40 bytes > away, but some random number of pages away? > > You get *COMPLETE*GARBAGE* when you do the above optimized divide. > Suddenly the divide had a modulus (because the base of the two arrays > weren't 40-byte aligned), and the "trick" doesn't work. > > So that's why you can't do pointer diffs between two arrays. Not > because you can't subtract the two pointers, but because the > *division* part of the C pointer diff rules leads to issues. Thanks a lot for the explanation! I wounder if this may be a problem in other places? For instance, perf uses address of a mutex to determinate the lock ordering. See mutex_lock_double(). The mutex is embedded into struct perf_event_context, which is allocated with kzalloc() so I don't see how we can presume that alignment is consistent between them. I don't think it's the only example in kernel. Are we just lucky? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org