From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B67EEC433E4 for ; Mon, 13 Jul 2020 02:53:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 511B4206B6 for ; Mon, 13 Jul 2020 02:53:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=joelfernandes.org header.i=@joelfernandes.org header.b="Kw6VSrOb" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 511B4206B6 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=joelfernandes.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A78A38D0006; Sun, 12 Jul 2020 22:53:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A291E8D0002; Sun, 12 Jul 2020 22:53:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 917E78D0006; Sun, 12 Jul 2020 22:53:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0034.hostedemail.com [216.40.44.34]) by kanga.kvack.org (Postfix) with ESMTP id 7B37A8D0002 for ; Sun, 12 Jul 2020 22:53:57 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E5C871DFA for ; Mon, 13 Jul 2020 02:53:56 +0000 (UTC) X-FDA: 77031532872.22.sound91_3210aac26ee4 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id B46F718038E60 for ; Mon, 13 Jul 2020 02:53:56 +0000 (UTC) X-HE-Tag: sound91_3210aac26ee4 X-Filterd-Recvd-Size: 10290 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Jul 2020 02:53:56 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id c30so10899044qka.10 for ; Sun, 12 Jul 2020 19:53:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=LfGl8MgHNTkh4eHPxpTYeSeTP47fmrEcX6XAsbs0NMI=; b=Kw6VSrObxjDxE1uwXFX7RhwxDOATo6vFk6W0yzVhg0cOvCuPoaY5Vu3UbcalkfgGMz 7qSpyIqBd2UVYIKuAKsD7zqu0AEAj6BHw/je/DGse6tg508fK+qih5rH7dezypuhlYss SJfc6u3W6Uf3sw97BQeoc1Ww+6JP9tm+RLAJI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=LfGl8MgHNTkh4eHPxpTYeSeTP47fmrEcX6XAsbs0NMI=; b=QEyfrG2HkGmJR+aTwU8IwPa7bkF3hh04TVYoXZ+LOyQ6q3bHg9K3tl0XI17u7NL9fF TKJ+oxy9vOi0jJUPEtFJAgU1x80pyR+Dbf9iWLSri2sE/wbPlD/o/7ND2SZBmTmpZmAk EE2M7JN6uPLGUFWZenMLDwrVbDIrg5KzZZ7U1Jwk7Zvlb6ON+nBllJQO4amyZ4OGlHe0 Ir/GT/A0tZSQplVDVU2jQkI0nl3rM4pt+lECeSEntaAqLCCSu7wThIWPiQK1q43g3W4Q 3hKpNEX20AVDSiEVEiA1geeVZ6O3J3g3Mn9gtkdUOJ9eXOMq37GZHhWV6V7YIO0mPPMg hqzQ== X-Gm-Message-State: AOAM5303lRS/cprpSoOCv9qzktFl+7qSuOsKzg/mRdDiEgVTWO0lhkai NTafveaRQ8kol2+iRNo53xXSHQ== X-Google-Smtp-Source: ABdhPJzNUoy4InAuw5dwqvsnFRKP1vnKgkRK10qXoyJdbcfTTpEXJt4daYmo5h0D1MLrh1rHettGCw== X-Received: by 2002:a37:4592:: with SMTP id s140mr76769699qka.245.1594608835452; Sun, 12 Jul 2020 19:53:55 -0700 (PDT) Received: from localhost ([2620:15c:6:12:cad3:ffff:feb3:bd59]) by smtp.gmail.com with ESMTPSA id t138sm16463758qka.15.2020.07.12.19.53.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 12 Jul 2020 19:53:54 -0700 (PDT) Date: Sun, 12 Jul 2020 22:53:54 -0400 From: Joel Fernandes To: Linus Torvalds Cc: Naresh Kamboju , linux- stable , open list , linux-mm , Arnd Bergmann , Andrew Morton , Roman Gushchin , Michal Hocko , lkft-triage@lists.linaro.org, Chris Down , Michel Lespinasse , Fan Yang , Brian Geffon , Anshuman Khandual , Will Deacon , Catalin Marinas , pugaowei@gmail.com, Jerome Glisse , Greg Kroah-Hartman , Mel Gorman , Hugh Dickins , Al Viro , Tejun Heo , Sasha Levin Subject: Re: WARNING: at mm/mremap.c:211 move_page_tables in i386 Message-ID: <20200713025354.GB3644504@google.com> References: <20200712215041.GA3644504@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: B46F718038E60 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Linus, On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote: > On Sun, Jul 12, 2020 at 2:50 PM Joel Fernandes wrote: > > > > I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it. > > The issue is solely within execve() itself and the way it allocates/copies the > > temporary stack. > > > > It is actually indeed an overlapping case because the length of the > > stack is big enough to cause overlap. The VMA grows quite a bit because of > > all the page faults that happen due to the copy of the args/env. Then during > > the move of overlapped region, it finds that a PMD is already allocated. > > Oh, ok, I think I see. > > So the issue is that while move_normal_pmd() _itself_ will be clearing > the old pmd entries when it copies them, the 4kB copies that happened > _before_ this time will not have cleared the old pmd that they were > in. > > So we've had a deeper stack, and we've already copied some of it one > page at a time, and we're done with those 4kB entries, but now we hit > a 2MB-aligned case and want to move that down. But when it does that, > it hits the (by now hopefully empty) pmd that used to contain the 4kB > entries we copied earlier. > > So we've cleared the page table, but we've simply never called > pgtable_clear() here, because the page table got cleared in > move_ptes() when it did that > > pte = ptep_get_and_clear(mm, old_addr, old_pte); > > on the previous old pte entries. > > That makes sense to me. Does that match what you see? Because when I > said it wasn't overlapped, I was really talking about that one single > pmd move itself not being overlapped. This matches exactly what you mentioned. Here is some analysis with specific numbers: Some time during execve(), the copy_strings() causes page faults. During this the VMA is growing and new PMD is allocated during the page fault: copy_strings: Copying args/env old process's memory 8067420 to new stack bfb0eec6 (len 4096) handle_mm_fault: vma: bfb0d000 c0000000 handle_mm_fault: pmd_alloc: ad: bfb0dec6 ptr: f46d0bf8 Here we see that the the pmd entry's (pmd_t *) pointer value is f46d0bf8 and the fault was on address bfb0dec6. Similarly, I note the pattern of other copy_strings() faulting and do the following mapping: address space pmd entry pointer 0xbfe00000-0xc0000000 f4698ff8 0xbfc00000-0xbfe00000 f4698ff0 0xbfa00000-0xbfc00000 f4698fe8 This is all from tracing the copy_strings(). Then later, the kernel decides to move the VMA down. The VMA total size 5MB. The stack is initially at a 1MB aligned address: 0xbfb00000 . exec requests move_page_tables() to move it down by 4MB. That is, to 0xbf700000 which is also 1MB aligned. This is an overlapping move. move_page_tables starts, I see the following prints before the warning fires. The plan of attack should be, first copy 1MB using move_ptes() like you said. Then it hits 2MB aligned boundary and starts move_normal_pmd(). For both move_ptes() and move_normal_pmd(), a pmd_alloc() first happens which is printed below: move_page_tables: move_page_tables old=bfb00000 (len:5242880) new=bf700000 move_page_tables: pmd_alloc: ad: bf700000 ptr: f4698fd8 move_page_tables: move_ptes old=bfb00000->bfc00000 new=bf700000 move_page_tables: pmd_alloc: ad: bf800000 ptr: f4698fe0 move_page_tables: move_normal_pmd: old: bfc00000-c0000000 new: bf800000 (val: 0) move_page_tables: pmd_alloc: ad: bfa00000 ptr: f4698fe8 move_page_tables: move_normal_pmd: old: bfe00000-c0000000 new: bfa00000 (val: 34164067) So basically, it did 1MB worth of move_ptes(), and twice 2MB worth of move_normal_pmd. Since the shift was only 4MB, it hit an already allocated PMD during the second move_normal_pmd. The warning fires as that move_normal_pmd() sees 0xbf800000 is already allocated before. As you said, this is harmless. One thing I observed by code reading is move_page_tables() is (in the case of mremap) only called for non-overlapping moves (through mremap_to() or move_vma() as the case maybe). It makes me nervous we are doing an overlapping move and causing some bug inadvertently. Could you let me know if there is any reason why we should not make the new stack's location as non-overlapping, just to keep things simple? (Assuming we fix the issues related to overriding you mentioned below). > > The below patch fixes it and is not warning anymore in 30 minutes of testing > > so far. > > So I'm not hugely happy with the patch, I have to admit. > > Why? > > Because: > > > + /* Ensure the temporary stack is shifted by atleast its size */ > > + if (stack_shift < (vma->vm_end - vma->vm_start)) > > + stack_shift = (vma->vm_end - vma->vm_start); > > This basically overrides the random stack shift done by arch_align_stack(). > > Yeah, yeah, arch_align_stack() is kind of misnamed. It _used_ to do > what the name implies it does, which on x86 is to just give the > minimum ABI-specified 16-byte stack alignment. > But these days, what it really does is say "align the stack pointer, > but also shift it down by a random amount within 8kB so that the start > of the stack isn't some repeatable thing that an attacker can > trivially control with the right argv/envp size.." Got it, thanks I will work on improving the patch along these lines. > I don't think it works right anyway because we then PAGE_ALIGN() it, > but whatever. :) > But it looks like what you're doing means that now the size of the > stack will override that shift, and that means that now the > randomization is entirely undone. No? Yes, true. I will work on fixing it. > Plus you do it after we've also checked that we haven't grown the > stack down to below mmap_min_addr(). > > But maybe I misread that patch. > > But I do feel like you figured out why the bug happened, now we're > just discussing whether the patch is the right thing to do. Yes. > Maybe saying "doing the pmd copies for the initial stack isn't > important, so let's just note this as a special case and get rid of > the WARN_ON()" might be an alternative solution. Personally, I feel it is better to keep the warning just so in the future we can detect any bugs. thanks, - Joel