From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 736C0C4338F for ; Thu, 19 Aug 2021 06:31:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A3B12610A5 for ; Thu, 19 Aug 2021 06:31:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A3B12610A5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 2958D6B006C; Thu, 19 Aug 2021 02:31:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 21CE66B0071; Thu, 19 Aug 2021 02:31:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E5156B0072; Thu, 19 Aug 2021 02:31:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0023.hostedemail.com [216.40.44.23]) by kanga.kvack.org (Postfix) with ESMTP id E2D206B006C for ; Thu, 19 Aug 2021 02:31:45 -0400 (EDT) Received: from smtpin36.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 861DE824C439 for ; Thu, 19 Aug 2021 06:31:45 +0000 (UTC) X-FDA: 78490859370.36.4C92DBD Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com [115.124.30.45]) by imf30.hostedemail.com (Postfix) with ESMTP id F0809E0056F2 for ; Thu, 19 Aug 2021 06:31:43 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R511e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0UjyViR7_1629354694; Received: from xuyu-mbp15.local(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0UjyViR7_1629354694) by smtp.aliyun-inc.com(127.0.0.1); Thu, 19 Aug 2021 14:31:40 +0800 Subject: Re: [PATCH] mm/swap: consider max pages in iomap_swapfile_add_extent To: "Darrick J. Wong" Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, hch@infradead.org, riteshh@linux.ibm.com, tytso@mit.edu, gavin.dg@linux.alibaba.com, linux-btrfs References: <20210818170152.GG12664@magnolia> <20210819014518.GD12597@magnolia> From: Yu Xu Message-ID: <592c57a9-3ef1-807d-fab1-14376928c47a@linux.alibaba.com> Date: Thu, 19 Aug 2021 14:31:33 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: <20210819014518.GD12597@magnolia> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Authentication-Results: imf30.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf30.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com X-Stat-Signature: qftz6yhwz8d8rb9kioiguq6tdkxa6nj3 X-Rspamd-Queue-Id: F0809E0056F2 X-Rspamd-Server: rspam05 X-HE-Tag: 1629354703-19582 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 8/19/21 9:45 AM, Darrick J. Wong wrote: > On Thu, Aug 19, 2021 at 08:59:22AM +0800, Yu Xu wrote: >> On 8/19/21 1:01 AM, Darrick J. Wong wrote: >>> On Wed, Aug 18, 2021 at 07:36:51PM +0800, Xu Yu wrote: >>>> When the max pages (last_page in the swap header + 1) is smaller than >>>> the total pages (inode size) of the swapfile, iomap_swapfile_activate >>>> overwrites sis->max with total pages. >>>> >>>> However, frontswap_map is allocated using max pages, but cleared using >>>> sis->max in __frontswap_invalidate_area(). The consequence is that the >>>> neighbors of frontswap_map bitmap may be overwritten, and then slab is >>>> polluted. >>>> >>>> This fixes the issue by considering the limitation of max pages of swap >>>> info in iomap_swapfile_add_extent(). >>>> >>>> To reproduce the case, compile kernel with slub RED ZONE, then run test: >>>> $ sudo stress-ng -a 1 -x softlockup,resources -t 72h --metrics --times \ >>>> --verify -v -Y /root/tmpdir/stress-ng/stress-statistic-12.yaml \ >>>> --log-file /root/tmpdir/stress-ng/stress-logfile-12.txt \ >>>> --temp-path /root/tmpdir/stress-ng/ >> >> Hi, Darrick J. Wong, >> >> This is the stress-ng test script from our nightly test, I just copied >> as it is. >> >>> >>> 72 hours? That's not really a targeted reproducer test. >> >> Actually, the system crashes soon, in less than 5 minutes on average. > > Hehe. > > I'm about to send out a regression fstest that (I think!) reproduces the > validation error. Would you mind running it in your environment to > confirm that it trips the same memory debugger warning? Hi, Darrick J. Wong, Things are more complicated than I thought. I thought that slab redzone overwritten happens at __frontswap_invalidate_area() when swapoff, and your testcase should easily reproduce the debugger warning. However, it did not. Then I looked into the codes, and find that sis->max is reset to 0 before frontswap_invalidate_area(). And the slab redzone overwritten should happen at __frontswap_clear(sis,offset) in __frontswap_invalidate_page(). I did not reproduce the debugger warning solely with your testcase, but your testcase should have the same effect as stress-swap.c in stress-ng testsuite, i.e., maxpages in swap header != file size. I think to replace stress-swap with your testcase should have the same result, but I am not going to dig into it. To summarize, 1) I think that your testcase makes sense, as Zorro Lang commented. 2) I will send PATCH v2 to correct the commit message. > > Also ccing the btrfs list because the regression test fails on btrfs in > the same way that iomap failed. :( > > --D > >>> >>> Oh, very interesting! maxpages comes from the swap header. >>> >>> # fallocate -l 100m /mnt/a >>> # mkswap /mnt/a >>> mkswap: /mnt/a: insecure permissions 0644, 0600 suggested. >>> Setting up swapspace version 1, size = 100 MiB (104853504 bytes) >>> no label, UUID=c6c93d94-3b91-47cc-9684-9c0edbd04759 >>> # swapon /mnt/a >>> swapon: /mnt/a: insecure permissions 0644, 0600 suggested. >>> [ 68.267848] Adding 102396k swap on /mnt/a. Priority:-2 extents:1 across:102396k SS >>> >>> >>> Ok, ~100M of swap... >>> >>> # swapoff /mnt/a >>> # fallocate -l 200m /mnt/a >>> # swapon /mnt/a >>> swapon: /mnt/a: insecure permissions 0644, 0600 suggested. >>> [ 84.458713] Adding 204796k swap on /mnt/a. Priority:-2 extents:1 across:204796k SS >>> >>> Huh. 200MB of swap, even though we haven't touched the swap header: >>> >>> # dd if=/mnt/a bs=4096 count=1 | od -tx1 -Ad -c >>> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >>> \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 >>> * >>> 0001024 01 00 00 00 ff 63 00 00 00 00 00 00 66 40 eb 42 >>> 001 \0 \0 \0 377 c \0 \0 \0 \0 \0 \0 f @ 353 B >>> >>> Version 1 swap header, 0x63ff pages (~102396kb), 0 bad pages... >>> >>> 0001040 8e 81 4e 4e 92 a1 f9 98 d9 43 2f cc 00 00 00 00 >>> 216 201 N N 222 241 371 230 331 C / 314 \0 \0 \0 \0 >>> 0001056 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >>> \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 >>> * >>> 0004080 00 00 00 00 00 00 53 57 41 50 53 50 41 43 45 32 >>> \0 \0 \0 \0 \0 \0 S W A P S P A C E 2 >>> 0004096 >>> >>> So yes, iomap_swapfile_* is incorrect. Even better, we have a targeted >>> way to test this now. :) >> >> Yes, this is more straightforward. I used to focus on the slab >> corruption too much. >> >>> >>>> We'll get the error log as below: >>>> >>>> [ 1151.015141] ============================================================================= >>>> [ 1151.016489] BUG kmalloc-16 (Not tainted): Right Redzone overwritten >>>> [ 1151.017486] ----------------------------------------------------------------------------- >>>> [ 1151.017486] >>>> [ 1151.018997] Disabling lock debugging due to kernel taint >>>> [ 1151.019873] INFO: 0x0000000084e43932-0x0000000098d17cae @offset=7392. First byte 0x0 instead of 0xcc >>>> [ 1151.021303] INFO: Allocated in __do_sys_swapon+0xcf6/0x1170 age=43417 cpu=9 pid=3816 >>>> [ 1151.022538] __slab_alloc+0xe/0x20 >>>> [ 1151.023069] __kmalloc_node+0xfd/0x4b0 >>>> [ 1151.023704] __do_sys_swapon+0xcf6/0x1170 >>>> [ 1151.024346] do_syscall_64+0x33/0x40 >>>> [ 1151.024925] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> [ 1151.025749] INFO: Freed in put_cred_rcu+0xa1/0xc0 age=43424 cpu=3 pid=2041 >>>> [ 1151.026889] kfree+0x276/0x2b0 >>>> [ 1151.027405] put_cred_rcu+0xa1/0xc0 >>>> [ 1151.027949] rcu_do_batch+0x17d/0x410 >>>> [ 1151.028566] rcu_core+0x14e/0x2b0 >>>> [ 1151.029084] __do_softirq+0x101/0x29e >>>> [ 1151.029645] asm_call_irq_on_stack+0x12/0x20 >>>> [ 1151.030381] do_softirq_own_stack+0x37/0x40 >>>> [ 1151.031037] do_softirq.part.15+0x2b/0x30 >>>> [ 1151.031710] __local_bh_enable_ip+0x4b/0x50 >>>> [ 1151.032412] copy_fpstate_to_sigframe+0x111/0x360 >>>> [ 1151.033197] __setup_rt_frame+0xce/0x480 >>>> [ 1151.033809] arch_do_signal+0x1a3/0x250 >>>> [ 1151.034463] exit_to_user_mode_prepare+0xcf/0x110 >>>> [ 1151.035242] syscall_exit_to_user_mode+0x27/0x190 >>>> [ 1151.035970] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> [ 1151.036795] INFO: Slab 0x000000003b9de4dc objects=44 used=9 fp=0x00000000539e349e flags=0xfffffc0010201 >>>> [ 1151.038323] INFO: Object 0x000000004855ba01 @offset=7376 fp=0x0000000000000000 >>>> [ 1151.038323] >>>> [ 1151.039683] Redzone 000000008d0afd3d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................ >>>> [ 1151.041180] Object 000000004855ba01: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ >>>> [ 1151.042714] Redzone 0000000084e43932: 00 00 00 c0 cc cc cc cc ........ >>>> [ 1151.044120] Padding 000000000864c042: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ >>>> [ 1151.045615] CPU: 5 PID: 3816 Comm: stress-ng Tainted: G B 5.10.50+ #7 >>>> [ 1151.046846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 >>>> [ 1151.048633] Call Trace: >>>> [ 1151.049072] dump_stack+0x57/0x6a >>>> [ 1151.049585] check_bytes_and_report+0xed/0x110 >>>> [ 1151.050320] check_object+0x1eb/0x290 >>>> [ 1151.050924] ? __x64_sys_swapoff+0x39a/0x540 >>>> [ 1151.051646] free_debug_processing+0x151/0x350 >>>> [ 1151.052333] __slab_free+0x21a/0x3a0 >>>> [ 1151.052938] ? _cond_resched+0x2d/0x40 >>>> [ 1151.053529] ? __vunmap+0x1de/0x220 >>>> [ 1151.054139] ? __x64_sys_swapoff+0x39a/0x540 >>>> [ 1151.054796] ? kfree+0x276/0x2b0 >>>> [ 1151.055307] kfree+0x276/0x2b0 >>>> [ 1151.055832] __x64_sys_swapoff+0x39a/0x540 >>>> [ 1151.056466] do_syscall_64+0x33/0x40 >>>> [ 1151.057084] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> [ 1151.057866] RIP: 0033:0x150340b0ffb7 >>>> [ 1151.058481] Code: Unable to access opcode bytes at RIP 0x150340b0ff8d. >>>> [ 1151.059537] RSP: 002b:00007fff7f4ee238 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8 >>>> [ 1151.060768] RAX: ffffffffffffffda RBX: 00007fff7f4ee66c RCX: 0000150340b0ffb7 >>>> [ 1151.061904] RDX: 000000000000000a RSI: 0000000000018094 RDI: 00007fff7f4ee860 >>>> [ 1151.063033] RBP: 00007fff7f4ef980 R08: 0000000000000000 R09: 0000150340a672bd >>>> [ 1151.064135] R10: 00007fff7f4edca0 R11: 0000000000000246 R12: 0000000000018094 >>>> [ 1151.065253] R13: 0000000000000005 R14: 000000000160d930 R15: 00007fff7f4ee66c >>>> [ 1151.066413] FIX kmalloc-16: Restoring 0x0000000084e43932-0x0000000098d17cae=0xcc >>>> [ 1151.066413] >>>> [ 1151.067890] FIX kmalloc-16: Object at 0x000000004855ba01 not freed >>>> >>>> Fixes: 0e6895ba00b7 ("ext4: implement swap_activate aops using iomap") >>>> Signed-off-by: Gang Deng >>>> Signed-off-by: Xu Yu >>> >>> Looks good; I'll cc you both when I send out the regression test. >>> Reviewed-by: Darrick J. Wong >>> >>> --D >>> >>>> --- >>>> fs/iomap/swapfile.c | 6 ++++++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c >>>> index 6250ca6a1f85..4ecf4e1f68ef 100644 >>>> --- a/fs/iomap/swapfile.c >>>> +++ b/fs/iomap/swapfile.c >>>> @@ -31,11 +31,16 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi) >>>> { >>>> struct iomap *iomap = &isi->iomap; >>>> unsigned long nr_pages; >>>> + unsigned long max_pages; >>>> uint64_t first_ppage; >>>> uint64_t first_ppage_reported; >>>> uint64_t next_ppage; >>>> int error; >>>> + if (unlikely(isi->nr_pages >= isi->sis->max)) >>>> + return 0; >>>> + max_pages = isi->sis->max - isi->nr_pages; >>>> + >>>> /* >>>> * Round the start up and the end down so that the physical >>>> * extent aligns to a page boundary. >>>> @@ -48,6 +53,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi) >>>> if (first_ppage >= next_ppage) >>>> return 0; >>>> nr_pages = next_ppage - first_ppage; >>>> + nr_pages = min(nr_pages, max_pages); >>>> /* >>>> * Calculate how much swap space we're adding; the first page contains >>>> -- >>>> 2.20.1.2432.ga663e714 >>>> >> >> -- >> Thanks, >> Yu -- Thanks, Yu