From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=BAYES_00,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B52B7C433B4 for ; Mon, 12 Apr 2021 04:03:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4E67561025 for ; Mon, 12 Apr 2021 04:03:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4E67561025 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B9B1A6B0036; Mon, 12 Apr 2021 00:03:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4B6C6B006C; Mon, 12 Apr 2021 00:03:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C5D96B006E; Mon, 12 Apr 2021 00:03:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0092.hostedemail.com [216.40.44.92]) by kanga.kvack.org (Postfix) with ESMTP id 765AD6B0036 for ; Mon, 12 Apr 2021 00:03:04 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 2412118012CC1 for ; Mon, 12 Apr 2021 04:03:04 +0000 (UTC) X-FDA: 78022369488.25.BF61504 Received: from mail-il1-f175.google.com (mail-il1-f175.google.com [209.85.166.175]) by imf16.hostedemail.com (Postfix) with ESMTP id 18B6980192C7 for ; Mon, 12 Apr 2021 04:03:02 +0000 (UTC) Received: by mail-il1-f175.google.com with SMTP id c15so9846536ilj.1 for ; Sun, 11 Apr 2021 21:03:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=KNLHPkplIW2H49/uEelRQIdqnn7xgelsGuhU8RzvU9Y=; b=jR9blwBXM3+uX7KMXwSHQyeRDs7qTAn2X4itzXoIVzkFkrlKf7jqlNtRe6toEJ2bUR iDdKrqw5sngRRqmRTVIuM8eLQDGZosqyeLyxIDHXow9d2+Lk6FGs03vmhSx5Vx+AAcVa qNQVxUXwl7F5nu+1782KP9SWFXAwUndoBPJK80HYECu0enzF631OacvzMWFKqOMYqcA/ 0N47LViJG04ikasowWBOg3tB5OPgcX5dYGlrLrRSGk2YvrSsiTWpxTOrppsnsiqJg6dD di0SrWi0XedVlOt7bngdG8XgpsdfpyXtNI1sBYFmfx42U2v81nR2K7vufq3MGQgGXqwK b0eQ== X-Gm-Message-State: AOAM533dC6oBS5GI4TUGvS3+JMDumE5OtBH4JkVje+fyPGtqYZbvQ42d hPnHDLss5lRUEj9X0buxEio= X-Google-Smtp-Source: ABdhPJwAlWLgLTRtHraR07nUzI+3h2NVMSQnNgvonV8ln7OJMiyoiprA7+l+/KXOL47bgtCSqMfRyA== X-Received: by 2002:a92:1306:: with SMTP id 6mr5338608ilt.289.1618200183071; Sun, 11 Apr 2021 21:03:03 -0700 (PDT) Received: from google.com (243.199.238.35.bc.googleusercontent.com. [35.238.199.243]) by smtp.gmail.com with ESMTPSA id h13sm1857889ild.16.2021.04.11.21.03.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 11 Apr 2021 21:03:02 -0700 (PDT) Date: Mon, 12 Apr 2021 04:03:01 +0000 From: Dennis Zhou To: Wang Yugui Cc: Vlastimil Babka , linux-mm@kvack.org, linux-btrfs@vger.kernel.org Subject: Re: unexpected -ENOMEM from percpu_counter_init() Message-ID: References: <20210411000846.9CC6.409509F4@e16-tech.com> <20210411232000.BF15.409509F4@e16-tech.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210411232000.BF15.409509F4@e16-tech.com> X-Rspamd-Queue-Id: 18B6980192C7 X-Stat-Signature: djj7qn8gkzt6zhb3z11db847xozww9ie X-Rspamd-Server: rspam02 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf16; identity=mailfrom; envelope-from=""; helo=mail-il1-f175.google.com; client-ip=209.85.166.175 X-HE-DKIM-Result: none/none X-HE-Tag: 1618200182-470356 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Apr 11, 2021 at 11:20:00PM +0800, Wang Yugui wrote: > Hi, Dennis Zhou > > > Hi, > > > > > On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote: > > > > Hi, Dennis Zhou > > > > > > > > Thanks for your ncie answer. > > > > but still a few questions. > > > > > > > > > Percpu is not really cheap memory to allocate because it has a > > > > > amplification factor of NR_CPUS. As a result, percpu on the critical > > > > > path is really not something that is expected to be high throughput. > > > > > > > > > Ideally things like btrfs snapshots should preallocate a number of these > > > > > and not try to do atomic allocations because that in theory could fail > > > > > because even after we go to the page allocator in the future we can't > > > > > get enough pages due to needing to go into reclaim. > > > > > > > > pre-allocate in module such as mempool_t is just used in a few place in > > > > linux/fs. so most people like system wide pre-allocate, because it is > > > > more easy to use? > > > > > > > > can we add more chance to management the system wide pre-alloc > > > > just like this? > > > > > > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > > > > index dc1f4dc..eb3f592 100644 > > > > --- a/include/linux/sched/mm.h > > > > +++ b/include/linux/sched/mm.h > > > > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags) > > > > static inline unsigned int memalloc_nofs_save(void) > > > > { > > > > unsigned int flags = current->flags & PF_MEMALLOC_NOFS; > > > > + > > > > + // just like slab_pre_alloc_hook > > > > + fs_reclaim_acquire(current->flags & gfp_allowed_mask); > > > > + fs_reclaim_release(current->flags & gfp_allowed_mask); > > > > + > > > > current->flags |= PF_MEMALLOC_NOFS; > > > > return flags; > > > > } > > > > > > > > > > > > > The workqueue approach has been good enough so far. Technically there is > > > > > a higher priority workqueue that this work could be scheduled on, but > > > > > save for this miss on my part, the system workqueue has worked out fine. > > > > > > > > > In the future as I mentioned above. It would be good to support actually > > > > > getting pages, but it's work that needs to be tackled with a bit of > > > > > care. I might target the work for v5.14. > > > > > > > > > > > this is our application pipeline. > > > > > > file_pre_process | > > > > > > bwa.nipt xx | > > > > > > samtools.nipt sort xx | > > > > > > file_post_process > > > > > > > > > > > > file_pre_process/file_post_process is fast, so often are blocked by > > > > > > pipe input/output. > > > > > > > > > > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores. > > > > > > > > > > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory. > > > > > > if the memory is not enough, it will save all the buffer to temp file, > > > > > > so it is sometimes high-IO-load too(write 60G or more to file). > > > > > > > > > > > > > > > > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high. > > > > > > so xfstests(generic/476) maybe easy than our application pipeline. > > > > > > > > > > > > Although there is yet not a simple reproducer for another problem > > > > > > happend here, but there is a little high chance that something is wrong > > > > > > in btrfs/mm/fs-buffer. > > > > > > > but another problem(os freezed without call trace, PANIC without OOPS?, > > > > > > > the reason is yet unkown) still happen. > > > > > > > > > > I do not have an answer for this. I would recommend looking into kdump. > > > > > > > > percpu ENOMEM problem blocked many heavy load test a little long time? > > > > I still guess this problem of system freeze is a mm/btrfs problem. > > > > OOM not work, OOPS not work too. > > > > > > > > > > I don't follow. Is this still a problem after the patch? > > > > > > After the patch for percpu ENOMEM, the problem of system freeze have a high > > frequecy (>75%) to be triggered by our user-space application. > > > > The problem of system freeze maybe not caused by the percpu ENOMEM patch. > > > > percpu ENOMEM problem maybe more easy to happen than the problem of > > system freeze. > > After highmem zone +80% / otherzone +40% of WMARK_MIN/ WMARK_LOW/ > WMARK_HIGH, we walked around or reduced the reproduce frequency of the > problem of system freeze. > > so this is a problem of linux-mm. > > the user case of our user-space application. > 1) write the files with the total size > 3 * memory size. > the memory size > 128G > 2) btrfs with SSD/SAS, SSD/SATA, or btrfs RAID6 hdd > SSD/NVMe maybe too fast, so difficult to reproduce. > 3) some CPU load, and some memory load. > To me it just sounds like writeback is slow. It's hard to debug a system without actually observing it as well. You might want to limit the memory allotted to the workload cgroup possibly memory.high. This may help kick reclaim in earlier. > btrfs and other fs seem not like mempool_t wiht pre-alloc, so difficult > job is left to the system-wide reclaim/pre-alloc of linux-mm. > > maye memalloc_nofs_save() or memalloc_nofs_restore() is a good place to > add some sync/aysnc memory reclaim/pre-alloc operations for WMARK_MIN/ > WMARK_LOW/WMARK_HIGH and percpu PCPU_EMPTY_POP_PAGES_LOW. > It's not that simple. Memory reclaim is a balancing act and these places mark where reclaim cannot trigger writeback and thus oom-killer is the only way out. I'm sorry, but beyond the above, I don't really have any additional advice besides retuning your workload to use less memory and give the system more headroom. I appreciate the bug report though and if its anything percpu related I will always be available. > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2021/04/11 > Thanks, Dennis