From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16B1DC31E49 for ; Fri, 14 Jun 2019 00:36:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E902021721 for ; Fri, 14 Jun 2019 00:36:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726767AbfFNAgW (ORCPT ); Thu, 13 Jun 2019 20:36:22 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:60257 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725777AbfFNAgV (ORCPT ); Thu, 13 Jun 2019 20:36:21 -0400 Received: from dread.disaster.area (pa49-195-189-25.pa.nsw.optusnet.com.au [49.195.189.25]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7F8AF43B354; Fri, 14 Jun 2019 10:36:16 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92) (envelope-from ) id 1hbaBa-0004jE-EB; Fri, 14 Jun 2019 10:35:18 +1000 Date: Fri, 14 Jun 2019 10:35:18 +1000 From: Dave Chinner To: Kent Overstreet Cc: Andreas Dilger , Linus Torvalds , Dave Chinner , "Darrick J . Wong" , Christoph Hellwig , Matthew Wilcox , Amir Goldstein , Jan Kara , Linux List Kernel Mailing , linux-xfs , linux-fsdevel , Josef Bacik , Alexander Viro , Andrew Morton Subject: Re: pagecache locking (was: bcachefs status update) merged) Message-ID: <20190614003518.GL14363@dread.disaster.area> References: <20190610191420.27007-1-kent.overstreet@gmail.com> <20190611011737.GA28701@kmo-pixel> <20190611043336.GB14363@dread.disaster.area> <20190612162144.GA7619@kmo-pixel> <20190612230224.GJ14308@dread.disaster.area> <20190613183625.GA28171@kmo-pixel> <20190613212112.GB28171@kmo-pixel> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190613212112.GB28171@kmo-pixel> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 cx=a_idp_d a=K5LJ/TdJMXINHCwnwvH1bQ==:117 a=K5LJ/TdJMXINHCwnwvH1bQ==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=dq6fvYVFJ5YA:10 a=7-415B0cAAAA:8 a=6_0dh5WEKKik7Vn-M0YA:9 a=FsNm7XV4SpkFqOcW:21 a=cFAAwf0Rn3E3QlG0:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote: > On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote: > > There are definitely workloads that require multiple threads doing non-overlapping > > writes to a single file in HPC. This is becoming an increasingly common problem > > as the number of cores on a single client increase, since there is typically one > > thread per core trying to write to a shared file. Using multiple files (one per > > core) is possible, but that has file management issues for users when there are a > > million cores running on the same job/file (obviously not on the same client node) > > dumping data every hour. > > Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to > me. > > > We were just looking at this exact problem last week, and most of the threads are > > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty() > > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are > > writing O_DIRECT instead of buffered). Flame graph is attached for 16-thread case, > > but high-end systems today easily have 2-4x that many cores. > > Yeah I've been spending some time on buffered IO performance too - 4k page > overhead is a killer. > > bcachefs has a buffered write path that looks up multiple pages at a time and > locks them, and then copies the data to all the pages at once (I stole the idea > from btrfs). It was a very significant performance increase. Careful with that - locking multiple pages is also a deadlock vector that triggers unexpectedly when something conspires to lock pages in non-ascending order. e.g. 64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock The fs/iomap.c code avoids this problem by mapping the IO first, then iterating pages one at a time until the mapping is consumed, then it gets another mapping. It also avoids needing to put a page array on stack.... Cheers, Dave. -- Dave Chinner david@fromorbit.com