From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7463C2D0E2 for ; Wed, 23 Sep 2020 13:11:58 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3A189221E7 for ; Wed, 23 Sep 2020 13:11:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZGkP/NDr" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3A189221E7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 6101C1515EEFB; Wed, 23 Sep 2020 06:11:57 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=mpatocka@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id CF82414FC33B5 for ; Wed, 23 Sep 2020 06:11:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1600866712; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kDxFW0Zci5a4gqRLOX3WWCWSqU2y/VaKVhoCtZvVWrs=; b=ZGkP/NDrjVBk1hUW9PJbmxuCpHuhhvIf3SFU4pV3x5gxuOkxj6a03Z+ce7SuFwAUM+4Cba 5C0jt8EpUefShXxsIquWkIQZD/qiCoNBmDUSW5itd56Zb46PpQSM8CYgG+/BNX5SwnQf0n nNoo9+NxvQIRbGLQ/xfRY3m85uAcr0I= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-118-LjZwdbeiODmyd-uDEwuOfA-1; Wed, 23 Sep 2020 09:11:47 -0400 X-MC-Unique: LjZwdbeiODmyd-uDEwuOfA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5830B802B4C; Wed, 23 Sep 2020 13:11:45 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (file01.intranet.prod.int.rdu2.redhat.com [10.11.5.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B90DB78822; Wed, 23 Sep 2020 13:11:44 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (localhost [127.0.0.1]) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4) with ESMTP id 08NDBiOY022621; Wed, 23 Sep 2020 09:11:44 -0400 Received: from localhost (mpatocka@localhost) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4/Submit) with ESMTP id 08NDBhOr022617; Wed, 23 Sep 2020 09:11:43 -0400 X-Authentication-Warning: file01.intranet.prod.int.rdu2.redhat.com: mpatocka owned process doing -bs Date: Wed, 23 Sep 2020 09:11:43 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: Jan Kara Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) In-Reply-To: <20200923095739.GC6719@quack2.suse.cz> Message-ID: References: <20200922050314.GB12096@dread.disaster.area> <20200923095739.GC6719@quack2.suse.cz> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Message-ID-Hash: I5W432D4PLJYVICVUPSQCZ27M5XKQD4Y X-Message-ID-Hash: I5W432D4PLJYVICVUPSQCZ27M5XKQD4Y X-MailFrom: mpatocka@redhat.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation CC: Dave Chinner , Linus Torvalds , Alexander Viro , Andrew Morton , Matthew Wilcox , Eric Sandeen , Dave Chinner , "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" , Linux Kernel Mailing List , linux-fsdevel , linux-nvdimm X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit On Wed, 23 Sep 2020, Jan Kara wrote: > On Tue 22-09-20 12:46:05, Mikulas Patocka wrote: > > > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going > > > to be faster to truncate away - a single record or 2 million individual > > > blocks? > > > > > > IOWs, we can take afford to take an extra cacheline miss or two on a > > > tree block search, because we're accessing and managing orders of > > > magnitude fewer records in the mapping tree than an indirect block > > > tree. > > > > > > PMEM doesn't change this: extents are more time and space efficient > > > at scale for mapping trees than indirect block trees regardless > > > of the storage medium in use. > > > > PMEM doesn't have to be read linearly, so the attempts to allocate large > > linear space are not needed. They won't harm but they won't help either. > > > > That's why NVFS has very simple block allocation alrogithm - it uses a > > per-cpu pointer and tries to allocate by a bit scan from this pointer. If > > the group is full, it tries a random group with above-average number of > > free blocks. > > I agree with Dave here. People are interested in 2MB or 1GB contiguous > allocations for DAX so that files can be mapped at PMD or event PUD levels > thus saving a lot of CPU time on page faults and TLB. NVFS has upper limit on block size 1MB. So, should raise it to 2MB? Will 2MB blocks be useful to someone? Is there some API how userspace can ask the kernel for aligned allocation? fallocate() doesn't seem to offer an option for alignment. > > EXT4 uses bit scan for allocations and people haven't complained that it's > > inefficient, so it is probably OK. > > Yes, it is more or less OK but once you get to 1TB filesystem size and > larger, the number of block groups grows enough that it isn't that great > anymore. We are actually considering new allocation schemes for ext4 for > this large filesystems... NVFS can run with block size larger than page size, so you can reduce the number of block groups by increasing block size. (ext4 also has bigalloc feature that will do it) > > If you think that the lack of journaling is show-stopper, I can implement > > it. But then, I'll have something that has complexity of EXT4 and > > performance of EXT4. So that there will no longer be any reason why to use > > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may > > attract some users who want good performance and who don't care about GID > > and UID being updated atomically, etc. > > I'd hope that your filesystem offers more performance benefits than just > what you can get from a lack of journalling :). ext4 can be configured to I also don't know how to implement journling on persistent memory :) On EXT4 or XFS you can pin dirty buffers in memory until the journal is flushed. This is obviously impossible on persistent memory. So, I'm considering implementing only some lightweight journaling that will guarantee atomicity between just a few writes. > run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does > significantly improve performance for some workloads but you have to have > some way to recover from crashes so it's mostly used for scratch > filesystems (e.g. in build systems, Google uses this feature a lot for some > of their infrastructure as well). > > Honza > -- > Jan Kara > SUSE Labs, CR I've run "dir-test /mnt/test/ 8000000 8000000" and the result is: EXT4 with journal - 5m54,019s EXT4 without journal - 4m4,444s NVFS - 2m9,482s Mikulas _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A05A8C2D0A8 for ; Wed, 23 Sep 2020 13:11:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3AE94221E7 for ; Wed, 23 Sep 2020 13:11:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZGkP/NDr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726623AbgIWNLz (ORCPT ); Wed, 23 Sep 2020 09:11:55 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:25241 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726332AbgIWNLy (ORCPT ); Wed, 23 Sep 2020 09:11:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1600866712; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kDxFW0Zci5a4gqRLOX3WWCWSqU2y/VaKVhoCtZvVWrs=; b=ZGkP/NDrjVBk1hUW9PJbmxuCpHuhhvIf3SFU4pV3x5gxuOkxj6a03Z+ce7SuFwAUM+4Cba 5C0jt8EpUefShXxsIquWkIQZD/qiCoNBmDUSW5itd56Zb46PpQSM8CYgG+/BNX5SwnQf0n nNoo9+NxvQIRbGLQ/xfRY3m85uAcr0I= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-118-LjZwdbeiODmyd-uDEwuOfA-1; Wed, 23 Sep 2020 09:11:47 -0400 X-MC-Unique: LjZwdbeiODmyd-uDEwuOfA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 5830B802B4C; Wed, 23 Sep 2020 13:11:45 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (file01.intranet.prod.int.rdu2.redhat.com [10.11.5.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B90DB78822; Wed, 23 Sep 2020 13:11:44 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (localhost [127.0.0.1]) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4) with ESMTP id 08NDBiOY022621; Wed, 23 Sep 2020 09:11:44 -0400 Received: from localhost (mpatocka@localhost) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4/Submit) with ESMTP id 08NDBhOr022617; Wed, 23 Sep 2020 09:11:43 -0400 X-Authentication-Warning: file01.intranet.prod.int.rdu2.redhat.com: mpatocka owned process doing -bs Date: Wed, 23 Sep 2020 09:11:43 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: Jan Kara cc: Dave Chinner , Dan Williams , Linus Torvalds , Alexander Viro , Andrew Morton , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Eric Sandeen , Dave Chinner , "Kani, Toshi" , "Norton, Scott J" , "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" , Linux Kernel Mailing List , linux-fsdevel , linux-nvdimm Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) In-Reply-To: <20200923095739.GC6719@quack2.suse.cz> Message-ID: References: <20200922050314.GB12096@dread.disaster.area> <20200923095739.GC6719@quack2.suse.cz> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 23 Sep 2020, Jan Kara wrote: > On Tue 22-09-20 12:46:05, Mikulas Patocka wrote: > > > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going > > > to be faster to truncate away - a single record or 2 million individual > > > blocks? > > > > > > IOWs, we can take afford to take an extra cacheline miss or two on a > > > tree block search, because we're accessing and managing orders of > > > magnitude fewer records in the mapping tree than an indirect block > > > tree. > > > > > > PMEM doesn't change this: extents are more time and space efficient > > > at scale for mapping trees than indirect block trees regardless > > > of the storage medium in use. > > > > PMEM doesn't have to be read linearly, so the attempts to allocate large > > linear space are not needed. They won't harm but they won't help either. > > > > That's why NVFS has very simple block allocation alrogithm - it uses a > > per-cpu pointer and tries to allocate by a bit scan from this pointer. If > > the group is full, it tries a random group with above-average number of > > free blocks. > > I agree with Dave here. People are interested in 2MB or 1GB contiguous > allocations for DAX so that files can be mapped at PMD or event PUD levels > thus saving a lot of CPU time on page faults and TLB. NVFS has upper limit on block size 1MB. So, should raise it to 2MB? Will 2MB blocks be useful to someone? Is there some API how userspace can ask the kernel for aligned allocation? fallocate() doesn't seem to offer an option for alignment. > > EXT4 uses bit scan for allocations and people haven't complained that it's > > inefficient, so it is probably OK. > > Yes, it is more or less OK but once you get to 1TB filesystem size and > larger, the number of block groups grows enough that it isn't that great > anymore. We are actually considering new allocation schemes for ext4 for > this large filesystems... NVFS can run with block size larger than page size, so you can reduce the number of block groups by increasing block size. (ext4 also has bigalloc feature that will do it) > > If you think that the lack of journaling is show-stopper, I can implement > > it. But then, I'll have something that has complexity of EXT4 and > > performance of EXT4. So that there will no longer be any reason why to use > > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may > > attract some users who want good performance and who don't care about GID > > and UID being updated atomically, etc. > > I'd hope that your filesystem offers more performance benefits than just > what you can get from a lack of journalling :). ext4 can be configured to I also don't know how to implement journling on persistent memory :) On EXT4 or XFS you can pin dirty buffers in memory until the journal is flushed. This is obviously impossible on persistent memory. So, I'm considering implementing only some lightweight journaling that will guarantee atomicity between just a few writes. > run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does > significantly improve performance for some workloads but you have to have > some way to recover from crashes so it's mostly used for scratch > filesystems (e.g. in build systems, Google uses this feature a lot for some > of their infrastructure as well). > > Honza > -- > Jan Kara > SUSE Labs, CR I've run "dir-test /mnt/test/ 8000000 8000000" and the result is: EXT4 with journal - 5m54,019s EXT4 without journal - 4m4,444s NVFS - 2m9,482s Mikulas