From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D1A2C433E2 for ; Tue, 15 Sep 2020 15:56:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3E52C206B7 for ; Tue, 15 Sep 2020 15:56:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="aYUToB6z" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727536AbgIOP4V (ORCPT ); Tue, 15 Sep 2020 11:56:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41588 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727310AbgIOPQ1 (ORCPT ); Tue, 15 Sep 2020 11:16:27 -0400 Received: from mail-ed1-x544.google.com (mail-ed1-x544.google.com [IPv6:2a00:1450:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55825C06178A for ; Tue, 15 Sep 2020 08:16:27 -0700 (PDT) Received: by mail-ed1-x544.google.com with SMTP id e22so3429976edq.6 for ; Tue, 15 Sep 2020 08:16:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=5Y/E8RIZkFddtFHwulQR49x2idyk01ZxBjW2bmut0Hc=; b=aYUToB6zrH24LY2xVmoAyJwITz8fGR6B9Qyvei4FahpHnuBBGJskZBN99qaAB3fUtt eplyUOquGjz16GIfT0XPfVfK6rWzZUl6W++JUAYHXdSONpYAZ7oJQoQpSO9WoXxERlUM BFE2EAtpbMqxL86TEUDW3nuKqnw5j+jNh6OhDaegsg2GfPGpcr4OuwCKUDkneHf+Y9xq aF/WATLPPJGPZQAXRmmcXIEywdheUmmHpm935jG7Gi26yjgWfBSUee5TXycCcQbZI8Xe L/4gjmmxDcj43zn6sB3SSphAW6uwaG7Mdenlq44BpIKgApFbwOJF1G6xT2ZkwdzYA6s7 qSXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5Y/E8RIZkFddtFHwulQR49x2idyk01ZxBjW2bmut0Hc=; b=E/AD/RsqwMjTj05ePwzxZiIi9hVNhKAy4dZnfWoSJgd4Ic0HXjtuxzZqS+qFIvdg4D Vs3n8YaD5aSYaH36fdlF8O3/jIVmXfOSpRjbpH0iBabaDxii4/LvzNhA0NGLTXQn2emk k9K6Fk4xvS1Sp24ZPkE5WhKiZkthTGVdW2CcniEtn8V4ZfleZotys456EsEmR20WFWSs HqUXKVGkwI03fuAU8kb4TmISjONDO/1WICj/QNckGK/caKajQi3v+xNgvYh1HjwVCf8T 3Hi0GSLHkIvtkbU0kz8m8RtUV80eGyuOz/3o+1ECb1vNPSfqhk38o5ZB9OBMoAIuEwEr 3rkg== X-Gm-Message-State: AOAM532oskHcEKit2fj9F2gOZMQNdKUiOcsZ3KuhkMz2aZ3ABYC9BsbA Wk7Nma2BsnE/7XNcRC3VFhi0m/yDYayMcjwLo0NAng== X-Google-Smtp-Source: ABdhPJythHZ3a+LmcR0W/cN6t2efwSdOy7mTxn9YsBnVaa3RGCOahJdxMaNVwwar29AYxB16Vjz9H4YOkhyHfRSam2I= X-Received: by 2002:aa7:d04d:: with SMTP id n13mr23655873edo.354.1600182982436; Tue, 15 Sep 2020 08:16:22 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dan Williams Date: Tue, 15 Sep 2020 08:16:11 -0700 Message-ID: Subject: Re: [RFC] nvfs: a filesystem for persistent memory To: Mikulas Patocka Cc: Linus Torvalds , Alexander Viro , Andrew Morton , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Jan Kara , Eric Sandeen , Dave Chinner , "Kani, Toshi" , "Norton, Scott J" , "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" , Linux Kernel Mailing List , linux-fsdevel , linux-nvdimm Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka wrote: > > Hi > > I am developing a new filesystem suitable for persistent memory - nvfs. Nice! > The goal is to have a small and fast filesystem that can be used on > DAX-based devices. Nvfs maps the whole device into linear address space > and it completely bypasses the overhead of the block layer and buffer > cache. So does device-dax, but device-dax lacks read(2)/write(2). > In the past, there was nova filesystem for pmem, but it was abandoned a > year ago (the last version is for the kernel 5.1 - > https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better. > > The design of nvfs is similar to ext2/ext4, so that it fits into the VFS > layer naturally, without too much glue code. > > I'd like to ask you to review it. > > > tarballs: > http://people.redhat.com/~mpatocka/nvfs/ > git: > git://leontynka.twibright.com/nvfs.git > the description of filesystem internals: > http://people.redhat.com/~mpatocka/nvfs/INTERNALS > benchmarks: > http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > > TODO: > > - programs run approximately 4% slower when running from Optane-based > persistent memory. Therefore, programs and libraries should use page cache > and not DAX mapping. This needs to be based on platform firmware data f(ACPI HMAT) for the relative performance of a PMEM range vs DRAM. For example, this tradeoff should not exist with battery backed DRAM, or virtio-pmem. > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > buffer cache for the mapping. The buffer cache slows does fsck by a factor > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > based block devices directly? We've been down this path before. 5a023cdba50c block: enable dax for raw block devices 9f4736fe7ca8 block: revert runtime dax control of the raw block device acc93d30d7d4 Revert "block: enable dax for raw block devices" EXT2/4 metadata buffer management depends on the page cache and we eliminated a class of bugs by removing that support. The problems are likely tractable, but there was not a straightforward fix visible at the time. > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > trailing bytes. You want copy_user_flushcache(). See how fs/dax.c arranges for dax_copy_from_iter() to route to pmem_copy_from_iter().