All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Eric Sandeen <esandeen@redhat.com>,
	Dave Chinner <dchinner@redhat.com>,
	"Tadakamadla,
	Rajesh (DCIG/CDI/HPS Perf)" <rajesh.tadakamadla@hpe.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: [RFC] nvfs: a filesystem for persistent memory
Date: Tue, 15 Sep 2020 08:16:11 -0700	[thread overview]
Message-ID: <CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com> (raw)
In-Reply-To: <alpine.LRH.2.02.2009140852030.22422@file01.intranet.prod.int.rdu2.redhat.com>

On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Hi
>
> I am developing a new filesystem suitable for persistent memory - nvfs.

Nice!

> The goal is to have a small and fast filesystem that can be used on
> DAX-based devices. Nvfs maps the whole device into linear address space
> and it completely bypasses the overhead of the block layer and buffer
> cache.

So does device-dax, but device-dax lacks read(2)/write(2).

> In the past, there was nova filesystem for pmem, but it was abandoned a
> year ago (the last version is for the kernel 5.1 -
> https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.
>
> The design of nvfs is similar to ext2/ext4, so that it fits into the VFS
> layer naturally, without too much glue code.
>
> I'd like to ask you to review it.
>
>
> tarballs:
>         http://people.redhat.com/~mpatocka/nvfs/
> git:
>         git://leontynka.twibright.com/nvfs.git
> the description of filesystem internals:
>         http://people.redhat.com/~mpatocka/nvfs/INTERNALS
> benchmarks:
>         http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
>
>
> TODO:
>
> - programs run approximately 4% slower when running from Optane-based
> persistent memory. Therefore, programs and libraries should use page cache
> and not DAX mapping.

This needs to be based on platform firmware data f(ACPI HMAT) for the
relative performance of a PMEM range vs DRAM. For example, this
tradeoff should not exist with battery backed DRAM, or virtio-pmem.

>
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> buffer cache for the mapping. The buffer cache slows does fsck by a factor
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> based block devices directly?

We've been down this path before.

5a023cdba50c block: enable dax for raw block devices
9f4736fe7ca8 block: revert runtime dax control of the raw block device
acc93d30d7d4 Revert "block: enable dax for raw block devices"

EXT2/4 metadata buffer management depends on the page cache and we
eliminated a class of bugs by removing that support. The problems are
likely tractable, but there was not a straightforward fix visible at
the time.

> - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> trailing bytes.

You want copy_user_flushcache(). See how fs/dax.c arranges for
dax_copy_from_iter() to route to pmem_copy_from_iter().
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Eric Sandeen <esandeen@redhat.com>,
	Dave Chinner <dchinner@redhat.com>,
	"Kani, Toshi" <toshi.kani@hpe.com>,
	"Norton, Scott J" <scott.norton@hpe.com>,
	"Tadakamadla,
	Rajesh (DCIG/CDI/HPS Perf)"  <rajesh.tadakamadla@hpe.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: [RFC] nvfs: a filesystem for persistent memory
Date: Tue, 15 Sep 2020 08:16:11 -0700	[thread overview]
Message-ID: <CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com> (raw)
In-Reply-To: <alpine.LRH.2.02.2009140852030.22422@file01.intranet.prod.int.rdu2.redhat.com>

On Tue, Sep 15, 2020 at 5:35 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Hi
>
> I am developing a new filesystem suitable for persistent memory - nvfs.

Nice!

> The goal is to have a small and fast filesystem that can be used on
> DAX-based devices. Nvfs maps the whole device into linear address space
> and it completely bypasses the overhead of the block layer and buffer
> cache.

So does device-dax, but device-dax lacks read(2)/write(2).

> In the past, there was nova filesystem for pmem, but it was abandoned a
> year ago (the last version is for the kernel 5.1 -
> https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.
>
> The design of nvfs is similar to ext2/ext4, so that it fits into the VFS
> layer naturally, without too much glue code.
>
> I'd like to ask you to review it.
>
>
> tarballs:
>         http://people.redhat.com/~mpatocka/nvfs/
> git:
>         git://leontynka.twibright.com/nvfs.git
> the description of filesystem internals:
>         http://people.redhat.com/~mpatocka/nvfs/INTERNALS
> benchmarks:
>         http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
>
>
> TODO:
>
> - programs run approximately 4% slower when running from Optane-based
> persistent memory. Therefore, programs and libraries should use page cache
> and not DAX mapping.

This needs to be based on platform firmware data f(ACPI HMAT) for the
relative performance of a PMEM range vs DRAM. For example, this
tradeoff should not exist with battery backed DRAM, or virtio-pmem.

>
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> buffer cache for the mapping. The buffer cache slows does fsck by a factor
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> based block devices directly?

We've been down this path before.

5a023cdba50c block: enable dax for raw block devices
9f4736fe7ca8 block: revert runtime dax control of the raw block device
acc93d30d7d4 Revert "block: enable dax for raw block devices"

EXT2/4 metadata buffer management depends on the page cache and we
eliminated a class of bugs by removing that support. The problems are
likely tractable, but there was not a straightforward fix visible at
the time.

> - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> trailing bytes.

You want copy_user_flushcache(). See how fs/dax.c arranges for
dax_copy_from_iter() to route to pmem_copy_from_iter().

  parent reply	other threads:[~2020-09-15 15:16 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15 12:34 [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-15 12:34 ` Mikulas Patocka
2020-09-15 13:00 ` Matthew Wilcox
2020-09-15 13:00   ` Matthew Wilcox
2020-09-15 13:24   ` Mikulas Patocka
2020-09-15 13:24     ` Mikulas Patocka
2020-09-22 10:04   ` Ritesh Harjani
2020-09-22 10:04     ` Ritesh Harjani
2020-09-15 15:16 ` Dan Williams [this message]
2020-09-15 15:16   ` Dan Williams
2020-09-15 16:58   ` Mikulas Patocka
2020-09-15 16:58     ` Mikulas Patocka
2020-09-15 17:38     ` Mikulas Patocka
2020-09-15 17:38       ` Mikulas Patocka
2020-09-16 10:57       ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Mikulas Patocka
2020-09-16 10:57         ` Mikulas Patocka
2020-09-16 16:21         ` Dan Williams
2020-09-16 16:21           ` Dan Williams
2020-09-16 17:24           ` Mikulas Patocka
2020-09-16 17:24             ` Mikulas Patocka
2020-09-16 17:40             ` Dan Williams
2020-09-16 17:40               ` Dan Williams
2020-09-16 18:06               ` Mikulas Patocka
2020-09-16 18:06                 ` Mikulas Patocka
2020-09-21 16:20                 ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka
2020-09-21 16:20                   ` Mikulas Patocka
2020-09-22  5:03                   ` Dave Chinner
2020-09-22  5:03                     ` Dave Chinner
2020-09-22 16:46                     ` Mikulas Patocka
2020-09-22 16:46                       ` Mikulas Patocka
2020-09-22 17:25                       ` Matthew Wilcox
2020-09-22 17:25                         ` Matthew Wilcox
2020-09-24 15:00                         ` Mikulas Patocka
2020-09-24 15:00                           ` Mikulas Patocka
2020-09-28 15:22                           ` Mikulas Patocka
2020-09-28 15:22                             ` Mikulas Patocka
2020-09-23  2:45                       ` Dave Chinner
2020-09-23  2:45                         ` Dave Chinner
2020-09-23  9:20                         ` A bug in ext4 with big directories (was: NVFS XFS metadata) Mikulas Patocka
2020-09-23  9:44                           ` Jan Kara
2020-09-23 12:46                             ` Mikulas Patocka
2020-09-23 20:20                             ` Andreas Dilger
2020-09-23 17:19                         ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka
2020-09-23 17:19                           ` Mikulas Patocka
2020-09-23  9:57                       ` Jan Kara
2020-09-23  9:57                         ` Jan Kara
2020-09-23 13:11                         ` Mikulas Patocka
2020-09-23 13:11                           ` Mikulas Patocka
2020-09-23 15:04                           ` Matthew Wilcox
2020-09-23 15:04                             ` Matthew Wilcox
2020-09-22 12:28                   ` Matthew Wilcox
2020-09-22 12:28                     ` Matthew Wilcox
2020-09-22 12:39                     ` Mikulas Patocka
2020-09-22 12:39                       ` Mikulas Patocka
2020-09-16 18:56               ` [PATCH] pmem: fix __copy_user_flushcache Mikulas Patocka
2020-09-16 18:56                 ` Mikulas Patocka
2020-09-18  1:53                 ` Dan Williams
2020-09-18  1:53                   ` Dan Williams
2020-09-18 12:25                   ` the "read" syscall sees partial effects of the "write" syscall Mikulas Patocka
2020-09-18 13:13                     ` Jan Kara
2020-09-18 18:02                       ` Linus Torvalds
2020-09-20 23:41                       ` Dave Chinner
2020-09-17  6:50               ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Christoph Hellwig
2020-09-17  6:50                 ` Christoph Hellwig
2020-09-21 16:19   ` [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-21 16:19     ` Mikulas Patocka
2020-09-21 16:29     ` Dan Williams
2020-09-21 16:29       ` Dan Williams
2020-09-22 15:43     ` Ira Weiny
2020-09-22 15:43       ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dchinner@redhat.com \
    --cc=esandeen@redhat.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mpatocka@redhat.com \
    --cc=rajesh.tadakamadla@hpe.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.