linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mikulas Patocka <mpatocka@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Eric Sandeen <esandeen@redhat.com>,
	Dave Chinner <dchinner@redhat.com>,
	"Kani, Toshi" <toshi.kani@hpe.com>,
	"Norton, Scott J" <scott.norton@hpe.com>,
	"Tadakamadla,
	Rajesh (DCIG/CDI/HPS Perf)"  <rajesh.tadakamadla@hpe.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
Date: Wed, 23 Sep 2020 13:19:42 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LRH.2.02.2009230445030.1800@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <20200923024528.GD12096@dread.disaster.area>



On Wed, 23 Sep 2020, Dave Chinner wrote:

> > > dir-test /mnt/test/linux-2.6 63000 1048576
> > > nvfs		6.6s
> > > ext4 dax	8.4s
> > > xfs dax		12.2s
> > > 
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 link
> > > nvfs		4.7s
> > > ext4 dax	5.6s
> > > xfs dax		7.8s
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 dir
> > > nvfs		8.2s
> > > ext4 dax	15.1s
> > > xfs dax		11.8s
> > > 
> > > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
> > > huge difference - it's not orders of magnitude faster.
> > 
> > If I increase the size of the test directory, NVFS is order of magnitude 
> > faster:
> > 
> > time dir-test /mnt/test/ 2000000 2000000
> > NVFS: 0m29,395s
> > XFS:  1m59,523s
> > EXT4: 1m14,176s
> 
> What happened to NVFS there? The runtime went up by a factor of 5,
> even though the number of ops performed only doubled.

This test is from a different machine (K10 Opteron) than the above test 
(Skylake Xeon). I borrowed the Xeon for a short time and I no longer have 
access to it.

> > time dir-test /mnt/test/ 8000000 8000000
> > NVFS: 2m13,507s
> > XFS: 14m31,261s
> > EXT4: reports "file 1976882 can't be created: No space left on device", 
> > 	(although there are free blocks and inodes)
> > 	Is it a bug or expected behavior?
> 
> Exponential increase in runtime for a workload like this indicates
> the XFS journal is too small to run large scale operations. I'm
> guessing you're just testing on a small device?

In this test, the pmem device had 64GiB.

I've created 1TiB ramdisk, formatted it with XFS and ran dir-test 8000000 
on it, however it wasn't much better - it took 14m8,824s.

> In which case, you'd get a 16MB log for XFS, which is tiny and most
> definitely will limit performance of any large scale metadta
> operation. Performance should improve significantly for large scale
> operations with a much larger log, and that should bring the XFS
> runtimes down significantly.

Is there some mkfs.xfs option that can increase log size?

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it.
> 
> I did not say that. My comments are about the requirement for
> atomicity of object changes, not journalling. Journalling is an
> -implementation that can provide change atomicity-, it is not a
> design constraint for metadata modification algorithms.
> 
> Really, you can chose how to do object update however you want. What
> I want to review is the design documentation and a correctness proof
> for whatever mechanism you choose to use. Without that information,
> we have absolutely no chance of reviewing the filesystem
> implementation for correctness. We don't need a proof for something
> that uses journalling (because we all know how that works), but for
> something that uses soft updates we most definitely need the proof
> of correctness for the update algorithm before we can determine if
> the implementation is good...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

I am thinking about this: I can implement lightweight journaling that will 
journal just a few writes - I'll allocate some small per-cpu intent log 
for that.

For example, in nvfs_rename, we call nvfs_delete_de and nvfs_finish_add - 
these functions are very simple, both of them write just one word - so we 
can add these two words to the intent log. The same for setattr requesting 
simultaneous uid/gid/mode change - they are small, so they'll fit into the 
intent log well.

Regarding verifiability, I can do this - the writes to pmem are wrapped in 
a macro nv_store. So, I can modify this macro so that it logs all 
modifications. Then I take the log, cut it at random time, reorder the 
entries (to simulate reordering in the CPU write-combining buffers), 
replay it, run nvfsck on it and mount it. This way, we can verify that no 
matter where the crash happened, either an old file or a new file is 
present in a directory.

Do you agree with that?

Mikulas


  parent reply	other threads:[~2020-09-23 17:19 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15 12:34 [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-15 13:00 ` Matthew Wilcox
2020-09-15 13:24   ` Mikulas Patocka
2020-09-22 10:04   ` Ritesh Harjani
2020-09-15 15:16 ` Dan Williams
2020-09-15 16:58   ` Mikulas Patocka
2020-09-15 17:38     ` Mikulas Patocka
2020-09-16 10:57       ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Mikulas Patocka
2020-09-16 16:21         ` Dan Williams
2020-09-16 17:24           ` Mikulas Patocka
2020-09-16 17:40             ` Dan Williams
2020-09-16 18:06               ` Mikulas Patocka
2020-09-21 16:20                 ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka
2020-09-22  5:03                   ` Dave Chinner
2020-09-22 16:46                     ` Mikulas Patocka
2020-09-22 17:25                       ` Matthew Wilcox
2020-09-24 15:00                         ` Mikulas Patocka
2020-09-28 15:22                           ` Mikulas Patocka
2020-09-23  2:45                       ` Dave Chinner
2020-09-23  9:20                         ` A bug in ext4 with big directories (was: NVFS XFS metadata) Mikulas Patocka
2020-09-23  9:44                           ` Jan Kara
2020-09-23 12:46                             ` Mikulas Patocka
2020-09-23 17:19                         ` Mikulas Patocka [this message]
2020-09-23  9:57                       ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Jan Kara
2020-09-23 13:11                         ` Mikulas Patocka
2020-09-23 15:04                           ` Matthew Wilcox
2020-09-22 12:28                   ` Matthew Wilcox
2020-09-22 12:39                     ` Mikulas Patocka
2020-09-16 18:56               ` [PATCH] pmem: fix __copy_user_flushcache Mikulas Patocka
2020-09-18  1:53                 ` Dan Williams
2020-09-18 12:25                   ` the "read" syscall sees partial effects of the "write" syscall Mikulas Patocka
2020-09-18 13:13                     ` Jan Kara
2020-09-18 18:02                       ` Linus Torvalds
2020-09-20 23:41                       ` Dave Chinner
2020-09-17  6:50               ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Christoph Hellwig
2020-09-21 16:19   ` [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-21 16:29     ` Dan Williams
2020-09-22 15:43     ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LRH.2.02.2009230445030.1800@file01.intranet.prod.int.rdu2.redhat.com \
    --to=mpatocka@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=esandeen@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=rajesh.tadakamadla@hpe.com \
    --cc=scott.norton@hpe.com \
    --cc=torvalds@linux-foundation.org \
    --cc=toshi.kani@hpe.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).