From: Mikulas Patocka <mpatocka@redhat.com> To: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Eric Sandeen <esandeen@redhat.com>, Dave Chinner <dchinner@redhat.com>, "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" <rajesh.tadakamadla@hpe.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org> Subject: Re: [RFC] nvfs: a filesystem for persistent memory Date: Tue, 15 Sep 2020 12:58:46 -0400 (EDT) [thread overview] Message-ID: <alpine.LRH.2.02.2009151216050.16057@file01.intranet.prod.int.rdu2.redhat.com> (raw) In-Reply-To: <CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com> On Tue, 15 Sep 2020, Dan Williams wrote: > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > > buffer cache for the mapping. The buffer cache slows does fsck by a factor > > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > > based block devices directly? > > We've been down this path before. > > 5a023cdba50c block: enable dax for raw block devices > 9f4736fe7ca8 block: revert runtime dax control of the raw block device > acc93d30d7d4 Revert "block: enable dax for raw block devices" It says "The functionality is superseded by the new 'Device DAX' facility". But the fsck tool can't change a fsdax device into a devdax device just for checking. Or can it? > EXT2/4 metadata buffer management depends on the page cache and we > eliminated a class of bugs by removing that support. The problems are > likely tractable, but there was not a straightforward fix visible at > the time. Thinking about it - it isn't as easy as it looks... Suppose that the user mounts an ext2 filesystem and then uses the tune2fs tool on the mounted block device. The tune2fs tool reads and writes the mounted superblock directly. So, read/write must be coherent with the buffer cache (otherwise the kernel would not see the changes written by tune2fs). And mmap must be coherent with read/write. So, if we want to map the pmem device directly, we could add a new flag MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly in this case. But the default must be to map it coherently in order to not break existing programs. > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > > trailing bytes. > > You want copy_user_flushcache(). See how fs/dax.c arranges for > dax_copy_from_iter() to route to pmem_copy_from_iter(). Is it something new for the kernel 5.10? I see only __copy_user_flushcache that is implemented just for x86 and arm64. There is __copy_from_user_flushcache implemented for x86, arm64 and power. It is used in lib/iov_iter.c under #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this? Mikulas _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Mikulas Patocka <mpatocka@redhat.com> To: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Eric Sandeen <esandeen@redhat.com>, Dave Chinner <dchinner@redhat.com>, "Kani, Toshi" <toshi.kani@hpe.com>, "Norton, Scott J" <scott.norton@hpe.com>, "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" <rajesh.tadakamadla@hpe.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org> Subject: Re: [RFC] nvfs: a filesystem for persistent memory Date: Tue, 15 Sep 2020 12:58:46 -0400 (EDT) [thread overview] Message-ID: <alpine.LRH.2.02.2009151216050.16057@file01.intranet.prod.int.rdu2.redhat.com> (raw) In-Reply-To: <CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com> On Tue, 15 Sep 2020, Dan Williams wrote: > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > > buffer cache for the mapping. The buffer cache slows does fsck by a factor > > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > > based block devices directly? > > We've been down this path before. > > 5a023cdba50c block: enable dax for raw block devices > 9f4736fe7ca8 block: revert runtime dax control of the raw block device > acc93d30d7d4 Revert "block: enable dax for raw block devices" It says "The functionality is superseded by the new 'Device DAX' facility". But the fsck tool can't change a fsdax device into a devdax device just for checking. Or can it? > EXT2/4 metadata buffer management depends on the page cache and we > eliminated a class of bugs by removing that support. The problems are > likely tractable, but there was not a straightforward fix visible at > the time. Thinking about it - it isn't as easy as it looks... Suppose that the user mounts an ext2 filesystem and then uses the tune2fs tool on the mounted block device. The tune2fs tool reads and writes the mounted superblock directly. So, read/write must be coherent with the buffer cache (otherwise the kernel would not see the changes written by tune2fs). And mmap must be coherent with read/write. So, if we want to map the pmem device directly, we could add a new flag MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly in this case. But the default must be to map it coherently in order to not break existing programs. > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > > trailing bytes. > > You want copy_user_flushcache(). See how fs/dax.c arranges for > dax_copy_from_iter() to route to pmem_copy_from_iter(). Is it something new for the kernel 5.10? I see only __copy_user_flushcache that is implemented just for x86 and arm64. There is __copy_from_user_flushcache implemented for x86, arm64 and power. It is used in lib/iov_iter.c under #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this? Mikulas
next prev parent reply other threads:[~2020-09-15 16:58 UTC|newest] Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-09-15 12:34 [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka 2020-09-15 12:34 ` Mikulas Patocka 2020-09-15 13:00 ` Matthew Wilcox 2020-09-15 13:00 ` Matthew Wilcox 2020-09-15 13:24 ` Mikulas Patocka 2020-09-15 13:24 ` Mikulas Patocka 2020-09-22 10:04 ` Ritesh Harjani 2020-09-22 10:04 ` Ritesh Harjani 2020-09-15 15:16 ` Dan Williams 2020-09-15 15:16 ` Dan Williams 2020-09-15 16:58 ` Mikulas Patocka [this message] 2020-09-15 16:58 ` Mikulas Patocka 2020-09-15 17:38 ` Mikulas Patocka 2020-09-15 17:38 ` Mikulas Patocka 2020-09-16 10:57 ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Mikulas Patocka 2020-09-16 10:57 ` Mikulas Patocka 2020-09-16 16:21 ` Dan Williams 2020-09-16 16:21 ` Dan Williams 2020-09-16 17:24 ` Mikulas Patocka 2020-09-16 17:24 ` Mikulas Patocka 2020-09-16 17:40 ` Dan Williams 2020-09-16 17:40 ` Dan Williams 2020-09-16 18:06 ` Mikulas Patocka 2020-09-16 18:06 ` Mikulas Patocka 2020-09-21 16:20 ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka 2020-09-21 16:20 ` Mikulas Patocka 2020-09-22 5:03 ` Dave Chinner 2020-09-22 5:03 ` Dave Chinner 2020-09-22 16:46 ` Mikulas Patocka 2020-09-22 16:46 ` Mikulas Patocka 2020-09-22 17:25 ` Matthew Wilcox 2020-09-22 17:25 ` Matthew Wilcox 2020-09-24 15:00 ` Mikulas Patocka 2020-09-24 15:00 ` Mikulas Patocka 2020-09-28 15:22 ` Mikulas Patocka 2020-09-28 15:22 ` Mikulas Patocka 2020-09-23 2:45 ` Dave Chinner 2020-09-23 2:45 ` Dave Chinner 2020-09-23 9:20 ` A bug in ext4 with big directories (was: NVFS XFS metadata) Mikulas Patocka 2020-09-23 9:44 ` Jan Kara 2020-09-23 12:46 ` Mikulas Patocka 2020-09-23 20:20 ` Andreas Dilger 2020-09-23 17:19 ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka 2020-09-23 17:19 ` Mikulas Patocka 2020-09-23 9:57 ` Jan Kara 2020-09-23 9:57 ` Jan Kara 2020-09-23 13:11 ` Mikulas Patocka 2020-09-23 13:11 ` Mikulas Patocka 2020-09-23 15:04 ` Matthew Wilcox 2020-09-23 15:04 ` Matthew Wilcox 2020-09-22 12:28 ` Matthew Wilcox 2020-09-22 12:28 ` Matthew Wilcox 2020-09-22 12:39 ` Mikulas Patocka 2020-09-22 12:39 ` Mikulas Patocka 2020-09-16 18:56 ` [PATCH] pmem: fix __copy_user_flushcache Mikulas Patocka 2020-09-16 18:56 ` Mikulas Patocka 2020-09-18 1:53 ` Dan Williams 2020-09-18 1:53 ` Dan Williams 2020-09-18 12:25 ` the "read" syscall sees partial effects of the "write" syscall Mikulas Patocka 2020-09-18 13:13 ` Jan Kara 2020-09-18 18:02 ` Linus Torvalds 2020-09-20 23:41 ` Dave Chinner 2020-09-17 6:50 ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Christoph Hellwig 2020-09-17 6:50 ` Christoph Hellwig 2020-09-21 16:19 ` [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka 2020-09-21 16:19 ` Mikulas Patocka 2020-09-21 16:29 ` Dan Williams 2020-09-21 16:29 ` Dan Williams 2020-09-22 15:43 ` Ira Weiny 2020-09-22 15:43 ` Ira Weiny
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=alpine.LRH.2.02.2009151216050.16057@file01.intranet.prod.int.rdu2.redhat.com \ --to=mpatocka@redhat.com \ --cc=akpm@linux-foundation.org \ --cc=dan.j.williams@intel.com \ --cc=dchinner@redhat.com \ --cc=esandeen@redhat.com \ --cc=jack@suse.cz \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-nvdimm@lists.01.org \ --cc=rajesh.tadakamadla@hpe.com \ --cc=torvalds@linux-foundation.org \ --cc=viro@zeniv.linux.org.uk \ --cc=willy@infradead.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.