* [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload @ 2019-04-04 16:57 Amir Goldstein 2019-04-04 21:17 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2019-04-04 16:57 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel This patch improves performance of mixed random rw workload on xfs without relaxing the atomic buffered read/write guaranty that xfs has always provided. We achieve that by calling generic_file_read_iter() twice. Once with a discard iterator to warm up page cache before taking the shared ilock and once again under shared ilock. Since this is a POC patch it also includes a separate fix to the copy_page_to_iter() helper when called with discard iterator. There is no other caller in the kernel to this method with a discard iterator as far as I could see. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Hi Folks, With this experimenital patch I was able to bring performance of random rw workload benchmark on xfs much closer to ext4. Ext4, as most Linux filesystems doesn't take the shared inode lock on buffered reads and does not provide atomic buffered reads w.r.t buffered writes. Following are the numbers I got with filebench randomrw workload [1] on a VM with 4 CPUs and spindles. Note that this improvement is unrelated to the rw_semaphore starvation issue that was observed when running the same benchmark on fast SDD drive [2]. === random read/write - cold page cache === --- EXT4 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.1.0-rc2, ext4 rand-write1 862304ops 14235ops/s 111.2mb/s 0.5ms/op rand-read1 22065ops 364ops/s 2.8mb/s 21.5ms/op --- XFS --- filebench randomrw (8 read threads, 8 write threads) kernel 5.1.0-rc2, xfs rand-write1 39451ops 657ops/s 5.1mb/s 12.0ms/op rand-read1 2035ops 34ops/s 0.3mb/s 232.7ms/op --- XFS+ --- filebench randomrw (8 read threads, 8 write threads) kernel 5.1.0-rc2+ (xfs page cache warmup patch) rand-write1 935597ops 15592ops/s 121.8mb/s 0.5ms/op rand-read1 4446ops 74ops/s 0.6mb/s 107.6ms/op To measure the effects of two passes of generic_file_read_iter(), I ran a random read [2] benchmark on 5GB file with warm and cold page cache. === random read - cold page cache === --- EXT4 --- filebench randomread (8 read threads) - cold page cache kernel 5.1.0-rc2 rand-read1 23589ops 393ops/s 3.1mb/s 20.3ms/op --- XFS --- filebench randomread (8 read threads) - cold page cache kernel 5.1.0-rc2 rand-read1 20578ops 343ops/s 2.7mb/s 23.3ms/op --- XFS+ --- filebench randomread (8 read threads) - cold page cache kernel 5.1.0-rc2+ (xfs page cache warmup patch) rand-read1 20476ops 341ops/s 2.7mb/s 23.4ms/op === random read - warm page cache === --- EXT4 --- filebench randomread (8 read threads) - warm page cache kernel 5.1.0-rc2 rand-read1 58168696ops 969410ops/s 7573.5mb/s 0.0ms/op --- XFS --- filebench randomread (8 read threads) - warm page cache kernel 5.1.0-rc2 rand-read1 52748818ops 878951ops/s 6866.8mb/s 0.0ms/op --- XFS+ --- filebench randomread (8 read threads) - warm page cache kernel 5.1.0-rc2+ (xfs page cache warmup patch) rand-read1 52770537ops 879445ops/s 6870.7mb/s 0.0ms/op The numbers of this benchmark do not show and measurable difference for readers only workload with either cold or warm page cache. If needed I can provide more measurments with fio or with different workloads and different drives. Fire away! Thanks, Amir. [1] https://github.com/amir73il/filebench/blob/overlayfs-devel/workloads/randomrw.f [2] https://marc.info/?l=linux-xfs&m=155347265016053&w=2 [3] https://github.com/amir73il/filebench/blob/overlayfs-devel/workloads/randomread.f fs/xfs/xfs_file.c | 14 ++++++++++++++ lib/iov_iter.c | 5 +++-- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 1f2e284..4e5f88a 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -240,6 +240,20 @@ xfs_file_buffered_aio_read( if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) return -EAGAIN; } else { + /* + * Warm up page cache to minimize time spent under + * shared ilock. + */ + struct iov_iter iter; + loff_t pos = iocb->ki_pos; + + iov_iter_discard(&iter, READ, iov_iter_count(to)); + ret = generic_file_read_iter(iocb, &iter); + if (ret <= 0) + return ret; + + iocb->ki_pos = pos; + xfs_ilock(ip, XFS_IOLOCK_SHARED); } ret = generic_file_read_iter(iocb, to); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index ea36dc3..b22e433 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -893,9 +893,10 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, size_t wanted = copy_to_iter(kaddr + offset, bytes, i); kunmap_atomic(kaddr); return wanted; - } else if (unlikely(iov_iter_is_discard(i))) + } else if (unlikely(iov_iter_is_discard(i))) { + i->count -= bytes; return bytes; - else if (likely(!iov_iter_is_pipe(i))) + } else if (likely(!iov_iter_is_pipe(i))) return copy_page_to_iter_iovec(page, offset, bytes, i); else return copy_page_to_iter_pipe(page, offset, bytes, i); -- 2.7.4 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein @ 2019-04-04 21:17 ` Dave Chinner 2019-04-05 14:02 ` Amir Goldstein 2019-04-08 10:33 ` Jan Kara 0 siblings, 2 replies; 38+ messages in thread From: Dave Chinner @ 2019-04-04 21:17 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > This patch improves performance of mixed random rw workload > on xfs without relaxing the atomic buffered read/write guaranty > that xfs has always provided. > > We achieve that by calling generic_file_read_iter() twice. > Once with a discard iterator to warm up page cache before taking > the shared ilock and once again under shared ilock. This will race with thing like truncate, hole punching, etc that serialise IO and invalidate the page cache for data integrity reasons under the IOLOCK. These rely on there being no IO to the inode in progress at all to work correctly, which this patch violates. IOWs, while this is fast, it is not safe and so not a viable approach to solving the problem. FYI, I'm working on a range lock implementation that should both solve the performance issue and the reader starvation issue at the same time by allowing concurrent buffered reads and writes to different file ranges. IO range locks will allow proper exclusion for other extent manipulation operations like fallocate and truncate, and eventually even allow truncate, hole punch, file extension, etc to run concurrently with other non-overlapping IO. They solve more than just the performance issue you are seeing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-04 21:17 ` Dave Chinner @ 2019-04-05 14:02 ` Amir Goldstein 2019-04-07 23:27 ` Dave Chinner 2019-04-08 10:33 ` Jan Kara 1 sibling, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2019-04-05 14:02 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > This patch improves performance of mixed random rw workload > > on xfs without relaxing the atomic buffered read/write guaranty > > that xfs has always provided. > > > > We achieve that by calling generic_file_read_iter() twice. > > Once with a discard iterator to warm up page cache before taking > > the shared ilock and once again under shared ilock. > > This will race with thing like truncate, hole punching, etc that > serialise IO and invalidate the page cache for data integrity > reasons under the IOLOCK. These rely on there being no IO to the > inode in progress at all to work correctly, which this patch > violates. IOWs, while this is fast, it is not safe and so not a > viable approach to solving the problem. > This statement leaves me wondering, if ext4 does not takes i_rwsem on generic_file_read_iter(), how does ext4 (or any other fs for that matter) guaranty buffered read synchronization with truncate, hole punching etc? The answer in ext4 case is i_mmap_sem, which is read locked in the page fault handler. And xfs does the same type of synchronization with MMAPLOCK, so while my patch may not be safe, I cannot follow why from your explanation, so please explain if I am missing something. One thing that Darrick mentioned earlier was that IOLOCK is also used by xfs to synchronization pNFS leases (probably listed under 'etc' in your explanation). I consent that my patch does not look safe w.r.t pNFS leases, but that can be sorted out with a hammer #ifndef CONFIG_EXPORTFS_BLOCK_OPS or with finer instruments. > FYI, I'm working on a range lock implementation that should both > solve the performance issue and the reader starvation issue at the > same time by allowing concurrent buffered reads and writes to > different file ranges. > > IO range locks will allow proper exclusion for other extent > manipulation operations like fallocate and truncate, and eventually > even allow truncate, hole punch, file extension, etc to run > concurrently with other non-overlapping IO. They solve more than > just the performance issue you are seeing.... > I'm glad to hear that. IO range locks are definitely a more wholesome solution to the problem looking forward. However, I am still interested to continue the discussion on my POC patch. One reason is that I am guessing it would be much easier for distros to backport and pick up to solve performance issues. Even if my patch doesn't get applied upstream nor picked by distros, I would still like to understand its flaws and limitations. I know... if I break it, I get to keep the pieces, but the information that you provide helps me make my risk assessments. Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-05 14:02 ` Amir Goldstein @ 2019-04-07 23:27 ` Dave Chinner 2019-04-08 9:02 ` Amir Goldstein 2019-04-08 11:03 ` Jan Kara 0 siblings, 2 replies; 38+ messages in thread From: Dave Chinner @ 2019-04-07 23:27 UTC (permalink / raw) To: Amir Goldstein Cc: Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > This patch improves performance of mixed random rw workload > > > on xfs without relaxing the atomic buffered read/write guaranty > > > that xfs has always provided. > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > Once with a discard iterator to warm up page cache before taking > > > the shared ilock and once again under shared ilock. > > > > This will race with thing like truncate, hole punching, etc that > > serialise IO and invalidate the page cache for data integrity > > reasons under the IOLOCK. These rely on there being no IO to the > > inode in progress at all to work correctly, which this patch > > violates. IOWs, while this is fast, it is not safe and so not a > > viable approach to solving the problem. > > > > This statement leaves me wondering, if ext4 does not takes > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > fs for that matter) guaranty buffered read synchronization with > truncate, hole punching etc? > The answer in ext4 case is i_mmap_sem, which is read locked > in the page fault handler. Nope, the i_mmap_sem is for serialisation of /page faults/ against truncate, holepunching, etc. Completely irrelevant to the read() path. > And xfs does the same type of synchronization with MMAPLOCK, > so while my patch may not be safe, I cannot follow why from your > explanation, so please explain if I am missing something. mmap_sem inversions require independent locks for IO path and page faults - the MMAPLOCK does not protect anything in the read()/write() IO path. > One thing that Darrick mentioned earlier was that IOLOCK is also > used by xfs to synchronization pNFS leases (probably listed under > 'etc' in your explanation). PNFS leases are separate to the IO locking. All the IO locking does is serialise new IO submission against the process of breaking leases so that extent maps that have been shared under the lease are invalidated correctly. i.e. we can't start IO until the lease has been recalled and the external client has guaranteed it won't read/write data from the stale extent map. If you do IO outside the IOLOCK, then you break those serialisation guarantees and risk data corruption and/or stale data exposure... > I consent that my patch does not look safe > w.r.t pNFS leases, but that can be sorted out with a hammer > #ifndef CONFIG_EXPORTFS_BLOCK_OPS > or with finer instruments. All you see is this: truncate: read() IOLOCK_EXCL flush relevant cached data truncate page cache pre-read page cache between new eof and old eof IOLOCK_SHARED <blocks> start transaction ILOCK_EXCL update isize remove extents .... commit xactn IOLOCK unlock <gets lock> sees beyond EOF, returns 0 So you see the read() doing the right thing (detect EOF, returning short read). Great. But what I see is uptodate pages containing stale data being left in the page cache beyond EOF. That is th eproblem here - truncate must not leave stale pages beyond EOF behind - it's the landmine that causes future things to go wrong. e.g. now the app does post-eof preallocation so the range those pages are cached over are allocated as unwritten - the filesystem will do this without even looking at the page cache because it's beyond EOF. Now we extend the file past those cached pages, and iomap_zero() sees the range as unwritten and so does not write zeros to the blocks between the old EOF and the new EOF. Now the app reads from that range (say it does a sub-page write, triggering a page cache RMW cycle). the read goes to instantiate the page cache page, finds a page already in the cache that is uptodate, and uses it without zeroing or reading from disk. And now we have stale data exposure and/or data corruption. If can come up with quite a few scenarios where this particular "populate cache after invalidation" race can cause similar problems for XFS. Hole punch and most of the other fallocate extent manipulations have the same serialisation requirements - no IO over the range of the operation can be *initiated* between the /start/ of the page cache invalidation and the end of the specific extent manipulation operation. So how does ext4 avoid this problem on truncate? History lesson: truncate in Linux (and hence ext4) has traditionally been serialised by the hacky post-page-lock checks that are strewn all through the page cache and mm/ subsystem. i.e. every time you look up and lock a page cache page, you have to check the page->mapping and page->index to ensure that the lookup-and-lock hasn't raced with truncate. This only works because truncate requires the inode size to be updated before invalidating the page cache - that's the "hacky" part of it. IOWs, the burden of detecting truncate races is strewn throughout the mm/ subsystem, rather than being the responisibility of the filesystem. This is made worse by the fact this mechanism simply doesn't work for hole punching because there is no file size change to indicate that the page lookup is racing with an in-progress invalidation. That means the mm/ and page cache code is unable to detect hole punch races, and so the serialisation of invalidation vs page cache instantiation has to be done in the filesystem. And no Linux native filesystem had the infrastructure for such serialisation because they never had to implement anything to ensure truncate was serialised against new and in-progress IO. The result of this is that, AFAICT, ext4 does not protect against read() vs hole punch races - it's hole punching code it does: Hole Punch: read(): inode_lock() inode_dio_wait(inode); down_write(i_mmap_sem) truncate_pagecache_range() ext4_file_iter_read() ext4_map_blocks() down_read(i_data_sem) <gets mapping> <populates page cache over hole> <reads stale data into cache> ..... down_write(i_data_sem) remove extents IOWs, ext4 is safe against truncate because of the change-inode-size-before-invalidation hacks, but the lack of serialise buffered reads means that hole punch and other similar fallocate based extent manipulations can race against reads.... > However, I am still interested to continue the discussion on my POC > patch. One reason is that I am guessing it would be much easier for > distros to backport and pick up to solve performance issues. Upstream first, please. If it's not fit for upstream, then it most definitely is not fit for backporting to distro kernels. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-07 23:27 ` Dave Chinner @ 2019-04-08 9:02 ` Amir Goldstein 2019-04-08 14:11 ` Jan Kara 2019-04-08 11:03 ` Jan Kara 1 sibling, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2019-04-08 9:02 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso, Jan Kara On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > This patch improves performance of mixed random rw workload > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > that xfs has always provided. > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > Once with a discard iterator to warm up page cache before taking > > > > the shared ilock and once again under shared ilock. > > > > > > This will race with thing like truncate, hole punching, etc that > > > serialise IO and invalidate the page cache for data integrity > > > reasons under the IOLOCK. These rely on there being no IO to the > > > inode in progress at all to work correctly, which this patch > > > violates. IOWs, while this is fast, it is not safe and so not a > > > viable approach to solving the problem. > > > > > > > This statement leaves me wondering, if ext4 does not takes > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > fs for that matter) guaranty buffered read synchronization with > > truncate, hole punching etc? > > The answer in ext4 case is i_mmap_sem, which is read locked > > in the page fault handler. > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > truncate, holepunching, etc. Completely irrelevant to the read() > path. > I'm at lost here. Why are page faults completely irrelevant to read() path? Aren't full pages supposed to be faulted in on read() after truncate_pagecache_range()? And aren't partial pages supposed to be partially zeroed and uptodate after truncate_pagecache_range()? > > And xfs does the same type of synchronization with MMAPLOCK, > > so while my patch may not be safe, I cannot follow why from your > > explanation, so please explain if I am missing something. > > mmap_sem inversions require independent locks for IO path and page > faults - the MMAPLOCK does not protect anything in the > read()/write() IO path. > [...] > > All you see is this: > > truncate: read() > > IOLOCK_EXCL > flush relevant cached data > truncate page cache > pre-read page cache between > new eof and old eof > IOLOCK_SHARED > <blocks> > start transaction > ILOCK_EXCL > update isize > remove extents > .... > commit xactn > IOLOCK unlock > <gets lock> > sees beyond EOF, returns 0 > > > So you see the read() doing the right thing (detect EOF, returning > short read). Great. > > But what I see is uptodate pages containing stale data being left in > the page cache beyond EOF. That is th eproblem here - truncate must > not leave stale pages beyond EOF behind - it's the landmine that > causes future things to go wrong. > > e.g. now the app does post-eof preallocation so the range those > pages are cached over are allocated as unwritten - the filesystem > will do this without even looking at the page cache because it's > beyond EOF. Now we extend the file past those cached pages, and > iomap_zero() sees the range as unwritten and so does not write zeros > to the blocks between the old EOF and the new EOF. Now the app reads > from that range (say it does a sub-page write, triggering a page > cache RMW cycle). the read goes to instantiate the page cache page, > finds a page already in the cache that is uptodate, and uses it > without zeroing or reading from disk. > > And now we have stale data exposure and/or data corruption. > > If can come up with quite a few scenarios where this particular > "populate cache after invalidation" race can cause similar problems > for XFS. Hole punch and most of the other fallocate extent > manipulations have the same serialisation requirements - no IO over > the range of the operation can be *initiated* between the /start/ of > the page cache invalidation and the end of the specific extent > manipulation operation. > > So how does ext4 avoid this problem on truncate? > > History lesson: truncate in Linux (and hence ext4) has traditionally > been serialised by the hacky post-page-lock checks that are strewn > all through the page cache and mm/ subsystem. i.e. every time you > look up and lock a page cache page, you have to check the > page->mapping and page->index to ensure that the lookup-and-lock > hasn't raced with truncate. This only works because truncate > requires the inode size to be updated before invalidating the page > cache - that's the "hacky" part of it. > > IOWs, the burden of detecting truncate races is strewn throughout > the mm/ subsystem, rather than being the responisibility of the > filesystem. This is made worse by the fact this mechanism simply > doesn't work for hole punching because there is no file size change > to indicate that the page lookup is racing with an in-progress > invalidation. > > That means the mm/ and page cache code is unable to detect hole > punch races, and so the serialisation of invalidation vs page cache > instantiation has to be done in the filesystem. And no Linux native > filesystem had the infrastructure for such serialisation because > they never had to implement anything to ensure truncate was > serialised against new and in-progress IO. > > The result of this is that, AFAICT, ext4 does not protect against > read() vs hole punch races - it's hole punching code it does: > > Hole Punch: read(): > > inode_lock() > inode_dio_wait(inode); > down_write(i_mmap_sem) > truncate_pagecache_range() > ext4_file_iter_read() > ext4_map_blocks() > down_read(i_data_sem) > <gets mapping> > <populates page cache over hole> > <reads stale data into cache> > ..... > down_write(i_data_sem) > remove extents > > IOWs, ext4 is safe against truncate because of the > change-inode-size-before-invalidation hacks, but the lack of > serialise buffered reads means that hole punch and other similar > fallocate based extent manipulations can race against reads.... > Adding some ext4 folks to comment on the above. Could it be that those races were already addressed by Lukas' work: https://lore.kernel.org/patchwork/cover/371861/ Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 9:02 ` Amir Goldstein @ 2019-04-08 14:11 ` Jan Kara 2019-04-08 17:41 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2019-04-08 14:11 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso, Jan Kara On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > This patch improves performance of mixed random rw workload > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > that xfs has always provided. > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > Once with a discard iterator to warm up page cache before taking > > > > > the shared ilock and once again under shared ilock. > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > serialise IO and invalidate the page cache for data integrity > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > inode in progress at all to work correctly, which this patch > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > viable approach to solving the problem. > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > fs for that matter) guaranty buffered read synchronization with > > > truncate, hole punching etc? > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > in the page fault handler. > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > truncate, holepunching, etc. Completely irrelevant to the read() > > path. > > > > I'm at lost here. Why are page faults completely irrelevant to read() > path? Aren't full pages supposed to be faulted in on read() after > truncate_pagecache_range()? During read(2), pages are not "faulted in". Just look at what generic_file_buffered_read() does. It uses completely separate code to add page to page cache, trigger readahead, and possibly call ->readpage() to fill the page with data. "fault" path (handled by filemap_fault()) applies only to accesses from userspace to mmaps. > And aren't partial pages supposed to be partially zeroed and uptodate > after truncate_pagecache_range()? truncate_pagecache_range() does take care of page zeroing, that is correct (if those pages are in page cache in the first place). I'm not sure how this relates to the above discussion though. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 14:11 ` Jan Kara @ 2019-04-08 17:41 ` Amir Goldstein 2019-04-09 8:26 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2019-04-08 17:41 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > This patch improves performance of mixed random rw workload > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > that xfs has always provided. > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > serialise IO and invalidate the page cache for data integrity > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > inode in progress at all to work correctly, which this patch > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > viable approach to solving the problem. > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > fs for that matter) guaranty buffered read synchronization with > > > > truncate, hole punching etc? > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > in the page fault handler. > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > path. > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > path? Aren't full pages supposed to be faulted in on read() after > > truncate_pagecache_range()? > > During read(2), pages are not "faulted in". Just look at > what generic_file_buffered_read() does. It uses completely separate code to > add page to page cache, trigger readahead, and possibly call ->readpage() to > fill the page with data. "fault" path (handled by filemap_fault()) applies > only to accesses from userspace to mmaps. > Oh! thanks for fixing my blind spot. So if you agree with Dave that ext4, and who knows what other fs, are vulnerable to populating page cache with stale "uptodate" data, then it seems to me that also xfs is vulnerable via readahead(2) and posix_fadvise(). Mind you, I recently added an fadvise f_op, so it could be used by xfs to synchronize with IOLOCK. Perhaps a better solution would be for truncate_pagecache_range() to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page cache. When we have shared pages for files, these pages could be deduped. Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 17:41 ` Amir Goldstein @ 2019-04-09 8:26 ` Jan Kara 2022-06-17 14:48 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2019-04-09 8:26 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso, Al Viro On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > inode in progress at all to work correctly, which this patch > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > truncate, hole punching etc? > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > in the page fault handler. > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > path. > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > path? Aren't full pages supposed to be faulted in on read() after > > > truncate_pagecache_range()? > > > > During read(2), pages are not "faulted in". Just look at > > what generic_file_buffered_read() does. It uses completely separate code to > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > only to accesses from userspace to mmaps. > > > > Oh! thanks for fixing my blind spot. > So if you agree with Dave that ext4, and who knows what other fs, > are vulnerable to populating page cache with stale "uptodate" data, Not that many filesystems support punching holes but you're right. > then it seems to me that also xfs is vulnerable via readahead(2) and > posix_fadvise(). Yes, this is correct AFAICT. > Mind you, I recently added an fadvise f_op, so it could be used by > xfs to synchronize with IOLOCK. And yes, this should work. > Perhaps a better solution would be for truncate_pagecache_range() > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > cache. When we have shared pages for files, these pages could be > deduped. No, I wouldn't really mess with sharing pages due to this. It would be hard to make that scale resonably and would be rather complex. We really need a proper and reasonably simple synchronization mechanism between operations removing blocks from inode and operations filling in page cache of the inode. Page lock was supposed to provide this but doesn't quite work because hole punching first remove pagecache pages and then go removing all blocks. So I agree with Dave that going for range lock is really the cleanest way forward here without causing big regressions for mixed rw workloads. I'm just thinking how to best do that without introducing lot of boilerplate code into each filesystem. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-09 8:26 ` Jan Kara @ 2022-06-17 14:48 ` Amir Goldstein 2022-06-17 15:11 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-06-17 14:48 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso, Al Viro On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <jack@suse.cz> wrote: > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > truncate, hole punching etc? > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > in the page fault handler. > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > path. > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > truncate_pagecache_range()? > > > > > > During read(2), pages are not "faulted in". Just look at > > > what generic_file_buffered_read() does. It uses completely separate code to > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > only to accesses from userspace to mmaps. > > > > > > > Oh! thanks for fixing my blind spot. > > So if you agree with Dave that ext4, and who knows what other fs, > > are vulnerable to populating page cache with stale "uptodate" data, > > Not that many filesystems support punching holes but you're right. > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > posix_fadvise(). > > Yes, this is correct AFAICT. > > > Mind you, I recently added an fadvise f_op, so it could be used by > > xfs to synchronize with IOLOCK. > > And yes, this should work. > > > Perhaps a better solution would be for truncate_pagecache_range() > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > cache. When we have shared pages for files, these pages could be > > deduped. > > No, I wouldn't really mess with sharing pages due to this. It would be hard > to make that scale resonably and would be rather complex. We really need a > proper and reasonably simple synchronization mechanism between operations > removing blocks from inode and operations filling in page cache of the > inode. Page lock was supposed to provide this but doesn't quite work > because hole punching first remove pagecache pages and then go removing all > blocks. > > So I agree with Dave that going for range lock is really the cleanest way > forward here without causing big regressions for mixed rw workloads. I'm > just thinking how to best do that without introducing lot of boilerplate > code into each filesystem. Hi Jan, Dave, Trying to circle back to this after 3 years! Seeing that there is no progress with range locks and that the mixed rw workloads performance issue still very much exists. Is the situation now different than 3 years ago with invalidate_lock? Would my approach of pre-warm page cache before taking IOLOCK be safe if page cache is pre-warmed with invalidate_lock held? For the pNFS leases issue, as I wrote back in pre-COVID era, I intend to opt-out of this optimization with #ifndef CONFIG_EXPORTFS_BLOCK_OPS Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-17 14:48 ` Amir Goldstein @ 2022-06-17 15:11 ` Jan Kara 2022-06-18 8:38 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2022-06-17 15:11 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Ext4, Lukas Czerner, Theodore Tso, Al Viro On Fri 17-06-22 17:48:08, Amir Goldstein wrote: > On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <jack@suse.cz> wrote: > > > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > > truncate, hole punching etc? > > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > > in the page fault handler. > > > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > > path. > > > > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > > truncate_pagecache_range()? > > > > > > > > During read(2), pages are not "faulted in". Just look at > > > > what generic_file_buffered_read() does. It uses completely separate code to > > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > > only to accesses from userspace to mmaps. > > > > > > > > > > Oh! thanks for fixing my blind spot. > > > So if you agree with Dave that ext4, and who knows what other fs, > > > are vulnerable to populating page cache with stale "uptodate" data, > > > > Not that many filesystems support punching holes but you're right. > > > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > > posix_fadvise(). > > > > Yes, this is correct AFAICT. > > > > > Mind you, I recently added an fadvise f_op, so it could be used by > > > xfs to synchronize with IOLOCK. > > > > And yes, this should work. > > > > > Perhaps a better solution would be for truncate_pagecache_range() > > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > > cache. When we have shared pages for files, these pages could be > > > deduped. > > > > No, I wouldn't really mess with sharing pages due to this. It would be hard > > to make that scale resonably and would be rather complex. We really need a > > proper and reasonably simple synchronization mechanism between operations > > removing blocks from inode and operations filling in page cache of the > > inode. Page lock was supposed to provide this but doesn't quite work > > because hole punching first remove pagecache pages and then go removing all > > blocks. > > > > So I agree with Dave that going for range lock is really the cleanest way > > forward here without causing big regressions for mixed rw workloads. I'm > > just thinking how to best do that without introducing lot of boilerplate > > code into each filesystem. > > Hi Jan, Dave, > > Trying to circle back to this after 3 years! > Seeing that there is no progress with range locks and > that the mixed rw workloads performance issue still very much exists. > > Is the situation now different than 3 years ago with invalidate_lock? Yes, I've implemented invalidate_lock exactly to fix the issues you've pointed out without regressing the mixed rw workloads (because invalidate_lock is taken in shared mode only for reads and usually not at all for writes). > Would my approach of pre-warm page cache before taking IOLOCK > be safe if page cache is pre-warmed with invalidate_lock held? Why would it be needed? But yes, with invalidate_lock you could presumably make that idea safe... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-17 15:11 ` Jan Kara @ 2022-06-18 8:38 ` Amir Goldstein 2022-06-20 9:11 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-06-18 8:38 UTC (permalink / raw) To: Jan Kara, Darrick J . Wong Cc: Dave Chinner, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Fri, Jun 17, 2022 at 6:11 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 17-06-22 17:48:08, Amir Goldstein wrote: > > On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > > > truncate, hole punching etc? > > > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > > > in the page fault handler. > > > > > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > > > path. > > > > > > > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > > > truncate_pagecache_range()? > > > > > > > > > > During read(2), pages are not "faulted in". Just look at > > > > > what generic_file_buffered_read() does. It uses completely separate code to > > > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > > > only to accesses from userspace to mmaps. > > > > > > > > > > > > > Oh! thanks for fixing my blind spot. > > > > So if you agree with Dave that ext4, and who knows what other fs, > > > > are vulnerable to populating page cache with stale "uptodate" data, > > > > > > Not that many filesystems support punching holes but you're right. > > > > > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > > > posix_fadvise(). > > > > > > Yes, this is correct AFAICT. > > > > > > > Mind you, I recently added an fadvise f_op, so it could be used by > > > > xfs to synchronize with IOLOCK. > > > > > > And yes, this should work. > > > > > > > Perhaps a better solution would be for truncate_pagecache_range() > > > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > > > cache. When we have shared pages for files, these pages could be > > > > deduped. > > > > > > No, I wouldn't really mess with sharing pages due to this. It would be hard > > > to make that scale resonably and would be rather complex. We really need a > > > proper and reasonably simple synchronization mechanism between operations > > > removing blocks from inode and operations filling in page cache of the > > > inode. Page lock was supposed to provide this but doesn't quite work > > > because hole punching first remove pagecache pages and then go removing all > > > blocks. > > > > > > So I agree with Dave that going for range lock is really the cleanest way > > > forward here without causing big regressions for mixed rw workloads. I'm > > > just thinking how to best do that without introducing lot of boilerplate > > > code into each filesystem. > > > > Hi Jan, Dave, > > > > Trying to circle back to this after 3 years! > > Seeing that there is no progress with range locks and > > that the mixed rw workloads performance issue still very much exists. > > > > Is the situation now different than 3 years ago with invalidate_lock? > > Yes, I've implemented invalidate_lock exactly to fix the issues you've > pointed out without regressing the mixed rw workloads (because > invalidate_lock is taken in shared mode only for reads and usually not at > all for writes). > > > Would my approach of pre-warm page cache before taking IOLOCK > > be safe if page cache is pre-warmed with invalidate_lock held? > > Why would it be needed? But yes, with invalidate_lock you could presumably > make that idea safe... To remind you, the context in which I pointed you to the punch hole race issue in "other file systems" was a discussion about trying to relax the "atomic write" POSIX semantics [1] of xfs. There was a lot of discussions around range locks and changing the fairness of rwsem readers and writer, but none of this changes the fact that as long as the lock is file wide (and it does not look like that is going to change in the near future), it is better for lock contention to perform the serialization on page cache read/write and not on disk read/write. Therefore, *if* it is acceptable to pre-warn page cache for buffered read under invalidate_lock, that is a simple way to bring the xfs performance with random rw mix workload on par with ext4 performance without losing the atomic write POSIX semantics. So everyone can be happy? In addition to Dave's concerns about stale page cache races with hole punch, I found in the original discussion these concern from Darrick: > Reads and writes are not the only thing xfs uses i_rwsem to synchronise. > Reflink remap uses it to make sure everything's flushed to disk and that > page cache contents remain clean while the remap is ongoing. I'm pretty > sure pnfs uses it for similar reasons when granting and committing write > leases. To reiterate, pNFS leases are not the common case. To address this issue, I intend to opt-out of pre-warm optimization when pNFS leases are present, either globally, or per file, whatever xfs developers tell me to do. From my understanding of the code, xfs_reflink_remap_prep() takes care of taking invalidate_lock(s), so populating page cache under invalidate_lock should be safe also w.r.t reflink/dedupe. Darrick, am I missing anything? Thanks, Amir. [1] https://lore.kernel.org/all/CAOQ4uxgSc7hK1=GuUajzG1Z+ks6gzFFX+EtuBMULOk0s85zi3A@mail.gmail.com/ [2] https://lore.kernel.org/linux-xfs/20190325154731.GT1183@magnolia/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-18 8:38 ` Amir Goldstein @ 2022-06-20 9:11 ` Jan Kara 2022-06-21 7:49 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2022-06-20 9:11 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Sat 18-06-22 11:38:30, Amir Goldstein wrote: > On Fri, Jun 17, 2022 at 6:11 PM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 17-06-22 17:48:08, Amir Goldstein wrote: > > > On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > > > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > > > > truncate, hole punching etc? > > > > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > > > > in the page fault handler. > > > > > > > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > > > > path. > > > > > > > > > > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > > > > truncate_pagecache_range()? > > > > > > > > > > > > During read(2), pages are not "faulted in". Just look at > > > > > > what generic_file_buffered_read() does. It uses completely separate code to > > > > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > > > > only to accesses from userspace to mmaps. > > > > > > > > > > > > > > > > Oh! thanks for fixing my blind spot. > > > > > So if you agree with Dave that ext4, and who knows what other fs, > > > > > are vulnerable to populating page cache with stale "uptodate" data, > > > > > > > > Not that many filesystems support punching holes but you're right. > > > > > > > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > > > > posix_fadvise(). > > > > > > > > Yes, this is correct AFAICT. > > > > > > > > > Mind you, I recently added an fadvise f_op, so it could be used by > > > > > xfs to synchronize with IOLOCK. > > > > > > > > And yes, this should work. > > > > > > > > > Perhaps a better solution would be for truncate_pagecache_range() > > > > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > > > > cache. When we have shared pages for files, these pages could be > > > > > deduped. > > > > > > > > No, I wouldn't really mess with sharing pages due to this. It would be hard > > > > to make that scale resonably and would be rather complex. We really need a > > > > proper and reasonably simple synchronization mechanism between operations > > > > removing blocks from inode and operations filling in page cache of the > > > > inode. Page lock was supposed to provide this but doesn't quite work > > > > because hole punching first remove pagecache pages and then go removing all > > > > blocks. > > > > > > > > So I agree with Dave that going for range lock is really the cleanest way > > > > forward here without causing big regressions for mixed rw workloads. I'm > > > > just thinking how to best do that without introducing lot of boilerplate > > > > code into each filesystem. > > > > > > Hi Jan, Dave, > > > > > > Trying to circle back to this after 3 years! > > > Seeing that there is no progress with range locks and > > > that the mixed rw workloads performance issue still very much exists. > > > > > > Is the situation now different than 3 years ago with invalidate_lock? > > > > Yes, I've implemented invalidate_lock exactly to fix the issues you've > > pointed out without regressing the mixed rw workloads (because > > invalidate_lock is taken in shared mode only for reads and usually not at > > all for writes). > > > > > Would my approach of pre-warm page cache before taking IOLOCK > > > be safe if page cache is pre-warmed with invalidate_lock held? > > > > Why would it be needed? But yes, with invalidate_lock you could presumably > > make that idea safe... > > To remind you, the context in which I pointed you to the punch hole race > issue in "other file systems" was a discussion about trying to relax the > "atomic write" POSIX semantics [1] of xfs. Ah, I see. Sorry, I already forgot :-| > There was a lot of discussions around range locks and changing the > fairness of rwsem readers and writer, but none of this changes the fact > that as long as the lock is file wide (and it does not look like that is > going to change in the near future), it is better for lock contention to > perform the serialization on page cache read/write and not on disk > read/write. > > Therefore, *if* it is acceptable to pre-warn page cache for buffered read > under invalidate_lock, that is a simple way to bring the xfs performance with > random rw mix workload on par with ext4 performance without losing the > atomic write POSIX semantics. So everyone can be happy? So to spell out your proposal so that we are on the same page: you want to use invalidate_lock + page locks to achieve "writes are atomic wrt reads" property XFS currently has without holding i_rwsem in shared mode during reads. Am I getting it correct? How exactly do you imagine the synchronization of buffered read against buffered write would work? Lock all pages for the read range in the page cache? You'd need to be careful to not bring the machine OOM when someone asks to read a huge range... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-20 9:11 ` Jan Kara @ 2022-06-21 7:49 ` Amir Goldstein 2022-06-21 8:59 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-06-21 7:49 UTC (permalink / raw) To: Jan Kara Cc: Darrick J . Wong, Dave Chinner, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel > > > > Hi Jan, Dave, > > > > > > > > Trying to circle back to this after 3 years! > > > > Seeing that there is no progress with range locks and > > > > that the mixed rw workloads performance issue still very much exists. > > > > > > > > Is the situation now different than 3 years ago with invalidate_lock? > > > > > > Yes, I've implemented invalidate_lock exactly to fix the issues you've > > > pointed out without regressing the mixed rw workloads (because > > > invalidate_lock is taken in shared mode only for reads and usually not at > > > all for writes). > > > > > > > Would my approach of pre-warm page cache before taking IOLOCK > > > > be safe if page cache is pre-warmed with invalidate_lock held? > > > > > > Why would it be needed? But yes, with invalidate_lock you could presumably > > > make that idea safe... > > > > To remind you, the context in which I pointed you to the punch hole race > > issue in "other file systems" was a discussion about trying to relax the > > "atomic write" POSIX semantics [1] of xfs. > > Ah, I see. Sorry, I already forgot :-| Understandable. It has been 3 years ;-) > > > There was a lot of discussions around range locks and changing the > > fairness of rwsem readers and writer, but none of this changes the fact > > that as long as the lock is file wide (and it does not look like that is > > going to change in the near future), it is better for lock contention to > > perform the serialization on page cache read/write and not on disk > > read/write. > > > > Therefore, *if* it is acceptable to pre-warn page cache for buffered read > > under invalidate_lock, that is a simple way to bring the xfs performance with > > random rw mix workload on par with ext4 performance without losing the > > atomic write POSIX semantics. So everyone can be happy? > > So to spell out your proposal so that we are on the same page: you want to > use invalidate_lock + page locks to achieve "writes are atomic wrt reads" > property XFS currently has without holding i_rwsem in shared mode during > reads. Am I getting it correct? Not exactly. > > How exactly do you imagine the synchronization of buffered read against > buffered write would work? Lock all pages for the read range in the page > cache? You'd need to be careful to not bring the machine OOM when someone > asks to read a huge range... I imagine that the atomic r/w synchronisation will remain *exactly* as it is today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), when reading data into user buffer, but before that, I would like to issue and wait for read of the pages in the range to reduce the probability of doing the read I/O under XFS_IOLOCK_SHARED. The pre-warm of page cache does not need to abide to the atomic read semantics and it is also tolerable if some pages are evicted in between pre-warn and read to user buffer - in the worst case this will result in I/O amplification, but for the common case, it will be a big win for the mixed random r/w performance on xfs. To reduce risk of page cache thrashing we can limit this optimization to a maximum number of page cache pre-warm. The questions are: 1. Does this plan sound reasonable? 2. Is there a ready helper (force_page_cache_readahead?) that I can use which takes the required page/invalidate locks? Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-21 7:49 ` Amir Goldstein @ 2022-06-21 8:59 ` Jan Kara 2022-06-21 12:53 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2022-06-21 8:59 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > How exactly do you imagine the synchronization of buffered read against > > buffered write would work? Lock all pages for the read range in the page > > cache? You'd need to be careful to not bring the machine OOM when someone > > asks to read a huge range... > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > when reading data into user buffer, but before that, I would like to issue > and wait for read of the pages in the range to reduce the probability > of doing the read I/O under XFS_IOLOCK_SHARED. > > The pre-warm of page cache does not need to abide to the atomic read > semantics and it is also tolerable if some pages are evicted in between > pre-warn and read to user buffer - in the worst case this will result in > I/O amplification, but for the common case, it will be a big win for the > mixed random r/w performance on xfs. > > To reduce risk of page cache thrashing we can limit this optimization > to a maximum number of page cache pre-warm. > > The questions are: > 1. Does this plan sound reasonable? Ah, I see now. So essentially the idea is to pull the readahead (which is currently happening from filemap_read() -> filemap_get_pages()) out from under the i_rwsem. It looks like a fine idea to me. > 2. Is there a ready helper (force_page_cache_readahead?) that > I can use which takes the required page/invalidate locks? page_cache_sync_readahead() should be the function you need. It does take care to lock invalidate_lock internally when creating & reading pages. I just cannot comment on whether calling this without i_rwsem does not break some internal XFS expectations for stuff like reflink etc. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-21 8:59 ` Jan Kara @ 2022-06-21 12:53 ` Amir Goldstein 2022-06-22 3:23 ` Matthew Wilcox 2022-09-13 14:40 ` Amir Goldstein 0 siblings, 2 replies; 38+ messages in thread From: Amir Goldstein @ 2022-06-21 12:53 UTC (permalink / raw) To: Jan Kara Cc: Darrick J . Wong, Dave Chinner, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > How exactly do you imagine the synchronization of buffered read against > > > buffered write would work? Lock all pages for the read range in the page > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > asks to read a huge range... > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > when reading data into user buffer, but before that, I would like to issue > > and wait for read of the pages in the range to reduce the probability > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > The pre-warm of page cache does not need to abide to the atomic read > > semantics and it is also tolerable if some pages are evicted in between > > pre-warn and read to user buffer - in the worst case this will result in > > I/O amplification, but for the common case, it will be a big win for the > > mixed random r/w performance on xfs. > > > > To reduce risk of page cache thrashing we can limit this optimization > > to a maximum number of page cache pre-warm. > > > > The questions are: > > 1. Does this plan sound reasonable? > > Ah, I see now. So essentially the idea is to pull the readahead (which is > currently happening from filemap_read() -> filemap_get_pages()) out from under > the i_rwsem. It looks like a fine idea to me. Great! Anyone doesn't like the idea or has another suggestion? > > > 2. Is there a ready helper (force_page_cache_readahead?) that > > I can use which takes the required page/invalidate locks? > > page_cache_sync_readahead() should be the function you need. It does take > care to lock invalidate_lock internally when creating & reading pages. I Thanks, I'll try that. > just cannot comment on whether calling this without i_rwsem does not break > some internal XFS expectations for stuff like reflink etc. relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two so it should not be a problem. pNFS leases I need to look into. Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-21 12:53 ` Amir Goldstein @ 2022-06-22 3:23 ` Matthew Wilcox 2022-06-22 9:00 ` Amir Goldstein 2022-09-13 14:40 ` Amir Goldstein 1 sibling, 1 reply; 38+ messages in thread From: Matthew Wilcox @ 2022-06-22 3:23 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel On Tue, Jun 21, 2022 at 03:53:33PM +0300, Amir Goldstein wrote: > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > How exactly do you imagine the synchronization of buffered read against > > > > buffered write would work? Lock all pages for the read range in the page > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > asks to read a huge range... > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > when reading data into user buffer, but before that, I would like to issue > > > and wait for read of the pages in the range to reduce the probability > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > semantics and it is also tolerable if some pages are evicted in between > > > pre-warn and read to user buffer - in the worst case this will result in > > > I/O amplification, but for the common case, it will be a big win for the > > > mixed random r/w performance on xfs. > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > to a maximum number of page cache pre-warm. > > > > > > The questions are: > > > 1. Does this plan sound reasonable? > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > the i_rwsem. It looks like a fine idea to me. > > Great! > Anyone doesn't like the idea or has another suggestion? I guess I'm still confused. The problem was the the XFS IOLOCK was being held while we waited for readahead to complete. To fix this, you're planning on waiting for readahead to complete with the invalidate lock held? I don't see the benefit. I see the invalidate_lock as being roughly equivalent to the IOLOCK, just pulled up to the VFS. Is that incorrect? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-22 3:23 ` Matthew Wilcox @ 2022-06-22 9:00 ` Amir Goldstein 2022-06-22 9:34 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-06-22 9:00 UTC (permalink / raw) To: Matthew Wilcox Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel On Wed, Jun 22, 2022 at 6:23 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jun 21, 2022 at 03:53:33PM +0300, Amir Goldstein wrote: > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > > How exactly do you imagine the synchronization of buffered read against > > > > > buffered write would work? Lock all pages for the read range in the page > > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > > asks to read a huge range... > > > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > > when reading data into user buffer, but before that, I would like to issue > > > > and wait for read of the pages in the range to reduce the probability > > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > > semantics and it is also tolerable if some pages are evicted in between > > > > pre-warn and read to user buffer - in the worst case this will result in > > > > I/O amplification, but for the common case, it will be a big win for the > > > > mixed random r/w performance on xfs. > > > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > > to a maximum number of page cache pre-warm. > > > > > > > > The questions are: > > > > 1. Does this plan sound reasonable? > > > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > > the i_rwsem. It looks like a fine idea to me. > > > > Great! > > Anyone doesn't like the idea or has another suggestion? > > I guess I'm still confused. > > The problem was the the XFS IOLOCK was being held while we waited for > readahead to complete. To fix this, you're planning on waiting for > readahead to complete with the invalidate lock held? I don't see the > benefit. > > I see the invalidate_lock as being roughly equivalent to the IOLOCK, > just pulled up to the VFS. Is that incorrect? > This question coming from you really shakes my confidence. This entire story started from the observation that xfs performance of concurrent mixed rw workload is two orders of magnitude worse than ext4 on slow disk. The reason for the difference was that xfs was taking the IOLOCK shared on reads and ext4 did not. That had two very different reasons: 1. POSIX atomic read/write semantics unique to xfs 2. Correctness w.r.t. races with punch hole etc, which lead to the conclusion that all non-xfs filesystems are buggy in that respect The solution of pulling IOLOCK to vfs would have solved the bug but at the cost of severely regressing the mix rw workload on all fs. The point of Jan's work on invalidate_lock was to fix the bug and avoid the regression. I hope that worked out, but I did not test the mixed rw workload on ext4 after invalidate_lock. IIUC, ideally, invalidate_lock was supposed to be taken only for adding pages to page cache and locking them, but not during IO in order to synchronize against truncating pages (punch hole). But from this comment in filemap_create_folio() I just learned that that is not exactly the case: "...Note that we could release invalidate_lock after inserting the folio into the page cache as the locked folio would then be enough to synchronize with hole punching. But..." Even so, because invalidate_lock is not taken by writers and reader that work on existing pages and because invalidate_lock is not held for the entire read/write operation, statistically it should be less contended than IOLOCK for some workloads, but I am afraid that for the workload I tested (bs=8K and mostly cold page cache) it will be contended with current vfs code. I am going to go find a machine with slow disk to test the random rw workload again on both xfs and ext4 pre and post invalidate_lock and to try out the pre-warm page cache solution. The results could be: a) ext4 random rw performance has been degraded by invalidate_lock b) pre-warm page cache before taking IOLOCK is going to improve xfs random rw performance c) A little bit of both It would be great if you and Jan could agree on the facts and confirm that my observations about invalidate_lock are correct until I get the test results. Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-22 9:00 ` Amir Goldstein @ 2022-06-22 9:34 ` Jan Kara 2022-06-22 16:26 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2022-06-22 9:34 UTC (permalink / raw) To: Amir Goldstein Cc: Matthew Wilcox, Jan Kara, Darrick J . Wong, Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel On Wed 22-06-22 12:00:35, Amir Goldstein wrote: > On Wed, Jun 22, 2022 at 6:23 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Jun 21, 2022 at 03:53:33PM +0300, Amir Goldstein wrote: > > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > > > How exactly do you imagine the synchronization of buffered read against > > > > > > buffered write would work? Lock all pages for the read range in the page > > > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > > > asks to read a huge range... > > > > > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > > > when reading data into user buffer, but before that, I would like to issue > > > > > and wait for read of the pages in the range to reduce the probability > > > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > > > semantics and it is also tolerable if some pages are evicted in between > > > > > pre-warn and read to user buffer - in the worst case this will result in > > > > > I/O amplification, but for the common case, it will be a big win for the > > > > > mixed random r/w performance on xfs. > > > > > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > > > to a maximum number of page cache pre-warm. > > > > > > > > > > The questions are: > > > > > 1. Does this plan sound reasonable? > > > > > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > > > the i_rwsem. It looks like a fine idea to me. > > > > > > Great! > > > Anyone doesn't like the idea or has another suggestion? > > > > I guess I'm still confused. > > > > The problem was the the XFS IOLOCK was being held while we waited for > > readahead to complete. To fix this, you're planning on waiting for > > readahead to complete with the invalidate lock held? I don't see the > > benefit. > > > > I see the invalidate_lock as being roughly equivalent to the IOLOCK, > > just pulled up to the VFS. Is that incorrect? > > > > This question coming from you really shakes my confidence. > > This entire story started from the observation that xfs performance > of concurrent mixed rw workload is two orders of magnitude worse > than ext4 on slow disk. > > The reason for the difference was that xfs was taking the IOLOCK > shared on reads and ext4 did not. > > That had two very different reasons: > 1. POSIX atomic read/write semantics unique to xfs > 2. Correctness w.r.t. races with punch hole etc, which lead to the > conclusion that all non-xfs filesystems are buggy in that respect > > The solution of pulling IOLOCK to vfs would have solved the bug > but at the cost of severely regressing the mix rw workload on all fs. > > The point of Jan's work on invalidate_lock was to fix the bug and > avoid the regression. I hope that worked out, but I did not test > the mixed rw workload on ext4 after invalidate_lock. Yes, it did work out :) > IIUC, ideally, invalidate_lock was supposed to be taken only for > adding pages to page cache and locking them, but not during IO > in order to synchronize against truncating pages (punch hole). > But from this comment in filemap_create_folio() I just learned > that that is not exactly the case: > "...Note that we could release invalidate_lock after inserting the > folio into the page cache as the locked folio would then be enough > to synchronize with hole punching. But..." > > Even so, because invalidate_lock is not taken by writers and reader > that work on existing pages and because invalidate_lock is not held > for the entire read/write operation, statistically it should be less > contended than IOLOCK for some workloads, but I am afraid that > for the workload I tested (bs=8K and mostly cold page cache) it will > be contended with current vfs code. Well, the rules are: When you are adding pages to the page cache you need to either hold i_rwsem or invalidate_lock at least in shared mode. Places removing underlying storage from the page cache pages are responsible for holding both i_rwsem and invalidate_lock in exclusive mode. Writes generally hold i_rwsem exclusive (or shared for overwriting direct IO), reads hold invalidate_lock shared. So there will not be contention on invalidate_lock as such for a mixed rw workload, except for the internal contention of the shared invalidate_lock holders on the cacheline holding the invalidate_lock (which is non-negligible as well in heavily parallel loads but that's a separate story). > I am going to go find a machine with slow disk to test the random rw > workload again on both xfs and ext4 pre and post invalidate_lock and > to try out the pre-warm page cache solution. > > The results could be: > a) ext4 random rw performance has been degraded by invalidate_lock > b) pre-warm page cache before taking IOLOCK is going to improve > xfs random rw performance > c) A little bit of both Well, numbers always beat the theory so I'm all for measuring it but let me say our kernel performance testing within SUSE didn't show significant hit being introduced by invalidate_lock for any major filesystem. I hope this clears out things a bit. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-22 9:34 ` Jan Kara @ 2022-06-22 16:26 ` Amir Goldstein 0 siblings, 0 replies; 38+ messages in thread From: Amir Goldstein @ 2022-06-22 16:26 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, Darrick J . Wong, Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel > > I am going to go find a machine with slow disk to test the random rw > > workload again on both xfs and ext4 pre and post invalidate_lock and > > to try out the pre-warm page cache solution. > > > > The results could be: > > a) ext4 random rw performance has been degraded by invalidate_lock > > b) pre-warm page cache before taking IOLOCK is going to improve > > xfs random rw performance > > c) A little bit of both The correct answer is b. :) > > Well, numbers always beat the theory so I'm all for measuring it but let me > say our kernel performance testing within SUSE didn't show significant hit > being introduced by invalidate_lock for any major filesystem. > Here are the numbers produced on v5.10.109, on v5.19-rc3 and on v5.19-rc3+ which includes the pre-warn test patch [1]. The numbers are produced by a filebench workload [2] that runs 8 random reader threads and 8 random writer threads for 60 seconds on a cold cache preallocated 5GB file. Note that the machine I tested with has much faster storage than the one that was used 3 years ago, but the performance impact of IOLOCK is still very clear, even larger in this test. If there are no other objections to the pre-warm concept, I will go on to write and test a proper patch. Thanks, Amir. [1] https://github.com/amir73il/linux/commit/70e94f3471739c442b1110ee46e8b59e5d5f5042 [2] https://github.com/amir73il/filebench/blob/overlayfs-devel/workloads/randomrw.f --- EXT4 5.10 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.10.109, ext4 Test #1: rand-write1 3002127ops 50020ops/s 390.8mb/s 0.156ms/op [0.002ms - 213.755ms] rand-read1 31749234ops 528988ops/s 4132.7mb/s 0.010ms/op [0.001ms - 68.884ms] Test #2: rand-write1 3083679ops 51381ops/s 401.4mb/s 0.152ms/op [0.002ms - 181.368ms] rand-read1 32182118ops 536228ops/s 4189.3mb/s 0.010ms/op [0.001ms - 61.158ms] --- EXT4 5.19 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.19-rc3, ext4 Test #1: rand-write1 2829917ops 47159ops/s 368.4mb/s 0.160ms/op [0.002ms - 4709.167ms] rand-read1 36997540ops 616542ops/s 4816.7mb/s 0.009ms/op [0.001ms - 4704.105ms] Test #2: rand-write1 2764486ops 46067ops/s 359.9mb/s 0.170ms/op [0.002ms - 5042.597ms] rand-read1 38893279ops 648118ops/s 5063.4mb/s 0.008ms/op [0.001ms - 5004.069ms] --- XFS 5.10 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.10.109, xfs Test #1: rand-write1 1049278ops 17485ops/s 136.6mb/s 0.456ms/op [0.002ms - 224.062ms] rand-read1 33325ops 555ops/s 4.3mb/s 14.392ms/op [0.007ms - 224.833ms] Test #2: rand-write1 1127497ops 18788ops/s 146.8mb/s 0.424ms/op [0.003ms - 445.810ms] rand-read1 35341ops 589ops/s 4.6mb/s 13.566ms/op [0.005ms - 445.529ms] --- XFS 5.19 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.19-rc3, xfs Test #1: rand-write1 3295934ops 54920ops/s 429.1mb/s 0.144ms/op [0.003ms - 109.703ms] rand-read1 86768ops 1446ops/s 11.3mb/s 5.520ms/op [0.003ms - 372.000ms] Test #2: rand-write1 3246935ops 54106ops/s 422.7mb/s 0.146ms/op [0.002ms - 103.505ms] rand-read1 167018ops 2783ops/s 21.7mb/s 2.867ms/op [0.003ms - 101.105ms] --- XFS+ 5.19 --- filebench randomrw (8 read threads, 8 write threads) kernel 5.19-rc3+ (xfs page cache warmup patch) Test #1: rand-write1 3054567ops 50899ops/s 397.6mb/s 0.154ms/op [0.002ms - 201.531ms] rand-read1 38107333ops 634990ops/s 4960.9mb/s 0.008ms/op [0.001ms - 60.027ms] Test #2: rand-write1 2704416ops 45053ops/s 352.0mb/s 0.174ms/op [0.002ms - 287.079ms] rand-read1 38589737ops 642874ops/s 5022.4mb/s 0.008ms/op [0.001ms - 60.741ms] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-06-21 12:53 ` Amir Goldstein 2022-06-22 3:23 ` Matthew Wilcox @ 2022-09-13 14:40 ` Amir Goldstein 2022-09-14 16:01 ` Darrick J. Wong 1 sibling, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-09-13 14:40 UTC (permalink / raw) To: Jan Kara, Dave Chinner, Christoph Hellwig Cc: Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Jun 21, 2022 at 3:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > How exactly do you imagine the synchronization of buffered read against > > > > buffered write would work? Lock all pages for the read range in the page > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > asks to read a huge range... > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > when reading data into user buffer, but before that, I would like to issue > > > and wait for read of the pages in the range to reduce the probability > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > semantics and it is also tolerable if some pages are evicted in between > > > pre-warn and read to user buffer - in the worst case this will result in > > > I/O amplification, but for the common case, it will be a big win for the > > > mixed random r/w performance on xfs. > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > to a maximum number of page cache pre-warm. > > > > > > The questions are: > > > 1. Does this plan sound reasonable? > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > the i_rwsem. It looks like a fine idea to me. > Although I was able to demonstrate performance improvement with page cache pre-warming on low latency disks, when testing on a common standard system [*], page cache pre-warming did not yield any improvement to the mixed rw workload. [*] I ran the following fio workload on e2-standard-8 GCE machine: [global] filename=/mnt/xfs/testfile.fio norandommap randrepeat=0 size=5G bs=8K ioengine=psync numjobs=8 group_reporting=1 direct=0 fallocate=1 end_fsync=0 runtime=60 [xfs-read] readwrite=randread [xfs-write] readwrite=randwrite The difference between ext4 and xfs with this machine/workload was two orders of magnitude: root@xfstests:~# fio ./ext4.fio ... Run status group 0 (all jobs): READ: bw=826MiB/s (866MB/s), 826MiB/s-826MiB/s (866MB/s-866MB/s), io=40.0GiB (42.9GB), run=49585-49585msec WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), io=18.1GiB (19.5GB), run=60003-60003msec root@xfstests:~# fio ./xfs.fio ... Run status group 0 (all jobs): READ: bw=7053KiB/s (7223kB/s), 7053KiB/s-7053KiB/s (7223kB/s-7223kB/s), io=413MiB (433MB), run=60007-60007msec WRITE: bw=155MiB/s (163MB/s), 155MiB/s-155MiB/s (163MB/s-163MB/s), io=9324MiB (9777MB), run=60006-60006msec I verified that without XFS_IOLOCK_SHARED xfs fio results are on par with ext4 results for this workload. > > > just cannot comment on whether calling this without i_rwsem does not break > > some internal XFS expectations for stuff like reflink etc. > > relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two > so it should not be a problem. > > pNFS leases I need to look into. > I wonder if xfs_fs_map_blocks() and xfs_fs_commit_blocks() should not be taking the invalidate lock before calling invalidate_inode_pages2() like the xfs callers of truncate_pagecache_range() do? If we do that, then I don't see a problem with buffered read without XFS_IOLOCK_SHARED w.r.t. correctness of layout leases. Dave, Christoph, I know that you said that changing the atomic buffered read semantics is out of the question and that you also objected to a mount option (which nobody will know how to use) and I accept that. Given that a performant range locks implementation is not something trivial to accomplish (Dave please correct me if I am wrong), and given the massive performance impact of XFS_IOLOCK_SHARED on this workload, what do you think about POSIX_FADV_TORN_RW that a specific application can use to opt-out of atomic buffer read semantics? The specific application that I want to modify to use this hint is Samba. Samba uses IO threads by default to issue pread/pwrite on the server for IO requested by the SMB client. The IO size is normally larger than xfs block size and the range may not be block aligned. The SMB protocol has explicit byte range locks and the server implements them, so it is pretty safe to assume that a client that did not request range locks does not need xfs to do the implicit range locking for it. For this reason and because of the huge performance win, I would like to implement POSIX_FADV_TORN_RW in xfs and have Samba try to set this hint when supported. It is very much possible that NFSv4 servers (user and kennel) would also want to set this hint for very similar reasons. Thoughts? Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-13 14:40 ` Amir Goldstein @ 2022-09-14 16:01 ` Darrick J. Wong 2022-09-14 16:29 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Darrick J. Wong @ 2022-09-14 16:01 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Sep 13, 2022 at 05:40:46PM +0300, Amir Goldstein wrote: > On Tue, Jun 21, 2022 at 3:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote: > > > > > How exactly do you imagine the synchronization of buffered read against > > > > > buffered write would work? Lock all pages for the read range in the page > > > > > cache? You'd need to be careful to not bring the machine OOM when someone > > > > > asks to read a huge range... > > > > > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is > > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(), > > > > when reading data into user buffer, but before that, I would like to issue > > > > and wait for read of the pages in the range to reduce the probability > > > > of doing the read I/O under XFS_IOLOCK_SHARED. > > > > > > > > The pre-warm of page cache does not need to abide to the atomic read > > > > semantics and it is also tolerable if some pages are evicted in between > > > > pre-warn and read to user buffer - in the worst case this will result in > > > > I/O amplification, but for the common case, it will be a big win for the > > > > mixed random r/w performance on xfs. > > > > > > > > To reduce risk of page cache thrashing we can limit this optimization > > > > to a maximum number of page cache pre-warm. > > > > > > > > The questions are: > > > > 1. Does this plan sound reasonable? > > > > > > Ah, I see now. So essentially the idea is to pull the readahead (which is > > > currently happening from filemap_read() -> filemap_get_pages()) out from under > > > the i_rwsem. It looks like a fine idea to me. > > > > Although I was able to demonstrate performance improvement > with page cache pre-warming on low latency disks, when testing > on a common standard system [*], page cache pre-warming did not > yield any improvement to the mixed rw workload. > > [*] I ran the following fio workload on e2-standard-8 GCE machine: > > [global] > filename=/mnt/xfs/testfile.fio > norandommap > randrepeat=0 > size=5G > bs=8K > ioengine=psync > numjobs=8 > group_reporting=1 > direct=0 > fallocate=1 > end_fsync=0 > runtime=60 > > [xfs-read] > readwrite=randread > > [xfs-write] > readwrite=randwrite > > The difference between ext4 and xfs with this machine/workload was > two orders of magnitude: > > root@xfstests:~# fio ./ext4.fio > ... > Run status group 0 (all jobs): > READ: bw=826MiB/s (866MB/s), 826MiB/s-826MiB/s (866MB/s-866MB/s), > io=40.0GiB (42.9GB), run=49585-49585msec > WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), > io=18.1GiB (19.5GB), run=60003-60003msec > > root@xfstests:~# fio ./xfs.fio > ... > Run status group 0 (all jobs): > READ: bw=7053KiB/s (7223kB/s), 7053KiB/s-7053KiB/s > (7223kB/s-7223kB/s), io=413MiB (433MB), run=60007-60007msec > WRITE: bw=155MiB/s (163MB/s), 155MiB/s-155MiB/s (163MB/s-163MB/s), > io=9324MiB (9777MB), run=60006-60006msec > > I verified that without XFS_IOLOCK_SHARED xfs fio results are on par > with ext4 results for this workload. > > > > > > just cannot comment on whether calling this without i_rwsem does not break > > > some internal XFS expectations for stuff like reflink etc. > > > > relink is done under xfs_ilock2_io_mmap => filemap_invalidate_lock_two > > so it should not be a problem. > > > > pNFS leases I need to look into. > > > > I wonder if xfs_fs_map_blocks() and xfs_fs_commit_blocks() > should not be taking the invalidate lock before calling > invalidate_inode_pages2() like the xfs callers of > truncate_pagecache_range() do? > > If we do that, then I don't see a problem with buffered read > without XFS_IOLOCK_SHARED w.r.t. correctness of layout leases. > > Dave, Christoph, > > I know that you said that changing the atomic buffered read semantics > is out of the question and that you also objected to a mount option > (which nobody will know how to use) and I accept that. > > Given that a performant range locks implementation is not something > trivial to accomplish (Dave please correct me if I am wrong), > and given the massive performance impact of XFS_IOLOCK_SHARED > on this workload, > what do you think about POSIX_FADV_TORN_RW that a specific > application can use to opt-out of atomic buffer read semantics? > > The specific application that I want to modify to use this hint is Samba. > Samba uses IO threads by default to issue pread/pwrite on the server > for IO requested by the SMB client. The IO size is normally larger than > xfs block size and the range may not be block aligned. > > The SMB protocol has explicit byte range locks and the server implements > them, so it is pretty safe to assume that a client that did not request > range locks does not need xfs to do the implicit range locking for it. > > For this reason and because of the huge performance win, > I would like to implement POSIX_FADV_TORN_RW in xfs and > have Samba try to set this hint when supported. > > It is very much possible that NFSv4 servers (user and kennel) > would also want to set this hint for very similar reasons. > > Thoughts? How about range locks for i_rwsem and invalidate_lock? That could reduce contention on VM farms, though I can only assume that, given that I don't have a reference implementation to play with... --D > Thanks, > Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-14 16:01 ` Darrick J. Wong @ 2022-09-14 16:29 ` Amir Goldstein 2022-09-14 17:39 ` Darrick J. Wong 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-09-14 16:29 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel > > Dave, Christoph, > > > > I know that you said that changing the atomic buffered read semantics > > is out of the question and that you also objected to a mount option > > (which nobody will know how to use) and I accept that. > > > > Given that a performant range locks implementation is not something > > trivial to accomplish (Dave please correct me if I am wrong), > > and given the massive performance impact of XFS_IOLOCK_SHARED > > on this workload, > > what do you think about POSIX_FADV_TORN_RW that a specific > > application can use to opt-out of atomic buffer read semantics? > > > > The specific application that I want to modify to use this hint is Samba. > > Samba uses IO threads by default to issue pread/pwrite on the server > > for IO requested by the SMB client. The IO size is normally larger than > > xfs block size and the range may not be block aligned. > > > > The SMB protocol has explicit byte range locks and the server implements > > them, so it is pretty safe to assume that a client that did not request > > range locks does not need xfs to do the implicit range locking for it. > > > > For this reason and because of the huge performance win, > > I would like to implement POSIX_FADV_TORN_RW in xfs and > > have Samba try to set this hint when supported. > > > > It is very much possible that NFSv4 servers (user and kennel) > > would also want to set this hint for very similar reasons. > > > > Thoughts? > > How about range locks for i_rwsem and invalidate_lock? That could > reduce contention on VM farms, though I can only assume that, given that > I don't have a reference implementation to play with... > If you are asking if I have the bandwidth to work on range lock then the answer is that I do not. IIRC, Dave had a WIP and ran some benchmarks with range locks, but I do not know at which state that work is. The question is, if application developers know (or believe) that their application does not care about torn reads, are we insisting not to allow them to opt out of atomic buffered reads (which they do not need) because noone has the time to work on range locks? If that is the final decision then if customers come to me to complain about this workload, my response will be: If this workload is important for your application, either - contribute developer resource to work on range locks - carry a patch in your kernel or - switch to another filesystem for this workload Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-14 16:29 ` Amir Goldstein @ 2022-09-14 17:39 ` Darrick J. Wong 2022-09-19 23:09 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Darrick J. Wong @ 2022-09-14 17:39 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Wed, Sep 14, 2022 at 07:29:15PM +0300, Amir Goldstein wrote: > > > Dave, Christoph, > > > > > > I know that you said that changing the atomic buffered read semantics > > > is out of the question and that you also objected to a mount option > > > (which nobody will know how to use) and I accept that. > > > > > > Given that a performant range locks implementation is not something > > > trivial to accomplish (Dave please correct me if I am wrong), > > > and given the massive performance impact of XFS_IOLOCK_SHARED > > > on this workload, > > > what do you think about POSIX_FADV_TORN_RW that a specific > > > application can use to opt-out of atomic buffer read semantics? > > > > > > The specific application that I want to modify to use this hint is Samba. > > > Samba uses IO threads by default to issue pread/pwrite on the server > > > for IO requested by the SMB client. The IO size is normally larger than > > > xfs block size and the range may not be block aligned. > > > > > > The SMB protocol has explicit byte range locks and the server implements > > > them, so it is pretty safe to assume that a client that did not request > > > range locks does not need xfs to do the implicit range locking for it. > > > > > > For this reason and because of the huge performance win, > > > I would like to implement POSIX_FADV_TORN_RW in xfs and > > > have Samba try to set this hint when supported. > > > > > > It is very much possible that NFSv4 servers (user and kennel) > > > would also want to set this hint for very similar reasons. > > > > > > Thoughts? > > > > How about range locks for i_rwsem and invalidate_lock? That could > > reduce contention on VM farms, though I can only assume that, given that > > I don't have a reference implementation to play with... > > > > If you are asking if I have the bandwidth to work on range lock > then the answer is that I do not. > > IIRC, Dave had a WIP and ran some benchmarks with range locks, > but I do not know at which state that work is. Yeah, that's what I was getting at -- I really wish Dave would post that as an RFC. The last time I talked to him about it, he was worried that the extra complexity of the range lock structure would lead to more memory traffic and overhead. I /know/ there are a lot of cloud vendors that would appreciate the speedup that range locking might provide. I'm also fairly sure there are also people who want maximum single threaded iops and will /not/ like range locks, but I think we ought to let kernel distributors choose which one they want. Recently I've been playing around with static keys, because certain parts of xfs online fsck need to hook into libxfs. The hooks have some overhead, so I'd want to reduce the cost of that to making the instruction prefetcher skip over a nop sled when fsck isn't running. I sorta suspect this is a way out -- the distributor selects a default locking implementation at kbuild time, and we allow a kernel command line parameter to switch (if desired) during early boot. That only works if the compiler supports asm goto (iirc) but that's not /so/ uncommon. I'll try to prod Dave about this later today, maybe we can find someone to work on it if he'd post the prototype. --D > The question is, if application developers know (or believe) > that their application does not care about torn reads, are we > insisting not to allow them to opt out of atomic buffered reads > (which they do not need) because noone has the time to > work on range locks? > > If that is the final decision then if customers come to me to > complain about this workload, my response will be: > > If this workload is important for your application, either > - contribute developer resource to work on range locks > - carry a patch in your kernel > or > - switch to another filesystem for this workload > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-14 17:39 ` Darrick J. Wong @ 2022-09-19 23:09 ` Dave Chinner 2022-09-20 2:24 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2022-09-19 23:09 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Jan Kara, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Wed, Sep 14, 2022 at 10:39:45AM -0700, Darrick J. Wong wrote: > On Wed, Sep 14, 2022 at 07:29:15PM +0300, Amir Goldstein wrote: > > > > Dave, Christoph, > > > > > > > > I know that you said that changing the atomic buffered read semantics > > > > is out of the question and that you also objected to a mount option > > > > (which nobody will know how to use) and I accept that. > > > > > > > > Given that a performant range locks implementation is not something > > > > trivial to accomplish (Dave please correct me if I am wrong), > > > > and given the massive performance impact of XFS_IOLOCK_SHARED > > > > on this workload, > > > > what do you think about POSIX_FADV_TORN_RW that a specific > > > > application can use to opt-out of atomic buffer read semantics? > > > > > > > > The specific application that I want to modify to use this hint is Samba. > > > > Samba uses IO threads by default to issue pread/pwrite on the server > > > > for IO requested by the SMB client. The IO size is normally larger than > > > > xfs block size and the range may not be block aligned. > > > > > > > > The SMB protocol has explicit byte range locks and the server implements > > > > them, so it is pretty safe to assume that a client that did not request > > > > range locks does not need xfs to do the implicit range locking for it. > > > > > > > > For this reason and because of the huge performance win, > > > > I would like to implement POSIX_FADV_TORN_RW in xfs and > > > > have Samba try to set this hint when supported. > > > > > > > > It is very much possible that NFSv4 servers (user and kennel) > > > > would also want to set this hint for very similar reasons. > > > > > > > > Thoughts? > > > > > > How about range locks for i_rwsem and invalidate_lock? That could > > > reduce contention on VM farms, though I can only assume that, given that > > > I don't have a reference implementation to play with... > > > > > > > If you are asking if I have the bandwidth to work on range lock > > then the answer is that I do not. > > > > IIRC, Dave had a WIP and ran some benchmarks with range locks, > > but I do not know at which state that work is. > > Yeah, that's what I was getting at -- I really wish Dave would post that > as an RFC. The last time I talked to him about it, he was worried that > the extra complexity of the range lock structure would lead to more > memory traffic and overhead. The reason I haven't posted it is that I don't think range locks can ever be made to perform and scale as we need for the IO path. The problem with range locks is that the structure that tracks the locked ranges needs locking itself, and that means taking an IO lock is no longer just a single atomic operation - it's at least two atomic ops (lock, unlock on a spin lock) and then a bunch of cache misses while searching up the structure containing the range locks looking for overlaps. Hence a { range_lock(); range_unlock(); } pair is at minimum twice as costly { down_read(); up_read(); } and that shows up dramatically with direct IO. My test system (2s, 32p) handles about 3 million atomic ops for a single cacheline across many CPUs before it breaks down into cacheline contention and goes really slow. That means I can run 1.6 million read/write DIO iops to a single file on the test machine with shared rwsem locking (~3.2 million atomic ops a second) but with range locks (assuming just atomic op overhead) that drops to ~800k r/w DIO ops. The hardware IO capacity is just over 1.7MIOPS... In reality, this contended cacheline is not the limiting factor for range locks - the limiting factor is the time it takes to run the critical section inside that lock. This was found with the mmap_sem when it was converted to range locks - the cache misses doing rb-tree pointer chasing with the spinlock held meant that the actual range lock rate topped out at about 180k lock,unlock pairs per second. i.e. an order of magnitude slower than a rwsem on this machine. Read this thread again: https://lore.kernel.org/linux-xfs/20190416122240.GN29573@dread.disaster.area/ That's really the elephant in the range locking room: a range lock with a locked search and update aglorithm can't scale beyond a single CPU in it's critical section. It's a hard limit no matter how many CPUs you have and how much concurrency the workload has - the range lock has a single threaded critical section that results in a hard computational limit on range lock operations. Hence I was looking at using a novel OLC btree algorithm for storing the range locks. The RCU-based OLC btree is largely lockless, allowing conconcurrent search, insert and delete operations on range based index. I've largely got the OLC btree to work, but that simply exposed a further problem that range locks need to handle. That is, the biggest problem for scaling range lock performance is that locking is mostly singleton operation - very few workloads actually use concurrent access to a single file and hence need multiple range locks held at once. As a result, the typical concurrent IO range lock workload results in a single node btree, and so all lookups, inserts and remove hammer the seqlocks on a single node and we end up contending on a single cache line again. Comared to a rwsem, we consume a lot more CPU overhead before we detect a change has occurred and we need to go around and try again. That said, it's better than previous range lock implementations in that it gets up to about 400-450k mixed DIO and buffered iops, but it is still way, way down on using a shared rwsem or serialising at a page granularity via page locks. Yes, I know there are many advantages to range locking. Because we can exclude specific ranges, operations like truncate, fallocate, buffered writes, etc can all run concurrently with reads. DIO can run perfectly coherently with buffered IO (mmap is still a problem!). We can extend files without having to serialise against other IO within the existing EOF. We can pass lock contexts with AIO so that the inode can be unlocked at completion and we can get rid of the nasty inode_dio_wait() stuff we have for synchronisation with DIO. And so on. IOWs, while there are many upsides to range locking the reality is that single file IO performance will not scale to storage hardware capability any more. I have few thoughts on how range locking could be further optimised to avoid such overheads in the cases where range locking is not necessary, but I really don't think that the scalability of a range lock will ever be sufficient to allow us to track every IO we have in flight. > I /know/ there are a lot of cloud vendors that would appreciate the > speedup that range locking might provide. I'm also fairly sure there > are also people who want maximum single threaded iops and will /not/ > like range locks, but I think we ought to let kernel distributors choose > which one they want. Speedup for what operations? Not single file DIO, and only for mixed buffered read/write. Perhaps for mixed fallocate/DIO workloads might benefit, but I reluctantly came to the conclusion that there really aren't many workloads that even a highly optimised rangelock would actually end up improving performance for... > Recently I've been playing around with static keys, because certain > parts of xfs online fsck need to hook into libxfs. The hooks have some > overhead, so I'd want to reduce the cost of that to making the > instruction prefetcher skip over a nop sled when fsck isn't running. > I sorta suspect this is a way out -- the distributor selects a default > locking implementation at kbuild time, and we allow a kernel command > line parameter to switch (if desired) during early boot. That only > works if the compiler supports asm goto (iirc) but that's not /so/ > uncommon. I think it's a lot harder than that. The range lock requires a signification on-stack structure to be declared in the context that the lock is being taken. i.e.: +#define RANGE_LOCK_FULL (LLONG_MAX) + +struct range_lock { + uint64_t start; + uint64_t end; + struct list_head wait_list; +#ifdef __KERNEL__ + struct task_struct *task; +#else + pthread_cond_t task; +#endif +}; + +#define __RANGE_LOCK_INITIALIZER(__name, __start, __end) { \ + .start = (__start) \ + ,.end = (__end) \ + ,.wait_list = LIST_HEAD_INIT((__name).wait_list) \ + } + +#define DEFINE_RANGE_LOCK(name, start, end) \ + struct range_lock name = __RANGE_LOCK_INITIALIZER((name), (start), (end)) + +#define DEFINE_RANGE_LOCK_FULL(name) \ + struct range_lock name = __RANGE_LOCK_INITIALIZER((name), 0, RANGE_LOCK_FULL) And code that uses it looks like: +static inline void +xfs_iolock_range_init( + struct xfs_inode *ip, + struct range_lock *rlock, + uint64_t start, + uint64_t count) +{ + int rounding; + + rounding = max_t(int, i_blocksize(VFS_I(ip)), PAGE_SIZE); + range_lock_init(rlock, round_down(start, rounding), + round_up(start + count, rounding)); +} + STATIC ssize_t xfs_file_dio_read( struct kiocb *iocb, struct iov_iter *to) { struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); + size_t count = iov_iter_count(to); ssize_t ret; + struct range_lock rlock; trace_xfs_file_direct_read(iocb, to); - if (!iov_iter_count(to)) + if (!count) return 0; /* skip atime */ file_accessed(iocb->ki_filp); - ret = xfs_ilock_iocb(iocb, XFS_VFSLOCK_SHARED); + xfs_iolock_range_init(ip, &rlock, iocb->ki_pos, count); + ret = xfs_ilock_iocb(iocb, &rlock, XFS_VFSLOCK_SHARED); Hence I don't think this is as simple as using static keys to switch code paths as there's a whole lot more information that needs to be set up for range locks compared to just using rwsems.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-19 23:09 ` Dave Chinner @ 2022-09-20 2:24 ` Dave Chinner 2022-09-20 3:08 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2022-09-20 2:24 UTC (permalink / raw) To: Darrick J. Wong Cc: Amir Goldstein, Jan Kara, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Sep 20, 2022 at 09:09:47AM +1000, Dave Chinner wrote: > On Wed, Sep 14, 2022 at 10:39:45AM -0700, Darrick J. Wong wrote: > > On Wed, Sep 14, 2022 at 07:29:15PM +0300, Amir Goldstein wrote: > > > > > Dave, Christoph, > > > > > > > > > > I know that you said that changing the atomic buffered read semantics > > > > > is out of the question and that you also objected to a mount option > > > > > (which nobody will know how to use) and I accept that. > > > > > > > > > > Given that a performant range locks implementation is not something > > > > > trivial to accomplish (Dave please correct me if I am wrong), > > > > > and given the massive performance impact of XFS_IOLOCK_SHARED > > > > > on this workload, > > > > > what do you think about POSIX_FADV_TORN_RW that a specific > > > > > application can use to opt-out of atomic buffer read semantics? > > > > > > > > > > The specific application that I want to modify to use this hint is Samba. > > > > > Samba uses IO threads by default to issue pread/pwrite on the server > > > > > for IO requested by the SMB client. The IO size is normally larger than > > > > > xfs block size and the range may not be block aligned. > > > > > > > > > > The SMB protocol has explicit byte range locks and the server implements > > > > > them, so it is pretty safe to assume that a client that did not request > > > > > range locks does not need xfs to do the implicit range locking for it. That doesn't cover concurrent local (server side) access to the file. It's not uncommon to have the same filesystems exported by both Samba and NFS at the same time, and the only point of co-ordination between the two is the underlying local filesystem.... IOWs, when we are talking about local filesystem behaviour, what a network protocol does above the filesystem is largely irrelevant to the synchronisation required within the filesystem implementation.... > > > > > For this reason and because of the huge performance win, > > > > > I would like to implement POSIX_FADV_TORN_RW in xfs and > > > > > have Samba try to set this hint when supported. > > > > > > > > > > It is very much possible that NFSv4 servers (user and kennel) > > > > > would also want to set this hint for very similar reasons. > > > > > > > > > > Thoughts? > > > > > > > > How about range locks for i_rwsem and invalidate_lock? That could > > > > reduce contention on VM farms, though I can only assume that, given that > > > > I don't have a reference implementation to play with... > > > > > > > > > > If you are asking if I have the bandwidth to work on range lock > > > then the answer is that I do not. > > > > > > IIRC, Dave had a WIP and ran some benchmarks with range locks, > > > but I do not know at which state that work is. > > > > Yeah, that's what I was getting at -- I really wish Dave would post that > > as an RFC. The last time I talked to him about it, he was worried that > > the extra complexity of the range lock structure would lead to more > > memory traffic and overhead. > > The reason I haven't posted it is that I don't think range locks can > ever be made to perform and scale as we need for the IO path. [snip range lock scalability and perf issues] As I just discussed on #xfs with Darrick, there are other options we can persue here. The first question we need to ask ourselves is this: what are we protecting against with exclusive buffered write behaviour? The answer is that we know there are custom enterprise database applications out there that assume that 8-16kB buffered writes are atomic. I wish I could say these are legacy applications these days, but they aren't - they are still in production use, and the applications build on those custom database engines are still under active development and use. AFAIK, the 8kB atomic write behaviour is historical and came from applications originally designed for Solaris and hardware that had an 8kB page size. Hence buffered 8kB writes were assumed to be the largest atomic write size that concurrent reads would not see write tearing. These applications are now run on x86-64 boxes with 4kB page size, but they still assume that 8kB writes are atomic and can't tear. So, really, these days the atomic write behaviour of XFS is catering for these small random read/write IO applications, not to provide atomic writes for bulk data moving applications writing 2GB of data per write() syscall. Hence we can fairly safely say that we really only need "exclusive" buffered write locking for relatively small multipage IOs, not huge IOs. We can do single page shared buffered writes immediately - we guarantee that while the folio is locked, a buffered read cannot access the data until the folio is unlocked. So that could be the first step to relaxing the exclusive locking requirement for buffered writes. Next we need to consider that we now have large folio support in the page cache, which means we can treat contiguous file ranges larger than a single page a single atomic unit if they are covered by a multi-page folio. As such, if we have a single multi-page folio that spans the entire write() range already in cache, we can run that write atomically under a shared IO lock the same as we can do with single page folios. However, what happens if the folio is smaller than the range we need to write? Well, in that case, we have to abort the shared lock write and upgrade to an exclusive lock before trying again. Of course, we can only determine if the write can go ahead once we have the folio locked. That means we need a new non-blocking write condition to be handled by the iomap code. We already have several of them because of IOCB_NOWAIT semantics that io_uring requires for buffered writes, so we are already well down the path of needing to support fully non-blocking writes through iomap. Further, the recent concurrent write data corruption that we uncovered requires a new hook in the iomap write path to allow writes to be aborted for remapping because the cached iomap has become stale. This validity check can only be done once the folio has locked - if the cached iomap is stale once we have the page locked, then we have to back out and remap the write range and re-run the write. IOWs, we are going to have to add write retries to the iomap write path for data integrity purposes. These checks must be done only after the folio has been locked, so we really end up getting the "can't do atomic write" retry infrastructure for free with the data corruption fixes... With this in place, it becomes trivial to support atomic writes with shared locking all the way up to PMD sizes (or whatever the maximum multipage folio size the arch supports is) with a minimal amount of extra code. At this point, we have a buffered write path that tries to do shared locking first, and only falls back to exclusive locking if the page cache doesn't contain a folio large enough to soak up the entire write. In future, Darrick suggested we might be able to do a "trygetlock a bunch of folios" operation that locks a range of folios within the current iomap in one go, and then we write into all of them in a batch before unlocking them all. This would give us multi-folio atomic writes with shared locking - this is much more complex, and it's unclear that multi-folio write batching will gain us anything over the single folio check described above... Finally, for anything that is concurrently reading and writing lots of data in chunks larger than PMD sizes, the application should really be using DIO with AIO or io_uring. So falling back to exclusive locking for such large single buffered write IOs doesn't seem like a huge issue right now.... Thoughts? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-20 2:24 ` Dave Chinner @ 2022-09-20 3:08 ` Amir Goldstein 2022-09-21 11:20 ` Amir Goldstein 0 siblings, 1 reply; 38+ messages in thread From: Amir Goldstein @ 2022-09-20 3:08 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Jan Kara, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Sep 20, 2022 at 5:24 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Sep 20, 2022 at 09:09:47AM +1000, Dave Chinner wrote: > > On Wed, Sep 14, 2022 at 10:39:45AM -0700, Darrick J. Wong wrote: > > > On Wed, Sep 14, 2022 at 07:29:15PM +0300, Amir Goldstein wrote: > > > > > > Dave, Christoph, > > > > > > > > > > > > I know that you said that changing the atomic buffered read semantics > > > > > > is out of the question and that you also objected to a mount option > > > > > > (which nobody will know how to use) and I accept that. > > > > > > > > > > > > Given that a performant range locks implementation is not something > > > > > > trivial to accomplish (Dave please correct me if I am wrong), > > > > > > and given the massive performance impact of XFS_IOLOCK_SHARED > > > > > > on this workload, > > > > > > what do you think about POSIX_FADV_TORN_RW that a specific > > > > > > application can use to opt-out of atomic buffer read semantics? > > > > > > > > > > > > The specific application that I want to modify to use this hint is Samba. > > > > > > Samba uses IO threads by default to issue pread/pwrite on the server > > > > > > for IO requested by the SMB client. The IO size is normally larger than > > > > > > xfs block size and the range may not be block aligned. > > > > > > > > > > > > The SMB protocol has explicit byte range locks and the server implements > > > > > > them, so it is pretty safe to assume that a client that did not request > > > > > > range locks does not need xfs to do the implicit range locking for it. > > That doesn't cover concurrent local (server side) access to the > file. It's not uncommon to have the same filesystems exported by > both Samba and NFS at the same time, and the only point of > co-ordination between the two is the underlying local filesystem.... > > IOWs, when we are talking about local filesystem behaviour, what a > network protocol does above the filesystem is largely irrelevant to > the synchronisation required within the filesystem > implementation.... > Perhaps I did not explain my proposal well. Maybe I should have named it POSIX_FADV_TORN_READ. The fadvise() from nfs/smb server would affect only the behavior of buffered reads on the open fd of that server, it will not affect any buffered IO on that inode not associated with the application that opted-in to this behavior. Also, the nfs/smb server buffered writes are still under the same exclusive IO lock as the rest of the callers. > > > > > > For this reason and because of the huge performance win, > > > > > > I would like to implement POSIX_FADV_TORN_RW in xfs and > > > > > > have Samba try to set this hint when supported. > > > > > > > > > > > > It is very much possible that NFSv4 servers (user and kennel) > > > > > > would also want to set this hint for very similar reasons. > > > > > > > > > > > > Thoughts? > > > > > > > > > > How about range locks for i_rwsem and invalidate_lock? That could > > > > > reduce contention on VM farms, though I can only assume that, given that > > > > > I don't have a reference implementation to play with... > > > > > > > > > > > > > If you are asking if I have the bandwidth to work on range lock > > > > then the answer is that I do not. > > > > > > > > IIRC, Dave had a WIP and ran some benchmarks with range locks, > > > > but I do not know at which state that work is. > > > > > > Yeah, that's what I was getting at -- I really wish Dave would post that > > > as an RFC. The last time I talked to him about it, he was worried that > > > the extra complexity of the range lock structure would lead to more > > > memory traffic and overhead. > > > > The reason I haven't posted it is that I don't think range locks can > > ever be made to perform and scale as we need for the IO path. > > [snip range lock scalability and perf issues] > > As I just discussed on #xfs with Darrick, there are other options > we can persue here. > > The first question we need to ask ourselves is this: what are we > protecting against with exclusive buffered write behaviour? > > The answer is that we know there are custom enterprise database > applications out there that assume that 8-16kB buffered writes are > atomic. I wish I could say these are legacy applications these days, > but they aren't - they are still in production use, and the > applications build on those custom database engines are still under > active development and use. > > AFAIK, the 8kB atomic write behaviour is historical and came from > applications originally designed for Solaris and hardware that > had an 8kB page size. Hence buffered 8kB writes were assumed to be > the largest atomic write size that concurrent reads would not see > write tearing. These applications are now run on x86-64 boxes with > 4kB page size, but they still assume that 8kB writes are atomic and > can't tear. > Interesting. I did not know which applications needed that behavior. The customer benchmark that started the complaint uses 8k buffered IO size (as do the benchmarks that I posted in this thread), so far as I am concerned, fixing small buffered IO will solve the problem. > So, really, these days the atomic write behaviour of XFS is catering > for these small random read/write IO applications, not to provide > atomic writes for bulk data moving applications writing 2GB of data > per write() syscall. Hence we can fairly safely say that we really > only need "exclusive" buffered write locking for relatively small > multipage IOs, not huge IOs. > > We can do single page shared buffered writes immediately - we > guarantee that while the folio is locked, a buffered read cannot > access the data until the folio is unlocked. So that could be the > first step to relaxing the exclusive locking requirement for > buffered writes. > > Next we need to consider that we now have large folio support in the > page cache, which means we can treat contiguous file ranges larger > than a single page a single atomic unit if they are covered by a > multi-page folio. As such, if we have a single multi-page folio that > spans the entire write() range already in cache, we can run that > write atomically under a shared IO lock the same as we can do with > single page folios. > > However, what happens if the folio is smaller than the range we need > to write? Well, in that case, we have to abort the shared lock write > and upgrade to an exclusive lock before trying again. > > Of course, we can only determine if the write can go ahead once we > have the folio locked. That means we need a new non-blocking write > condition to be handled by the iomap code. We already have several > of them because of IOCB_NOWAIT semantics that io_uring requires for > buffered writes, so we are already well down the path of needing to > support fully non-blocking writes through iomap. > > Further, the recent concurrent write data corruption that we > uncovered requires a new hook in the iomap write path to allow > writes to be aborted for remapping because the cached iomap has > become stale. This validity check can only be done once the folio > has locked - if the cached iomap is stale once we have the page > locked, then we have to back out and remap the write range and > re-run the write. > > IOWs, we are going to have to add write retries to the iomap write > path for data integrity purposes. These checks must be done only > after the folio has been locked, so we really end up getting the > "can't do atomic write" retry infrastructure for free with the data > corruption fixes... > > With this in place, it becomes trivial to support atomic writes with > shared locking all the way up to PMD sizes (or whatever the maximum > multipage folio size the arch supports is) with a minimal amount of > extra code. > > At this point, we have a buffered write path that tries to do shared > locking first, and only falls back to exclusive locking if the page > cache doesn't contain a folio large enough to soak up the entire > write. > > In future, Darrick suggested we might be able to do a "trygetlock a > bunch of folios" operation that locks a range of folios within the > current iomap in one go, and then we write into all of them in a > batch before unlocking them all. This would give us multi-folio > atomic writes with shared locking - this is much more complex, and > it's unclear that multi-folio write batching will gain us anything > over the single folio check described above... > > Finally, for anything that is concurrently reading and writing lots > of data in chunks larger than PMD sizes, the application should > really be using DIO with AIO or io_uring. So falling back to > exclusive locking for such large single buffered write IOs doesn't > seem like a huge issue right now.... > > Thoughts? That sounds like a great plan. I especially liked the "get it for free" part ;) Is there already WIP for the data integrity issue fix? If there is anything I can do to assist, run the benchmark or anything please let me know. In the meanwhile, I will run the benchmark with XFS_IOLOCK_SHARED on the write() path. Thanks! Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2022-09-20 3:08 ` Amir Goldstein @ 2022-09-21 11:20 ` Amir Goldstein 0 siblings, 0 replies; 38+ messages in thread From: Amir Goldstein @ 2022-09-21 11:20 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Jan Kara, Christoph Hellwig, Darrick J . Wong, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Sep 20, 2022 at 6:08 AM Amir Goldstein <amir73il@gmail.com> wrote: > [...] > > As I just discussed on #xfs with Darrick, there are other options > > we can persue here. > > > > The first question we need to ask ourselves is this: what are we > > protecting against with exclusive buffered write behaviour? > > > > The answer is that we know there are custom enterprise database > > applications out there that assume that 8-16kB buffered writes are > > atomic. I wish I could say these are legacy applications these days, > > but they aren't - they are still in production use, and the > > applications build on those custom database engines are still under > > active development and use. > > > > AFAIK, the 8kB atomic write behaviour is historical and came from > > applications originally designed for Solaris and hardware that > > had an 8kB page size. Hence buffered 8kB writes were assumed to be > > the largest atomic write size that concurrent reads would not see > > write tearing. These applications are now run on x86-64 boxes with > > 4kB page size, but they still assume that 8kB writes are atomic and > > can't tear. > > > > Interesting. I did not know which applications needed that behavior. > The customer benchmark that started the complaint uses 8k buffered > IO size (as do the benchmarks that I posted in this thread), so far as > I am concerned, fixing small buffered IO will solve the problem. > > > So, really, these days the atomic write behaviour of XFS is catering > > for these small random read/write IO applications, not to provide > > atomic writes for bulk data moving applications writing 2GB of data > > per write() syscall. Hence we can fairly safely say that we really > > only need "exclusive" buffered write locking for relatively small > > multipage IOs, not huge IOs. > > > > We can do single page shared buffered writes immediately - we > > guarantee that while the folio is locked, a buffered read cannot > > access the data until the folio is unlocked. So that could be the > > first step to relaxing the exclusive locking requirement for > > buffered writes. > > > > Next we need to consider that we now have large folio support in the > > page cache, which means we can treat contiguous file ranges larger > > than a single page a single atomic unit if they are covered by a > > multi-page folio. As such, if we have a single multi-page folio that > > spans the entire write() range already in cache, we can run that > > write atomically under a shared IO lock the same as we can do with > > single page folios. > > > > However, what happens if the folio is smaller than the range we need > > to write? Well, in that case, we have to abort the shared lock write > > and upgrade to an exclusive lock before trying again. > > Please correct me if I am wrong, but with current upstream, the only way that multi page folios are created for 4K block / 4K page setup are during readahead. We *could* allocate multi page folios on write to an allocated block range that maps inside a single extent, but there is no code for that today. It seems that without this code, any write to a region of page cache not pre-populated using readahead, would get exclusive iolock for 8K buffered writes until that single page folio cache entry is evicted. Am I reading the iomap code correctly? > > Of course, we can only determine if the write can go ahead once we > > have the folio locked. That means we need a new non-blocking write > > condition to be handled by the iomap code. We already have several > > of them because of IOCB_NOWAIT semantics that io_uring requires for > > buffered writes, so we are already well down the path of needing to > > support fully non-blocking writes through iomap. > > > > Further, the recent concurrent write data corruption that we > > uncovered requires a new hook in the iomap write path to allow > > writes to be aborted for remapping because the cached iomap has > > become stale. This validity check can only be done once the folio > > has locked - if the cached iomap is stale once we have the page > > locked, then we have to back out and remap the write range and > > re-run the write. > > > > IOWs, we are going to have to add write retries to the iomap write > > path for data integrity purposes. These checks must be done only > > after the folio has been locked, so we really end up getting the > > "can't do atomic write" retry infrastructure for free with the data > > corruption fixes... > > > > With this in place, it becomes trivial to support atomic writes with > > shared locking all the way up to PMD sizes (or whatever the maximum > > multipage folio size the arch supports is) with a minimal amount of > > extra code. > > > > At this point, we have a buffered write path that tries to do shared > > locking first, and only falls back to exclusive locking if the page > > cache doesn't contain a folio large enough to soak up the entire > > write. > > > > In future, Darrick suggested we might be able to do a "trygetlock a > > bunch of folios" operation that locks a range of folios within the > > current iomap in one go, and then we write into all of them in a > > batch before unlocking them all. This would give us multi-folio > > atomic writes with shared locking - this is much more complex, and > > it's unclear that multi-folio write batching will gain us anything > > over the single folio check described above... > > > > Finally, for anything that is concurrently reading and writing lots > > of data in chunks larger than PMD sizes, the application should > > really be using DIO with AIO or io_uring. So falling back to > > exclusive locking for such large single buffered write IOs doesn't > > seem like a huge issue right now.... > > > > Thoughts? > > That sounds like a great plan. > I especially liked the "get it for free" part ;) > Is there already WIP for the data integrity issue fix? > OK. I see your patch set. > If there is anything I can do to assist, run the benchmark or anything > please let me know. > > In the meanwhile, I will run the benchmark with XFS_IOLOCK_SHARED > on the write() path. > As expected, results without exclusive iolock look very good [*]. Thanks, Amir. [*] I ran the following fio workload on e2-standard-8 GCE machine: [global] filename=/mnt/xfs/testfile.fio norandommap randrepeat=0 size=5G bs=8K ioengine=psync numjobs=8 group_reporting=1 direct=0 fallocate=1 end_fsync=0 runtime=60 [xfs-read] readwrite=randread [xfs-write] readwrite=randwrite ========================= v6.0-rc4 (BAD) ========= Run #1: READ: bw=7053KiB/s (7223kB/s) WRITE: bw=155MiB/s (163MB/s) Run #2: READ: bw=4672KiB/s (4784kB/s) WRITE: bw=355MiB/s (372MB/s) Run #3: READ: bw=5887KiB/s (6028kB/s) WRITE: bw=137MiB/s (144MB/s) ========================= v6.0-rc4 (read no iolock like ext4 - GOOD) ========= READ: bw=742MiB/s (778MB/s) WRITE: bw=345MiB/s (361MB/s) ========================= v6.0-rc4 (write shared iolock - BETTER) ========= Run #1: READ: bw=762MiB/s (799MB/s) WRITE: bw=926MiB/s (971MB/s) Run #2: READ: bw=170MiB/s (178MB/s) WRITE: bw=982MiB/s (1029MB/s) Run #3: READ: bw=755MiB/s (792MB/s) WRITE: bw=933MiB/s (978MB/s) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-07 23:27 ` Dave Chinner 2019-04-08 9:02 ` Amir Goldstein @ 2019-04-08 11:03 ` Jan Kara 2019-04-22 10:55 ` Boaz Harrosh 1 sibling, 1 reply; 38+ messages in thread From: Jan Kara @ 2019-04-08 11:03 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Mon 08-04-19 09:27:28, Dave Chinner wrote: > The result of this is that, AFAICT, ext4 does not protect against > read() vs hole punch races - it's hole punching code it does: > > Hole Punch: read(): > > inode_lock() > inode_dio_wait(inode); > down_write(i_mmap_sem) > truncate_pagecache_range() > ext4_file_iter_read() > ext4_map_blocks() > down_read(i_data_sem) > <gets mapping> > <populates page cache over hole> > <reads stale data into cache> > ..... > down_write(i_data_sem) > remove extents > > IOWs, ext4 is safe against truncate because of the > change-inode-size-before-invalidation hacks, but the lack of > serialise buffered reads means that hole punch and other similar > fallocate based extent manipulations can race against reads.... Hum, you are right. Ext4 is buggy in this regard. I've fixed the race for page fault in ea3d7209ca01 "ext4: fix races between page faults and hole punching" but didn't realize the problem is there for buffered reads as well. I'll think how we can fix this. Thanks for noticing this! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 11:03 ` Jan Kara @ 2019-04-22 10:55 ` Boaz Harrosh 0 siblings, 0 replies; 38+ messages in thread From: Boaz Harrosh @ 2019-04-22 10:55 UTC (permalink / raw) To: Jan Kara, Dave Chinner Cc: Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On 08/04/19 14:03, Jan Kara wrote: <> > Hum, you are right. Ext4 is buggy in this regard. I've fixed the race for > page fault in ea3d7209ca01 "ext4: fix races between page faults and hole > punching" but didn't realize the problem is there for buffered reads as > well. I'll think how we can fix this. Thanks for noticing this! > This is very interesting. Please CC me if you have devised a test for this? For a long time I want to enhance fsstress --verify that each writer always writes the long address it writes to as data (ie. file looks like an increasing long counter) And readers want to see only this pattern or zero. > Honza > Thanks Boaz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-04 21:17 ` Dave Chinner 2019-04-05 14:02 ` Amir Goldstein @ 2019-04-08 10:33 ` Jan Kara 2019-04-08 16:37 ` Davidlohr Bueso 1 sibling, 1 reply; 38+ messages in thread From: Jan Kara @ 2019-04-08 10:33 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel, Davidlohr Bueso On Fri 05-04-19 08:17:30, Dave Chinner wrote: > FYI, I'm working on a range lock implementation that should both > solve the performance issue and the reader starvation issue at the > same time by allowing concurrent buffered reads and writes to > different file ranges. Are you aware of range locks Davidlohr has implemented [1]? It didn't get merged because he had no in-tree user at the time (he was more aiming at converting mmap_sem which is rather difficult). But the generic lock implementation should be well usable. Added Davidlohr to CC. Honza [1] https://lkml.org/lkml/2017/3/7/22 -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 10:33 ` Jan Kara @ 2019-04-08 16:37 ` Davidlohr Bueso 2019-04-11 1:11 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Davidlohr Bueso @ 2019-04-08 16:37 UTC (permalink / raw) To: Jan Kara, Dave Chinner Cc: Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote: > On Fri 05-04-19 08:17:30, Dave Chinner wrote: > > FYI, I'm working on a range lock implementation that should both > > solve the performance issue and the reader starvation issue at the > > same time by allowing concurrent buffered reads and writes to > > different file ranges. > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get > merged because he had no in-tree user at the time (he was more aiming at > converting mmap_sem which is rather difficult). But the generic lock > implementation should be well usable. > > Added Davidlohr to CC. > > Honza > > [1] https://lkml.org/lkml/2017/3/7/22 fyi this was the latest version (had some naming updates per peterz). https://lkml.org/lkml/2018/2/4/232 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-08 16:37 ` Davidlohr Bueso @ 2019-04-11 1:11 ` Dave Chinner 2019-04-16 12:22 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2019-04-11 1:11 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote: > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote: > > On Fri 05-04-19 08:17:30, Dave Chinner wrote: > > > FYI, I'm working on a range lock implementation that should both > > > solve the performance issue and the reader starvation issue at the > > > same time by allowing concurrent buffered reads and writes to > > > different file ranges. > > > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get > > merged because he had no in-tree user at the time (he was more aiming at > > converting mmap_sem which is rather difficult). But the generic lock > > implementation should be well usable. > > > > Added Davidlohr to CC. > > > > Honza > > > > [1] https://lkml.org/lkml/2017/3/7/22 > > fyi this was the latest version (had some naming updates per peterz). > > https://lkml.org/lkml/2018/2/4/232 No, I wasn't aware of these because they haven't ever been posted to a list I subscribe to and they haven't been merged. I'll go have a look at them over the next few days. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-11 1:11 ` Dave Chinner @ 2019-04-16 12:22 ` Dave Chinner 2019-04-18 3:10 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2019-04-16 12:22 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote: > On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote: > > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote: > > > On Fri 05-04-19 08:17:30, Dave Chinner wrote: > > > > FYI, I'm working on a range lock implementation that should both > > > > solve the performance issue and the reader starvation issue at the > > > > same time by allowing concurrent buffered reads and writes to > > > > different file ranges. > > > > > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get > > > merged because he had no in-tree user at the time (he was more aiming at > > > converting mmap_sem which is rather difficult). But the generic lock > > > implementation should be well usable. > > > > > > Added Davidlohr to CC. > > > > > > Honza > > > > > > [1] https://lkml.org/lkml/2017/3/7/22 > > > > fyi this was the latest version (had some naming updates per peterz). > > > > https://lkml.org/lkml/2018/2/4/232 > > No, I wasn't aware of these because they haven't ever been posted to > a list I subscribe to and they haven't been merged. I'll go have a > look at them over the next few days. Rather disappointing, to tell the truth. The implementation doesn't scale nearly as a rwsem for concurrent IO using shared access (i.e. XFS direct IO). That's the baseline we need to get close to for range locks to be a viable prospect. Fio randrw numbers on a single file on a pmem device on a 16p machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3: IOPS read/write (direct IO) fio processes rwsem rangelock 1 78k / 78k 75k / 75k 2 131k / 131k 123k / 123k 4 267k / 267k 183k / 183k 8 372k / 372k 177k / 177k 16 315k / 315k 135k / 135k So uncontended, non-overlapping range performance is not really even in the ballpark right now, unfortunately. To indicate that range locking is actually working, let's do buffered read/write, which takes the rwsem exclusive for writes and so will be bound by that: IOPS read/write (buffered IO) fio processes rwsem rangelock 1 57k / 57k 64k / 64k 2 61k / 61k 111k / 111k 4 61k / 61k 228k / 228k 8 55k / 55k 195k / 195k 16 15k / 15k 40k / 40k So the win for mixed buffered IO is huge but it's at the cost of direct IO performance. We can't make that tradeoff, so if this implementation cannot be substantially improved we really can'ti use it. FUndamentally, the problem is that the interval tree work is all done under a spinlock, so by the time we get to 16p, 70% of the 16p that is being burnt is on the spinlock, and only 30% of the CPU time is actually doing IO work: + 70.78% 2.50% [kernel] [k] do_raw_spin_lock + 69.72% 0.07% [kernel] [k] _raw_spin_lock_irqsave - 68.27% 68.10% [kernel] [k] __pv_queued_spin_lock_slowpath ..... + 15.61% 0.07% [kernel] [k] range_write_unlock + 15.39% 0.06% [kernel] [k] range_read_unlock + 15.11% 0.12% [kernel] [k] range_read_lock + 15.11% 0.13% [kernel] [k] range_write_lock ..... + 12.92% 0.11% [kernel] [k] range_is_locked FWIW, I'm not convinced about the scalability of the rb/interval tree, to tell you the truth. We got rid of the rbtree in XFS for cache indexing because the multi-level pointer chasing was just too expensive to do under a spinlock - it's just not a cache efficient structure for random index object storage. FWIW, I have basic hack to replace the i_rwsem in XFS with a full range read or write lock with my XFS range lock implementation so it just behaves like a rwsem at this point. It is not in any way optimised at this point. Numbers for same AIO-DIO test are: IOPS read/write (direct IO) processes rwsem DB rangelock XFS rangelock 1 78k / 78k 75k / 75k 74k / 74k 2 131k / 131k 123k / 123k 134k / 134k 4 267k / 267k 183k / 183k 246k / 246k 8 372k / 372k 177k / 177k 306k / 306k 16 315k / 315k 135k / 135k 240k / 240k Performance is much closer to rwsems, but the single lock critical section still causes serious amounts of cacheline contention on pure shared-read lock workloads like this. FWIW, the XFS rangelock numbers match what I've been seeing with userspace perf tests using unoptimised pthread mutexes - ~600,000 lock/unlock cycles a second over ~100 concurrent non-overlapping ranges seem to be the limitation before futex contention burns down the house AS it is, I've done all the work to make XFS use proper range locks on top of your patch - I'll go back tomorrow and modify the XFS range locks to use the same API as your lock implementation. I'm interested to see what falls out when they are used as real range locks rather than a glorified rwsem.... Cheers, Dave. PS: there's a double lock bug in range_is_locked()..... -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-16 12:22 ` Dave Chinner @ 2019-04-18 3:10 ` Dave Chinner 2019-04-18 18:21 ` Davidlohr Bueso 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2019-04-18 3:10 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Tue, Apr 16, 2019 at 10:22:40PM +1000, Dave Chinner wrote: > On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote: > > On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote: > > > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote: > > > > On Fri 05-04-19 08:17:30, Dave Chinner wrote: > > > > > FYI, I'm working on a range lock implementation that should both > > > > > solve the performance issue and the reader starvation issue at the > > > > > same time by allowing concurrent buffered reads and writes to > > > > > different file ranges. > > > > > > > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get > > > > merged because he had no in-tree user at the time (he was more aiming at > > > > converting mmap_sem which is rather difficult). But the generic lock > > > > implementation should be well usable. > > > > > > > > Added Davidlohr to CC. ..... > Fio randrw numbers on a single file on a pmem device on a 16p > machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3: > > IOPS read/write (direct IO) > fio processes rwsem rangelock > 1 78k / 78k 75k / 75k > 2 131k / 131k 123k / 123k > 4 267k / 267k 183k / 183k > 8 372k / 372k 177k / 177k > 16 315k / 315k 135k / 135k .... > FWIW, I'm not convinced about the scalability of the rb/interval > tree, to tell you the truth. We got rid of the rbtree in XFS for > cache indexing because the multi-level pointer chasing was just too > expensive to do under a spinlock - it's just not a cache efficient > structure for random index object storage. Yeah, definitely not convinced an rbtree is the right structure here. Locking of the tree is the limitation.... > FWIW, I have basic hack to replace the i_rwsem in XFS with a full > range read or write lock with my XFS range lock implementation so it > just behaves like a rwsem at this point. It is not in any way > optimised at this point. Numbers for same AIO-DIO test are: Now the stuff I've been working on has the same interface as Davidlohr's patch, so I can swap and change them without thinking about it. It's still completely unoptimised, but: IOPS read/write (direct IO) processes rwsem DB rangelock XFS rangelock 1 78k / 78k 75k / 75k 72k / 72k 2 131k / 131k 123k / 123k 133k / 133k 4 267k / 267k 183k / 183k 237k / 237k 8 372k / 372k 177k / 177k 265k / 265k 16 315k / 315k 135k / 135k 228k / 228k It's still substantially faster than the interval tree code. BTW, if I take away the rwsem serialisation altogether, this test tops out at just under 500k/500k at 8 threads, and at 16 threads has started dropping off (~440k/440k). So the rwsem is a scalability limitation at just 8 threads.... /me goes off and thinks more about adding optimistic lock coupling to the XFS iext btree to get rid of the need for tree-wide locking altogether Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-18 3:10 ` Dave Chinner @ 2019-04-18 18:21 ` Davidlohr Bueso 2019-04-20 23:54 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Davidlohr Bueso @ 2019-04-18 18:21 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote: > Now the stuff I've been working on has the same interface as > Davidlohr's patch, so I can swap and change them without thinking > about it. It's still completely unoptimised, but: > > IOPS read/write (direct IO) > processes rwsem DB rangelock XFS > rangelock > 1 78k / 78k 75k / 75k 72k / 72k > 2 131k / 131k 123k / 123k 133k / 133k > 4 267k / 267k 183k / 183k 237k / 237k > 8 372k / 372k 177k / 177k 265k / 265k > 16 315k / 315k 135k / 135k 228k / 228k > > It's still substantially faster than the interval tree code. In general another big difference between both rangelock vs rwsems (when comparing them with full ranges) is that the latter will do writer optimistic spinning, so saving a context switch under the right scenarios provides mayor wins for rwsems -- I'm not sure if this applies to your fio tests, though. And pretty soon readers will also do this, hence rwsem will become a try-hard-not-to-sleep lock. One of the reasons why I was hesitant with Btrees was the fact that insertion requires memory allocation, something I wanted to avoid... per your numbers, sacrificing tree depth was the wrong choice. Thanks for sharing these numbers. > > BTW, if I take away the rwsem serialisation altogether, this > test tops out at just under 500k/500k at 8 threads, and at 16 > threads has started dropping off (~440k/440k). So the rwsem is > a scalability limitation at just 8 threads.... > > /me goes off and thinks more about adding optimistic lock coupling > to the XFS iext btree to get rid of the need for tree-wide > locking altogether I was not aware of this code. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-18 18:21 ` Davidlohr Bueso @ 2019-04-20 23:54 ` Dave Chinner 2019-05-03 4:17 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2019-04-20 23:54 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote: > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote: > > Now the stuff I've been working on has the same interface as > > Davidlohr's patch, so I can swap and change them without thinking > > about it. It's still completely unoptimised, but: > > > > IOPS read/write (direct IO) > > processes rwsem DB rangelock XFS > > rangelock > > 1 78k / 78k 75k / 75k 72k / 72k > > 2 131k / 131k 123k / 123k 133k / 133k > > 4 267k / 267k 183k / 183k 237k / 237k > > 8 372k / 372k 177k / 177k 265k / 265k > > 16 315k / 315k 135k / 135k 228k / 228k > > > > It's still substantially faster than the interval tree code. > > In general another big difference between both rangelock vs rwsems > (when comparing them with full ranges) is that the latter will do > writer optimistic spinning, so saving a context switch under the right > scenarios provides mayor wins for rwsems -- I'm not sure if this > applies to your fio tests, though. And pretty soon readers will also do > this, hence rwsem will become a try-hard-not-to-sleep lock. Right, the optimistic spin-on-owner behaviour basically causes both rwsems and mutexes to behave as spin locks under these fio workloads until all CPUs are spinning at 100% CPU usage before they even start to behave like a sleeping lock. Essentailly, when I'm using the XFS range locks as "full range only" rwsem equivalent locks, they end up behaving like rwsems in terms of spinning on the tree mutex as the single read-lock range is shared between all concurrent readers and so has relatively low lookup and modification cost. If I optmised the record update, it would be even closer to the rwsem performance, but full range read locks are going to be rare. Hell, even conflicting lock ranges are going to be rare in the IO path, because the application is doing something very wrong if it is doing overlapping concurrent IOs.... > One of the reasons why I was hesitant with Btrees was the fact that > insertion requires memory allocation, something I wanted to avoid... Yup, that's an issue, and one of the reasons why I've used a mutex for the tree lock rather than a spin lock. But in the end, there's no difference to CPU usage on contention as they both act as spin locks in these cases... If it becomes an issue, I'll add a mempool for the tree nodes and make sure that it is pre-populated sufficiently before starting a multi-level tree modification. > per your numbers, sacrificing tree depth was the wrong choice. Thanks > for sharing these numbers. Yeah, anything that pointer chases through cachelines and dirties multiple cachelines on rebalance really doesn't scale for a fast-lookup structure. For btrees multi-level split/merge hurt, but for the lookup and simple insert/remove case the cacheline miss cost is substantially amortised by packing multiple records per cacheline. > > BTW, if I take away the rwsem serialisation altogether, this > > test tops out at just under 500k/500k at 8 threads, and at 16 > > threads has started dropping off (~440k/440k). So the rwsem is > > a scalability limitation at just 8 threads.... > > > > /me goes off and thinks more about adding optimistic lock coupling > > to the XFS iext btree to get rid of the need for tree-wide > > locking altogether > > I was not aware of this code. It's relatively new, and directly tailored to the needs of caching the XFS extent tree - it's not really a generic btree in that it's record store format is the XFS on-disk extent record. i.e. it only stores 54 bits of start offset and 21 bits of length in it's 16 byte records, and the rest of the space is for the record data. As such, I'm not really implementing a generic range lock - it is highly tailored to the requirements and constraints of XFS filesystem IO. e.g. using max(block size, PAGE_SIZE) granularity range indexing means we can support maximum IO sizes (4GB) with 21 bits of length in the record. And RANGE_LOCK_FULL is implemented as length = MAXEXTLEN + a hidden state bit so it supports exclusive locking from (start, RANGE_LOCK_FULL) and things like truncate, EOF zeroing use this to exclude IO from the range they are working on correctly.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-04-20 23:54 ` Dave Chinner @ 2019-05-03 4:17 ` Dave Chinner 2019-05-03 5:17 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2019-05-03 4:17 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Sun, Apr 21, 2019 at 09:54:12AM +1000, Dave Chinner wrote: > On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote: > > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote: > > > Now the stuff I've been working on has the same interface as > > > Davidlohr's patch, so I can swap and change them without thinking > > > about it. It's still completely unoptimised, but: > > > > > > IOPS read/write (direct IO) > > > processes rwsem DB rangelock XFS > > > rangelock > > > 1 78k / 78k 75k / 75k 72k / 72k > > > 2 131k / 131k 123k / 123k 133k / 133k > > > 4 267k / 267k 183k / 183k 237k / 237k > > > 8 372k / 372k 177k / 177k 265k / 265k > > > 16 315k / 315k 135k / 135k 228k / 228k > > > > > > It's still substantially faster than the interval tree code. .... > > > /me goes off and thinks more about adding optimistic lock coupling > > > to the XFS iext btree to get rid of the need for tree-wide > > > locking altogether > > > > I was not aware of this code. > > It's relatively new, and directly tailored to the needs of caching > the XFS extent tree - it's not really a generic btree in that it's > record store format is the XFS on-disk extent record. i.e. it > only stores 54 bits of start offset and 21 bits of length in it's 16 > byte records, and the rest of the space is for the record data. SO now I have a mostly working OLC btree based on this tree which is plumbed into xfsprogs userspace and some testing code. I think I can say now that the code will actually work, and it /should/ scale better than a rwsem. The userspace test harness that I have ran a "thread profile" to indicated scalability. Basically it ran each thread in a different offset range and locked a hundred ranges and then unlocked them, and then looped over this. The btree is a single level for the first 14 locks, 2-level for up to 210 locks, and 3-level for up to 3150 locks. Hence most of this testing results in the btree being 2-3 levels and so largely removes the global root node lock as a point of contention. It's "best case" for concurrency for an OLC btree. On a 16p machine: Range lock/unlock ops/s threads mutex btree OLC btree 1 5239442 949487 2 1014466 1398539 4 985940 2405275 8 733195 3288435 16 653429 2809225 When looking at these numbers, remember that the mutex btree kernel range lock performed a lot better than the interval tree range lock, and they were only ~30% down on an rwsem. The mutex btree code shows cache residency effects for the single threaded load, hence it looks much faster than it is for occasional and multithreaded access. However, at 2 threads (where hot CPU caches don't affect the performance), the OLC btree is 40% faster, and at 8 threads it is 4.5x faster than the mutex btree. The OLC btree starts slowing down at 16 threads, largely because the tree itself doesn't have enough depth to provide the interior nodes to scale to higher concurrency levels without contention, but it's still running at 4.5x faster than the mutex btree.... The best part is when I run worse case threaded workloads on the OLC btree. If I run the same 100-lock loops, but this time change the offsets of each thread so they interleave into adjacent records in the btree (i.e. every thread touches every leaf), then the performance is still pretty damn good: Range lock/unlock ops/s threads Worst Case Best Case 1 1045991 949487 2 1530212 1398539 4 1147099 2405275 8 1602114 3288435 16 1731890 2809225 IOWs, performance is down and somewhat variable around tree height changes (4 threads straddles the 2-3 level tree height threshold), but it's still a massive improvement on the mutex_btree and it's not going backwards as threads are added. Concept proven. Next steps are: - separate the OLC btree from the XFS iext btree implementation. It will still have a similar interface (i.e. can't manipulate the btree records directly), but there's sufficient difference in structure for them to be separate implementations. - expand records out to full 64bit extents. The iext tree memory usage constraints no longer apply, so the record size can go up a little bit. - work out whether RCU read locking and kfree_rcu() will work with the requirement to do memory allocation while holding rcu_read_lock(). Alternative is an internal garbage collector mechanism, kinda like I've hacked up to simulate kfree_rcu() in userspace. - fix all the little bugs that still exist in the code. - Think about structural optimisations like parent pointers to avoid costly path walks to find parents for modifications. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload 2019-05-03 4:17 ` Dave Chinner @ 2019-05-03 5:17 ` Dave Chinner 0 siblings, 0 replies; 38+ messages in thread From: Dave Chinner @ 2019-05-03 5:17 UTC (permalink / raw) To: Davidlohr Bueso Cc: Jan Kara, Amir Goldstein, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, linux-xfs, linux-fsdevel On Fri, May 03, 2019 at 02:17:27PM +1000, Dave Chinner wrote: > Concept proven. > > Next steps are: .... > - work out whether RCU read locking and kfree_rcu() will > work with the requirement to do memory allocation while > holding rcu_read_lock(). Alternative is an internal > garbage collector mechanism, kinda like I've hacked up to > simulate kfree_rcu() in userspace. Internal RCU interactions are now solved. Actually very simple in the end, should be very easy to integrate into the code. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2022-09-21 11:20 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein 2019-04-04 21:17 ` Dave Chinner 2019-04-05 14:02 ` Amir Goldstein 2019-04-07 23:27 ` Dave Chinner 2019-04-08 9:02 ` Amir Goldstein 2019-04-08 14:11 ` Jan Kara 2019-04-08 17:41 ` Amir Goldstein 2019-04-09 8:26 ` Jan Kara 2022-06-17 14:48 ` Amir Goldstein 2022-06-17 15:11 ` Jan Kara 2022-06-18 8:38 ` Amir Goldstein 2022-06-20 9:11 ` Jan Kara 2022-06-21 7:49 ` Amir Goldstein 2022-06-21 8:59 ` Jan Kara 2022-06-21 12:53 ` Amir Goldstein 2022-06-22 3:23 ` Matthew Wilcox 2022-06-22 9:00 ` Amir Goldstein 2022-06-22 9:34 ` Jan Kara 2022-06-22 16:26 ` Amir Goldstein 2022-09-13 14:40 ` Amir Goldstein 2022-09-14 16:01 ` Darrick J. Wong 2022-09-14 16:29 ` Amir Goldstein 2022-09-14 17:39 ` Darrick J. Wong 2022-09-19 23:09 ` Dave Chinner 2022-09-20 2:24 ` Dave Chinner 2022-09-20 3:08 ` Amir Goldstein 2022-09-21 11:20 ` Amir Goldstein 2019-04-08 11:03 ` Jan Kara 2019-04-22 10:55 ` Boaz Harrosh 2019-04-08 10:33 ` Jan Kara 2019-04-08 16:37 ` Davidlohr Bueso 2019-04-11 1:11 ` Dave Chinner 2019-04-16 12:22 ` Dave Chinner 2019-04-18 3:10 ` Dave Chinner 2019-04-18 18:21 ` Davidlohr Bueso 2019-04-20 23:54 ` Dave Chinner 2019-05-03 4:17 ` Dave Chinner 2019-05-03 5:17 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).