linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Triggering non-integrity writeback from userspace
@ 2015-10-22 13:15 Andres Freund
  2015-10-24 19:09 ` Jan Kara
  2015-10-24 21:39 ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Andres Freund @ 2015-10-22 13:15 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel

Hi,

postgres regularly has to checkpoint data to disk to be able to free
data from its journal. We currently use buffered IO and that's not
going to change short term.

In a busy database this checkpointing process can write out a lot of
data. Currently that frequently leads to massive latency spikes
(c.f. 20140326191113.GF9066@alap3.anarazel.de) for other processed doing
IO. These happen either when the kernel starts writeback or when, at the
end of the checkpoint, we issue an fsync() on the datafiles.

One odd issue there is that the kernel tends to do writeback in a very
irregular manner. Even if we write data at a constant rate writeback
very often happens in bulk - not a good idea for preserving
interactivity.

What we're preparing to do now is to regularly issue
sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after
we've written them to to the OS. That way there's not too much dirty
data in the page cache, so writeback won't cause latency spikes, and the
fsync at the end doesn't have to write much if anything.

That improves things a lot.

But I still see latency spikes that shouldn't be there given the amount
of IO. I'm wondering if that is related to the fact that
SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with
WB_SYNC_ALL specified. Given the the documentation for
SYNC_FILE_RANGE_WRITE I did not expect that:
 * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
 * are not presently under writeout.  This is an asynchronous flush-to-disk
 * operation.  Not suitable for data integrity operations.

If I followed the code correctly - not a sure thing at all - that means
bios are submitted with WRITE_SYNC specified. Not really what's needed
in this case.

Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't
there for data integrity, but it might be that people rely on in
nonetheless. so I'm loathe to suggest changing that. But I do wonder if
there's a way non-integrity writeback triggering could be exposed to
userspace. A new fadvise flags seems like a good way to do that -
POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does
other things, so it's not suitable for us.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-22 13:15 Triggering non-integrity writeback from userspace Andres Freund
@ 2015-10-24 19:09 ` Jan Kara
  2015-10-24 21:39 ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2015-10-24 19:09 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

  Hi,

On Thu 22-10-15 15:15:55, Andres Freund wrote:
> postgres regularly has to checkpoint data to disk to be able to free
> data from its journal. We currently use buffered IO and that's not
> going to change short term.
> 
> In a busy database this checkpointing process can write out a lot of
> data. Currently that frequently leads to massive latency spikes
> (c.f. 20140326191113.GF9066@alap3.anarazel.de) for other processed doing
> IO. These happen either when the kernel starts writeback or when, at the
> end of the checkpoint, we issue an fsync() on the datafiles.
> 
> One odd issue there is that the kernel tends to do writeback in a very
> irregular manner. Even if we write data at a constant rate writeback
> very often happens in bulk - not a good idea for preserving
> interactivity.
> 
> What we're preparing to do now is to regularly issue
> sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after
> we've written them to to the OS. That way there's not too much dirty
> data in the page cache, so writeback won't cause latency spikes, and the
> fsync at the end doesn't have to write much if anything.
> 
> That improves things a lot.
> 
> But I still see latency spikes that shouldn't be there given the amount
> of IO. I'm wondering if that is related to the fact that
> SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with
> WB_SYNC_ALL specified. Given the the documentation for
> SYNC_FILE_RANGE_WRITE I did not expect that:
>  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
>  * are not presently under writeout.  This is an asynchronous flush-to-disk
>  * operation.  Not suitable for data integrity operations.
> 
> If I followed the code correctly - not a sure thing at all - that means
> bios are submitted with WRITE_SYNC specified. Not really what's needed
> in this case.
> 
> Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't
> there for data integrity, but it might be that people rely on in
> nonetheless. so I'm loathe to suggest changing that. But I do wonder if
> there's a way non-integrity writeback triggering could be exposed to
> userspace. A new fadvise flags seems like a good way to do that -
> POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does
> other things, so it's not suitable for us.

You are absolutely correct that sync_file_range() should issue writeback as
WB_SYNC_NONE and not wait for current writeback in progress. That was an
oversight introduced by commit ee53a891f474 (mm: do_sync_mapping_range
integrity fix) which changed do_sync_mapping_range() to use WB_SYNC_ALL
because it had other users which relied WB_SYNC_ALL semantics. Later that
got copied over to the current sync_file_range() implementation.

I think we should just revert to the very explicitely documented behavior
of sync_file_range(). I'll send a patch for that. Thanks for report.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-22 13:15 Triggering non-integrity writeback from userspace Andres Freund
  2015-10-24 19:09 ` Jan Kara
@ 2015-10-24 21:39 ` Dave Chinner
  2015-10-28  9:27   ` Andres Freund
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-10-24 21:39 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 22, 2015 at 03:15:55PM +0200, Andres Freund wrote:
> Hi,
> 
> postgres regularly has to checkpoint data to disk to be able to free
> data from its journal. We currently use buffered IO and that's not
> going to change short term.
> 
> In a busy database this checkpointing process can write out a lot of
> data. Currently that frequently leads to massive latency spikes
> (c.f. 20140326191113.GF9066@alap3.anarazel.de) for other processed doing
> IO. These happen either when the kernel starts writeback or when, at the
> end of the checkpoint, we issue an fsync() on the datafiles.
> 
> One odd issue there is that the kernel tends to do writeback in a very
> irregular manner. Even if we write data at a constant rate writeback
> very often happens in bulk - not a good idea for preserving
> interactivity.
> 
> What we're preparing to do now is to regularly issue
> sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after
> we've written them to to the OS. That way there's not too much dirty
> data in the page cache, so writeback won't cause latency spikes, and the
> fsync at the end doesn't have to write much if anything.
> 
> That improves things a lot.
> 
> But I still see latency spikes that shouldn't be there given the amount
> of IO. I'm wondering if that is related to the fact that
> SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with
> WB_SYNC_ALL specified. Given the the documentation for
> SYNC_FILE_RANGE_WRITE I did not expect that:
>  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
>  * are not presently under writeout.  This is an asynchronous flush-to-disk
>  * operation.  Not suitable for data integrity operations.

WB_SYNC_ALL is simply a method of saying "writeback all dirty pages
and don't skip any". That's part of a data integrity operation, but
it's not what results in data integrity being provided. It may cause
some latencies caused by blocking on locks or in the request queues,
so that's what I'd be looking for.

i.e. if the request queues are full, SYNC_FILE_RANGE_WRITE will
block until all the IO it has been requested to write has been
submitted to the request queues. Put simply: the IO is asynchronous
in that we don't wait for completion, but the IO submission is still
synchronous.

Data integrity operations require related file metadata (e.g. block
allocation trnascations) to be forced to the journal/disk, and a
device cache flush issued to ensure the data is on stable storage.
SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
the IO might be the same pattern as a data integrity operation, it
does not provide such guarantees.

> If I followed the code correctly - not a sure thing at all - that means
> bios are submitted with WRITE_SYNC specified. Not really what's needed
> in this case.

That just allows the IO scheduler to classify them differently to
bulk background writeback. 

> Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't
> there for data integrity, but it might be that people rely on in
> nonetheless. so I'm loathe to suggest changing that. But I do wonder if
> there's a way non-integrity writeback triggering could be exposed to
> userspace. A new fadvise flags seems like a good way to do that -
> POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does
> other things, so it's not suitable for us.

You don't want to do writeback from the syscall, right? i.e. you'd
like to expire the inode behind the fd, and schedule background
writeback to run on it immediately?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-24 21:39 ` Dave Chinner
@ 2015-10-28  9:27   ` Andres Freund
  2015-10-28 20:48     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Andres Freund @ 2015-10-28  9:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-kernel

Hi,

Thanks for looking into this.

On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:
> WB_SYNC_ALL is simply a method of saying "writeback all dirty pages
> and don't skip any". That's part of a data integrity operation, but
> it's not what results in data integrity being provided. It may cause
> some latencies caused by blocking on locks or in the request queues,
> so that's what I'd be looking for.

It also means we'll wait for more:
int write_cache_pages(struct address_space *mapping,
		      struct writeback_control *wbc, writepage_t writepage,
		      void *data)
{
...
	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
		tag = PAGECACHE_TAG_TOWRITE;
	else
		tag = PAGECACHE_TAG_DIRTY;
...
			if (PageWriteback(page)) {
				if (wbc->sync_mode != WB_SYNC_NONE)
					wait_on_page_writeback(page);
				else
					goto continue_unlock;
			}

> i.e. if the request queues are full, SYNC_FILE_RANGE_WRITE will
> block until all the IO it has been requested to write has been
> submitted to the request queues. Put simply: the IO is asynchronous
> in that we don't wait for completion, but the IO submission is still
> synchronous.

That's desirable in our case because there's a limit to how much
outstanding IO there is.

> Data integrity operations require related file metadata (e.g. block
> allocation trnascations) to be forced to the journal/disk, and a
> device cache flush issued to ensure the data is on stable storage.
> SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> the IO might be the same pattern as a data integrity operation, it
> does not provide such guarantees.

Which is desired here - the actual integrity is still going to be done
via fsync(). The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
the fsync() will only have to do very little work. The language in
sync_file_range(2) doesn't inspire enough confidence for using it as an
actual integrity operation :/

> > If I followed the code correctly - not a sure thing at all - that means
> > bios are submitted with WRITE_SYNC specified. Not really what's needed
> > in this case.
>
> That just allows the IO scheduler to classify them differently to
> bulk background writeback.

It also influences which writes are merged and which are not, at least
if I understand elv_rq_merge_ok() and the callbacks it calls..

> You don't want to do writeback from the syscall, right? i.e. you'd
> like to expire the inode behind the fd, and schedule background
> writeback to run on it immediately?

Yes, that's exactly what we want. Blocking if a process has done too
much writes is fine tho.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-28  9:27   ` Andres Freund
@ 2015-10-28 20:48     ` Dave Chinner
  2015-10-28 23:23       ` Andres Freund
  2015-10-28 23:26       ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2015-10-28 20:48 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

Hi Andres,

On Wed, Oct 28, 2015 at 10:27:52AM +0100, Andres Freund wrote:
> On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:
....
> > Data integrity operations require related file metadata (e.g. block
> > allocation trnascations) to be forced to the journal/disk, and a
> > device cache flush issued to ensure the data is on stable storage.
> > SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> > the IO might be the same pattern as a data integrity operation, it
> > does not provide such guarantees.
> 
> Which is desired here - the actual integrity is still going to be done
> via fsync().

OK, so you require data integrity, but....

> The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> the fsync() will only have to do very little work. The language in
> sync_file_range(2) doesn't inspire enough confidence for using it as an
> actual integrity operation :/

So really you're trying to minimise the blocking/latency of fsync()?

> > You don't want to do writeback from the syscall, right? i.e. you'd
> > like to expire the inode behind the fd, and schedule background
> > writeback to run on it immediately?
> 
> Yes, that's exactly what we want. Blocking if a process has done too
> much writes is fine tho.

OK, so it's really the latency of the fsync() operation that is what
you are trying to avoid? I've been meaning to get back to a generic
implementation of an aio fsync operation:

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

Would that be a better approach to solving your need for a
non-blocking data integrity flush of a file?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-28 20:48     ` Dave Chinner
@ 2015-10-28 23:23       ` Andres Freund
  2015-10-29  1:54         ` Dave Chinner
  2015-10-28 23:26       ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Andres Freund @ 2015-10-28 23:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-kernel

Hi,

On 2015-10-29 07:48:34 +1100, Dave Chinner wrote:
> > The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> > the fsync() will only have to do very little work. The language in
> > sync_file_range(2) doesn't inspire enough confidence for using it as an
> > actual integrity operation :/
> 
> So really you're trying to minimise the blocking/latency of fsync()?

The blocking/latency of the fsync doesn't actually matter at all *for
this callsite*. It's called from a dedicated background process - if
it's slowed down by a couple seconds it doesn't matter much.
The problem is that if you have a couple gigabytes of dirty data being
fsync()ed at once, latency for concurrent reads and writes often goes
absolutely apeshit. And those concurrent reads and writes might
actually be latency sensitive.

By calling sync_file_range() over small ranges of pages shortly after
they've been written we make it unlikely (but still possible) that much
data has to be flushed at fsync() time.


Should it interesting: The relevant background process is the
"checkpointer" - it writes back all dirty data from postgres' in-memory
shared buffer cache back to disk, then fyncs all files that have been
touched since the last checkpoint (might have independently been
flushed). After that it then can remove the old write-ahead-log/journal.


> > > You don't want to do writeback from the syscall, right? i.e. you'd
> > > like to expire the inode behind the fd, and schedule background
> > > writeback to run on it immediately?
> > 
> > Yes, that's exactly what we want. Blocking if a process has done too
> > much writes is fine tho.
> 
> OK, so it's really the latency of the fsync() operation that is what
> you are trying to avoid? I've been meaning to get back to a generic
> implementation of an aio fsync operation:
> 
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> 
> Would that be a better approach to solving your need for a
> non-blocking data integrity flush of a file?

So an async fsync() isn't that particularly interesting for the
checkpointer/the issue in this thread. But there's another process in
postgres where I could imagine it being useful. We have a "background"
process that regularly flushes the journal to disk. It currently uses
fdatasync() to do so for subsections of a preallocated/reused file. It
tries to sync the sections that in the near future needs to be flushed
to disk because a transaction commits.

I could imagine that it's good for throughput to issue multiple
asynchronous fsyncs in this background process. Might not be good for
latency sensitive workloads tho.

At the moment using fdatasync() instead of fsync() is a considerable
performance advantage... If I understand the above proposal correctly,
it'd allow specifying ranges, is that right?


There'll be some concern about portability around this - issuing
sync_file_range() every now and then isn't particularly invasive. Using
aio might end up being that, not sure.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-28 20:48     ` Dave Chinner
  2015-10-28 23:23       ` Andres Freund
@ 2015-10-28 23:26       ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2015-10-28 23:26 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 29, 2015 at 07:48:34AM +1100, Dave Chinner wrote:
> Hi Andres,
> 
> On Wed, Oct 28, 2015 at 10:27:52AM +0100, Andres Freund wrote:
> > On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:
> ....
> > > Data integrity operations require related file metadata (e.g. block
> > > allocation trnascations) to be forced to the journal/disk, and a
> > > device cache flush issued to ensure the data is on stable storage.
> > > SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> > > the IO might be the same pattern as a data integrity operation, it
> > > does not provide such guarantees.
> > 
> > Which is desired here - the actual integrity is still going to be done
> > via fsync().
> 
> OK, so you require data integrity, but....
> 
> > The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> > the fsync() will only have to do very little work. The language in
> > sync_file_range(2) doesn't inspire enough confidence for using it as an
> > actual integrity operation :/
> 
> So really you're trying to minimise the blocking/latency of fsync()?
> 
> > > You don't want to do writeback from the syscall, right? i.e. you'd
> > > like to expire the inode behind the fd, and schedule background
> > > writeback to run on it immediately?
> > 
> > Yes, that's exactly what we want. Blocking if a process has done too
> > much writes is fine tho.
> 
> OK, so it's really the latency of the fsync() operation that is what
> you are trying to avoid? I've been meaning to get back to a generic
> implementation of an aio fsync operation:
> 
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> 
> Would that be a better approach to solving your need for a
> non-blocking data integrity flush of a file?

Which was relatively trivial to do. Numbers below come from XFS, I
smoke tested ext4 and it kinda worked but behaviour was very
unpredictable and maxxed out at about 25000 IOPS with max
performance being at 4 threads @ an average of 20000 files/s...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[RFC] aio: wire up generic aio_fsync method

From: Dave Chinner <dchinner@redhat.com>

We've had plenty of requests for an asynchronous fsync over the past
few years, and we've got the infrastructure there to do it. But
nobody has wired it up to test it. The common request we get from
userspace storage applications is to do a post-write pass over a set
of files that were just written (i.e. bulk background fsync) for
point-in-time checkpointing or flushing purposes.

So, just to see if I could brute force an effective implementation,
wire up aio_fsync, add a workqueue and push all the fsync calls off
to the workqueue. The workqueue will allow parallel dispatch, switch
execution if a fsync blocks for any reason, etc. Brute force and
very effective....

So, I hacked up fs_mark to enable fsync via the libaio io_fsync()
interface to run some tests. The quick test is:

	- write 10000 4k files into the cache
	- run a post write open-fsync-close pass (sync mode 5)
	- run 5 iterations
	- run a single thread, then 4 threads.

First I ran it on a 500TB sparse filesystem on a SSD.

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096        507.5           184435
     0        20000         4096        527.2           184815
     0        30000         4096        530.4           183798
     0        40000         4096        531.0           189431
     0        50000         4096        554.2           181557

real    1m34.548s
user    0m0.819s
sys     0m10.596s

Runs at around 500 log forces/s resulting in 500 log writes/s
giving a sustained IO load of about 1200 IOPS.

Using io_fsync():

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096       4124.1           151359
     0        20000         4096       5506.4           112704
     0        30000         4096       7347.1            97967
     0        40000         4096       7110.1            97089
     0        50000         4096       7075.3            94942

real    0m8.554s
user    0m0.350s
sys     0m3.684s

Runs at around 7,000 log forces/s, which are mostly aggregated down
to around 700 log writes/s, for a total sustained load of ~8000 IOPS.
The parallel dispatch of fsync operations allows the log to
aggregate them effectively, reducing journal IO by a factor of 10

Run the same workload, 4 threads at a time. Normal fsync:

FSUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096       2156.0           690185
     0        80000         4096       1859.6           693849
     0       120000         4096       1858.8           723889
     0       160000         4096       1848.5           708657
     0       200000         4096       1842.7           736587

Runs at ~2000 log forces/s, resulting in ~1000 log writes/s and
3,000 IOPS. We see the journal writes being aggregated, but nowhere
near the rate of the previous async fsync run.

Using io_fsync():

SUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096      18956.0           633011
     0        80000         4096      18972.1           635786
     0       120000         4096      23719.6           433334
     0       160000         4096      25780.6           403199
     0       200000         4096      24848.7           480086

real    0m9.512s
user    0m1.307s
sys     0m14.844s

Almost perfect scaling! ~24,000 log forces/s resulting in ~700 log
writes/s, so we've not got a 35:1 journal write aggregation
occurring, and so the total sustained IOPS is only ~25000 IOPS.

Just checking to see how far I can push it.

threads		files/s		IOPS		log aggregation
  1		 7000		 8000		 10:1
  4		24000		25000		 35:1
  8		32000		34000		100:1
 16		33000		35000		100:1
 32		30000		35000		 90:1

At 32 threads it's becoming context switch bound and burning
13-14 CPUs. It's pushing 6-800,000 context switches/s, and the
overhead in the blk_mq tag code is killing everything:

-   23.73%    23.73%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 64.15% prepare_to_wait
         - 99.35% bt_get
              blk_mq_get_tag
....
      - 14.23% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - blk_mq_run_hw_queue
               - 93.89% blk_mq_insert_requests
                    blk_mq_flush_plug_list
.....
   13.30%    13.30%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.27% finish_task_switch
         - __schedule
            - 94.53% schedule
               - 68.36% schedule_timeout
                  - 85.22% io_schedule_timeout
                     + 93.52% bt_get
                     + 6.48% bit_wait_io
                  + 14.39% wait_for_completion
      - 15.36% blk_insert_flush
           blk_sq_make_request
           generic_make_request
         - submit_bio
            - 99.24% submit_bio_wait
                 blkdev_issue_flush
                 xfs_blkdev_issue_flush
                 xfs_file_fsync
                 vfs_fsync_range
                 vfs_fsync
                 generic_aio_fsync_work


So, essentiall, close on 30% of the CPU being used (2.5 of 8 CPUs
being spent on this workload) is being spent on lock contention on
the blk mq request and tag wait queues due to the amount of task
switching going on...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/aio.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 9 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 155f842..19df3ec 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -188,6 +188,19 @@ struct aio_kiocb {
 	struct eventfd_ctx	*ki_eventfd;
 };
 
+/*
+ * Generic async fsync work structure.  If the file does not supply
+ * an ->aio_fsync method but has a ->fsync method, then the f(d)sync request is
+ * passed to the aio_fsync_wq workqueue and is executed there.
+ */
+struct aio_fsync_args {
+	struct work_struct	work;
+	struct kiocb		*iocb;
+	int			datasync;
+};
+
+static struct workqueue_struct *aio_fsync_wq;
+
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
@@ -257,6 +270,10 @@ static int __init aio_setup(void)
 	if (IS_ERR(aio_mnt))
 		panic("Failed to create aio fs mount.");
 
+	aio_fsync_wq = alloc_workqueue("aio-fsync", 0, 0);
+	if (!aio_fsync_wq)
+		panic("Failed to create aio fsync workqueue.");
+
 	kiocb_cachep = KMEM_CACHE(aio_kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
@@ -1396,6 +1413,32 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
 				len, UIO_FASTIOV, iovec, iter);
 }
 
+static void generic_aio_fsync_work(struct work_struct *work)
+{
+	struct aio_fsync_args *args = container_of(work,
+						   struct aio_fsync_args, work);
+	int error;
+
+	error = vfs_fsync(args->iocb->ki_filp, args->datasync);
+	aio_complete(args->iocb, error, 0);
+	kfree(args);
+}
+
+static int generic_aio_fsync(struct kiocb *iocb, int datasync)
+{
+	struct aio_fsync_args	*args;
+
+	args = kzalloc(sizeof(struct aio_fsync_args), GFP_KERNEL);
+	if (!args)
+		return -ENOMEM;
+
+	INIT_WORK(&args->work, generic_aio_fsync_work);
+	args->iocb = iocb;
+	args->datasync = datasync;
+	queue_work(aio_fsync_wq, &args->work);
+	return -EIOCBQUEUED;
+}
+
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
@@ -1410,6 +1453,7 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 	rw_iter_op *iter_op;
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct iov_iter iter;
+	int datasync = 0;
 
 	switch (opcode) {
 	case IOCB_CMD_PREAD:
@@ -1460,17 +1504,15 @@ rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(req, 1);
-		break;
-
+		datasync = 1;
+		/* fall through */
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(req, datasync);
+		else if (file->f_op->fsync)
+			ret = generic_aio_fsync(req, datasync);
+		else
 			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(req, 0);
 		break;
 
 	default:

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-28 23:23       ` Andres Freund
@ 2015-10-29  1:54         ` Dave Chinner
  2015-10-29 16:23           ` Andres Freund
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-10-29  1:54 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 29, 2015 at 12:23:12AM +0100, Andres Freund wrote:
> Hi,
> 
> On 2015-10-29 07:48:34 +1100, Dave Chinner wrote:
> > > The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> > > the fsync() will only have to do very little work. The language in
> > > sync_file_range(2) doesn't inspire enough confidence for using it as an
> > > actual integrity operation :/
> > 
> > So really you're trying to minimise the blocking/latency of fsync()?
> 
> The blocking/latency of the fsync doesn't actually matter at all *for
> this callsite*. It's called from a dedicated background process - if
> it's slowed down by a couple seconds it doesn't matter much.
> The problem is that if you have a couple gigabytes of dirty data being
> fsync()ed at once, latency for concurrent reads and writes often goes
> absolutely apeshit. And those concurrent reads and writes might
> actually be latency sensitive.

Right, but my point is with an async fsync/fdatasync you don't need
this background process - you can just trickle out async fdatasync
calls instead of trckling out calls to sync_file_range().

> By calling sync_file_range() over small ranges of pages shortly after
> they've been written we make it unlikely (but still possible) that much
> data has to be flushed at fsync() time.

Right, but you still need the fsync call, whereas with a async fsync
call you don't - when you gather the completion, no further action
needs to be taken on that dirty range.

> At the moment using fdatasync() instead of fsync() is a considerable
> performance advantage... If I understand the above proposal correctly,
> it'd allow specifying ranges, is that right?

Well, the patch I sent doesn't do ranges, but it could easily be
passed in as the iocb has offset/len parameters that are used by
IOCB_CMD_PREAD/PWRITE. io_prep_fsync/io_fsync both memset the iocb
to zero, so if we pass in a non-zero length, we could treat it as a
ranged f(d)sync quite easily.

> There'll be some concern about portability around this - issuing
> sync_file_range() every now and then isn't particularly invasive. Using
> aio might end up being that, not sure.

It's still a non-portable/linux only solution, because it is using
the linux native aio interface, not the glibc one...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-29  1:54         ` Dave Chinner
@ 2015-10-29 16:23           ` Andres Freund
  2015-10-29 22:10             ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Andres Freund @ 2015-10-29 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-kernel

On 2015-10-29 12:54:22 +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 12:23:12AM +0100, Andres Freund wrote:
> > The blocking/latency of the fsync doesn't actually matter at all *for
> > this callsite*. It's called from a dedicated background process - if
> > it's slowed down by a couple seconds it doesn't matter much.
> > The problem is that if you have a couple gigabytes of dirty data being
> > fsync()ed at once, latency for concurrent reads and writes often goes
> > absolutely apeshit. And those concurrent reads and writes might
> > actually be latency sensitive.
> 
> Right, but my point is with an async fsync/fdatasync you don't need
> this background process - you can just trickle out async fdatasync
> calls instead of trckling out calls to sync_file_range().

We don't want to do the checkpointing from normal backends that process
user queries, so there has to be a background process anyway. Depending
on settings we only do the checkpoints in 5 to 60 minutes intervals
(spread over that interval).


> > By calling sync_file_range() over small ranges of pages shortly after
> > they've been written we make it unlikely (but still possible) that much
> > data has to be flushed at fsync() time.
> 
> Right, but you still need the fsync call, whereas with a async fsync
> call you don't - when you gather the completion, no further action
> needs to be taken on that dirty range.

I assume that the actual IOs issued by the async fsync and a plain fsync
would be pretty similar. So the problem that an fsync of large amounts
of dirty data causes latency increases for other issuers of IO wouldn't
be gone, no?


> > At the moment using fdatasync() instead of fsync() is a considerable
> > performance advantage... If I understand the above proposal correctly,
> > it'd allow specifying ranges, is that right?
> 
> Well, the patch I sent doesn't do ranges, but it could easily be
> passed in as the iocb has offset/len parameters that are used by
> IOCB_CMD_PREAD/PWRITE.

That'd be cool. Then we could issue those for asynchronous transaction
commits, and to have more wal writes concurrently in progress by the
background wal writer.



I'll try the patch from 20151028232641.GS8773@dastard and see wether I
can make it be advantageous for throughput (for WAL flushing, not the
checkpointer process).  Wish I had a better storage system, my guess
it'll be more advantageous there. We'll see.


Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Triggering non-integrity writeback from userspace
  2015-10-29 16:23           ` Andres Freund
@ 2015-10-29 22:10             ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2015-10-29 22:10 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 29, 2015 at 05:23:56PM +0100, Andres Freund wrote:
> On 2015-10-29 12:54:22 +1100, Dave Chinner wrote:
> > On Thu, Oct 29, 2015 at 12:23:12AM +0100, Andres Freund wrote:
> > > By calling sync_file_range() over small ranges of pages shortly after
> > > they've been written we make it unlikely (but still possible) that much
> > > data has to be flushed at fsync() time.
> > 
> > Right, but you still need the fsync call, whereas with a async fsync
> > call you don't - when you gather the completion, no further action
> > needs to be taken on that dirty range.
> 
> I assume that the actual IOs issued by the async fsync and a plain fsync
> would be pretty similar. So the problem that an fsync of large amounts
> of dirty data causes latency increases for other issuers of IO wouldn't
> be gone, no?

Yes, they'd be the same if the async operation is not range limited.

> > > At the moment using fdatasync() instead of fsync() is a considerable
> > > performance advantage... If I understand the above proposal correctly,
> > > it'd allow specifying ranges, is that right?
> > 
> > Well, the patch I sent doesn't do ranges, but it could easily be
> > passed in as the iocb has offset/len parameters that are used by
> > IOCB_CMD_PREAD/PWRITE.
> 
> That'd be cool. Then we could issue those for asynchronous transaction
> commits, and to have more wal writes concurrently in progress by the
> background wal writer.

Updated patch that allows ranged aio fsync below. In the
application, do this for a ranged fsync:

	io_prep_fsync(iocb, fd);
	iocb->u.c.offset = offset;	/* start of range */
	iocb->u.c.nbytes = len;		/* size (in bytes) to sync */
	error = io_submit(ctx, 1, &iocb);

> I'll try the patch from 20151028232641.GS8773@dastard and see wether I
> can make it be advantageous for throughput (for WAL flushing, not the
> checkpointer process).  Wish I had a better storage system, my guess
> it'll be more advantageous there. We'll see.

A $100 SATA ssd is all you need to get the IOPS rates in the
thousands for these sorts of tests...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

aio: wire up generic aio_fsync method

From: Dave Chinner <dchinner@redhat.com>

We've had plenty of requests for an asynchronous fsync over the past
few years, and we've got the infrastructure there to do it. But
nobody has wired it up to test it. The common request we get from
userspace storage applications is to do a post-write pass over a set
of files that were just written (i.e. bulk background fsync) for
point-in-time checkpointing or flushing purposes.

So, just to see if I could brute force an effective implementation,
wire up aio_fsync, add a workqueue and push all the fsync calls off
to the workqueue. The workqueue will allow parallel dispatch, switch
execution if a fsync blocks for any reason, etc. Brute force and
very effective....

This also allows us to do ranged f(data)sync calls. the libaio
io_prep_fsync() function zeros the unused sections of the iocb
passed to the kernel, so the offset/byte count in the iocb should
always be zero. Hence if we get a non-zero byte count, we can treat
it as a ranges operation. This allows applications to commit ranges
of files to stable storage, rather than just he entire file. TO do
this, we need to be able to pass the length to ->aio_fsync(), but
this is trivial to change because no subsystem currently implements
this method.

So, I hacked up fs_mark to enable fsync via the libaio io_fsync()
interface to run some tests. The quick test is:

	- write 10000 4k files into the cache
	- run a post write open-fsync-close pass (sync mode 5)
	- run 5 iterations
	- run a single thread, then 4 threads.

First I ran it on a 500TB sparse filesystem on a SSD.

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096        507.5           184435
     0        20000         4096        527.2           184815
     0        30000         4096        530.4           183798
     0        40000         4096        531.0           189431
     0        50000         4096        554.2           181557

real    1m34.548s
user    0m0.819s
sys     0m10.596s

Runs at around 500 log forces/s resulting in 500 log writes/s
giving a sustained IO load of about 1200 IOPS.

Using io_fsync():

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096       4124.1           151359
     0        20000         4096       5506.4           112704
     0        30000         4096       7347.1            97967
     0        40000         4096       7110.1            97089
     0        50000         4096       7075.3            94942

real    0m8.554s
user    0m0.350s
sys     0m3.684s

Runs at around 7,000 log forces/s, which are mostly aggregated down
to around 700 log writes/s, for a total sustained load of ~8000 IOPS.
The parallel dispatch of fsync operations allows the log to
aggregate them effectively, reducing journal IO by a factor of 10

Run the same workload, 4 threads at a time. Normal fsync:

FSUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096       2156.0           690185
     0        80000         4096       1859.6           693849
     0       120000         4096       1858.8           723889
     0       160000         4096       1848.5           708657
     0       200000         4096       1842.7           736587

Runs at ~2000 log forces/s, resulting in ~1000 log writes/s and
3,000 IOPS. We see the journal writes being aggregated, but nowhere
near the rate of the previous async fsync run.

Using io_fsync():

SUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096      18956.0           633011
     0        80000         4096      18972.1           635786
     0       120000         4096      23719.6           433334
     0       160000         4096      25780.6           403199
     0       200000         4096      24848.7           480086

real    0m9.512s
user    0m1.307s
sys     0m14.844s

Almost perfect scaling! ~24,000 log forces/s resulting in ~700 log
writes/s, so we've not got a 35:1 journal write aggregation
occurring, and so the total sustained IOPS is only ~25000 IOPS.

Just checking to see how far I can push it.

threads		files/s		IOPS		log aggregation
  1		 7000		 8000		 10:1
  4		24000		25000		 35:1
  8		32000		34000		100:1
 16		33000		35000		100:1
 32		30000		35000		 90:1

At 32 threads it's becoming context switch bound and burning
13-14 CPUs. It's pushing 6-800,000 context switches/s, and the
overhead in the blk_mq tag code is killing everything:

-   23.73%    23.73%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 64.15% prepare_to_wait
         - 99.35% bt_get
              blk_mq_get_tag
....
      - 14.23% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - blk_mq_run_hw_queue
               - 93.89% blk_mq_insert_requests
                    blk_mq_flush_plug_list
.....
   13.30%    13.30%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.27% finish_task_switch
         - __schedule
            - 94.53% schedule
               - 68.36% schedule_timeout
                  - 85.22% io_schedule_timeout
                     + 93.52% bt_get
                     + 6.48% bit_wait_io
                  + 14.39% wait_for_completion
      - 15.36% blk_insert_flush
           blk_sq_make_request
           generic_make_request
         - submit_bio
            - 99.24% submit_bio_wait
                 blkdev_issue_flush
                 xfs_blkdev_issue_flush
                 xfs_file_fsync
                 vfs_fsync_range
                 vfs_fsync
                 generic_aio_fsync_work


So, essentiall, close on 30% of the CPU being used (2.5 of 8 CPUs
being spent on this workload) is being spent on lock contention on
the blk mq request and tag wait queues due to the amount of task
switching going on...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/aio.c           | 69 +++++++++++++++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h |  2 +-
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 155f842..109433b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -188,6 +188,20 @@ struct aio_kiocb {
 	struct eventfd_ctx	*ki_eventfd;
 };
 
+/*
+ * Generic async fsync work structure.  If the file does not supply
+ * an ->aio_fsync method but has a ->fsync method, then the f(d)sync request is
+ * passed to the aio_fsync_wq workqueue and is executed there.
+ */
+struct aio_fsync_args {
+	struct work_struct	work;
+	struct kiocb		*req;
+	size_t			len;		/* zero for full file fsync */
+	int			datasync;
+};
+
+static struct workqueue_struct *aio_fsync_wq;
+
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
@@ -257,6 +271,10 @@ static int __init aio_setup(void)
 	if (IS_ERR(aio_mnt))
 		panic("Failed to create aio fs mount.");
 
+	aio_fsync_wq = alloc_workqueue("aio-fsync", 0, 0);
+	if (!aio_fsync_wq)
+		panic("Failed to create aio fsync workqueue.");
+
 	kiocb_cachep = KMEM_CACHE(aio_kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
@@ -1396,6 +1414,40 @@ static int aio_setup_vectored_rw(int rw, char __user *buf, size_t len,
 				len, UIO_FASTIOV, iovec, iter);
 }
 
+static void generic_aio_fsync_work(struct work_struct *work)
+{
+	struct aio_fsync_args *args = container_of(work,
+						   struct aio_fsync_args, work);
+	struct kiocb *req = args->req;
+	int error;
+
+	if (!args->len)
+		error = vfs_fsync(req->ki_filp, args->datasync);
+	else
+		error = vfs_fsync_range(req->ki_filp, req->ki_pos,
+					req->ki_pos + args->len,
+					args->datasync);
+
+	aio_complete(req, error, 0);
+	kfree(args);
+}
+
+static int generic_aio_fsync(struct kiocb *req, size_t len, int datasync)
+{
+	struct aio_fsync_args	*args;
+
+	args = kzalloc(sizeof(struct aio_fsync_args), GFP_KERNEL);
+	if (!args)
+		return -ENOMEM;
+
+	INIT_WORK(&args->work, generic_aio_fsync_work);
+	args->req = req;
+	args->len = len;
+	args->datasync = datasync;
+	queue_work(aio_fsync_wq, &args->work);
+	return -EIOCBQUEUED;
+}
+
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
@@ -1410,6 +1462,7 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 	rw_iter_op *iter_op;
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct iov_iter iter;
+	int datasync = 0;
 
 	switch (opcode) {
 	case IOCB_CMD_PREAD:
@@ -1460,17 +1513,15 @@ rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(req, 1);
-		break;
-
+		datasync = 1;
+		/* fall through */
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(req, len, datasync);
+		else if (file->f_op->fsync)
+			ret = generic_aio_fsync(req, len, datasync);
+		else
 			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(req, 0);
 		break;
 
 	default:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..8a74dfb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1626,7 +1626,7 @@ struct file_operations {
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
 	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
-	int (*aio_fsync) (struct kiocb *, int datasync);
+	int (*aio_fsync) (struct kiocb *, size_t len, int datasync);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-29 22:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-22 13:15 Triggering non-integrity writeback from userspace Andres Freund
2015-10-24 19:09 ` Jan Kara
2015-10-24 21:39 ` Dave Chinner
2015-10-28  9:27   ` Andres Freund
2015-10-28 20:48     ` Dave Chinner
2015-10-28 23:23       ` Andres Freund
2015-10-29  1:54         ` Dave Chinner
2015-10-29 16:23           ` Andres Freund
2015-10-29 22:10             ` Dave Chinner
2015-10-28 23:26       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).