linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeremy Bongio <bongiojp@gmail.com>
To: Ted Tso <tytso@mit.edu>, "Darrick J . Wong" <djwong@kernel.org>,
	Allison Henderson <allison.henderson@oracle.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org, Jeremy Bongio <bongiojp@gmail.com>
Subject: [PATCH 0/1] iomap regression for aio dio 4k writes
Date: Wed, 21 Jun 2023 10:29:19 -0700	[thread overview]
Message-ID: <20230621174114.1320834-1-bongiojp@gmail.com> (raw)

Hi Darrick and Allison,

There has been a standing performance regression involving AIO DIO
4k-aligned writes on ext4 backed by a fast local SSD since the switch
to iomap. I think it was originally reported and investigated in this
thread: https://lore.kernel.org/all/87lf7rkffv.fsf@collabora.com/

Short version:
Pre-iomap, for ext4 async direct writes, after the bio is written to disk
the completion function is called directly during the endio stage.

Post-iomap, for direct writes, after the bio is written to disk, the completion
function is deferred to a work queue. This adds latency that impacts
performance most noticeably in very fast SSDs.

Detailed version:
A possible explanation is below, followed by a few questions to figure
out the right way to fix it.

In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has completed
in the nvme driver, the interrupt handler for the write request ends
in calling bio_endio() which ends up calling dio_bio_end_aio(). A
different end_io function is used for async and sync io. If there are
no pages mapped in memory for the write operation's inode, then the
completion function for ext4 is called directly. If there are pages
mapped, then they might be dirty and need to be updated and work
is deferred to a work queue.

Here is the relevant 4.15 code:

fs/direct-io.c: dio_bio_end_aio()
if (dio->result)
        defer_completion = dio->defer_completion ||
                           (dio_op == REQ_OP_WRITE &&
                           dio->inode->i_mapping->nrpages);
if (defer_completion) {
        INIT_WORK(&dio->complete_work, dio_aio_complete_work);
        queue_work(dio->inode->i_sb->s_dio_done_wq,
                   &dio->complete_work);
} else {
        dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
}

After ext4 switched to using iomap, the endio function became
iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end io
function is used for both async and sync io. All write requests will
defer io completion to a work queue even if there are no mapped pages
for the inode.

Here is the relevant code in latest kernel (post-iomap) ...

fs/iomap/direct-io.c: iomap_dio_bio_end_io()
if (dio->wait_for_completion) {
          struct task_struct *waiter = dio->submit.waiter;
          WRITE_ONCE(dio->submit.waiter, NULL);
          blk_wake_io_task(waiter);
   } else if (dio->flags & IOMAP_DIO_WRITE) {
         struct inode *inode = file_inode(dio->iocb->ki_filp);

         WRITE_ONCE(dio->iocb->private, NULL);
         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
         queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
   } else {
         WRITE_ONCE(dio->iocb->private, NULL);
         iomap_dio_complete_work(&dio->aio.work);
}

With the attached patch, I see significantly better performance in 5.10 than 4.15. 5.10 is the latest kernel where I have driver support for an SSD that is fast enough to reproduce the regression. I verified that upstream iomap works the same.

Test results using the reproduction script from the original report
and testing with 4k/8k/12k/16k blocksizes and write-only:
https://people.collabora.com/~krisman/dio/week21/bench.sh

fio benchmark command:
fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --iodepth=64 \
--time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \
--name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
--runtime=100 --output-format=terse >> ${LOG}

For 4.15, with all write completions called in io handler:
4k:  bw=1056MiB/s
8k:  bw=2082MiB/s
12k: bw=2332MiB/s
16k: bw=2453MiB/s

For unmodified 5.10, with all write completions deferred:
4k:  bw=1004MiB/s
8k:  bw=2074MiB/s
12k: bw=2309MiB/s
16k: bw=2465MiB/s

For modified 5.10, with all write completions called in io handler:
4k:  bw=1193MiB/s
8k:  bw=2258MiB/s
12k: bw=2346MiB/s
16k: bw=2446MiB/s

Questions:

Why did iomap from the beginning not make the async/sync io and
mapped/unmapped distinction that fs/direct-io.c did?

Since no issues have been found for ext4 calling completion work
directly in the io handler pre-iomap, it is unlikely that this is
unsafe (sleeping within an io handler callback). However, this may not
be true for all filesystems. Does XFS potentially sleep in its
completion code?

Allison, Ted mentioned you saw a similar problem when doing performance testing for the latest version of Unbreakable Linux. Can you test/verify this patch addresses your performance regression as well?

Jeremy Bongio (1):
  For DIO writes with no mapped pages for inode, skip deferring
    completion.

 fs/iomap/direct-io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.41.0.185.g7c58973941-goog


             reply	other threads:[~2023-06-21 17:42 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-21 17:29 Jeremy Bongio [this message]
2023-06-21 17:29 ` [PATCH 1/1] For DIO writes with no mapped pages for inode, skip deferring completion Jeremy Bongio
2023-06-21 18:55   ` Matthew Wilcox
2023-06-22  0:04   ` Dave Chinner
2023-06-21 23:59 ` [PATCH 0/1] iomap regression for aio dio 4k writes Dave Chinner
2023-06-22  1:55   ` Dave Chinner
2023-06-22  2:55     ` Matthew Wilcox
2023-06-22  4:08       ` Christoph Hellwig
2023-06-22  4:47       ` Dave Chinner
2023-06-23  2:32   ` Theodore Ts'o
2023-06-23  3:02     ` Dave Chinner
2023-06-22 23:22 ` Allison Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230621174114.1320834-1-bongiojp@gmail.com \
    --to=bongiojp@gmail.com \
    --cc=allison.henderson@oracle.com \
    --cc=djwong@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).