linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: "Xiaohui1 Li 李晓辉" <lixiaohui1@xiaomi.com>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"harshadshirwadkar@gmail.com" <harshadshirwadkar@gmail.com>
Subject: Re: 答复: [External Mail]Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
Date: Wed, 30 Oct 2019 10:26:28 -0400	[thread overview]
Message-ID: <20191030142628.GA16197@mit.edu> (raw)
In-Reply-To: <1572409673853.43507@xiaomi.com>

On Wed, Oct 30, 2019 at 04:28:42AM +0000, Xiaohui1 Li 李晓辉 wrote:
> the problem of file' data wating in jbd2 order mode is also a
> serious problem which case a long-latency fsync call.

Yes, this is a separate problem, although note that if the file with a
large amount of data is the file which is being fsync'ed, you have to
write it out at fsync time no matter what.

You could try to write out dirty data earlier (e.g., by decreasing the
30 second writeback window), but there are tradeoffs.  For one thing,
if the file ends up being deleted anyway, it's better not to write out
the data at all.  For another, if we know how big the file is at the
time when we do the writeout, we can do a better job allocating space
for the file, and it improves the file layout by making it more likely
it will be contiguous, or at least mostly contiguous.

Also, files that tend to be fsync'ed a lot tend to be database files
(e.g., SQLite files), and they tend to write small amounts of data and
then fsync them.  So the problem described below happens when there
are unrelated files that happen to be downloaded in parallel.  An
example of this in the Android case mgiht be when the user is
downloading a large video file, such as a movie, to be watched offline
later (such as when they are on a plane).

> as pointed out in this iJournaling paper, when three conditions turn up at the same time,
> 1: order mode must be applied, not the writeback mode.
> 2: The delayed block allocation technique of ext4 must be  applied.
> 3: backgroud buffer writes are too many.

(1) and (2) are the default.  (3) may or may not be a frequent
occurrence, depending on the workload.  In practice though, users
aren't downloading large files all *that* often.

> we have no choice as the order mode need to do this work, so the
> waiting inode-data-flushed-disk time is too long in some extreme
> conditions.  so it cause the appearance of long-latency fsync call.
> 
> thank you for your reply, i will try to fix this problem in my free time.

So there is a solution; it's just a bit tricky to do, and it's not
been a huge enough deal that anyone has allocated time to fix it.

The idea is to allocate space, but not actually update the metadata
blocks at the time when the data blocks are allocated.  Instead, we
reserve them so they won't get allocated for use by another file, and
we note where they are in the extent status cache.  We then issue the
writes of the data block, and only after they are complete, only
*then* do we update the metadata blocks (which then gets updated via
the journal, using either a commit or a fast commit).

This is similar to the dioread_nolock case, where we update the
metadata blocks first, but mark them as unwritten, then we let the
data blocks get written, and only finally do we update the metadata
blocks so they are marked as written (e.g., initialized).  This avoids
the stale data problem as well, but we end up modifying the metadata
blocks twice, and it has resulted other performance problems since in
increases overhead on the i_data_sem lock.  See for example some of
the posts by Liu Bo from Alibaba last year:

If we can allocate space, write the data blocks, and only *then*
update the extent tree metadata blocks, it solves a lot of problems.
We can get rid of the dioread_nolock option; we can get rid of the
data=ordered vs data=writeback distinction; and we can avoid the need
to force data blocks to be written out at commit time.  So it improves
performance, and it will reduce code complexity, making it a win-win
approach.

The problem is that this means significantly changing how we do block
allocation and block reservation, so it's a fairly large and invasive
set of changes.  But it's the right long-term direction, and we'll get
there eventually.

Cheers,

						- Ted

  reply	other threads:[~2019-10-30 14:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1571900042725.99617@xiaomi.com>
2019-10-24 20:18 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Theodore Y. Ts'o
     [not found]   ` <1572349386604.43878@xiaomi.com>
2019-10-29 21:35     ` Theodore Y. Ts'o
2019-10-30  4:28       ` 答复: [External Mail]Re: " Xiaohui1 Li 李晓辉
2019-10-30 14:26         ` Theodore Y. Ts'o [this message]
2019-11-04  1:01           ` xiaohui li
2019-11-04  3:22             ` Theodore Y. Ts'o
2019-11-08  7:58               ` xiaohui li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191030142628.GA16197@mit.edu \
    --to=tytso@mit.edu \
    --cc=harshadshirwadkar@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=lixiaohui1@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).