All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Phillips <daniel@phunq.net>
To: Pavel Machek <pavel@ucw.cz>
Cc: Howard Chu <hyc@symas.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tux3@tux3.org, "Theodore Ts'o" <tytso@mit.edu>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
Date: Mon, 11 May 2015 16:53:10 -0700	[thread overview]
Message-ID: <555140E6.1070409@phunq.net> (raw)
In-Reply-To: <20150511221223.GD4434@amd>

Hi Pavel,

On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>> design something workable there it is going to impact everything else you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our layout
>> is not any worse than XFS, which started bad and stayed that way.
> 
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).

That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.

Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.

This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.

It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.

Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.

Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.

Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.

> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.

They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.

Regards,

Daniel

WARNING: multiple messages have this Message-ID (diff)
From: Daniel Phillips <daniel@phunq.net>
To: Pavel Machek <pavel@ucw.cz>
Cc: Theodore Ts'o <tytso@mit.edu>, Howard Chu <hyc@symas.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	tux3@tux3.org, linux-fsdevel@vger.kernel.org,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
Date: Mon, 11 May 2015 16:53:10 -0700	[thread overview]
Message-ID: <555140E6.1070409@phunq.net> (raw)
In-Reply-To: <20150511221223.GD4434@amd>

Hi Pavel,

On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>> design something workable there it is going to impact everything else you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our layout
>> is not any worse than XFS, which started bad and stayed that way.
> 
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).

That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.

Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.

This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.

It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.

Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.

Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.

Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.

> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.

They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.

Regards,

Daniel

  parent reply	other threads:[~2015-05-11 23:53 UTC|newest]

Thread overview: 211+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
2015-04-28 23:13 ` Daniel Phillips
2015-04-29  2:21 ` Mike Galbraith
2015-04-29  6:01   ` Daniel Phillips
2015-04-29  6:01     ` Daniel Phillips
2015-04-29  6:20     ` Richard Weinberger
2015-04-29  6:56       ` Daniel Phillips
2015-04-29  6:56         ` Daniel Phillips
2015-04-29  6:33     ` Mike Galbraith
2015-04-29  7:23       ` Daniel Phillips
2015-04-29  7:23         ` Daniel Phillips
2015-04-29 16:42         ` Mike Galbraith
2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
2015-04-29 19:20             ` Austin S Hemmelgarn
2015-04-29 21:12             ` Daniel Phillips
2015-04-30  4:40               ` Mike Galbraith
2015-04-30  0:20             ` Dave Chinner
2015-04-30  3:35               ` Mike Galbraith
2015-04-30  9:00               ` Martin Steigerwald
2015-04-30  9:00                 ` Martin Steigerwald
2015-04-30 14:57                 ` Theodore Ts'o
2015-04-30 15:59                   ` Daniel Phillips
2015-04-30 17:59                   ` Martin Steigerwald
2015-04-30 11:14               ` Daniel Phillips
2015-04-30 12:07                 ` Mike Galbraith
2015-04-30 12:58                   ` Daniel Phillips
2015-04-30 12:58                     ` Daniel Phillips
2015-04-30 13:48                     ` Mike Galbraith
2015-04-30 14:07                       ` Daniel Phillips
2015-04-30 14:28                         ` Howard Chu
2015-04-30 14:28                           ` Howard Chu
2015-04-30 15:14                           ` Daniel Phillips
2015-04-30 16:00                             ` Howard Chu
2015-04-30 18:22                             ` Christian Stroetmann
2015-05-11 22:12                             ` Pavel Machek
2015-05-11 23:17                               ` Theodore Ts'o
2015-05-12  2:34                                 ` Daniel Phillips
2015-05-12  5:38                                   ` Dave Chinner
2015-05-12  6:18                                     ` Daniel Phillips
2015-05-12  6:18                                       ` Daniel Phillips
2015-05-12 18:39                                       ` David Lang
2015-05-12 20:54                                         ` Daniel Phillips
2015-05-12 21:30                                           ` David Lang
2015-05-12 22:27                                             ` Daniel Phillips
2015-05-12 22:35                                               ` David Lang
2015-05-12 23:55                                                 ` Theodore Ts'o
2015-05-13  1:26                                                 ` Daniel Phillips
2015-05-13 19:09                                                   ` Martin Steigerwald
2015-05-13 19:37                                                     ` Daniel Phillips
2015-05-13 20:02                                                       ` Jeremy Allison
2015-05-13 20:02                                                         ` Jeremy Allison
2015-05-13 20:24                                                         ` Daniel Phillips
2015-05-13 20:25                                                       ` Martin Steigerwald
2015-05-13 20:38                                                         ` Daniel Phillips
2015-05-13 21:10                                                           ` Martin Steigerwald
2015-05-13  0:31                                             ` Daniel Phillips
2015-05-12 21:30                                           ` Christian Stroetmann
2015-05-13  7:20                                           ` Pavel Machek
2015-05-13 13:47                                             ` Elifarley Callado Coelho Cruz
2015-05-12  9:03                                   ` Pavel Machek
2015-05-12  9:03                                     ` Pavel Machek
2015-05-12 11:22                                     ` Daniel Phillips
2015-05-12 13:26                                       ` Howard Chu
2015-05-11 23:53                               ` Daniel Phillips [this message]
2015-05-11 23:53                                 ` Daniel Phillips
2015-05-12  0:12                                 ` David Lang
2015-05-12  4:36                                   ` Daniel Phillips
2015-05-12 17:30                                     ` Christian Stroetmann
2015-05-13  7:25                                 ` Pavel Machek
2015-05-13 11:31                                   ` Daniel Phillips
2015-05-13 12:41                                     ` Daniel Phillips
2015-05-13 13:08                                     ` Mike Galbraith
2015-05-13 13:15                                       ` Daniel Phillips
2015-04-30 14:33                         ` Mike Galbraith
2015-04-30 15:24                           ` Daniel Phillips
2015-04-30 15:24                             ` Daniel Phillips
2015-04-29 20:40           ` Tux3 Report: How fast can we fsync? Daniel Phillips
2015-04-29 20:40             ` Daniel Phillips
2015-04-29 22:06             ` OGAWA Hirofumi
2015-04-29 22:06               ` OGAWA Hirofumi
2015-04-30  3:57               ` Mike Galbraith
2015-04-30  3:50             ` Mike Galbraith
2015-04-30 10:59               ` Daniel Phillips
2015-04-30  1:46 ` Dave Chinner
2015-04-30 10:28   ` Daniel Phillips
2015-04-30 10:28     ` Daniel Phillips
     [not found]     ` <55420EAC.5040900@suse.com>
2015-04-30 11:36       ` Daniel Phillips
2015-04-30 13:19         ` Filipe David Manana
2015-04-30 13:25           ` Daniel Phillips
2015-05-01 15:38     ` Dave Chinner
2015-05-01 23:20       ` Daniel Phillips
2015-05-02  1:07         ` David Lang
2015-05-02 10:26           ` Daniel Phillips
2015-05-02 16:00             ` Christian Stroetmann
2015-05-02 16:30               ` Richard Weinberger
2015-05-02 17:00                 ` Christian Stroetmann
2015-05-12 17:41 ` Daniel Phillips
2015-05-12 17:46 ` Tux3 Report: How fast can we fail? Daniel Phillips
2015-05-13 22:07   ` Daniel Phillips
2015-05-26 10:03   ` Pavel Machek
2015-05-26 10:03     ` Pavel Machek
2015-05-27  6:41     ` Mosis Tembo
2015-05-27 18:28       ` Daniel Phillips
2015-05-27 18:28         ` Daniel Phillips
2015-05-27 21:39         ` Pavel Machek
2015-05-27 22:46           ` Daniel Phillips
2015-05-28 12:55             ` Austin S Hemmelgarn
2015-05-27  7:37     ` Mosis Tembo
2015-05-27 14:04       ` Austin S Hemmelgarn
2015-05-27 15:21         ` Mosis Tembo
2015-05-27 15:37           ` Austin S Hemmelgarn
2015-05-14  7:37 ` [WIP] tux3: Optimized fsync Daniel Phillips
2015-05-14  8:26 ` [FYI] tux3: Core changes Daniel Phillips
2015-05-14 12:59   ` Rik van Riel
2015-05-15  0:06     ` Daniel Phillips
2015-05-15  0:06       ` Daniel Phillips
2015-05-15  3:06       ` Rik van Riel
2015-05-15  8:09         ` Mel Gorman
2015-05-15  9:54           ` Daniel Phillips
2015-05-15  9:54             ` Daniel Phillips
2015-05-15 11:00             ` Mel Gorman
2015-05-16 22:38               ` David Lang
2015-05-18 12:57                 ` Mel Gorman
2015-05-18 12:57                   ` Mel Gorman
2015-05-15  9:38         ` Daniel Phillips
2015-05-15  9:38           ` Daniel Phillips
2015-05-27  7:41           ` Pavel Machek
2015-05-27 18:09             ` Daniel Phillips
2015-05-27 18:09               ` Daniel Phillips
2015-05-27 21:37               ` Pavel Machek
2015-05-27 22:33                 ` Daniel Phillips
2015-05-15  8:05       ` Mel Gorman
2015-05-17 13:26     ` Boaz Harrosh
2015-05-18  2:20       ` Rik van Riel
2015-05-18  7:58         ` Boaz Harrosh
2015-05-19  4:46         ` Daniel Phillips
2015-05-21 19:43     ` [WIP][PATCH] tux3: preliminatry nospace handling Daniel Phillips
2015-05-19 14:00   ` [FYI] tux3: Core changes Jan Kara
2015-05-19 19:18     ` Daniel Phillips
2015-05-19 20:33       ` David Lang
2015-05-19 20:33         ` David Lang
2015-05-20 14:44         ` Jan Kara
2015-05-20 16:22           ` Daniel Phillips
2015-05-20 18:01             ` David Lang
2015-05-20 18:01               ` David Lang
2015-05-20 19:53             ` Rik van Riel
2015-05-20 19:53               ` Rik van Riel
2015-05-20 22:51               ` Daniel Phillips
2015-05-20 22:51                 ` Daniel Phillips
2015-05-21  3:24                 ` Daniel Phillips
2015-05-21  3:51                   ` David Lang
2015-05-21 19:53                     ` Daniel Phillips
2015-05-21 19:53                       ` Daniel Phillips
2015-05-26  4:25                       ` Rik van Riel
2015-05-26  4:25                         ` Rik van Riel
2015-05-26  4:30                         ` Daniel Phillips
2015-05-26  4:30                           ` Daniel Phillips
2015-05-26  6:04                           ` David Lang
2015-05-26  6:04                             ` David Lang
2015-05-26  6:11                             ` Daniel Phillips
2015-05-26  6:13                               ` David Lang
2015-05-26  6:13                                 ` David Lang
2015-05-26  8:09                                 ` Daniel Phillips
2015-05-26  8:09                                   ` Daniel Phillips
2015-05-26 10:13                                   ` Pavel Machek
2015-05-26 10:13                                     ` Pavel Machek
2015-05-26  7:09                               ` Jan Kara
2015-05-26  8:08                                 ` Daniel Phillips
2015-05-26  8:08                                   ` Daniel Phillips
2015-05-26  9:00                                   ` Jan Kara
2015-05-26  9:00                                     ` Jan Kara
2015-05-26 20:22                                     ` Daniel Phillips
2015-05-26 21:36                                       ` Rik van Riel
2015-05-26 21:49                                         ` Daniel Phillips
2015-05-26 21:49                                           ` Daniel Phillips
2015-05-27  8:41                                       ` Jan Kara
2015-06-21 15:36                                         ` OGAWA Hirofumi
2015-06-21 15:36                                           ` OGAWA Hirofumi
2015-06-23 16:12                                           ` Jan Kara
2015-07-05 12:54                                             ` OGAWA Hirofumi
2015-07-05 12:54                                               ` OGAWA Hirofumi
2015-07-09 16:05                                               ` Jan Kara
2015-07-09 16:05                                                 ` Jan Kara
2015-07-31  4:44                                                 ` OGAWA Hirofumi
2015-07-31 15:37                                                   ` Raymond Jennings
2015-07-31 17:27                                                     ` Daniel Phillips
2015-07-31 17:27                                                       ` Daniel Phillips
2015-07-31 18:29                                                       ` David Lang
2015-07-31 18:29                                                         ` David Lang
2015-07-31 18:43                                                         ` Daniel Phillips
2015-07-31 18:43                                                           ` Daniel Phillips
2015-07-31 22:12                                                         ` Daniel Phillips
2015-07-31 22:12                                                           ` Daniel Phillips
2015-07-31 22:27                                                           ` David Lang
2015-08-01  0:00                                                             ` Daniel Phillips
2015-08-01  0:00                                                               ` Daniel Phillips
2015-08-01  0:16                                                               ` Daniel Phillips
2015-08-01  0:16                                                                 ` Daniel Phillips
2015-08-03 13:07                                                                 ` Jan Kara
2015-08-01 10:55                                                             ` Elifarley Callado Coelho Cruz
2015-08-18 16:39                                                       ` Rik van Riel
2015-08-03 13:42                                                   ` Jan Kara
2015-08-03 13:42                                                     ` Jan Kara
2015-08-09 13:42                                                     ` OGAWA Hirofumi
2015-08-10 12:45                                                       ` Jan Kara
2015-08-10 12:45                                                         ` Jan Kara
2015-08-16 19:42                                                         ` OGAWA Hirofumi
2015-05-26 10:22                                   ` Sergey Senozhatsky
2015-05-26 12:33                                     ` Jan Kara
2015-05-26 12:33                                       ` Jan Kara
2015-05-26 19:18                                     ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555140E6.1070409@phunq.net \
    --to=daniel@phunq.net \
    --cc=david@fromorbit.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=hyc@symas.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pavel@ucw.cz \
    --cc=tux3@tux3.org \
    --cc=tytso@mit.edu \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.