linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ric Wheeler <rwheeler@redhat.com>,
	Andy Lutomirski <luto@amacapital.net>,
	One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
	Gregory Farnum <greg@gregs42.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	shane.seymour@hpe.com, Bruce Fields <bfields@fieldses.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jeff Layton <jlayton@poochiereds.net>,
	Eric Sandeen <esandeen@redhat.com>
Subject: Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
Date: Thu, 17 Mar 2016 12:01:16 +1100	[thread overview]
Message-ID: <20160317010116.GK30721@dastard> (raw)
In-Reply-To: <20160316015139.GC5826@birch.djwong.org>

On Tue, Mar 15, 2016 at 06:51:39PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
> > > 
> > > Stale data escaping containment is a security issue. Enabling
> > > generic kernel mechanisms to *enable containment escape* is
> > > fundamentally wrong, and relying on userspace to Do The Right Thing
> > > is even more of a gamble, IMO.
> > 
> > We already have generic kernel mechanisms such as "the block device".   P
> > 
> > > It's a practical concern because if we enable this functionality in
> > > fallocate because it will get used by more than just special storage
> > > apps. i.e. this can't be waved away with "only properly managed
> > > applications will use it" arguments.
> > 
> > It requires a mount option.  How is this going to allow random
> > applications to use this feature, again?
> > 
> > > I also don't make a habit of publicising the fact that since we
> > > disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> > > blogs such as the above and configuration cargo-culting resulting in
> > > unsuspecting users exposing stale data unintentionally) that the
> > > functionality still exists in the kernel code and that it only takes
> > > a single xfs_db command to turn off unwritten extents in XFS. i.e.
> > > we can easily make fallocate on XFS expose stale data, filesystem
> > > wide, without requiring mount options, kernel or application
> > > modifications.
> > 
> > So you have something even more dangerous in XFS and it's in the
> > kernel tree?  Has Red Hat threatened to have a distro-specific patch
> 
> xfs_db is the XFS debugger, so you can only enable that bit of functionality
> with magical commands, which IMHO isn't much different than people messing
> with their ext4 filesystems with debugfs.  You had better know what you're
> doing and if you break the filesystem you can eat both pieces. :P
> 
> > to comment out this code to make sure irresponsible users can't use
> > it?  What I've been suggesting has even more controls that what you
> > have.  And I've been keeping it as an out-of-tree kernel patch mainly
> > because you've been arguing that it's such a horrible thing.
> 
> One could lock it down even more -- hide it behind a Kconfig option that
> depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
> only allow the file owner to call no-hide-stale and only if the file is 0600
> (or the appropriate group equivalents like Ted's existing patch), and upon
> adding stale extents, set an inode flag that locks uid/gid/mode/flags.
> Obviously root can still get to the file, but at least there's hard evidence
> that one is entering the twilight zone.
> 
> > > Making Google's hack more widely available through the fallocate
> > > API is entirely dependent on proving that:
> > 
> > Ceph is about to completely bypass the file system because of your
> > intransigence, and reimplement a userspace file system.  They seem to
> > believe it's necessary.  I'll let them make the case, because they
> > seem to think it's necessary.  And if not, if Linus sides with you,
> > and doesn't want to take the patch, I'll just keep it as a
> > Google-specific out-of-tree patch.  I don't *need* to have this thing
> > upstream.
> 
> Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
> unwritten extent conversion for these folks?
> 
> I ran this crappy looping microbenchmark against a ramdisk:
> 	fallocate 400M
> 	write 400M
> 	fsync
> 	rewrite the 400M
> 	fsync
> on kernel 4.5.
> 
> For writing 400M through the page cache in 4k chunks,
> ext4: ~460MB/s -> ~580MB/s (~20%)
> XFS: ~660 -> ~870 (~25%)
> btrfs: ~130 -> ~200 (~35%)
> 
> For writing 400M in 80M chunks,
> ext4: ~480MB/s -> ~720MB/s (~30%)
> XFS: ~1GB/s -> ~1.5GB/s (~35%)
> btrfs: ~590MB/s -> ~590MB/s (no change)
> 
> For directio writing 400MB in 4k chunks,
> ext4: 25MB/s -> 26MB/s (~5%)
> XFS: 25 -> 27 (~8%)
> btrfs: 22 -> 18 (...)
> 
> For directio writing 1200MB in 80M chunks,
> ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
> XFS: 3.2 -> 3.5 (~9%)
> btrfs: 2.3 -> 2.2 (...)

That's not comparing apples to apples. overwrite does not require
allocation/extent manipulation at all so it is comparing performance
of completely different file extent operations. The operations we
should be comparing are "first write" operations, namely "write over
hole with allocation" vs "write over preallocation".  Overwrite
performance should be the same regardless of the method used for the
intial write/allocation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2016-03-17  1:01 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-02  4:09 [PATCH v5.1 0/2] create BLKZEROOUT ioctl that invalidates page cache Darrick J. Wong
2016-03-02  4:09 ` [PATCH 1/2] block: invalidate the page cache when issuing BLKZEROOUT Darrick J. Wong
2016-03-02  9:19   ` Christoph Hellwig
2016-03-02  4:09 ` [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks Darrick J. Wong
2016-03-02  9:20   ` Christoph Hellwig
2016-03-02 18:52   ` Linus Torvalds
2016-03-02 22:56     ` Darrick J. Wong
2016-03-02 23:49       ` Linus Torvalds
2016-03-03 17:02         ` Theodore Ts'o
2016-03-03 17:55           ` Linus Torvalds
2016-03-03 18:00             ` Christoph Hellwig
2016-03-03 18:14             ` Martin K. Petersen
2016-03-03 18:21             ` Theodore Ts'o
2016-03-03 18:01         ` Martin K. Petersen
2016-03-03 18:09           ` Christoph Hellwig
2016-03-03 18:12             ` Darrick J. Wong
2016-03-03 18:54             ` Martin K. Petersen
2016-03-03 22:39               ` Theodore Ts'o
2016-03-03 23:10                 ` Dave Chinner
2016-03-04  0:20                   ` Theodore Ts'o
2016-03-09 22:20                   ` Gregory Farnum
2016-03-09 23:08                     ` Theodore Ts'o
2016-03-10 14:58                       ` Ric Wheeler
2016-03-10 18:33                         ` Linus Torvalds
2016-03-10 21:47                           ` Theodore Ts'o
2016-03-11  4:42                           ` Ric Wheeler
2016-03-11 13:59                             ` One Thousand Gnomes
2016-03-11 15:27                               ` Theodore Ts'o
2016-03-11 17:23                               ` Linus Torvalds
2016-03-11 17:30                                 ` Andy Lutomirski
2016-03-11 18:25                                   ` Linus Torvalds
2016-03-11 22:30                                     ` Dave Chinner
2016-03-12  0:33                                       ` Linus Torvalds
2016-03-12  0:35                                       ` Theodore Ts'o
2016-03-12  0:44                                         ` Linus Torvalds
2016-03-12  7:19                                           ` Theodore Ts'o
2016-03-12 10:11                                             ` Thomas Schoebel-Theuer
2016-03-13 23:30                                           ` Dave Chinner
2016-03-14 10:34                                             ` Ric Wheeler
2016-03-14 14:46                                               ` Theodore Ts'o
2016-03-15 20:14                                                 ` Dave Chinner
2016-03-15 20:43                                                   ` Linus Torvalds
2016-03-15 21:29                                                     ` Theodore Ts'o
2016-03-15 22:33                                                     ` Dave Chinner
2016-03-15 22:52                                                       ` Theodore Ts'o
2016-03-16  1:51                                                         ` Darrick J. Wong
2016-03-16 21:45                                                           ` Andreas Dilger
2016-03-17  0:15                                                             ` Theodore Ts'o
2016-03-17  0:33                                                               ` Eric Sandeen
2016-03-17  0:59                                                                 ` Theodore Ts'o
2016-03-17  5:18                                                                 ` Gregory Farnum
2016-03-17 12:36                                                                   ` Theodore Ts'o
2016-03-17 17:47                                                                   ` Linus Torvalds
2016-03-17 17:50                                                                     ` Ric Wheeler
2016-03-17 17:59                                                                       ` Linus Torvalds
2016-03-17 18:35                                                                     ` Chris Mason
2016-03-17 20:49                                                                       ` Andreas Dilger
2016-03-17 21:00                                                                         ` Chris Mason
2016-03-18  3:20                                                                           ` Theodore Ts'o
2016-03-18 15:15                                                                             ` Jeff Moyer
2016-03-18 20:05                                                                               ` Martin K. Petersen
2016-03-18  6:52                                                                     ` Gregory Farnum
2016-03-18  7:19                                                                       ` Linus Torvalds
2016-03-17  1:01                                                           ` Dave Chinner [this message]
2016-03-17  2:38                                                             ` Darrick J. Wong
2016-03-18 22:55                                                         ` NeilBrown
2016-03-15 23:06                                                       ` Linus Torvalds
2016-03-15 23:14                                                         ` Linus Torvalds
2016-03-16  0:08                                                           ` Dave Chinner
2016-03-15 23:52                                                         ` Dave Chinner
2016-03-16  0:06                                                           ` Linus Torvalds
2016-03-16  0:30                                                             ` Eric Sandeen
2016-03-16  0:51                                                               ` Chris Mason
2016-03-16 22:23                                                                 ` Chris Mason
2016-03-17 13:49                                                                   ` Ric Wheeler
2016-03-15 22:38                                                   ` Eric Sandeen
2016-03-03 22:56               ` Dave Chinner
2016-03-04  2:30                 ` Thomas Schoebel-Theuer
2016-03-03 18:14           ` Linus Torvalds
2016-03-02  9:15 ` [PATCH v5.1 0/2] create BLKZEROOUT ioctl that invalidates page cache Arnd Bergmann
2016-03-02  9:44   ` Christoph Hellwig
2016-03-02 10:55     ` Arnd Bergmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160317010116.GK30721@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bfields@fieldses.org \
    --cc=darrick.wong@oracle.com \
    --cc=esandeen@redhat.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=greg@gregs42.com \
    --cc=hch@infradead.org \
    --cc=jlayton@poochiereds.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=martin.petersen@oracle.com \
    --cc=rwheeler@redhat.com \
    --cc=shane.seymour@hpe.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).