All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>, Kanchan Joshi <joshi.k@samsung.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.com,
	viro@zeniv.linux.org.uk, darrick.wong@oracle.com,
	jrdr.linux@gmail.com, ebiggers@google.com,
	jooyoung.hwang@samsung.com, chur.lee@samsung.com,
	prakash.v@samsung.com
Subject: Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
Date: Thu, 13 Dec 2018 09:21:02 +1100	[thread overview]
Message-ID: <20181212222102.GC29416@dastard> (raw)
In-Reply-To: <20181211215444.GA29416@dastard>

On Wed, Dec 12, 2018 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Dec 10, 2018 at 10:07:48PM -0700, Jens Axboe wrote:
> > On 12/10/18 9:07 PM, Dave Chinner wrote:
> > > On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
> > >> On 12/10/18 8:41 AM, Jan Kara wrote:
> > >>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
> > >>>> On 12/10/18 7:12 AM, Jan Kara wrote:
> > >>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> > >>>>>> This patch introduces "j_writehint" in JBD2 journal,
> > >>>>>> which is set based by Ext4 depending on "journal_writehint"
> > >>>>>> mount option (inspired from "journal_ioprio").
> > >>>>>
> > >>>>> Thanks for the patch! It would be good to provide the explanation you have
> > >>>>> in the cover letter in this patch as well so that it gets recorded in git
> > >>>>> logs.
> > >>>>>
> > >>>>> Also I don't like the fact that users have to set the hint via a mount
> > >>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> > >>>>> fs.h are generally supposed to be used by userspace so it's difficult to
> > >>>>> pick anything if we don't know what the userspace is going to do. I'd argue
> > >>>>> it's even difficult for the sysadmin to pick any good value even if he
> > >>>>> actually knows that he might benefit from setting some. Jens, is there
> > >>>>> some reasonable way for fs to automatically pick some stream value for its
> > >>>>> journal?
> > >>>>
> > >>>> I think we have two options here:
> > >>>>
> > >>>> 1) It's _probably_ safe to assume that journal data is short lived. While
> > >>>>    hints are all relative to the specific use case, the size of the journal
> > >>>>    compared to the rest of the drive is most likely very small. Hence a
> > >>>>    default of WRITE_LIFE_SHORT is probably a good idea.
> > >>>
> > >>> That's what I was thinking as well. But there are some exceptions like
> > >>> heavy DB load where there's very little of metadata modified (and thus
> > >>> almost no journal IO) compared to the DB data. So journal blocks may have
> > >>> actually longer life time than data blocks. OTOH if there's little journal
> > >>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
> > >>> is probably a good default anyway.
> > >>
> > >> Right, that's my probably, it would definitely not work for all cases. But
> > >> it only really fails if two uses of the same life time ends up being vastly
> > >> different. It doesn't matter if LIFE_SHORT ends up being the longest life
> > >> time of them all.
> > >>
> > >>>> 2) We add a specific internal life time hint for fs journals.
> > >>>>
> > >>>> #2 makes the most sense to me, but requires a bit more work...
> > >>>
> > >>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
> > >>> what the drive offers, doesn't it?
> > >>
> > >> We only used 4 streams, drives generally offer a lot more. So we can expand
> > >> it quite easily.
> > > 
> > > Can we get the number of stream supported from the drive? If we can
> > > get at this at mount time, we can use high numbers down for internal
> > > filesystem stuff, and low numbers up for user data (as already
> > > defined by the fcntl interface).
> > > 
> > > If the hardware doesn't support streams or doesn't support any more
> > > than the userspace interface covers, then it is probably best not to
> > > use hints at all for metadata...
> > 
> > Yes, we query these values. For instance, if we can't get the current
> > number of streams we support (4), then we don't use them. We currently
> > don't export this anywhere for the kernel to see, but that could be
> > rectified. In terms of values, the NVMe stream space is 16-bits, so
> > we could allocate from 65535 and down. There are no restrictions on
> > ordering, so it'd be perfectly fine to use your suggestion of top down
> > for the kernel.
> 
> That sounds perfect - right now I can't see a need for more than 4
> streams for XFS filesystem metadata (log, directory blocks, inodes,
> and internal btree blocks) as all the file data extent tree blocks
> whould inherit the data lifetime hint as the two of them are closely
> related....
> 
> > In terms of hardware support, we assign a number of streams per
> > namespace, and there's a fixed number of concurrently open streams per
> > drive. We can add reservations, for instance 8, for each namespace.
> > That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.
> 
> *nod*
> 
> That also allows us to extend the "kernel reserved" space in future
> if we need to.
> 
> How do we find out if the device has the required number of streams
> to enable us to use this? Or can we just always set the hint on
> filesytem metadata IO and the devices will ignore it if they don't
> support it? The latter would be ideal from a filesystem POV - zero
> conf and just does the right thing when the hardware supports it...

Having looked at the block/nvme streams implementation yesterday,
it looks to me like this entails a complete rewrite of the
block layer streams functionality to enable this. There's not much
in the way of functionality there right now.

FWIW, I've got quite a few different nvme SSDs here (including
several enterprise drives), but none of them support streams. Which
means I can't test and validate anything to do with streams right
now. I'm guessing streams support is highly vendor specifici and
there are very few devices that support it right now?

At which point I have to ask: just how well tested is this
functionality? I really don't want to have XFS users exposed to yet
another round of "SSD firmware bugs corrupt XFS filesystems" because
nobody is using or testing this stuff outside of benchmarketing
exercises....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-12-12 22:21 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20181210125245epcas2p202bc4336e91a55cee3786ed6abffea05@epcas2p2.samsung.com>
2018-12-10 12:50 ` [PATCH 0/2] fs,ext4,jbd2: Specifying write-hint for Ext4 journal Kanchan Joshi
     [not found]   ` <CGME20181210125251epcas1p3023db72cc08d6b2f899141d551d53f61@epcas1p3.samsung.com>
2018-12-10 12:50     ` [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head Kanchan Joshi
2018-12-10 13:49       ` Jan Kara
2018-12-11 11:57         ` Kanchan Joshi
     [not found]   ` <CGME20181210125256epcas1p1e4142cfefb9b21dfb8dad927fbd49143@epcas1p1.samsung.com>
2018-12-10 12:50     ` [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal Kanchan Joshi
2018-12-10 14:12       ` Jan Kara
2018-12-10 15:17         ` Jens Axboe
2018-12-10 15:41           ` Jan Kara
2018-12-10 15:44             ` Jens Axboe
2018-12-11  4:07               ` Dave Chinner
2018-12-11  5:07                 ` Jens Axboe
2018-12-11 13:53                   ` Kanchan Joshi
2018-12-11 21:54                   ` Dave Chinner
2018-12-12 22:21                     ` Dave Chinner [this message]
2018-12-19 14:10                       ` Kanchan Joshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181212222102.GC29416@dastard \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=axboe@kernel.dk \
    --cc=chur.lee@samsung.com \
    --cc=darrick.wong@oracle.com \
    --cc=ebiggers@google.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=jooyoung.hwang@samsung.com \
    --cc=joshi.k@samsung.com \
    --cc=jrdr.linux@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=prakash.v@samsung.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.