All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Boaz Harrosh <boaz@plexistor.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Jan Kara <jack@suse.cz>, Hugh Dickins <hughd@google.com>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, linux-nvdimm <linux-nvdimm@ml01.01.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Eryu Guan <eguan@redhat.com>
Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
Date: Wed, 25 Mar 2015 20:41:35 +1100	[thread overview]
Message-ID: <20150325094135.GI31342@dastard> (raw)
In-Reply-To: <5512725A.1010905@plexistor.com>

On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> >>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> >>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> >>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> >>> +	 *	return 0;
> >>> +	 */
> >>> +
> >>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > 
> > missing locking.
> > 
> 
> I will please need help here. This is very deep inside the freeze process
> we area already holding bunch of locks. We know that nothing can be modified
> at this stage. We are completely read-only.

Which means we could stillbe reading new inodes in off disk and
hence the sb->s_inodes list can be changing. Memory reclaim can be
running via the shrinker, freeing clean inodes, hence the
sb->s_inodes list can be changing.

>From fs/inode.c:

/*
 * Inode locking rules:
.....
 * inode_sb_list_lock protects:
 *   sb->s_inodes, inode->i_sb_list

This...

> >>> +		/* TODO: For freezing we can actually do with write-protecting
> >>> +		 * the page. But I cannot find a ready made function that does
> >>> +		 * that for a giving mapping (with all the proper locking).
> >>> +		 * How performance sensitive is the all sb_freeze API?
> >>> +		 * For now we can just unmap the all mapping, and pay extra
> >>> +		 * on read faults.
> >>> +		 */
> >>> +		/* NOTE: Do not unmap private COW mapped pages it will not
> >>> +		 * modify the FS.
> >>> +		 */
> >>> +		if (IS_DAX(inode))
> >>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> >>
> >> So what happens here is that we loop on all sb->s_inodes every freeze
> >> and in the not DAX case just do nothing.
> > 
> > Which is real bad and known to be a performance issue. See Josef's
> > recent sync scalability patchset posting that only tracks and walks
> > dirty inodes...
> 
> Sure but how hot is freeze? Josef's fixed the very hot sync path,
> but freeze happens once in a blue moon. Do we care?

Yes, because if you have 50 million cached inodes on a filesystem,
it's going to take a long time to traverse them all, and right now
the inode_sb_list_lock is a *global lock*.

> >> It could be nice to have a flag at the sb level to tel us if we need
> >> to expect IS_DAX() inodes at all, for example when we are mounted on
> >> an harddisk it should not be set.
> >>
> >> All of ext2/4 and now Dave's xfs have their own
> >> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> >>
> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> >> here in Generic code? The option parsing will be done by each FS but
> >> the flag be global?
> > 
> > No, because as I mentioned in another thread we're going to end up
> > with filesystems that don't have "mount wide" DAX behaviour, and we
> > have to check every dirty inode anyway. And....
> > 
> 
> Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
> if there is any chance at all that IS_DAX() comes out true, so we loop
> here. 

The mount option is irrelevant here - we should only be looping over
dirty inodes.  We don't care if they are DAX or not - we have to
iterate them and ensure they are properly clean. We already have
infrastructure to do this - we should use it and fix the problem
once and for all rather than hacking special case code into random
places.

> BTW: We must loop this way on every sb inode because we do not have
> dirty inodes. There is no "dirty"ing going on in dax, not of inodes
> and not of pages.

Precisely the problem we need to address. We do have dirty inodes,
we just never set the fact they have dirty "pages" on them and hence
never do data writeback on them.

> > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> > this problem, so that sync and fsync work correctly, and then you
> > don't care about whether DAX is supported or not...
> > 
> 
> sync and fsync should and will work correctly, but this does not
> solve our problem. because what turns pages to read-only is the
> writeback. And we do not have this in dax. Therefore we need to
> do this here as a special case.

We can still use exactly the same dirty tracking as we use for data
writeback. The difference is that we don't need to go through all
teh page writeback; we can just flush the CPU caches and mark all
the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
inode writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID
From: Dave Chinner <david@fromorbit.com>
To: Boaz Harrosh <boaz@plexistor.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Jan Kara <jack@suse.cz>, Hugh Dickins <hughd@google.com>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, linux-nvdimm <linux-nvdimm@ml01.01.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Eryu Guan <eguan@redhat.com>
Subject: Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze
Date: Wed, 25 Mar 2015 20:41:35 +1100	[thread overview]
Message-ID: <20150325094135.GI31342@dastard> (raw)
In-Reply-To: <5512725A.1010905@plexistor.com>

On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> >>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> >>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> >>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> >>> +	 *	return 0;
> >>> +	 */
> >>> +
> >>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > 
> > missing locking.
> > 
> 
> I will please need help here. This is very deep inside the freeze process
> we area already holding bunch of locks. We know that nothing can be modified
> at this stage. We are completely read-only.

Which means we could stillbe reading new inodes in off disk and
hence the sb->s_inodes list can be changing. Memory reclaim can be
running via the shrinker, freeing clean inodes, hence the
sb->s_inodes list can be changing.

>From fs/inode.c:

/*
 * Inode locking rules:
.....
 * inode_sb_list_lock protects:
 *   sb->s_inodes, inode->i_sb_list

This...

> >>> +		/* TODO: For freezing we can actually do with write-protecting
> >>> +		 * the page. But I cannot find a ready made function that does
> >>> +		 * that for a giving mapping (with all the proper locking).
> >>> +		 * How performance sensitive is the all sb_freeze API?
> >>> +		 * For now we can just unmap the all mapping, and pay extra
> >>> +		 * on read faults.
> >>> +		 */
> >>> +		/* NOTE: Do not unmap private COW mapped pages it will not
> >>> +		 * modify the FS.
> >>> +		 */
> >>> +		if (IS_DAX(inode))
> >>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> >>
> >> So what happens here is that we loop on all sb->s_inodes every freeze
> >> and in the not DAX case just do nothing.
> > 
> > Which is real bad and known to be a performance issue. See Josef's
> > recent sync scalability patchset posting that only tracks and walks
> > dirty inodes...
> 
> Sure but how hot is freeze? Josef's fixed the very hot sync path,
> but freeze happens once in a blue moon. Do we care?

Yes, because if you have 50 million cached inodes on a filesystem,
it's going to take a long time to traverse them all, and right now
the inode_sb_list_lock is a *global lock*.

> >> It could be nice to have a flag at the sb level to tel us if we need
> >> to expect IS_DAX() inodes at all, for example when we are mounted on
> >> an harddisk it should not be set.
> >>
> >> All of ext2/4 and now Dave's xfs have their own
> >> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> >>
> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> >> here in Generic code? The option parsing will be done by each FS but
> >> the flag be global?
> > 
> > No, because as I mentioned in another thread we're going to end up
> > with filesystems that don't have "mount wide" DAX behaviour, and we
> > have to check every dirty inode anyway. And....
> > 
> 
> Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
> if there is any chance at all that IS_DAX() comes out true, so we loop
> here. 

The mount option is irrelevant here - we should only be looping over
dirty inodes.  We don't care if they are DAX or not - we have to
iterate them and ensure they are properly clean. We already have
infrastructure to do this - we should use it and fix the problem
once and for all rather than hacking special case code into random
places.

> BTW: We must loop this way on every sb inode because we do not have
> dirty inodes. There is no "dirty"ing going on in dax, not of inodes
> and not of pages.

Precisely the problem we need to address. We do have dirty inodes,
we just never set the fact they have dirty "pages" on them and hence
never do data writeback on them.

> > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> > this problem, so that sync and fsync work correctly, and then you
> > don't care about whether DAX is supported or not...
> > 
> 
> sync and fsync should and will work correctly, but this does not
> solve our problem. because what turns pages to read-only is the
> writeback. And we do not have this in dax. Therefore we need to
> do this here as a special case.

We can still use exactly the same dirty tracking as we use for data
writeback. The difference is that we don't need to go through all
teh page writeback; we can just flush the CPU caches and mark all
the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
inode writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-03-25  9:41 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-23 12:47 [PATCH 0/3 v3] dax: Fix mmap-write not updating c/mtime Boaz Harrosh
2015-03-23 12:47 ` Boaz Harrosh
2015-03-23 12:49 ` [PATCH 1/3] mm: New pfn_mkwrite same as page_mkwrite for VM_PFNMAP Boaz Harrosh
2015-03-23 22:49   ` Andrew Morton
2015-03-23 22:49     ` Andrew Morton
2015-03-23 12:52 ` [PATCH 2/3] dax: use pfn_mkwrite to update c/mtime + freeze protection Boaz Harrosh
2015-03-23 12:54 ` [PATCH 3/3] RFC: dax: dax_prepare_freeze Boaz Harrosh
2015-03-23 22:40   ` Dave Chinner
2015-03-23 22:40     ` Dave Chinner
2015-03-24  6:14     ` Boaz Harrosh
2015-03-24  6:14       ` Boaz Harrosh
2015-03-25  2:22       ` Dave Chinner
2015-03-25  2:22         ` Dave Chinner
2015-03-25  8:10         ` Boaz Harrosh
2015-03-25  9:29           ` Dave Chinner
2015-03-25  9:29             ` Dave Chinner
2015-03-25 10:19             ` Boaz Harrosh
2015-03-25 10:19               ` Boaz Harrosh
2015-03-25 20:00               ` Dave Chinner
2015-03-25 20:00                 ` Dave Chinner
2015-03-26  8:02                 ` Boaz Harrosh
2015-03-26 20:58                   ` Dave Chinner
2015-03-26 20:58                     ` Dave Chinner
2015-03-24 12:37   ` Boaz Harrosh
2015-03-24 12:37     ` Boaz Harrosh
2015-03-25  2:26     ` Dave Chinner
2015-03-25  2:26       ` Dave Chinner
2015-03-25  8:31       ` Boaz Harrosh
2015-03-25  8:31         ` Boaz Harrosh
2015-03-25  9:41         ` Dave Chinner [this message]
2015-03-25  9:41           ` Dave Chinner
2015-03-25 10:40           ` Boaz Harrosh
2015-03-25 10:40             ` Boaz Harrosh
2015-03-25 20:05             ` Dave Chinner
2015-03-25 20:05               ` Dave Chinner
2015-03-23 12:56 ` [PATCH v4] xfstest: generic/080 test that mmap-write updates c/mtime Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150325094135.GI31342@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=boaz@plexistor.com \
    --cc=eguan@redhat.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mgorman@suse.de \
    --subject='Re: [PATCH 3/3] RFC: dax: dax_prepare_freeze' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.