linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@sandeen.net>
To: Jan Kara <jack@suse.cz>
Cc: Pingfan Liu <kernelfans@gmail.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Hari Bathini <hbathini@linux.ibm.com>,
	linuxppc-dev@lists.ozlabs.org, Dave Chinner <dchinner@redhat.com>,
	Eric Sandeen <esandeen@redhat.com>, Jan Kara <jack@suse.com>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock
Date: Mon, 14 Oct 2019 15:09:48 -0500	[thread overview]
Message-ID: <5796090e-6206-1bd7-174e-58798c9af052@sandeen.net> (raw)
In-Reply-To: <20191014200303.GF5939@quack2.suse.cz>



On 10/14/19 3:03 PM, Jan Kara wrote:
> On Mon 14-10-19 08:23:39, Eric Sandeen wrote:
>> On 10/14/19 4:43 AM, Jan Kara wrote:
>>> On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
>>>> On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
>>>>> On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
>>>>>> When using fadump (fireware assist dump) mode on powerpc, a mismatch
>>>>>> between grub xfs driver and kernel xfs driver has been obsevered.  Note:
>>>>>> fadump boots up in the following sequence: fireware -> grub reads kernel
>>>>>> and initramfs -> kernel boots.
>>>>>>
>>>>>> The process to reproduce this mismatch:
>>>>>>     - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
>>>>>>     - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
>>>>>>       restart" to rebuild the initramfs. Detail about the rebuilding looks
>>>>>>       like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
>>>>>>             mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname -r`.img
>>>>>>             sync
>>>>>>     - "echo c >/proc/sysrq-trigger".
>>>>>>
>>>>>> The result:
>>>>>> The dump image will not be saved under /var/crashnew/* as expected, but
>>>>>> still saved under /var/crash.
>>>>>>
>>>>>> The root cause:
>>>>>> As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
>>>>>> back metadata to xlog, but not necessary to fsblock. This raises issue if
>>>>>> grub can not replay the xlog before accessing the xfs files. Since the
>>>>>> above dir entry of initramfs should be saved as inline data with xfs_inode,
>>>>>> so xfs_fs_sync_fs() does not guarantee it written to fsblock.
>>>>>>
>>>>>> umount can be used to write metadata fsblock, but the filesystem can not be
>>>>>> umounted if still in use.
>>>>>>
>>>>>> There are two ways to fix this mismatch, either grub or xfs. It may be
>>>>>> easier to do this in xfs side by introducing an interface to flush metadata
>>>>>> to fsblock explicitly.
>>>>>>
>>>>>> With this patch, metadata can be written to fsblock by:
>>>>>>     # update AIL
>>>>>>     sync
>>>>>>     # new introduced interface to flush metadata to fsblock
>>>>>>     mount -o remount,metasync mountpoint
>>>>>
>>>>> I think this ought to be an ioctl or some sort of generic call since the
>>>>> jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
>>>>> is too dumb to recover logs but still wants to write to the fs"
>>>>> checkpointing problem.
>>>> Yes, a syscall sounds more reasonable.
>>>>>
>>>>> (Or maybe we should just put all that stuff in a vfat filesystem, I
>>>>> don't know...)
>>>> I think it is unavoidable to involve in each fs' implementation. What
>>>> about introducing an interface sync_to_fsblock(struct super_block *sb) in
>>>> the struct super_operations, then let each fs manage its own case?
>>>
>>> Well, we already have a way to achieve what you need: fsfreeze.
>>> Traditionally, that is guaranteed to put fs into a "clean" state very much
>>> equivalent to the fs being unmounted and that seems to be what the
>>> bootloader wants so that it can access the filesystem without worrying
>>> about some recovery details. So do you see any problem with replacing
>>> 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
>>>
>>> 								Honza
>>
>> The problem with fsfreeze is that if the device you want to quiesce is, say,
>> the root fs, freeze isn't really a good option.
> 
> I agree you need to be really careful not to deadlock against yourself in
> that case. But this particular use actually has a chance to work.
> 
>> But the other thing I want to highlight about this approach is that it does not
>> solve the root problem: something is trying to read the block device without
>> first replaying the log.
>>
>> A call such as the proposal here is only going to leave consistent metadata at
>> the time the call returns; at any time after that, all guarantees are off again,
>> so the problem hasn't been solved.
> 
> Oh, absolutely agreed. I was also thinking about this before sending my
> reply. Once you unfreeze, the log can start filling with changes and
> there's no guarantee that e.g. inode does not move as part of these
> changes. But to be fair, replaying the log isn't easy either, even more so
> from a bootloader. You cannot write the changes from the log back into the
> filesystem as e.g. in case of suspend-to-disk the resumed kernel gets
> surprised and corrupts the fs under its hands (been there, tried that). So
> you must keep changes only in memory and that's not really easy in the
> constrained bootloader environment.
> 
> So I guess we are left with hacks that kind of mostly work and fsfreeze is
> one of those. If you don't mess with the files after fsfreeze, you're
> likely to find what you need even without replaying the log.

We're in agreement here.  ;)  I only worry about implementing things like this
which sound like guarantees, but aren't, and end up encouraging bad behavior
or promoting misconceptions.

More and more, I think we should reconsider Darrick's "bootfs" (ext2 by another
name, but with extra-sync-iness) proposal...

-Eric

> 								Honza
> 

  reply	other threads:[~2019-10-14 20:09 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1570977420-3944-1-git-send-email-kernelfans@gmail.com>
     [not found] ` <20191013163417.GQ13108@magnolia>
2019-10-14  8:33   ` [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock Pingfan Liu
2019-10-14  9:43     ` Jan Kara
2019-10-14 13:23       ` Eric Sandeen
2019-10-14 20:03         ` Jan Kara
2019-10-14 20:09           ` Eric Sandeen [this message]
2019-10-15  8:01             ` Christoph Hellwig
2019-10-15 13:10               ` Theodore Y. Ts'o
2019-10-15 16:18                 ` Darrick J. Wong
2019-10-15  2:20           ` Pingfan Liu
2019-10-15  2:12         ` Pingfan Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5796090e-6206-1bd7-174e-58798c9af052@sandeen.net \
    --to=sandeen@sandeen.net \
    --cc=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=esandeen@redhat.com \
    --cc=hbathini@linux.ibm.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=kernelfans@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).