[LSF/MM TOPIC NOTES] Phasing out kernel thread freezing

From: "Luis R. Rodriguez" <mcgrof@kernel.org>
To: linux-fsdevel@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>, "Luis R. Rodriguez" <mcgrof@kernel.org>,
	Bart Van Assche <Bart.VanAssche@wdc.com>,
	"darrick.wong@oracle.com" <darrick.wong@oracle.com>,
	"jlayton@redhat.com" <jlayton@redhat.com>,
	"jikos@kernel.org" <jikos@kernel.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>
Subject: [LSF/MM TOPIC NOTES] Phasing out kernel thread freezing
Date: Thu, 26 Apr 2018 21:22:43 +0000	[thread overview]
Message-ID: <20180426212243.GA27853@wotan.suse.de> (raw)

Below are my notes about this session at LSF/MM 2018.

kthread freezer woes
====================

Icebreakers:

  * When was the kthread primitive introduced and who introduced it?
    What was the original purpose of the kthreads API?

  * What was the first try_to_freeze() kernel user?

  * Do *you* run into issues with the kthread freezer?

  * We're addressing such woes with filesystems first

Motivation
----------

The kernel kthread freezer API is rather loose, and even with a lot of years of
evolution over *how* to properly freeze kthreads, there are still issues that
creep up. One goal suggested long ago by Jiri Kosina was to *not* have to
call try_to_freeze() on kthreads all over the kernel and instead replace it
with more appropriate infrastructure per subsystem.

Long term we want to address this throughout the kernel, however, we'll start
off focusing on filesystems first. Each other subsystem will have to address
this on their own but perhaps they can get some ideas of what to do from the
filesystems work.

Example of a modern kthread freezer issue
-----------------------------------------

A regression was detected on XFS with suspend, on a hibernation stress test
after 48 rounds of cycling it failed.

> Reverting 18f1df4e00ce ("xfs: Make xfsaild freezeable again")
> would break the proper form of the kthread for it to be freezable.
> This "form" is not defined formally, and sadly its just a form
> learned throughout years over different kthreads in the kernel.

Dave Chinner later noted:

	"Suspend on journalling filesystems has been broken for a long time
	(i.e since I first realised the scope of the problem back in 2005)"

	"IOWs, suspend of filesystems has been broken forever, and we've been
	slapping bandaids on it in XFS forever."

https://lkml.kernel.org/r/20171114212538.GC4094@dastard

Components to understand the the issue
--------------------------------------

  * refrigerator()
  * try_to_freeze()
  * kthread_freezable_should_stop()
  * kthread_run()

The core of the issue:

If a freezable kernel thread fails to call try_to_freeze() after the freezer
has initiated a freezing operation, the freezing of tasks will fail and the
entire hibernation or suspend operation will be cancelled.

refrigerator()
--------------

Commit 542f96a52 ("[PATCH] suspend-to-{RAM,disk}") by
Pavel Machek <pavel@ucw.cz> on v2.5.18 added suspend-to-RAM/disk
support, and as part of it, it added the refrigerator(). It carried
heavy warnings for a good reason:

	/Documentation/swsusp.txt
	BIG FAT WARNING

	If you have unsupported (*) devices using DMA...
				    ...say goodbye to your data.

	If you touch anything on disk between suspend and resume...
				    ...kiss your data goodbye.

	If your disk driver does not support suspend... (IDE does)
				    ...you'd better find out how to get along
				       without your data.

	(*) pm interface support is needed to make it safe.

# refrigerator() in a nutshell

void refrigerator(unsigned long flag)
{
	...
	while (current->flags & PF_FROZEN)
		schedule();
	...
}

kthread_run()
-------------

When was the kthread primitive introduced and who introduced it?

Rusty Russell <rusty@rustcorp.com.au> via linux-history commit
933ba10234f68 ("[PATCH] kthread primitive") on the v2.6.4 release.
Original motivation was to enable CPU hotplug. Managing tasks properly
in light of CPU hotplug is hard, the kthread primitive helps with this.

Uses kernel_thread() behind the scenes -- the kernel equivalent to a fork()

Don't freeze kthreads on try_to_freeze_tasks()
----------------------------------------------

We don't want kernel threads to be frozen in unexpected  places, so we allow
them to block freeze_processes(), or to set PF_NOFREEZE if needed.

KTW_FREEZABLE exists to enable kthread work freezing but no users exist.

static void create_kthread(struct kthread_create_info *create)
{
	...
        /* We want our own signal handler (we take no signals by default). */
        pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
	...
}

Also:

bool freeze_task(struct task_struct *p)
{
	...
	if (!(p->flags & PF_KTHREAD))
		fake_signal_wake_up(p);
	else
		wake_up_state(p, TASK_INTERRUPTIBLE);
	...
}

Since kthreads want their own signal handler we won't wake them up with
the above signal, and we also have that extra PF_KTHREAD check -- just
in case. kthreads need to have control over how they are frozen.

freezer_do_not_count() - don't freeze current
---------------------------------------------

Just keep in mind its not just kthreads.

kthreads are not the only things which avoids the general freeze_processes().
freezer_do_not_count() sets PF_NOFREEZE and these proceesses will also skip
the general freeze. Currently set on:

  * do_fork() on wait_for_vfork_done()
  * do_coredump() on coredump_wait()
  * binder_thread_read() on binder_wait_for_work()

First try_to_freeze()
---------------------

Commit 54820fb26 ("[PATCH] swsusp: try_to_freeze() to make freezing hooks
nicer") added as of v2.6.11 by Pavel Machek <pavel@ucw.cz> added
try_to_freeze() API for the kernel scheduler.

# First try_to_freeze() kernel user

What was the first try_to_freeze() user? It was on the x86 Intel IO-APIC
kernel IRQ balancer:

static int __init balanced_irq_init(void)                                       
{
	...
	printk(KERN_INFO "Starting balanced_irq\n");
        if (kernel_thread(balanced_irq, NULL, CLONE_KERNEL) >= 0)
		return 0;
	else
	   printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq");
	...
}

Modified later to use the kthread API, so after commit f26d6a2bbcf38 ("[PATCH]
i386: convert to the kthread API") merged on v2.6.22 this looked like:

static int __init balanced_irq_init(void)
{
	...
	printk(KERN_INFO "Starting balanced_irq\n");
	if (!IS_ERR(kthread_run(balanced_irq, NULL, "kirqd")))
		return 0;
	printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq");
	...
}

Anyway, this IRQ balancer was later removed via commit 8b8e8c1bf7275e ("x86:
remove irqbalance in kernel for 32 bit") merged on v2.6.28 since the userspace
irqbalanced deprecated this.

Early try_to_freeze() users
---------------------------

Code was simplified all around that used similar semantics with
try_to_freeze() via commit f9adcf4ea1599 ("[PATCH] swsusp: refrigerator
cleanups") added on v2.6.11 by Pavel Machek <pavel@ucw.cz>.

What are the try_to_freeze() early users? Determined by using
linux-history with:

	git checkout -b linux-2.6.11 v2.6.11
	git grep try_to_freeze

  * architecture do_signal() calls -- for example on x86:

diff --git a/arch/x86_64/kernel/signal.c b/arch/x86_64/kernel/signal.c
index ad3b240cdd9c..1cb237ad1fcc 100644
--- a/arch/x86_64/kernel/signal.c
+++ b/arch/x86_64/kernel/signal.c
@@ -24,7 +24,6 @@
 #include <linux/stddef.h>
 #include <linux/personality.h>
 #include <linux/compiler.h>
-#include <linux/suspend.h>
 #include <asm/ucontext.h>
 #include <asm/uaccess.h>
 #include <asm/i387.h>
@@ -423,10 +422,8 @@ int do_signal(struct pt_regs *regs, sigset_t *oldset)
                return 1;
        }       
 
-       if (current->flags & PF_FREEZE) {
-               refrigerator(0);
+       if (try_to_freeze(0))
                goto no_signal;
-       }
 
        if (!oldset)
                oldset = &current->blocked;

	Commit fc558a7496bf ("[PATCH] swsusp: finally solve mysqld problem")
	by Rafael J. Wysocki <rjw@sisk.pl> on v2.6.17 moved try_to_freeze()
	to kernel/signal.c on get_signal_to_deliver(), now get_signal().

  * pcmcia on the pccardd kthread -- still present today
  * USB core hub_thread()
  * filesystmes
  * mm pdflush() context -- the original writeback daemon, "dirty page" flush
	pdflush added on v2.5.8. Prior to this we had bdflush(),
	and that was kicked off if try_to_free_buffers() was called on a page
	and it was determined not all the buffers of a page could be freed.
	Simple daemon to provide a dynamic response to dirty buffers.
	Its job was to writeback a limited number of buffers to disk and go
	back to sleep again.
  * sunrpc svc_recv()

kthread_freezable_should_stop()
-------------------------------

Commit 8a32c441c1609 ("freezer: implement and use
kthread_freezable_should_stop()" added via v3.3 by Tejun.
This commit claims you should *not* use try_to_freeze(), instead
you *should* use kthread_freezable_should_stop() if your kthread
is freezable....

But yet... only 4 kernel users!

The only filesystem using it is NFS on nfs4_callback_svc(). What gives?

Getting the kthread freezer right
---------------------------------

At the 2015 Kernel summit at South Korea, Jiri Kosina suggested we should
phase it out long term. The semantics are loose and experience shows that
it is difficult to get right. Best to phase it out if possible.

What to do - address filesystems first
--------------------------------------

Phasing out the kthread freezer is hard so lets divide and conquer.
Let's first address this on filesystems properly. Simple:

	freeze_fs() for filesystems that implement the callback

Patches are ongoing, so expect a new series soon for filesystem.
But what's left after this?

Filesystems with freeze_fs() and those using try_to_freeze()
------------------------------------------------------------

The current patches being worked on address filesystems which
implement freeze_fs(). Let's review these. The filesystems which
implement freeze_fs():

  * xfs
  * reiserfs
  * nilfs2
  * jfs
  * f2fs
  * ext4
  * ext2
  * btrfs

Of these, the following have freezer helpers, which can then be removed after
the kernel automaticaly calls freeze_fs() for us on suspend:

  * xfs
  * nilfs2
  * jfs
  * f2fs
  * ext4

What about other filesystems?

Order considerations
--------------------

Current simple solution:

	iterate_supers_reverse_excl() on freeze
	iterate_supers() for thaw

We acknowledged at LSF/MM that this order is not sufficient and definitely not
perfect. We would need a proper infrastructure to have a Directed Acyclic Graph
(DAG) we can rely on.  We do not have that today. We discussed possibly adding
infrastructure for this. We determined that this work is long term -- we're
fine to move on with life with the above simple implementation for now and
acknowledge the imperfect solution should work in most cases. Al Viro gave
an example where the order is not respected today.

You can use a loopback device to mount a file image on a filesystem A, and
at a later point in time dynamically change the backing device to another
file present on filesystem B using the loopback ioctl LOOP_CHANGE_FD --
loop_change_fd().

/*                                                                              
 * loop_change_fd switched the backing store of a loopback device to            
 * a new file. This is useful for operating system installers to free up        
 * the original file and in High Availability environments to switch to         
 * an alternative location for the content in case of server meltdown.          
 * This can only work if the loop device is used read-only, and if the          
 * new backing store is the same size and type as the old backing store.        
 */                                                                             
static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,    
                          unsigned int arg)  
{
	...
}

The ioctl then currently enables the order for the a superblock to change
dynamically and break the above order assumptions to be correct.

It was mentioned it is believed some distributions (Fedora?) installers may use
this to complete installation and give control to the system withot rebooting.
If true, then suspend order *could* break in this case.

Should a flag be added to disable use of freeze_fs() if such ioctl is used?
Al didn't seem too happy with the idea of working around this by moving the
superblock around to the right order after the ioctl work.

One of the issues with the violation of this ordering is that its not possible
to detect violations to odering, given if you could do this, you may be able
to address ordering. Its a catch-22 situation. However, if there *are* known
cases where such violations do happen, one option is can skip the suspend
framework for filesystems later.

Long term we want proper infrastructure to address ordering, which can address
this example corner case. Other possible use cases for a DAG on the superblocks
is emergency remount.

Possible issues
---------------

David Howells noted future possible races with suspend/freeze and automount.
Should automount be skipped during the general suspend/resume cycle?

NFS
---

For NFS Jeff Layton has suggested to have freeze_fs() make the RPC 
engine "park" newly issued RPCs for that fs' client onto a rpc_wait_queue.
Any RPC that has already been sent however, we need to wait for a reply.

unfreeze_fs() can then just have the engine stop parking RPCs and wake up the
waitq. He however points out that if we're interested in making the cgroup
freezer also work, then we may need to do a bit more work to ensure that we
don't end up with frozen tasks squatting on VFS locks.

Is it true that cgroups freezer want tasks to not hold VFS locks prior to
freezing?

Dave Chinner notes that freezing a filesystem pretty much guarantees the
opposite - that tasks *will freeze when holding VFS locks* - the cgroup freezer
is broken by design *if* it requires tasks to be frozen without holding any
VFS/filesystem lock context, and as such we *should* be able to ignore it.

At LSF/MM it was acknowledged cgroup freezer may be broken, folks who would
care are aware, and a long term fix is needed.

Why the device model cannot be used for filesystems
---------------------------------------------------

Darick suggested that since the filesystem ordering is a layering problem we
should consider applying the device model to filesystems / block layer. Turns
out suspend of filesystems cannot be tied to suspend of devices.

# Do not use device model - breaks hibernation

Hibernation is one example of an issue.

Filesystems ordering is different that device ordering on a system.

During hibernation devices are suspended/resumed twice. When you hibernate
you need to freeze the filesystem, create the image, resume all devices to then
be able to store the image, and then suspend devices again.

Device ordering is not representative of the ordering in which you mount
filesystems. That has its own order, and suspending by bus order can be
different than the mount order. Suspending filesystems in the incorrect mount
order may yield in the inability for one filesystem to properly flush all
pending IO.

The ordering is not something which is only needed for suspend and hibernation
though. We'll need proper ordering for snapshotting, and if we later grow
the idea of snapshotting on files for the notion of subvolumes, odering
may play an important role here as well?

The current order strategy proposed is simply to iterate in reverse on all
superblocks, this should *in most cases* reflect the inverse of mount order.
The superblock on top will be the last mounted filesystem.

Filesystems on loopback freeze before the lower filesystem that hosts the
loopback image.

What about kernel filesystem which do not implement freeze_fs()?
----------------------------------------------------------------

Issuing a generic sync for filesystems which implement freeze_super() may
suffice.

cgroup notes
------------

Recall from Documentation/cgroup-v1/freezer-subsystem.txt:

	$ echo $$
	16644
	$ bash
	$ echo $$
	16690

	From a second, unrelated bash shell:
	$ kill -SIGSTOP 16690
	$ kill -SIGCONT 16690

	<at this point 16690 exits and causes 16644 to exit too>
                                                                                
This happens because bash can observe both signals and choose how it
responds to them.

FUSE drivers
------------

The implementation is only making use of filesystems which implement
freeze_fs() callback, it doesn't ouch other filesystems. It should
therefore technically not regress FUSE filesystems.

But there are still problems which need to be addressed on FUSE.

# New mount API with userspace freeze handler

One idea is that the kernel can notify userspace about the freeze and let
it deal with what it has to. There is a new mount call API being evolved,
one idea is to provide a userspace freeze hanlder call with this new mount
API.

It was punted that perhaps some of these things are ideas which can be worked
on under the Project Springfield umbrella [0].

[0] https://springfield-project.github.io/

# FUSE drivers and dirty data

https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html

Paul Mackerras:

	"What happens if there is an encfs filesystem mounted,
	and we freeze all userspace processes and then do a sys_sync()?  Does
	the system wait forever for the encfs filesystem to write out its
	dirty data?"

Yes.

Technically sync() and the freezer should in theory address most considerations?

# An example broken approach

One could iterate over an XFS filesystems or FUSE filesystem and do something
like, but *this* script below is fragile and should be broken. Consider the
cgroup notes above, but this is also horribly slow and fragile. This uses the
filesystem freeze ioctl but also sends SIGSTOP/SIGCONT to processes. For FUSE
it could use whatever freeze op FUSE drivers come up with. But again, horribly
broken and why this needs proper infrastructure.

#!/bin/bash
set -e

XFS_FREEZE="/usr/sbin/xfs_freeze"
PROC_MOUNTS="/proc/mounts"
SUSPEND_SIGNAL="SIGSTOP"

error_quit()
{
        echo "$1" >&2
        exit 1
}

check-system()
{
        [ -r "${PROC_MOUNTS}" ] || error_quit "ERROR: cannot find or read ${PROC_MOUNTS}"
        [ -x "${XFS_FREEZE}" ] || error_quit "ERROR: cannot find or execute ${XFS_FREEZE}"
}

run-fs-freeze()
{
        local i FSTYPE MNT ROOTDEV ARGS

        while read ROOTDEV MNT FSTYPE ARGS; do
            [ "$ROOTDEV" = "rootfs" ] && continue
            [ "$MNT" = "/" ] && continue

            case $FSTYPE in
            xfs)
                echo "  Trying to freeze userspace processes on '$FSTYPE' mounted on ${MNT}... "
                for i in $(lsof +D $MNT -t 2>/dev/null); do kill -$SUSPEND_SIGNAL $i; done
                echo -n "  Trying to freeze fstype '$FSTYPE' mounted on ${MNT}... "
                $XFS_FREEZE -f $MNT
                echo "OK!"
                ;;
            *)
                ;;
            esac

        done < $PROC_MOUNTS
}

run-fs-unfreeze()
{
        local i FSTYPE MNT ROOTDEV ARGS

        while read ROOTDEV MNT FSTYPE ARGS; do
            [ "$ROOTDEV" = "rootfs" ] && continue
            [ "$MNT" = "/" ] && continue

            case $FSTYPE in
            xfs)
                echo "  Trying to unfreeze userspace processes on '$FSTYPE' mounted on ${MNT}... "
                for i in $(lsof +D $MNT -t 2>/dev/null); do kill -SIGCONT $i; done
                echo -n "  Trying to unfreeze fstype '$FSTYPE' mounted on ${MNT}... "
                $XFS_FREEZE -u $MNT
                echo "OK!"
                ;;
            *)
                ;;
            esac

        done < $PROC_MOUNTS
}

check-system

if [ "$2" = suspend ]; then
        echo "INFO: running $0 for $2"
else
        echo "INFO: running $0 for $2"
fi

if [ "$1" = pre ] ; then
        run-fs-freeze
fi
if [ "$1" = post ] ; then
        run-fs-unfreeze
fi

Future ideas
------------

# Flushing

Flushing is *not* fully needed as per Rafael, and he suggests this actually
creates high latencies for suspend. Addressing this as per Jan Kara is complex.
Jan Kara also notes that switching the fs freeze implementation to avoid
sync(2) if asked to is quite independent from implementing system suspend to
use fs freezing.

Can patches to make 'background buffered writeback not suck' help with future
suspend sync latencies?

Probably not?

Jens Axboe's patches:
https://lwn.net/Articles/681763/
https://lkml.org/lkml/2016/3/23/310

What about dynamically throttling the amount of writeback allowed through a
grace period, moments prior to suspend?
-- 
Do not panic