Mount/df/PAM login hangs during rsync to btrfs subvolume, or maybe doing btrfs subvolume snapshot

* Mount/df/PAM login hangs during rsync to btrfs subvolume, or maybe doing btrfs subvolume snapshot
@ 2019-09-11  6:45 David Newall
  2019-09-11  6:55 ` Nikolay Borisov
  0 siblings, 1 reply; 13+ messages in thread
From: David Newall @ 2019-09-11  6:45 UTC (permalink / raw)
  To: linux-btrfs

Hi All,

I might have misunderstood how to report a problem.  I registered for 
bugzilla and reported a bug 
(https://bugzilla.kernel.org/show_bug.cgi?id=204757), but, perhaps I 
should have sent this message to this mailing list, first.  My apologies 
if I bungled it.

I've been trying to track down a problem, intermittently, for a long 
time, and now need to reach out for advice.  I apologise in advance for 
the quality of this report, which I feel includes more detail than 
needed, yet may be missing what's important.  I'm trying my best.

The brief summary is that my system hangs during SSH login while a 
backup is in progress.  Sshd uses PAM authentication.  The problem seems 
to be related to mounts as df and mount also hang.

The longer details are:  I'm running Ubuntu 16.04.5 on a 64-bit VM under 
kvm.  I backup data using the following steps:

1. Take an LVM2 snapshot of the (non-root) ext2 file-system mounted as 
/data;
2. Mount a btrfs file system as /backup;
2. Mount the snapshot over an empty directory (may be subvolume; does it 
make a difference?) on /backup/snapshot;
3. Rsync the snapshot (with --archive --one-file-system --hard-links 
--inplace --numeric-ids --delete) to a subvolume /backup/data (thus it 
always contains /data as at last backup);
4. Take btrfs subvolume snapshot of /backup/data;
5. Unmount /backup/snapshot and /backup.

By the time I get called, SSH logins via PAM hang (but complete 
"immediately" if I re-configure sshd for UsePAM no).  Sessions which are 
still logged in seem unaffected, except df and mount both hang.  I don't 
know what else hangs.

During all of these steps, the /data is almost static, maybe even be 
completely static.

I've queried my user, carefully, to determine the exact step where it 
starts to hang, and am 90% confident in her answer, which indicates that 
the hang-condition starts during rsync.

Processes that were hanging complete normally when subvolume snapshot 
finishes.

There's a chance that processes complete when the snapshot or btrfs 
file-system is unmounted, but I think it's before then because I've 
tried running each step by hand, was unable to reproduce the problem, 
probably because the amount of data to rsync in real-use is much larger 
than I tried writing during that test.  At any rate, during that test I 
could log in between and during each step of the procedure.

The only messages in dmesg are "mounting ext2 file system using the ext4 
subsystem" and "mounted filesystem without journal. Opts: (null)", which 
sounds right as I use "mount" instead of "mount -text2".

When I tried running df under strace, strace's output was:

   open("/proc/self/mountinfo", O_RDONLY)  = 3
   fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
   read(3, "18 24 0:17 / /sys rw,nosuid,node"..., 1024) = 1024
   read(3, "ystemd/systemd-cgroups-agent,nam"..., 1024) = 1024
   read(3, "t rw,nosuid,nodev,noexec,relatim"..., 1024) = 1024
   read(3,

After the subvolume snapshot completed, strace continued producing output:

   "fs lxcfs rw,user_id=0,group_id=0"..., 1024) = 624
   --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=28055, 
si_uid=1000} ---
   read(3, "", 1024)                       = 0
   lseek(3, 0, SEEK_CUR)                   = 3696
   close(3)                                = 0

I think the SIGCONT was because I suspended the parent, strace, using 
Ctrl-Z.

I could just leave sshd doing non-PAM authentication but I think that's 
the wrong approach.  How do I zero in on this problem?

Thanks,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread