scp bug due to progress indicator when copying from remote to local on Linux

All of lore.kernel.org
 help / color / mirror / Atom feed

* scp bug due to progress indicator when copying from remote to local on Linux
@ 2019-01-11  6:11 Steve French
  2019-01-11  6:13 ` Fwd: " Steve French
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Steve French @ 2019-01-11  6:11 UTC (permalink / raw)
  To: CIFS, linux-fsdevel; +Cc: Pavel Shilovsky

In discussing an interesting problem with scp recently with Pavel, we
found a fairly obvious bug in scp when it is run with a progress
indicator (which is the default when the source file is remote).

scp triggers "SIGALRM" probably from update_progress_meter() in
progressmeter.c when executed with "scp localhost:somelargefile /mnt"
ie with an ssh source path but a local path for the target, as long as
the flush of the large amount of cached data to disk (which seems to
be triggered by the ftruncate call in scp.c) takes more than a few
seconds (which can be common depending on disk or network speed).

It is interesting that this can be avoided easily by running in quiet
mode ("-q") which disables the progress meter in scp, but it seems
very, very broken that scp of a large file can 'fail' when using a
progress meter (unless caching were disabled on the file system) due
to the slight delay in ftruncate due to triggering a flush of a large
amount of cached data to disk sequentially.

Any thoughts of whether scp is actively maintained and best approach
to fix progressmeter.c so it doesn't break on Linux?

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fwd: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11  6:11 scp bug due to progress indicator when copying from remote to local on Linux Steve French
@ 2019-01-11  6:13 ` Steve French
  2019-01-11  8:50 ` Dr. Bernd Feige
  2019-01-11 13:28 ` Matthew Wilcox
  2 siblings, 0 replies; 14+ messages in thread
From: Steve French @ 2019-01-11  6:13 UTC (permalink / raw)
  To: LKML

---------- Forwarded message ---------
From: Steve French <smfrench@gmail.com>
Date: Fri, Jan 11, 2019 at 12:11 AM
Subject: scp bug due to progress indicator when copying from remote to
local on Linux
To: CIFS <linux-cifs@vger.kernel.org>, linux-fsdevel
<linux-fsdevel@vger.kernel.org>
Cc: CIFS <linux-cifs@vger.kernel.org>, Pavel Shilovsky <piastryyy@gmail.com>

In discussing an interesting problem with scp recently with Pavel, we
found a fairly obvious bug in scp when it is run with a progress
indicator (which is the default when the source file is remote).

scp triggers "SIGALRM" probably from update_progress_meter() in
progressmeter.c when executed with "scp localhost:somelargefile /mnt"
ie with an ssh source path but a local path for the target, as long as
the flush of the large amount of cached data to disk (which seems to
be triggered by the ftruncate call in scp.c) takes more than a few
seconds (which can be common depending on disk or network speed).

It is interesting that this can be avoided easily by running in quiet
mode ("-q") which disables the progress meter in scp, but it seems
very, very broken that scp of a large file can 'fail' when using a
progress meter (unless caching were disabled on the file system) due
to the slight delay in ftruncate due to triggering a flush of a large
amount of cached data to disk sequentially.

Any thoughts of whether scp is actively maintained and best approach
to fix progressmeter.c so it doesn't break on Linux?

--
Thanks,

Steve

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11  6:11 scp bug due to progress indicator when copying from remote to local on Linux Steve French
  2019-01-11  6:13 ` Fwd: " Steve French
@ 2019-01-11  8:50 ` Dr. Bernd Feige
  2019-01-11 13:28 ` Matthew Wilcox
  2 siblings, 0 replies; 14+ messages in thread
From: Dr. Bernd Feige @ 2019-01-11  8:50 UTC (permalink / raw)
  To: Steve French, CIFS, linux-fsdevel; +Cc: Pavel Shilovsky

Dear Steve,

the most commonly used version is part of openssh, part of openbsd, and
the Linux version maintained as "portable" version:
https://www.openssh.com/portable.html

primary SCM is at 
https://anongit.mindrot.org/openssh.git/

Best regards,
Bernd

Am Freitag, den 11.01.2019, 00:11 -0600 schrieb Steve French:
> In discussing an interesting problem with scp recently with Pavel, we
> found a fairly obvious bug in scp when it is run with a progress
> indicator (which is the default when the source file is remote).
> 
> scp triggers "SIGALRM" probably from update_progress_meter() in
> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> ie with an ssh source path but a local path for the target, as long
> as
> the flush of the large amount of cached data to disk (which seems to
> be triggered by the ftruncate call in scp.c) takes more than a few
> seconds (which can be common depending on disk or network speed).
> 
> It is interesting that this can be avoided easily by running in quiet
> mode ("-q") which disables the progress meter in scp, but it seems
> very, very broken that scp of a large file can 'fail' when using a
> progress meter (unless caching were disabled on the file system) due
> to the slight delay in ftruncate due to triggering a flush of a large
> amount of cached data to disk sequentially.
> 
> Any thoughts of whether scp is actively maintained and best approach
> to fix progressmeter.c so it doesn't break on Linux?
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11  6:11 scp bug due to progress indicator when copying from remote to local on Linux Steve French
  2019-01-11  6:13 ` Fwd: " Steve French
  2019-01-11  8:50 ` Dr. Bernd Feige
@ 2019-01-11 13:28 ` Matthew Wilcox
  2019-01-11 21:13   ` Steve French
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2019-01-11 13:28 UTC (permalink / raw)
  To: Steve French; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> In discussing an interesting problem with scp recently with Pavel, we
> found a fairly obvious bug in scp when it is run with a progress
> indicator (which is the default when the source file is remote).
> 
> scp triggers "SIGALRM" probably from update_progress_meter() in
> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> ie with an ssh source path but a local path for the target, as long as
> the flush of the large amount of cached data to disk (which seems to
> be triggered by the ftruncate call in scp.c) takes more than a few
> seconds (which can be common depending on disk or network speed).

Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
to fail?

> It is interesting that this can be avoided easily by running in quiet
> mode ("-q") which disables the progress meter in scp, but it seems
> very, very broken that scp of a large file can 'fail' when using a
> progress meter (unless caching were disabled on the file system) due
> to the slight delay in ftruncate due to triggering a flush of a large
> amount of cached data to disk sequentially.
> 
> Any thoughts of whether scp is actively maintained and best approach
> to fix progressmeter.c so it doesn't break on Linux?
> 
> -- 
> Thanks,
> 
> Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 13:28 ` Matthew Wilcox
@ 2019-01-11 21:13   ` Steve French
  2019-01-11 21:22     ` Matthew Wilcox
  2019-01-11 21:48     ` Andreas Dilger
  0 siblings, 2 replies; 14+ messages in thread
From: Steve French @ 2019-01-11 21:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> > In discussing an interesting problem with scp recently with Pavel, we
> > found a fairly obvious bug in scp when it is run with a progress
> > indicator (which is the default when the source file is remote).
> >
> > scp triggers "SIGALRM" probably from update_progress_meter() in
> > progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> > ie with an ssh source path but a local path for the target, as long as
> > the flush of the large amount of cached data to disk (which seems to
> > be triggered by the ftruncate call in scp.c) takes more than a few
> > seconds (which can be common depending on disk or network speed).
>
> Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> to fail?

So ftruncate does not really fail (the file contents and size match on
source and target after the copy) but the scp 'fails' and the user
would be quite confused (and presumably the network stack doesn't like
this signal, which can cause disconnects etc. which in theory could
cause reconnect/data loss issues in some corner cases).

Here is sample output from rerunning scp a few minutes ago.  (in
tracing it earlier I can see scp triggering the SIGALRM in the
progress indicator code).  scp returned non-zero value (failure), but
scp 'succeeded' (file was copied fine).  If I run with -q (quiet mode,
no progress indicator, there is no problem). For a relatively small
file size like this it fails less than 1/2 of the time, but if the
file size is much larger (gigabytes) the ftruncate obviously takes
longer and SIGALRM will always trigger.


smfrench@smf-copy-test3:~$ scp localhost:250M /mnt/trgtfile
smfrench@localhost's password:
250M
                                                   100%  250MB
18.6MB/s   00:13    /mnt/trgtfile: truncate: Interrupted system call

smfrench@smf-copy-test3:~$ echo $?
1

smfrench@smf-copy-test3:~$ diff 250M /mnt/trgtfile


Here is sample strace output from another run, note a very long series
of crazy small write sizes (16K) which complete quite quickly since
cached by Linux mm, then a relatively long (few second) pause as
ftruncate triggers the flush which causes the SIGALRM time out in the
scp progress indicator.

<snip>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
fcntl(7, F_GETFL)                       = 0x800 (flags O_RDONLY|O_NONBLOCK)
fcntl(7, F_SETFL, O_RDONLY)             = 0
ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[ALRM]})             = 0
ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
getpgrp()                               = 82563


> > It is interesting that this can be avoided easily by running in quiet
> > mode ("-q") which disables the progress meter in scp



--
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:13   ` Steve French
@ 2019-01-11 21:22     ` Matthew Wilcox
  2019-01-11 21:50       ` Steve French
  2019-01-14 19:48       ` Pavel Shilovsky
  2019-01-11 21:48     ` Andreas Dilger
  1 sibling, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2019-01-11 21:22 UTC (permalink / raw)
  To: Steve French; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 03:13:05PM -0600, Steve French wrote:
> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> > to fail?
> 
> So ftruncate does not really fail (the file contents and size match on
> source and target after the copy) but the scp 'fails' and the user
> would be quite confused (and presumably the network stack doesn't like
> this signal, which can cause disconnects etc. which in theory could
> cause reconnect/data loss issues in some corner cases).

You've run into the problem that userspace simply doesn't check the
return value from syscalls.  It's not just scp, it's every program.
Looking through cifs, you seem to do a lot of wait_event_interruptible()
where you maybe should be doing wait_event_killable()?

> ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> rt_sigreturn({mask=[ALRM]})             = 0
> ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
> getpgrp()                               = 82563

Right ... so the code never calls ftruncate() again.  Changing all of
userspace is just not going to happen; maybe you could get stuff fixed in
libc, but really ftruncate() should only be interrupted by a fatal signal
and not by SIGALRM.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:22     ` Matthew Wilcox
@ 2019-01-11 21:50       ` Steve French
  2019-01-11 23:05         ` Matthew Wilcox
  2019-01-14 19:48       ` Pavel Shilovsky
  1 sibling, 1 reply; 14+ messages in thread
From: Steve French @ 2019-01-11 21:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 3:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 11, 2019 at 03:13:05PM -0600, Steve French wrote:
> > On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> > > to fail?
> >
> > So ftruncate does not really fail (the file contents and size match on
> > source and target after the copy) but the scp 'fails' and the user
> > would be quite confused (and presumably the network stack doesn't like
> > this signal, which can cause disconnects etc. which in theory could
> > cause reconnect/data loss issues in some corner cases).
>
> You've run into the problem that userspace simply doesn't check the
> return value from syscalls.  It's not just scp, it's every program.
> Looking through cifs, you seem to do a lot of wait_event_interruptible()
> where you maybe should be doing wait_event_killable()?
>
> > ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
> > restarted if SA_RESTART is set)
> > --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> > --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> > rt_sigreturn({mask=[ALRM]})             = 0
> > ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
> > getpgrp()                               = 82563
>
> Right ... so the code never calls ftruncate() again.  Changing all of
> userspace is just not going to happen; maybe you could get stuff fixed in
> libc, but really ftruncate() should only be interrupted by a fatal signal
> and not by SIGALRM.

Looking at the places wait_event_interruptible is done I didn't see code
in fs/cifs that would match the (presumably) code path, mostly those
calls are in
smbdirect (RDMA) code) - for example cifs_setattr does call
filemap_write_and_wait
but as it goes down into the mm layer and then to cifs_writepages and
the SMB3 write
code, I didn't spot a "wait_event_interruptible" in that path (I might
have missed
something in the mm layer).  I do see one in the cifs reconnect path,
but that is
not what we are typically hitting.   Any ideas how to match what we
are blocked in when
we get the annoying SIGALRM?  Another vague thought - is it possible
to block SIGALRM
across all of cifs_setattr?  If it is - why do so few (only 3!) file
systems (ceph, jffs2, ocfs2
ever call sigprocmask)?

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:50       ` Steve French
@ 2019-01-11 23:05         ` Matthew Wilcox
  2019-01-11 23:18           ` Steve French
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2019-01-11 23:05 UTC (permalink / raw)
  To: Steve French; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 03:50:02PM -0600, Steve French wrote:
> On Fri, Jan 11, 2019 at 3:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > Right ... so the code never calls ftruncate() again.  Changing all of
> > userspace is just not going to happen; maybe you could get stuff fixed in
> > libc, but really ftruncate() should only be interrupted by a fatal signal
> > and not by SIGALRM.
> 
> Looking at the places wait_event_interruptible is done I didn't see code
> in fs/cifs that would match the (presumably) code path, mostly those
> calls are in
> smbdirect (RDMA) code) - for example cifs_setattr does call
> filemap_write_and_wait
> but as it goes down into the mm layer and then to cifs_writepages and
> the SMB3 write
> code, I didn't spot a "wait_event_interruptible" in that path (I might
> have missed
> something in the mm layer).  I do see one in the cifs reconnect path,
> but that is
> not what we are typically hitting.   Any ideas how to match what we
> are blocked in when
> we get the annoying SIGALRM?  Another vague thought - is it possible
> to block SIGALRM
> across all of cifs_setattr?  If it is - why do so few (only 3!) file
> systems (ceph, jffs2, ocfs2
> ever call sigprocmask)?

You can see where a task is currently sleeping with 'cat /proc/$pid/stack'.
If you can provoke a long duration ftruncate, that'd be a good place to
start looking.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 23:05         ` Matthew Wilcox
@ 2019-01-11 23:18           ` Steve French
  0 siblings, 0 replies; 14+ messages in thread
From: Steve French @ 2019-01-11 23:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 5:05 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 11, 2019 at 03:50:02PM -0600, Steve French wrote:
> > On Fri, Jan 11, 2019 at 3:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > Right ... so the code never calls ftruncate() again.  Changing all of
> > > userspace is just not going to happen; maybe you could get stuff fixed in
> > > libc, but really ftruncate() should only be interrupted by a fatal signal
> > > and not by SIGALRM.
> >
> > Looking at the places wait_event_interruptible is done I didn't see code
> > in fs/cifs that would match the (presumably) code path, mostly those
> > calls are in
> > smbdirect (RDMA) code) - for example cifs_setattr does call
> > filemap_write_and_wait
> > but as it goes down into the mm layer and then to cifs_writepages and
> > the SMB3 write
> > code, I didn't spot a "wait_event_interruptible" in that path (I might
> > have missed
> > something in the mm layer).  I do see one in the cifs reconnect path,
> > but that is
> > not what we are typically hitting.   Any ideas how to match what we
> > are blocked in when
> > we get the annoying SIGALRM?  Another vague thought - is it possible
> > to block SIGALRM
> > across all of cifs_setattr?  If it is - why do so few (only 3!) file
> > systems (ceph, jffs2, ocfs2
> > ever call sigprocmask)?
>
> You can see where a task is currently sleeping with 'cat /proc/$pid/stack'.
> If you can provoke a long duration ftruncate, that'd be a good place to
> start looking.

Not surprisingly it is waiting in mm code:

root@smf-copy-test3:~# cat /proc/92189/stack
[<0>] io_schedule+0x16/0x40
[<0>] wait_on_page_bit_common+0x14f/0x350
[<0>] __filemap_fdatawait_range+0x104/0x160
[<0>] filemap_write_and_wait+0x4d/0x90
[<0>] cifs_setattr+0xc9/0xe80 [cifs]
[<0>] notify_change+0x2d2/0x460
[<0>] do_truncate+0x78/0xc0
[<0>] do_sys_ftruncate+0x14c/0x1c0
[<0>] __x64_sys_ftruncate+0x1b/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

So that brings me back to thinking about whether it is practical to mask
signals (non killable signals) in a few places in cifs.ko as apparently
at least a few file systems (ceph and jffs2 and ocfs2) do.  In particular
mask SIGALRM across calls to filemap_write_and_wait

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:22     ` Matthew Wilcox
  2019-01-11 21:50       ` Steve French
@ 2019-01-14 19:48       ` Pavel Shilovsky
  1 sibling, 0 replies; 14+ messages in thread
From: Pavel Shilovsky @ 2019-01-14 19:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Steve French, CIFS, linux-fsdevel

пт, 11 янв. 2019 г. в 13:22, Matthew Wilcox <willy@infradead.org>:
>
> On Fri, Jan 11, 2019 at 03:13:05PM -0600, Steve French wrote:
> > On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> > > to fail?
> >
> > So ftruncate does not really fail (the file contents and size match on
> > source and target after the copy) but the scp 'fails' and the user
> > would be quite confused (and presumably the network stack doesn't like
> > this signal, which can cause disconnects etc. which in theory could
> > cause reconnect/data loss issues in some corner cases).
>
> You've run into the problem that userspace simply doesn't check the
> return value from syscalls.  It's not just scp, it's every program.
> Looking through cifs, you seem to do a lot of wait_event_interruptible()
> where you maybe should be doing wait_event_killable()?

We are doing wait_event_interruptible() mostly in places where we are
waiting on a blocking byte-range lock or when we are waiting for a TCP
connection to be established (e.g. after a reconnect)

>
> > ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
> > restarted if SA_RESTART is set)
> > --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> > --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> > rt_sigreturn({mask=[ALRM]})             = 0
> > ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
> > getpgrp()                               = 82563
>
> Right ... so the code never calls ftruncate() again.  Changing all of
> userspace is just not going to happen; maybe you could get stuff fixed in
> libc, but really ftruncate() should only be interrupted by a fatal signal
> and not by SIGALRM.

It seems that SA_RESTART is just not set for SCP. What do you think
about returning ERESTARTNOINTR instead for this specific case -
filemap_write_and_wait during ftruncate? It should force the syscall
to be restarted regardless of the userspace program settings.

--
Best regards,
Pavel Shilovsky

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:13   ` Steve French
  2019-01-11 21:22     ` Matthew Wilcox
@ 2019-01-11 21:48     ` Andreas Dilger
  2019-01-11 22:02       ` Steve French
  1 sibling, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2019-01-11 21:48 UTC (permalink / raw)
  To: Steve French; +Cc: Matthew Wilcox, CIFS, linux-fsdevel, Pavel Shilovsky

[-- Attachment #1: Type: text/plain, Size: 5155 bytes --]

On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@gmail.com> wrote:
> 
> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
>>> In discussing an interesting problem with scp recently with Pavel, we
>>> found a fairly obvious bug in scp when it is run with a progress
>>> indicator (which is the default when the source file is remote).
>>> 
>>> scp triggers "SIGALRM" probably from update_progress_meter() in
>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
>>> ie with an ssh source path but a local path for the target, as long as
>>> the flush of the large amount of cached data to disk (which seems to
>>> be triggered by the ftruncate call in scp.c) takes more than a few
>>> seconds (which can be common depending on disk or network speed).
>> 
>> Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
>> to fail?
> 
> So ftruncate does not really fail (the file contents and size match on
> source and target after the copy) but the scp 'fails' and the user
> would be quite confused (and presumably the network stack doesn't like
> this signal, which can cause disconnects etc. which in theory could
> cause reconnect/data loss issues in some corner cases).
> 
> Here is sample output from rerunning scp a few minutes ago.  (in
> tracing it earlier I can see scp triggering the SIGALRM in the
> progress indicator code).  scp returned non-zero value (failure), but
> scp 'succeeded' (file was copied fine).  If I run with -q (quiet mode,
> no progress indicator, there is no problem). For a relatively small
> file size like this it fails less than 1/2 of the time, but if the
> file size is much larger (gigabytes) the ftruncate obviously takes
> longer and SIGALRM will always trigger.

I'm guessing this is writing to a CIFS filesystem as the target, or some
other filesystem that doesn't handle sparse files?  AFAIK, at one point
there was a heuristic like "truncate-to-non-zero triggers file
preallocation" before fallocate() was available?

Otherwise, why would the truncate() take a long time even if the file
is large, since it shouldn't actually be doing any work to change the
size of the file (I'm assuming that the target file is either empty or
about the same size during an overwrite, you didn't mention anything
unusual as a requirement).

Other than the duration of the truncate() call being excessive, it seems
to me that everything is working as expected, or at least what the code
asked for?  It registered SIGALRM, so when a syscall returns -ERESTARTSYS
it should actually restart that call.


My typical response for issues like this is to find the github repo for
the code and push a patch to fix the code myself.  If I can't find the
upstream project/source repo easily, download the .src.rpm and make a
patch and submit it to the major distro package maintainers (RHEL, SLES,
Ubuntu or Debian), so that at least it is fixed for the vast majority
of users, and those maintainers can handle pushing the patch further
upstream (and smaller distros will copy patches/sources from them).

Cheers, Andreas

> smfrench@smf-copy-test3:~$ scp localhost:250M /mnt/trgtfile
> smfrench@localhost's password:
> 250M
>                                                   100%  250MB
> 18.6MB/s   00:13    /mnt/trgtfile: truncate: Interrupted system call
> 
> smfrench@smf-copy-test3:~$ echo $?
> 1
> 
> smfrench@smf-copy-test3:~$ diff 250M /mnt/trgtfile
> 
> 
> Here is sample strace output from another run, note a very long series
> of crazy small write sizes (16K) which complete quite quickly since
> cached by Linux mm, then a relatively long (few second) pause as
> ftruncate triggers the flush which causes the SIGALRM time out in the
> scp progress indicator.
> 
> <snip>
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 16384) = 16384
> read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 16384) = 16384
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 16384) = 16384
> read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 16384) = 16384
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 16384) = 16384
> fcntl(7, F_GETFL)                       = 0x800 (flags O_RDONLY|O_NONBLOCK)
> fcntl(7, F_SETFL, O_RDONLY)             = 0
> ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> rt_sigreturn({mask=[ALRM]})             = 0
> ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
> getpgrp()                               = 82563
> 
> 
>>> It is interesting that this can be avoided easily by running in quiet
>>> mode ("-q") which disables the progress meter in scp
> 
> 
> 
> --
> Thanks,
> 
> Steve


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 21:48     ` Andreas Dilger
@ 2019-01-11 22:02       ` Steve French
  2019-01-11 23:51         ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Steve French @ 2019-01-11 22:02 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Matthew Wilcox, CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@gmail.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> >>> In discussing an interesting problem with scp recently with Pavel, we
> >>> found a fairly obvious bug in scp when it is run with a progress
> >>> indicator (which is the default when the source file is remote).
> >>>
> >>> scp triggers "SIGALRM" probably from update_progress_meter() in
> >>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> >>> ie with an ssh source path but a local path for the target, as long as
<snip>
> I'm guessing this is writing to a CIFS filesystem as the target, or some
> other filesystem that doesn't handle sparse files?  AFAIK, at one point
> there was a heuristic like "truncate-to-non-zero triggers file
> preallocation" before fallocate() was available?

cifs can handle sparse files (it has had protocol support for various
fallocate equivalents for many, many years) , and I think this would
fail to other file systems depending on how long flushing cached
data takes (presumably a function of disk or network speed,
size of cached data etc.)

> Otherwise, why would the truncate() take a long time even if the file
> is large, since it shouldn't actually be doing any work to change the
> size of the file (I'm assuming that the target file is either empty or
> about the same size during an overwrite, you didn't mention anything
> unusual as a requirement).

Note that scp (unlike rsync, robocopy and some other tools doesn't
appear to offer an option to do a 'set the file size first'
then do the writes ... (which helps with some types of file systems),
but setting the file size is not the issue in this case (perhaps it would
be in some cloud file systems, or cluster file systems but not in this
case).

> Other than the duration of the truncate() call being excessive, it seems
> to me that everything is working as expected, or at least what the code
> asked for?  It registered SIGALRM, so when a syscall returns -ERESTARTSYS
> it should actually restart that call.
>
>
> My typical response for issues like this is to find the github repo for
> the code and push a patch to fix the code myself.  If I can't find the
> upstream project/source repo easily, download the .src.rpm and make a
> patch and submit it to the major distro package maintainers (RHEL, SLES,
> Ubuntu or Debian), so that at least it is fixed for the vast majority
> of users, and those maintainers can handle pushing the patch further
> upstream (and smaller distros will copy patches/sources from them).

I have built the scp code as an experiment and by the way there are many,
many things that could be changed (other copy tools e.g. robocopy even rsync
and cp) look much more mature (with lots more useful performance options).
My earlier experiments e.g. changing the scp i/o size to something larger
did help as expected for various cases (16K is suboptimal in many cases).

But in this case it seems much simpler - the code path is basically
read 16K over ssh
write 16K to local/network file system (cached)
(repeated 1000s of times)
then ftruncate

ftruncate if it were simply setting the file size would be very fast, except
that setting the file size needs to have the data written out (writes
otherwise could risk extending the file size) so triggers a call to writepages
(filemap_write_and_wait) as a sideeffect
so all the cached data is finally written out which can take a few
seconds and cause the scp progress indicator to malfunction
and trigger the SIGALRM.  This is normally not a problem
fsync etc. can take longer than a single i/o operation.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 22:02       ` Steve French
@ 2019-01-11 23:51         ` Andreas Dilger
  2019-01-12  0:06           ` Steve French
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2019-01-11 23:51 UTC (permalink / raw)
  To: Steve French; +Cc: Matthew Wilcox, CIFS, linux-fsdevel, Pavel Shilovsky

[-- Attachment #1: Type: text/plain, Size: 4713 bytes --]

On Jan 11, 2019, at 3:02 PM, Steve French <smfrench@gmail.com> wrote:
> 
> On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@dilger.ca> wrote:
>> 
>> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@gmail.com> wrote:
>>> 
>>> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
>>>>> In discussing an interesting problem with scp recently with Pavel, we
>>>>> found a fairly obvious bug in scp when it is run with a progress
>>>>> indicator (which is the default when the source file is remote).
>>>>> 
>>>>> scp triggers "SIGALRM" probably from update_progress_meter() in
>>>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
>>>>> ie with an ssh source path but a local path for the target, as long as
> <snip>
> 
>> My typical response for issues like this is to find the github repo for
>> the code and push a patch to fix the code myself.  If I can't find the
>> upstream project/source repo easily, download the .src.rpm and make a
>> patch and submit it to the major distro package maintainers (RHEL, SLES,
>> Ubuntu or Debian), so that at least it is fixed for the vast majority
>> of users, and those maintainers can handle pushing the patch further
>> upstream (and smaller distros will copy patches/sources from them).
> 
> I have built the scp code as an experiment and by the way there are many,
> many things that could be changed (other copy tools e.g. robocopy even rsync
> and cp) look much more mature (with lots more useful performance options).
> My earlier experiments e.g. changing the scp i/o size to something larger
> did help as expected for various cases (16K is suboptimal in many cases).

Years ago I submitted a patch to fix GNU fileutils so that "cp" could use
st_blksize to determine the data IO size for the copy.  Originally it was
allocate its copy buffer on the stack, but when Lustre returned st_blksize
of 8MB it exploded.  2MB was safe at the time, and what we used for many
years but fixing "cp" allowed us to increase st_blksize over time. Probably
scp should be modified to do the same.

It is often just a case of the developer didn't know any better, or the code
was written so long ago that 16KB transfers were seen as a huge improvement
over 4KB or 1KB transfers, and nobody has looked at it again.  Since we all
depend on other developers to know their side of the house (e.g. I'm wholly
dependent on others for the crypto implementation of SSH/scp), we may as
well provide some help when there is something as widely used as scp that
we can improve in some useful way.

> But in this case it seems much simpler - the code path is basically
> read 16K over ssh
> write 16K to local/network file system (cached)
> (repeated 1000s of times)
> then ftruncate

Even if the filesystem is transferring large IOs from disk/network, doing
many thousands of excess system calls is pure overhead that could be avoided.
Also, even if scp could only do 16KB network transfers, staying in tight
"scp copies data over ssh" loop is better for icache/dcache/context, then
doing a single large write call to/from the filesystem is most efficient
on that side of the house.

>> ... why would the truncate() take a long time even if the file
>> is large, since it shouldn't actually be doing any work to change the
>> size of the file (I'm assuming that the target file is either empty or
>> about the same size during an overwrite, you didn't mention anything
>> unusual as a requirement).
> 
> ftruncate if it were simply setting the file size would be very fast,
> except that setting the file size needs to have the data written out
> (writes otherwise could risk extending the file size) so triggers a
> call to writepages (filemap_write_and_wait) as a sideeffect
> so all the cached data is finally written out which can take a few
> seconds and cause the scp progress indicator to malfunction
> and trigger the SIGALRM.  This is normally not a problem
> fsync etc. can take longer than a single i/o operation.

This seems more like a kernel/filesystem issue.  We shouldn't have to flush
the entire file to disk just to change the file size.  In the exceedingly
common case of the current size == truncate() size, the truncate should be
a no-op unless the on-disk size is > new size?  In the unlikely case that
file size > new size then it could truncate only the pages and on-disk
blocks beyond the new size?

I don't see how pages in cache < new size could cause the file size to be
extended.  Is this a VFS/MM thing, or CIFS-specific?

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: scp bug due to progress indicator when copying from remote to local on Linux
  2019-01-11 23:51         ` Andreas Dilger
@ 2019-01-12  0:06           ` Steve French
  0 siblings, 0 replies; 14+ messages in thread
From: Steve French @ 2019-01-12  0:06 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Matthew Wilcox, CIFS, linux-fsdevel, Pavel Shilovsky

On Fri, Jan 11, 2019 at 5:51 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Jan 11, 2019, at 3:02 PM, Steve French <smfrench@gmail.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@dilger.ca> wrote:
> >>
> >> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@gmail.com> wrote:
> >>>
> >>> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>>>
> >>>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> >>>>> In discussing an interesting problem with scp recently with Pavel, we
> >>>>> found a fairly obvious bug in scp when it is run with a progress
> >>>>> indicator (which is the default when the source file is remote).
> >>>>>
> >>>>> scp triggers "SIGALRM" probably from update_progress_meter() in
> >>>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> >>>>> ie with an ssh source path but a local path for the target, as long as
> > <snip>
> >
> >> My typical response for issues like this is to find the github repo for
> >> the code and push a patch to fix the code myself.  If I can't find the
> >> upstream project/source repo easily, download the .src.rpm and make a
> >> patch and submit it to the major distro package maintainers (RHEL, SLES,
> >> Ubuntu or Debian), so that at least it is fixed for the vast majority
> >> of users, and those maintainers can handle pushing the patch further
> >> upstream (and smaller distros will copy patches/sources from them).
> >
> > I have built the scp code as an experiment and by the way there are many,
> > many things that could be changed (other copy tools e.g. robocopy even rsync
> > and cp) look much more mature (with lots more useful performance options).
> > My earlier experiments e.g. changing the scp i/o size to something larger
> > did help as expected for various cases (16K is suboptimal in many cases).
>
> Years ago I submitted a patch to fix GNU fileutils so that "cp" could use
> st_blksize to determine the data IO size for the copy.  Originally it was
> allocate its copy buffer on the stack, but when Lustre returned st_blksize
> of 8MB it exploded.  2MB was safe at the time, and what we used for many
> years but fixing "cp" allowed us to increase st_blksize over time. Probably
> scp should be modified to do the same.
>
> It is often just a case of the developer didn't know any better, or the code
> was written so long ago that 16KB transfers were seen as a huge improvement
> over 4KB or 1KB transfers, and nobody has looked at it again.  Since we all
> depend on other developers to know their side of the house (e.g. I'm wholly
> dependent on others for the crypto implementation of SSH/scp), we may as
> well provide some help when there is something as widely used as scp that
> we can improve in some useful way.
>
> > But in this case it seems much simpler - the code path is basically
> > read 16K over ssh
> > write 16K to local/network file system (cached)
> > (repeated 1000s of times)
> > then ftruncate
>
> Even if the filesystem is transferring large IOs from disk/network, doing
> many thousands of excess system calls is pure overhead that could be avoided.
> Also, even if scp could only do 16KB network transfers, staying in tight
> "scp copies data over ssh" loop is better for icache/dcache/context, then
> doing a single large write call to/from the filesystem is most efficient
> on that side of the house.
>
> >> ... why would the truncate() take a long time even if the file
> >> is large, since it shouldn't actually be doing any work to change the
> >> size of the file (I'm assuming that the target file is either empty or
> >> about the same size during an overwrite, you didn't mention anything
> >> unusual as a requirement).
> >
> > ftruncate if it were simply setting the file size would be very fast,
> > except that setting the file size needs to have the data written out
> > (writes otherwise could risk extending the file size) so triggers a
> > call to writepages (filemap_write_and_wait) as a sideeffect
> > so all the cached data is finally written out which can take a few
> > seconds and cause the scp progress indicator to malfunction
> > and trigger the SIGALRM.  This is normally not a problem
> > fsync etc. can take longer than a single i/o operation.
>
> This seems more like a kernel/filesystem issue.  We shouldn't have to flush
> the entire file to disk just to change the file size.  In the exceedingly
> common case of the current size == truncate() size, the truncate should be
> a no-op unless the on-disk size is > new size?  In the unlikely case that
> file size > new size then it could truncate only the pages and on-disk
> blocks beyond the new size?
>
> I don't see how pages in cache < new size could cause the file size to be
> extended.  Is this a VFS/MM thing, or CIFS-specific?

I suspect that for various other network/cluster file systems, flushing (not
to disk but at least to the server, ie across the network) is considered safer
to do before certain metadata updates (including setting file size)


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-01-14 19:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-11  6:11 scp bug due to progress indicator when copying from remote to local on Linux Steve French
2019-01-11  6:13 ` Fwd: " Steve French
2019-01-11  8:50 ` Dr. Bernd Feige
2019-01-11 13:28 ` Matthew Wilcox
2019-01-11 21:13   ` Steve French
2019-01-11 21:22     ` Matthew Wilcox
2019-01-11 21:50       ` Steve French
2019-01-11 23:05         ` Matthew Wilcox
2019-01-11 23:18           ` Steve French
2019-01-14 19:48       ` Pavel Shilovsky
2019-01-11 21:48     ` Andreas Dilger
2019-01-11 22:02       ` Steve French
2019-01-11 23:51         ` Andreas Dilger
2019-01-12  0:06           ` Steve French

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.