All of lore.kernel.org
 help / color / mirror / Atom feed
* Intermittent file corruption problems with cifs driver?
@ 2011-09-12  8:36 sean finney
  2011-09-12 12:27 ` Steve French
  2011-09-12 13:16 ` Jeff Layton
  0 siblings, 2 replies; 6+ messages in thread
From: sean finney @ 2011-09-12  8:36 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ

Hi all,

Recently at $customer I've been tasked into looking into a problem they
are intermittently having with corrupt file transfers from linux servers
to a windows share.  

Little info on the servers:

	Ubuntu Lucid 10.04
	Stock and up to date Linux 2.6.32-33-server distro package
	Stock cifs-utils 4.5-2 packages

Description of behavior:

	The servers are all part of a distributed service where each server
	regularly uploads 100-200MB zipfiles to the windows share.  Intermittently
	the resulting files will be corrupted.  On the client that performs
	the upload, the corrupted file will appear to have the correct checksum,
	but any other remote client will see it as corrupted.

	The problem used to be much more frequent, and mounting with -o directio
	seems to have greatly reduced, but not eliminated, the recurrence of the
	corruption.  But recently (perhaps due to higher reates of uploads?),
	the problem has started recurring.  It doesn't seem uniformly occuring,
	but rather in spurts where a couple files will be corrupted in one day,
	and then a week will go by with no corruptions.

	I do see occasional errors in the kernel logs, though I'm not sure if
	they are relevant or not (note that they're at substantially different
	times, and at present I have no way to correlate them with corruption,
	though I'm working on that):

	[170873.721023]  CIFS VFS: Error -104 sending data on socket to server
	[170873.728747]  CIFS VFS: Error -32 sending data on socket to server
	[515039.940104]  CIFS VFS: No response to cmd 115 mid 32714
	[515039.947933]  CIFS VFS: Send error in SessSetup = -11
	[521901.595381]  CIFS VFS: No response to cmd 46 mid 37426
	[521901.603422]  CIFS VFS: Send error in read = -11
	[2097744.571138]  CIFS VFS: No response for cmd 50 mid 48502
	[2097849.771138]  CIFS VFS: No response for cmd 114 mid 48519


Reading through the archives along with the rest of teh internetz I've found 
very little info.  Someone posted here back in february about a similar
sounding problem, though I do not see the wsize-len blocks of NULL bytes in
the resulting files like they did.

I've written a small python script that right now is running on a pair of
these servers, which with a couple dozen threads is uploading similarly sized
files of arbitrary data, and comparing the upload results of each other.
after a few hours I haven't seen it yet, but will keep it runnign for
a couple days to see if it shows up.

I've also found a couple suggestions out there to "disable linux
extensions" and "disable oplocks" when searching on the above kernel error
messages, but am hesitant to try them unless there's a strong indication
that they will help, and i'm not entirely sure if/whether they will.


does this ring a bell with anyone?  at this point i can't just do a
blanket "try the latest" upgrade of these servers because they're part
of a production application, at least without any further indication that
there was a fix for this problem between the current and latest versions.
If I can repro the problem, however, and can then take it to a non-prod
machine, then I might have a bit more flexibility, but in the meantime
thought I'd field the question here on the off chance...


thanks!
	sean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent file corruption problems with cifs driver?
  2011-09-12  8:36 Intermittent file corruption problems with cifs driver? sean finney
@ 2011-09-12 12:27 ` Steve French
       [not found]   ` <CAH2r5msrRGT+aMZw8shNRNzbQqfGo8Ba2RHBhUtXznNRvRyykQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-09-12 13:16 ` Jeff Layton
  1 sibling, 1 reply; 6+ messages in thread
From: Steve French @ 2011-09-12 12:27 UTC (permalink / raw)
  To: sean finney; +Cc: linux-cifs, samba-technical

On Mon, Sep 12, 2011 at 3:36 AM, sean finney <seanius@seanius.net> wrote:

> Hi all,
>
> Recently at $customer I've been tasked into looking into a problem they
> are intermittently having with corrupt file transfers from linux servers
> to a windows share.
>
> Little info on the servers:
>
>        Ubuntu Lucid 10.04
>        Stock and up to date Linux 2.6.32-33-server distro package
>        Stock cifs-utils 4.5-2 packages
>
> Description of behavior:
>
>        The servers are all part of a distributed service where each server
>        regularly uploads 100-200MB zipfiles to the windows share.
>  Intermittently
>        the resulting files will be corrupted.  On the client that performs
>        the upload, the corrupted file will appear to have the correct
> checksum,
>        but any other remote client will see it as corrupted.
>
>        The problem used to be much more frequent, and mounting with -o
> directio
>        seems to have greatly reduced, but not eliminated, the recurrence of
> the
>        corruption.  But recently (perhaps due to higher reates of
> uploads?),
>        the problem has started recurring.  It doesn't seem uniformly
> occuring,
>        but rather in spurts where a couple files will be corrupted in one
> day,
>        and then a week will go by with no corruptions.
>
>        I do see occasional errors in the kernel logs, though I'm not sure
> if
>        they are relevant or not (note that they're at substantially
> different
>        times, and at present I have no way to correlate them with
> corruption,
>        though I'm working on that):
>
>        [170873.721023]  CIFS VFS: Error -104 sending data on socket to
> server
>        [170873.728747]  CIFS VFS: Error -32 sending data on socket to
> server
>        [515039.940104]  CIFS VFS: No response to cmd 115 mid 32714
>        [515039.947933]  CIFS VFS: Send error in SessSetup = -11
>        [521901.595381]  CIFS VFS: No response to cmd 46 mid 37426
>        [521901.603422]  CIFS VFS: Send error in read = -11
>        [2097744.571138]  CIFS VFS: No response for cmd 50 mid 48502
>        [2097849.771138]  CIFS VFS: No response for cmd 114 mid 48519
>


With log entries like the above, the probability of having a file fail to
copy
is high (although I would expect it more with command 47 (SMB WriteAndX,
but presumably if you get an error reading (command 46) you will
also have the copy fail - depending on how the application handles
such errors).

The reconnection/retry behavior is much better in more current cifs,
especially when servers are sometimes very slow (as we saw
in clustered servers during occassional operations when cluster
overhead could cause > 15 second delays).   If the server
is taking greater than 15 seconds or so from time to time
(or as we see here with rc 104, if the server
or network randomly drops the connection), I would expect
big improvements in 2.6.39 or later (or equivalent
backport as some of the distros have done).

Disabling unix extensions (mount option "nounix") is
only going to make a difference to Samba and similar servers,
but I would not expect it to have an effect on this problem.
mounting "forcedirectio" could have an effect though as it
1) changes i/o sizes to more closely match the application
(since the cache is not being used)
2) allows the application to get errors on writes more
quickly (since some applications forget to check errors
on fsync or close, and expect write errors on the actual write)


>
>
> Reading through the archives along with the rest of teh internetz I've
> found
> very little info.  Someone posted here back in february about a similar
> sounding problem, though I do not see the wsize-len blocks of NULL bytes in
> the resulting files like they did.
>
> I've written a small python script that right now is running on a pair of
> these servers, which with a couple dozen threads is uploading similarly
> sized
> files of arbitrary data, and comparing the upload results of each other.
> after a few hours I haven't seen it yet, but will keep it runnign for
> a couple days to see if it shows up.
>
> I've also found a couple suggestions out there to "disable linux
> extensions" and "disable oplocks" when searching on the above kernel error
> messages, but am hesitant to try them unless there's a strong indication
> that they will help, and i'm not entirely sure if/whether they will.
>
>
> does this ring a bell with anyone?  at this point i can't just do a
> blanket "try the latest" upgrade of these servers because they're part
> of a production application, at least without any further indication that
> there was a fix for this problem between the current and latest versions.
> If I can repro the problem, however, and can then take it to a non-prod
> machine, then I might have a bit more flexibility, but in the meantime
> thought I'd field the question here on the off chance...
>
>
> thanks!
>        sean
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Fwd: Intermittent file corruption problems with cifs driver?
       [not found]   ` <CAH2r5msrRGT+aMZw8shNRNzbQqfGo8Ba2RHBhUtXznNRvRyykQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-12 12:29     ` Steve French
  0 siblings, 0 replies; 6+ messages in thread
From: Steve French @ 2011-09-12 12:29 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 12, 2011 at 3:36 AM, sean finney <seanius-ADwgVSpYHhHR7s880joybQ@public.gmane.org> wrote:
>
> Hi all,
>
> Recently at $customer I've been tasked into looking into a problem they
> are intermittently having with corrupt file transfers from linux servers
> to a windows share.
>
> Little info on the servers:
>
>        Ubuntu Lucid 10.04
>        Stock and up to date Linux 2.6.32-33-server distro package
>        Stock cifs-utils 4.5-2 packages
>
> Description of behavior:
>
>        The servers are all part of a distributed service where each server
>        regularly uploads 100-200MB zipfiles to the windows share.  Intermittently
>        the resulting files will be corrupted.  On the client that performs
>        the upload, the corrupted file will appear to have the correct checksum,
>        but any other remote client will see it as corrupted.
>
>        The problem used to be much more frequent, and mounting with -o directio
>        seems to have greatly reduced, but not eliminated, the recurrence of the
>        corruption.  But recently (perhaps due to higher reates of uploads?),
>        the problem has started recurring.  It doesn't seem uniformly occuring,
>        but rather in spurts where a couple files will be corrupted in one day,
>        and then a week will go by with no corruptions.
>
>        I do see occasional errors in the kernel logs, though I'm not sure if
>        they are relevant or not (note that they're at substantially different
>        times, and at present I have no way to correlate them with corruption,
>        though I'm working on that):
>
>        [170873.721023]  CIFS VFS: Error -104 sending data on socket to server
>        [170873.728747]  CIFS VFS: Error -32 sending data on socket to server
>        [515039.940104]  CIFS VFS: No response to cmd 115 mid 32714
>        [515039.947933]  CIFS VFS: Send error in SessSetup = -11
>        [521901.595381]  CIFS VFS: No response to cmd 46 mid 37426
>        [521901.603422]  CIFS VFS: Send error in read = -11
>        [2097744.571138]  CIFS VFS: No response for cmd 50 mid 48502
>        [2097849.771138]  CIFS VFS: No response for cmd 114 mid 48519


With log entries like the above, the probability of having a file fail to copy
is high (although I would expect it more with command 47 (SMB WriteAndX,
but presumably if you get an error reading (command 46) you will
also have the copy fail - depending on how the application handles
such errors).

The reconnection/retry behavior is much better in more current cifs,
especially when servers are sometimes very slow (as we saw
in clustered servers during occassional operations when cluster
overhead could cause > 15 second delays).   If the server
is taking greater than 15 seconds or so from time to time
(or as we see here with rc 104, if the server
or network randomly drops the connection), I would expect
big improvements in 2.6.39 or later (or equivalent
backport as some of the distros have done).

Disabling unix extensions (mount option "nounix") is
only going to make a difference to Samba and similar servers,
but I would not expect it to have an effect on this problem.
mounting "forcedirectio" could have an effect though as it
1) changes i/o sizes to more closely match the application
(since the cache is not being used)
2) allows the application to get errors on writes more
quickly (since some applications forget to check errors
on fsync or close, and expect write errors on the actual write)

>
> Reading through the archives along with the rest of teh internetz I've found
> very little info.  Someone posted here back in february about a similar
> sounding problem, though I do not see the wsize-len blocks of NULL bytes in
> the resulting files like they did.
>
> I've written a small python script that right now is running on a pair of
> these servers, which with a couple dozen threads is uploading similarly sized
> files of arbitrary data, and comparing the upload results of each other.
> after a few hours I haven't seen it yet, but will keep it runnign for
> a couple days to see if it shows up.
>
> I've also found a couple suggestions out there to "disable linux
> extensions" and "disable oplocks" when searching on the above kernel error
> messages, but am hesitant to try them unless there's a strong indication
> that they will help, and i'm not entirely sure if/whether they will.
>
>
> does this ring a bell with anyone?  at this point i can't just do a
> blanket "try the latest" upgrade of these servers because they're part
> of a production application, at least without any further indication that
> there was a fix for this problem between the current and latest versions.
> If I can repro the problem, however, and can then take it to a non-prod
> machine, then I might have a bit more flexibility, but in the meantime
> thought I'd field the question here on the off chance...
>
>
> thanks!
>        sean
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Thanks,

Steve



--
Thanks,

Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent file corruption problems with cifs driver?
  2011-09-12  8:36 Intermittent file corruption problems with cifs driver? sean finney
  2011-09-12 12:27 ` Steve French
@ 2011-09-12 13:16 ` Jeff Layton
       [not found]   ` <20110912091624.10d5aea5-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff Layton @ 2011-09-12 13:16 UTC (permalink / raw)
  To: sean finney; +Cc: linux-cifs, samba-technical

On Mon, 12 Sep 2011 10:36:55 +0200
sean finney <seanius@seanius.net> wrote:

> Hi all,
> 
> Recently at $customer I've been tasked into looking into a problem they
> are intermittently having with corrupt file transfers from linux servers
> to a windows share.  
> 
> Little info on the servers:
> 
> 	Ubuntu Lucid 10.04
> 	Stock and up to date Linux 2.6.32-33-server distro package
> 	Stock cifs-utils 4.5-2 packages
> 
> Description of behavior:
> 
> 	The servers are all part of a distributed service where each server
> 	regularly uploads 100-200MB zipfiles to the windows share.  Intermittently
> 	the resulting files will be corrupted.  On the client that performs
> 	the upload, the corrupted file will appear to have the correct checksum,
> 	but any other remote client will see it as corrupted.
> 
> 	The problem used to be much more frequent, and mounting with -o directio
> 	seems to have greatly reduced, but not eliminated, the recurrence of the
> 	corruption.  But recently (perhaps due to higher reates of uploads?),
> 	the problem has started recurring.  It doesn't seem uniformly occuring,
> 	but rather in spurts where a couple files will be corrupted in one day,
> 	and then a week will go by with no corruptions.
> 
> 	I do see occasional errors in the kernel logs, though I'm not sure if
> 	they are relevant or not (note that they're at substantially different
> 	times, and at present I have no way to correlate them with corruption,
> 	though I'm working on that):
> 
> 	[170873.721023]  CIFS VFS: Error -104 sending data on socket to server
> 	[170873.728747]  CIFS VFS: Error -32 sending data on socket to server
> 	[515039.940104]  CIFS VFS: No response to cmd 115 mid 32714
> 	[515039.947933]  CIFS VFS: Send error in SessSetup = -11
> 	[521901.595381]  CIFS VFS: No response to cmd 46 mid 37426
> 	[521901.603422]  CIFS VFS: Send error in read = -11
> 	[2097744.571138]  CIFS VFS: No response for cmd 50 mid 48502
> 	[2097849.771138]  CIFS VFS: No response for cmd 114 mid 48519
> 
> 
> Reading through the archives along with the rest of teh internetz I've found 
> very little info.  Someone posted here back in february about a similar
> sounding problem, though I do not see the wsize-len blocks of NULL bytes in
> the resulting files like they did.
> 
> I've written a small python script that right now is running on a pair of
> these servers, which with a couple dozen threads is uploading similarly sized
> files of arbitrary data, and comparing the upload results of each other.
> after a few hours I haven't seen it yet, but will keep it runnign for
> a couple days to see if it shows up.
> 
> I've also found a couple suggestions out there to "disable linux
> extensions" and "disable oplocks" when searching on the above kernel error
> messages, but am hesitant to try them unless there's a strong indication
> that they will help, and i'm not entirely sure if/whether they will.
> 
> 
> does this ring a bell with anyone?  at this point i can't just do a
> blanket "try the latest" upgrade of these servers because they're part
> of a production application, at least without any further indication that
> there was a fix for this problem between the current and latest versions.
> If I can repro the problem, however, and can then take it to a non-prod
> machine, then I might have a bit more flexibility, but in the meantime
> thought I'd field the question here on the off chance...
> 
> 
> thanks!
> 	sean
> 


Older kernels were particularly bad about giving up on writes that
timed out. When this happens, it typically will mark the mapping bad
so that you get an error on fsync or close, but that's small
consolation since a lot of programs don't check the return value on
close.

That said, the messages you post above seem to indicate timeouts
on reads, not writes, but I don't recall whether writepages spewed any
errors when writes would time out.

The patchset that converted cifs to use async writes should not only
improve performance, but make this more robust as well. One thing you
can try is backporting 941b853 and see if that helps. Other than that
I'd suggest moving to a newer kernel.

Anything before 2.6.38 is probably going to suck for data integrity for
those reasons, unless someone backported the newer code to it (like we
did for RHEL6).

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Intermittent file corruption problems with cifs driver?
       [not found]     ` <CAH2r5mumMQEG57BDP3PSsx42N_51hf7_HbcSVHYy2O0LR=FVUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-12 14:37       ` sean finney
       [not found]         ` <20110912143715.GA3959-Znhnm/lQSyjxW5zecs3cv0EOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: sean finney @ 2011-09-12 14:37 UTC (permalink / raw)
  To: Steve French, Jeff Layton
  Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	samba-technical-w/Ol4Ecudpl8XjKLYN78aQ

Hi Steve, Jeff,

Thanks for the detailed info and suggestions.  It sounds like a 3.0
backport is the way to go then, since I've seen a number of other fixes
and improvements go that way, let's see how that works :)


	sean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Intermittent file corruption problems with cifs driver?
       [not found]         ` <20110912143715.GA3959-Znhnm/lQSyjxW5zecs3cv0EOCMrvLtNR@public.gmane.org>
@ 2011-09-12 15:25           ` Steve French
  0 siblings, 0 replies; 6+ messages in thread
From: Steve French @ 2011-09-12 15:25 UTC (permalink / raw)
  To: sean finney; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA, samba-technical

On Mon, Sep 12, 2011 at 9:37 AM, sean finney <seanius-ADwgVSpYHhHR7s880joybQ@public.gmane.org> wrote:
> Hi Steve, Jeff,
>
> Thanks for the detailed info and suggestions.  It sounds like a 3.0
> backport is the way to go then, since I've seen a number of other fixes
> and improvements go that way, let's see how that works :)

3.0 is a lot faster too



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-09-12 15:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-12  8:36 Intermittent file corruption problems with cifs driver? sean finney
2011-09-12 12:27 ` Steve French
     [not found]   ` <CAH2r5msrRGT+aMZw8shNRNzbQqfGo8Ba2RHBhUtXznNRvRyykQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-12 12:29     ` Fwd: " Steve French
2011-09-12 13:16 ` Jeff Layton
     [not found]   ` <20110912091624.10d5aea5-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
     [not found]     ` <CAH2r5mumMQEG57BDP3PSsx42N_51hf7_HbcSVHYy2O0LR=FVUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-12 14:37       ` Fwd: " sean finney
     [not found]         ` <20110912143715.GA3959-Znhnm/lQSyjxW5zecs3cv0EOCMrvLtNR@public.gmane.org>
2011-09-12 15:25           ` Steve French

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.