All of lore.kernel.org
 help / color / mirror / Atom feed
* [radosgw] Race condition corrupting data on COPY ?
@ 2013-03-18  9:50 Sylvain Munaut
  2013-03-18 13:29 ` Yehuda Sadeh
  0 siblings, 1 reply; 5+ messages in thread
From: Sylvain Munaut @ 2013-03-18  9:50 UTC (permalink / raw)
  To: ceph-devel

Hi,


I've just noticed something rather worrying on our cluster.

Some files are apparently truncated. From the first look I had at it,
it happened on files where there was a metadata update right after the
file was stored. The exact sequence was:

 - PUT to store the file
 - GET to get the file (which at that point is still correct and has
the proper length)
 - PUT using a 'copy source' over itself to update the metadata

all of theses happening sequentially in the same second, very quickly.

Then subsequent GET return a truncated file.


I'm looking into it to narrow down the issue but I wanted to know if
anyone had seen something similar ?


Cheers,

     Sylvain

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [radosgw] Race condition corrupting data on COPY ?
  2013-03-18  9:50 [radosgw] Race condition corrupting data on COPY ? Sylvain Munaut
@ 2013-03-18 13:29 ` Yehuda Sadeh
  2013-03-18 14:40   ` Sylvain Munaut
  0 siblings, 1 reply; 5+ messages in thread
From: Yehuda Sadeh @ 2013-03-18 13:29 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: ceph-devel

On Mon, Mar 18, 2013 at 2:50 AM, Sylvain Munaut
<s.munaut@whatever-company.com> wrote:
> Hi,
>
>
> I've just noticed something rather worrying on our cluster.
>
> Some files are apparently truncated. From the first look I had at it,
> it happened on files where there was a metadata update right after the
> file was stored. The exact sequence was:
>
>  - PUT to store the file
>  - GET to get the file (which at that point is still correct and has
> the proper length)
>  - PUT using a 'copy source' over itself to update the metadata
>
> all of theses happening sequentially in the same second, very quickly.
>
> Then subsequent GET return a truncated file.
>
>
> I'm looking into it to narrow down the issue but I wanted to know if
> anyone had seen something similar ?
>
>
What version are you using? Do you have logs?

Thanks,
Yehuda

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [radosgw] Race condition corrupting data on COPY ?
  2013-03-18 13:29 ` Yehuda Sadeh
@ 2013-03-18 14:40   ` Sylvain Munaut
  2013-03-18 16:25     ` Yehuda Sadeh
  0 siblings, 1 reply; 5+ messages in thread
From: Sylvain Munaut @ 2013-03-18 14:40 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: ceph-devel

Hi,


> What version are you using? Do you have logs?

I'm running a custom build 0.56.3 + some patches ( basically up
to7889c5412 + fixes for #4150 and #4177 ).

I don't have any radosgw low  ( debug level is set to 0 and it didn't
output anything ).
I have the HTTP logs :

10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +0000] "PUT
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
HTTP/1.1" 200 0 "-" "Boto/2.6.0 (linux2)"
10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +0000] "GET
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3D&Expires=1363256594&AWSAccessKeyId=XXX
HTTP/1.1" 200 622080 "-" "python-requests"
10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +0000] "PUT
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
HTTP/1.1" 200 146 "-" "Boto/2.6.0 (linux2)"
10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +0000] "GET
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3D&Expires=1363258236&AWSAccessKeyId=XXX
HTTP/1.1" 200 461220 "-" "python-requests"


Cheers,

   Sylvain

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [radosgw] Race condition corrupting data on COPY ?
  2013-03-18 14:40   ` Sylvain Munaut
@ 2013-03-18 16:25     ` Yehuda Sadeh
  2013-03-18 16:39       ` Sylvain Munaut
  0 siblings, 1 reply; 5+ messages in thread
From: Yehuda Sadeh @ 2013-03-18 16:25 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: ceph-devel

On Mon, Mar 18, 2013 at 7:40 AM, Sylvain Munaut
<s.munaut@whatever-company.com> wrote:
> Hi,
>
>
>> What version are you using? Do you have logs?
>
> I'm running a custom build 0.56.3 + some patches ( basically up
> to7889c5412 + fixes for #4150 and #4177 ).
>
> I don't have any radosgw low  ( debug level is set to 0 and it didn't
> output anything ).
> I have the HTTP logs :
>
> 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +0000] "PUT
> /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
> HTTP/1.1" 200 0 "-" "Boto/2.6.0 (linux2)"
> 10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +0000] "GET
> /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3D&Expires=1363256594&AWSAccessKeyId=XXX
> HTTP/1.1" 200 622080 "-" "python-requests"
> 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +0000] "PUT
> /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
> HTTP/1.1" 200 146 "-" "Boto/2.6.0 (linux2)"
> 10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +0000] "GET
> /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3D&Expires=1363258236&AWSAccessKeyId=XXX
> HTTP/1.1" 200 461220 "-" "python-requests"
>
>
Can't make much out of it, will probably need rgw logs (and preferably
with also 'debug ms = 1') for this issue.

Yehuda

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [radosgw] Race condition corrupting data on COPY ?
  2013-03-18 16:25     ` Yehuda Sadeh
@ 2013-03-18 16:39       ` Sylvain Munaut
  0 siblings, 0 replies; 5+ messages in thread
From: Sylvain Munaut @ 2013-03-18 16:39 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: ceph-devel

Hi,

> Can't make much out of it, will probably need rgw logs (and preferably
> with also 'debug ms = 1') for this issue.

Well, the problem is that I can't make it happen again ... it happened
4 times during an import of ~3000 files ... I'm trying to reproduce
this on a test cluster but so far, no luck. I'll give it another shot
tomorrow.

And I can't enable debug on prod for long periods, the space for log
is limited and would be filled in minutes with all the requests. I
also disabled the use of copy in production anyway because I can't
have it corrupt random customer files.


Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-03-18 16:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-18  9:50 [radosgw] Race condition corrupting data on COPY ? Sylvain Munaut
2013-03-18 13:29 ` Yehuda Sadeh
2013-03-18 14:40   ` Sylvain Munaut
2013-03-18 16:25     ` Yehuda Sadeh
2013-03-18 16:39       ` Sylvain Munaut

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.