All of lore.kernel.org
 help / color / mirror / Atom feed
* Message sequence overflow
@ 2016-06-01  8:49 Yan, Zheng
  2016-06-01 10:20 ` [ceph-users] " Ilya Dryomov
  2016-06-01 12:49 ` Sage Weil
  0 siblings, 2 replies; 12+ messages in thread
From: Yan, Zheng @ 2016-06-01  8:49 UTC (permalink / raw)
  To: ceph-devel; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, James Webb

On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
> Dear ceph-users...
>
> My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>
> Our infrastructure is the following:
> - We use CEPH/CEPHFS (10.2.1)
> - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
> - We use enterprise SSDs for everything including journals
> - We have one main mds and one standby mds.
> - We are using ceph kernel client to mount cephfs.
> - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
> - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
> - These are physical machines with 8 cores & 32GB memory
>
> On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
> We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
> So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
> I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>
>
> Here are the relevant log entries.
>
> From my primary MDS server, I start seeing these entries start to pile up:
>
> 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
> 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
> 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>
> From my NFS server, I see these entries from dmesg also start piling up:
> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>

4294967296 is 0x100000000, this looks like sequence  overflow.

In src/msg/Message.h:

class Message {
...
  unsigned get_seq() const { return header.seq; }
  void set_seq(unsigned s) { header.seq = s; }
...
}

in src/msg/simple/Pipe.cc

class Pipe {
...
  __u32 get_out_seq() { return out_seq; }
...
}

Is this bug or intentional ?

Regards
Yan, Zheng


> Next, we find something like this on one of the OSDs.:
> 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
>
> Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
>
> cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
>      health HEALTH_WARN
>             mds0: Behind on trimming (67/30)
>      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
>             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
>       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
>      osdmap e65516: 36 osds: 36 up, 36 in
>       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
>             75117 GB used, 53591 GB / 125 TB avail
>                 4160 active+clean
>   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
>
>
> Regards,
> James Webb
> DevOps Engineer, Engineering Tools
> Unity Technologies
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ceph-users] Message sequence overflow
  2016-06-01  8:49 Message sequence overflow Yan, Zheng
@ 2016-06-01 10:20 ` Ilya Dryomov
  2016-06-01 12:49 ` Sage Weil
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2016-06-01 10:20 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, ceph-users, James Webb, Sage Weil

> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>> Dear ceph-users...
>>
>> My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>
>> Our infrastructure is the following:
>> - We use CEPH/CEPHFS (10.2.1)
>> - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>> - We use enterprise SSDs for everything including journals
>> - We have one main mds and one standby mds.
>> - We are using ceph kernel client to mount cephfs.
>> - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>> - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>> - These are physical machines with 8 cores & 32GB memory
>>
>> On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>> We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>> So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>> I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>
>>
>> Here are the relevant log entries.
>>
>> From my primary MDS server, I start seeing these entries start to pile up:
>>
>> 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>> 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>> 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>
>> From my NFS server, I see these entries from dmesg also start piling up:
>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>
>
> 4294967296 is 0x100000000, this looks like sequence  overflow.
>
> In src/msg/Message.h:
>
> class Message {
> ...
>   unsigned get_seq() const { return header.seq; }
>   void set_seq(unsigned s) { header.seq = s; }
> ...
> }
>
> in src/msg/simple/Pipe.cc
>
> class Pipe {
> ...
>   __u32 get_out_seq() { return out_seq; }
> ...
> }
>
> Is this bug or intentional ?

Hrm, I think this a bug^Woversight.  Sage's commit 9731226228dd
("convert more types in ceph_fs.h to __le* notation") from early 2008
changed ceph_msg_header's seq from __u32 to __le64 and also changed
dout()s in the kernel from %d to %lld, so the 32 -> 64 switch seems
like it was intentional.  Message::get/set_seq() remained unsigned...

The question is which do we fix now - changing the kernel client to
wrap at 32 would be less of a hassle and easier in terms of backporting,
but the problem is really in the userspace messenger.  Sage?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
  2016-06-01  8:49 Message sequence overflow Yan, Zheng
  2016-06-01 10:20 ` [ceph-users] " Ilya Dryomov
@ 2016-06-01 12:49 ` Sage Weil
  2016-06-01 13:26   ` [ceph-users] " Ilya Dryomov
       [not found]   ` <alpine.DEB.2.11.1606010847240.6221-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>
  1 sibling, 2 replies; 12+ messages in thread
From: Sage Weil @ 2016-06-01 12:49 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, ceph-users, James Webb

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4828 bytes --]

On Wed, 1 Jun 2016, Yan, Zheng wrote:
> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
> > Dear ceph-users...
> >
> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
> >
> > Our infrastructure is the following:
> > - We use CEPH/CEPHFS (10.2.1)
> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
> > - We use enterprise SSDs for everything including journals
> > - We have one main mds and one standby mds.
> > - We are using ceph kernel client to mount cephfs.
> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
> > - These are physical machines with 8 cores & 32GB memory
> >
> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
> >
> >
> > Here are the relevant log entries.
> >
> > From my primary MDS server, I start seeing these entries start to pile up:
> >
> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
> >
> > From my NFS server, I see these entries from dmesg also start piling up:
> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
> >
> 
> 4294967296 is 0x100000000, this looks like sequence  overflow.
> 
> In src/msg/Message.h:
> 
> class Message {
> ...
>   unsigned get_seq() const { return header.seq; }
>   void set_seq(unsigned s) { header.seq = s; }
> ...
> }
> 
> in src/msg/simple/Pipe.cc
> 
> class Pipe {
> ...
>   __u32 get_out_seq() { return out_seq; }
> ...
> }
> 
> Is this bug or intentional ?

That's a bug.  The seq values are intended to be 32 bits. 

(We should also be using the ceph_cmp_seq (IIRC) helper for any inequality 
checks, which does a sloppy comparison so that a 31-bit signed difference 
is used to determine > or <.  It sounds like in this case we're just 
failing an equality check, though.)

sage


> Regards
> Yan, Zheng
> 
> 
> > Next, we find something like this on one of the OSDs.:
> > 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
> >
> > Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
> >
> > cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
> >      health HEALTH_WARN
> >             mds0: Behind on trimming (67/30)
> >      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
> >             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
> >       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
> >      osdmap e65516: 36 osds: 36 up, 36 in
> >       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
> >             75117 GB used, 53591 GB / 125 TB avail
> >                 4160 active+clean
> >   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
> >
> >
> > Regards,
> > James Webb
> > DevOps Engineer, Engineering Tools
> > Unity Technologies
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ceph-users] Message sequence overflow
  2016-06-01 12:49 ` Sage Weil
@ 2016-06-01 13:26   ` Ilya Dryomov
       [not found]   ` <alpine.DEB.2.11.1606010847240.6221-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2016-06-01 13:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan, Zheng, ceph-devel, James Webb, ceph-users

On Wed, Jun 1, 2016 at 2:49 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>> > Dear ceph-users...
>> >
>> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>> >
>> > Our infrastructure is the following:
>> > - We use CEPH/CEPHFS (10.2.1)
>> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>> > - We use enterprise SSDs for everything including journals
>> > - We have one main mds and one standby mds.
>> > - We are using ceph kernel client to mount cephfs.
>> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>> > - These are physical machines with 8 cores & 32GB memory
>> >
>> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>> >
>> >
>> > Here are the relevant log entries.
>> >
>> > From my primary MDS server, I start seeing these entries start to pile up:
>> >
>> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>> >
>> > From my NFS server, I see these entries from dmesg also start piling up:
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>> >
>>
>> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>
>> In src/msg/Message.h:
>>
>> class Message {
>> ...
>>   unsigned get_seq() const { return header.seq; }
>>   void set_seq(unsigned s) { header.seq = s; }
>> ...
>> }
>>
>> in src/msg/simple/Pipe.cc
>>
>> class Pipe {
>> ...
>>   __u32 get_out_seq() { return out_seq; }
>> ...
>> }
>>
>> Is this bug or intentional ?
>
> That's a bug.  The seq values are intended to be 32 bits.
>
> (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
> checks, which does a sloppy comparison so that a 31-bit signed difference
> is used to determine > or <.  It sounds like in this case we're just
> failing an equality check, though.)

ceph_seq_cmp().  I'll patch the kernel client then (if Zheng doesn't
get to it first).

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
       [not found]   ` <alpine.DEB.2.11.1606010847240.6221-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>
@ 2016-06-01 13:51     ` Yan, Zheng
       [not found]       ` <CAAM7YA=xU+Ovxf-QU6ruRKGhWS88-d0UaX1AbcR9oCLbEi9_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2016-06-01 13:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, James Webb, ceph-users-idqoXFIVOFJgJs9I8MT0rw

On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>> > Dear ceph-users...
>> >
>> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>> >
>> > Our infrastructure is the following:
>> > - We use CEPH/CEPHFS (10.2.1)
>> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>> > - We use enterprise SSDs for everything including journals
>> > - We have one main mds and one standby mds.
>> > - We are using ceph kernel client to mount cephfs.
>> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>> > - These are physical machines with 8 cores & 32GB memory
>> >
>> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>> >
>> >
>> > Here are the relevant log entries.
>> >
>> > From my primary MDS server, I start seeing these entries start to pile up:
>> >
>> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>> >
>> > From my NFS server, I see these entries from dmesg also start piling up:
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>> >
>>
>> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>
>> In src/msg/Message.h:
>>
>> class Message {
>> ...
>>   unsigned get_seq() const { return header.seq; }
>>   void set_seq(unsigned s) { header.seq = s; }
>> ...
>> }
>>
>> in src/msg/simple/Pipe.cc
>>
>> class Pipe {
>> ...
>>   __u32 get_out_seq() { return out_seq; }
>> ...
>> }
>>
>> Is this bug or intentional ?
>
> That's a bug.  The seq values are intended to be 32 bits.
>
> (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
> checks, which does a sloppy comparison so that a 31-bit signed difference
> is used to determine > or <.  It sounds like in this case we're just
> failing an equality check, though.)
>

struct ceph_msg_header {
        __le64 seq;       /* message seq# for this session */
        ...
}

you means we should leave the upper 32-bits unused?


> sage
>
>
>> Regards
>> Yan, Zheng
>>
>>
>> > Next, we find something like this on one of the OSDs.:
>> > 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
>> >
>> > Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
>> >
>> > cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
>> >      health HEALTH_WARN
>> >             mds0: Behind on trimming (67/30)
>> >      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
>> >             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
>> >       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
>> >      osdmap e65516: 36 osds: 36 up, 36 in
>> >       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
>> >             75117 GB used, 53591 GB / 125 TB avail
>> >                 4160 active+clean
>> >   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
>> >
>> >
>> > Regards,
>> > James Webb
>> > DevOps Engineer, Engineering Tools
>> > Unity Technologies
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
       [not found]       ` <CAAM7YA=xU+Ovxf-QU6ruRKGhWS88-d0UaX1AbcR9oCLbEi9_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-01 14:22         ` Sage Weil
  2016-06-01 14:46           ` [ceph-users] " Ilya Dryomov
                             ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sage Weil @ 2016-06-01 14:22 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, James Webb, ceph-users-idqoXFIVOFJgJs9I8MT0rw

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6313 bytes --]

On Wed, 1 Jun 2016, Yan, Zheng wrote:
> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw-pKHY6dePtwZBDgjK7y7TUQ@public.gmane.org> wrote:
> >> > Dear ceph-users...
> >> >
> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
> >> >
> >> > Our infrastructure is the following:
> >> > - We use CEPH/CEPHFS (10.2.1)
> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
> >> > - We use enterprise SSDs for everything including journals
> >> > - We have one main mds and one standby mds.
> >> > - We are using ceph kernel client to mount cephfs.
> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
> >> > - These are physical machines with 8 cores & 32GB memory
> >> >
> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
> >> >
> >> >
> >> > Here are the relevant log entries.
> >> >
> >> > From my primary MDS server, I start seeing these entries start to pile up:
> >> >
> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
> >> >
> >> > From my NFS server, I see these entries from dmesg also start piling up:
> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
> >> >
> >>
> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
> >>
> >> In src/msg/Message.h:
> >>
> >> class Message {
> >> ...
> >>   unsigned get_seq() const { return header.seq; }
> >>   void set_seq(unsigned s) { header.seq = s; }
> >> ...
> >> }
> >>
> >> in src/msg/simple/Pipe.cc
> >>
> >> class Pipe {
> >> ...
> >>   __u32 get_out_seq() { return out_seq; }
> >> ...
> >> }
> >>
> >> Is this bug or intentional ?
> >
> > That's a bug.  The seq values are intended to be 32 bits.
> >
> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
> > checks, which does a sloppy comparison so that a 31-bit signed difference
> > is used to determine > or <.  It sounds like in this case we're just
> > failing an equality check, though.)
> >
> 
> struct ceph_msg_header {
>         __le64 seq;       /* message seq# for this session */
>         ...
> }
> 
> you means we should leave the upper 32-bits unused?

Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).  

I think we can safely go either way.. the question is which path is 
easier.  If we move to 32 bits used on the kernel side, will userspace 
also need to be patched to make reconnect work?  That unsigned get_seq() 
is only 32-bits wide.

If we go with 64 bits, userspace still needs to be fixed to change that 
unsigned to uint64_t.

What do you think?
sage


> 
> 
> > sage
> >
> >
> >> Regards
> >> Yan, Zheng
> >>
> >>
> >> > Next, we find something like this on one of the OSDs.:
> >> > 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
> >> >
> >> > Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
> >> >
> >> > cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
> >> >      health HEALTH_WARN
> >> >             mds0: Behind on trimming (67/30)
> >> >      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
> >> >             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
> >> >       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
> >> >      osdmap e65516: 36 osds: 36 up, 36 in
> >> >       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
> >> >             75117 GB used, 53591 GB / 125 TB avail
> >> >                 4160 active+clean
> >> >   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
> >> >
> >> >
> >> > Regards,
> >> > James Webb
> >> > DevOps Engineer, Engineering Tools
> >> > Unity Technologies
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ceph-users] Message sequence overflow
  2016-06-01 14:22         ` Sage Weil
@ 2016-06-01 14:46           ` Ilya Dryomov
  2016-06-02  2:46           ` Yan, Zheng
  2016-06-02  3:07           ` Alex Elder
  2 siblings, 0 replies; 12+ messages in thread
From: Ilya Dryomov @ 2016-06-01 14:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan, Zheng, ceph-devel, James Webb, ceph-users

On Wed, Jun 1, 2016 at 4:22 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>> >> > Dear ceph-users...
>> >> >
>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We\u2019ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>> >> >
>> >> > Our infrastructure is the following:
>> >> > - We use CEPH/CEPHFS (10.2.1)
>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>> >> > - We use enterprise SSDs for everything including journals
>> >> > - We have one main mds and one standby mds.
>> >> > - We are using ceph kernel client to mount cephfs.
>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>> >> > - These are physical machines with 8 cores & 32GB memory
>> >> >
>> >> > On a regular basis, we lose all IO via ceph FS. We\u2019re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>> >> > We can\u2019t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>> >> > I\u2019m pretty sure we\u2019re running with default tuning of CEPH MDS configuration parameters.
>> >> >
>> >> >
>> >> > Here are the relevant log entries.
>> >> >
>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>> >> >
>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>> >> >
>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>> >> >
>> >>
>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>> >>
>> >> In src/msg/Message.h:
>> >>
>> >> class Message {
>> >> ...
>> >>   unsigned get_seq() const { return header.seq; }
>> >>   void set_seq(unsigned s) { header.seq = s; }
>> >> ...
>> >> }
>> >>
>> >> in src/msg/simple/Pipe.cc
>> >>
>> >> class Pipe {
>> >> ...
>> >>   __u32 get_out_seq() { return out_seq; }
>> >> ...
>> >> }
>> >>
>> >> Is this bug or intentional ?
>> >
>> > That's a bug.  The seq values are intended to be 32 bits.
>> >
>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>> > is used to determine > or <.  It sounds like in this case we're just
>> > failing an equality check, though.)
>> >
>>
>> struct ceph_msg_header {
>>         __le64 seq;       /* message seq# for this session */
>>         ...
>> }
>>
>> you means we should leave the upper 32-bits unused?
>
> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>
> I think we can safely go either way.. the question is which path is
> easier.  If we move to 32 bits used on the kernel side, will userspace
> also need to be patched to make reconnect work?  That unsigned get_seq()
> is only 32-bits wide.
>
> If we go with 64 bits, userspace still needs to be fixed to change that
> unsigned to uint64_t.
>
> What do you think?

Did you guys see my previous message?  I didn't bother to check what
kind of sequence numbers ceph_seq_cmp() - it's seems to be unused BTW,
so I got diverted by Sage's reply.

> Hrm, I think this a bug^Woversight.  Sage's commit 9731226228dd
> ("convert more types in ceph_fs.h to __le* notation") from early 2008
> changed ceph_msg_header's seq from __u32 to __le64 and also changed
> dout()s in the kernel from %d to %lld, so the 32 -> 64 switch seems
> like it was intentional.  Message::get/set_seq() remained unsigned...
>
> The question is which do we fix now - changing the kernel client to
> wrap at 32 would be less of a hassle and easier in terms of backporting,
> but the problem is really in the userspace messenger.  Sage?

Patching the kernel should be trivial - I don't think userspace would
need to be patched in this case.  The change can also be backported
easily.  Going with 64 bits in userspace would be a lot messier...

OTOH the "bug" is in the userspace messenger, so I'd vote for fixing
the userspace, especially if we can piggy back on one of the looming
feature bits for this.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
  2016-06-01 14:22         ` Sage Weil
  2016-06-01 14:46           ` [ceph-users] " Ilya Dryomov
@ 2016-06-02  2:46           ` Yan, Zheng
       [not found]             ` <CAAM7YAmzpqAoh0TuVj0zCXPgrM=0BjfbwD+C3rJa=0CKwmRhAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-06-02  3:07           ` Alex Elder
  2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2016-06-02  2:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users, James Webb

On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>> >> > Dear ceph-users...
>> >> >
>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>> >> >
>> >> > Our infrastructure is the following:
>> >> > - We use CEPH/CEPHFS (10.2.1)
>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>> >> > - We use enterprise SSDs for everything including journals
>> >> > - We have one main mds and one standby mds.
>> >> > - We are using ceph kernel client to mount cephfs.
>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>> >> > - These are physical machines with 8 cores & 32GB memory
>> >> >
>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>> >> >
>> >> >
>> >> > Here are the relevant log entries.
>> >> >
>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>> >> >
>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>> >> >
>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>> >> >
>> >>
>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>> >>
>> >> In src/msg/Message.h:
>> >>
>> >> class Message {
>> >> ...
>> >>   unsigned get_seq() const { return header.seq; }
>> >>   void set_seq(unsigned s) { header.seq = s; }
>> >> ...
>> >> }
>> >>
>> >> in src/msg/simple/Pipe.cc
>> >>
>> >> class Pipe {
>> >> ...
>> >>   __u32 get_out_seq() { return out_seq; }
>> >> ...
>> >> }
>> >>
>> >> Is this bug or intentional ?
>> >
>> > That's a bug.  The seq values are intended to be 32 bits.
>> >
>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>> > is used to determine > or <.  It sounds like in this case we're just
>> > failing an equality check, though.)
>> >
>>
>> struct ceph_msg_header {
>>         __le64 seq;       /* message seq# for this session */
>>         ...
>> }
>>
>> you means we should leave the upper 32-bits unused?
>
> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>
> I think we can safely go either way.. the question is which path is
> easier.  If we move to 32 bits used on the kernel side, will userspace
> also need to be patched to make reconnect work?  That unsigned get_seq()
> is only 32-bits wide.

I don't think userspace need to be patched.

>
> If we go with 64 bits, userspace still needs to be fixed to change that
> unsigned to uint64_t.
>
> What do you think?
> sage
>

I like the 64 bits approach. Here is userspace code that checks
message sequence.

Pipe::reader() {

...
      if (m->get_seq() <= in_seq) {
        ldout(msgr->cct,0) << "reader got old message "
                << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
                << ", discarding" << dendl;

        msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
        m->put();

        if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
            msgr->cct->_conf->ms_die_on_old_message)
          assert(0 == "old msgs despite reconnect_seq feature");
        continue;
      }
      if (m->get_seq() > in_seq + 1) {
        ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
                           << in_seq << " to " << m->get_seq() << dendl;
        if (msgr->cct->_conf->ms_die_on_skipped_message)
          assert(0 == "skipped incoming seq");
      }
      m->set_connection(connection_state.get());
      // note last received message.
      in_seq = m->get_seq();
...
}

Looks like the code works perfectly when the two ends of connection
have different bits. We don't need to worry about the change breaks
interoperability between patched userspace and un-patched userspace.

Regards
Yan, Zheng

>
>>
>>
>> > sage
>> >
>> >
>> >> Regards
>> >> Yan, Zheng
>> >>
>> >>
>> >> > Next, we find something like this on one of the OSDs.:
>> >> > 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
>> >> >
>> >> > Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
>> >> >
>> >> > cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
>> >> >      health HEALTH_WARN
>> >> >             mds0: Behind on trimming (67/30)
>> >> >      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
>> >> >             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
>> >> >       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
>> >> >      osdmap e65516: 36 osds: 36 up, 36 in
>> >> >       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
>> >> >             75117 GB used, 53591 GB / 125 TB avail
>> >> >                 4160 active+clean
>> >> >   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
>> >> >
>> >> >
>> >> > Regards,
>> >> > James Webb
>> >> > DevOps Engineer, Engineering Tools
>> >> > Unity Technologies
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
  2016-06-01 14:22         ` Sage Weil
  2016-06-01 14:46           ` [ceph-users] " Ilya Dryomov
  2016-06-02  2:46           ` Yan, Zheng
@ 2016-06-02  3:07           ` Alex Elder
  2 siblings, 0 replies; 12+ messages in thread
From: Alex Elder @ 2016-06-02  3:07 UTC (permalink / raw)
  To: Sage Weil, Yan, Zheng; +Cc: ceph-devel, ceph-users, James Webb

On 06/01/2016 09:22 AM, Sage Weil wrote:
> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>>>>> Dear ceph-users...
>>>>>
>>>>> My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>>>>
>>>>> Our infrastructure is the following:
>>>>> - We use CEPH/CEPHFS (10.2.1)
>>>>> - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>>>> - We use enterprise SSDs for everything including journals
>>>>> - We have one main mds and one standby mds.
>>>>> - We are using ceph kernel client to mount cephfs.
>>>>> - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>>>> - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>>>> - These are physical machines with 8 cores & 32GB memory
>>>>>
>>>>> On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>>>> We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>>>> So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>>>> I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>>>>
>>>>>
>>>>> Here are the relevant log entries.
>>>>>
>>>>> From my primary MDS server, I start seeing these entries start to pile up:
>>>>>
>>>>> 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>>>> 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>>>> 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>>>>
>>>>> From my NFS server, I see these entries from dmesg also start piling up:
>>>>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>>>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>>>> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>>>>
>>>>
>>>> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>>>
>>>> In src/msg/Message.h:
>>>>
>>>> class Message {
>>>> ...
>>>>   unsigned get_seq() const { return header.seq; }
>>>>   void set_seq(unsigned s) { header.seq = s; }
>>>> ...
>>>> }
>>>>
>>>> in src/msg/simple/Pipe.cc
>>>>
>>>> class Pipe {
>>>> ...
>>>>   __u32 get_out_seq() { return out_seq; }
>>>> ...
>>>> }
>>>>
>>>> Is this bug or intentional ?
>>>
>>> That's a bug.  The seq values are intended to be 32 bits.
>>>
>>> (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>> checks, which does a sloppy comparison so that a 31-bit signed difference
>>> is used to determine > or <.  It sounds like in this case we're just
>>> failing an equality check, though.)
>>>
>>
>> struct ceph_msg_header {
>>         __le64 seq;       /* message seq# for this session */
>>         ...
>> }
>>
>> you means we should leave the upper 32-bits unused?
> 
> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).  
> 
> I think we can safely go either way.. the question is which path is 
> easier.  If we move to 32 bits used on the kernel side, will userspace 
> also need to be patched to make reconnect work?  That unsigned get_seq() 
> is only 32-bits wide.

What is the value of the 64 bit message sequence number?
Does the message sequence number only have to be big enough
to be unique for all conceivable in-flight messages?  If so,
32 bits (or maybe less, but I wouldn't suggest that)
might be enough and 64 bits might have been over-designed.

					-Alex

> If we go with 64 bits, userspace still needs to be fixed to change that 
> unsigned to uint64_t.
> 
> What do you think?
> sage
> 
> 
>>
>>
>>> sage
>>>
>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>
>>>>> Next, we find something like this on one of the OSDs.:
>>>>> 2016-05-31 14:34:44.130279 mon.0 XX.XX.XX.188:6789/0 1272184 : cluster [INF] HEALTH_WARN; mds0: Client storage-nfs-01 failing to respond to capability release
>>>>>
>>>>> Finally, I am seeing consistent HEALTH_WARN in my status regarding trimming which I am not sure if it is related:
>>>>>
>>>>> cluster XXXXXXXX-bd8f-4091-bed3-8586fd0d6b46
>>>>>      health HEALTH_WARN
>>>>>             mds0: Behind on trimming (67/30)
>>>>>      monmap e3: 3 mons at {storage02=X.X.X.190:6789/0,storage03=X.X.X.189:6789/0,storage04=X.X.X.188:6789/0}
>>>>>             election epoch 206, quorum 0,1,2 storage04,storage03,storage02
>>>>>       fsmap e74879: 1/1/1 up {0=cephfs-03=up:active}, 1 up:standby
>>>>>      osdmap e65516: 36 osds: 36 up, 36 in
>>>>>       pgmap v15435732: 4160 pgs, 3 pools, 37539 GB data, 9611 kobjects
>>>>>             75117 GB used, 53591 GB / 125 TB avail
>>>>>                 4160 active+clean
>>>>>   client io 334 MB/s rd, 319 MB/s wr, 5839 op/s rd, 4848 op/s wr
>>>>>
>>>>>
>>>>> Regards,
>>>>> James Webb
>>>>> DevOps Engineer, Engineering Tools
>>>>> Unity Technologies
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Message sequence overflow
       [not found]             ` <CAAM7YAmzpqAoh0TuVj0zCXPgrM=0BjfbwD+C3rJa=0CKwmRhAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-06-02  9:13               ` Ilya Dryomov
  2016-06-02  9:50                 ` [ceph-users] " Yan, Zheng
  2016-06-02  9:51                 ` Yan, Zheng
  0 siblings, 2 replies; 12+ messages in thread
From: Ilya Dryomov @ 2016-06-02  9:13 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, James Webb, ceph-users-idqoXFIVOFJgJs9I8MT0rw

On Thu, Jun 2, 2016 at 4:46 AM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@newdream.net> wrote:
>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>>> >> > Dear ceph-users...
>>> >> >
>>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>> >> >
>>> >> > Our infrastructure is the following:
>>> >> > - We use CEPH/CEPHFS (10.2.1)
>>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>> >> > - We use enterprise SSDs for everything including journals
>>> >> > - We have one main mds and one standby mds.
>>> >> > - We are using ceph kernel client to mount cephfs.
>>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>> >> > - These are physical machines with 8 cores & 32GB memory
>>> >> >
>>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>> >> >
>>> >> >
>>> >> > Here are the relevant log entries.
>>> >> >
>>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>>> >> >
>>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>> >> >
>>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>> >> >
>>> >>
>>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>> >>
>>> >> In src/msg/Message.h:
>>> >>
>>> >> class Message {
>>> >> ...
>>> >>   unsigned get_seq() const { return header.seq; }
>>> >>   void set_seq(unsigned s) { header.seq = s; }
>>> >> ...
>>> >> }
>>> >>
>>> >> in src/msg/simple/Pipe.cc
>>> >>
>>> >> class Pipe {
>>> >> ...
>>> >>   __u32 get_out_seq() { return out_seq; }
>>> >> ...
>>> >> }
>>> >>
>>> >> Is this bug or intentional ?
>>> >
>>> > That's a bug.  The seq values are intended to be 32 bits.
>>> >
>>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>>> > is used to determine > or <.  It sounds like in this case we're just
>>> > failing an equality check, though.)
>>> >
>>>
>>> struct ceph_msg_header {
>>>         __le64 seq;       /* message seq# for this session */
>>>         ...
>>> }
>>>
>>> you means we should leave the upper 32-bits unused?
>>
>> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>>
>> I think we can safely go either way.. the question is which path is
>> easier.  If we move to 32 bits used on the kernel side, will userspace
>> also need to be patched to make reconnect work?  That unsigned get_seq()
>> is only 32-bits wide.
>
> I don't think userspace need to be patched.
>
>>
>> If we go with 64 bits, userspace still needs to be fixed to change that
>> unsigned to uint64_t.
>>
>> What do you think?
>> sage
>>
>
> I like the 64 bits approach. Here is userspace code that checks
> message sequence.
>
> Pipe::reader() {
>
> ...
>       if (m->get_seq() <= in_seq) {
>         ldout(msgr->cct,0) << "reader got old message "
>                 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
>                 << ", discarding" << dendl;
>
>         msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
>         m->put();
>
>         if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
>             msgr->cct->_conf->ms_die_on_old_message)
>           assert(0 == "old msgs despite reconnect_seq feature");
>         continue;
>       }
>       if (m->get_seq() > in_seq + 1) {
>         ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
>                            << in_seq << " to " << m->get_seq() << dendl;
>         if (msgr->cct->_conf->ms_die_on_skipped_message)
>           assert(0 == "skipped incoming seq");
>       }
>       m->set_connection(connection_state.get());
>       // note last received message.
>       in_seq = m->get_seq();
> ...
> }
>
> Looks like the code works perfectly when the two ends of connection
> have different bits. We don't need to worry about the change breaks
> interoperability between patched userspace and un-patched userspace.

Are you sure?  It seems to me that if that were true, we wouldn't have
this thread in the first place.  An unpatched m->get_seq() in

    if (m->get_seq() <= in_seq) {

would truncate and the "old message" branch would be taken.  The same
goes for the unpatched m->set_seq() - the other (patched) side is going
to trip over.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ceph-users] Message sequence overflow
  2016-06-02  9:13               ` Ilya Dryomov
@ 2016-06-02  9:50                 ` Yan, Zheng
  2016-06-02  9:51                 ` Yan, Zheng
  1 sibling, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2016-06-02  9:50 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Sage Weil, ceph-devel, James Webb, ceph-users

On Thu, Jun 2, 2016 at 5:13 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Thu, Jun 2, 2016 at 4:46 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>> On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>>>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>>>> >> > Dear ceph-users...
>>>> >> >
>>>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>>> >> >
>>>> >> > Our infrastructure is the following:
>>>> >> > - We use CEPH/CEPHFS (10.2.1)
>>>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>>> >> > - We use enterprise SSDs for everything including journals
>>>> >> > - We have one main mds and one standby mds.
>>>> >> > - We are using ceph kernel client to mount cephfs.
>>>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>>> >> > - These are physical machines with 8 cores & 32GB memory
>>>> >> >
>>>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>>> >> >
>>>> >> >
>>>> >> > Here are the relevant log entries.
>>>> >> >
>>>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>>>> >> >
>>>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>>> >> >
>>>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>>> >> >
>>>> >>
>>>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>>> >>
>>>> >> In src/msg/Message.h:
>>>> >>
>>>> >> class Message {
>>>> >> ...
>>>> >>   unsigned get_seq() const { return header.seq; }
>>>> >>   void set_seq(unsigned s) { header.seq = s; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> in src/msg/simple/Pipe.cc
>>>> >>
>>>> >> class Pipe {
>>>> >> ...
>>>> >>   __u32 get_out_seq() { return out_seq; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> Is this bug or intentional ?
>>>> >
>>>> > That's a bug.  The seq values are intended to be 32 bits.
>>>> >
>>>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>>>> > is used to determine > or <.  It sounds like in this case we're just
>>>> > failing an equality check, though.)
>>>> >
>>>>
>>>> struct ceph_msg_header {
>>>>         __le64 seq;       /* message seq# for this session */
>>>>         ...
>>>> }
>>>>
>>>> you means we should leave the upper 32-bits unused?
>>>
>>> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>>>
>>> I think we can safely go either way.. the question is which path is
>>> easier.  If we move to 32 bits used on the kernel side, will userspace
>>> also need to be patched to make reconnect work?  That unsigned get_seq()
>>> is only 32-bits wide.
>>
>> I don't think userspace need to be patched.
>>
>>>
>>> If we go with 64 bits, userspace still needs to be fixed to change that
>>> unsigned to uint64_t.
>>>
>>> What do you think?
>>> sage
>>>
>>
>> I like the 64 bits approach. Here is userspace code that checks
>> message sequence.
>>
>> Pipe::reader() {
>>
>> ...
>>       if (m->get_seq() <= in_seq) {
>>         ldout(msgr->cct,0) << "reader got old message "
>>                 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
>>                 << ", discarding" << dendl;
>>
>>         msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
>>         m->put();
>>
>>         if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
>>             msgr->cct->_conf->ms_die_on_old_message)
>>           assert(0 == "old msgs despite reconnect_seq feature");
>>         continue;
>>       }
>>       if (m->get_seq() > in_seq + 1) {
>>         ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
>>                            << in_seq << " to " << m->get_seq() << dendl;
>>         if (msgr->cct->_conf->ms_die_on_skipped_message)
>>           assert(0 == "skipped incoming seq");
>>       }
>>       m->set_connection(connection_state.get());
>>       // note last received message.
>>       in_seq = m->get_seq();
>> ...
>> }
>>
>> Looks like the code works perfectly when the two ends of connection
>> have different bits. We don't need to worry about the change breaks
>> interoperability between patched userspace and un-patched userspace.
>
> Are you sure?  It seems to me that if that were true, we wouldn't have
> this thread in the first place.  An unpatched m->get_seq() in
>
>     if (m->get_seq() <= in_seq) {
>
> would truncate and the "old message" branch would be taken.  The same
> goes for the unpatched m->set_seq() - the other (patched) side is going
> to trip over.

The user space code never increates in_seq. Instead it assigns
m->get_seq() to in_seq. If m->get_seq() return 32-bits value, the
upper 32-bits of in_seq are always zero.

I did some tests (set the initial value of out_seq to 0xffffff00) . It
seems that the user space code also starts to malfunction when the seq
overflows. There is code compares
{newly_acked_seq, out_seq}, {in_seq, in_seq_acked}, but it does not
take the overflow into consideration.

Besides, I found that Pipe::randomize_out_seq() never randomizes the out_seq.

Regards
Yan, Zheng

>
> Thanks,
>
>                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ceph-users] Message sequence overflow
  2016-06-02  9:13               ` Ilya Dryomov
  2016-06-02  9:50                 ` [ceph-users] " Yan, Zheng
@ 2016-06-02  9:51                 ` Yan, Zheng
  1 sibling, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2016-06-02  9:51 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Sage Weil, ceph-devel, James Webb, ceph-users

On Thu, Jun 2, 2016 at 5:13 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Thu, Jun 2, 2016 at 4:46 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>> On Wed, Jun 1, 2016 at 10:22 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> On Wed, Jun 1, 2016 at 8:49 PM, Sage Weil <sage@newdream.net> wrote:
>>>> > On Wed, 1 Jun 2016, Yan, Zheng wrote:
>>>> >> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@unity3d.com> wrote:
>>>> >> > Dear ceph-users...
>>>> >> >
>>>> >> > My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with.
>>>> >> >
>>>> >> > Our infrastructure is the following:
>>>> >> > - We use CEPH/CEPHFS (10.2.1)
>>>> >> > - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>>>> >> > - We use enterprise SSDs for everything including journals
>>>> >> > - We have one main mds and one standby mds.
>>>> >> > - We are using ceph kernel client to mount cephfs.
>>>> >> > - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel)
>>>> >> > - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness)
>>>> >> > - These are physical machines with 8 cores & 32GB memory
>>>> >> >
>>>> >> > On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client.
>>>> >> > We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues.
>>>> >> > So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things.
>>>> >> > I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters.
>>>> >> >
>>>> >> >
>>>> >> > Here are the relevant log entries.
>>>> >> >
>>>> >> > From my primary MDS server, I start seeing these entries start to pile up:
>>>> >> >
>>>> >> > 2016-05-31 14:34:07.091117 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091129 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\
>>>> >> > 2016-05-31 14:34:07.091133 7f9f2eb87700  0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago
>>>> >> >
>>>> >> > From my NFS server, I see these entries from dmesg also start piling up:
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296
>>>> >> > [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296
>>>> >> >
>>>> >>
>>>> >> 4294967296 is 0x100000000, this looks like sequence  overflow.
>>>> >>
>>>> >> In src/msg/Message.h:
>>>> >>
>>>> >> class Message {
>>>> >> ...
>>>> >>   unsigned get_seq() const { return header.seq; }
>>>> >>   void set_seq(unsigned s) { header.seq = s; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> in src/msg/simple/Pipe.cc
>>>> >>
>>>> >> class Pipe {
>>>> >> ...
>>>> >>   __u32 get_out_seq() { return out_seq; }
>>>> >> ...
>>>> >> }
>>>> >>
>>>> >> Is this bug or intentional ?
>>>> >
>>>> > That's a bug.  The seq values are intended to be 32 bits.
>>>> >
>>>> > (We should also be using the ceph_cmp_seq (IIRC) helper for any inequality
>>>> > checks, which does a sloppy comparison so that a 31-bit signed difference
>>>> > is used to determine > or <.  It sounds like in this case we're just
>>>> > failing an equality check, though.)
>>>> >
>>>>
>>>> struct ceph_msg_header {
>>>>         __le64 seq;       /* message seq# for this session */
>>>>         ...
>>>> }
>>>>
>>>> you means we should leave the upper 32-bits unused?
>>>
>>> Oh, hmm. I'm confusing this with the cap seq (which is 32 bits).
>>>
>>> I think we can safely go either way.. the question is which path is
>>> easier.  If we move to 32 bits used on the kernel side, will userspace
>>> also need to be patched to make reconnect work?  That unsigned get_seq()
>>> is only 32-bits wide.
>>
>> I don't think userspace need to be patched.
>>
>>>
>>> If we go with 64 bits, userspace still needs to be fixed to change that
>>> unsigned to uint64_t.
>>>
>>> What do you think?
>>> sage
>>>
>>
>> I like the 64 bits approach. Here is userspace code that checks
>> message sequence.
>>
>> Pipe::reader() {
>>
>> ...
>>       if (m->get_seq() <= in_seq) {
>>         ldout(msgr->cct,0) << "reader got old message "
>>                 << m->get_seq() << " <= " << in_seq << " " << m << " " << *m
>>                 << ", discarding" << dendl;
>>
>>         msgr->dispatch_throttle_release(m->get_dispatch_throttle_size());
>>         m->put();
>>
>>         if (connection_state->has_feature(CEPH_FEATURE_RECONNECT_SEQ) &&
>>             msgr->cct->_conf->ms_die_on_old_message)
>>           assert(0 == "old msgs despite reconnect_seq feature");
>>         continue;
>>       }
>>       if (m->get_seq() > in_seq + 1) {
>>         ldout(msgr->cct,0) << "reader missed message?  skipped from seq "
>>                            << in_seq << " to " << m->get_seq() << dendl;
>>         if (msgr->cct->_conf->ms_die_on_skipped_message)
>>           assert(0 == "skipped incoming seq");
>>       }
>>       m->set_connection(connection_state.get());
>>       // note last received message.
>>       in_seq = m->get_seq();
>> ...
>> }
>>
>> Looks like the code works perfectly when the two ends of connection
>> have different bits. We don't need to worry about the change breaks
>> interoperability between patched userspace and un-patched userspace.
>
> Are you sure?  It seems to me that if that were true, we wouldn't have
> this thread in the first place.  An unpatched m->get_seq() in
>
>     if (m->get_seq() <= in_seq) {
>
> would truncate and the "old message" branch would be taken.  The same
> goes for the unpatched m->set_seq() - the other (patched) side is going
> to trip over.


The user space code never increases in_seq. Instead, it assigns
m->get_seq() to in_seq. If m->get_seq() return 32-bits value, the
upper 32-bits of in_seq are always zero.

I did some tests (set the initial value of out_seq to 0xffffff00) . It
seems that the user space code also starts to malfunction when the seq
overflows. There are codes that compare {newly_acked_seq, out_seq},
{in_seq, in_seq_acked}, but they do not take the overflow into
consideration.

Besides, I found that Pipe::randomize_out_seq() never randomizes the out_seq.

Regards
Yan, Zheng

>
> Thanks,
>
>                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-06-02  9:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01  8:49 Message sequence overflow Yan, Zheng
2016-06-01 10:20 ` [ceph-users] " Ilya Dryomov
2016-06-01 12:49 ` Sage Weil
2016-06-01 13:26   ` [ceph-users] " Ilya Dryomov
     [not found]   ` <alpine.DEB.2.11.1606010847240.6221-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>
2016-06-01 13:51     ` Yan, Zheng
     [not found]       ` <CAAM7YA=xU+Ovxf-QU6ruRKGhWS88-d0UaX1AbcR9oCLbEi9_jg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-01 14:22         ` Sage Weil
2016-06-01 14:46           ` [ceph-users] " Ilya Dryomov
2016-06-02  2:46           ` Yan, Zheng
     [not found]             ` <CAAM7YAmzpqAoh0TuVj0zCXPgrM=0BjfbwD+C3rJa=0CKwmRhAA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-02  9:13               ` Ilya Dryomov
2016-06-02  9:50                 ` [ceph-users] " Yan, Zheng
2016-06-02  9:51                 ` Yan, Zheng
2016-06-02  3:07           ` Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.