All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
@ 2015-01-05 10:53 Chaitanya Huilgol
  2015-01-05 11:15 ` Wido den Hollander
  2015-01-05 15:57 ` Ilya Dryomov
  0 siblings, 2 replies; 15+ messages in thread
From: Chaitanya Huilgol @ 2015-01-05 10:53 UTC (permalink / raw)
  To: ceph-devel

Hi All,

The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are seeing crashes and soft-lockup issues which have been fixed in the current ceph-client code base.
What would be recommended ceph-client branch compatible with the Ubuntu 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?

Regards,
Chaitanya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 10:53 Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels) Chaitanya Huilgol
@ 2015-01-05 11:15 ` Wido den Hollander
  2015-01-08  8:49   ` joel.merrick
  2015-01-05 15:57 ` Ilya Dryomov
  1 sibling, 1 reply; 15+ messages in thread
From: Wido den Hollander @ 2015-01-05 11:15 UTC (permalink / raw)
  To: Chaitanya Huilgol, ceph-devel



On 05-01-15 11:53, Chaitanya Huilgol wrote:
> Hi All,
>
> The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are seeing crashes and soft-lockup issues which have been fixed in the current ceph-client code base.
> What would be recommended ceph-client branch compatible with the Ubuntu 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?
>

I recommend you take a look here: 
http://kernel.ubuntu.com/~kernel-ppa/mainline/

That should give you some new kernels.

Wido

> Regards,
> Chaitanya
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 10:53 Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels) Chaitanya Huilgol
  2015-01-05 11:15 ` Wido den Hollander
@ 2015-01-05 15:57 ` Ilya Dryomov
  2015-01-05 17:11   ` Somnath Roy
  2015-01-06  2:36   ` Chaitanya Huilgol
  1 sibling, 2 replies; 15+ messages in thread
From: Ilya Dryomov @ 2015-01-05 15:57 UTC (permalink / raw)
  To: Chaitanya Huilgol; +Cc: ceph-devel

On Mon, Jan 5, 2015 at 1:53 PM, Chaitanya Huilgol
<Chaitanya.Huilgol@sandisk.com> wrote:
> Hi All,
>
> The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are seeing crashes and soft-lockup issues which have been fixed in the current ceph-client code base.
> What would be recommended ceph-client branch compatible with the Ubuntu 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?

We actively mark rbd (not so much cephfs) fixes for stable and Ubuntu
kernel team generally picks them up.  3.13 series should have most of
the important fixes, although I haven't counted.

What issues in particular you are running into?  uname -a?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 15:57 ` Ilya Dryomov
@ 2015-01-05 17:11   ` Somnath Roy
  2015-01-05 18:50     ` Ilya Dryomov
  2015-01-06  2:36   ` Chaitanya Huilgol
  1 sibling, 1 reply; 15+ messages in thread
From: Somnath Roy @ 2015-01-05 17:11 UTC (permalink / raw)
  To: Ilya Dryomov, Chaitanya Huilgol; +Cc: ceph-devel

Ilya,
The main issue we are facing the krbd client crash in case of cluster node reboot. Is this fix backported to any 14.04 stable LTS kernel ?
If not, please suggest a workaround for this as upgrading kernel may not be an option.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ilya Dryomov
Sent: Monday, January 05, 2015 7:58 AM
To: Chaitanya Huilgol
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 1:53 PM, Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com> wrote:
> Hi All,
>
> The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are seeing crashes and soft-lockup issues which have been fixed in the current ceph-client code base.
> What would be recommended ceph-client branch compatible with the Ubuntu 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?

We actively mark rbd (not so much cephfs) fixes for stable and Ubuntu kernel team generally picks them up.  3.13 series should have most of the important fixes, although I haven't counted.

What issues in particular you are running into?  uname -a?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 17:11   ` Somnath Roy
@ 2015-01-05 18:50     ` Ilya Dryomov
  2015-01-05 20:01       ` Somnath Roy
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Dryomov @ 2015-01-05 18:50 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Chaitanya Huilgol, ceph-devel

On Mon, Jan 5, 2015 at 8:11 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> The main issue we are facing the krbd client crash in case of cluster node reboot. Is this fix backported to any 14.04 stable LTS kernel ?

I don't recall anything like that or at least phrased that way.
Can you give more details - crash traces at least?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 18:50     ` Ilya Dryomov
@ 2015-01-05 20:01       ` Somnath Roy
  2015-01-05 20:33         ` Ilya Dryomov
  0 siblings, 1 reply; 15+ messages in thread
From: Somnath Roy @ 2015-01-05 20:01 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Chaitanya Huilgol, ceph-devel

Ilya,
Here is the steps..

1. You have a cluster (3 nodes) and replication is 3

2. map krbd image to a client.

3. Reboot or stop ceph services on one or more nodes

4. The client with krbd mapped module crashes

Also, if we try to reboot the clients without unmapping the clients, client nodes goes into a loop and requires hard boot.
But, we found this issue is fixed in later version of rbd. We are using inbox rbd coming with Ubuntu 14.04 LTS.
Let me know if you need further details.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Monday, January 05, 2015 10:50 AM
To: Somnath Roy
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 8:11 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> The main issue we are facing the krbd client crash in case of cluster node reboot. Is this fix backported to any 14.04 stable LTS kernel ?

I don't recall anything like that or at least phrased that way.
Can you give more details - crash traces at least?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 20:01       ` Somnath Roy
@ 2015-01-05 20:33         ` Ilya Dryomov
  2015-01-05 21:08           ` Somnath Roy
  2015-01-05 21:54           ` Somnath Roy
  0 siblings, 2 replies; 15+ messages in thread
From: Ilya Dryomov @ 2015-01-05 20:33 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Chaitanya Huilgol, ceph-devel

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 20:33         ` Ilya Dryomov
@ 2015-01-05 21:08           ` Somnath Roy
  2015-01-06 12:31             ` Chaitanya Huilgol
  2015-01-05 21:54           ` Somnath Roy
  1 sibling, 1 reply; 15+ messages in thread
From: Somnath Roy @ 2015-01-05 21:08 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Chaitanya Huilgol, ceph-devel

It's happening both in idle and under load.
I don't have the trace right now but will get you one soon.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Monday, January 05, 2015 12:34 PM
To: Somnath Roy
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 20:33         ` Ilya Dryomov
  2015-01-05 21:08           ` Somnath Roy
@ 2015-01-05 21:54           ` Somnath Roy
  1 sibling, 0 replies; 15+ messages in thread
From: Somnath Roy @ 2015-01-05 21:54 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Chaitanya Huilgol, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6412 bytes --]

Ilya,
I can gather the following syslog entries. Attached is the syslog..Please have a look if this is helpful.

I can see the following trace..

Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.283268] Workqueue: ceph-msgr con_work [libceph]
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.291641] task: ffff880fb6868000 ti: ffff880ffaa2a000 task.ti: ffff880ffaa2a000
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.304503] RIP: 0010:[<ffffffffa035a40e>]  [<ffffffffa035a40e>] osd_reset+0x22e/0x2c0 [libceph]
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.319808] RSP: 0018:ffff880ffaa2bd80  EFLAGS: 00010206
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.328659] RAX: ffff881012fb4ca8 RBX: ffff8810114a9750 RCX: ffff881012790050
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.599331] RDX: ffff881012fb4ca8 RSI: 0000000086588656 RDI: 0000000000000286
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.703539] RBP: ffff880ffaa2bdd8 R08: 0000000000000000 R09: 0000000000000000
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.810053] R10: ffffffff81600edf R11: ffffea003fef7a00 R12: ffff881012fb4c58
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371757.918811] R13: ffff8810114a9810 R14: ffff881012790000 R15: ffff881012790020
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029661] libceph: osd32 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029662] libceph: osd33 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029662] libceph: osd38 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029662] libceph: osd39 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029663] libceph: osd40 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029663] libceph: osd47 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029663] libceph: osd48 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029663] libceph: osd49 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029664] libceph: osd50 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029664] libceph: osd51 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029664] libceph: osd52 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029665] libceph: osd53 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.029665] libceph: osd57 down
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.631655] FS:  0000000000000000(0000) GS:ffff88101f300000(0000) knlGS:0000000000000000
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.700074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.734306] CR2: 00007f0bbad49000 CR3: 0000000001c0e000 CR4: 00000000001407e0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.800693] Stack:
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.832457]  ffff8810114a97a8 ffff8810114a9760 ffff881012fb4800 ffff881012fb4ca8
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.897340]  ffff880ffaa2bda0 ffff880ffaa2bda0 ffff881012fb4c10 ffff881012fb4830
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371758.962318]  ffff881012fb49b0 ffff881012fb4860 0000000000000011 ffff880ffaa2be20
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.027390] Call Trace:
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.058230]  [<ffffffffa03549e8>] con_work+0x298/0x640 [libceph]
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.089619]  [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.120139]  [<ffffffff81084641>] worker_thread+0x121/0x410
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.149533]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.179041]  [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.209159]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.240921]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.273511]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.307636] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 1c 3c e1 48 8b 7d b0 e8 41 68 d5 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 e6 36 a0 48 c7
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.421674] RIP  [<ffffffffa035a40e>] osd_reset+0x22e/0x2c0 [libceph]
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.462127]  RSP <ffff880ffaa2bd80>
Dec  9 01:38:01 rack1-ramp-5 kernel: [1371759.567952] ---[ end trace 37d00d439ac66995 ]---
Dec  9 01:38:17 rack1-ramp-5 kernel: [1371759.614230] BUG: unable to handle kernel paging request at ffffffffffffffd8
Dec  9 01:38:17 rack1-ramp-5 kernel: [1371759.659349] IP: [<ffffffff8108b9b0>] kthread_data+0x10/0x20

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Monday, January 05, 2015 1:08 PM
To: 'Ilya Dryomov'
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

It's happening both in idle and under load.
I don't have the trace right now but will get you one soon.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Monday, January 05, 2015 12:34 PM
To: Somnath Roy
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


[-- Attachment #2: syslog.tar.gz --]
[-- Type: application/x-gzip, Size: 64086 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 15:57 ` Ilya Dryomov
  2015-01-05 17:11   ` Somnath Roy
@ 2015-01-06  2:36   ` Chaitanya Huilgol
  1 sibling, 0 replies; 15+ messages in thread
From: Chaitanya Huilgol @ 2015-01-06  2:36 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

Hi Ilya,

Can you please point us to the sources for the ubuntu ceph-client with the fixes, the ceph-client code that comes with the linux-source debian package does not seem to contain many of the fixes and I did not see any ceph-client patch over the 3.13 kernel either. Looks like I might be looking at the wrong place.

Regards,
Chaitanya

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Monday, January 05, 2015 9:28 PM
To: Chaitanya Huilgol
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 1:53 PM, Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com> wrote:
> Hi All,
>
> The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are seeing crashes and soft-lockup issues which have been fixed in the current ceph-client code base.
> What would be recommended ceph-client branch compatible with the Ubuntu 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?

We actively mark rbd (not so much cephfs) fixes for stable and Ubuntu kernel team generally picks them up.  3.13 series should have most of the important fixes, although I haven't counted.

What issues in particular you are running into?  uname -a?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 21:08           ` Somnath Roy
@ 2015-01-06 12:31             ` Chaitanya Huilgol
  2015-01-06 14:19               ` Ilya Dryomov
  0 siblings, 1 reply; 15+ messages in thread
From: Chaitanya Huilgol @ 2015-01-06 12:31 UTC (permalink / raw)
  To: Somnath Roy, Ilya Dryomov; +Cc: ceph-devel

Hi Ilya,

The RBD crash on OSD nodes going away is routinely hit in our setups.
We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon.
Most of the times this happens when all the OSD nodes go away at once.  This could have probably been fixed by one of the following commits?

Ilya Dryomov
libceph: change from BUG to WARN for __remove_osd() asserts …
idryomov authored on Nov 5
cc9f1f5
Ilya Dryomov
libceph: clear r_req_lru_item in __unregister_linger_request() …
idryomov authored on Nov 5
ba9d114
Ilya Dryomov
libceph: unlink from o_linger_requests when clearing r_osd …
idryomov authored on Nov 4
a390de0

Also, We have encountered a few other issues listed below

(1) Soft Lockup issue
Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process)
.
.
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>

(2) Soft lockup when OSDs are flapping

Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:0:45012]
.
.
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] ceph_osdc_handle_map+0x2a8/0x620 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] dispatch+0x24b/0xb20 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? vtime_common_task_switch+0x3d/0x40
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0

(3)  BUG_ON(!list_empty(&req->r_req_lru_item));

Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0

(4) img_request null
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127:
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request != NULL);

Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] rbd_osd_req_callback+0xdf/0x4e0 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] dispatch+0x4a2/0x900 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? try_write+0xa42/0xe30 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] process_one_work+0x182/0x450
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] worker_thread+0x121/0x410
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] kthread+0xd2/0xf0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
This is similar to: http://tracker.ceph.com/issues/8378

Saw that the rhel7a branch has many of the latest fixes and is somewhat compatible with 3.13 kernels,
For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2).
We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well.

Regards,
Chaitanya

-----Original Message-----
From: Somnath Roy
Sent: Tuesday, January 06, 2015 2:38 AM
To: Ilya Dryomov
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

It's happening both in idle and under load.
I don't have the trace right now but will get you one soon.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Monday, January 05, 2015 12:34 PM
To: Somnath Roy
Cc: Chaitanya Huilgol; ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-06 12:31             ` Chaitanya Huilgol
@ 2015-01-06 14:19               ` Ilya Dryomov
  2015-01-08  3:30                 ` Chaitanya Huilgol
  0 siblings, 1 reply; 15+ messages in thread
From: Ilya Dryomov @ 2015-01-06 14:19 UTC (permalink / raw)
  To: Chaitanya Huilgol; +Cc: Somnath Roy, ceph-devel

On Tue, Jan 6, 2015 at 3:31 PM, Chaitanya Huilgol
<Chaitanya.Huilgol@sandisk.com> wrote:
> Hi Ilya,
>
> The RBD crash on OSD nodes going away is routinely hit in our setups.
> We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon.
> Most of the times this happens when all the OSD nodes go away at once.  This could have probably been fixed by one of the following commits?
>
> Ilya Dryomov
> libceph: change from BUG to WARN for __remove_osd() asserts …
> idryomov authored on Nov 5
> cc9f1f5
> Ilya Dryomov
> libceph: clear r_req_lru_item in __unregister_linger_request() …
> idryomov authored on Nov 5
> ba9d114
> Ilya Dryomov
> libceph: unlink from o_linger_requests when clearing r_osd …
> idryomov authored on Nov 4
> a390de0

Yes, but probably others as well.

>
> Also, We have encountered a few other issues listed below
>
> (1) Soft Lockup issue
> Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process)
> .
> .
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>
>
> (2) Soft lockup when OSDs are flapping
>
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:0:45012]
> .
> .
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] ceph_osdc_handle_map+0x2a8/0x620 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] dispatch+0x24b/0xb20 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? vtime_common_task_switch+0x3d/0x40
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
>
> (3)  BUG_ON(!list_empty(&req->r_req_lru_item));
>
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
>
> (4) img_request null
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127:
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request != NULL);
>
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] rbd_osd_req_callback+0xdf/0x4e0 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] dispatch+0x4a2/0x900 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? try_write+0xa42/0xe30 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? sched_clock_cpu+0xb5/0x100
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] process_one_work+0x182/0x450
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] worker_thread+0x121/0x410
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] kthread+0xd2/0xf0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> This is similar to: http://tracker.ceph.com/issues/8378
>
> Saw that the rhel7a branch has many of the latest fixes and is somewhat compatible with 3.13 kernels,
> For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2).

What do you mean by "expect issue-2"?

(3) and (4) should be fixed in rhel7-a.  Can't say anything about (1)
and (2) - please report back if you see any soft lockup splats on
rhel7-a.

> We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well.

It may not be the right approach, but in many ways it's better than
a set of selected backports.  While working on another report I found
a couple easy-to-backport patches that are missing from Ubuntu 3.13
series and will forward them to stable guys, but, for those who can
build their own kernels at least, branches like rhel7-a are best.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-06 14:19               ` Ilya Dryomov
@ 2015-01-08  3:30                 ` Chaitanya Huilgol
  2015-01-08  8:22                   ` Ilya Dryomov
  0 siblings, 1 reply; 15+ messages in thread
From: Chaitanya Huilgol @ 2015-01-08  3:30 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Somnath Roy, ceph-devel

We have hit issue-2 to on the rhel7a code base (soft lockup in ceph_osdc_handle_map, when large number of osds were flapping due to spurious heartbeat failures).  We have not been able to reproduce other issues.
On a side-note, are the changes in the ceph-client rhel7a branch being actively pulled into the rhel7/centos7 kernel updated?

-----Original Message-----
From: Ilya Dryomov [mailto:ilya.dryomov@inktank.com]
Sent: Tuesday, January 06, 2015 7:49 PM
To: Chaitanya Huilgol
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Tue, Jan 6, 2015 at 3:31 PM, Chaitanya Huilgol <Chaitanya.Huilgol@sandisk.com> wrote:
> Hi Ilya,
>
> The RBD crash on OSD nodes going away is routinely hit in our setups.
> We have not been able to get a good stack trace for this one due to our console capture issues and these don't end up in the syslogs either after the crash. Will get you the traces soon.
> Most of the times this happens when all the OSD nodes go away at once.  This could have probably been fixed by one of the following commits?
>
> Ilya Dryomov
> libceph: change from BUG to WARN for __remove_osd() asserts … idryomov
> authored on Nov 5
> cc9f1f5
> Ilya Dryomov
> libceph: clear r_req_lru_item in __unregister_linger_request() …
> idryomov authored on Nov 5
> ba9d114
> Ilya Dryomov
> libceph: unlink from o_linger_requests when clearing r_osd … idryomov
> authored on Nov 4
> a390de0

Yes, but probably others as well.

>
> Also, We have encountered a few other issues listed below
>
> (1) Soft Lockup issue
> Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft
> lockup - CPU#2 stuck for 22s! [java:29169] --- (vdbench process) .
> .
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630]
> [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph] Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>]
> process_one_work+0x182/0x450 Dec 4 17:14:33 rack6-ramp-4 kernel:
> [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 4
> 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ?
> rescuer_thread+0x3e0/0x3e0 Dec 4 17:14:33 rack6-ramp-4 kernel:
> [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058]
> [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89
> df e8 e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7
> e0 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49
> 8b 0e 4c 89 f2 48 c7 c6 d0 76 64 a0 48 c7 Dec 4 17:14:33 rack6-ramp-4
> kernel: [320361.663443] RIP [<ffffffffa063340e>] osd_reset+0x22e/0x2c0
> [libceph] Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP
> <ffff880a22b8bd80>
>
> (2) Soft lockup when OSDs are flapping
>
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft
> lockup - CPU#4 stuck for 23s! [kworker/4:0:45012] .
> .
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653]
> [<ffffffffa030d963>] kick_requests+0x1e3/0x440 [libceph] Dec 18
> 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>]
> ceph_osdc_handle_map+0x2a8/0x620 [libceph] Dec 18 18:25:10
> rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>]
> dispatch+0x24b/0xb20 [libceph] Dec 18 18:25:10 rack3-client-2 kernel:
> [157126.098665] [<ffffffffa0301c08>] ? ceph_tcp_recvmsg+0x48/0x60
> [libceph] Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669]
> [<ffffffffa030552f>] con_work+0x164f/0x2b60 [libceph] Dec 18 18:25:10
> rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ?
> sched_clock+0x9/0x10 Dec 18 18:25:10 rack3-client-2 kernel:
> [157126.098674] [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676]
> [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 18 18:25:10
> rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ?
> sched_clock_cpu+0xb5/0x100 Dec 18 18:25:10 rack3-client-2 kernel:
> [157126.098681] [<ffffffff8109df6d>] ?
> vtime_common_task_switch+0x3d/0x40
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684]
> [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 18 18:25:10
> rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>]
> worker_thread+0x121/0x410 Dec 18 18:25:10 rack3-client-2 kernel:
> [157126.098688] [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec
> 18 18:25:10 rack3-client-2 kernel: [157126.098690]
> [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 18 18:25:10 rack3-client-2
> kernel: [157126.098692] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695]
> [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 18 18:25:10
> rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
>
> (3)  BUG_ON(!list_empty(&req->r_req_lru_item));
>
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630]
> [<ffffffffa062d9e8>] con_work+0x298/0x640 [libceph] Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>]
> process_one_work+0x182/0x450 Dec 4 17:14:33 rack6-ramp-4 kernel:
> [320361.206653] [<ffffffff81084641>] worker_thread+0x121/0x410 Dec 4
> 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ?
> rescuer_thread+0x3e0/0x3e0 Dec 4 17:14:33 rack6-ramp-4 kernel:
> [320361.312023] [<ffffffff8108b312>] kthread+0xd2/0xf0 Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058]
> [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 4 17:14:33
> rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
>
> (4) img_request null
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in rbd_img_obj_callback() at line 2127:
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request != NULL);
>
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]
> [<ffffffffa01a5897>] rbd_obj_request_complete+0x27/0x70 [rbd] Dec 12
> 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>]
> rbd_osd_req_callback+0xdf/0x4e0 [rbd] Dec 12 08:07:50 rack1-ram-6
> kernel: [251597.279182]  [<ffffffffa039e262>] dispatch+0x4a2/0x900
> [libceph] Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]
> [<ffffffffa039494b>] try_read+0x4ab/0x10d0 [libceph] Dec 12 08:07:50
> rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ?
> try_write+0xa42/0xe30 [libceph] Dec 12 08:07:50 rack1-ram-6 kernel:
> [251597.309777]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 12
> 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ?
> native_sched_clock+0x13/0x80 Dec 12 08:07:50 rack1-ram-6 kernel:
> [251597.332347]  [<ffffffff8101b7d9>] ? sched_clock+0x9/0x10 Dec 12
> 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ?
> sched_clock_cpu+0xb5/0x100 Dec 12 08:07:50 rack1-ram-6 kernel:
> [251597.351061]  [<ffffffffa0396809>] con_work+0xb9/0x640 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]
> [<ffffffff810838a2>] process_one_work+0x182/0x450 Dec 12 08:07:50
> rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>]
> worker_thread+0x121/0x410 Dec 12 08:07:50 rack1-ram-6 kernel:
> [251597.379816]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0 Dec
> 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>]
> kthread+0xd2/0xf0 Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]
> [<ffffffff8108b240>] ? kthread_create_on_node+0x1d0/0x1d0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]
> [<ffffffff8172637c>] ret_from_fork+0x7c/0xb0 Dec 12 08:07:50
> rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ?
> kthread_create_on_node+0x1d0/0x1d0
> This is similar to: http://tracker.ceph.com/issues/8378
>
> Saw that the rhel7a branch has many of the latest fixes and is
> somewhat compatible with 3.13 kernels, For validation, we have taken the rhel7a ceph-client branch and with minor modification gotten it to compile with 3.13.0 headers. With this we did not hit any issues (expect issue-2).

What do you mean by "expect issue-2"?

(3) and (4) should be fixed in rhel7-a.  Can't say anything about (1) and (2) - please report back if you see any soft lockup splats on rhel7-a.

> We understand that is not the right approach for Ubuntu, It would be great if we could get the fixes into Ubuntu 14.04 kernels as well.

It may not be the right approach, but in many ways it's better than a set of selected backports.  While working on another report I found a couple easy-to-backport patches that are missing from Ubuntu 3.13 series and will forward them to stable guys, but, for those who can build their own kernels at least, branches like rhel7-a are best.

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-08  3:30                 ` Chaitanya Huilgol
@ 2015-01-08  8:22                   ` Ilya Dryomov
  0 siblings, 0 replies; 15+ messages in thread
From: Ilya Dryomov @ 2015-01-08  8:22 UTC (permalink / raw)
  To: Chaitanya Huilgol; +Cc: Somnath Roy, ceph-devel

On Thu, Jan 8, 2015 at 6:30 AM, Chaitanya Huilgol
<Chaitanya.Huilgol@sandisk.com> wrote:
> We have hit issue-2 to on the rhel7a code base (soft lockup in ceph_osdc_handle_map, when large number of osds were flapping due to spurious heartbeat failures).  We have not been able to reproduce other issues.

Can I see the entire dmesg of a boot when it happened on rhel7-a?

> On a side-note, are the changes in the ceph-client rhel7a branch being actively pulled into the rhel7/centos7 kernel updated?

All of rhel7-a is on its way to rhel7.1 I think.  Not sure about centos.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)
  2015-01-05 11:15 ` Wido den Hollander
@ 2015-01-08  8:49   ` joel.merrick
  0 siblings, 0 replies; 15+ messages in thread
From: joel.merrick @ 2015-01-08  8:49 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Chaitanya Huilgol, ceph-devel

On Mon, Jan 5, 2015 at 11:15 AM, Wido den Hollander <wido@42on.com> wrote:
>
>
> On 05-01-15 11:53, Chaitanya Huilgol wrote:
>>
>> Hi All,
>>
>> The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we
>> are seeing crashes and soft-lockup issues which have been fixed in the
>> current ceph-client code base.
>> What would be recommended ceph-client branch compatible with the Ubuntu
>> 14.04 (3.13.0-x) kernels so that we can get as many fixes as possible?
>>
>
> I recommend you take a look here:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/
>
> That should give you some new kernels.

Just to throw in another method... Ubuntu also ship the kernels for
newer non-LTS releases (as well as X components and other cherry
picked bits) in their LTS releases too. So linux-generic-lts-utopic
currently exists in 14.04. As Ilya mentioned though, the fixes should
be marked anyway in the stock ubuntu 14.04 kernels (I use them without
issue but use case may vary), but could be useful knowledge for
someone.



-- 
$ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-01-08  8:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-05 10:53 Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels) Chaitanya Huilgol
2015-01-05 11:15 ` Wido den Hollander
2015-01-08  8:49   ` joel.merrick
2015-01-05 15:57 ` Ilya Dryomov
2015-01-05 17:11   ` Somnath Roy
2015-01-05 18:50     ` Ilya Dryomov
2015-01-05 20:01       ` Somnath Roy
2015-01-05 20:33         ` Ilya Dryomov
2015-01-05 21:08           ` Somnath Roy
2015-01-06 12:31             ` Chaitanya Huilgol
2015-01-06 14:19               ` Ilya Dryomov
2015-01-08  3:30                 ` Chaitanya Huilgol
2015-01-08  8:22                   ` Ilya Dryomov
2015-01-05 21:54           ` Somnath Roy
2015-01-06  2:36   ` Chaitanya Huilgol

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.