All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible bug in krbd (4.4.0)
@ 2017-01-04  0:13 Max Yehorov
  2017-01-04 16:15 ` Max Yehorov
  2017-01-07 16:08 ` Ilya Dryomov
  0 siblings, 2 replies; 8+ messages in thread
From: Max Yehorov @ 2017-01-04  0:13 UTC (permalink / raw)
  To: ceph-devel

Hi,

I have encountered a weird possible bug. There is an rbd image mapped
and mounted on a client machine. It is not possible to umount it. Both
lsof and fuser show no mention of neither device nor mountpoint. It is
not exported via nfs kernel server, so unlikely it is blocked by
kernel.

There is an odd pattern in syslog, two osds are constantly loose
connections. A wild guess is that umount tries to contact primary osd
and fails?

After I enabled kernel debug I saw the following:

[9586733.605792] libceph:  con_open ffff880748f58030 10.80.16.74:6812
[9586733.623876] libceph:  connect 10.80.16.74:6812
[9586733.625091] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
[9586756.681246] libceph:  con_keepalive ffff881057d082b8
[9586767.713067] libceph:  fault ffff880748f59830 state 5 to peer
10.80.16.78:6812
[9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN)
[9586767.721145] libceph:  con_close ffff880748f59830 peer 10.80.16.78:6812
[9586767.724440] libceph:  con_open ffff880748f59830 10.80.16.78:6812
[9586767.742487] libceph:  connect 10.80.16.78:6812
[9586767.743696] libceph:  connect 10.80.16.78:6812 EINPROGRESS sk_state = 2
[9587346.956812] libceph:  try_read start on ffff881057d082b8 state 5
[9587466.968125] libceph:  try_write start ffff881057d082b8 state 5
[9587634.021257] libceph:  fault ffff880748f58030 state 5 to peer
10.80.16.74:6812
[9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN)
[9587634.029336] libceph:  con_close ffff880748f58030 peer 10.80.16.74:6812
[9587634.032628] libceph:  con_open ffff880748f58030 10.80.16.74:6812
[9587634.050677] libceph:  connect 10.80.16.74:6812
[9587634.051888] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
[9587668.124746] libceph:  fault ffff880748f59830 state 5 to peer
10.80.16.78:6812

grep of ceph_sock_state_change
kernel: [9585833.117190] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
kernel: [9585833.121912] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
kernel: [9585833.122467] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
kernel: [9585833.151589] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
TCP_ESTABLISHED
kernel: [9586733.591304] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
kernel: [9586733.596020] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
kernel: [9586733.596573] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
kernel: [9586733.625709] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
TCP_ESTABLISHED
kernel: [9587634.018152] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
kernel: [9587634.022853] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
kernel: [9587634.023406] libceph:  ceph_sock_state_change
ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE

A couple of observations:
the two OSDs in question have the same port 6812, but different IPs
(10.80.16.74 and 10.80.16.78), what is more interesting that they have
the same ceph_connection struct, note the ffff880748f59830 in the log
snippet above. So it seems that because two "struct sock *sk" share
the same "ceph_connection *con = sk->sk_user_data" they enter an
endless loop of establishing and closing the connection.

Does it sound plausible?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-01-04  0:13 Possible bug in krbd (4.4.0) Max Yehorov
@ 2017-01-04 16:15 ` Max Yehorov
  2017-01-07 16:08 ` Ilya Dryomov
  1 sibling, 0 replies; 8+ messages in thread
From: Max Yehorov @ 2017-01-04 16:15 UTC (permalink / raw)
  To: ceph-devel

pls disregard comment about "the same ceph_connection struct".

On Tue, Jan 3, 2017 at 4:13 PM, Max Yehorov <myehorov@skytap.com> wrote:
> Hi,
>
> I have encountered a weird possible bug. There is an rbd image mapped
> and mounted on a client machine. It is not possible to umount it. Both
> lsof and fuser show no mention of neither device nor mountpoint. It is
> not exported via nfs kernel server, so unlikely it is blocked by
> kernel.
>
> There is an odd pattern in syslog, two osds are constantly loose
> connections. A wild guess is that umount tries to contact primary osd
> and fails?
>
> After I enabled kernel debug I saw the following:
>
> [9586733.605792] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9586733.623876] libceph:  connect 10.80.16.74:6812
> [9586733.625091] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9586756.681246] libceph:  con_keepalive ffff881057d082b8
> [9586767.713067] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812
> [9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN)
> [9586767.721145] libceph:  con_close ffff880748f59830 peer 10.80.16.78:6812
> [9586767.724440] libceph:  con_open ffff880748f59830 10.80.16.78:6812
> [9586767.742487] libceph:  connect 10.80.16.78:6812
> [9586767.743696] libceph:  connect 10.80.16.78:6812 EINPROGRESS sk_state = 2
> [9587346.956812] libceph:  try_read start on ffff881057d082b8 state 5
> [9587466.968125] libceph:  try_write start ffff881057d082b8 state 5
> [9587634.021257] libceph:  fault ffff880748f58030 state 5 to peer
> 10.80.16.74:6812
> [9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN)
> [9587634.029336] libceph:  con_close ffff880748f58030 peer 10.80.16.74:6812
> [9587634.032628] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9587634.050677] libceph:  connect 10.80.16.74:6812
> [9587634.051888] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9587668.124746] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812
>
> grep of ceph_sock_state_change
> kernel: [9585833.117190] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9585833.121912] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9585833.122467] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
> kernel: [9585833.151589] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
> TCP_ESTABLISHED
> kernel: [9586733.591304] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9586733.596020] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9586733.596573] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
> kernel: [9586733.625709] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
> TCP_ESTABLISHED
> kernel: [9587634.018152] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9587634.022853] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9587634.023406] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
>
> A couple of observations:
> the two OSDs in question have the same port 6812, but different IPs
> (10.80.16.74 and 10.80.16.78), what is more interesting that they have
> the same ceph_connection struct, note the ffff880748f59830 in the log
> snippet above. So it seems that because two "struct sock *sk" share
> the same "ceph_connection *con = sk->sk_user_data" they enter an
> endless loop of establishing and closing the connection.
>
> Does it sound plausible?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-01-04  0:13 Possible bug in krbd (4.4.0) Max Yehorov
  2017-01-04 16:15 ` Max Yehorov
@ 2017-01-07 16:08 ` Ilya Dryomov
  2017-02-03 23:20   ` Max Yehorov
  1 sibling, 1 reply; 8+ messages in thread
From: Ilya Dryomov @ 2017-01-07 16:08 UTC (permalink / raw)
  To: Max Yehorov; +Cc: ceph-devel

On Wed, Jan 4, 2017 at 3:13 AM, Max Yehorov <myehorov@skytap.com> wrote:
> Hi,
>
> I have encountered a weird possible bug. There is an rbd image mapped
> and mounted on a client machine. It is not possible to umount it. Both
> lsof and fuser show no mention of neither device nor mountpoint. It is
> not exported via nfs kernel server, so unlikely it is blocked by
> kernel.
>
> There is an odd pattern in syslog, two osds are constantly loose
> connections. A wild guess is that umount tries to contact primary osd
> and fails?

Does umount error out or hang forever?

>
> After I enabled kernel debug I saw the following:
>
> [9586733.605792] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9586733.623876] libceph:  connect 10.80.16.74:6812
> [9586733.625091] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9586756.681246] libceph:  con_keepalive ffff881057d082b8
> [9586767.713067] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812
> [9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN)
> [9586767.721145] libceph:  con_close ffff880748f59830 peer 10.80.16.78:6812
> [9586767.724440] libceph:  con_open ffff880748f59830 10.80.16.78:6812
> [9586767.742487] libceph:  connect 10.80.16.78:6812
> [9586767.743696] libceph:  connect 10.80.16.78:6812 EINPROGRESS sk_state = 2
> [9587346.956812] libceph:  try_read start on ffff881057d082b8 state 5
> [9587466.968125] libceph:  try_write start ffff881057d082b8 state 5
> [9587634.021257] libceph:  fault ffff880748f58030 state 5 to peer
> 10.80.16.74:6812
> [9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN)
> [9587634.029336] libceph:  con_close ffff880748f58030 peer 10.80.16.74:6812
> [9587634.032628] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9587634.050677] libceph:  connect 10.80.16.74:6812
> [9587634.051888] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9587668.124746] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812

How many rbd images were mapped on that machine at that time?  This
looks like two idle mappings reestablishing watch connections - if you
look closely, you'll notice that those "fault to peer" messages are
exactly 15 minutes apart.  This behaviour is annoying, but harmless.

If umount hangs, an output of

$ cat /sys/kernel/debug/ceph/<fsid>/osdc
$ echo w >/proc/sysrq-trigger
$ echo t >/proc/sysrq-trigger

might have helped.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-01-07 16:08 ` Ilya Dryomov
@ 2017-02-03 23:20   ` Max Yehorov
  2017-02-05 12:26     ` Ilya Dryomov
  0 siblings, 1 reply; 8+ messages in thread
From: Max Yehorov @ 2017-02-03 23:20 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

It is certainly annoying. It does not allow to umount for hours.
> Does umount error out or hang forever?
umount errors out. with
umount: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)

Maybe this will help?

- lsof, for the rbd device in question, shows it is in use by pid 5926

dio/rbd0  5926  root  cwd   DIR                8,5   4096          2 /
dio/rbd0  5926  root  rtd     DIR                8,5   4096          2 /
dio/rbd0  5926  root  txt     unknown                   /proc/5926/exe

- PID 5926 is dio, it seems like a kernel thread
ps ax | grep 5926
 5926 ?        S<     0:00 [dio/rbd0]

- Also, there is no more dio but that one
ps ax | grep dio
 5926 ?        S<     0:00 [dio/rbd0]


On Sat, Jan 7, 2017 at 8:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Wed, Jan 4, 2017 at 3:13 AM, Max Yehorov <myehorov@skytap.com> wrote:
>> Hi,
>>
>> I have encountered a weird possible bug. There is an rbd image mapped
>> and mounted on a client machine. It is not possible to umount it. Both
>> lsof and fuser show no mention of neither device nor mountpoint. It is
>> not exported via nfs kernel server, so unlikely it is blocked by
>> kernel.
>>
>> There is an odd pattern in syslog, two osds are constantly loose
>> connections. A wild guess is that umount tries to contact primary osd
>> and fails?
>
> Does umount error out or hang forever?
>
>>
>> After I enabled kernel debug I saw the following:
>>
>> [9586733.605792] libceph:  con_open ffff880748f58030 10.80.16.74:6812
>> [9586733.623876] libceph:  connect 10.80.16.74:6812
>> [9586733.625091] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
>> [9586756.681246] libceph:  con_keepalive ffff881057d082b8
>> [9586767.713067] libceph:  fault ffff880748f59830 state 5 to peer
>> 10.80.16.78:6812
>> [9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN)
>> [9586767.721145] libceph:  con_close ffff880748f59830 peer 10.80.16.78:6812
>> [9586767.724440] libceph:  con_open ffff880748f59830 10.80.16.78:6812
>> [9586767.742487] libceph:  connect 10.80.16.78:6812
>> [9586767.743696] libceph:  connect 10.80.16.78:6812 EINPROGRESS sk_state = 2
>> [9587346.956812] libceph:  try_read start on ffff881057d082b8 state 5
>> [9587466.968125] libceph:  try_write start ffff881057d082b8 state 5
>> [9587634.021257] libceph:  fault ffff880748f58030 state 5 to peer
>> 10.80.16.74:6812
>> [9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN)
>> [9587634.029336] libceph:  con_close ffff880748f58030 peer 10.80.16.74:6812
>> [9587634.032628] libceph:  con_open ffff880748f58030 10.80.16.74:6812
>> [9587634.050677] libceph:  connect 10.80.16.74:6812
>> [9587634.051888] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
>> [9587668.124746] libceph:  fault ffff880748f59830 state 5 to peer
>> 10.80.16.78:6812
>
> How many rbd images were mapped on that machine at that time?  This
> looks like two idle mappings reestablishing watch connections - if you
> look closely, you'll notice that those "fault to peer" messages are
> exactly 15 minutes apart.  This behaviour is annoying, but harmless.
>
> If umount hangs, an output of
>
> $ cat /sys/kernel/debug/ceph/<fsid>/osdc
> $ echo w >/proc/sysrq-trigger
> $ echo t >/proc/sysrq-trigger
>
> might have helped.
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-02-03 23:20   ` Max Yehorov
@ 2017-02-05 12:26     ` Ilya Dryomov
  2017-02-06 18:26       ` Max Yehorov
  0 siblings, 1 reply; 8+ messages in thread
From: Ilya Dryomov @ 2017-02-05 12:26 UTC (permalink / raw)
  To: Max Yehorov; +Cc: ceph-devel

On Sat, Feb 4, 2017 at 12:20 AM, Max Yehorov <myehorov@skytap.com> wrote:
> It is certainly annoying. It does not allow to umount for hours.

For hours?  Does it succeed after X number of hours?

>> Does umount error out or hang forever?
> umount errors out. with
> umount: target is busy
>         (In some cases useful info about processes that
>          use the device is found by lsof(8) or fuser(1).)
>
> Maybe this will help?
>
> - lsof, for the rbd device in question, shows it is in use by pid 5926
>
> dio/rbd0  5926  root  cwd   DIR                8,5   4096          2 /
> dio/rbd0  5926  root  rtd     DIR                8,5   4096          2 /
> dio/rbd0  5926  root  txt     unknown                   /proc/5926/exe

This doesn't say that /dev/rbd0 is in use by pid 5926.  It says that
there is a process with the name "dio/rbd0", that's all.

I'm guessing you ran plain "lsof" with no arguments here.  Were there
other */rbd0 matches, like jbd2/rbd0?

>
> - PID 5926 is dio, it seems like a kernel thread
> ps ax | grep 5926
>  5926 ?        S<     0:00 [dio/rbd0]
>
> - Also, there is no more dio but that one
> ps ax | grep dio
>  5926 ?        S<     0:00 [dio/rbd0]

It's a kernel workqueue, its presence itself doesn't mean much:

$ sudo mkfs.ext4 /dev/rbd0
...
$ sudo mount /dev/rbd0 /mnt
$ pgrep -a dio # nothing
$ sudo fio --ioengine=libaio --direct=1 --name=test
--filename=/mnt/test --bs=4k --size=16M --readwrite=randwrite
...
$ pgrep -a dio
1254 dio/rbd0
$ ps 1254
  PID TTY      STAT   TIME COMMAND
 1254 ?        S<     0:00 [dio/rbd0]
$ sudo umount /mnt # OK

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-02-05 12:26     ` Ilya Dryomov
@ 2017-02-06 18:26       ` Max Yehorov
  2017-02-06 18:40         ` Ilya Dryomov
  0 siblings, 1 reply; 8+ messages in thread
From: Max Yehorov @ 2017-02-06 18:26 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

> For hours?  Does it succeed after X number of hours?
Still mounted from Feb 6. There is an attempt to unmount every 10 minutes.

> This doesn't say that /dev/rbd0 is in use by pid 5926.  It says that
> there is a process with the name "dio/rbd0", that's all.
>
> I'm guessing you ran plain "lsof" with no arguments here.  Were there
> other */rbd0 matches, like jbd2/rbd0?

The entire output

:~# lsof | grep rbd

dio/rbd0   5926             root  cwd       DIR                8,5
    4096          2 /
dio/rbd0   5926             root  rtd       DIR                8,5
    4096          2 /
dio/rbd0   5926             root  txt   unknown
                    /proc/5926/exe
rbd        5996             root  cwd       DIR                8,5
    4096          2 /
rbd        5996             root  rtd       DIR                8,5
    4096          2 /
rbd        5996             root  txt   unknown
                    /proc/5996/exe



On Sun, Feb 5, 2017 at 4:26 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Sat, Feb 4, 2017 at 12:20 AM, Max Yehorov <myehorov@skytap.com> wrote:
>> It is certainly annoying. It does not allow to umount for hours.
>
> For hours?  Does it succeed after X number of hours?
>
>>> Does umount error out or hang forever?
>> umount errors out. with
>> umount: target is busy
>>         (In some cases useful info about processes that
>>          use the device is found by lsof(8) or fuser(1).)
>>
>> Maybe this will help?
>>
>> - lsof, for the rbd device in question, shows it is in use by pid 5926
>>
>> dio/rbd0  5926  root  cwd   DIR                8,5   4096          2 /
>> dio/rbd0  5926  root  rtd     DIR                8,5   4096          2 /
>> dio/rbd0  5926  root  txt     unknown                   /proc/5926/exe
>
> This doesn't say that /dev/rbd0 is in use by pid 5926.  It says that
> there is a process with the name "dio/rbd0", that's all.
>
> I'm guessing you ran plain "lsof" with no arguments here.  Were there
> other */rbd0 matches, like jbd2/rbd0?
>
>>
>> - PID 5926 is dio, it seems like a kernel thread
>> ps ax | grep 5926
>>  5926 ?        S<     0:00 [dio/rbd0]
>>
>> - Also, there is no more dio but that one
>> ps ax | grep dio
>>  5926 ?        S<     0:00 [dio/rbd0]
>
> It's a kernel workqueue, its presence itself doesn't mean much:
>
> $ sudo mkfs.ext4 /dev/rbd0
> ...
> $ sudo mount /dev/rbd0 /mnt
> $ pgrep -a dio # nothing
> $ sudo fio --ioengine=libaio --direct=1 --name=test
> --filename=/mnt/test --bs=4k --size=16M --readwrite=randwrite
> ...
> $ pgrep -a dio
> 1254 dio/rbd0
> $ ps 1254
>   PID TTY      STAT   TIME COMMAND
>  1254 ?        S<     0:00 [dio/rbd0]
> $ sudo umount /mnt # OK
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-02-06 18:26       ` Max Yehorov
@ 2017-02-06 18:40         ` Ilya Dryomov
  2017-02-06 19:29           ` Max Yehorov
  0 siblings, 1 reply; 8+ messages in thread
From: Ilya Dryomov @ 2017-02-06 18:40 UTC (permalink / raw)
  To: Max Yehorov; +Cc: ceph-devel

On Mon, Feb 6, 2017 at 7:26 PM, Max Yehorov <myehorov@skytap.com> wrote:
>> For hours?  Does it succeed after X number of hours?
> Still mounted from Feb 6. There is an attempt to unmount every 10 minutes.
>
>> This doesn't say that /dev/rbd0 is in use by pid 5926.  It says that
>> there is a process with the name "dio/rbd0", that's all.
>>
>> I'm guessing you ran plain "lsof" with no arguments here.  Were there
>> other */rbd0 matches, like jbd2/rbd0?
>
> The entire output
>
> :~# lsof | grep rbd
>
> dio/rbd0   5926             root  cwd       DIR                8,5
>     4096          2 /
> dio/rbd0   5926             root  rtd       DIR                8,5
>     4096          2 /
> dio/rbd0   5926             root  txt   unknown
>                     /proc/5926/exe
> rbd        5996             root  cwd       DIR                8,5
>     4096          2 /
> rbd        5996             root  rtd       DIR                8,5
>     4096          2 /
> rbd        5996             root  txt   unknown
>                     /proc/5996/exe

Is this xfs or ext4?

What is the output of "rbd showmapped"?

Is there anything in dmesg?  Can you provide a snippet?

Can you provide the output I requested in my first reply?

$ cat /sys/kernel/debug/ceph/<fsid>/osdc
$ echo w >/proc/sysrq-trigger
$ echo t >/proc/sysrq-trigger

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible bug in krbd (4.4.0)
  2017-02-06 18:40         ` Ilya Dryomov
@ 2017-02-06 19:29           ` Max Yehorov
  0 siblings, 0 replies; 8+ messages in thread
From: Max Yehorov @ 2017-02-06 19:29 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

It is XFS.

nothing interesting in dmesg

rbd showmapped + trimmed sysrq (-t -w)
http://pastebin.com/s68Sqia4


On Mon, Feb 6, 2017 at 10:40 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Mon, Feb 6, 2017 at 7:26 PM, Max Yehorov <myehorov@skytap.com> wrote:
>>> For hours?  Does it succeed after X number of hours?
>> Still mounted from Feb 6. There is an attempt to unmount every 10 minutes.
>>
>>> This doesn't say that /dev/rbd0 is in use by pid 5926.  It says that
>>> there is a process with the name "dio/rbd0", that's all.
>>>
>>> I'm guessing you ran plain "lsof" with no arguments here.  Were there
>>> other */rbd0 matches, like jbd2/rbd0?
>>
>> The entire output
>>
>> :~# lsof | grep rbd
>>
>> dio/rbd0   5926             root  cwd       DIR                8,5
>>     4096          2 /
>> dio/rbd0   5926             root  rtd       DIR                8,5
>>     4096          2 /
>> dio/rbd0   5926             root  txt   unknown
>>                     /proc/5926/exe
>> rbd        5996             root  cwd       DIR                8,5
>>     4096          2 /
>> rbd        5996             root  rtd       DIR                8,5
>>     4096          2 /
>> rbd        5996             root  txt   unknown
>>                     /proc/5996/exe
>
> Is this xfs or ext4?
>
> What is the output of "rbd showmapped"?
>
> Is there anything in dmesg?  Can you provide a snippet?
>
> Can you provide the output I requested in my first reply?
>
> $ cat /sys/kernel/debug/ceph/<fsid>/osdc
> $ echo w >/proc/sysrq-trigger
> $ echo t >/proc/sysrq-trigger
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-02-06 19:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-04  0:13 Possible bug in krbd (4.4.0) Max Yehorov
2017-01-04 16:15 ` Max Yehorov
2017-01-07 16:08 ` Ilya Dryomov
2017-02-03 23:20   ` Max Yehorov
2017-02-05 12:26     ` Ilya Dryomov
2017-02-06 18:26       ` Max Yehorov
2017-02-06 18:40         ` Ilya Dryomov
2017-02-06 19:29           ` Max Yehorov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.