All of lore.kernel.org
 help / color / mirror / Atom feed
* [ceph-users] occasional failure to unmap rbd
       [not found]       ` <560587E0.2070905@commerceguys.com>
@ 2015-09-26  2:54         ` Shinobu Kinjo
  2015-09-26  8:52           ` Ilya Dryomov
  0 siblings, 1 reply; 9+ messages in thread
From: Shinobu Kinjo @ 2015-09-26  2:54 UTC (permalink / raw)
  To: Ceph Development

I think it's more helpful to put returned value in:

# ./src/krbd.cc
530       cerr << "rbd: sysfs write failed" << std::endl;

like:

530       cerr << "rbd: sysfs write failed (" << r << ")" << std::endl;

So that we exactly know what **write** complains about.
Because **write** has some return values in case of error.

What do you think?

Shinobu

----- Original Message -----
From: "Jeff Epstein" <jeff.epstein@commerceguys.com>
To: "Jan Schermer" <jan@schermer.cz>
Cc: ceph-users@lists.ceph.com
Sent: Saturday, September 26, 2015 2:44:00 AM
Subject: Re: [ceph-users] occasional failure to unmap rbd

On 09/25/2015 12:53 PM, Jan Schermer wrote:
> What are you looking for in lsof? Did you try looking for the major/minor number of the rbd device?
> Things that could hold the device are devicemapper, lvm, swraid and possibly many more, not sure if all that shows in lsof output...
>
I searched for the rbd's mounted block device name, of course, which 
didn't turn up anything. Just now I tried searching for the minor device 
number, but I didn't see anything obviously useful. lsof usually just 
shows processes, so if the the device is being held by a kernel module 
or inaccurate refcount, lsof wouldn't help.

Jeff
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-09-26  2:54         ` [ceph-users] occasional failure to unmap rbd Shinobu Kinjo
@ 2015-09-26  8:52           ` Ilya Dryomov
  2015-09-26 10:30             ` Shinobu Kinjo
  0 siblings, 1 reply; 9+ messages in thread
From: Ilya Dryomov @ 2015-09-26  8:52 UTC (permalink / raw)
  To: Shinobu Kinjo; +Cc: Ceph Development

On Sat, Sep 26, 2015 at 5:54 AM, Shinobu Kinjo <skinjo@redhat.com> wrote:
> I think it's more helpful to put returned value in:
>
> # ./src/krbd.cc
> 530       cerr << "rbd: sysfs write failed" << std::endl;
>
> like:
>
> 530       cerr << "rbd: sysfs write failed (" << r << ")" << std::endl;
>
> So that we exactly know what **write** complains about.
> Because **write** has some return values in case of error.
>
> What do you think?

It's already doing that:

rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

sysfs_write_rbd_remove() return value is propagated up and reported.
The code is written in such a way that return values are preserved and
never overwritten (hopefully!).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-09-26  8:52           ` Ilya Dryomov
@ 2015-09-26 10:30             ` Shinobu Kinjo
  2015-11-23 22:06               ` Markus Kienast
       [not found]               ` <CAJv+SGeTD5VxPdQ-3wrz0m0HFU7ZTZaKYMALavUQG4CRYTb4YA@mail.gmail.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Shinobu Kinjo @ 2015-09-26 10:30 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Thanks!
I completely overlooked that -;

 Shinobu

----- Original Message -----
From: "Ilya Dryomov" <idryomov@gmail.com>
To: "Shinobu Kinjo" <skinjo@redhat.com>
Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
Sent: Saturday, September 26, 2015 5:52:50 PM
Subject: Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd

On Sat, Sep 26, 2015 at 5:54 AM, Shinobu Kinjo <skinjo@redhat.com> wrote:
> I think it's more helpful to put returned value in:
>
> # ./src/krbd.cc
> 530       cerr << "rbd: sysfs write failed" << std::endl;
>
> like:
>
> 530       cerr << "rbd: sysfs write failed (" << r << ")" << std::endl;
>
> So that we exactly know what **write** complains about.
> Because **write** has some return values in case of error.
>
> What do you think?

It's already doing that:

rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

sysfs_write_rbd_remove() return value is propagated up and reported.
The code is written in such a way that return values are preserved and
never overwritten (hopefully!).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-09-26 10:30             ` Shinobu Kinjo
@ 2015-11-23 22:06               ` Markus Kienast
       [not found]               ` <CAJv+SGeTD5VxPdQ-3wrz0m0HFU7ZTZaKYMALavUQG4CRYTb4YA@mail.gmail.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Markus Kienast @ 2015-11-23 22:06 UTC (permalink / raw)
  To: Ceph Development

I am having the same issue here.

root@paris3:/etc/neutron# rbd unmap /dev/rbd0
rbd: failed to remove rbd device: (16) Device or resource busy
rbd: remove failed: (16) Device or resource busy

root@paris3:/etc/neutron# rbd info -p volumes
volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x17734e0).fault
rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
size 500 GB in 128000 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1b6d9e2aaa998b
format: 2
features: layering
root@paris3:/etc/neutron# rados -p volumes listwatchers
rbd_header.1b6d9e2aaa998b
2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x9cf4f0).fault
watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1

root@paris3:/etc/neutron# ps ax | grep rbd
 7814 ?        S      0:00 [jbd2/rbd0-8]
11003 ?        S      0:00 [jbd2/rbd1-8]
14042 ?        S      0:00 [jbd2/rbd2p1-8]
24228 ?        S      0:00 [jbd2/rbd3-8]

root@paris3:/etc/neutron# ceph --version
ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)

root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
returns nothing

root@paris3:/etc/neutron# fuser -amv /dev/rbd0
                     USER        PID ACCESS COMMAND
/dev/rbd0:

root@paris3:/etc/neutron# lsof /dev/rbd0
returns nothing

Please advise,
Markus

On Sat, Sep 26, 2015 at 12:30 PM, Shinobu Kinjo <skinjo@redhat.com> wrote:
>
> Thanks!
> I completely overlooked that -;
>
>  Shinobu
>
> ----- Original Message -----
> From: "Ilya Dryomov" <idryomov@gmail.com>
> To: "Shinobu Kinjo" <skinjo@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Saturday, September 26, 2015 5:52:50 PM
> Subject: Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd
>
> On Sat, Sep 26, 2015 at 5:54 AM, Shinobu Kinjo <skinjo@redhat.com> wrote:
> > I think it's more helpful to put returned value in:
> >
> > # ./src/krbd.cc
> > 530       cerr << "rbd: sysfs write failed" << std::endl;
> >
> > like:
> >
> > 530       cerr << "rbd: sysfs write failed (" << r << ")" << std::endl;
> >
> > So that we exactly know what **write** complains about.
> > Because **write** has some return values in case of error.
> >
> > What do you think?
>
> It's already doing that:
>
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> sysfs_write_rbd_remove() return value is propagated up and reported.
> The code is written in such a way that return values are preserved and
> never overwritten (hopefully!).
>
> Thanks,
>
>                 Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
       [not found]               ` <CAJv+SGeTD5VxPdQ-3wrz0m0HFU7ZTZaKYMALavUQG4CRYTb4YA@mail.gmail.com>
@ 2015-11-23 22:26                 ` Ilya Dryomov
  2015-11-23 23:12                   ` Markus Kienast
  0 siblings, 1 reply; 9+ messages in thread
From: Ilya Dryomov @ 2015-11-23 22:26 UTC (permalink / raw)
  To: Markus Kienast; +Cc: Shinobu Kinjo, Ceph Development

On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@trickkiste.at> wrote:
> I am having the same issue here.

Which kernel are you running?  Could you attach your dmesg?

>
> root@paris3:/etc/neutron# rbd unmap /dev/rbd0
> rbd: failed to remove rbd device: (16) Device or resource busy
> rbd: remove failed: (16) Device or resource busy
>
> root@paris3:/etc/neutron# rbd info -p volumes
> volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
> 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
> 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x17734e0).fault
> rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
> size 500 GB in 128000 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.1b6d9e2aaa998b
> format: 2
> features: layering
> root@paris3:/etc/neutron# rados -p volumes listwatchers
> rbd_header.1b6d9e2aaa998b
> 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
> 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault

Did you root cause these faults?

> watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>
> root@paris3:/etc/neutron# ps ax | grep rbd
>  7814 ?        S      0:00 [jbd2/rbd0-8]

Was there an ext filesystem involved?  How was it umounted - do you
have a "umount <mountpoint>" process stuck in D state?

> 11003 ?        S      0:00 [jbd2/rbd1-8]
> 14042 ?        S      0:00 [jbd2/rbd2p1-8]
> 24228 ?        S      0:00 [jbd2/rbd3-8]
>
> root@paris3:/etc/neutron# ceph --version
> ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>
> root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
> returns nothing
>
> root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>                      USER        PID ACCESS COMMAND
> /dev/rbd0:

What's the output of "cat /sys/bus/rbd/devices/0/client_id"?

What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-11-23 22:26                 ` Ilya Dryomov
@ 2015-11-23 23:12                   ` Markus Kienast
  2015-11-24 11:49                     ` Ilya Dryomov
  0 siblings, 1 reply; 9+ messages in thread
From: Markus Kienast @ 2015-11-23 23:12 UTC (permalink / raw)
  To: Ceph Development

Kernel Version
elias@paris3:~$ uname -a
Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Output of dmesg and /var/log/dmesg attached.
But does not show much except for one mon being down.
The mon is down for hardware reasons.



On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@trickkiste.at> wrote:
> > I am having the same issue here.
>
> Which kernel are you running?  Could you attach your dmesg?
>
> >
> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
> > rbd: failed to remove rbd device: (16) Device or resource busy
> > rbd: remove failed: (16) Device or resource busy
> >
> > root@paris3:/etc/neutron# rbd info -p volumes
> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
> > c=0x17734e0).fault
> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
> > size 500 GB in 128000 objects
> > order 22 (4096 kB objects)
> > block_name_prefix: rbd_data.1b6d9e2aaa998b
> > format: 2
> > features: layering
> > root@paris3:/etc/neutron# rados -p volumes listwatchers
> > rbd_header.1b6d9e2aaa998b
> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>
> Did you root cause these faults?

Hardware failure caused these faults.

>
> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
> >
> > root@paris3:/etc/neutron# ps ax | grep rbd
> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>
> Was there an ext filesystem involved?  How was it umounted - do you
> have a "umount <mountpoint>" process stuck in D state?

Yes, all these RBDs are formatted with ext4. I am regularly using them
with openstack and have never had any problems.
I did "unmount <mountpoint>" and the unmount process did actually
finish just fine.
Where can I look up, if it is stuck in "D" state?

>
> > 11003 ?        S      0:00 [jbd2/rbd1-8]
> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
> > 24228 ?        S      0:00 [jbd2/rbd3-8]
> >
> > root@paris3:/etc/neutron# ceph --version
> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
> >
> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
> > returns nothing
> >
> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
> >                      USER        PID ACCESS COMMAND
> > /dev/rbd0:
>
> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?

root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
client8471177

>
> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?

root@paris3:~# ls -l /sys/kernel/debug/ceph/
total 0
drwxr-xr-x 2 root root 0 Feb  4  2015
32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
drwxr-xr-x 2 root root 0 Nov 23 11:41
32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177

root@paris3:~# cat
/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
has no output

root@paris3:~# cat
/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
hangs with no output

BTW, I have mapped these RBDs as user cinder not as admin.

But using -n produces the same error:
root@paris3:~# rbd -n client.cinder  unmap /dev/rbd0
rbd: failed to remove rbd device: (16) Device or resource busy
rbd: remove failed: (16) Device or resource busy

I appreciate your help!
Markus

>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-11-23 23:12                   ` Markus Kienast
@ 2015-11-24 11:49                     ` Ilya Dryomov
  2015-11-24 11:51                       ` Ilya Dryomov
  0 siblings, 1 reply; 9+ messages in thread
From: Ilya Dryomov @ 2015-11-24 11:49 UTC (permalink / raw)
  To: Markus Kienast; +Cc: Ceph Development

On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1884@gmail.com> wrote:
> Kernel Version
> elias@paris3:~$ uname -a
> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> Output of dmesg and /var/log/dmesg attached.
> But does not show much except for one mon being down.
> The mon is down for hardware reasons.
>
>
>
> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>
>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@trickkiste.at> wrote:
>> > I am having the same issue here.
>>
>> Which kernel are you running?  Could you attach your dmesg?
>>
>> >
>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>> > rbd: failed to remove rbd device: (16) Device or resource busy
>> > rbd: remove failed: (16) Device or resource busy
>> >
>> > root@paris3:/etc/neutron# rbd info -p volumes
>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> > c=0x17734e0).fault
>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>> > size 500 GB in 128000 objects
>> > order 22 (4096 kB objects)
>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>> > format: 2
>> > features: layering
>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>> > rbd_header.1b6d9e2aaa998b
>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>>
>> Did you root cause these faults?
>
> Hardware failure caused these faults.
>
>>
>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>> >
>> > root@paris3:/etc/neutron# ps ax | grep rbd
>> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>>
>> Was there an ext filesystem involved?  How was it umounted - do you
>> have a "umount <mountpoint>" process stuck in D state?
>
> Yes, all these RBDs are formatted with ext4. I am regularly using them
> with openstack and have never had any problems.
> I did "unmount <mountpoint>" and the unmount process did actually
> finish just fine.
> Where can I look up, if it is stuck in "D" state?
>
>>
>> > 11003 ?        S      0:00 [jbd2/rbd1-8]
>> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
>> > 24228 ?        S      0:00 [jbd2/rbd3-8]
>> >
>> > root@paris3:/etc/neutron# ceph --version
>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>> >
>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>> > returns nothing
>> >
>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>> >                      USER        PID ACCESS COMMAND
>> > /dev/rbd0:
>>
>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>
> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
> client8471177
>
>>
>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>
> root@paris3:~# ls -l /sys/kernel/debug/ceph/
> total 0
> drwxr-xr-x 2 root root 0 Feb  4  2015
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
> drwxr-xr-x 2 root root 0 Nov 23 11:41
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
> has no output

This means there are no outstanding/hung rbd I/Os.  According to you,
umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
hanging around, keeping /dev/rbd0 open and holding a ref to it.
A quick search produced two similar reports:

[1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
[2] http://lists.openwall.net/linux-ext4/2015/10/24/11

The only difference as far I can tell is those people noticed the jbd2
thread because they wanted to run fsck, while you ran into it because
you tried to do "rbd unmap".  Neither mentions rbd.

Look at [2], did you at any point see any similar errors in dmesg?

>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
> hangs with no output

It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"
timestamp, I'm going to assume you haven't rebooted this box in a long
time?  If so, do you remember what happened around that date?

Do you keep syslog archives?  I'd be interested in seeing everything
you have for Feb 3 - Feb 5.

To try to figure out where it's hanging, can you do

# cat >/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
< it'll hang, grab its PID from ps output >
# cat /proc/$PID/stack

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-11-24 11:49                     ` Ilya Dryomov
@ 2015-11-24 11:51                       ` Ilya Dryomov
  2015-11-24 12:46                         ` Markus Kienast
  0 siblings, 1 reply; 9+ messages in thread
From: Ilya Dryomov @ 2015-11-24 11:51 UTC (permalink / raw)
  To: Markus Kienast; +Cc: Ceph Development

On Tue, Nov 24, 2015 at 12:49 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1884@gmail.com> wrote:
>> Kernel Version
>> elias@paris3:~$ uname -a
>> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
>> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Output of dmesg and /var/log/dmesg attached.
>> But does not show much except for one mon being down.
>> The mon is down for hardware reasons.
>>
>>
>>
>> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>
>>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@trickkiste.at> wrote:
>>> > I am having the same issue here.
>>>
>>> Which kernel are you running?  Could you attach your dmesg?
>>>
>>> >
>>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>>> > rbd: failed to remove rbd device: (16) Device or resource busy
>>> > rbd: remove failed: (16) Device or resource busy
>>> >
>>> > root@paris3:/etc/neutron# rbd info -p volumes
>>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>>> > c=0x17734e0).fault
>>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>>> > size 500 GB in 128000 objects
>>> > order 22 (4096 kB objects)
>>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>>> > format: 2
>>> > features: layering
>>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>>> > rbd_header.1b6d9e2aaa998b
>>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>>>
>>> Did you root cause these faults?
>>
>> Hardware failure caused these faults.
>>
>>>
>>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>>> >
>>> > root@paris3:/etc/neutron# ps ax | grep rbd
>>> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>>>
>>> Was there an ext filesystem involved?  How was it umounted - do you
>>> have a "umount <mountpoint>" process stuck in D state?
>>
>> Yes, all these RBDs are formatted with ext4. I am regularly using them
>> with openstack and have never had any problems.
>> I did "unmount <mountpoint>" and the unmount process did actually
>> finish just fine.
>> Where can I look up, if it is stuck in "D" state?
>>
>>>
>>> > 11003 ?        S      0:00 [jbd2/rbd1-8]
>>> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
>>> > 24228 ?        S      0:00 [jbd2/rbd3-8]
>>> >
>>> > root@paris3:/etc/neutron# ceph --version
>>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>>> >
>>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>>> > returns nothing
>>> >
>>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>>> >                      USER        PID ACCESS COMMAND
>>> > /dev/rbd0:
>>>
>>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>>
>> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
>> client8471177
>>
>>>
>>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>>
>> root@paris3:~# ls -l /sys/kernel/debug/ceph/
>> total 0
>> drwxr-xr-x 2 root root 0 Feb  4  2015
>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
>> drwxr-xr-x 2 root root 0 Nov 23 11:41
>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>>
>> root@paris3:~# cat
>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
>> has no output
>
> This means there are no outstanding/hung rbd I/Os.  According to you,
> umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
> hanging around, keeping /dev/rbd0 open and holding a ref to it.
> A quick search produced two similar reports:
>
> [1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
> [2] http://lists.openwall.net/linux-ext4/2015/10/24/11
>
> The only difference as far I can tell is those people noticed the jbd2
> thread because they wanted to run fsck, while you ran into it because
> you tried to do "rbd unmap".  Neither mentions rbd.
>
> Look at [2], did you at any point see any similar errors in dmesg?
>
>>
>> root@paris3:~# cat
>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
>> hangs with no output
>
> It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"

It should read "so it could be related", of course.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [ceph-users] occasional failure to unmap rbd
  2015-11-24 11:51                       ` Ilya Dryomov
@ 2015-11-24 12:46                         ` Markus Kienast
  0 siblings, 0 replies; 9+ messages in thread
From: Markus Kienast @ 2015-11-24 12:46 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Unfortunately I have rebooted the server, as I needed the services back online.
I did try mapping and unmapping again after reboot and did not see the
problem anymore.

However, I will search through my logs and send you everything from
Feb 3 - Feb 5.

And if I see the issue again, I will follow all the debug steps
described in this thread and post it here.

In the mean time, I have upgraded to the next minor revision from your
dragonfly-debian archives. So maybe I do not see the problem anymore
due to that.

Many thanks for your help!

Regards,
Markus

On Tue, Nov 24, 2015 at 12:51 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Tue, Nov 24, 2015 at 12:49 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1884@gmail.com> wrote:
>>> Kernel Version
>>> elias@paris3:~$ uname -a
>>> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
>>> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Output of dmesg and /var/log/dmesg attached.
>>> But does not show much except for one mon being down.
>>> The mon is down for hardware reasons.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>
>>>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@trickkiste.at> wrote:
>>>> > I am having the same issue here.
>>>>
>>>> Which kernel are you running?  Could you attach your dmesg?
>>>>
>>>> >
>>>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>>>> > rbd: failed to remove rbd device: (16) Device or resource busy
>>>> > rbd: remove failed: (16) Device or resource busy
>>>> >
>>>> > root@paris3:/etc/neutron# rbd info -p volumes
>>>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>>>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>>>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>>>> > c=0x17734e0).fault
>>>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>>>> > size 500 GB in 128000 objects
>>>> > order 22 (4096 kB objects)
>>>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>>>> > format: 2
>>>> > features: layering
>>>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>>>> > rbd_header.1b6d9e2aaa998b
>>>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>>>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>>>>
>>>> Did you root cause these faults?
>>>
>>> Hardware failure caused these faults.
>>>
>>>>
>>>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>>>> >
>>>> > root@paris3:/etc/neutron# ps ax | grep rbd
>>>> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>>>>
>>>> Was there an ext filesystem involved?  How was it umounted - do you
>>>> have a "umount <mountpoint>" process stuck in D state?
>>>
>>> Yes, all these RBDs are formatted with ext4. I am regularly using them
>>> with openstack and have never had any problems.
>>> I did "unmount <mountpoint>" and the unmount process did actually
>>> finish just fine.
>>> Where can I look up, if it is stuck in "D" state?
>>>
>>>>
>>>> > 11003 ?        S      0:00 [jbd2/rbd1-8]
>>>> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
>>>> > 24228 ?        S      0:00 [jbd2/rbd3-8]
>>>> >
>>>> > root@paris3:/etc/neutron# ceph --version
>>>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>>>> >
>>>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>>>> > returns nothing
>>>> >
>>>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>>>> >                      USER        PID ACCESS COMMAND
>>>> > /dev/rbd0:
>>>>
>>>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>>>
>>> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
>>> client8471177
>>>
>>>>
>>>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>>>
>>> root@paris3:~# ls -l /sys/kernel/debug/ceph/
>>> total 0
>>> drwxr-xr-x 2 root root 0 Feb  4  2015
>>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
>>> drwxr-xr-x 2 root root 0 Nov 23 11:41
>>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>>>
>>> root@paris3:~# cat
>>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
>>> has no output
>>
>> This means there are no outstanding/hung rbd I/Os.  According to you,
>> umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
>> hanging around, keeping /dev/rbd0 open and holding a ref to it.
>> A quick search produced two similar reports:
>>
>> [1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
>> [2] http://lists.openwall.net/linux-ext4/2015/10/24/11
>>
>> The only difference as far I can tell is those people noticed the jbd2
>> thread because they wanted to run fsck, while you ran into it because
>> you tried to do "rbd unmap".  Neither mentions rbd.
>>
>> Look at [2], did you at any point see any similar errors in dmesg?
>>
>>>
>>> root@paris3:~# cat
>>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
>>> hangs with no output
>>
>> It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"
>
> It should read "so it could be related", of course.
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-11-24 12:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <56057393.5040203@commerceguys.com>
     [not found] ` <CAOi1vP_mHfYjtT+-eKxbvxgpPtgYoc+388oKbzC1=L-mU-euAA@mail.gmail.com>
     [not found]   ` <5605793C.8030403@commerceguys.com>
     [not found]     ` <1C5AFAFE-8302-4D33-B9B7-5A097F86BA27@schermer.cz>
     [not found]       ` <560587E0.2070905@commerceguys.com>
2015-09-26  2:54         ` [ceph-users] occasional failure to unmap rbd Shinobu Kinjo
2015-09-26  8:52           ` Ilya Dryomov
2015-09-26 10:30             ` Shinobu Kinjo
2015-11-23 22:06               ` Markus Kienast
     [not found]               ` <CAJv+SGeTD5VxPdQ-3wrz0m0HFU7ZTZaKYMALavUQG4CRYTb4YA@mail.gmail.com>
2015-11-23 22:26                 ` Ilya Dryomov
2015-11-23 23:12                   ` Markus Kienast
2015-11-24 11:49                     ` Ilya Dryomov
2015-11-24 11:51                       ` Ilya Dryomov
2015-11-24 12:46                         ` Markus Kienast

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.