All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs-image gets stuck, using 100%, looping on bad file descriptor
@ 2015-08-18 15:40 Timothy Normand Miller
  2015-08-19  1:32 ` Qu Wenruo
  2015-08-21  1:32 ` Qu Wenruo
  0 siblings, 2 replies; 11+ messages in thread
From: Timothy Normand Miller @ 2015-08-18 15:40 UTC (permalink / raw)
  To: Btrfs BTRFS

I've filed a bug report on this:

https://bugzilla.kernel.org/show_bug.cgi?id=103081

-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-18 15:40 btrfs-image gets stuck, using 100%, looping on bad file descriptor Timothy Normand Miller
@ 2015-08-19  1:32 ` Qu Wenruo
  2015-08-19  2:46   ` Timothy Normand Miller
  2015-08-21  1:32 ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-08-19  1:32 UTC (permalink / raw)
  To: Timothy Normand Miller, Btrfs BTRFS

Hi Timothy,

Although I have replied to the bugzilla, IMHO it's more appropriate to 
discuss it in mail list, as it's not a kernel bug.

Thanks,
Qu

Timothy Normand Miller wrote on 2015/08/18 11:40 -0400:
> I've filed a bug report on this:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=103081
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  1:32 ` Qu Wenruo
@ 2015-08-19  2:46   ` Timothy Normand Miller
  2015-08-19  2:48     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Timothy Normand Miller @ 2015-08-19  2:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> Hi Timothy,
>
> Although I have replied to the bugzilla, IMHO it's more appropriate to
> discuss it in mail list, as it's not a kernel bug.
>

All four devices were online.  The "missing" one was a drive that
died, which was replaced by a new one, but btrfs wouldn't finish the
deletion of the missing device.

-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  2:46   ` Timothy Normand Miller
@ 2015-08-19  2:48     ` Qu Wenruo
  2015-08-19  2:55       ` Timothy Normand Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-08-19  2:48 UTC (permalink / raw)
  To: Timothy Normand Miller; +Cc: Btrfs BTRFS



Timothy Normand Miller wrote on 2015/08/18 22:46 -0400:
> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> Hi Timothy,
>>
>> Although I have replied to the bugzilla, IMHO it's more appropriate to
>> discuss it in mail list, as it's not a kernel bug.
>>
>
> All four devices were online.  The "missing" one was a drive that
> died, which was replaced by a new one, but btrfs wouldn't finish the
> deletion of the missing device.
>
By replaced, did you mean "btrfs replace"? Or just change the physical 
disk without using "btrfs replace"?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  2:48     ` Qu Wenruo
@ 2015-08-19  2:55       ` Timothy Normand Miller
  2015-08-19  5:22         ` Qu Wenruo
  2015-08-20 11:38         ` Austin S Hemmelgarn
  0 siblings, 2 replies; 11+ messages in thread
From: Timothy Normand Miller @ 2015-08-19  2:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Timothy Normand Miller wrote on 2015/08/18 22:46 -0400:
>>
>> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>> Hi Timothy,
>>>
>>> Although I have replied to the bugzilla, IMHO it's more appropriate to
>>> discuss it in mail list, as it's not a kernel bug.
>>>
>>
>> All four devices were online.  The "missing" one was a drive that
>> died, which was replaced by a new one, but btrfs wouldn't finish the
>> deletion of the missing device.
>>
> By replaced, did you mean "btrfs replace"? Or just change the physical disk
> without using "btrfs replace"?

Here's what happened:

- A drive started throwing bad sectors.  Somehow this caused metadata
on other drives to get messed up.
- I took that drive offline and mounted degraded (it's a 4-drive RAID1)
- I did a "btrfs add" on a new drive and then a "btrfs delete missing"
- The replacement drive failed during the replacement operation, and
everything went to crap.
- With some help, I got a kernel patch that allowed me to mount the
original three drives with TWO missing devices.
- I added a brand new drive and then did "delete missing" again.  This
time, the first "delete missing" was successful, but it didn't fully
balance the drives, and there was another missing device, so I had to
do a "delete missing" again, and that failed.

I wanted to get this back online and restored from a backup, but I was
willing to keep it this way if people wanted to probe at, in case we
can uncover any btrfs bugs.  So it was suggested to get a metadata
image, but that ran into some kind of bug in btrfs-image.

Currently, I'm restoring from backup, but I have at least a partial
metadata dump.


-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  2:55       ` Timothy Normand Miller
@ 2015-08-19  5:22         ` Qu Wenruo
  2015-08-19 16:18           ` Timothy Normand Miller
  2015-08-20 11:38         ` Austin S Hemmelgarn
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-08-19  5:22 UTC (permalink / raw)
  To: Timothy Normand Miller; +Cc: Btrfs BTRFS



Timothy Normand Miller wrote on 2015/08/18 22:55 -0400:
> On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Timothy Normand Miller wrote on 2015/08/18 22:46 -0400:
>>>
>>> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> Hi Timothy,
>>>>
>>>> Although I have replied to the bugzilla, IMHO it's more appropriate to
>>>> discuss it in mail list, as it's not a kernel bug.
>>>>
>>>
>>> All four devices were online.  The "missing" one was a drive that
>>> died, which was replaced by a new one, but btrfs wouldn't finish the
>>> deletion of the missing device.
>>>
>> By replaced, did you mean "btrfs replace"? Or just change the physical disk
>> without using "btrfs replace"?
>
> Here's what happened:
>
> - A drive started throwing bad sectors.  Somehow this caused metadata
> on other drives to get messed up.

Did that cause any huge damage?

> - I took that drive offline and mounted degraded (it's a 4-drive RAID1)
> - I did a "btrfs add" on a new drive and then a "btrfs delete missing"
> - The replacement drive failed during the replacement operation, and
> everything went to crap.
> - With some help, I got a kernel patch that allowed me to mount the
> original three drives with TWO missing devices.

So the original 3 drives are still OK,
original bad one is missing, and the newly add one is also missing?

That sounds quite repairable.

> - I added a brand new drive and then did "delete missing" again.  This
> time, the first "delete missing" was successful, but it didn't fully
> balance the drives, and there was another missing device, so I had to
> do a "delete missing" again, and that failed.
>
> I wanted to get this back online and restored from a backup, but I was
> willing to keep it this way if people wanted to probe at, in case we
> can uncover any btrfs bugs.  So it was suggested to get a metadata
> image, but that ran into some kind of bug in btrfs-image.
If btrfs-image doesn't work, you can also try btrfs-debug-tree.
IIRC, debug-tree should be more robust than btrfs-image.

BTW, have you tried btrfsck on it? Does it also cause the infinite loop?

I'll also try to reproduce it and investigate the codes directly.

Thanks,
Qu
>
> Currently, I'm restoring from backup, but I have at least a partial
> metadata dump.
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  5:22         ` Qu Wenruo
@ 2015-08-19 16:18           ` Timothy Normand Miller
  0 siblings, 0 replies; 11+ messages in thread
From: Timothy Normand Miller @ 2015-08-19 16:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Wed, Aug 19, 2015 at 1:22 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Timothy Normand Miller wrote on 2015/08/18 22:55 -0400:
>>
>> On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>>
>>> Timothy Normand Miller wrote on 2015/08/18 22:46 -0400:
>>>>
>>>>
>>>> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Timothy,
>>>>>
>>>>> Although I have replied to the bugzilla, IMHO it's more appropriate to
>>>>> discuss it in mail list, as it's not a kernel bug.
>>>>>
>>>>
>>>> All four devices were online.  The "missing" one was a drive that
>>>> died, which was replaced by a new one, but btrfs wouldn't finish the
>>>> deletion of the missing device.
>>>>
>>> By replaced, did you mean "btrfs replace"? Or just change the physical
>>> disk
>>> without using "btrfs replace"?
>>
>>
>> Here's what happened:
>>
>> - A drive started throwing bad sectors.  Somehow this caused metadata
>> on other drives to get messed up.
>
>
> Did that cause any huge damage?

It seems that metadata was damaged on all drives.

>
>> - I took that drive offline and mounted degraded (it's a 4-drive RAID1)
>> - I did a "btrfs add" on a new drive and then a "btrfs delete missing"
>> - The replacement drive failed during the replacement operation, and
>> everything went to crap.
>> - With some help, I got a kernel patch that allowed me to mount the
>> original three drives with TWO missing devices.
>
>
> So the original 3 drives are still OK,
> original bad one is missing, and the newly add one is also missing?
>
> That sounds quite repairable.

Nothing I tried would run to completion.  There were always errors.

>
>> - I added a brand new drive and then did "delete missing" again.  This
>> time, the first "delete missing" was successful, but it didn't fully
>> balance the drives, and there was another missing device, so I had to
>> do a "delete missing" again, and that failed.
>>
>> I wanted to get this back online and restored from a backup, but I was
>> willing to keep it this way if people wanted to probe at, in case we
>> can uncover any btrfs bugs.  So it was suggested to get a metadata
>> image, but that ran into some kind of bug in btrfs-image.
>
> If btrfs-image doesn't work, you can also try btrfs-debug-tree.
> IIRC, debug-tree should be more robust than btrfs-image.
>
> BTW, have you tried btrfsck on it? Does it also cause the infinite loop?
>
> I'll also try to reproduce it and investigate the codes directly.

Well, I had to get things back online, so I've restored from backup.
I do have what limited metadata image I could get from btrfs-image.

>
> Thanks,
> Qu
>
>>
>> Currently, I'm restoring from backup, but I have at least a partial
>> metadata dump.
>>
>>
>



-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-19  2:55       ` Timothy Normand Miller
  2015-08-19  5:22         ` Qu Wenruo
@ 2015-08-20 11:38         ` Austin S Hemmelgarn
  2015-08-20 13:08           ` Timothy Normand Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-20 11:38 UTC (permalink / raw)
  To: Timothy Normand Miller, Qu Wenruo; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

On 2015-08-18 22:55, Timothy Normand Miller wrote:
> On Tue, Aug 18, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Timothy Normand Miller wrote on 2015/08/18 22:46 -0400:
>>>
>>> On Tue, Aug 18, 2015 at 9:32 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> Hi Timothy,
>>>>
>>>> Although I have replied to the bugzilla, IMHO it's more appropriate to
>>>> discuss it in mail list, as it's not a kernel bug.
>>>>
>>>
>>> All four devices were online.  The "missing" one was a drive that
>>> died, which was replaced by a new one, but btrfs wouldn't finish the
>>> deletion of the missing device.
>>>
>> By replaced, did you mean "btrfs replace"? Or just change the physical disk
>> without using "btrfs replace"?
>
> Here's what happened:
>
> - A drive started throwing bad sectors.  Somehow this caused metadata
> on other drives to get messed up.
> - I took that drive offline and mounted degraded (it's a 4-drive RAID1)
> - I did a "btrfs add" on a new drive and then a "btrfs delete missing"
> - The replacement drive failed during the replacement operation, and
> everything went to crap.
> - With some help, I got a kernel patch that allowed me to mount the
> original three drives with TWO missing devices.
> - I added a brand new drive and then did "delete missing" again.  This
> time, the first "delete missing" was successful, but it didn't fully
> balance the drives, and there was another missing device, so I had to
> do a "delete missing" again, and that failed.
>
Just for reference, I've found that it is usually safer to delete the 
missing device first if possible, then add the new one and re-balance. 
There seem to be some edge-cases in the code for deleting missing devices.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-20 11:38         ` Austin S Hemmelgarn
@ 2015-08-20 13:08           ` Timothy Normand Miller
  2015-08-20 13:12             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 11+ messages in thread
From: Timothy Normand Miller @ 2015-08-20 13:08 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Qu Wenruo, Btrfs BTRFS

On Thu, Aug 20, 2015 at 7:38 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:

> Just for reference, I've found that it is usually safer to delete the
> missing device first if possible, then add the new one and re-balance. There
> seem to be some edge-cases in the code for deleting missing devices.
>

The problem is that you can't do that if there's not enough space on
the remaining devices to hold all the data.


-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-20 13:08           ` Timothy Normand Miller
@ 2015-08-20 13:12             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 11+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-20 13:12 UTC (permalink / raw)
  To: Timothy Normand Miller; +Cc: Qu Wenruo, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 595 bytes --]

On 2015-08-20 09:08, Timothy Normand Miller wrote:
> On Thu, Aug 20, 2015 at 7:38 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> Just for reference, I've found that it is usually safer to delete the
>> missing device first if possible, then add the new one and re-balance. There
>> seem to be some edge-cases in the code for deleting missing devices.
>>
>
> The problem is that you can't do that if there's not enough space on
> the remaining devices to hold all the data.
>
>
Good point, I often forget that not everyone over-provisions their 
storage like I do.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: btrfs-image gets stuck, using 100%, looping on bad file descriptor
  2015-08-18 15:40 btrfs-image gets stuck, using 100%, looping on bad file descriptor Timothy Normand Miller
  2015-08-19  1:32 ` Qu Wenruo
@ 2015-08-21  1:32 ` Qu Wenruo
  1 sibling, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2015-08-21  1:32 UTC (permalink / raw)
  To: Timothy Normand Miller, Btrfs BTRFS

Succeeded in reproducing the bug.

Any missing device will cause btrfs-image to inifinite loop.
It should be easy to fix.

I'll CC you when the patch is out.

Thanks,
Qu

Timothy Normand Miller wrote on 2015/08/18 11:40 -0400:
> I've filed a bug report on this:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=103081
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-08-21  1:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-18 15:40 btrfs-image gets stuck, using 100%, looping on bad file descriptor Timothy Normand Miller
2015-08-19  1:32 ` Qu Wenruo
2015-08-19  2:46   ` Timothy Normand Miller
2015-08-19  2:48     ` Qu Wenruo
2015-08-19  2:55       ` Timothy Normand Miller
2015-08-19  5:22         ` Qu Wenruo
2015-08-19 16:18           ` Timothy Normand Miller
2015-08-20 11:38         ` Austin S Hemmelgarn
2015-08-20 13:08           ` Timothy Normand Miller
2015-08-20 13:12             ` Austin S Hemmelgarn
2015-08-21  1:32 ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.