linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Unusual crash -- data rolled back ~2 weeks?
@ 2019-11-09 22:33 Timothy Pearson
  2019-11-09 22:48 ` Timothy Pearson
  2019-11-10  3:38 ` Qu Wenruo
  0 siblings, 2 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-09 22:33 UTC (permalink / raw)
  To: linux-btrfs

We just experienced a very unusual crash on a Linux 5.3 file server using NFS to serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent underlying disk subsystem problems, and when the server was hard rebooted to clear the D wait the BTRFS filesystem remounted itself in the state that it was in approximately two weeks earlier (!).  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL InnoDB) noted -- we restored from backup for those files, but are concerned about the status of the entire filesystem at this point.

We do not use subvolumes, snapshots, or any of the advanced features of BTRFS beyond the data checksumming.  I am at a loss as to how BTRFS could suddenly just "forget" about the past two weeks of written data and (mostly) cleanly roll back on the next mount without even throwing any warnings in dmesg.

Any thoughts on how this is possible, and if there is any chance of getting the lost couple weeks of data back, would be appreciated.

Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-09 22:33 Unusual crash -- data rolled back ~2 weeks? Timothy Pearson
@ 2019-11-09 22:48 ` Timothy Pearson
  2019-11-10  3:38 ` Qu Wenruo
  1 sibling, 0 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-09 22:48 UTC (permalink / raw)
  To: linux-btrfs

One item I did forget to mention here is that the underlying device was expanded online using "btrfs fi resize max /mount/path" at most a month before the failure -- I don't have the exact timestamps available, so there remains a possibility that the latest files on the currently mounted filesystem correspond to the filesystem as it was immediately prior to the resize operation.

Again, any suggestions welcome.  It took us a bit of time (and several large file restores) to realize the filesystem had rolled back vs just corrupted a few files, so while there does exist a raw copy of the filesystem it is tainted by being mounted and written to before the copy was taken.

----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Saturday, November 9, 2019 4:33:29 PM
> Subject: Unusual crash -- data rolled back ~2 weeks?

> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
> underlying disk subsystem problems, and when the server was hard rebooted to
> clear the D wait the BTRFS filesystem remounted itself in the state that it was
> in approximately two weeks earlier (!).  There was also significant corruption
> of certain files (e.g. LDAP MDB and MySQL InnoDB) noted -- we restored from
> backup for those files, but are concerned about the status of the entire
> filesystem at this point.
> 
> We do not use subvolumes, snapshots, or any of the advanced features of BTRFS
> beyond the data checksumming.  I am at a loss as to how BTRFS could suddenly
> just "forget" about the past two weeks of written data and (mostly) cleanly
> roll back on the next mount without even throwing any warnings in dmesg.
> 
> Any thoughts on how this is possible, and if there is any chance of getting the
> lost couple weeks of data back, would be appreciated.
> 
> Thank you!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-09 22:33 Unusual crash -- data rolled back ~2 weeks? Timothy Pearson
  2019-11-09 22:48 ` Timothy Pearson
@ 2019-11-10  3:38 ` Qu Wenruo
  2019-11-10  6:47   ` Timothy Pearson
  1 sibling, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2019-11-10  3:38 UTC (permalink / raw)
  To: Timothy Pearson, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1376 bytes --]



On 2019/11/10 上午6:33, Timothy Pearson wrote:
> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent underlying disk subsystem problems, and when the server was hard rebooted to clear the D wait the BTRFS filesystem remounted itself in the state that it was in approximately two weeks earlier (!).

This means during two weeks, the btrfs is not committed.

>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL InnoDB) noted -- we restored from backup for those files, but are concerned about the status of the entire filesystem at this point.

Btrfs check is needed to ensure no metadata corruption.

Also, we need sysrq+w output to determine where we are deadlocking.
Otherwise, it's really hard to find any clue from the report.

Thanks,
Qu

> 
> We do not use subvolumes, snapshots, or any of the advanced features of BTRFS beyond the data checksumming.  I am at a loss as to how BTRFS could suddenly just "forget" about the past two weeks of written data and (mostly) cleanly roll back on the next mount without even throwing any warnings in dmesg.
> 
> Any thoughts on how this is possible, and if there is any chance of getting the lost couple weeks of data back, would be appreciated.
> 
> Thank you!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  3:38 ` Qu Wenruo
@ 2019-11-10  6:47   ` Timothy Pearson
  2019-11-10  6:54     ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Timothy Pearson @ 2019-11-10  6:47 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Saturday, November 9, 2019 9:38:21 PM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>> underlying disk subsystem problems, and when the server was hard rebooted to
>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>> in approximately two weeks earlier (!).
> 
> This means during two weeks, the btrfs is not committed.

Is there any hope of getting the data from that interval back via btrfs-recover or a similar tool, or does the lack of commit mean the data was stored in RAM only and is therefore gone after the server reboot?

If the latter, I'm somewhat surprised given the I/O load on the disk array in question, but it would also offer a clue as to why it hard locked the filesystem eventually (presumably on memory exhaustion -- the server has something like 128GB of RAM, so it could go quite a while before hitting the physical RAM limits).

> 
>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>  about the status of the entire filesystem at this point.
> 
> Btrfs check is needed to ensure no metadata corruption.
> 
> Also, we need sysrq+w output to determine where we are deadlocking.
> Otherwise, it's really hard to find any clue from the report.

It would have been gathered if we'd known the filesystem was in this bad state.  At the time, the priority was on restoring service and we had assumed NFS had just wedged itself (again).  It was only after reboot and remount that the damage slowly came to light.

Do the described symptoms (what little we know of them at this point) line up with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right now we're hoping that this particular issue was fixed by that series, but if not we might consider increasing backup frequency to nightly for this particular array and seeing if it happens again.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  6:47   ` Timothy Pearson
@ 2019-11-10  6:54     ` Qu Wenruo
  2019-11-10  7:18       ` Timothy Pearson
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2019-11-10  6:54 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2805 bytes --]



On 2019/11/10 下午2:47, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs" <linux-btrfs@vger.kernel.org>
>> Sent: Saturday, November 9, 2019 9:38:21 PM
>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> 
>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>> in approximately two weeks earlier (!).
>>
>> This means during two weeks, the btrfs is not committed.
> 
> Is there any hope of getting the data from that interval back via btrfs-recover or a similar tool, or does the lack of commit mean the data was stored in RAM only and is therefore gone after the server reboot?

If it's deadlock preventing new transaction to be committed, then no
metadata is even written back to disk, so no way to recover metadata.
Maybe you can find some data written, but without metadata it makes no
sense.

> 
> If the latter, I'm somewhat surprised given the I/O load on the disk array in question, but it would also offer a clue as to why it hard locked the filesystem eventually (presumably on memory exhaustion -- the server has something like 128GB of RAM, so it could go quite a while before hitting the physical RAM limits).
> 
>>
>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>  about the status of the entire filesystem at this point.
>>
>> Btrfs check is needed to ensure no metadata corruption.
>>
>> Also, we need sysrq+w output to determine where we are deadlocking.
>> Otherwise, it's really hard to find any clue from the report.
> 
> It would have been gathered if we'd known the filesystem was in this bad state.  At the time, the priority was on restoring service and we had assumed NFS had just wedged itself (again).  It was only after reboot and remount that the damage slowly came to light.
> 
> Do the described symptoms (what little we know of them at this point) line up with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right now we're hoping that this particular issue was fixed by that series, but if not we might consider increasing backup frequency to nightly for this particular array and seeing if it happens again.

That fix is already in v5.3, thus I don't think that's the case.

Thanks,
Qu

> 
> Thanks!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  6:54     ` Qu Wenruo
@ 2019-11-10  7:18       ` Timothy Pearson
  2019-11-10  7:45         ` Qu Wenruo
  2019-11-10  8:04         ` Andrei Borzenkov
  0 siblings, 2 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-10  7:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, November 10, 2019 6:54:55 AM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>> <linux-btrfs@vger.kernel.org>
>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>> 
>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>> in approximately two weeks earlier (!).
>>>
>>> This means during two weeks, the btrfs is not committed.
>> 
>> Is there any hope of getting the data from that interval back via btrfs-recover
>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>> only and is therefore gone after the server reboot?
> 
> If it's deadlock preventing new transaction to be committed, then no
> metadata is even written back to disk, so no way to recover metadata.
> Maybe you can find some data written, but without metadata it makes no
> sense.

OK, I'll just assume the data written in that window is unrecoverable at this point then.

Would the commit deadlock affect only one btrfs filesystem or all of them on the machine?  I take it there is no automatic dmesg spew on extended deadlock?  dmesg was completely clean at the time of the fault / reboot.

>> 
>> If the latter, I'm somewhat surprised given the I/O load on the disk array in
>> question, but it would also offer a clue as to why it hard locked the
>> filesystem eventually (presumably on memory exhaustion -- the server has
>> something like 128GB of RAM, so it could go quite a while before hitting the
>> physical RAM limits).
>> 
>>>
>>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>>  about the status of the entire filesystem at this point.
>>>
>>> Btrfs check is needed to ensure no metadata corruption.
>>>
>>> Also, we need sysrq+w output to determine where we are deadlocking.
>>> Otherwise, it's really hard to find any clue from the report.
>> 
>> It would have been gathered if we'd known the filesystem was in this bad state.
>> At the time, the priority was on restoring service and we had assumed NFS had
>> just wedged itself (again).  It was only after reboot and remount that the
>> damage slowly came to light.
>> 
>> Do the described symptoms (what little we know of them at this point) line up
>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>> now we're hoping that this particular issue was fixed by that series, but if
>> not we might consider increasing backup frequency to nightly for this
>> particular array and seeing if it happens again.
> 
> That fix is already in v5.3, thus I don't think that's the case.
> 
> Thanks,
> Qu

Looking more carefully, the server in question had been booted on 5.3-rc3 somehow.  It's possible that this was because earlier versions were showing driver problems with the other hardware, but somehow this machine was running 5.3-rc3 and the patch was created *after* rc3 release.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  7:18       ` Timothy Pearson
@ 2019-11-10  7:45         ` Qu Wenruo
  2019-11-10  7:48           ` Timothy Pearson
                             ` (2 more replies)
  2019-11-10  8:04         ` Andrei Borzenkov
  1 sibling, 3 replies; 14+ messages in thread
From: Qu Wenruo @ 2019-11-10  7:45 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4502 bytes --]



On 2019/11/10 下午3:18, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>> Sent: Sunday, November 10, 2019 6:54:55 AM
>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> 
>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>>> <linux-btrfs@vger.kernel.org>
>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>>
>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>>> in approximately two weeks earlier (!).
>>>>
>>>> This means during two weeks, the btrfs is not committed.
>>>
>>> Is there any hope of getting the data from that interval back via btrfs-recover
>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>>> only and is therefore gone after the server reboot?
>>
>> If it's deadlock preventing new transaction to be committed, then no
>> metadata is even written back to disk, so no way to recover metadata.
>> Maybe you can find some data written, but without metadata it makes no
>> sense.
> 
> OK, I'll just assume the data written in that window is unrecoverable at this point then.
> 
> Would the commit deadlock affect only one btrfs filesystem or all of them on the machine?  I take it there is no automatic dmesg spew on extended deadlock?  dmesg was completely clean at the time of the fault / reboot.

It should have some kernel message for things like process hang for over
120s.
If you could recover that, it would help us to locate the cause.

Normally such deadlock should only affect the unlucky fs which meets the
condition, not all filesystems.
But if you're unlucky enough, it may happen to other filesystems.

Anyway, without enough info, it's really hard to say.

> 
>>>
>>> If the latter, I'm somewhat surprised given the I/O load on the disk array in
>>> question, but it would also offer a clue as to why it hard locked the
>>> filesystem eventually (presumably on memory exhaustion -- the server has
>>> something like 128GB of RAM, so it could go quite a while before hitting the
>>> physical RAM limits).
>>>
>>>>
>>>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>>>  about the status of the entire filesystem at this point.
>>>>
>>>> Btrfs check is needed to ensure no metadata corruption.
>>>>
>>>> Also, we need sysrq+w output to determine where we are deadlocking.
>>>> Otherwise, it's really hard to find any clue from the report.
>>>
>>> It would have been gathered if we'd known the filesystem was in this bad state.
>>> At the time, the priority was on restoring service and we had assumed NFS had
>>> just wedged itself (again).  It was only after reboot and remount that the
>>> damage slowly came to light.
>>>
>>> Do the described symptoms (what little we know of them at this point) line up
>>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>>> now we're hoping that this particular issue was fixed by that series, but if
>>> not we might consider increasing backup frequency to nightly for this
>>> particular array and seeing if it happens again.
>>
>> That fix is already in v5.3, thus I don't think that's the case.
>>
>> Thanks,
>> Qu
> 
> Looking more carefully, the server in question had been booted on 5.3-rc3 somehow.  It's possible that this was because earlier versions were showing driver problems with the other hardware, but somehow this machine was running 5.3-rc3 and the patch was created *after* rc3 release.

If that's the case, just upgrade the kernel should prevent such problem
from happening.
And it's a relief that we don't need to face another deadly deadlock.

Thanks,
Qu

> 
> Thanks!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  7:45         ` Qu Wenruo
@ 2019-11-10  7:48           ` Timothy Pearson
  2019-11-10 10:02           ` Timothy Pearson
  2019-11-11 23:28           ` Timothy Pearson
  2 siblings, 0 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-10  7:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, November 10, 2019 7:45:14 AM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> On 2019/11/10 下午3:18, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>>> Sent: Sunday, November 10, 2019 6:54:55 AM
>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>> 
>>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>>>> <linux-btrfs@vger.kernel.org>
>>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>>>
>>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>>>> in approximately two weeks earlier (!).
>>>>>
>>>>> This means during two weeks, the btrfs is not committed.
>>>>
>>>> Is there any hope of getting the data from that interval back via btrfs-recover
>>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>>>> only and is therefore gone after the server reboot?
>>>
>>> If it's deadlock preventing new transaction to be committed, then no
>>> metadata is even written back to disk, so no way to recover metadata.
>>> Maybe you can find some data written, but without metadata it makes no
>>> sense.
>> 
>> OK, I'll just assume the data written in that window is unrecoverable at this
>> point then.
>> 
>> Would the commit deadlock affect only one btrfs filesystem or all of them on the
>> machine?  I take it there is no automatic dmesg spew on extended deadlock?
>> dmesg was completely clean at the time of the fault / reboot.
> 
> It should have some kernel message for things like process hang for over
> 120s.
> If you could recover that, it would help us to locate the cause.
> 
> Normally such deadlock should only affect the unlucky fs which meets the
> condition, not all filesystems.
> But if you're unlucky enough, it may happen to other filesystems.
> 
> Anyway, without enough info, it's really hard to say.

Agreed.  I'll check to see if there's some chance of a dmesg rotated file still hanging around from ~2 weeks prior when I get back tomorrow.

>> 
>>>>
>>>> If the latter, I'm somewhat surprised given the I/O load on the disk array in
>>>> question, but it would also offer a clue as to why it hard locked the
>>>> filesystem eventually (presumably on memory exhaustion -- the server has
>>>> something like 128GB of RAM, so it could go quite a while before hitting the
>>>> physical RAM limits).
>>>>
>>>>>
>>>>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>>>>  about the status of the entire filesystem at this point.
>>>>>
>>>>> Btrfs check is needed to ensure no metadata corruption.
>>>>>
>>>>> Also, we need sysrq+w output to determine where we are deadlocking.
>>>>> Otherwise, it's really hard to find any clue from the report.
>>>>
>>>> It would have been gathered if we'd known the filesystem was in this bad state.
>>>> At the time, the priority was on restoring service and we had assumed NFS had
>>>> just wedged itself (again).  It was only after reboot and remount that the
>>>> damage slowly came to light.
>>>>
>>>> Do the described symptoms (what little we know of them at this point) line up
>>>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>>>> now we're hoping that this particular issue was fixed by that series, but if
>>>> not we might consider increasing backup frequency to nightly for this
>>>> particular array and seeing if it happens again.
>>>
>>> That fix is already in v5.3, thus I don't think that's the case.
>>>
>>> Thanks,
>>> Qu
>> 
>> Looking more carefully, the server in question had been booted on 5.3-rc3
>> somehow.  It's possible that this was because earlier versions were showing
>> driver problems with the other hardware, but somehow this machine was running
>> 5.3-rc3 and the patch was created *after* rc3 release.
> 
> If that's the case, just upgrade the kernel should prevent such problem
> from happening.
> And it's a relief that we don't need to face another deadly deadlock.

Yes, absolutely, the kernel was updated immediately once this was discovered, but it was too late for the unfortunate filesystem in question.  At least we had a good DR test and backup check. :)

I'll keep an eye on this over the coming months to make sure it doesn't recur.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  7:18       ` Timothy Pearson
  2019-11-10  7:45         ` Qu Wenruo
@ 2019-11-10  8:04         ` Andrei Borzenkov
  1 sibling, 0 replies; 14+ messages in thread
From: Andrei Borzenkov @ 2019-11-10  8:04 UTC (permalink / raw)
  To: Timothy Pearson, Qu Wenruo; +Cc: linux-btrfs

10.11.2019 10:18, Timothy Pearson пишет:
>>>
>>> Do the described symptoms (what little we know of them at this point) line up
>>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>>> now we're hoping that this particular issue was fixed by that series, but if
>>> not we might consider increasing backup frequency to nightly for this
>>> particular array and seeing if it happens again.
>>
>> That fix is already in v5.3, thus I don't think that's the case.
>>
>> Thanks,
>> Qu
> 
> Looking more carefully, the server in question had been booted on 5.3-rc3 somehow.  It's possible that this was because earlier versions were showing driver problems with the other hardware, but somehow this machine was running 5.3-rc3 and the patch was created *after* rc3 release.

The patch apparently was added in 5.3 final so way after rc3.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  7:45         ` Qu Wenruo
  2019-11-10  7:48           ` Timothy Pearson
@ 2019-11-10 10:02           ` Timothy Pearson
  2019-11-10 20:10             ` Zygo Blaxell
  2019-11-11 23:28           ` Timothy Pearson
  2 siblings, 1 reply; 14+ messages in thread
From: Timothy Pearson @ 2019-11-10 10:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, November 10, 2019 1:45:14 AM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> On 2019/11/10 下午3:18, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>>> Sent: Sunday, November 10, 2019 6:54:55 AM
>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>> 
>>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>>>> <linux-btrfs@vger.kernel.org>
>>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>>>
>>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>>>> in approximately two weeks earlier (!).
>>>>>
>>>>> This means during two weeks, the btrfs is not committed.
>>>>
>>>> Is there any hope of getting the data from that interval back via btrfs-recover
>>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>>>> only and is therefore gone after the server reboot?
>>>
>>> If it's deadlock preventing new transaction to be committed, then no
>>> metadata is even written back to disk, so no way to recover metadata.
>>> Maybe you can find some data written, but without metadata it makes no
>>> sense.
>> 
>> OK, I'll just assume the data written in that window is unrecoverable at this
>> point then.
>> 
>> Would the commit deadlock affect only one btrfs filesystem or all of them on the
>> machine?  I take it there is no automatic dmesg spew on extended deadlock?
>> dmesg was completely clean at the time of the fault / reboot.
> 
> It should have some kernel message for things like process hang for over
> 120s.
> If you could recover that, it would help us to locate the cause.
> 
> Normally such deadlock should only affect the unlucky fs which meets the
> condition, not all filesystems.
> But if you're unlucky enough, it may happen to other filesystems.
> 
> Anyway, without enough info, it's really hard to say.

I was able to retrieve complete logs from the kernel for the entire time period.  The BTRFS filesystem was online resized five days before the last apparent filesystem commit.  Immediately after resize, a couple of csum errors were thrown for a single inode on the resized filesystem, though this was not detected at the time.  The underlying hardware did not experience a fault at any point and is passing all diagnostics at this time.  Intriguingly, there are a handful of files accessible from after the last known good filesystem commit (Oct. 29), but the vast majority are simply absent.

At this point I'm more interested in making sure this type of event does not happen in the future than anything else.  At no point did the kernel print any type of stack trace or deadlock warning.  I'm starting to wonder if we hit a bug in the online resize path, but am just guessing at this point.  The timing is certainly very close / coincidental.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10 10:02           ` Timothy Pearson
@ 2019-11-10 20:10             ` Zygo Blaxell
  0 siblings, 0 replies; 14+ messages in thread
From: Zygo Blaxell @ 2019-11-10 20:10 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6230 bytes --]

On Sun, Nov 10, 2019 at 04:02:11AM -0600, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
> > From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> > Sent: Sunday, November 10, 2019 1:45:14 AM
> > Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> 
> > On 2019/11/10 下午3:18, Timothy Pearson wrote:
> >> 
> >> 
> >> ----- Original Message -----
> >>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> >>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> >>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> >>> Sent: Sunday, November 10, 2019 6:54:55 AM
> >>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> >> 
> >>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> >>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
> >>>>> <linux-btrfs@vger.kernel.org>
> >>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
> >>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> >>>>
> >>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
> >>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
> >>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
> >>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
> >>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
> >>>>>> in approximately two weeks earlier (!).
> >>>>>
> >>>>> This means during two weeks, the btrfs is not committed.
> >>>>
> >>>> Is there any hope of getting the data from that interval back via btrfs-recover
> >>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
> >>>> only and is therefore gone after the server reboot?

Writeback will dump out some data blocks between commits; however, without
a commit, there will be no metadata pages on disk that point to the data.

Writeback could keep a fileserver running for a long time as long as
nobody calls a nontrivial fsync() (too complex to be sent to the log tree)
or sync(), or renames a file over another existing file (all may trigger a
commit if reservations fill up); however, as soon as one of those happens,
something should be noticeably failing as the calls will block.

> >>> If it's deadlock preventing new transaction to be committed, then no
> >>> metadata is even written back to disk, so no way to recover metadata.
> >>> Maybe you can find some data written, but without metadata it makes no
> >>> sense.
> >> 
> >> OK, I'll just assume the data written in that window is unrecoverable at this
> >> point then.
> >> 
> >> Would the commit deadlock affect only one btrfs filesystem or all of them on the
> >> machine?  I take it there is no automatic dmesg spew on extended deadlock?
> >> dmesg was completely clean at the time of the fault / reboot.

Stepping away from btrfs a bit, I've heard rumors of something like this
happening to SSDs (on Windows, so not a btrfs issue).  I guess it may
be possible for a log-structured FTL layer to revert to a significantly
earlier disk content state if there are enough free erase blocks so that
the older data isn't destroyed, and the pointer to the current log record
isn't updated in persistent storage due to a firmware bug.  Obviously this
is not relevant if you're not using SSD, and not likely if you have a
multi-disk filesystem (one disk will appear to be corrupted in that case).

> > It should have some kernel message for things like process hang for over
> > 120s.
> > If you could recover that, it would help us to locate the cause.
> > 
> > Normally such deadlock should only affect the unlucky fs which meets the
> > condition, not all filesystems.
> > But if you're unlucky enough, it may happen to other filesystems.
> > 
> > Anyway, without enough info, it's really hard to say.
> 
> I was able to retrieve complete logs from the kernel for the entire time period.  The BTRFS filesystem was online resized five days before the last apparent filesystem commit.  Immediately after resize, a couple of csum errors were thrown for a single inode on the resized filesystem, though this was not detected at the time.  The underlying hardware did not experience a fault at any point and is passing all diagnostics at this time.  Intriguingly, there are a handful of files accessible from after the last known good filesystem commit (Oct. 29), but the vast majority are simply absent.
> 
> At this point I'm more interested in making sure this type of event does not happen in the future than anything else.  At no point did the kernel print any type of stack trace or deadlock warning.  I'm starting to wonder if we hit a bug in the online resize path, but am just guessing at this point.  The timing is certainly very close / coincidental.

To detect this kind of failure we use a watchdog script that invokes mkdir
and rmdir every 30 seconds on each filesystem backed by disk (i.e. btrfs,
ext4, and xfs).  If the mkdir/rmdir takes too long (*) then we try to
log some information (mostly 'echo w > /proc/sysrq-trigger') and force
a reboot.  mkdir and rmdir will eventually get stuck on btrfs if there
is a commit that is not making forward progress.  It's a surprisingly
simple and effective bug detector on ext4 and xfs too.

(This doesn't detect the SSD thing--you'd need RAID1 to handle that case).

The lack of kernel messages is unexpected, especially when you have a NFS
process stuck in D state long enough to get admins to force a reboot.
That should have produced at least a stuck task warning if they are
enabled in your kernel.  Did anyone capture the nfsd process stack trace?

(*) too long can be surprisingly long.  Some btrfs algorithms don't have
bounded running time and can delay a commit for several hours if there
are active writers on the system.  We record logs for commits over 100
seconds, send alarms to admins set at one hour, and automatic reboots
after 12 hours.


> Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-10  7:45         ` Qu Wenruo
  2019-11-10  7:48           ` Timothy Pearson
  2019-11-10 10:02           ` Timothy Pearson
@ 2019-11-11 23:28           ` Timothy Pearson
  2019-11-11 23:33             ` Timothy Pearson
  2019-11-12 11:30             ` Chris Murphy
  2 siblings, 2 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-11 23:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, November 10, 2019 1:45:14 AM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> On 2019/11/10 下午3:18, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>>> Sent: Sunday, November 10, 2019 6:54:55 AM
>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>> 
>>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>>>> <linux-btrfs@vger.kernel.org>
>>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>>>
>>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>>>> in approximately two weeks earlier (!).
>>>>>
>>>>> This means during two weeks, the btrfs is not committed.
>>>>
>>>> Is there any hope of getting the data from that interval back via btrfs-recover
>>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>>>> only and is therefore gone after the server reboot?
>>>
>>> If it's deadlock preventing new transaction to be committed, then no
>>> metadata is even written back to disk, so no way to recover metadata.
>>> Maybe you can find some data written, but without metadata it makes no
>>> sense.
>> 
>> OK, I'll just assume the data written in that window is unrecoverable at this
>> point then.
>> 
>> Would the commit deadlock affect only one btrfs filesystem or all of them on the
>> machine?  I take it there is no automatic dmesg spew on extended deadlock?
>> dmesg was completely clean at the time of the fault / reboot.
> 
> It should have some kernel message for things like process hang for over
> 120s.
> If you could recover that, it would help us to locate the cause.
> 
> Normally such deadlock should only affect the unlucky fs which meets the
> condition, not all filesystems.
> But if you're unlucky enough, it may happen to other filesystems.
> 
> Anyway, without enough info, it's really hard to say.
> 
>> 
>>>>
>>>> If the latter, I'm somewhat surprised given the I/O load on the disk array in
>>>> question, but it would also offer a clue as to why it hard locked the
>>>> filesystem eventually (presumably on memory exhaustion -- the server has
>>>> something like 128GB of RAM, so it could go quite a while before hitting the
>>>> physical RAM limits).
>>>>
>>>>>
>>>>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>>>>  about the status of the entire filesystem at this point.
>>>>>
>>>>> Btrfs check is needed to ensure no metadata corruption.
>>>>>
>>>>> Also, we need sysrq+w output to determine where we are deadlocking.
>>>>> Otherwise, it's really hard to find any clue from the report.
>>>>
>>>> It would have been gathered if we'd known the filesystem was in this bad state.
>>>> At the time, the priority was on restoring service and we had assumed NFS had
>>>> just wedged itself (again).  It was only after reboot and remount that the
>>>> damage slowly came to light.
>>>>
>>>> Do the described symptoms (what little we know of them at this point) line up
>>>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>>>> now we're hoping that this particular issue was fixed by that series, but if
>>>> not we might consider increasing backup frequency to nightly for this
>>>> particular array and seeing if it happens again.
>>>
>>> That fix is already in v5.3, thus I don't think that's the case.
>>>
>>> Thanks,
>>> Qu
>> 
>> Looking more carefully, the server in question had been booted on 5.3-rc3
>> somehow.  It's possible that this was because earlier versions were showing
>> driver problems with the other hardware, but somehow this machine was running
>> 5.3-rc3 and the patch was created *after* rc3 release.
> 
> If that's the case, just upgrade the kernel should prevent such problem
> from happening.
> And it's a relief that we don't need to face another deadly deadlock.
> 
> Thanks,
> Qu

Here's the final information we gleaned from the disk image -- that is now being archived and we're moving on from this failure.

It doesn't look like a general commit failure, it looks like somehow specific directories were corrupted / automatically rolled back.  Again I wonder how much of this is due to the online resize; needless to say, we won't be doing that again -- future procedure will be to isolate the existing array, format a new array, transfer files, then restart the services.

btrfs-find-root returned the following:

=====
These generations showed the missing files and also contained files from after the crash and restart:
Well block 114904137728(gen: 295060 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 114679480320(gen: 295059 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 114592710656(gen: 295058 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 114092670976(gen: 295057 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 114844827648(gen: 295056 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 114618925056(gen: 295055 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 923598848(gen: 294112 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 495386624(gen: 294111 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

=====
This generation failed to recover any data whatsoever:
Well block 92602368(gen: 294008 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

=====
Generations below do not show files created after the crash and restart, but the directories that would have contained the ~2 weeks of files are corrupted badly enough that they cannot be recovered.  Lots of "leaf parent key incorrect" on those directories; unknown if this is because of corruption that occurred prior to the crash or if this data was simply overwritten after remount and file restore.

Well block 299955716096(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 299916853248(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
Well block 299787747328(gen: 293445 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

My confidence still isn't great here that we don't have an underlying bug of some sort still present in btrfs, but all we can really do is keep an eye on it and increase backup frequency at this point.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-11 23:28           ` Timothy Pearson
@ 2019-11-11 23:33             ` Timothy Pearson
  2019-11-12 11:30             ` Chris Murphy
  1 sibling, 0 replies; 14+ messages in thread
From: Timothy Pearson @ 2019-11-11 23:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs



----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineering.com>
> To: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Monday, November 11, 2019 5:28:37 PM
> Subject: Re: Unusual crash -- data rolled back ~2 weeks?

> ----- Original Message -----
>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>> Sent: Sunday, November 10, 2019 1:45:14 AM
>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
> 
>> On 2019/11/10 下午3:18, Timothy Pearson wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>>>> Cc: "linux-btrfs" <linux-btrfs@vger.kernel.org>
>>>> Sent: Sunday, November 10, 2019 6:54:55 AM
>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>> 
>>>> On 2019/11/10 下午2:47, Timothy Pearson wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Qu Wenruo" <quwenruo.btrfs@gmx.com>
>>>>>> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "linux-btrfs"
>>>>>> <linux-btrfs@vger.kernel.org>
>>>>>> Sent: Saturday, November 9, 2019 9:38:21 PM
>>>>>> Subject: Re: Unusual crash -- data rolled back ~2 weeks?
>>>>>
>>>>>> On 2019/11/10 上午6:33, Timothy Pearson wrote:
>>>>>>> We just experienced a very unusual crash on a Linux 5.3 file server using NFS to
>>>>>>> serve a BTRFS filesystem.  NFS went into deadlock (D wait) with no apparent
>>>>>>> underlying disk subsystem problems, and when the server was hard rebooted to
>>>>>>> clear the D wait the BTRFS filesystem remounted itself in the state that it was
>>>>>>> in approximately two weeks earlier (!).
>>>>>>
>>>>>> This means during two weeks, the btrfs is not committed.
>>>>>
>>>>> Is there any hope of getting the data from that interval back via btrfs-recover
>>>>> or a similar tool, or does the lack of commit mean the data was stored in RAM
>>>>> only and is therefore gone after the server reboot?
>>>>
>>>> If it's deadlock preventing new transaction to be committed, then no
>>>> metadata is even written back to disk, so no way to recover metadata.
>>>> Maybe you can find some data written, but without metadata it makes no
>>>> sense.
>>> 
>>> OK, I'll just assume the data written in that window is unrecoverable at this
>>> point then.
>>> 
>>> Would the commit deadlock affect only one btrfs filesystem or all of them on the
>>> machine?  I take it there is no automatic dmesg spew on extended deadlock?
>>> dmesg was completely clean at the time of the fault / reboot.
>> 
>> It should have some kernel message for things like process hang for over
>> 120s.
>> If you could recover that, it would help us to locate the cause.
>> 
>> Normally such deadlock should only affect the unlucky fs which meets the
>> condition, not all filesystems.
>> But if you're unlucky enough, it may happen to other filesystems.
>> 
>> Anyway, without enough info, it's really hard to say.
>> 
>>> 
>>>>>
>>>>> If the latter, I'm somewhat surprised given the I/O load on the disk array in
>>>>> question, but it would also offer a clue as to why it hard locked the
>>>>> filesystem eventually (presumably on memory exhaustion -- the server has
>>>>> something like 128GB of RAM, so it could go quite a while before hitting the
>>>>> physical RAM limits).
>>>>>
>>>>>>
>>>>>>>  There was also significant corruption of certain files (e.g. LDAP MDB and MySQL
>>>>>>>  InnoDB) noted -- we restored from backup for those files, but are concerned
>>>>>>>  about the status of the entire filesystem at this point.
>>>>>>
>>>>>> Btrfs check is needed to ensure no metadata corruption.
>>>>>>
>>>>>> Also, we need sysrq+w output to determine where we are deadlocking.
>>>>>> Otherwise, it's really hard to find any clue from the report.
>>>>>
>>>>> It would have been gathered if we'd known the filesystem was in this bad state.
>>>>> At the time, the priority was on restoring service and we had assumed NFS had
>>>>> just wedged itself (again).  It was only after reboot and remount that the
>>>>> damage slowly came to light.
>>>>>
>>>>> Do the described symptoms (what little we know of them at this point) line up
>>>>> with the issues fixed by https://patchwork.kernel.org/patch/11141559/ ?  Right
>>>>> now we're hoping that this particular issue was fixed by that series, but if
>>>>> not we might consider increasing backup frequency to nightly for this
>>>>> particular array and seeing if it happens again.
>>>>
>>>> That fix is already in v5.3, thus I don't think that's the case.
>>>>
>>>> Thanks,
>>>> Qu
>>> 
>>> Looking more carefully, the server in question had been booted on 5.3-rc3
>>> somehow.  It's possible that this was because earlier versions were showing
>>> driver problems with the other hardware, but somehow this machine was running
>>> 5.3-rc3 and the patch was created *after* rc3 release.
>> 
>> If that's the case, just upgrade the kernel should prevent such problem
>> from happening.
>> And it's a relief that we don't need to face another deadly deadlock.
>> 
>> Thanks,
>> Qu
> 
> Here's the final information we gleaned from the disk image -- that is now being
> archived and we're moving on from this failure.
> 
> It doesn't look like a general commit failure, it looks like somehow specific
> directories were corrupted / automatically rolled back.  Again I wonder how
> much of this is due to the online resize; needless to say, we won't be doing
> that again -- future procedure will be to isolate the existing array, format a
> new array, transfer files, then restart the services.
> 
> btrfs-find-root returned the following:
> 
> =====
> These generations showed the missing files and also contained files from after
> the crash and restart:
> Well block 114904137728(gen: 295060 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 114679480320(gen: 295059 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 114592710656(gen: 295058 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 114092670976(gen: 295057 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 114844827648(gen: 295056 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 114618925056(gen: 295055 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 923598848(gen: 294112 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 495386624(gen: 294111 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> 
> =====
> This generation failed to recover any data whatsoever:
> Well block 92602368(gen: 294008 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> 
> =====
> Generations below do not show files created after the crash and restart, but the
> directories that would have contained the ~2 weeks of files are corrupted badly
> enough that they cannot be recovered.  Lots of "leaf parent key incorrect" on
> those directories; unknown if this is because of corruption that occurred prior
> to the crash or if this data was simply overwritten after remount and file
> restore.
> 
> Well block 299955716096(gen: 293446 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 299916853248(gen: 293446 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> Well block 299787747328(gen: 293445 level: 1) seems good, but generation/level
> doesn't match, want gen: 294909 level: 1
> 
> My confidence still isn't great here that we don't have an underlying bug of
> some sort still present in btrfs, but all we can really do is keep an eye on it
> and increase backup frequency at this point.
> 
> Thanks!

For clarity, none of these roots allowed the files to be recovered.  They were simply missing from the latest generations, and the directories that would have contained them in previous generations were too badly corrupted to allow btrfs-recover to function.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Unusual crash -- data rolled back ~2 weeks?
  2019-11-11 23:28           ` Timothy Pearson
  2019-11-11 23:33             ` Timothy Pearson
@ 2019-11-12 11:30             ` Chris Murphy
  1 sibling, 0 replies; 14+ messages in thread
From: Chris Murphy @ 2019-11-12 11:30 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Qu Wenruo, linux-btrfs

On Mon, Nov 11, 2019 at 11:28 PM Timothy Pearson
<tpearson@raptorengineering.com> wrote:
>
> Here's the final information we gleaned from the disk image -- that is now being archived and we're moving on from this failure.
>
> It doesn't look like a general commit failure, it looks like somehow specific directories were corrupted / automatically rolled back.  Again I wonder how much of this is due to the online resize; needless to say, we won't be doing that again -- future procedure will be to isolate the existing array, format a new array, transfer files, then restart the services.

I'm skeptical of resize being involved for a couple reasons:
a) it should have resulted in immediate problems, not days later
b) resize involves the same code as balance and device removal, the
first step is to identify any chunks in physical areas that will no
longer exist after the resize and moving those chunks to areas with
free space that will continue to exist, and updating all the metadata
that points to those chunks. It's essentially identical to a filtered
balance.

Therefore, if there's a bug in resize, there's also a bug in balance
and device removal. And if that's true I think we'd have other people
running into it.



>
> btrfs-find-root returned the following:
>
> =====
> These generations showed the missing files and also contained files from after the crash and restart:
> Well block 114904137728(gen: 295060 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

That's really suspicious that it wants a LOWER generation number than
what it has. And it's not a huge difference, just 151 generations,
which isn't likely weeks. For a system root, that's maybe an hour or
two of time? Or if not used that much it could be a couple days.


> Well block 114679480320(gen: 295059 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114592710656(gen: 295058 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114092670976(gen: 295057 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114844827648(gen: 295056 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 114618925056(gen: 295055 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 923598848(gen: 294112 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 495386624(gen: 294111 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
>
> =====
> This generation failed to recover any data whatsoever:
> Well block 92602368(gen: 294008 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1

And that's 901 generations, could be a day with average use, or more
days with light use.

What are the mount options for this file system?

>
> =====
> Generations below do not show files created after the crash and restart, but the directories that would have contained the ~2 weeks of files are corrupted badly enough that they cannot be recovered.  Lots of "leaf parent key incorrect" on those directories; unknown if this is because of corruption that occurred prior to the crash or if this data was simply overwritten after remount and file restore.
>
> Well block 299955716096(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 299916853248(gen: 293446 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
> Well block 299787747328(gen: 293445 level: 1) seems good, but generation/level doesn't match, want gen: 294909 level: 1
>
> My confidence still isn't great here that we don't have an underlying bug of some sort still present in btrfs, but all we can really do is keep an eye on it and increase backup frequency at this point.
>
> Thanks!

There isn't a lot to go on. Have you gone through the logs looking for
non-Btrfs related errors? Like SCSI or libata link resets, or doing a
grep -i 'fail\|error' and so on? Each drive has its own log, exposed
by 'smartctl -x' and also useful is to know what the SCT ERC is,
'smartctl -l scterc' for each drive in the volume. Somewhere something
got dropped.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-11-12 11:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-09 22:33 Unusual crash -- data rolled back ~2 weeks? Timothy Pearson
2019-11-09 22:48 ` Timothy Pearson
2019-11-10  3:38 ` Qu Wenruo
2019-11-10  6:47   ` Timothy Pearson
2019-11-10  6:54     ` Qu Wenruo
2019-11-10  7:18       ` Timothy Pearson
2019-11-10  7:45         ` Qu Wenruo
2019-11-10  7:48           ` Timothy Pearson
2019-11-10 10:02           ` Timothy Pearson
2019-11-10 20:10             ` Zygo Blaxell
2019-11-11 23:28           ` Timothy Pearson
2019-11-11 23:33             ` Timothy Pearson
2019-11-12 11:30             ` Chris Murphy
2019-11-10  8:04         ` Andrei Borzenkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).