On 2020/2/12 下午10:57, David Sterba wrote: > On Wed, Feb 12, 2020 at 08:11:56PM +0800, Qu Wenruo wrote: >>>>> >>>>> This looks like an existing bug, IIRC Zygo reported it before. >>>>> >>>>> Btrfs balance just randomly failed at data reloc tree. >>>>> >>>>> Thus I don't believe it's related to Ethan's patches. >>>> >>>> Ok, than the patches make it more likely to happen, which could mean >>>> that faster backref processing hits some race window. As there could be >>>> more we should first fix the bug you say Zygo reported. >>> >>> I added a log to check if find_parent_nodes is ever called under >>> test btrfs/125. It turns out that btrfs/125 doesn't pass through the >>> function. What my patches do is all under find_parent_nodes. >> >> Balance goes through its own backref cache, thus it doesn't utilize the >> path you're modifying. >> >> So don't worry your patches look pretty good. >> >> Furthermore, this csum mismatch is not related to backref walk, but the >> data csum and the data in data reloc tree, which are all created by balance. >> >> So there is really no reason to block such good optimization. > > I don't mean to block the patchset but when I test patchsets from 5 > people and tests start to fail I need to know what's the cause and if > there's a fix in sight. So far the test failed 2 out of 2 (once the > branch itself and then with for-next), I can do more rounds but at this > point it's too reliable to reproduce so there is some connection. > > Sometimes it looks like I blame the messenger and complaining under > patches that don't cause the bugs, but often I don't have anyting better > than new warnings between 2 test rounds. Once we have more eyes on the > problem we'll narrow it down and find the root cause. > BTW, from your initial report, the csum looks pretty long. Are you testing with those new csum algos? And could that be the reason why it's much easier to reproduce? Thanks, Qu