On 2019/12/21 上午12:53, Marc Lehmann wrote: > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo wrote: >> BTW, that chunk number is very small, and since it has 0 tolerance, it >> looks like to be SINGLE chunk. >> >> In that case, it looks like a temporary chunk from older mkfs, and it >> should contain no data/metadata at all, thus brings no data loss. > > Well, there indeed should not have been any data or metadata left as the > btrfs dev del succeeded after lengthy copying. > >> BTW, "btrfs ins dump-tree -t chunk " would help a lot. >> That would directly tell us if the devid 1 device is in chunk tree. > > Apologies if I wasn't too clear about it - I already had to mkfs and > redo the filesystem. I understand that makes tracking this down hard or > impossible, but I did need that machine and filesystem. > >>> And if you want to hear more "insane" things, after I hard-reset >>> my desktop machine (5.2.21) two days ago I had to "btrfs rescue >>> fix-device-size" to be able to mount (can't find the kernel error atm.). >> >> Consider all these insane things, I tend to believe there is some >> FUA/FLUSH related hardware problem. > > Please don't - I honestly think btrfs developers are way to fast to blame > hardware for problems. I currently lose btrfs filesystems about once every > 6 months, and other than the occasional user error, it's always the kernel > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, > low-memory situations etc. - none of these seem to be centric to btrfs, > but none of those are hardware errors either). I know its the kernel in > most cases because in those cases, I can identify the fix in a later > kernel, or the mitigating circumstances don't appear (e.g. freezes). > > In any case if it is a hardware problem, then linux and/or btrfs has > to work around them, because it affects many different controllers on > different boards: > > - dell perc h740 on "doom" and "cerebro" > - intel series 9 controller on "doom'" and "cerebro". > - samsung nvme controller on "yoyo" and "yuna". > - marvell sata controller on "doom". > > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > filesystem I restored to went into readonly mode with ENOSPC. Another > hardware problem? > > [41801.618772] ------------[ cut here ]------------ > [41801.618776] BTRFS: Transaction aborted (error -28) According to your later replies, this bug turns out to be a problem in over-commit calculation. It doesn't really take disk requirement into consideration, thus can't handle cases like 3 disks RAID1 with 2 full disks. Now it acts just like we're using DUP profiles, thus causing the problem. To Josef, any idea to fix it? I guess we could go the complex statfs() way to do a calculation on how many bytes can really be allocated. Or hugely reduce the over-commit threshold? Thanks, Qu > [41801.618843] WARNING: CPU: 2 PID: 5713 at fs/btrfs/inode.c:3159 btrfs_finish_ordered_io+0x730/0x820 [btrfs] > [41801.618844] Modules linked in: nfsv3 nfs fscache nvidia_modeset(POE) nvidia(POE) btusb algif_skcipher af_alg dm_crypt nfsd auth_rpcgss nfs_acl lockd grace cls_fw sch_htb sit tunnel4 ip_tunnel hidp act_police cls_u32 sch_ingress sch_tbf 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_CT xt_MASQUERADE xt_nat xt_REDIRECT nft_chain_nat nf_nat xt_owner xt_TCPMSS xt_DSCP xt_mark nf_log_ipv4 nf_log_common xt_LOG xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_length xt_mac xt_tcpudp nft_compat nft_counter nf_tables xfrm_user xfrm_algo nfnetlink cmac uhid bnep tda10021 snd_hda_codec_hdmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass tda827x tda10023 crct10dif_pclmul mei_hdcp crc32_pclmul btrtl btbcm rc_tt_1500 ghash_clmulni_intel snd_emu10k1 btintel snd_util_mem snd_ac97_codec aesni_intel bluetooth snd_hda_intel budget_av snd_rawmidi snd_intel_nhlt crypto_simd saa7146_vv > [41801.618864] snd_hda_codec videobuf_dma_sg budget_ci videobuf_core snd_seq_device budget_core cryptd ttpci_eeprom glue_helper snd_hda_core saa7146 dvb_core intel_cstate ac97_bus snd_hwdep rc_core snd_pcm intel_rapl_perf mxm_wmi cdc_acm pcspkr videodev snd_timer ecdh_generic snd emu10k1_gp ecc mc gameport soundcore mei_me mei mac_hid acpi_pad tcp_bbr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler hid_generic usbhid hid usbkbd coretemp nct6775 hwmon_vid sunrpc parport_pc ppdev lp parport msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c ahci megaraid_sas i2c_i801 libahci lpc_ich r8169 realtek wmi video [last unloaded: nvidia] > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P OE 5.4.5-050405-generic #201912181630 > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8 12/24/2014 > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5 <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000000000006 > [41801.618922] BTRFS info (device dm-35): forced readonly > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8bbeb17440 > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: ffffffff979a4d90 > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8ad8b4ab60 > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000000000000 > [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) knlGS:0000000000000000 > [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 00000000001606e0 > [41801.618930] Call Trace: > [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] > [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] > [41801.618959] ? __schedule+0x2eb/0x740 > [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] > [41801.618975] process_one_work+0x1ec/0x3a0 > [41801.618977] worker_thread+0x4d/0x400 > [41801.618979] kthread+0x104/0x140 > [41801.618980] ? process_one_work+0x3a0/0x3a0 > [41801.618982] ? kthread_park+0x90/0x90 > [41801.618984] ret_from_fork+0x1f/0x40 > [41801.618985] ---[ end trace 35086266bf39c897 ]--- > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3159: errno=-28 No space left > > unmount/remount seems to make it work again, and it is full (df) yet has > 3TB of unallocated space left. No clue what to do now, do I have to start > over restoring again? > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 > > Overall: > Device size: 24216.49GiB > Device allocated: 20894.89GiB > Device unallocated: 3321.60GiB > Device missing: 0.00GiB > Used: 20893.68GiB > Free (estimated): 3322.73GiB (min: 1661.93GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 0.50GiB (used: 0.00GiB) > > Data,single: Size:20839.01GiB, Used:20837.88GiB (99.99%) > /dev/mapper/xmnt-cold15 9288.01GiB > /dev/mapper/xmnt-cold12 7427.00GiB > /dev/mapper/xmnt-cold13 4124.00GiB > > Metadata,RAID1: Size:27.91GiB, Used:27.90GiB (99.97%) > /dev/mapper/xmnt-cold15 25.44GiB > /dev/mapper/xmnt-cold12 24.46GiB > /dev/mapper/xmnt-cold13 5.91GiB > > System,RAID1: Size:0.03GiB, Used:0.00GiB (6.69%) > /dev/mapper/xmnt-cold15 0.03GiB > /dev/mapper/xmnt-cold12 0.03GiB > > Unallocated: > /dev/mapper/xmnt-cold15 0.01GiB > /dev/mapper/xmnt-cold12 0.00GiB > /dev/mapper/xmnt-cold13 3321.59GiB > > Please, don't always chalk it up to hardware problems - btrfs is a > wonderful filesystem for many reasons, one reason I like is that it can > detect corruption much earlier than other filesystems. This featire alone > makes it impossible for me to go back to xfs. However, I had corruption > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > still than those - before btrfs (and even now) I kept md5sums of all > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > corrupting data than btrfs on the same hardware - while I get filesystem > problems about every half a year with btrfs, I had (silent) corruption > problems maybe once every three to four years with xfs or ext4 (and not > yet on the bxoes I use currently). > > Please take these issues seriously - the trend of "it's a hardware > problem" will not remove the "unstable" stigma from btrfs as long as btrfs > is clearly more buggy then other filesystems. > > Sorry to be so blunt, but I am a bit sensitive with always being told > "it's probably a hardware problem" when it clearly affects practically any > server and any laptop I administrate. I believe in btrfs, and detecting > corruption early is a feature to me. > > I understand it can be frustrating to be confronted with hard to explain > accidents, and I understand if you can't find the bug with the sparse info > I gave, especially as the bug might not even be in btrfs. But keep in mind > that the people who boldly/dumbly use btrfs in production and restore > dozens of terabytes from backup every so and so many months are also being > frustrated if they present evidence from multiple machines and get told > "its probably a hardware problem". >