Output of the commands is attached.

The broken-sector-theory sounds plausible and is compatible with my new 
findings:
I suspected the problem to be in one specific directory, let's call it 
"broken_dir".
I created a new subvolume and copied broken_dir over.
- If I copied it with cp --reflink, made a snapshot and tried to 
btrfs-send that, it hung
- If I rsynced broken_dir over I could snapshot and btrfs-send without a 
problem.

But shouldn't btrfs scrub or check find such errors?


On 9/6/18 8:16 PM, Chris Murphy wrote:
> OK you've got a different problem.
>
> [  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK
> [  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
> 00 08 00 00
> [  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
> [  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
> using xhci_hcd
> [  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK
> [  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
> 00 08 00 00
> [  188.247048] print_req_error: I/O error, dev sdb, sector 354855120
>
>
> This is a read error for a specific sector.  So your drive has media
> problems. And I think that's the instigating problem here, from which
> a bunch of other tasks that depend on one or more reads completing but
> never do. But weirdly there also isn't any kind of libata reset. At
> least on SATA, by default we see a link reset after a command has not
> returned in 30 seconds. That reset would totally clear the drive's
> command queue, and then things either can recover or barf. But in your
> case, neither happens and it just sits there with hung tasks.
>
> [  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
> rd 2, flush 0, corrupt 0, gen 0
>
> And that's the last we really see from Btrfs. After that, it's all
> just hung task traces and are rather unsurprising to me.
>
> Drives in USB cases add a whole bunch of complicating factors for
> troubleshooting and repair. Including often masking the actual logical
> and physical sector size, the min and max IO size, alignment offset,
> and all kinds of things. They can have all sorts of bugs. And I'm also
> not totally certain about the relationship between the usb reset
> messages and the bad sector. As far as I know the only way we can get
> a sector LBA expressly noted in dmesg along with the failed read(10)
> command, is if the drive has reported back to libata that discrete
> error with sense information. So I'm accepting that as a reliable
> error, rather than it being something like a cable. But the reset
> messages could possibly be something else in addition to that.
>
> Anyway, the central issue is sector 354855120 is having problems. I
> can't tell from the trace if it's transient or persistent. Maybe if
> it's transient, that would explain how you sometimes get send to start
> working again briefly but then it reverts to hanging. What do you get
> for:
>
> fdisk -l /dev/sdb
> smartctl -x /dev/sdb
> smartctl -l sct erc /dev/sdb
>
> Those are all read only commands, nothing is written or changed.
>
>
>