On 17.02.2021 23:37, Olga Kornievskaia wrote:
> On Tue, Feb 16, 2021 at 5:27 PM Timo Rothenpieler <timo@rothenpieler.org> wrote:
>>
>> On 16.02.2021 21:37, Timo Rothenpieler wrote:
>>> I can't get a network (I assume just TCP/20049 is fine, and not also
>>> some RDMA trace?) right now, but I will once a user has finished their
>>> work on the machine.
>>
>> There wasn't any TCP traffic to dump on the NFSoRDMA Port, probably
>> because everything is handled via RDMA/IB.
> 
> Yeah, I'm not sure if tcpdump can snoop on the IB traffic. I know that
> upstream tcpdump can snoop on RDMA mellanox card (but I only know
> about the Roce mode).

I managed to get https://github.com/Mellanox/ibdump working. Attached is 
what it records when I run the xfs_io copy_range command that gets 
stuck(sniffer.pcap).
Additionally, I rebooted the client machine, and captured the traffic 
when it does a then successful copy during the first few minutes of 
uptime(sniffer2.pcap).

Both those commands were run on a the same 500M file.

>> But I recorded a trace log of rpcrdma and sunrpc observing the situation.
>>
>> To me it looks like the COPY task (task:15886@7) completes successfully?
>> The compressed trace.dat is attached.
> 
> I'm having a hard time reproducing the problem. But I only tried
> "xfs", "btrfs", "ext4" (first two send a CLONE since the file system
> supports it), the last one exercises a copy. In all my tries your

I can also reproduce this on a test NFS share from an ext4 filesystem.
Have not tested xfs yet.

> xfs_io commands succeed. The differences between our environments are
> (1) ZFS vs (xfs, etc) and (2) IB vs RoCE. Question is: does any
> copy_file_range work over RDMA/IB. One thing to try a synchronous

It works, on any size of file, when the client machine is freshly booted 
(within its first 10~30 minutes of uptime).

> copy: create a small file 10bytes and do a copy. Is this the case
> where we have copy and the callback racing, so instead do a really
> large copy: create a >=1GB file and do a copy. that will be an async
> copy but will not have a racy condition. Can you try those 2 examples
> for me?

I have observed in the past, that the xfs_io copy is more likely to 
succeed the smaller the file is, though I did not make out a definite 
pattern.

I did some bisecting on the number of bytes, and came up with the following:
A 2097153 byte sized file gets stuck, while a 2097152(=2^21) sized one 
still works.

It's been stable at that cutoff point for a while now, so I think that's 
actually the point where it starts happening, and different behaviour I 
saw in the past was an issue in my testing.

> Not sure how useful tracepoints here are. The results of the COPY
> isn't interesting as this is an async copy. The server should have
> sent a CB_COMPOUND with the copy's results. The process stack tells me
> that COPY is waiting for the results (waiting for the callback). So
> the question is there a problem of sending a callback over RDMA/IB? Or
> did the client receive it and missed it somehow? We really do need
> some better tracepoints in the copy (but we don't have them
> currently).
> 
> Would you be willing to install the upstream libpcap/tcpdump to see if
> it can capture RDMA/IB traffic or perhaps Chunk knows that it doesn't
> work for sure?

Managed to get ibdump working, as stated above.