Hi Stefano,

Thanks again for helping us to find the root cause of the issue.

> On 20 Apr 2022, at 3:36 am, Stefano Stabellini <sstabellini@kernel.org> wrote:
>
>>> Then there is xen_swiotlb_init() which allocates some memory for
>>> swiotlb-xen at boot. It could lower the total amount of memory
>>> available, but if you disabled swiotlb-xen like I suggested,
>>> xen_swiotlb_init() still should get called and executed anyway at boot
>>> (it is called from arch/arm/xen/mm.c:xen_mm_init). So xen_swiotlb_init()
>>> shouldn't be the one causing problems.
>>>
>>> That's it -- there is nothing else in swiotlb-xen that I can think of.
>>>
>>> I don't have any good ideas, so I would only suggest to add more printks
>>> and report the results, for instance:
>>
>> As suggested I added the more printks but only difference I see is the size apart
>> from that everything looks same .
>>
>> Please find the attached logs for xen and native linux boot.
>
> One difference is that the order of the allocations is significantly
> different after the first 3 allocations. It is very unlikely but
> possible that this is an unrelated concurrency bug that only occurs on
> Xen. I doubt it.

I am not sure but just to confirm with you, I see below logs in every scenario.
SWIOTLB memory allocated by linux swiotlb and used by xen-swiotlb. Is that okay or it can cause some issue.

[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] software IO TLB: mapped [mem 0x00000000f4000000-0x00000000f8000000] (64MB)

snip from int __ref xen_swiotlb_init(int verbose, bool early)
/*                                                                         
     * IO TLB memory already allocated. Just use it.                           
     */                                                                        
    if (io_tlb_start != 0) {                                                   
        xen_io_tlb_start = phys_to_virt(io_tlb_start);                         
        goto end;                                                              
    }


>
> I think you could try booting native and Xen with only 1 CPU enabled in
> both cases.
>
> For native, you can do that with maxcpus, e.g. maxcpus=1.
> For Xen, you can do that with dom0_max_vcpus=1. I don't think we need to
> reduce the number of pCPUs seen by Xen, but it could be useful to pass
> sched=null to avoid any scheduler effects. This is just for debugging of
> course.
>

I tried to boot the XEN with "dom0_max_vcpus=1” & “schedule-null” and
issue remains .

>
> In reality, the most likely explanation is that the issue is a memory
> corruption. Something somewhere is corrupting Linux memory and it just
> happens that we see it when calling dma_direct_alloc. This means it is
> going to be difficult to find as the only real clue is that it is
> swiotlb-xen that is causing it.

Agree we observe issue with xen-swiotlb dma ops only.
>
>
> I added more printks with the goal of detecting swiotlb-xen code paths
> that shouldn't be taken in a normal dom0 boot without domUs. For
> instance, range_straddles_page_boundary should always return zero and
> the dma_mask check in xen_swiotlb_alloc_coherent should always succeed.
>
> Fingers crossed we'll notice that the wrong path is taken just before
> the crash.

Please find the attached logs.

I captured the logs for Xen with and without (dom0_max_vcpus=1 & sched=null) and
also for native linux with and without (maxcpus=1)


 
Regards,
Rahul