Re: [PATCH v5 15/16] x86: make Xen early boot code relocatable

From: "Jan Beulich" <JBeulich@suse.com>
To: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Juergen Gross <JGross@suse.com>,
	sstabellini@kernel.org, andrew.cooper3@citrix.com,
	cardoe@cardoe.com, pgnet.dev@gmail.com, ning.sun@intel.com,
	david.vrabel@citrix.com, xen-devel@lists.xenproject.org,
	qiaowei.ren@intel.com, gang.wei@intel.com, fu.wei@linaro.org
Subject: Re: [PATCH v5 15/16] x86: make Xen early boot code relocatable
Date: Thu, 25 Aug 2016 08:41:39 -0600	[thread overview]
Message-ID: <57BF1FC3020000780010915B@prv-mh.provo.novell.com> (raw)
In-Reply-To: <1471646606-28519-16-git-send-email-daniel.kiper@oracle.com>

>>> On 20.08.16 at 00:43, <daniel.kiper@oracle.com> wrote:
> Every multiboot protocol (regardless of version) compatible image must
> specify its load address (in ELF or multiboot header). Multiboot protocol
> compatible loader have to load image at specified address. However, there
> is no guarantee that the requested memory region (in case of Xen it starts
> at 1 MiB and ends at 17 MiB) where image should be loaded initially is a RAM
> and it is free (legacy BIOS platforms are merciful for Xen but I found at
> least one EFI platform on which Xen load address conflicts with EFI boot
> services; it is Dell PowerEdge R820 with latest firmware). To cope with that
> problem we must make Xen early boot code relocatable and help boot loader to
> relocate image in proper way by suggesting, not requesting specific load
> addresses as it is right now, allowed address ranges. This patch does former.
> It does not add multiboot2 protocol interface which is done in "x86: add
> multiboot2 protocol support for relocatable images" patch.
> 
> This patch changes following things:
>   - default load address is changed from 1 MiB to 2 MiB; I did that because
>     initial page tables are using 2 MiB huge pages and this way required
>     updates for them are quite easy; it means that e.g. we avoid spacial
>     cases for start and end of required memory region if it live at address
>     not aligned to 2 MiB,

Should this be a separate change then, to separate possible
regressions due to that from such due to any other of the changes
here? And then I don't see why, when making the image relocatable
anyway, the link time load address actually still matters.

>   - %esi and %r15d registers are used as a storage for Xen image load base
>     address (%r15d shortly because %rsi is used for EFI SystemTable address
>     in 64-bit code); both registers are (%esi is mostly) unused in early
>     boot code and preserved during C functions calls,

In a context where you (also) talk about 64-bit code, %esi can't
be called preserved unilaterally. You should make clear that this is
for 32-bit function calls.

> --- a/xen/arch/x86/boot/head.S
> +++ b/xen/arch/x86/boot/head.S
> @@ -12,13 +12,16 @@
>          .text
>          .code32
>  
> -#define sym_phys(sym)     ((sym) - __XEN_VIRT_START)
> +#define sym_offset(sym)   ((sym) - __XEN_VIRT_START)

In a patch already large and hence hard to review, did you really
need to do this rename?

> @@ -126,26 +130,26 @@ vga_text_buffer:
>          .section .init.text, "ax", @progbits
>  
>  bad_cpu:
> -        mov     $(sym_phys(.Lbad_cpu_msg)),%esi # Error message
> +        lea     esi_offset(.Lbad_cpu_msg),%esi  # Error message

Why not just

        add     $sym_offset(.Lbad_cpu_msg),%esi  # Error message

? Together with not doing said rename, this would be more obviously
correct.

> @@ -409,36 +436,93 @@ trampoline_bios_setup:
>          cmovb   %edx,%ecx           /* and use the smaller */
>  
>  trampoline_setup:
> +        /*
> +         * Called on legacy BIOS and EFI platforms.
> +         *
> +         * Compute 0-15 bits of BOOT_FS segment descriptor base address.
> +         */
> +        mov     %esi,%edx
> +        shl     $16,%edx
> +        or      %edx,BOOT_FS+esi_offset(trampoline_gdt)

	mov   %si,BOOT_FS+esi_offset(trampoline_gdt)

> +        /* Compute 16-23 bits of BOOT_FS segment descriptor base address. */
> +        mov     %esi,%edx
> +        shr     $16,%edx
> +        and     $0x000000ff,%edx
> +        or      %edx,BOOT_FS+4+esi_offset(trampoline_gdt)

	mov   %dl,BOOT_FS+4+esi_offset(trampoline_gdt)

> +        /* Compute 24-31 bits of BOOT_FS segment descriptor base address. */
> +        mov     %esi,%edx
> +        and     $0xff000000,%edx
> +        or      %edx,BOOT_FS+4+esi_offset(trampoline_gdt)

	mov   %dh,BOOT_FS+7+esi_offset(trampoline_gdt)

(with various of the intermediate instructions dropped)

That'll reduce you code size concern for the GDT setup quite a bit.

> +        /*
> +         * Initialise %fs and later use it to access Xen data if possible.
> +         * According to Intel 64 and IA-32 Architectures Software Developer’s
> +         * Manual it is safe to do that without reloading GDTR before.
> +         *
> +         * Please check Intel 64 and IA-32 Architectures Software Developer’s
> +         * Manual, Volume 2 (2A, 2B & 2C): Instruction Set Reference,
> +         * LGDT and MOV instructions description and
> +         * Intel 64 and IA-32 Architectures Software Developer’s
> +         * Manual Volume 3 (3A, 3B & 3C): System Programming Guide,
> +         * section 3.4.3, Segment Registers for more details.
> +         *
> +         * AIUI, only GDT address and limit are loaded into GDTR when
> +         * lgdt is executed. Segment descriptor is loaded directly from
> +         * memory into segment register (hiden part) only when relevant
> +         * load instruction is used (e.g. mov %edx,%fs). Though GDT content
> +         * probably could be stored in CPU cache but nothing suggest that
> +         * CPU caching interfere in one way or another with segment descriptor
> +         * load. So, it looks that every change in active GDT is immediately
> +         * available for relevant segment descriptor load instruction.
> +         *
> +         * I was not able to find anything which invalidates above.
> +         * So, everything suggest that we do not need an extra lgdt here.
> +         */

All of this comment except for the first sentence is just stating basic
architecture facts. Please drop all that.

> +        mov     $BOOT_FS,%edx
> +        mov     %edx,%fs
> +
>          /* Reserve 64kb for the trampoline. */
>          sub     $0x1000,%ecx
>  
>          /* From arch/x86/smpboot.c: start_eip had better be page-aligned! */
>          xor     %cl, %cl
>          shl     $4, %ecx
> -        mov     %ecx,sym_phys(trampoline_phys)
> +        mov     %ecx,fs_offset(trampoline_phys)

Seeing the first instance, together with the earlier comment about
not renaming sym_phys() I think this would end up more consistently
with using sym_fs() (and then obviously sym_esi()) as names.

> @@ -461,62 +545,88 @@ trampoline_setup:
>  
>          /* Stash TSC to calculate a good approximation of time-since-boot */
>          rdtsc
> -        mov     %eax,sym_phys(boot_tsc_stamp)
> -        mov     %edx,sym_phys(boot_tsc_stamp+4)
> +        mov     %eax,fs_offset(boot_tsc_stamp)
> +        mov     %edx,fs_offset(boot_tsc_stamp)+4
> +
> +        /* Update frame addresses in page tables. */
> +        mov     $((__page_tables_end-__page_tables_start)/8),%ecx
> +1:      testl   $_PAGE_PRESENT,fs_offset(__page_tables_start)-8(,%ecx,8)
> +        jz      2f
> +        add     %esi,fs_offset(__page_tables_start)-8(,%ecx,8)
> +2:      loop    1b

This loop includes the mapping of the low Mb, which I don't think it
should modify. Or wait - you move __page_tables_start, which by
itself is fragile (but looks to be acceptable if accompanied by a
comment explaining why it doesn't cover l1_identmap).

Also, just to double check - all these page table setup adjustments
don't require reflecting in xen.efi's page table setup code?

>          /* Apply relocations to bootstrap trampoline. */
> -        mov     sym_phys(trampoline_phys),%edx
> -        mov     $sym_phys(__trampoline_rel_start),%edi
> +        mov     fs_offset(trampoline_phys),%edx
> +        mov     $sym_offset(__trampoline_rel_start),%edi
> +        mov     $sym_offset(__trampoline_rel_stop),%ebx
>  1:
> -        mov     (%edi),%eax
> -        add     %edx,(%edi,%eax)
> +        mov     %fs:(%edi),%eax
> +        add     %edx,%fs:(%edi,%eax)
>          add     $4,%edi
> -        cmp     $sym_phys(__trampoline_rel_stop),%edi
> +        cmp     %ebx,%edi
>          jb      1b

And again it looks like the switch to using %ebx here is not needed.

>          /* Patch in the trampoline segment. */
>          shr     $4,%edx
> -        mov     $sym_phys(__trampoline_seg_start),%edi
> +        mov     $sym_offset(__trampoline_seg_start),%edi
> +        mov     $sym_offset(__trampoline_seg_stop),%ebx
>  1:
> -        mov     (%edi),%eax
> -        mov     %dx,(%edi,%eax)
> +        mov     %fs:(%edi),%eax
> +        mov     %dx,%fs:(%edi,%eax)
>          add     $4,%edi
> -        cmp     $sym_phys(__trampoline_seg_stop),%edi
> +        cmp     %ebx,%edi
>          jb      1b

Same here then, obviously.

> --- a/xen/arch/x86/boot/trampoline.S
> +++ b/xen/arch/x86/boot/trampoline.S
> @@ -54,12 +54,20 @@ trampoline_gdt:
>          /* 0x0028: real-mode data @ BOOT_TRAMPOLINE */
>          .long   0x0000ffff
>          .long   0x00009200
> +        /*
> +         * 0x0030: ring 0 Xen data, 16 MiB size, base
> +         * address is computed during runtime.
> +         */
> +        .quad   0x00c0920000001000

This does not look like it covers 1Mb, it's more like 1Mb+4k-1.

>          .pushsection .trampoline_rel, "a"
>          .long   trampoline_gdt + BOOT_PSEUDORM_CS + 2 - .
>          .long   trampoline_gdt + BOOT_PSEUDORM_DS + 2 - .
>          .popsection
>  
> +GLOBAL(xen_img_load_base_addr)
> +        .long   0

I've yet to understand the purpose of this symbol, but in any case:
If the trampoline code doesn't reference it, why does it get placed
here?

> @@ -861,15 +860,17 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>          highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1);
>  #endif
>  
> +    /* Do not relocate Xen image if boot loader did work for us. */
> +    if ( xen_img_load_base_addr )
> +        xen_phys_start = xen_img_load_base_addr;

Okay, with this change the question really is: Why do you need the
new symbol? I.e. why can't you just use xen_phys_start, just like I
managed to re-use it when I introduced boot from EFI?

>      for ( i = boot_e820.nr_map-1; i >= 0; i-- )
>      {
>          uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1;
>          uint64_t end, limit = ARRAY_SIZE(l2_identmap) << L2_PAGETABLE_SHIFT;
>  
> -        /* Superpage-aligned chunks from BOOTSTRAP_MAP_BASE. */

I can see that you want to get rid of BOOTSTRAP_MAP_BASE, but
please don't delete the comment as a whole.

>          s = (boot_e820.map[i].addr + mask) & ~mask;
>          e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask;
> -        s = max_t(uint64_t, s, BOOTSTRAP_MAP_BASE);
>          if ( (boot_e820.map[i].type != E820_RAM) || (s >= e) )
>              continue;

This means you now map memory between 2Mb and 16Mb here. Is
this necessary?

> @@ -901,7 +902,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>              l4_pgentry_t *pl4e;
>              l3_pgentry_t *pl3e;
>              l2_pgentry_t *pl2e;
> -            uint64_t load_start;
>              int i, j, k;
>  
>              /* Select relocation address. */
> @@ -915,9 +915,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>               * with a barrier(). After this we must *not* modify static/global
>               * data until after we have switched to the relocated pagetables!
>               */
> -            load_start = (unsigned long)_start - XEN_VIRT_START;
>              barrier();
> -            move_memory(e + load_start, load_start, _end - _start, 1);
> +            move_memory(e + XEN_IMG_OFFSET, XEN_IMG_OFFSET, _end - _start, 1);

Assuming _start - XEN_VIRT_START == XEN_IMG_OFFSET, is this
change necessary? Or is it rather just simplification, which again
should be split from an already complex patch.

> @@ -957,15 +956,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>               * Undo the temporary-hooking of the l1_identmap.  __2M_text_start
>               * is contained in this PTE.
>               */
> -            BUG_ON(l2_table_offset((unsigned long)_erodata) ==
> -                   l2_table_offset((unsigned long)_stext));

At least when using_2M_mapping() this surely ought to hold?

> @@ -1019,6 +1017,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>                  : "memory" );
>  
>              bootstrap_map(NULL);
> +
> +            printk("New Xen image base address: 0x%08lx\n", xen_phys_start);

I don't see the motivation for adding such a message in this patch,
but if, then please use %#lx.

> @@ -1082,6 +1082,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>  
>      if ( !xen_phys_start )
>          panic("Not enough memory to relocate Xen.");
> +
>      reserve_e820_ram(&boot_e820, __pa(&_start), __pa(&_end));

Another stray change.

> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -55,7 +55,7 @@ SECTIONS
>    __2M_text_start = .;         /* Start of 2M superpages, mapped RX. */
>  #endif
>  
> -  . = __XEN_VIRT_START + MB(1);
> +  . = __XEN_VIRT_START + XEN_IMG_OFFSET;
>    _start = .;
>    .text : {
>          _stext = .;            /* Text and read-only data */
> @@ -257,12 +257,14 @@ SECTIONS
>    .reloc : {
>      *(.reloc)
>    } :text
> -  /* Trick the linker into setting the image size to exactly 16Mb. */
>    . = ALIGN(__section_alignment__);
> +#endif
> +
> +  /* Trick the linker into setting the image size to exactly 16Mb. */
>    .pad : {
>      . = ALIGN(MB(16));
> +    __end_of_image__ = .;

I see that you use this symbol in xen/arch/x86/Makefile, but I again
don't follow why this logic needs to change.

> --- a/xen/include/asm-x86/page.h
> +++ b/xen/include/asm-x86/page.h
> @@ -288,7 +288,7 @@ extern root_pgentry_t 
> idle_pg_table[ROOT_PAGETABLE_ENTRIES];
>  extern l2_pgentry_t  *compat_idle_pg_table_l2;
>  extern unsigned int   m2p_compat_vstart;
>  extern l2_pgentry_t l2_xenmap[L2_PAGETABLE_ENTRIES],
> -    l2_bootmap[L2_PAGETABLE_ENTRIES];
> +    l2_bootmap[4*L2_PAGETABLE_ENTRIES];

Why do you need 4 of them? I can see why one might not suffice,
but two surely should?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel