All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2] xSplice design
@ 2015-05-15 19:44 Konrad Rzeszutek Wilk
  2015-05-18 12:41 ` Jan Beulich
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-05-15 19:44 UTC (permalink / raw)
  To: msw, aliguori, Antony Messerli, Rick Harris, Paul Voccio,
	Steven Wilson, Major Hayden, Josh Kearney, jinsong.liu,
	xiantao.zxt, boris.ostrovsky, Daniel Kiper, Elena Ufimtseva,
	bob.liu, lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

Hey!

During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
the hypervisor. We sketched out how it could be done, and brainstormed
some of the problems.

I took that and wrote an design - which is very much RFC. The design is
laid out in two sections - the format of the ELF payload - and then the
hypercalls to act on it.

Hypercall preemption has caused a couple of XSAs so I've baked the need
for that in the design so we hopefully won't have an XSA for this code.

There are two big *TODO* in the design which I had hoped to get done
before sending this out - however I am going on vacation for two weeks
so I figured it would be better to send this off for folks to mull now
then to have it languish.

Please feel free to add more folks on the CC list.

Enjoy!


# xSplice Design v1 (EXTERNAL RFC v2)

## Rationale

A mechanism is required to binarily patch the running hypervisor with new
opcodes that have come about due to primarily security updates.

This document describes the design of the API that would allow us to
upload to the hypervisor binary patches.

## Glossary

 * splice - patch in the binary code with new opcodes
 * trampoline - a jump to a new instruction.
 * payload - telemetries of the old code along with binary blob of the new
   function (if needed).
 * reloc - telemetries contained in the payload to construct proper trampoline.

## Multiple ways to patch

The mechanism needs to be flexible to patch the hypervisor in multiple ways
and be as simple as possible. The compiled code is contiguous in memory with
no gaps - so we have no luxury of 'moving' existing code and must either
insert a trampoline to the new code to be executed - or only modify in-place
the code if there is sufficient space. The placement of new code has to be done
by hypervisor and the virtual address for the new code is allocated dynamically.
i
This implies that the hypervisor must compute the new offsets when splicing
in the new trampoline code. Where the trampoline is added (inside
the function we are patching or just the callers?) is also important.

To lessen the amount of code in hypervisor, the consumer of the API
is responsible for identifying which mechanism to employ and how many locations
to patch. Combinations of modifying in-place code, adding trampoline, etc
has to be supported. The API should allow read/write any memory within
the hypervisor virtual address space.

We must also have a mechanism to query what has been applied and a mechanism
to revert it if needed.

We must also have a mechanism to: provide an copy of the old code - so that
the hypervisor can verify it against the code in memory; the new code;
the symbol name of the function to be patched; or offset from the symbol;
or virtual address.

The complications that this design will encounter are explained later
in this document.

## Patching code

The first mechanism to patch that comes in mind is in-place replacement.
That is replace the affected code with new code. Unfortunately the x86
ISA is variable size which places limits on how much space we have available
to replace the instructions.

The second mechanism is by replacing the call or jump to the
old function with the address of the new function.

A third mechanism is to add a jump to the new function at the
start of the old function.

### Example of trampoline and in-place splicing

As example we will assume the hypervisor does not have XSA-132 (see
*domctl/sysctl: don't leak hypervisor stack to toolstacks*
4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
the hypervisor with it. The original code looks as so:

<pre>
   48 89 e0                  mov    %rsp,%rax  
   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
</pre>

while the new patched hypervisor would be:

<pre>
   48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)  
   48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)  
   48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)  
   48 89 e0                  mov    %rsp,%rax  
   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
</pre>

This is inside the arch_do_domctl. This new change adds 21 extra
bytes of code which alters all the offsets inside the function. To alter
these offsets and add the extra 21 bytes of code we might not have enough
space in .text to squeze this in.

As such we could simplify this problem by only patching the site
which calls arch_do_domctl:

<pre>
<do_domctl>:  
 e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>  
</pre>

with a new address for where the new `arch_do_domctl` would be (this
area would be allocated dynamically).

Astute readers will wonder what we need to do if we were to patch `do_domctl`
- which is not called directly by hypervisor but on behalf of the guests via
the `compat_hypercall_table` and `hypercall_table`.
Patching the offset in `hypercall_table` for `do_domctl:
(ffff82d080103079 <do_domctl>:)
<pre>

 ffff82d08024d490:   79 30  
 ffff82d08024d492:   10 80 d0 82 ff ff   

</pre>
with the new address where the new `do_domctl` is possible. The other
place where it is used is in `hvm_hypercall64_table` which would need
to be patched in a similar way. This would require an in-place splicing
of the new virtual address of `arch_do_domctl`.

In summary this example patched the callee of the affected function by
 * allocating memory for the new code to live in,
 * changing the virtual address of all the functions which called the old
   code (computing the new offset, patching the callq with a new callq).
 * changing the function pointer tables with the new virtual address of
   the function (splicing in the new virtual address). Since this table
   resides in the .rodata section we would need to temporarily change the
   page table permissions during this part.


However it has severe drawbacks - the safety checks which have to make sure
the function is not on the stack - must also check every caller. For some
patches this could if there were an sufficient large amount of callers
that we would never be able to apply the update.

### Example of different trampoline patching.

An alternative mechanism exists where we can insert an trampoline in the
existing function to be patched to jump directly to the new code. This
lessens the locations to be patched to one but it puts pressure on the
CPU branching logic (I-cache, but it is just one unconditional jump).

For this example we will assume that the hypervisor has not been compiled
with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
in `xen_version` hypercall. This function is not called **anywhere** in
the hypervisor (it is called by the guest) but referenced in the
`compat_hypercall_table` and `hypercall_table` (and indirectly called
from that). Patching the offset in `hypercall_table` for the old
`do_xen_version` (ffff82d080112f9e <do_xen_version>)

</pre>
 ffff82d08024b270 <hypercall_table>  
 ...  
 ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff  

</pre>
with the new address where the new `do_xen_version` is possible. The other
place where it is used is in `hvm_hypercall64_table` which would need
to be patched in a similar way. This would require an in-place splicing
of the new virtual address of `do_xen_version`.

An alternative solution would be to patch insert an trampoline in the
old `do_xen_version' function to directly jump to the new `do_xen_version`.

<pre>
 ffff82d080112f9e <do_xen_version>:  
 ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax  
 ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi  
 ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 <do_xen_version+0x534>  
</pre>

with:

<pre>
 ffff82d080112f9e <do_xen_version>:  
 ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]  
</pre>

which would lessen the amount of patching to just one location.

In summary this example patched the affected function to jump to the
new replacement function which required:
 * allocating memory for the new code to live in,
 * inserting trampoline with new offset in the old function to point to the
   new function.
 * Optionally we can insert in the old function an trampoline jump to an function
   providing an BUG_ON to catch errant code.

The disadvantage of this are that the unconditional jump will consume a small
I-cache penalty. However the simplicity of the patching of safety checks
make this a worthwhile option.

### Security

With this method we can re-write the hypervisor - and as such we **MUST** be
diligent in only allowing certain guests to perform this operation.

Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
of the payload to be certain it came from a trusted source.

As such the hypercall **MUST** support an XSM policy to limit the what
guest is allowed. If the system is booted with signature checking the
signature checking will be enforced.

## Payload format

The payload **MUST** contain enough data to allow us to apply the update
and also safely reverse it. As such we **MUST** know:

 * What the old code is expected to be. We **MUST** verify it against the
   runtime code.
 * The locations in memory to be patched. This can be determined dynamically
   via symbols or via virtual addresses.
 * The new code to be used.
 * Signature to verify the payload.

This binary format can be constructed using an custom binary format but
there are severe disadvantages of it:

 * The format might need to be change and we need an mechanism to accommodate
   that.
 * It has to be platform agnostic.
 * Easily constructed using existing tools.

As such having the payload in an ELF file is the sensible way. We would be
carrying the various set of structures (and data) in the ELF sections under
different names and with definitions. The prefix for the ELF section name
would always be: *.xsplice_*

Note that every structure has padding. This is added so that the hypervisor
can re-use those fields as it sees fit.

There are five sections *.xsplice_* sections:

 * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be referenced
   during the update. This can contain the symbols (functions) that will be
   patched, or the list of symbols (functions) to be checked pre-patching which
   may not be on the stack.

* `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
   trampolines for an patch. We can have multiple locations for which we
   need to insert an trampoline for a payload and each location might require
   a different way of handling it. This would naturally reference the `.text`
   section and its proper offset. The `.xsplice_reloc` is not directly concerned
   with patches but rather is an ELF relocation - describing the target
   of a relocation and how that is performed.  They're also used for where
   the new code references the run code too.

 * `.xsplice_sections`. The safety data for the old code and new code.
   This contains an array of symbols (pointing to `.xsplice_symbols` to
   and `.text`) which are to be used during safety and dependency checking.


 * `.xsplice_patches`: The description of the new functions to be patched
   in (size, type, pointer to code, etc.).

 * `.xsplice_change`. The structure that ties all of this together and defines
   the payload.

Additionally the ELF file would contain:

 * `.text` section for the new and old code (function).
 * `.rela.text` relocation data for the `.text` (both new and old).
 * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as offset
   to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
 * `.bss` section for the new code (function)
 * `.data` and `.data.read_mostly` section for the new and old code (function)
 * `.rodata` section for the new and old code (function).

In short the *.xsplice_* sections represent various structures and the
ELF provides the mechanism to glue it all together when loaded in memory.

Note that a lot of these ideas are borrowed from kSplice which is
available at: https://github.com/jirislaby/ksplice

For ELF understanding the best starting point is the OSDev Wiki
(http://wiki.osdev.org/ELF). Furthermore the ELF specification is
at http://www.skyfree.org/linux/references/ELF_Format.pdf and
at Oracle's web site:
http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc

### ASCII art of the ELF structures

*TODO*: Include an ASCII art of how the sections are tied together.

### xsplice_symbols

The section contains an array of an structure that outlines the name
of the symbol to be patched (or checked against). The structure is
as follow:

<pre>
struct xsplice_symbol {  
    const char *name; /* The ELF name of the symbol. */  
    const char *label; /* A unique xSplice name for the symbol. */  
    uint8_t pad[16]; /* Must be zero. */  
};  
</pre>
The structures may be in the section in any order and in any amount
(duplicate entries are permitted).

Both `name` and `label` would be pointing to entries in `.xsplice_str`.

The `label` is used for diagnostic purposes - such as including the
name and the offset.

### xsplice_reloc and xsplice_reloc_howto

The section contains an array of a structure that outlines the different
locations (and howto) for which an trampoline is to be inserted.

The howto defines in the detail the change. It contains the type,
whether the relocation is relative, the size of the relocation,
bitmask for which parts of the instruction or data are to be replaced,
amount of final relocation is shifted by (to drop unwanted data), and
whether the replacement should be interpreted as signed value.

The structure is as follow:

<pre>
#define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */  
#define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */  
#define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */  
#define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
#define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
#define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
#define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  

#define XSPLICE_HOWTO_FLAG_PC_REL    0x00000001 /* Is PC relative. */  
#define XSPLICE_HOWOT_FLAG_SIGN      0x00000002 /* Should the new value be treated as signed value. */  

struct xsplice_reloc_howto {  
    uint32_t    type; /* XSPLICE_HOWTO_* */  
    uint32_t    flag; /* XSPLICE_HOWTO_FLAG_* */  
    uint32_t    size; /* Size, in bytes, of the item to be relocated. */  
    uint32_t    r_shift; /* The value the final relocation is shifted right by; used to drop unwanted data from the relocation. */  
    uint64_t    mask; /* Bitmask for which parts of the instruction or data are replaced with the relocated value. */  
    uint8_t     pad[8]; /* Must be zero. */  
};  

</pre>

This structure is used in:

<pre>
struct xsplice_reloc {  
    uint64_t addr; /* The address of the relocation (if known). */  
    struct xsplice_symbol *symbol; /* Symbol for this relocation. */  
    struct xsplice_reloc_howto  *howto; /* Pointer to the above structure. */  
    uint64_t isns_added; /* ELF addend resulting from quirks of instruction one of whose operands is the relocation. For example, this is -4 on x86 pc-relative jumps. */  
    uint64_t isns_target; /* rest of the ELF addend.  This is equal to the offset against the symbol that the relocation refers to. */  
    uint8_t pad[8];  /* Must be zero. */  
};  
</pre>

### xsplice_sections

The structure defined in this section is used to verify that it is safe
to update with the new changes. It can contain safety data on the old code
and what kind of matching we are to expect.

It also can contain safety date of what to check when about to patch.
That is whether any of the addresses (either provided or resolved
when payload is loaded by referencing the symbols) are in memory
with what we expect it to be.

As such the flags can be or-ed together:

<pre>
#define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */  
#define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */  
#define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */  
#define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
#define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has .altinstructions. */  
#define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
#dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
#define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  

struct xsplice_section {  
    struct xsplice_symbol *symbol; /* The symbol associated with this change. */  
    uint64_t address; /* The address of the section (if known). */  
    uint64_t size; /* The size of the section. */  
    uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
    uint8_t pad[16]; /* To be zero. */  
};

</pre>

### xsplice_patches

Within this section we have an array of a structure defining the new code (patch).

This structure consist of an pointer to the new code (which in ELF ends up
pointing to an offset in `.text` or `.data` section); the type of patch:
inline - either text or data, or requesting an trampoline; and size of patch.

The structure is as follow:

<pre>
#define XSPLICE_PATCH_INLINE_TEXT   0
#define XSPLICE_PATCH_INLINE_DATA   1
#define XSPLICE_PATCH_RELOC_TEXT    2

struct xsplice_patch {  
    uint32_t type; /* XSPLICE_PATCH_* .*/  
    uint32_t size; /* Size of patch. */  
    uint64_t addr; /* The address of the new code (or data). */  
    void *content; /* The bytes to be installed. */  
    uint8_t pad[16]; /* Must be zero. */  
};

</pre>

### xsplice_code

The structure embedded within this section ties it all together.
It has the name of the patch, and pointers to all the above
mentioned structures (the start and end addresses).

The structure is as follow:

<pre>
struct xsplice_code {  
    const char *name; /* A sensible name for the patch. Up to 40 characters. */  
    struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */  
    struct xsplice_section *sections, *sections_end; /* Safety data */  
    struct xsplice_patch *patches, *patches_end; /* Patch code & data */  
    uint8_t pad[32]; /* Must be zero. */
};
</pre>

There should only be one such structure in the section.

### Example

*TODO*: Include an objdump of how the ELF would look like for the XSA
mentioned earlier.

## Signature checking requirements.

The signature checking requires that the layout of the data in memory
**MUST** be same for signature to be verified. This means that the payload
data layout in ELF format **MUST** match what the hypervisor would be
expecting such that it can properly do signature verification.

The signature is based on the all of the payloads continuously laid out
in memory. The signature is to be appended at the end of the ELF payload
prefixed with the string '~Module signature appended~\n", followed by
an signature header then followed by the signature, key identifier, and signers
name.

Specifically the signature header would be:

<pre>
#define PKEY_ALGO_DSA       0  
#define PKEY_ALGO_RSA       1  

#define PKEY_ID_PGP         0 /* OpenPGP generated key ID */  
#define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */  

#define HASH_ALGO_MD4          0  
#define HASH_ALGO_MD5          1  
#define HASH_ALGO_SHA1         2  
#define HASH_ALGO_RIPE_MD_160  3  
#define HASH_ALGO_SHA256       4  
#define HASH_ALGO_SHA384       5  
#define HASH_ALGO_SHA512       6  
#define HASH_ALGO_SHA224       7  
#define HASH_ALGO_RIPE_MD_128  8  
#define HASH_ALGO_RIPE_MD_256  9  
#define HASH_ALGO_RIPE_MD_320 10  
#define HASH_ALGO_WP_256      11  
#define HASH_ALGO_WP_384      12  
#define HASH_ALGO_WP_512      13  
#define HASH_ALGO_TGR_128     14  
#define HASH_ALGO_TGR_160     15  
#define HASH_ALGO_TGR_192     16  


struct elf_payload_signature {  
	u8	algo;		/* Public-key crypto algorithm PKEY_ALGO_*. */  
	u8	hash;		/* Digest algorithm: HASH_ALGO_*. */  
	u8	id_type;	/* Key identifier type PKEY_ID*. */  
	u8	signer_len;	/* Length of signer's name */  
	u8	key_id_len;	/* Length of key identifier */  
	u8	__pad[3];  
	__be32	sig_len;	/* Length of signature data */  
};

</pre>
(Note that this has been borrowed from Linux module signature code.).


## Hypercalls

We will employ the sub operations of the system management hypercall (sysctl).
There are to be four sub-operations:

 * upload the payloads.
 * listing of payloads summary uploaded and their state.
 * getting an particular payload summary and its state.
 * command to apply, delete, or revert the payload.

The patching is asynchronous therefore the caller is responsible
to verify that it has been applied properly by retrieving the summary of it
and verifying that there are no error codes associated with the payload.

We **MUST** make it asynchronous due to the nature of patching: it requires
every physical CPU to be lock-step with each other. The patching mechanism
while an implementation detail, is not an short operation and as such
the design **MUST** assume it will be an long-running operation.

Furthermore it is possible to have multiple different payloads for the same
function. As such an unique id has to be visible to allow proper manipulation.

The hypercall is part of the `xen_sysctl`. The top level structure contains
one uint32_t to determine the sub-operations:

<pre>
struct xen_sysctl_xsplice_op {  
    uint32_t cmd;  
	union {  
          ... see below ...  
        } u;  
};  

</pre>
while the rest of hypercall specific structures are part of the this structure.


### XEN_SYSCTL_XSPLICE_UPLOAD (0)

Upload a payload to the hypervisor. The payload is verified and if there
are any issues the proper return code will be returned. The payload is
not applied at this time - that is controlled by *XEN_SYSCTL_XSPLICE_ACTION*.

The caller provides:

 * `id` unique id.
 * `payload` the virtual address of where the ELF payload is.

The return value is zero if the payload was succesfully uploaded and the
signature was verified. Otherwise an EXX return value is provided.
Duplicate `id` are not supported.

The `payload` is the ELF payload as mentioned in the `Payload format` section.

The structure is as follow:

<pre>
struct xen_sysctl_xsplice_upload {  
    char id[40];  /* IN, name of the patch. */  
    uint64_t size; /* IN, size of the ELF file. */
    XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */  
}; 
</pre>

### XEN_SYSCTL_XSPLICE_GET (1)

Retrieve an summary of an specific payload. This caller provides:

 * `id` the unique id.
 * `status` *MUST* be set to zero.
 * `rc` *MUST* be set to zero.

The `summary` structure contains an summary of payload which includes:

 * `id` the unique id.
 * `status` - whether it has been:
 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
 3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
 * `rc` - its error state if any.

The structure is as follow:

<pre>
#define XSPLICE_STATUS_LOADED    0  
#define XSPLICE_STATUS_PROGRESS  1  
#define XSPLICE_STATUS_CHECKED   2  
#define XSPLICE_STATUS_APPLIED   3  
#define XSPLICE_STATUS_REVERTED  4  
#define XSPLICE_STATUS_IN_ERROR  5  

struct xen_sysctl_xsplice_summary {  
    char id[40];  /* IN/OUT, name of the patch. */  
    uint32_t status;   /* OUT */  
    int32_t rc;  /* OUT */  
}; 
</pre>

### XEN_SYSCTL_XSPLICE_LIST (2)

Retrieve an array of abbreviated summary of payloads that are loaded in the
hypervisor.

The caller provides:

 * `idx` index iterator. Initially it *MUST* be zero.
 * `count` the max number of entries to populate.
 * `summary` virtual address of where to write payload summaries.

The hypercall returns zero on success and updates the `idx` (index) iterator
with the number of payloads returned, `count` to the number of remaining
payloads, and `summary` with an number of payload summaries.

If the hypercall returns E2BIG the `count` is too big and should be
lowered.

Note that due to the asynchronous nature of hypercalls the domain might have
added or removed the number of payloads making this information stale. It is
the responsibility of the domain to provide proper accounting.

The `summary` structure contains an summary of payload which includes:

 * `id` unique id.
 * `status` - whether it has been:
 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
 3. *XSPLICE_STATUS_CHECKED*  (2) the payload `old` and `addr` match with the hypervisor.
 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
 * `rc` - its error state if any.

The structure is as follow:

<pre>
struct xen_sysctl_xsplice_list {  
    uint32_t idx;  /* IN/OUT */  
    uint32_t count;  /* IN/OUT */
    XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary;  /* OUT */  
};  

struct xen_sysctl_xsplice_summary {  
    char id[40];  /* OUT, name of the patch. */  
    uint32_t status;   /* OUT */  
    int32_t rc;  /* OUT */  
};  

</pre>
### XEN_SYSCTL_XSPLICE_ACTION (3)

Perform an operation on the payload structure referenced by the `id` field.
The operation request is asynchronous and the status should be retrieved
by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST** hypercall.

The caller provides:

 * `id` the unique id.
 * `cmd` the command requested:
  1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply properly.
  2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.  
  3. *XSPLICE_ACTION_REVERT* (2) revert the payload.  
  4. *XSPLICE_ACTION_APPLY* (3) apply the payload.   


The return value will be zero unless the provided fields are incorrect.

The structure is as follow:

<pre>
#define XSPLICE_ACTION_CHECK  0  
#define XSPLICE_ACTION_UNLOAD 1  
#define XSPLICE_ACTION_REVERT 2  
#define XSPLICE_ACTION_APPLY  3  

struct xen_sysctl_xsplice_action {  
    char id[40];  /* IN, name of the patch. */  
    uint32_t cmd; /* IN */  
};  

</pre>

## Sequence of events.

The normal sequence of events is to:

 1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors *STOP* here.
 2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next step.
 3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to verify that the payload can be succesfully applied.
 4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next step.
 5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the patch.
 6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with success.

 
## Addendum

Implementation quirks should not be discussed in a design document.

However these observations can provide aid when developing against this
document.


### Alternative assembler

Alternative assembler is a mechanism to use different instructions depending
on what the CPU supports. This is done by providing multiple streams of code
that can be patched in - or if the CPU does not support it - padded with
`nop` operations. The alternative assembler macros cause the compiler to
expand the code to place a most generic code in place - emit a special
ELF .section header to tag this location. During run-time the hypervisor
can leave the areas alone or patch them with an better suited opcodes.

As we might be patching the alternative assembler sections as well - by
providing a new better suited op-codes or perhaps with nops - we need to
also re-run the alternative assembler patching after we have done our
patching.

Also when we are doing safety checks the code we are checking might be
utilizing alternative assembler. As such we should relax out checks to
accomodate that.

### .rodata sections

The patching might require strings to be updated as well. As such we must be
also able to patch the strings as needed. This sounds simple - but the compiler
has a habit of coalescing strings that are the same - which means if we in-place
alter the strings - other users will be inadvertently affected as well.

This is also where pointers to functions live - and we may need to patch this
as well.

To guard against that we must be prepared to do patching similar to
trampoline patching or in-line depending on the flavour. If we can
do in-line patching we would need to:

 * alter `.rodata` to be writeable.
 * inline patch.
 * alter `.rodata` to be read-only.

If are doing trampoline patching we would need to:

 * allocate a new memory location for the string.
 * all locations which use this string will have to be updated to use the
   offset to the string.
 * mark the region RO when we are done.

### .bss sections

Patching writable data is not suitable as it is unclear what should be done
depending on the current state of data. As such it should not be attempted.


### Patching code which is in the stack.

We should not patch the code which is on the stack. That can lead
to corruption.

### Trampoline (e9 opcode)

The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
we are limited to up to 2GB of virtual address to place the new code
from the old code. That should not be a problem since Xen hypervisor has
a very small footprint.

However if we need - we can always add two trampolines. One at the 2GB
limit that calls the next trampoline.

### Time rendezvous code instead of stop_machine for patching

The hypervisor's time rendezvous code runs synchronously across all CPUs
every second. Using the stop_machine to patch can stall the time rendezvous
code and result in NMI. As such having the patching be done at the tail
of rendezvous code should avoid this problem.

### Security

Only the privileged domain should be allowed to do this operation.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
@ 2015-05-18 12:41 ` Jan Beulich
  2015-06-05 14:49   ` Konrad Rzeszutek Wilk
  2015-05-18 12:54 ` Liuqiming (John)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2015-05-18 12:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 15.05.15 at 21:44, <konrad.wilk@oracle.com> wrote:
> As such having the payload in an ELF file is the sensible way. We would be
> carrying the various set of structures (and data) in the ELF sections under
> different names and with definitions. The prefix for the ELF section name
> would always be: *.xsplice_*
> 
> Note that every structure has padding. This is added so that the hypervisor
> can re-use those fields as it sees fit.
> 
> There are five sections *.xsplice_* sections:

A general remark: Having worked on ELF on different occasions,
UNIX'es (and hence Linux'es) use of section names to identify the
sections' purposes is pretty contrary to ELF's original intentions.
Section types (and architecture as well as OS extensions to them)
exist for a reason. Of course from the following sections it's not
really clear in how far you intended this to be done. If you did
and just didn't explicitly say so, pointing out that relations
between sections should then also be expressed suitably via ELF
mechanisms (sh_link, sh_info) would then be unnecessary too. If
you didn't, then I'd strongly suggest switching to a formally more
correct model, which would then include specifying section types
and section attributes (and optional other relevant aspects)
along with their (then no longer mandatory) names.

>  * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be referenced
>    during the update. This can contain the symbols (functions) that will be
>    patched, or the list of symbols (functions) to be checked pre-patching which
>    may not be on the stack.
> 
> * `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
>    trampolines for an patch. We can have multiple locations for which we
>    need to insert an trampoline for a payload and each location might require
>    a different way of handling it. This would naturally reference the `.text`
>    section and its proper offset. The `.xsplice_reloc` is not directly concerned
>    with patches but rather is an ELF relocation - describing the target
>    of a relocation and how that is performed.  They're also used for where
>    the new code references the run code too.
> 
>  * `.xsplice_sections`. The safety data for the old code and new code.
>    This contains an array of symbols (pointing to `.xsplice_symbols` to
>    and `.text`) which are to be used during safety and dependency checking.
> 
> 
>  * `.xsplice_patches`: The description of the new functions to be patched
>    in (size, type, pointer to code, etc.).
> 
>  * `.xsplice_change`. The structure that ties all of this together and defines
>    the payload.
> 
> Additionally the ELF file would contain:
> 
>  * `.text` section for the new and old code (function).
>  * `.rela.text` relocation data for the `.text` (both new and old).
>  * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as offset
>    to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).

I think you should not immediately rule out REL relocations, i.e.
specify both .rela.* and .rel.*.

>  * `.bss` section for the new code (function)
>  * `.data` and `.data.read_mostly` section for the new and old code (function)
>  * `.rodata` section for the new and old code (function).
> 
> In short the *.xsplice_* sections represent various structures and the
> ELF provides the mechanism to glue it all together when loaded in memory.

As per above - I think ELF can do this in a more explicit way. Of
course that would imply sticking to strictly ELF types in the structure
definitions below. That would in turn have the advantage of
expressing e.g. "const char *" references to strings by section
offsets into a specified string section, reducing (with the ultimate
goal of eliminating) the need for processing relocations on the
ELF object (possible references to "external symbols" left aside for
the moment).

> Note that a lot of these ideas are borrowed from kSplice which is
> available at: https://github.com/jirislaby/ksplice 
> 
> For ELF understanding the best starting point is the OSDev Wiki
> (http://wiki.osdev.org/ELF). Furthermore the ELF specification is
> at http://www.skyfree.org/linux/references/ELF_Format.pdf and
> at Oracle's web site:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc

I think the master copy still is

http://www.sco.com/developers/gabi/

> ### xsplice_reloc and xsplice_reloc_howto
> 
> The section contains an array of a structure that outlines the different
> locations (and howto) for which an trampoline is to be inserted.
> 
> The howto defines in the detail the change. It contains the type,
> whether the relocation is relative, the size of the relocation,
> bitmask for which parts of the instruction or data are to be replaced,
> amount of final relocation is shifted by (to drop unwanted data), and
> whether the replacement should be interpreted as signed value.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */  
> #define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */  
> #define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */  

DATE or DATA?

> #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  

Assuming the former is DATE - what do you foresee date/time
kinds of changes to be good for?

> #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
> #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
> #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  

For none of these I really understand their purpose.

> ### xsplice_sections
> 
> The structure defined in this section is used to verify that it is safe
> to update with the new changes. It can contain safety data on the old code
> and what kind of matching we are to expect.
> 
> It also can contain safety date of what to check when about to patch.
> That is whether any of the addresses (either provided or resolved
> when payload is loaded by referencing the symbols) are in memory
> with what we expect it to be.
> 
> As such the flags can be or-ed together:
> 
> <pre>
> #define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */  
> #define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */  
> #define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */  

DATA or .rodata?

> #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
> #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has .altinstructions. */  

I don't see the purpose of adding flags for a few special sections we
may know about now - what if over time hundreds of sections appear?

> #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
> #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
> #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  

This isn't the first reference to "stack", without me having seen a
description of what the word is meant to stand for in such contexts.
Did I overlook it?

> struct xsplice_section {  
>     struct xsplice_symbol *symbol; /* The symbol associated with this change. */  
>     uint64_t address; /* The address of the section (if known). */  
>     uint64_t size; /* The size of the section. */  

I don't think you need 64-bit types here even on 32-bit architectures.
Use Elf*_Addr and Elf*_Word?

> ### xsplice_patches
> 
> Within this section we have an array of a structure defining the new code 
> (patch).
> 
> This structure consist of an pointer to the new code (which in ELF ends up
> pointing to an offset in `.text` or `.data` section); the type of patch:
> inline - either text or data, or requesting an trampoline; and size of patch.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_PATCH_INLINE_TEXT   0
> #define XSPLICE_PATCH_INLINE_DATA   1
> #define XSPLICE_PATCH_RELOC_TEXT    2

Again it remains vague what these are supposed to express.

> ### XEN_SYSCTL_XSPLICE_GET (1)
> 
> Retrieve an summary of an specific payload. This caller provides:
> 
>  * `id` the unique id.
>  * `status` *MUST* be set to zero.
>  * `rc` *MUST* be set to zero.
> 
> The `summary` structure contains an summary of payload which includes:
> 
>  * `id` the unique id.
>  * `status` - whether it has been:
>  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
>  3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
>  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
>  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
>  * `rc` - its error state if any.

This duplication of status and rc looks odd to me, the more that only
appears to mean an error. How about using negative status values
as error indicators?

> ## Sequence of events.
> 
> The normal sequence of events is to:
> 
>  1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors 
> *STOP* here.
>  2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next 
> step.
>  3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to 
> verify that the payload can be succesfully applied.
>  4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next 
> step.
>  5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the 
> patch.
>  6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with 
> success.

Is this meant to imply that e.g. ACTION_APPLY will fail unless the
respective payload is in CHECKED state?

> ### Alternative assembler
> 
> Alternative assembler is a mechanism to use different instructions depending
> on what the CPU supports. This is done by providing multiple streams of code
> that can be patched in - or if the CPU does not support it - padded with
> `nop` operations. The alternative assembler macros cause the compiler to
> expand the code to place a most generic code in place - emit a special
> ELF .section header to tag this location. During run-time the hypervisor
> can leave the areas alone or patch them with an better suited opcodes.
> 
> As we might be patching the alternative assembler sections as well - by
> providing a new better suited op-codes or perhaps with nops - we need to
> also re-run the alternative assembler patching after we have done our
> patching.

These sections are part of .init.* and as such can't reasonably be
subject to patching.

> ### .bss sections
> 
> Patching writable data is not suitable as it is unclear what should be done
> depending on the current state of data. As such it should not be attempted.

This certainly also applies to .data.

> ### Time rendezvous code instead of stop_machine for patching
> 
> The hypervisor's time rendezvous code runs synchronously across all CPUs
> every second. Using the stop_machine to patch can stall the time rendezvous
> code and result in NMI. As such having the patching be done at the tail
> of rendezvous code should avoid this problem.

I know this was brough up on the hackathon, but did you verify this
to be a problem? Because if it is, we already have an issue here as
we already use stop_machine for things like CPU offlining.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
  2015-05-18 12:41 ` Jan Beulich
@ 2015-05-18 12:54 ` Liuqiming (John)
  2015-05-18 13:11   ` Daniel Kiper
  2015-06-05 14:50   ` Konrad Rzeszutek Wilk
  2015-05-19 19:13 ` Lars Kurth
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 37+ messages in thread
From: Liuqiming (John) @ 2015-05-18 12:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, Paul Voccio,
	Daniel Kiper, Major Hayden, liuyingdong, aliguori, konrad,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

Hi Konrad,

Will this design include hotpatch build tools chain?
Such as how these .xplice_ section are created? How to handle xen symbols when creating hotpatch elf file?

On 2015/5/16 3:44, Konrad Rzeszutek Wilk wrote:
> Hey!
>
> During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
> the hypervisor. We sketched out how it could be done, and brainstormed
> some of the problems.
>
> I took that and wrote an design - which is very much RFC. The design is
> laid out in two sections - the format of the ELF payload - and then the
> hypercalls to act on it.
>
> Hypercall preemption has caused a couple of XSAs so I've baked the need
> for that in the design so we hopefully won't have an XSA for this code.
>
> There are two big *TODO* in the design which I had hoped to get done
> before sending this out - however I am going on vacation for two weeks
> so I figured it would be better to send this off for folks to mull now
> then to have it languish.
>
> Please feel free to add more folks on the CC list.
>
> Enjoy!
>
>
> # xSplice Design v1 (EXTERNAL RFC v2)
>
> ## Rationale
>
> A mechanism is required to binarily patch the running hypervisor with new
> opcodes that have come about due to primarily security updates.
>
> This document describes the design of the API that would allow us to
> upload to the hypervisor binary patches.
>
> ## Glossary
>
>   * splice - patch in the binary code with new opcodes
>   * trampoline - a jump to a new instruction.
>   * payload - telemetries of the old code along with binary blob of the new
>     function (if needed).
>   * reloc - telemetries contained in the payload to construct proper trampoline.
>
> ## Multiple ways to patch
>
> The mechanism needs to be flexible to patch the hypervisor in multiple ways
> and be as simple as possible. The compiled code is contiguous in memory with
> no gaps - so we have no luxury of 'moving' existing code and must either
> insert a trampoline to the new code to be executed - or only modify in-place
> the code if there is sufficient space. The placement of new code has to be done
> by hypervisor and the virtual address for the new code is allocated dynamically.
> i
> This implies that the hypervisor must compute the new offsets when splicing
> in the new trampoline code. Where the trampoline is added (inside
> the function we are patching or just the callers?) is also important.
>
> To lessen the amount of code in hypervisor, the consumer of the API
> is responsible for identifying which mechanism to employ and how many locations
> to patch. Combinations of modifying in-place code, adding trampoline, etc
> has to be supported. The API should allow read/write any memory within
> the hypervisor virtual address space.
>
> We must also have a mechanism to query what has been applied and a mechanism
> to revert it if needed.
>
> We must also have a mechanism to: provide an copy of the old code - so that
> the hypervisor can verify it against the code in memory; the new code;
> the symbol name of the function to be patched; or offset from the symbol;
> or virtual address.
>
> The complications that this design will encounter are explained later
> in this document.
>
> ## Patching code
>
> The first mechanism to patch that comes in mind is in-place replacement.
> That is replace the affected code with new code. Unfortunately the x86
> ISA is variable size which places limits on how much space we have available
> to replace the instructions.
>
> The second mechanism is by replacing the call or jump to the
> old function with the address of the new function.
>
> A third mechanism is to add a jump to the new function at the
> start of the old function.
>
> ### Example of trampoline and in-place splicing
>
> As example we will assume the hypervisor does not have XSA-132 (see
> *domctl/sysctl: don't leak hypervisor stack to toolstacks*
> 4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
> the hypervisor with it. The original code looks as so:
>
> <pre>
>     48 89 e0                  mov    %rsp,%rax
>     48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
> </pre>
>
> while the new patched hypervisor would be:
>
> <pre>
>     48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)
>     48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)
>     48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)
>     48 89 e0                  mov    %rsp,%rax
>     48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
> </pre>
>
> This is inside the arch_do_domctl. This new change adds 21 extra
> bytes of code which alters all the offsets inside the function. To alter
> these offsets and add the extra 21 bytes of code we might not have enough
> space in .text to squeze this in.
>
> As such we could simplify this problem by only patching the site
> which calls arch_do_domctl:
>
> <pre>
> <do_domctl>:
>   e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>
> </pre>
>
> with a new address for where the new `arch_do_domctl` would be (this
> area would be allocated dynamically).
>
> Astute readers will wonder what we need to do if we were to patch `do_domctl`
> - which is not called directly by hypervisor but on behalf of the guests via
> the `compat_hypercall_table` and `hypercall_table`.
> Patching the offset in `hypercall_table` for `do_domctl:
> (ffff82d080103079 <do_domctl>:)
> <pre>
>
>   ffff82d08024d490:   79 30
>   ffff82d08024d492:   10 80 d0 82 ff ff
>
> </pre>
> with the new address where the new `do_domctl` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `arch_do_domctl`.
>
> In summary this example patched the callee of the affected function by
>   * allocating memory for the new code to live in,
>   * changing the virtual address of all the functions which called the old
>     code (computing the new offset, patching the callq with a new callq).
>   * changing the function pointer tables with the new virtual address of
>     the function (splicing in the new virtual address). Since this table
>     resides in the .rodata section we would need to temporarily change the
>     page table permissions during this part.
>
>
> However it has severe drawbacks - the safety checks which have to make sure
> the function is not on the stack - must also check every caller. For some
> patches this could if there were an sufficient large amount of callers
> that we would never be able to apply the update.
>
> ### Example of different trampoline patching.
>
> An alternative mechanism exists where we can insert an trampoline in the
> existing function to be patched to jump directly to the new code. This
> lessens the locations to be patched to one but it puts pressure on the
> CPU branching logic (I-cache, but it is just one unconditional jump).
>
> For this example we will assume that the hypervisor has not been compiled
> with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
> for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
> in `xen_version` hypercall. This function is not called **anywhere** in
> the hypervisor (it is called by the guest) but referenced in the
> `compat_hypercall_table` and `hypercall_table` (and indirectly called
> from that). Patching the offset in `hypercall_table` for the old
> `do_xen_version` (ffff82d080112f9e <do_xen_version>)
>
> </pre>
>   ffff82d08024b270 <hypercall_table>
>   ...
>   ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff
>
> </pre>
> with the new address where the new `do_xen_version` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `do_xen_version`.
>
> An alternative solution would be to patch insert an trampoline in the
> old `do_xen_version' function to directly jump to the new `do_xen_version`.
>
> <pre>
>   ffff82d080112f9e <do_xen_version>:
>   ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax
>   ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi
>   ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 <do_xen_version+0x534>
> </pre>
>
> with:
>
> <pre>
>   ffff82d080112f9e <do_xen_version>:
>   ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]
> </pre>
>
> which would lessen the amount of patching to just one location.
>
> In summary this example patched the affected function to jump to the
> new replacement function which required:
>   * allocating memory for the new code to live in,
>   * inserting trampoline with new offset in the old function to point to the
>     new function.
>   * Optionally we can insert in the old function an trampoline jump to an function
>     providing an BUG_ON to catch errant code.
>
> The disadvantage of this are that the unconditional jump will consume a small
> I-cache penalty. However the simplicity of the patching of safety checks
> make this a worthwhile option.
>
> ### Security
>
> With this method we can re-write the hypervisor - and as such we **MUST** be
> diligent in only allowing certain guests to perform this operation.
>
> Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
> of the payload to be certain it came from a trusted source.
>
> As such the hypercall **MUST** support an XSM policy to limit the what
> guest is allowed. If the system is booted with signature checking the
> signature checking will be enforced.
>
> ## Payload format
>
> The payload **MUST** contain enough data to allow us to apply the update
> and also safely reverse it. As such we **MUST** know:
>
>   * What the old code is expected to be. We **MUST** verify it against the
>     runtime code.
>   * The locations in memory to be patched. This can be determined dynamically
>     via symbols or via virtual addresses.
>   * The new code to be used.
>   * Signature to verify the payload.
>
> This binary format can be constructed using an custom binary format but
> there are severe disadvantages of it:
>
>   * The format might need to be change and we need an mechanism to accommodate
>     that.
>   * It has to be platform agnostic.
>   * Easily constructed using existing tools.
>
> As such having the payload in an ELF file is the sensible way. We would be
> carrying the various set of structures (and data) in the ELF sections under
> different names and with definitions. The prefix for the ELF section name
> would always be: *.xsplice_*
>
> Note that every structure has padding. This is added so that the hypervisor
> can re-use those fields as it sees fit.
>
> There are five sections *.xsplice_* sections:
>
>   * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be referenced
>     during the update. This can contain the symbols (functions) that will be
>     patched, or the list of symbols (functions) to be checked pre-patching which
>     may not be on the stack.
>
> * `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
>     trampolines for an patch. We can have multiple locations for which we
>     need to insert an trampoline for a payload and each location might require
>     a different way of handling it. This would naturally reference the `.text`
>     section and its proper offset. The `.xsplice_reloc` is not directly concerned
>     with patches but rather is an ELF relocation - describing the target
>     of a relocation and how that is performed.  They're also used for where
>     the new code references the run code too.
>
>   * `.xsplice_sections`. The safety data for the old code and new code.
>     This contains an array of symbols (pointing to `.xsplice_symbols` to
>     and `.text`) which are to be used during safety and dependency checking.
>
>
>   * `.xsplice_patches`: The description of the new functions to be patched
>     in (size, type, pointer to code, etc.).
>
>   * `.xsplice_change`. The structure that ties all of this together and defines
>     the payload.
>
> Additionally the ELF file would contain:
>
>   * `.text` section for the new and old code (function).
>   * `.rela.text` relocation data for the `.text` (both new and old).
>   * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as offset
>     to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
>   * `.bss` section for the new code (function)
>   * `.data` and `.data.read_mostly` section for the new and old code (function)
>   * `.rodata` section for the new and old code (function).
>
> In short the *.xsplice_* sections represent various structures and the
> ELF provides the mechanism to glue it all together when loaded in memory.
>
> Note that a lot of these ideas are borrowed from kSplice which is
> available at: https://github.com/jirislaby/ksplice
>
> For ELF understanding the best starting point is the OSDev Wiki
> (http://wiki.osdev.org/ELF). Furthermore the ELF specification is
> at http://www.skyfree.org/linux/references/ELF_Format.pdf and
> at Oracle's web site:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc
>
> ### ASCII art of the ELF structures
>
> *TODO*: Include an ASCII art of how the sections are tied together.
>
> ### xsplice_symbols
>
> The section contains an array of an structure that outlines the name
> of the symbol to be patched (or checked against). The structure is
> as follow:
>
> <pre>
> struct xsplice_symbol {
>      const char *name; /* The ELF name of the symbol. */
>      const char *label; /* A unique xSplice name for the symbol. */
>      uint8_t pad[16]; /* Must be zero. */
> };
> </pre>
> The structures may be in the section in any order and in any amount
> (duplicate entries are permitted).
>
> Both `name` and `label` would be pointing to entries in `.xsplice_str`.
>
> The `label` is used for diagnostic purposes - such as including the
> name and the offset.
>
> ### xsplice_reloc and xsplice_reloc_howto
>
> The section contains an array of a structure that outlines the different
> locations (and howto) for which an trampoline is to be inserted.
>
> The howto defines in the detail the change. It contains the type,
> whether the relocation is relative, the size of the relocation,
> bitmask for which parts of the instruction or data are to be replaced,
> amount of final relocation is shifted by (to drop unwanted data), and
> whether the replacement should be interpreted as signed value.
>
> The structure is as follow:
>
> <pre>
> #define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */
> #define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */
> #define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */
> #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */
> #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/
> #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */
> #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */
>
> #define XSPLICE_HOWTO_FLAG_PC_REL    0x00000001 /* Is PC relative. */
> #define XSPLICE_HOWOT_FLAG_SIGN      0x00000002 /* Should the new value be treated as signed value. */
>
> struct xsplice_reloc_howto {
>      uint32_t    type; /* XSPLICE_HOWTO_* */
>      uint32_t    flag; /* XSPLICE_HOWTO_FLAG_* */
>      uint32_t    size; /* Size, in bytes, of the item to be relocated. */
>      uint32_t    r_shift; /* The value the final relocation is shifted right by; used to drop unwanted data from the relocation. */
>      uint64_t    mask; /* Bitmask for which parts of the instruction or data are replaced with the relocated value. */
>      uint8_t     pad[8]; /* Must be zero. */
> };
>
> </pre>
>
> This structure is used in:
>
> <pre>
> struct xsplice_reloc {
>      uint64_t addr; /* The address of the relocation (if known). */
>      struct xsplice_symbol *symbol; /* Symbol for this relocation. */
>      struct xsplice_reloc_howto  *howto; /* Pointer to the above structure. */
>      uint64_t isns_added; /* ELF addend resulting from quirks of instruction one of whose operands is the relocation. For example, this is -4 on x86 pc-relative jumps. */
>      uint64_t isns_target; /* rest of the ELF addend.  This is equal to the offset against the symbol that the relocation refers to. */
>      uint8_t pad[8];  /* Must be zero. */
> };
> </pre>
>
> ### xsplice_sections
>
> The structure defined in this section is used to verify that it is safe
> to update with the new changes. It can contain safety data on the old code
> and what kind of matching we are to expect.
>
> It also can contain safety date of what to check when about to patch.
> That is whether any of the addresses (either provided or resolved
> when payload is loaded by referencing the symbols) are in memory
> with what we expect it to be.
>
> As such the flags can be or-ed together:
>
> <pre>
> #define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */
> #define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */
> #define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */
> #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */
> #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has .altinstructions. */
> #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */
> #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */
> #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */
>
> struct xsplice_section {
>      struct xsplice_symbol *symbol; /* The symbol associated with this change. */
>      uint64_t address; /* The address of the section (if known). */
>      uint64_t size; /* The size of the section. */
>      uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
>      uint8_t pad[16]; /* To be zero. */
> };
>
> </pre>
>
> ### xsplice_patches
>
> Within this section we have an array of a structure defining the new code (patch).
>
> This structure consist of an pointer to the new code (which in ELF ends up
> pointing to an offset in `.text` or `.data` section); the type of patch:
> inline - either text or data, or requesting an trampoline; and size of patch.
>
> The structure is as follow:
>
> <pre>
> #define XSPLICE_PATCH_INLINE_TEXT   0
> #define XSPLICE_PATCH_INLINE_DATA   1
> #define XSPLICE_PATCH_RELOC_TEXT    2
>
> struct xsplice_patch {
>      uint32_t type; /* XSPLICE_PATCH_* .*/
>      uint32_t size; /* Size of patch. */
>      uint64_t addr; /* The address of the new code (or data). */
>      void *content; /* The bytes to be installed. */
>      uint8_t pad[16]; /* Must be zero. */
> };
>
> </pre>
>
> ### xsplice_code
>
> The structure embedded within this section ties it all together.
> It has the name of the patch, and pointers to all the above
> mentioned structures (the start and end addresses).
>
> The structure is as follow:
>
> <pre>
> struct xsplice_code {
>      const char *name; /* A sensible name for the patch. Up to 40 characters. */
>      struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */
>      struct xsplice_section *sections, *sections_end; /* Safety data */
>      struct xsplice_patch *patches, *patches_end; /* Patch code & data */
>      uint8_t pad[32]; /* Must be zero. */
> };
> </pre>
>
> There should only be one such structure in the section.
>
> ### Example
>
> *TODO*: Include an objdump of how the ELF would look like for the XSA
> mentioned earlier.
>
> ## Signature checking requirements.
>
> The signature checking requires that the layout of the data in memory
> **MUST** be same for signature to be verified. This means that the payload
> data layout in ELF format **MUST** match what the hypervisor would be
> expecting such that it can properly do signature verification.
>
> The signature is based on the all of the payloads continuously laid out
> in memory. The signature is to be appended at the end of the ELF payload
> prefixed with the string '~Module signature appended~\n", followed by
> an signature header then followed by the signature, key identifier, and signers
> name.
>
> Specifically the signature header would be:
>
> <pre>
> #define PKEY_ALGO_DSA       0
> #define PKEY_ALGO_RSA       1
>
> #define PKEY_ID_PGP         0 /* OpenPGP generated key ID */
> #define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */
>
> #define HASH_ALGO_MD4          0
> #define HASH_ALGO_MD5          1
> #define HASH_ALGO_SHA1         2
> #define HASH_ALGO_RIPE_MD_160  3
> #define HASH_ALGO_SHA256       4
> #define HASH_ALGO_SHA384       5
> #define HASH_ALGO_SHA512       6
> #define HASH_ALGO_SHA224       7
> #define HASH_ALGO_RIPE_MD_128  8
> #define HASH_ALGO_RIPE_MD_256  9
> #define HASH_ALGO_RIPE_MD_320 10
> #define HASH_ALGO_WP_256      11
> #define HASH_ALGO_WP_384      12
> #define HASH_ALGO_WP_512      13
> #define HASH_ALGO_TGR_128     14
> #define HASH_ALGO_TGR_160     15
> #define HASH_ALGO_TGR_192     16
>
>
> struct elf_payload_signature {
>     u8    algo;        /* Public-key crypto algorithm PKEY_ALGO_*. */
>     u8    hash;        /* Digest algorithm: HASH_ALGO_*. */
>     u8    id_type;    /* Key identifier type PKEY_ID*. */
>     u8    signer_len;    /* Length of signer's name */
>     u8    key_id_len;    /* Length of key identifier */
>     u8    __pad[3];
>     __be32    sig_len;    /* Length of signature data */
> };
>
> </pre>
> (Note that this has been borrowed from Linux module signature code.).
>
>
> ## Hypercalls
>
> We will employ the sub operations of the system management hypercall (sysctl).
> There are to be four sub-operations:
>
>   * upload the payloads.
>   * listing of payloads summary uploaded and their state.
>   * getting an particular payload summary and its state.
>   * command to apply, delete, or revert the payload.
>
> The patching is asynchronous therefore the caller is responsible
> to verify that it has been applied properly by retrieving the summary of it
> and verifying that there are no error codes associated with the payload.
>
> We **MUST** make it asynchronous due to the nature of patching: it requires
> every physical CPU to be lock-step with each other. The patching mechanism
> while an implementation detail, is not an short operation and as such
> the design **MUST** assume it will be an long-running operation.
>
> Furthermore it is possible to have multiple different payloads for the same
> function. As such an unique id has to be visible to allow proper manipulation.
>
> The hypercall is part of the `xen_sysctl`. The top level structure contains
> one uint32_t to determine the sub-operations:
>
> <pre>
> struct xen_sysctl_xsplice_op {
>      uint32_t cmd;
>     union {
>            ... see below ...
>          } u;
> };
>
> </pre>
> while the rest of hypercall specific structures are part of the this structure.
>
>
> ### XEN_SYSCTL_XSPLICE_UPLOAD (0)
>
> Upload a payload to the hypervisor. The payload is verified and if there
> are any issues the proper return code will be returned. The payload is
> not applied at this time - that is controlled by *XEN_SYSCTL_XSPLICE_ACTION*.
>
> The caller provides:
>
>   * `id` unique id.
>   * `payload` the virtual address of where the ELF payload is.
>
> The return value is zero if the payload was succesfully uploaded and the
> signature was verified. Otherwise an EXX return value is provided.
> Duplicate `id` are not supported.
>
> The `payload` is the ELF payload as mentioned in the `Payload format` section.
>
> The structure is as follow:
>
> <pre>
> struct xen_sysctl_xsplice_upload {
>      char id[40];  /* IN, name of the patch. */
>      uint64_t size; /* IN, size of the ELF file. */
>      XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */
> };
> </pre>
>
> ### XEN_SYSCTL_XSPLICE_GET (1)
>
> Retrieve an summary of an specific payload. This caller provides:
>
>   * `id` the unique id.
>   * `status` *MUST* be set to zero.
>   * `rc` *MUST* be set to zero.
>
> The `summary` structure contains an summary of payload which includes:
>
>   * `id` the unique id.
>   * `status` - whether it has been:
>   1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>   2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
>   3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
>   4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>   5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
>   6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
>   * `rc` - its error state if any.
>
> The structure is as follow:
>
> <pre>
> #define XSPLICE_STATUS_LOADED    0
> #define XSPLICE_STATUS_PROGRESS  1
> #define XSPLICE_STATUS_CHECKED   2
> #define XSPLICE_STATUS_APPLIED   3
> #define XSPLICE_STATUS_REVERTED  4
> #define XSPLICE_STATUS_IN_ERROR  5
>
> struct xen_sysctl_xsplice_summary {
>      char id[40];  /* IN/OUT, name of the patch. */
>      uint32_t status;   /* OUT */
>      int32_t rc;  /* OUT */
> };
> </pre>
>
> ### XEN_SYSCTL_XSPLICE_LIST (2)
>
> Retrieve an array of abbreviated summary of payloads that are loaded in the
> hypervisor.
>
> The caller provides:
>
>   * `idx` index iterator. Initially it *MUST* be zero.
>   * `count` the max number of entries to populate.
>   * `summary` virtual address of where to write payload summaries.
>
> The hypercall returns zero on success and updates the `idx` (index) iterator
> with the number of payloads returned, `count` to the number of remaining
> payloads, and `summary` with an number of payload summaries.
>
> If the hypercall returns E2BIG the `count` is too big and should be
> lowered.
>
> Note that due to the asynchronous nature of hypercalls the domain might have
> added or removed the number of payloads making this information stale. It is
> the responsibility of the domain to provide proper accounting.
>
> The `summary` structure contains an summary of payload which includes:
>
>   * `id` unique id.
>   * `status` - whether it has been:
>   1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>   2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
>   3. *XSPLICE_STATUS_CHECKED*  (2) the payload `old` and `addr` match with the hypervisor.
>   4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>   5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
>   6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
>   * `rc` - its error state if any.
>
> The structure is as follow:
>
> <pre>
> struct xen_sysctl_xsplice_list {
>      uint32_t idx;  /* IN/OUT */
>      uint32_t count;  /* IN/OUT */
>      XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary;  /* OUT */
> };
>
> struct xen_sysctl_xsplice_summary {
>      char id[40];  /* OUT, name of the patch. */
>      uint32_t status;   /* OUT */
>      int32_t rc;  /* OUT */
> };
>
> </pre>
> ### XEN_SYSCTL_XSPLICE_ACTION (3)
>
> Perform an operation on the payload structure referenced by the `id` field.
> The operation request is asynchronous and the status should be retrieved
> by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST** hypercall.
>
> The caller provides:
>
>   * `id` the unique id.
>   * `cmd` the command requested:
>    1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply properly.
>    2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.
>    3. *XSPLICE_ACTION_REVERT* (2) revert the payload.
>    4. *XSPLICE_ACTION_APPLY* (3) apply the payload.
>
>
> The return value will be zero unless the provided fields are incorrect.
>
> The structure is as follow:
>
> <pre>
> #define XSPLICE_ACTION_CHECK  0
> #define XSPLICE_ACTION_UNLOAD 1
> #define XSPLICE_ACTION_REVERT 2
> #define XSPLICE_ACTION_APPLY  3
>
> struct xen_sysctl_xsplice_action {
>      char id[40];  /* IN, name of the patch. */
>      uint32_t cmd; /* IN */
> };
>
> </pre>
>
> ## Sequence of events.
>
> The normal sequence of events is to:
>
>   1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors *STOP* here.
>   2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next step.
>   3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to verify that the payload can be succesfully applied.
>   4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next step.
>   5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the patch.
>   6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with success.
>
>
> ## Addendum
>
> Implementation quirks should not be discussed in a design document.
>
> However these observations can provide aid when developing against this
> document.
>
>
> ### Alternative assembler
>
> Alternative assembler is a mechanism to use different instructions depending
> on what the CPU supports. This is done by providing multiple streams of code
> that can be patched in - or if the CPU does not support it - padded with
> `nop` operations. The alternative assembler macros cause the compiler to
> expand the code to place a most generic code in place - emit a special
> ELF .section header to tag this location. During run-time the hypervisor
> can leave the areas alone or patch them with an better suited opcodes.
>
> As we might be patching the alternative assembler sections as well - by
> providing a new better suited op-codes or perhaps with nops - we need to
> also re-run the alternative assembler patching after we have done our
> patching.
>
> Also when we are doing safety checks the code we are checking might be
> utilizing alternative assembler. As such we should relax out checks to
> accomodate that.
>
> ### .rodata sections
>
> The patching might require strings to be updated as well. As such we must be
> also able to patch the strings as needed. This sounds simple - but the compiler
> has a habit of coalescing strings that are the same - which means if we in-place
> alter the strings - other users will be inadvertently affected as well.
>
> This is also where pointers to functions live - and we may need to patch this
> as well.
>
> To guard against that we must be prepared to do patching similar to
> trampoline patching or in-line depending on the flavour. If we can
> do in-line patching we would need to:
>
>   * alter `.rodata` to be writeable.
>   * inline patch.
>   * alter `.rodata` to be read-only.
>
> If are doing trampoline patching we would need to:
>
>   * allocate a new memory location for the string.
>   * all locations which use this string will have to be updated to use the
>     offset to the string.
>   * mark the region RO when we are done.
>
> ### .bss sections
>
> Patching writable data is not suitable as it is unclear what should be done
> depending on the current state of data. As such it should not be attempted.
>
>
> ### Patching code which is in the stack.
>
> We should not patch the code which is on the stack. That can lead
> to corruption.
>
> ### Trampoline (e9 opcode)
>
> The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
> we are limited to up to 2GB of virtual address to place the new code
> from the old code. That should not be a problem since Xen hypervisor has
> a very small footprint.
>
> However if we need - we can always add two trampolines. One at the 2GB
> limit that calls the next trampoline.
>
> ### Time rendezvous code instead of stop_machine for patching
>
> The hypervisor's time rendezvous code runs synchronously across all CPUs
> every second. Using the stop_machine to patch can stall the time rendezvous
> code and result in NMI. As such having the patching be done at the tail
> of rendezvous code should avoid this problem.
>
> ### Security
>
> Only the privileged domain should be allowed to do this operation.
>
> .
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-18 12:54 ` Liuqiming (John)
@ 2015-05-18 13:11   ` Daniel Kiper
  2015-06-05 14:50   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 37+ messages in thread
From: Daniel Kiper @ 2015-05-18 13:11 UTC (permalink / raw)
  To: Liuqiming (John)
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, Paul Voccio,
	xen-devel, Major Hayden, liuyingdong, aliguori, konrad,
	lars.kurth, Steven Wilson, peter.huangpeng, msw, xiantao.zxt,
	Rick Harris, boris.ostrovsky, Josh Kearney, jinsong.liu,
	Antony Messerli, fanhenglong, andrew.cooper3

On Mon, May 18, 2015 at 08:54:22PM +0800, Liuqiming (John) wrote:
> Hi Konrad,
>
> Will this design include hotpatch build tools chain?
> Such as how these .xplice_ section are created?
> How to handle xen symbols when creating hotpatch elf file?

No, this is not a main goal of this project. However, it was agreed that
there will be a simple tool showing how to build patches. Additionally,
it will give us a chance to verify correctness of design and code
in hypervisor itself.

Daniel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
  2015-05-18 12:41 ` Jan Beulich
  2015-05-18 12:54 ` Liuqiming (John)
@ 2015-05-19 19:13 ` Lars Kurth
  2015-05-20 15:11 ` Martin Pohlack
  2015-06-12 11:39 ` Martin Pohlack
  4 siblings, 0 replies; 37+ messages in thread
From: Lars Kurth @ 2015-05-19 19:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, msw, aliguori, Antony Messerli,
	Rick Harris, Paul Voccio, Steven Wilson, Major Hayden,
	Josh Kearney, jinsong.liu, xiantao.zxt, boris.ostrovsky,
	Daniel Kiper, Elena Ufimtseva, bob.liu, hanweidong,
	peter.huangpeng, fanhenglong, liuyingdong
  Cc: konrad


Adding Don Slutz as he requested to be added
Lars

On 15/05/2015 20:44, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>
wrote:

>Hey!
>
>During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
>the hypervisor. We sketched out how it could be done, and brainstormed
>some of the problems.
>
>I took that and wrote an design - which is very much RFC. The design is
>laid out in two sections - the format of the ELF payload - and then the
>hypercalls to act on it.
>
>Hypercall preemption has caused a couple of XSAs so I've baked the need
>for that in the design so we hopefully won't have an XSA for this code.
>
>There are two big *TODO* in the design which I had hoped to get done
>before sending this out - however I am going on vacation for two weeks
>so I figured it would be better to send this off for folks to mull now
>then to have it languish.
>
>Please feel free to add more folks on the CC list.
>
>Enjoy!
>
>
># xSplice Design v1 (EXTERNAL RFC v2)
>
>## Rationale
>
>A mechanism is required to binarily patch the running hypervisor with new
>opcodes that have come about due to primarily security updates.
>
>This document describes the design of the API that would allow us to
>upload to the hypervisor binary patches.
>
>## Glossary
>
> * splice - patch in the binary code with new opcodes
> * trampoline - a jump to a new instruction.
> * payload - telemetries of the old code along with binary blob of the new
>   function (if needed).
> * reloc - telemetries contained in the payload to construct proper
>trampoline.
>
>## Multiple ways to patch
>
>The mechanism needs to be flexible to patch the hypervisor in multiple
>ways
>and be as simple as possible. The compiled code is contiguous in memory
>with
>no gaps - so we have no luxury of 'moving' existing code and must either
>insert a trampoline to the new code to be executed - or only modify
>in-place
>the code if there is sufficient space. The placement of new code has to
>be done
>by hypervisor and the virtual address for the new code is allocated
>dynamically.
>i
>This implies that the hypervisor must compute the new offsets when
>splicing
>in the new trampoline code. Where the trampoline is added (inside
>the function we are patching or just the callers?) is also important.
>
>To lessen the amount of code in hypervisor, the consumer of the API
>is responsible for identifying which mechanism to employ and how many
>locations
>to patch. Combinations of modifying in-place code, adding trampoline, etc
>has to be supported. The API should allow read/write any memory within
>the hypervisor virtual address space.
>
>We must also have a mechanism to query what has been applied and a
>mechanism
>to revert it if needed.
>
>We must also have a mechanism to: provide an copy of the old code - so
>that
>the hypervisor can verify it against the code in memory; the new code;
>the symbol name of the function to be patched; or offset from the symbol;
>or virtual address.
>
>The complications that this design will encounter are explained later
>in this document.
>
>## Patching code
>
>The first mechanism to patch that comes in mind is in-place replacement.
>That is replace the affected code with new code. Unfortunately the x86
>ISA is variable size which places limits on how much space we have
>available
>to replace the instructions.
>
>The second mechanism is by replacing the call or jump to the
>old function with the address of the new function.
>
>A third mechanism is to add a jump to the new function at the
>start of the old function.
>
>### Example of trampoline and in-place splicing
>
>As example we will assume the hypervisor does not have XSA-132 (see
>*domctl/sysctl: don't leak hypervisor stack to toolstacks*
>4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary
>patch
>the hypervisor with it. The original code looks as so:
>
><pre>
>   48 89 e0                  mov    %rsp,%rax
>   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
></pre>
>
>while the new patched hypervisor would be:
>
><pre>
>   48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)
>   48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)
>   48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)
>   48 89 e0                  mov    %rsp,%rax
>   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
></pre>
>
>This is inside the arch_do_domctl. This new change adds 21 extra
>bytes of code which alters all the offsets inside the function. To alter
>these offsets and add the extra 21 bytes of code we might not have enough
>space in .text to squeze this in.
>
>As such we could simplify this problem by only patching the site
>which calls arch_do_domctl:
>
><pre>
><do_domctl>:  
> e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>
></pre>
>
>with a new address for where the new `arch_do_domctl` would be (this
>area would be allocated dynamically).
>
>Astute readers will wonder what we need to do if we were to patch
>`do_domctl`
>- which is not called directly by hypervisor but on behalf of the guests
>via
>the `compat_hypercall_table` and `hypercall_table`.
>Patching the offset in `hypercall_table` for `do_domctl:
>(ffff82d080103079 <do_domctl>:)
><pre>
>
> ffff82d08024d490:   79 30
> ffff82d08024d492:   10 80 d0 82 ff ff
>
></pre>
>with the new address where the new `do_domctl` is possible. The other
>place where it is used is in `hvm_hypercall64_table` which would need
>to be patched in a similar way. This would require an in-place splicing
>of the new virtual address of `arch_do_domctl`.
>
>In summary this example patched the callee of the affected function by
> * allocating memory for the new code to live in,
> * changing the virtual address of all the functions which called the old
>   code (computing the new offset, patching the callq with a new callq).
> * changing the function pointer tables with the new virtual address of
>   the function (splicing in the new virtual address). Since this table
>   resides in the .rodata section we would need to temporarily change the
>   page table permissions during this part.
>
>
>However it has severe drawbacks - the safety checks which have to make
>sure
>the function is not on the stack - must also check every caller. For some
>patches this could if there were an sufficient large amount of callers
>that we would never be able to apply the update.
>
>### Example of different trampoline patching.
>
>An alternative mechanism exists where we can insert an trampoline in the
>existing function to be patched to jump directly to the new code. This
>lessens the locations to be patched to one but it puts pressure on the
>CPU branching logic (I-cache, but it is just one unconditional jump).
>
>For this example we will assume that the hypervisor has not been compiled
>with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill
>structures
>for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
>in `xen_version` hypercall. This function is not called **anywhere** in
>the hypervisor (it is called by the guest) but referenced in the
>`compat_hypercall_table` and `hypercall_table` (and indirectly called
>from that). Patching the offset in `hypercall_table` for the old
>`do_xen_version` (ffff82d080112f9e <do_xen_version>)
>
></pre>
> ffff82d08024b270 <hypercall_table>
> ...  
> ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff
>
></pre>
>with the new address where the new `do_xen_version` is possible. The other
>place where it is used is in `hvm_hypercall64_table` which would need
>to be patched in a similar way. This would require an in-place splicing
>of the new virtual address of `do_xen_version`.
>
>An alternative solution would be to patch insert an trampoline in the
>old `do_xen_version' function to directly jump to the new
>`do_xen_version`.
>
><pre>
> ffff82d080112f9e <do_xen_version>:
> ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov
>$0xffffffffffffffda,%rax
> ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi
> ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2
><do_xen_version+0x534>
></pre>
>
>with:
>
><pre>
> ffff82d080112f9e <do_xen_version>:
> ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new
>do_xen_version]  
></pre>
>
>which would lessen the amount of patching to just one location.
>
>In summary this example patched the affected function to jump to the
>new replacement function which required:
> * allocating memory for the new code to live in,
> * inserting trampoline with new offset in the old function to point to
>the
>   new function.
> * Optionally we can insert in the old function an trampoline jump to an
>function
>   providing an BUG_ON to catch errant code.
>
>The disadvantage of this are that the unconditional jump will consume a
>small
>I-cache penalty. However the simplicity of the patching of safety checks
>make this a worthwhile option.
>
>### Security
>
>With this method we can re-write the hypervisor - and as such we **MUST**
>be
>diligent in only allowing certain guests to perform this operation.
>
>Furthermore with SecureBoot or tboot, we **MUST** also verify the
>signature
>of the payload to be certain it came from a trusted source.
>
>As such the hypercall **MUST** support an XSM policy to limit the what
>guest is allowed. If the system is booted with signature checking the
>signature checking will be enforced.
>
>## Payload format
>
>The payload **MUST** contain enough data to allow us to apply the update
>and also safely reverse it. As such we **MUST** know:
>
> * What the old code is expected to be. We **MUST** verify it against the
>   runtime code.
> * The locations in memory to be patched. This can be determined
>dynamically
>   via symbols or via virtual addresses.
> * The new code to be used.
> * Signature to verify the payload.
>
>This binary format can be constructed using an custom binary format but
>there are severe disadvantages of it:
>
> * The format might need to be change and we need an mechanism to
>accommodate
>   that.
> * It has to be platform agnostic.
> * Easily constructed using existing tools.
>
>As such having the payload in an ELF file is the sensible way. We would be
>carrying the various set of structures (and data) in the ELF sections
>under
>different names and with definitions. The prefix for the ELF section name
>would always be: *.xsplice_*
>
>Note that every structure has padding. This is added so that the
>hypervisor
>can re-use those fields as it sees fit.
>
>There are five sections *.xsplice_* sections:
>
> * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be
>referenced
>   during the update. This can contain the symbols (functions) that will
>be
>   patched, or the list of symbols (functions) to be checked pre-patching
>which
>   may not be on the stack.
>
>* `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly
>construct
>   trampolines for an patch. We can have multiple locations for which we
>   need to insert an trampoline for a payload and each location might
>require
>   a different way of handling it. This would naturally reference the
>`.text`
>   section and its proper offset. The `.xsplice_reloc` is not directly
>concerned
>   with patches but rather is an ELF relocation - describing the target
>   of a relocation and how that is performed.  They're also used for where
>   the new code references the run code too.
>
> * `.xsplice_sections`. The safety data for the old code and new code.
>   This contains an array of symbols (pointing to `.xsplice_symbols` to
>   and `.text`) which are to be used during safety and dependency
>checking.
>
>
> * `.xsplice_patches`: The description of the new functions to be patched
>   in (size, type, pointer to code, etc.).
>
> * `.xsplice_change`. The structure that ties all of this together and
>defines
>   the payload.
>
>Additionally the ELF file would contain:
>
> * `.text` section for the new and old code (function).
> * `.rela.text` relocation data for the `.text` (both new and old).
> * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such
>as offset
>   to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
> * `.bss` section for the new code (function)
> * `.data` and `.data.read_mostly` section for the new and old code
>(function)
> * `.rodata` section for the new and old code (function).
>
>In short the *.xsplice_* sections represent various structures and the
>ELF provides the mechanism to glue it all together when loaded in memory.
>
>Note that a lot of these ideas are borrowed from kSplice which is
>available at: https://github.com/jirislaby/ksplice
>
>For ELF understanding the best starting point is the OSDev Wiki
>(http://wiki.osdev.org/ELF). Furthermore the ELF specification is
>at http://www.skyfree.org/linux/references/ELF_Format.pdf and
>at Oracle's web site:
>http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scro
>lltoc
>
>### ASCII art of the ELF structures
>
>*TODO*: Include an ASCII art of how the sections are tied together.
>
>### xsplice_symbols
>
>The section contains an array of an structure that outlines the name
>of the symbol to be patched (or checked against). The structure is
>as follow:
>
><pre>
>struct xsplice_symbol {
>    const char *name; /* The ELF name of the symbol. */
>    const char *label; /* A unique xSplice name for the symbol. */
>    uint8_t pad[16]; /* Must be zero. */
>};  
></pre>
>The structures may be in the section in any order and in any amount
>(duplicate entries are permitted).
>
>Both `name` and `label` would be pointing to entries in `.xsplice_str`.
>
>The `label` is used for diagnostic purposes - such as including the
>name and the offset.
>
>### xsplice_reloc and xsplice_reloc_howto
>
>The section contains an array of a structure that outlines the different
>locations (and howto) for which an trampoline is to be inserted.
>
>The howto defines in the detail the change. It contains the type,
>whether the relocation is relative, the size of the relocation,
>bitmask for which parts of the instruction or data are to be replaced,
>amount of final relocation is shifted by (to drop unwanted data), and
>whether the replacement should be interpreted as signed value.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */
>#define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */
>#define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */
>#define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */
>#define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/
>#define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */
>#define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */
>
>#define XSPLICE_HOWTO_FLAG_PC_REL    0x00000001 /* Is PC relative. */
>#define XSPLICE_HOWOT_FLAG_SIGN      0x00000002 /* Should the new value
>be treated as signed value. */
>
>struct xsplice_reloc_howto {
>    uint32_t    type; /* XSPLICE_HOWTO_* */
>    uint32_t    flag; /* XSPLICE_HOWTO_FLAG_* */
>    uint32_t    size; /* Size, in bytes, of the item to be relocated. */
>    uint32_t    r_shift; /* The value the final relocation is shifted
>right by; used to drop unwanted data from the relocation. */
>    uint64_t    mask; /* Bitmask for which parts of the instruction or
>data are replaced with the relocated value. */
>    uint8_t     pad[8]; /* Must be zero. */
>};  
>
></pre>
>
>This structure is used in:
>
><pre>
>struct xsplice_reloc {
>    uint64_t addr; /* The address of the relocation (if known). */
>    struct xsplice_symbol *symbol; /* Symbol for this relocation. */
>    struct xsplice_reloc_howto  *howto; /* Pointer to the above
>structure. */  
>    uint64_t isns_added; /* ELF addend resulting from quirks of
>instruction one of whose operands is the relocation. For example, this is
>-4 on x86 pc-relative jumps. */
>    uint64_t isns_target; /* rest of the ELF addend.  This is equal to
>the offset against the symbol that the relocation refers to. */
>    uint8_t pad[8];  /* Must be zero. */
>};  
></pre>
>
>### xsplice_sections
>
>The structure defined in this section is used to verify that it is safe
>to update with the new changes. It can contain safety data on the old code
>and what kind of matching we are to expect.
>
>It also can contain safety date of what to check when about to patch.
>That is whether any of the addresses (either provided or resolved
>when payload is loaded by referencing the symbols) are in memory
>with what we expect it to be.
>
>As such the flags can be or-ed together:
>
><pre>
>#define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */
>#define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */
>#define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */
>#define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */
>#define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has
>.altinstructions. */
>#define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */
>  
>#dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */
>#define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the
>stack. */  
>
>struct xsplice_section {
>    struct xsplice_symbol *symbol; /* The symbol associated with this
>change. */  
>    uint64_t address; /* The address of the section (if known). */
>    uint64_t size; /* The size of the section. */
>    uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
>    uint8_t pad[16]; /* To be zero. */
>};
>
></pre>
>
>### xsplice_patches
>
>Within this section we have an array of a structure defining the new code
>(patch).
>
>This structure consist of an pointer to the new code (which in ELF ends up
>pointing to an offset in `.text` or `.data` section); the type of patch:
>inline - either text or data, or requesting an trampoline; and size of
>patch.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_PATCH_INLINE_TEXT   0
>#define XSPLICE_PATCH_INLINE_DATA   1
>#define XSPLICE_PATCH_RELOC_TEXT    2
>
>struct xsplice_patch {
>    uint32_t type; /* XSPLICE_PATCH_* .*/
>    uint32_t size; /* Size of patch. */
>    uint64_t addr; /* The address of the new code (or data). */
>    void *content; /* The bytes to be installed. */
>    uint8_t pad[16]; /* Must be zero. */
>};
>
></pre>
>
>### xsplice_code
>
>The structure embedded within this section ties it all together.
>It has the name of the patch, and pointers to all the above
>mentioned structures (the start and end addresses).
>
>The structure is as follow:
>
><pre>
>struct xsplice_code {
>    const char *name; /* A sensible name for the patch. Up to 40
>characters. */  
>    struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */
>    struct xsplice_section *sections, *sections_end; /* Safety data */
>    struct xsplice_patch *patches, *patches_end; /* Patch code & data */
>    uint8_t pad[32]; /* Must be zero. */
>};
></pre>
>
>There should only be one such structure in the section.
>
>### Example
>
>*TODO*: Include an objdump of how the ELF would look like for the XSA
>mentioned earlier.
>
>## Signature checking requirements.
>
>The signature checking requires that the layout of the data in memory
>**MUST** be same for signature to be verified. This means that the payload
>data layout in ELF format **MUST** match what the hypervisor would be
>expecting such that it can properly do signature verification.
>
>The signature is based on the all of the payloads continuously laid out
>in memory. The signature is to be appended at the end of the ELF payload
>prefixed with the string '~Module signature appended~\n", followed by
>an signature header then followed by the signature, key identifier, and
>signers
>name.
>
>Specifically the signature header would be:
>
><pre>
>#define PKEY_ALGO_DSA       0
>#define PKEY_ALGO_RSA       1
>
>#define PKEY_ID_PGP         0 /* OpenPGP generated key ID */
>#define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */
>
>#define HASH_ALGO_MD4          0
>#define HASH_ALGO_MD5          1
>#define HASH_ALGO_SHA1         2
>#define HASH_ALGO_RIPE_MD_160  3
>#define HASH_ALGO_SHA256       4
>#define HASH_ALGO_SHA384       5
>#define HASH_ALGO_SHA512       6
>#define HASH_ALGO_SHA224       7
>#define HASH_ALGO_RIPE_MD_128  8
>#define HASH_ALGO_RIPE_MD_256  9
>#define HASH_ALGO_RIPE_MD_320 10
>#define HASH_ALGO_WP_256      11
>#define HASH_ALGO_WP_384      12
>#define HASH_ALGO_WP_512      13
>#define HASH_ALGO_TGR_128     14
>#define HASH_ALGO_TGR_160     15
>#define HASH_ALGO_TGR_192     16
>
>
>struct elf_payload_signature {
>	u8	algo;		/* Public-key crypto algorithm PKEY_ALGO_*. */
>	u8	hash;		/* Digest algorithm: HASH_ALGO_*. */
>	u8	id_type;	/* Key identifier type PKEY_ID*. */
>	u8	signer_len;	/* Length of signer's name */
>	u8	key_id_len;	/* Length of key identifier */
>	u8	__pad[3];  
>	__be32	sig_len;	/* Length of signature data */
>};
>
></pre>
>(Note that this has been borrowed from Linux module signature code.).
>
>
>## Hypercalls
>
>We will employ the sub operations of the system management hypercall
>(sysctl).
>There are to be four sub-operations:
>
> * upload the payloads.
> * listing of payloads summary uploaded and their state.
> * getting an particular payload summary and its state.
> * command to apply, delete, or revert the payload.
>
>The patching is asynchronous therefore the caller is responsible
>to verify that it has been applied properly by retrieving the summary of
>it
>and verifying that there are no error codes associated with the payload.
>
>We **MUST** make it asynchronous due to the nature of patching: it
>requires
>every physical CPU to be lock-step with each other. The patching mechanism
>while an implementation detail, is not an short operation and as such
>the design **MUST** assume it will be an long-running operation.
>
>Furthermore it is possible to have multiple different payloads for the
>same
>function. As such an unique id has to be visible to allow proper
>manipulation.
>
>The hypercall is part of the `xen_sysctl`. The top level structure
>contains
>one uint32_t to determine the sub-operations:
>
><pre>
>struct xen_sysctl_xsplice_op {
>    uint32_t cmd; 
>	union {  
>          ... see below ...
>        } u;  
>};  
>
></pre>
>while the rest of hypercall specific structures are part of the this
>structure.
>
>
>### XEN_SYSCTL_XSPLICE_UPLOAD (0)
>
>Upload a payload to the hypervisor. The payload is verified and if there
>are any issues the proper return code will be returned. The payload is
>not applied at this time - that is controlled by
>*XEN_SYSCTL_XSPLICE_ACTION*.
>
>The caller provides:
>
> * `id` unique id.
> * `payload` the virtual address of where the ELF payload is.
>
>The return value is zero if the payload was succesfully uploaded and the
>signature was verified. Otherwise an EXX return value is provided.
>Duplicate `id` are not supported.
>
>The `payload` is the ELF payload as mentioned in the `Payload format`
>section.
>
>The structure is as follow:
>
><pre>
>struct xen_sysctl_xsplice_upload {
>    char id[40];  /* IN, name of the patch. */
>    uint64_t size; /* IN, size of the ELF file. */
>    XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */
>}; 
></pre>
>
>### XEN_SYSCTL_XSPLICE_GET (1)
>
>Retrieve an summary of an specific payload. This caller provides:
>
> * `id` the unique id.
> * `status` *MUST* be set to zero.
> * `rc` *MUST* be set to zero.
>
>The `summary` structure contains an summary of payload which includes:
>
> * `id` the unique id.
> * `status` - whether it has been:
> 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
> 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the
>**XEN_SYSCTL_XSPLICE_ACTION** command.
> 3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
> 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
> 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also
>reverted.
> 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult
>`rc` for details.
> * `rc` - its error state if any.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_STATUS_LOADED    0
>#define XSPLICE_STATUS_PROGRESS  1
>#define XSPLICE_STATUS_CHECKED   2
>#define XSPLICE_STATUS_APPLIED   3
>#define XSPLICE_STATUS_REVERTED  4
>#define XSPLICE_STATUS_IN_ERROR  5
>
>struct xen_sysctl_xsplice_summary {
>    char id[40];  /* IN/OUT, name of the patch. */
>    uint32_t status;   /* OUT */
>    int32_t rc;  /* OUT */
>}; 
></pre>
>
>### XEN_SYSCTL_XSPLICE_LIST (2)
>
>Retrieve an array of abbreviated summary of payloads that are loaded in
>the
>hypervisor.
>
>The caller provides:
>
> * `idx` index iterator. Initially it *MUST* be zero.
> * `count` the max number of entries to populate.
> * `summary` virtual address of where to write payload summaries.
>
>The hypercall returns zero on success and updates the `idx` (index)
>iterator
>with the number of payloads returned, `count` to the number of remaining
>payloads, and `summary` with an number of payload summaries.
>
>If the hypercall returns E2BIG the `count` is too big and should be
>lowered.
>
>Note that due to the asynchronous nature of hypercalls the domain might
>have
>added or removed the number of payloads making this information stale. It
>is
>the responsibility of the domain to provide proper accounting.
>
>The `summary` structure contains an summary of payload which includes:
>
> * `id` unique id.
> * `status` - whether it has been:
> 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
> 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the
>**XEN_SYSCTL_XSPLICE_ACTION** command.
> 3. *XSPLICE_STATUS_CHECKED*  (2) the payload `old` and `addr` match with
>the hypervisor.
> 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
> 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also
>reverted.
> 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult
>`rc` for details.
> * `rc` - its error state if any.
>
>The structure is as follow:
>
><pre>
>struct xen_sysctl_xsplice_list {
>    uint32_t idx;  /* IN/OUT */
>    uint32_t count;  /* IN/OUT */
>    XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary;  /* OUT */
>};  
>
>struct xen_sysctl_xsplice_summary {
>    char id[40];  /* OUT, name of the patch. */
>    uint32_t status;   /* OUT */
>    int32_t rc;  /* OUT */
>};  
>
></pre>
>### XEN_SYSCTL_XSPLICE_ACTION (3)
>
>Perform an operation on the payload structure referenced by the `id`
>field.
>The operation request is asynchronous and the status should be retrieved
>by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST**
>hypercall.
>
>The caller provides:
>
> * `id` the unique id.
> * `cmd` the command requested:
>  1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply
>properly.
>  2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.
>  3. *XSPLICE_ACTION_REVERT* (2) revert the payload.
>  4. *XSPLICE_ACTION_APPLY* (3) apply the payload.
>
>
>The return value will be zero unless the provided fields are incorrect.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_ACTION_CHECK  0
>#define XSPLICE_ACTION_UNLOAD 1
>#define XSPLICE_ACTION_REVERT 2
>#define XSPLICE_ACTION_APPLY  3
>
>struct xen_sysctl_xsplice_action {
>    char id[40];  /* IN, name of the patch. */
>    uint32_t cmd; /* IN */
>};  
>
></pre>
>
>## Sequence of events.
>
>The normal sequence of events is to:
>
> 1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are
>errors *STOP* here.
> 2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next
>step.
> 3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to
>verify that the payload can be succesfully applied.
> 4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next
>step.
> 5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the
>patch.
> 6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with
>success.
>
> 
>## Addendum
>
>Implementation quirks should not be discussed in a design document.
>
>However these observations can provide aid when developing against this
>document.
>
>
>### Alternative assembler
>
>Alternative assembler is a mechanism to use different instructions
>depending
>on what the CPU supports. This is done by providing multiple streams of
>code
>that can be patched in - or if the CPU does not support it - padded with
>`nop` operations. The alternative assembler macros cause the compiler to
>expand the code to place a most generic code in place - emit a special
>ELF .section header to tag this location. During run-time the hypervisor
>can leave the areas alone or patch them with an better suited opcodes.
>
>As we might be patching the alternative assembler sections as well - by
>providing a new better suited op-codes or perhaps with nops - we need to
>also re-run the alternative assembler patching after we have done our
>patching.
>
>Also when we are doing safety checks the code we are checking might be
>utilizing alternative assembler. As such we should relax out checks to
>accomodate that.
>
>### .rodata sections
>
>The patching might require strings to be updated as well. As such we must
>be
>also able to patch the strings as needed. This sounds simple - but the
>compiler
>has a habit of coalescing strings that are the same - which means if we
>in-place
>alter the strings - other users will be inadvertently affected as well.
>
>This is also where pointers to functions live - and we may need to patch
>this
>as well.
>
>To guard against that we must be prepared to do patching similar to
>trampoline patching or in-line depending on the flavour. If we can
>do in-line patching we would need to:
>
> * alter `.rodata` to be writeable.
> * inline patch.
> * alter `.rodata` to be read-only.
>
>If are doing trampoline patching we would need to:
>
> * allocate a new memory location for the string.
> * all locations which use this string will have to be updated to use the
>   offset to the string.
> * mark the region RO when we are done.
>
>### .bss sections
>
>Patching writable data is not suitable as it is unclear what should be
>done
>depending on the current state of data. As such it should not be
>attempted.
>
>
>### Patching code which is in the stack.
>
>We should not patch the code which is on the stack. That can lead
>to corruption.
>
>### Trampoline (e9 opcode)
>
>The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
>we are limited to up to 2GB of virtual address to place the new code
>from the old code. That should not be a problem since Xen hypervisor has
>a very small footprint.
>
>However if we need - we can always add two trampolines. One at the 2GB
>limit that calls the next trampoline.
>
>### Time rendezvous code instead of stop_machine for patching
>
>The hypervisor's time rendezvous code runs synchronously across all CPUs
>every second. Using the stop_machine to patch can stall the time
>rendezvous
>code and result in NMI. As such having the patching be done at the tail
>of rendezvous code should avoid this problem.
>
>### Security
>
>Only the privileged domain should be allowed to do this operation.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2015-05-19 19:13 ` Lars Kurth
@ 2015-05-20 15:11 ` Martin Pohlack
  2015-06-05 15:00   ` Konrad Rzeszutek Wilk
  2015-06-12 11:39 ` Martin Pohlack
  4 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-05-20 15:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, msw, aliguori, Antony Messerli,
	Rick Harris, Paul Voccio, Steven Wilson, Major Hayden,
	Josh Kearney, jinsong.liu, xiantao.zxt, boris.ostrovsky,
	Daniel Kiper, Elena Ufimtseva, bob.liu, lars.kurth, hanweidong,
	peter.huangpeng, fanhenglong, liuyingdong, john.liuqiming,
	xen-devel, jbeulich, andrew.cooper3, jeremy
  Cc: konrad

Hi,

this looks very interesting.

I have talked about an experimental Xen hotpatching design at Linux
Plumbers Conference 2014 in Düsseldorf, slides are here:

http://www.linuxplumbersconf.net/2014/ocw//system/presentations/2421/original/xen_hotpatching-2014-10-16.pdf

and would like to share a couple of observations from the slides:

* We found splicing at hypervisor exit with a barrier for all CPUs to
  be a good place, because all Xen stacks are effectively empty at
  that point.  We use a barrier with a maximum timeout (that can be
  adapted).  The approach is similar in concept to stop_machine and
  the time rendezvous but is time-bound and does not require all the
  asynchronous operations discussed below.

* Hotpatch generation often requires support for compiling the target
  with -ffunction-sections / -fdata-sections.  Last time I looked, you
  can compile Xen with that option, but the linker scripts are not
  fully adapted to deal with it and the resulting hypervisor cannot be
  booted.

* Xen as it is now, has a couple of non-unique symbol names which will
  make runtime symbol identification hard.  Sometimes, static symbols
  simply have the same name in C files, sometimes such symbols get
  included via header files, and some C files are also compiled
  multiple times and linked under different names (guest_walk.c).  I
  think the Linux kernel solves this by aiming at unique symbol names
  even for local symbols.

  nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head

* You may need support for adapting or augmenting exception tables if
  patching such code is desired (it probably is).  Hotpatches may need
  to bring their own small exception tables (similar to how Linux
  modules support this).  If you don't plan on supporting hotpatches
  that introduce additional exception-locations, one could also change
  the exception table in-place and reorder it afterwards.

* Hotpatch interdependencies are tricky.  IIRC the discussion at the
  Live Patching track at LPC, both kpatch and kgraft aimed, at the
  time, at a single large patch that subsumes and replaces all previous
  ones.

Some more comments inline below ...

Cheers,
Martin

On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> Hey!
> 
> During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
> the hypervisor. We sketched out how it could be done, and brainstormed
> some of the problems.
> 
> I took that and wrote an design - which is very much RFC. The design is
> laid out in two sections - the format of the ELF payload - and then the
> hypercalls to act on it.
> 
> Hypercall preemption has caused a couple of XSAs so I've baked the need
> for that in the design so we hopefully won't have an XSA for this code.
> 
> There are two big *TODO* in the design which I had hoped to get done
> before sending this out - however I am going on vacation for two weeks
> so I figured it would be better to send this off for folks to mull now
> then to have it languish.
> 
> Please feel free to add more folks on the CC list.
> 
> Enjoy!
> 
> 
> # xSplice Design v1 (EXTERNAL RFC v2)
> 
> ## Rationale
> 
> A mechanism is required to binarily patch the running hypervisor with new
> opcodes that have come about due to primarily security updates.
> 
> This document describes the design of the API that would allow us to
> upload to the hypervisor binary patches.
> 
> ## Glossary
> 
>  * splice - patch in the binary code with new opcodes
>  * trampoline - a jump to a new instruction.
>  * payload - telemetries of the old code along with binary blob of the new
>    function (if needed).
>  * reloc - telemetries contained in the payload to construct proper trampoline.
> 
> ## Multiple ways to patch
> 
> The mechanism needs to be flexible to patch the hypervisor in multiple ways
> and be as simple as possible. The compiled code is contiguous in memory with
> no gaps - so we have no luxury of 'moving' existing code and must either
> insert a trampoline to the new code to be executed - or only modify in-place
> the code if there is sufficient space. The placement of new code has to be done
> by hypervisor and the virtual address for the new code is allocated dynamically.
> i
> This implies that the hypervisor must compute the new offsets when splicing
> in the new trampoline code. Where the trampoline is added (inside
> the function we are patching or just the callers?) is also important.
> 
> To lessen the amount of code in hypervisor, the consumer of the API
> is responsible for identifying which mechanism to employ and how many locations
> to patch. Combinations of modifying in-place code, adding trampoline, etc
> has to be supported. The API should allow read/write any memory within
> the hypervisor virtual address space.
> 
> We must also have a mechanism to query what has been applied and a mechanism
> to revert it if needed.
> 
> We must also have a mechanism to: provide an copy of the old code - so that
> the hypervisor can verify it against the code in memory; the new code;

As Xen has no stable in-hypervisor API / ABI, you need to make sure
that a generated module matches a target hypervisor.  In our design,
we use build IDs for that (ld --build-id).  We embed build IDs at Xen
compile time and can query a running hypervisor for its ID and only
load matching patches.

This seems to be an alternative to your proposal to include old code
into hotpatch modules.

> the symbol name of the function to be patched; or offset from the symbol;
> or virtual address.
> 
> The complications that this design will encounter are explained later
> in this document.
> 
> ## Patching code
> 
> The first mechanism to patch that comes in mind is in-place replacement.
> That is replace the affected code with new code. Unfortunately the x86
> ISA is variable size which places limits on how much space we have available
> to replace the instructions.
> 
> The second mechanism is by replacing the call or jump to the
> old function with the address of the new function.

We found splicing via a non-conditional JMP in the beginning of the
old target function to be a sensible and uncomplicated approach.  This
approach works at function level which feels like a very natural unit
to work at.  The really nice thing of this approach is that you don't
have to track all the, potential indirect, call sites.

As you discussed, if you allocate hotpatch memory withing +-2GB of the
patch location, no further trampoline indirection is required, a
5-byte JMP does the trick on x86.  We found that to be sufficient in
our experiments.

> A third mechanism is to add a jump to the new function at the
> start of the old function.
> 
> ### Example of trampoline and in-place splicing
> 
> As example we will assume the hypervisor does not have XSA-132 (see
> *domctl/sysctl: don't leak hypervisor stack to toolstacks*
> 4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
> the hypervisor with it. The original code looks as so:
> 
> <pre>
>    48 89 e0                  mov    %rsp,%rax  
>    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
> </pre>
> 
> while the new patched hypervisor would be:
> 
> <pre>
>    48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)  
>    48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)  
>    48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)  
>    48 89 e0                  mov    %rsp,%rax  
>    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
> </pre>
> 
> This is inside the arch_do_domctl. This new change adds 21 extra
> bytes of code which alters all the offsets inside the function. To alter
> these offsets and add the extra 21 bytes of code we might not have enough
> space in .text to squeze this in.
> 
> As such we could simplify this problem by only patching the site
> which calls arch_do_domctl:
> 
> <pre>
> <do_domctl>:  
>  e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>  
> </pre>
> 
> with a new address for where the new `arch_do_domctl` would be (this
> area would be allocated dynamically).
> 
> Astute readers will wonder what we need to do if we were to patch `do_domctl`
> - which is not called directly by hypervisor but on behalf of the guests via
> the `compat_hypercall_table` and `hypercall_table`.
> Patching the offset in `hypercall_table` for `do_domctl:
> (ffff82d080103079 <do_domctl>:)
> <pre>
> 
>  ffff82d08024d490:   79 30  
>  ffff82d08024d492:   10 80 d0 82 ff ff   
> 
> </pre>
> with the new address where the new `do_domctl` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `arch_do_domctl`.
> 
> In summary this example patched the callee of the affected function by
>  * allocating memory for the new code to live in,
>  * changing the virtual address of all the functions which called the old
>    code (computing the new offset, patching the callq with a new callq).
>  * changing the function pointer tables with the new virtual address of
>    the function (splicing in the new virtual address). Since this table
>    resides in the .rodata section we would need to temporarily change the
>    page table permissions during this part.
> 
> 
> However it has severe drawbacks - the safety checks which have to make sure
> the function is not on the stack - must also check every caller. For some
> patches this could if there were an sufficient large amount of callers
> that we would never be able to apply the update.
> 
> ### Example of different trampoline patching.
> 
> An alternative mechanism exists where we can insert an trampoline in the
> existing function to be patched to jump directly to the new code. This
> lessens the locations to be patched to one but it puts pressure on the
> CPU branching logic (I-cache, but it is just one unconditional jump).
> 
> For this example we will assume that the hypervisor has not been compiled
> with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
> for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
> in `xen_version` hypercall. This function is not called **anywhere** in
> the hypervisor (it is called by the guest) but referenced in the
> `compat_hypercall_table` and `hypercall_table` (and indirectly called
> from that). Patching the offset in `hypercall_table` for the old
> `do_xen_version` (ffff82d080112f9e <do_xen_version>)
> 
> </pre>
>  ffff82d08024b270 <hypercall_table>  
>  ...  
>  ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff  
> 
> </pre>
> with the new address where the new `do_xen_version` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `do_xen_version`.
> 
> An alternative solution would be to patch insert an trampoline in the
> old `do_xen_version' function to directly jump to the new `do_xen_version`.
> 
> <pre>
>  ffff82d080112f9e <do_xen_version>:  
>  ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax  
>  ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi  
>  ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 <do_xen_version+0x534>  
> </pre>
> 
> with:
> 
> <pre>
>  ffff82d080112f9e <do_xen_version>:  
>  ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]  
> </pre>
> 
> which would lessen the amount of patching to just one location.
> 
> In summary this example patched the affected function to jump to the
> new replacement function which required:
>  * allocating memory for the new code to live in,
>  * inserting trampoline with new offset in the old function to point to the
>    new function.
>  * Optionally we can insert in the old function an trampoline jump to an function
>    providing an BUG_ON to catch errant code.
> 
> The disadvantage of this are that the unconditional jump will consume a small
> I-cache penalty. However the simplicity of the patching of safety checks
> make this a worthwhile option.
> 
> ### Security
> 
> With this method we can re-write the hypervisor - and as such we **MUST** be
> diligent in only allowing certain guests to perform this operation.
> 
> Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
> of the payload to be certain it came from a trusted source.
> 
> As such the hypercall **MUST** support an XSM policy to limit the what
> guest is allowed. If the system is booted with signature checking the
> signature checking will be enforced.
> 
> ## Payload format
> 
> The payload **MUST** contain enough data to allow us to apply the update
> and also safely reverse it. As such we **MUST** know:
> 
>  * What the old code is expected to be. We **MUST** verify it against the
>    runtime code.
>  * The locations in memory to be patched. This can be determined dynamically
>    via symbols or via virtual addresses.
>  * The new code to be used.
>  * Signature to verify the payload.
> 
> This binary format can be constructed using an custom binary format but
> there are severe disadvantages of it:
> 
>  * The format might need to be change and we need an mechanism to accommodate
>    that.
>  * It has to be platform agnostic.
>  * Easily constructed using existing tools.
> 
> As such having the payload in an ELF file is the sensible way. We would be
> carrying the various set of structures (and data) in the ELF sections under
> different names and with definitions. The prefix for the ELF section name
> would always be: *.xsplice_*
> 
> Note that every structure has padding. This is added so that the hypervisor
> can re-use those fields as it sees fit.
> 
> There are five sections *.xsplice_* sections:
> 
>  * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be referenced
>    during the update. This can contain the symbols (functions) that will be
>    patched, or the list of symbols (functions) to be checked pre-patching which
>    may not be on the stack.
> 
> * `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
>    trampolines for an patch. We can have multiple locations for which we
>    need to insert an trampoline for a payload and each location might require
>    a different way of handling it. This would naturally reference the `.text`
>    section and its proper offset. The `.xsplice_reloc` is not directly concerned
>    with patches but rather is an ELF relocation - describing the target
>    of a relocation and how that is performed.  They're also used for where
>    the new code references the run code too.
> 
>  * `.xsplice_sections`. The safety data for the old code and new code.
>    This contains an array of symbols (pointing to `.xsplice_symbols` to
>    and `.text`) which are to be used during safety and dependency checking.
> 
> 
>  * `.xsplice_patches`: The description of the new functions to be patched
>    in (size, type, pointer to code, etc.).
> 
>  * `.xsplice_change`. The structure that ties all of this together and defines
>    the payload.
> 
> Additionally the ELF file would contain:
> 
>  * `.text` section for the new and old code (function).
>  * `.rela.text` relocation data for the `.text` (both new and old).
>  * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as offset
>    to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
>  * `.bss` section for the new code (function)
>  * `.data` and `.data.read_mostly` section for the new and old code (function)
>  * `.rodata` section for the new and old code (function).
> 
> In short the *.xsplice_* sections represent various structures and the
> ELF provides the mechanism to glue it all together when loaded in memory.
> 
> Note that a lot of these ideas are borrowed from kSplice which is
> available at: https://github.com/jirislaby/ksplice
> 
> For ELF understanding the best starting point is the OSDev Wiki
> (http://wiki.osdev.org/ELF). Furthermore the ELF specification is
> at http://www.skyfree.org/linux/references/ELF_Format.pdf and
> at Oracle's web site:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc
> 
> ### ASCII art of the ELF structures
> 
> *TODO*: Include an ASCII art of how the sections are tied together.
> 
> ### xsplice_symbols
> 
> The section contains an array of an structure that outlines the name
> of the symbol to be patched (or checked against). The structure is
> as follow:
> 
> <pre>
> struct xsplice_symbol {  
>     const char *name; /* The ELF name of the symbol. */  
>     const char *label; /* A unique xSplice name for the symbol. */  
>     uint8_t pad[16]; /* Must be zero. */  
> };  
> </pre>
> The structures may be in the section in any order and in any amount
> (duplicate entries are permitted).
> 
> Both `name` and `label` would be pointing to entries in `.xsplice_str`.
> 
> The `label` is used for diagnostic purposes - such as including the
> name and the offset.
> 
> ### xsplice_reloc and xsplice_reloc_howto
> 
> The section contains an array of a structure that outlines the different
> locations (and howto) for which an trampoline is to be inserted.
> 
> The howto defines in the detail the change. It contains the type,
> whether the relocation is relative, the size of the relocation,
> bitmask for which parts of the instruction or data are to be replaced,
> amount of final relocation is shifted by (to drop unwanted data), and
> whether the replacement should be interpreted as signed value.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */  
> #define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */  
> #define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */  
> #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
> #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
> #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
> #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
> 
> #define XSPLICE_HOWTO_FLAG_PC_REL    0x00000001 /* Is PC relative. */  
> #define XSPLICE_HOWOT_FLAG_SIGN      0x00000002 /* Should the new value be treated as signed value. */  
> 
> struct xsplice_reloc_howto {  
>     uint32_t    type; /* XSPLICE_HOWTO_* */  
>     uint32_t    flag; /* XSPLICE_HOWTO_FLAG_* */  
>     uint32_t    size; /* Size, in bytes, of the item to be relocated. */  
>     uint32_t    r_shift; /* The value the final relocation is shifted right by; used to drop unwanted data from the relocation. */  
>     uint64_t    mask; /* Bitmask for which parts of the instruction or data are replaced with the relocated value. */  
>     uint8_t     pad[8]; /* Must be zero. */  
> };  
> 
> </pre>
> 
> This structure is used in:
> 
> <pre>
> struct xsplice_reloc {  
>     uint64_t addr; /* The address of the relocation (if known). */  
>     struct xsplice_symbol *symbol; /* Symbol for this relocation. */  
>     struct xsplice_reloc_howto  *howto; /* Pointer to the above structure. */  
>     uint64_t isns_added; /* ELF addend resulting from quirks of instruction one of whose operands is the relocation. For example, this is -4 on x86 pc-relative jumps. */  
>     uint64_t isns_target; /* rest of the ELF addend.  This is equal to the offset against the symbol that the relocation refers to. */  
>     uint8_t pad[8];  /* Must be zero. */  
> };  
> </pre>
> 
> ### xsplice_sections
> 
> The structure defined in this section is used to verify that it is safe
> to update with the new changes. It can contain safety data on the old code
> and what kind of matching we are to expect.
> 
> It also can contain safety date of what to check when about to patch.
> That is whether any of the addresses (either provided or resolved
> when payload is loaded by referencing the symbols) are in memory
> with what we expect it to be.
> 
> As such the flags can be or-ed together:
> 
> <pre>
> #define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */  
> #define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */  
> #define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */  
> #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
> #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has .altinstructions. */  
> #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
> #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
> #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  
> 
> struct xsplice_section {  
>     struct xsplice_symbol *symbol; /* The symbol associated with this change. */  
>     uint64_t address; /* The address of the section (if known). */  
>     uint64_t size; /* The size of the section. */  
>     uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
>     uint8_t pad[16]; /* To be zero. */  
> };
> 
> </pre>
> 
> ### xsplice_patches
> 
> Within this section we have an array of a structure defining the new code (patch).
> 
> This structure consist of an pointer to the new code (which in ELF ends up
> pointing to an offset in `.text` or `.data` section); the type of patch:
> inline - either text or data, or requesting an trampoline; and size of patch.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_PATCH_INLINE_TEXT   0
> #define XSPLICE_PATCH_INLINE_DATA   1
> #define XSPLICE_PATCH_RELOC_TEXT    2
> 
> struct xsplice_patch {  
>     uint32_t type; /* XSPLICE_PATCH_* .*/  
>     uint32_t size; /* Size of patch. */  
>     uint64_t addr; /* The address of the new code (or data). */  
>     void *content; /* The bytes to be installed. */  
>     uint8_t pad[16]; /* Must be zero. */  
> };
> 
> </pre>
> 
> ### xsplice_code
> 
> The structure embedded within this section ties it all together.
> It has the name of the patch, and pointers to all the above
> mentioned structures (the start and end addresses).
> 
> The structure is as follow:
> 
> <pre>
> struct xsplice_code {  
>     const char *name; /* A sensible name for the patch. Up to 40 characters. */  
>     struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */  
>     struct xsplice_section *sections, *sections_end; /* Safety data */  
>     struct xsplice_patch *patches, *patches_end; /* Patch code & data */  
>     uint8_t pad[32]; /* Must be zero. */
> };
> </pre>
> 
> There should only be one such structure in the section.
> 
> ### Example
> 
> *TODO*: Include an objdump of how the ELF would look like for the XSA
> mentioned earlier.
> 
> ## Signature checking requirements.
> 
> The signature checking requires that the layout of the data in memory
> **MUST** be same for signature to be verified. This means that the payload
> data layout in ELF format **MUST** match what the hypervisor would be
> expecting such that it can properly do signature verification.
> 
> The signature is based on the all of the payloads continuously laid out
> in memory. The signature is to be appended at the end of the ELF payload
> prefixed with the string '~Module signature appended~\n", followed by
> an signature header then followed by the signature, key identifier, and signers
> name.
> 
> Specifically the signature header would be:
> 
> <pre>
> #define PKEY_ALGO_DSA       0  
> #define PKEY_ALGO_RSA       1  
> 
> #define PKEY_ID_PGP         0 /* OpenPGP generated key ID */  
> #define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */  
> 
> #define HASH_ALGO_MD4          0  
> #define HASH_ALGO_MD5          1  
> #define HASH_ALGO_SHA1         2  
> #define HASH_ALGO_RIPE_MD_160  3  
> #define HASH_ALGO_SHA256       4  
> #define HASH_ALGO_SHA384       5  
> #define HASH_ALGO_SHA512       6  
> #define HASH_ALGO_SHA224       7  
> #define HASH_ALGO_RIPE_MD_128  8  
> #define HASH_ALGO_RIPE_MD_256  9  
> #define HASH_ALGO_RIPE_MD_320 10  
> #define HASH_ALGO_WP_256      11  
> #define HASH_ALGO_WP_384      12  
> #define HASH_ALGO_WP_512      13  
> #define HASH_ALGO_TGR_128     14  
> #define HASH_ALGO_TGR_160     15  
> #define HASH_ALGO_TGR_192     16  
> 
> 
> struct elf_payload_signature {  
> 	u8	algo;		/* Public-key crypto algorithm PKEY_ALGO_*. */  
> 	u8	hash;		/* Digest algorithm: HASH_ALGO_*. */  
> 	u8	id_type;	/* Key identifier type PKEY_ID*. */  
> 	u8	signer_len;	/* Length of signer's name */  
> 	u8	key_id_len;	/* Length of key identifier */  
> 	u8	__pad[3];  
> 	__be32	sig_len;	/* Length of signature data */  
> };
> 
> </pre>
> (Note that this has been borrowed from Linux module signature code.).
> 
> 
> ## Hypercalls
> 
> We will employ the sub operations of the system management hypercall (sysctl).
> There are to be four sub-operations:
> 
>  * upload the payloads.
>  * listing of payloads summary uploaded and their state.
>  * getting an particular payload summary and its state.
>  * command to apply, delete, or revert the payload.
> 
> The patching is asynchronous therefore the caller is responsible
> to verify that it has been applied properly by retrieving the summary of it
> and verifying that there are no error codes associated with the payload.
> 
> We **MUST** make it asynchronous due to the nature of patching: it requires
> every physical CPU to be lock-step with each other. The patching mechanism
> while an implementation detail, is not an short operation and as such
> the design **MUST** assume it will be an long-running operation.
> 
> Furthermore it is possible to have multiple different payloads for the same
> function. As such an unique id has to be visible to allow proper manipulation.
> 
> The hypercall is part of the `xen_sysctl`. The top level structure contains
> one uint32_t to determine the sub-operations:
> 
> <pre>
> struct xen_sysctl_xsplice_op {  
>     uint32_t cmd;  
> 	union {  
>           ... see below ...  
>         } u;  
> };  
> 
> </pre>
> while the rest of hypercall specific structures are part of the this structure.
> 
> 
> ### XEN_SYSCTL_XSPLICE_UPLOAD (0)
> 
> Upload a payload to the hypervisor. The payload is verified and if there
> are any issues the proper return code will be returned. The payload is
> not applied at this time - that is controlled by *XEN_SYSCTL_XSPLICE_ACTION*.
> 
> The caller provides:
> 
>  * `id` unique id.
>  * `payload` the virtual address of where the ELF payload is.
> 
> The return value is zero if the payload was succesfully uploaded and the
> signature was verified. Otherwise an EXX return value is provided.
> Duplicate `id` are not supported.
> 
> The `payload` is the ELF payload as mentioned in the `Payload format` section.
> 
> The structure is as follow:
> 
> <pre>
> struct xen_sysctl_xsplice_upload {  
>     char id[40];  /* IN, name of the patch. */  
>     uint64_t size; /* IN, size of the ELF file. */
>     XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */  
> }; 
> </pre>
> 
> ### XEN_SYSCTL_XSPLICE_GET (1)
> 
> Retrieve an summary of an specific payload. This caller provides:
> 
>  * `id` the unique id.
>  * `status` *MUST* be set to zero.
>  * `rc` *MUST* be set to zero.
> 
> The `summary` structure contains an summary of payload which includes:
> 
>  * `id` the unique id.
>  * `status` - whether it has been:
>  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
>  3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
>  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
>  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
>  * `rc` - its error state if any.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_STATUS_LOADED    0  
> #define XSPLICE_STATUS_PROGRESS  1  
> #define XSPLICE_STATUS_CHECKED   2  
> #define XSPLICE_STATUS_APPLIED   3  
> #define XSPLICE_STATUS_REVERTED  4  
> #define XSPLICE_STATUS_IN_ERROR  5  
> 
> struct xen_sysctl_xsplice_summary {  
>     char id[40];  /* IN/OUT, name of the patch. */  
>     uint32_t status;   /* OUT */  
>     int32_t rc;  /* OUT */  
> }; 
> </pre>
> 
> ### XEN_SYSCTL_XSPLICE_LIST (2)
> 
> Retrieve an array of abbreviated summary of payloads that are loaded in the
> hypervisor.
> 
> The caller provides:
> 
>  * `idx` index iterator. Initially it *MUST* be zero.
>  * `count` the max number of entries to populate.
>  * `summary` virtual address of where to write payload summaries.
> 
> The hypercall returns zero on success and updates the `idx` (index) iterator
> with the number of payloads returned, `count` to the number of remaining
> payloads, and `summary` with an number of payload summaries.
> 
> If the hypercall returns E2BIG the `count` is too big and should be
> lowered.
> 
> Note that due to the asynchronous nature of hypercalls the domain might have
> added or removed the number of payloads making this information stale. It is
> the responsibility of the domain to provide proper accounting.
> 
> The `summary` structure contains an summary of payload which includes:
> 
>  * `id` unique id.
>  * `status` - whether it has been:
>  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
>  3. *XSPLICE_STATUS_CHECKED*  (2) the payload `old` and `addr` match with the hypervisor.
>  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
>  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
>  * `rc` - its error state if any.
> 
> The structure is as follow:
> 
> <pre>
> struct xen_sysctl_xsplice_list {  
>     uint32_t idx;  /* IN/OUT */  
>     uint32_t count;  /* IN/OUT */
>     XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary;  /* OUT */  
> };  
> 
> struct xen_sysctl_xsplice_summary {  
>     char id[40];  /* OUT, name of the patch. */  
>     uint32_t status;   /* OUT */  
>     int32_t rc;  /* OUT */  
> };  
> 
> </pre>
> ### XEN_SYSCTL_XSPLICE_ACTION (3)
> 
> Perform an operation on the payload structure referenced by the `id` field.
> The operation request is asynchronous and the status should be retrieved
> by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST** hypercall.
> 
> The caller provides:
> 
>  * `id` the unique id.
>  * `cmd` the command requested:
>   1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply properly.
>   2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.  
>   3. *XSPLICE_ACTION_REVERT* (2) revert the payload.  
>   4. *XSPLICE_ACTION_APPLY* (3) apply the payload.   
> 
> 
> The return value will be zero unless the provided fields are incorrect.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_ACTION_CHECK  0  
> #define XSPLICE_ACTION_UNLOAD 1  
> #define XSPLICE_ACTION_REVERT 2  
> #define XSPLICE_ACTION_APPLY  3  
> 
> struct xen_sysctl_xsplice_action {  
>     char id[40];  /* IN, name of the patch. */  
>     uint32_t cmd; /* IN */  
> };  
> 
> </pre>
> 
> ## Sequence of events.
> 
> The normal sequence of events is to:
> 
>  1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors *STOP* here.
>  2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next step.
>  3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to verify that the payload can be succesfully applied.
>  4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next step.
>  5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the patch.
>  6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with success.
> 
>  
> ## Addendum
> 
> Implementation quirks should not be discussed in a design document.
> 
> However these observations can provide aid when developing against this
> document.
> 
> 
> ### Alternative assembler
> 
> Alternative assembler is a mechanism to use different instructions depending
> on what the CPU supports. This is done by providing multiple streams of code
> that can be patched in - or if the CPU does not support it - padded with
> `nop` operations. The alternative assembler macros cause the compiler to
> expand the code to place a most generic code in place - emit a special
> ELF .section header to tag this location. During run-time the hypervisor
> can leave the areas alone or patch them with an better suited opcodes.
> 
> As we might be patching the alternative assembler sections as well - by
> providing a new better suited op-codes or perhaps with nops - we need to
> also re-run the alternative assembler patching after we have done our
> patching.
> 
> Also when we are doing safety checks the code we are checking might be
> utilizing alternative assembler. As such we should relax out checks to
> accomodate that.
> 
> ### .rodata sections
> 
> The patching might require strings to be updated as well. As such we must be
> also able to patch the strings as needed. This sounds simple - but the compiler
> has a habit of coalescing strings that are the same - which means if we in-place
> alter the strings - other users will be inadvertently affected as well.
> 
> This is also where pointers to functions live - and we may need to patch this
> as well.
> 
> To guard against that we must be prepared to do patching similar to
> trampoline patching or in-line depending on the flavour. If we can
> do in-line patching we would need to:
> 
>  * alter `.rodata` to be writeable.
>  * inline patch.
>  * alter `.rodata` to be read-only.
> 
> If are doing trampoline patching we would need to:
> 
>  * allocate a new memory location for the string.
>  * all locations which use this string will have to be updated to use the
>    offset to the string.
>  * mark the region RO when we are done.
> 
> ### .bss sections
> 
> Patching writable data is not suitable as it is unclear what should be done
> depending on the current state of data. As such it should not be attempted.
> 
> 
> ### Patching code which is in the stack.
> 
> We should not patch the code which is on the stack. That can lead
> to corruption.
> 
> ### Trampoline (e9 opcode)
> 
> The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
> we are limited to up to 2GB of virtual address to place the new code
> from the old code. That should not be a problem since Xen hypervisor has
> a very small footprint.
> 
> However if we need - we can always add two trampolines. One at the 2GB
> limit that calls the next trampoline.
> 
> ### Time rendezvous code instead of stop_machine for patching
> 
> The hypervisor's time rendezvous code runs synchronously across all CPUs
> every second. Using the stop_machine to patch can stall the time rendezvous
> code and result in NMI. As such having the patching be done at the tail
> of rendezvous code should avoid this problem.
> 
> ### Security
> 
> Only the privileged domain should be allowed to do this operation.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-18 12:41 ` Jan Beulich
@ 2015-06-05 14:49   ` Konrad Rzeszutek Wilk
  2015-06-05 15:16     ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-05 14:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

On Mon, May 18, 2015 at 01:41:34PM +0100, Jan Beulich wrote:
> >>> On 15.05.15 at 21:44, <konrad.wilk@oracle.com> wrote:
> > As such having the payload in an ELF file is the sensible way. We would be
> > carrying the various set of structures (and data) in the ELF sections under
> > different names and with definitions. The prefix for the ELF section name
> > would always be: *.xsplice_*
> > 
> > Note that every structure has padding. This is added so that the hypervisor
> > can re-use those fields as it sees fit.
> > 
> > There are five sections *.xsplice_* sections:
> 
> A general remark: Having worked on ELF on different occasions,
> UNIX'es (and hence Linux'es) use of section names to identify the
> sections' purposes is pretty contrary to ELF's original intentions.
> Section types (and architecture as well as OS extensions to them)
> exist for a reason. Of course from the following sections it's not
> really clear in how far you intended this to be done. If you did
> and just didn't explicitly say so, pointing out that relations
> between sections should then also be expressed suitably via ELF
> mechanisms (sh_link, sh_info) would then be unnecessary too. If
> you didn't, then I'd strongly suggest switching to a formally more
> correct model, which would then include specifying section types
> and section attributes (and optional other relevant aspects)
> along with their (then no longer mandatory) names.


Something like:
http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-1.html

With (for example) an pointer to a specific section:
http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-28.html#scrolltoc
?

> 
> >  * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be referenced
> >    during the update. This can contain the symbols (functions) that will be
> >    patched, or the list of symbols (functions) to be checked pre-patching which
> >    may not be on the stack.
> > 
> > * `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
> >    trampolines for an patch. We can have multiple locations for which we
> >    need to insert an trampoline for a payload and each location might require
> >    a different way of handling it. This would naturally reference the `.text`
> >    section and its proper offset. The `.xsplice_reloc` is not directly concerned
> >    with patches but rather is an ELF relocation - describing the target
> >    of a relocation and how that is performed.  They're also used for where
> >    the new code references the run code too.
> > 
> >  * `.xsplice_sections`. The safety data for the old code and new code.
> >    This contains an array of symbols (pointing to `.xsplice_symbols` to
> >    and `.text`) which are to be used during safety and dependency checking.
> > 
> > 
> >  * `.xsplice_patches`: The description of the new functions to be patched
> >    in (size, type, pointer to code, etc.).
> > 
> >  * `.xsplice_change`. The structure that ties all of this together and defines
> >    the payload.
> > 
> > Additionally the ELF file would contain:
> > 
> >  * `.text` section for the new and old code (function).
> >  * `.rela.text` relocation data for the `.text` (both new and old).
> >  * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as offset
> >    to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
> 
> I think you should not immediately rule out REL relocations, i.e.
> specify both .rela.* and .rel.*.
> 
> >  * `.bss` section for the new code (function)
> >  * `.data` and `.data.read_mostly` section for the new and old code (function)
> >  * `.rodata` section for the new and old code (function).
> > 
> > In short the *.xsplice_* sections represent various structures and the
> > ELF provides the mechanism to glue it all together when loaded in memory.
> 
> As per above - I think ELF can do this in a more explicit way. Of
> course that would imply sticking to strictly ELF types in the structure
> definitions below. That would in turn have the advantage of
> expressing e.g. "const char *" references to strings by section
> offsets into a specified string section, reducing (with the ultimate
> goal of eliminating) the need for processing relocations on the
> ELF object (possible references to "external symbols" left aside for
> the moment).

Do you think that going that way would be too complex for this design?
Folks have expressed that perhaps I went overboard with this ELF and
should have just stuck with structures and just mention that it is
in an ELF file?

> 
> > Note that a lot of these ideas are borrowed from kSplice which is
> > available at: https://github.com/jirislaby/ksplice 
> > 
> > For ELF understanding the best starting point is the OSDev Wiki
> > (http://wiki.osdev.org/ELF). Furthermore the ELF specification is
> > at http://www.skyfree.org/linux/references/ELF_Format.pdf and
> > at Oracle's web site:
> > http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc
> 
> I think the master copy still is
> 
> http://www.sco.com/developers/gabi/
> 
> > ### xsplice_reloc and xsplice_reloc_howto
> > 
> > The section contains an array of a structure that outlines the different
> > locations (and howto) for which an trampoline is to be inserted.
> > 
> > The howto defines in the detail the change. It contains the type,
> > whether the relocation is relative, the size of the relocation,
> > bitmask for which parts of the instruction or data are to be replaced,
> > amount of final relocation is shifted by (to drop unwanted data), and
> > whether the replacement should be interpreted as signed value.
> > 
> > The structure is as follow:
> > 
> > <pre>
> > #define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */  
> > #define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */  
> > #define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */  
> 
> DATE or DATA?

DATE.

> 
> > #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
> 
> Assuming the former is DATE - what do you foresee date/time
> kinds of changes to be good for?

In case we have source code which uses __DATE__ and the binary patch
modifies it - we would need to sort out other users of the __DATE__
and make sure that they are not affected (or perhaps they should be).

> 
> > #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
> > #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
> > #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
> 
> For none of these I really understand their purpose.

This is to help the hypervisor to figure out which of the lookup
mechanisms to use to find the relevant section. As in if we need
to update (as part of the patch) a normal function, but also
the exception table (because the new function is at a new virtual
address) we want to search in the exception table for the old 
one and replace it with the new virtual address.

> i
> > ### xsplice_sections
> > 
> > The structure defined in this section is used to verify that it is safe
> > to update with the new changes. It can contain safety data on the old code
> > and what kind of matching we are to expect.
> > 
> > It also can contain safety date of what to check when about to patch.
> > That is whether any of the addresses (either provided or resolved
> > when payload is loaded by referencing the symbols) are in memory
> > with what we expect it to be.
> > 
> > As such the flags can be or-ed together:
> > 
> > <pre>
> > #define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */  
> > #define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */  
> > #define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */  
> 
> DATA or .rodata?

Ooops. That should be the .rodata just by itself.
> 
> > #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
> > #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has .altinstructions. */  
> 
> I don't see the purpose of adding flags for a few special sections we
> may know about now - what if over time hundreds of sections appear?

We just need to know if it has - and then the hypervisor needs
to call apply_altinstructions to redo the alternative instructions.

I presume you are advocating to move this flag to an 'higher' level part
of the structures?

> 
> > #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
> > #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
> > #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  
> 
> This isn't the first reference to "stack", without me having seen a
> description of what the word is meant to stand for in such contexts.
> Did I overlook it?

I mentioned it earlier in 'Example of trampoline and in-place splicing':

126 However it has severe drawbacks - the safety checks which have to make sure     
127 the function is not on the stack - must also check every caller. For some       
128 patches this could if there were an sufficient large amount of callers          
129 that we would never be able to apply the update.

But perhaps I should dive much deeper in about the patching
mechanism having to check whether the function to be patched
is on the stack.
> 
> > struct xsplice_section {  
> >     struct xsplice_symbol *symbol; /* The symbol associated with this change. */  
> >     uint64_t address; /* The address of the section (if known). */  
> >     uint64_t size; /* The size of the section. */  
> 
> I don't think you need 64-bit types here even on 32-bit architectures.
> Use Elf*_Addr and Elf*_Word?

OK.
> 
> > ### xsplice_patches
> > 
> > Within this section we have an array of a structure defining the new code 
> > (patch).
> > 
> > This structure consist of an pointer to the new code (which in ELF ends up
> > pointing to an offset in `.text` or `.data` section); the type of patch:
> > inline - either text or data, or requesting an trampoline; and size of patch.
> > 
> > The structure is as follow:
> > 
> > <pre>
> > #define XSPLICE_PATCH_INLINE_TEXT   0
> > #define XSPLICE_PATCH_INLINE_DATA   1
> > #define XSPLICE_PATCH_RELOC_TEXT    2
> 
> Again it remains vague what these are supposed to express.

(So I don't forget to update):

INLINE_TEXT: Patch the function in-line. As in alter it at the original
virtual address location. The patching MUST have the same size as the
original code.

INLINE_DATA: Patch the data in-line. As in alter it at the original
virtual address location. The patching data MUST have the same size
as the original data.

RELOC_TEXT: Patch original code with an jump to the new function.
Allocate a new page (or more) for the function and insert trampoline in
the old function to the new code.
> 
> > ### XEN_SYSCTL_XSPLICE_GET (1)
> > 
> > Retrieve an summary of an specific payload. This caller provides:
> > 
> >  * `id` the unique id.
> >  * `status` *MUST* be set to zero.
> >  * `rc` *MUST* be set to zero.
> > 
> > The `summary` structure contains an summary of payload which includes:
> > 
> >  * `id` the unique id.
> >  * `status` - whether it has been:
> >  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
> >  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** command.
> >  3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
> >  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
> >  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also reverted.
> >  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` for details.
> >  * `rc` - its error state if any.
> 
> This duplication of status and rc looks odd to me, the more that only
> appears to mean an error. How about using negative status values
> as error indicators?

Perfect idea!
> 
> > ## Sequence of events.
> > 
> > The normal sequence of events is to:
> > 
> >  1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors 
> > *STOP* here.
> >  2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> > *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next 
> > step.
> >  3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to 
> > verify that the payload can be succesfully applied.
> >  4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> > *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next 
> > step.
> >  5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the 
> > patch.
> >  6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> > *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with 
> > success.
> 
> Is this meant to imply that e.g. ACTION_APPLY will fail unless the
> respective payload is in CHECKED state?

Correct. I should include a state diagram to make this obvious. Thank you!

> 
> > ### Alternative assembler
> > 
> > Alternative assembler is a mechanism to use different instructions depending
> > on what the CPU supports. This is done by providing multiple streams of code
> > that can be patched in - or if the CPU does not support it - padded with
> > `nop` operations. The alternative assembler macros cause the compiler to
> > expand the code to place a most generic code in place - emit a special
> > ELF .section header to tag this location. During run-time the hypervisor
> > can leave the areas alone or patch them with an better suited opcodes.
> > 
> > As we might be patching the alternative assembler sections as well - by
> > providing a new better suited op-codes or perhaps with nops - we need to
> > also re-run the alternative assembler patching after we have done our
> > patching.
> 
> These sections are part of .init.* and as such can't reasonably be
> subject to patching.

True, thought I thought we have some for run-time as well.
> 
> > ### .bss sections
> > 
> > Patching writable data is not suitable as it is unclear what should be done
> > depending on the current state of data. As such it should not be attempted.
> 
> This certainly also applies to .data.

True. Thank you.
> 
> > ### Time rendezvous code instead of stop_machine for patching
> > 
> > The hypervisor's time rendezvous code runs synchronously across all CPUs
> > every second. Using the stop_machine to patch can stall the time rendezvous
> > code and result in NMI. As such having the patching be done at the tail
> > of rendezvous code should avoid this problem.
> 
> I know this was brough up on the hackathon, but did you verify this
> to be a problem? Because if it is, we already have an issue here as

No. Code-wise I've just cooked up the hypercall invocation code and hadn't
tried any other hard stuff.

> we already use stop_machine for things like CPU offlining.
> 

Thank you for your review!
> Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-18 12:54 ` Liuqiming (John)
  2015-05-18 13:11   ` Daniel Kiper
@ 2015-06-05 14:50   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-05 14:50 UTC (permalink / raw)
  To: Liuqiming (John)
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, Paul Voccio,
	Daniel Kiper, Major Hayden, liuyingdong, aliguori, konrad,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

On Mon, May 18, 2015 at 08:54:22PM +0800, Liuqiming (John) wrote:
> Hi Konrad,
> 
> Will this design include hotpatch build tools chain?

Yes, that is certainly the idea.
> Such as how these .xplice_ section are created? How to handle xen symbols when creating hotpatch elf file?

Right now I am seeing objdump and objcopy along with special GCC
parameters to create .o.xsplice files.

It would be rudimentary - and I hope that folks would help
in doing this work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-20 15:11 ` Martin Pohlack
@ 2015-06-05 15:00   ` Konrad Rzeszutek Wilk
  2015-06-05 15:15     ` Andrew Cooper
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-05 15:00 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On Wed, May 20, 2015 at 05:11:20PM +0200, Martin Pohlack wrote:
> Hi,
> 
> this looks very interesting.

Thank you!
> 
> I have talked about an experimental Xen hotpatching design at Linux
> Plumbers Conference 2014 in Düsseldorf, slides are here:
> 
> http://www.linuxplumbersconf.net/2014/ocw//system/presentations/2421/original/xen_hotpatching-2014-10-16.pdf
> 
> and would like to share a couple of observations from the slides:
> 
> * We found splicing at hypervisor exit with a barrier for all CPUs to
>   be a good place, because all Xen stacks are effectively empty at
>   that point.  We use a barrier with a maximum timeout (that can be
>   adapted).  The approach is similar in concept to stop_machine and
>   the time rendezvous but is time-bound and does not require all the
>   asynchronous operations discussed below.

hypervisor exit = vmexit?

> 
> * Hotpatch generation often requires support for compiling the target
>   with -ffunction-sections / -fdata-sections.  Last time I looked, you
>   can compile Xen with that option, but the linker scripts are not
>   fully adapted to deal with it and the resulting hypervisor cannot be
>   booted.

Correct. That (adapting the Makefile to do a second build with this)
would be needed as well.

> 
> * Xen as it is now, has a couple of non-unique symbol names which will
>   make runtime symbol identification hard.  Sometimes, static symbols
>   simply have the same name in C files, sometimes such symbols get
>   included via header files, and some C files are also compiled
>   multiple times and linked under different names (guest_walk.c).  I
>   think the Linux kernel solves this by aiming at unique symbol names
>   even for local symbols.
> 
>   nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head

Oh my. Looks like a couple of prepartion patches will be in order!

> 
> * You may need support for adapting or augmenting exception tables if
>   patching such code is desired (it probably is).  Hotpatches may need
>   to bring their own small exception tables (similar to how Linux
>   modules support this).  If you don't plan on supporting hotpatches
>   that introduce additional exception-locations, one could also change
>   the exception table in-place and reorder it afterwards.

Each patch carries 'section' structures which define what kind
of massaging needs to be done. As in each of these 'section' structues
says:
 - At the original address X there was an excetion table
 - (or) At the origian address Y there was an alt assembler

And we would:
 - During 'check' state, confirm that indeed X is in the exception table
   (or Y in the alt assembler)

 - During the apply state, fix the exception table X offset to point to
   the new virtual address.

Or perhaps I am over-simplying it? My recollection of the exception
table code is that it would be mostly dealing with the table and
changing the virtual address, but I hadn't dug it in details.

> 
> * Hotpatch interdependencies are tricky.  IIRC the discussion at the

Oooh, that bubbled in my nightmare subconcious but hadn't burst out yet.
Ugh, what did I sign up for! 

>   Live Patching track at LPC, both kpatch and kgraft aimed, at the
>   time, at a single large patch that subsumes and replaces all previous
>   ones.
> 
> Some more comments inline below ...
> 
> Cheers,
> Martin
> 
> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> > Hey!
> > 
> > During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
> > the hypervisor. We sketched out how it could be done, and brainstormed
> > some of the problems.
> > 
> > I took that and wrote an design - which is very much RFC. The design is
> > laid out in two sections - the format of the ELF payload - and then the
> > hypercalls to act on it.
> > 
> > Hypercall preemption has caused a couple of XSAs so I've baked the need
> > for that in the design so we hopefully won't have an XSA for this code.
> > 
> > There are two big *TODO* in the design which I had hoped to get done
> > before sending this out - however I am going on vacation for two weeks
> > so I figured it would be better to send this off for folks to mull now
> > then to have it languish.
> > 
> > Please feel free to add more folks on the CC list.
> > 
> > Enjoy!
> > 
> > 
> > # xSplice Design v1 (EXTERNAL RFC v2)
> > 
> > ## Rationale
> > 
> > A mechanism is required to binarily patch the running hypervisor with new
> > opcodes that have come about due to primarily security updates.
> > 
> > This document describes the design of the API that would allow us to
> > upload to the hypervisor binary patches.
> > 
> > ## Glossary
> > 
> >  * splice - patch in the binary code with new opcodes
> >  * trampoline - a jump to a new instruction.
> >  * payload - telemetries of the old code along with binary blob of the new
> >    function (if needed).
> >  * reloc - telemetries contained in the payload to construct proper trampoline.
> > 
> > ## Multiple ways to patch
> > 
> > The mechanism needs to be flexible to patch the hypervisor in multiple ways
> > and be as simple as possible. The compiled code is contiguous in memory with
> > no gaps - so we have no luxury of 'moving' existing code and must either
> > insert a trampoline to the new code to be executed - or only modify in-place
> > the code if there is sufficient space. The placement of new code has to be done
> > by hypervisor and the virtual address for the new code is allocated dynamically.
> > i
> > This implies that the hypervisor must compute the new offsets when splicing
> > in the new trampoline code. Where the trampoline is added (inside
> > the function we are patching or just the callers?) is also important.
> > 
> > To lessen the amount of code in hypervisor, the consumer of the API
> > is responsible for identifying which mechanism to employ and how many locations
> > to patch. Combinations of modifying in-place code, adding trampoline, etc
> > has to be supported. The API should allow read/write any memory within
> > the hypervisor virtual address space.
> > 
> > We must also have a mechanism to query what has been applied and a mechanism
> > to revert it if needed.
> > 
> > We must also have a mechanism to: provide an copy of the old code - so that
> > the hypervisor can verify it against the code in memory; the new code;
> 
> As Xen has no stable in-hypervisor API / ABI, you need to make sure
> that a generated module matches a target hypervisor.  In our design,
> we use build IDs for that (ld --build-id).  We embed build IDs at Xen
> compile time and can query a running hypervisor for its ID and only
> load matching patches.
> 
> This seems to be an alternative to your proposal to include old code
> into hotpatch modules.

That is much simpler. There is a nice part of the old code check - you
can check (and deal with) patching an already patched code.

As in, if the payload was configured to be applied on top of an already
patched function it would patch nicely. But if the payload is against
the virgin code - and the hypervisor is running an older patch, we would
bail out.

> 
> > the symbol name of the function to be patched; or offset from the symbol;
> > or virtual address.
> > 
> > The complications that this design will encounter are explained later
> > in this document.
> > 
> > ## Patching code
> > 
> > The first mechanism to patch that comes in mind is in-place replacement.
> > That is replace the affected code with new code. Unfortunately the x86
> > ISA is variable size which places limits on how much space we have available
> > to replace the instructions.
> > 
> > The second mechanism is by replacing the call or jump to the
> > old function with the address of the new function.
> 
> We found splicing via a non-conditional JMP in the beginning of the
> old target function to be a sensible and uncomplicated approach.  This
> approach works at function level which feels like a very natural unit
> to work at.  The really nice thing of this approach is that you don't
> have to track all the, potential indirect, call sites.

<nods>
> 
> As you discussed, if you allocate hotpatch memory withing +-2GB of the
> patch location, no further trampoline indirection is required, a
> 5-byte JMP does the trick on x86.  We found that to be sufficient in
> our experiments.

And worst case if you do need more than +-2GB you can just have
two jumps. Kind of silly but possible.

Thank you for your input and lookign forward to your reply!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 15:00   ` Konrad Rzeszutek Wilk
@ 2015-06-05 15:15     ` Andrew Cooper
  2015-06-05 15:27     ` Jan Beulich
  2015-06-08 14:38     ` Martin Pohlack
  2 siblings, 0 replies; 37+ messages in thread
From: Andrew Cooper @ 2015-06-05 15:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong

On 05/06/15 16:00, Konrad Rzeszutek Wilk wrote:
>> As you discussed, if you allocate hotpatch memory withing +-2GB of the
>> > patch location, no further trampoline indirection is required, a
>> > 5-byte JMP does the trick on x86.  We found that to be sufficient in
>> > our experiments.
> And worst case if you do need more than +-2GB you can just have
> two jumps. Kind of silly but possible.
>
> Thank you for your input and lookign forward to your reply!

The Xen combined text/rodata/data/init/bss size is currently less than
4MB.  The alignment for superpages bumps this to 12MB (so plenty of free
space), and arbitrary extra space can be inserted if wanted.  Xen has
(just less than) 1GB of virtual space reserved for this area.

If a hotpatch is going to blow these limits, you have bigger problems to
worry about.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 14:49   ` Konrad Rzeszutek Wilk
@ 2015-06-05 15:16     ` Jan Beulich
  2015-06-05 16:00       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2015-06-05 15:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 05.06.15 at 16:49, <konrad.wilk@oracle.com> wrote:
> On Mon, May 18, 2015 at 01:41:34PM +0100, Jan Beulich wrote:
>> >>> On 15.05.15 at 21:44, <konrad.wilk@oracle.com> wrote:
>> > As such having the payload in an ELF file is the sensible way. We would be
>> > carrying the various set of structures (and data) in the ELF sections under
>> > different names and with definitions. The prefix for the ELF section name
>> > would always be: *.xsplice_*
>> > 
>> > Note that every structure has padding. This is added so that the hypervisor
>> > can re-use those fields as it sees fit.
>> > 
>> > There are five sections *.xsplice_* sections:
>> 
>> A general remark: Having worked on ELF on different occasions,
>> UNIX'es (and hence Linux'es) use of section names to identify the
>> sections' purposes is pretty contrary to ELF's original intentions.
>> Section types (and architecture as well as OS extensions to them)
>> exist for a reason. Of course from the following sections it's not
>> really clear in how far you intended this to be done. If you did
>> and just didn't explicitly say so, pointing out that relations
>> between sections should then also be expressed suitably via ELF
>> mechanisms (sh_link, sh_info) would then be unnecessary too. If
>> you didn't, then I'd strongly suggest switching to a formally more
>> correct model, which would then include specifying section types
>> and section attributes (and optional other relevant aspects)
>> along with their (then no longer mandatory) names.
> 
> 
> Something like:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-1.html 

Yes, the SHT_SUNW_* ones are examples of what I mean.

> With (for example) an pointer to a specific section:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-28.html#scrolltoc 
> ?

Not sure what part of that page you mean to refer to.

>> >  * `.bss` section for the new code (function)
>> >  * `.data` and `.data.read_mostly` section for the new and old code (function)
>> >  * `.rodata` section for the new and old code (function).
>> > 
>> > In short the *.xsplice_* sections represent various structures and the
>> > ELF provides the mechanism to glue it all together when loaded in memory.
>> 
>> As per above - I think ELF can do this in a more explicit way. Of
>> course that would imply sticking to strictly ELF types in the structure
>> definitions below. That would in turn have the advantage of
>> expressing e.g. "const char *" references to strings by section
>> offsets into a specified string section, reducing (with the ultimate
>> goal of eliminating) the need for processing relocations on the
>> ELF object (possible references to "external symbols" left aside for
>> the moment).
> 
> Do you think that going that way would be too complex for this design?
> Folks have expressed that perhaps I went overboard with this ELF and
> should have just stuck with structures and just mention that it is
> in an ELF file?

I think it should be done either completely ELF-agnostic (with ELF
just happening the container to convey the data) or fully ELF-
compliant.

>> > #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
>> 
>> Assuming the former is DATE - what do you foresee date/time
>> kinds of changes to be good for?
> 
> In case we have source code which uses __DATE__ and the binary patch
> modifies it - we would need to sort out other users of the __DATE__
> and make sure that they are not affected (or perhaps they should be).

And how is this different from arbitrary other string literals?

>> > #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
>> > #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
>> > #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
>> 
>> For none of these I really understand their purpose.
> 
> This is to help the hypervisor to figure out which of the lookup
> mechanisms to use to find the relevant section. As in if we need
> to update (as part of the patch) a normal function, but also
> the exception table (because the new function is at a new virtual
> address) we want to search in the exception table for the old 
> one and replace it with the new virtual address.

I think this gets too specific. Telling apart code and data may make
sense, but I think beyond that thing should be expressed as
generically as possible. Imagine new special data kinds showing up
- do you envision adding a special type (and hence special handling)
for all of them?

>> > #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
>> > #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has 
> .altinstructions. */  
>> 
>> I don't see the purpose of adding flags for a few special sections we
>> may know about now - what if over time hundreds of sections appear?
> 
> We just need to know if it has - and then the hypervisor needs
> to call apply_altinstructions to redo the alternative instructions.
> 
> I presume you are advocating to move this flag to an 'higher' level part
> of the structures?

As above I'd rather see such things not special cased at all. You
already need to deal with the to be patched code having two
(or more) possible variants due to live patching at boot. I think
for the purpose of runtime patching it ought to be quite fine to
use correct, but less optimal instructions (i.e. no consideration of
alternative instructions at all).

>> > #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
>> > #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
>> > #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  
>> 
>> This isn't the first reference to "stack", without me having seen a
>> description of what the word is meant to stand for in such contexts.
>> Did I overlook it?
> 
> I mentioned it earlier in 'Example of trampoline and in-place splicing':
> 
> 126 However it has severe drawbacks - the safety checks which have to make sure     
> 127 the function is not on the stack - must also check every caller. For some 
> 128 patches this could if there were an sufficient large amount of callers   
> 129 that we would never be able to apply the update.
> 
> But perhaps I should dive much deeper in about the patching
> mechanism having to check whether the function to be patched
> is on the stack.

Yeah - the thing is that the lines you quote are precisely what I
referred to by "this isn't the first reference..." - it's not being
clarified anywhere what "stack" you talk about. In some places
it looked like you meant the stack in memory that the local
variables and spilled registers go onto, but in other places it
wasn't clear whether something else was meant. And with us
supposedly being in stop-machine state, what is on the call
stacks of each CPU should be pretty deterministic.

>> > ### Alternative assembler
>> > 
>> > Alternative assembler is a mechanism to use different instructions depending
>> > on what the CPU supports. This is done by providing multiple streams of code
>> > that can be patched in - or if the CPU does not support it - padded with
>> > `nop` operations. The alternative assembler macros cause the compiler to
>> > expand the code to place a most generic code in place - emit a special
>> > ELF .section header to tag this location. During run-time the hypervisor
>> > can leave the areas alone or patch them with an better suited opcodes.
>> > 
>> > As we might be patching the alternative assembler sections as well - by
>> > providing a new better suited op-codes or perhaps with nops - we need to
>> > also re-run the alternative assembler patching after we have done our
>> > patching.
>> 
>> These sections are part of .init.* and as such can't reasonably be
>> subject to patching.
> 
> True, thought I thought we have some for run-time as well.

No, we're not switching between UP and SMP like Linux does (or did).

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 15:00   ` Konrad Rzeszutek Wilk
  2015-06-05 15:15     ` Andrew Cooper
@ 2015-06-05 15:27     ` Jan Beulich
  2015-06-08  8:34       ` Martin Pohlack
  2015-06-08 14:38     ` Martin Pohlack
  2 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2015-06-05 15:27 UTC (permalink / raw)
  To: Martin Pohlack, Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 05.06.15 at 17:00, <konrad.wilk@oracle.com> wrote:
> On Wed, May 20, 2015 at 05:11:20PM +0200, Martin Pohlack wrote:
>> * Xen as it is now, has a couple of non-unique symbol names which will
>>   make runtime symbol identification hard.  Sometimes, static symbols
>>   simply have the same name in C files, sometimes such symbols get
>>   included via header files, and some C files are also compiled
>>   multiple times and linked under different names (guest_walk.c).  I
>>   think the Linux kernel solves this by aiming at unique symbol names
>>   even for local symbols.
>> 
>>   nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head
> 
> Oh my. Looks like a couple of prepartion patches will be in order!

Just because nm doesn't express this doesn't mean we have to
special case them: In ELF (and in COFF too, which is relevant as
long as xen.efi still remains to be a separate binary) static symbols
can easily be qualified by their source or object file name - the
symbol table certainly has that information available. As said on
the hackathon, our current kallsyms mechanism - using nm output -
suffers from this too, yet I personally view it as rather bad practice
to add a globally unique prefix to local symbol names. Instead, as
also said there, we should switch to consuming the full ELF symbol
table. That's been on my todo list for two years or more - if only I
would ever get the time to do things like that...

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 15:16     ` Jan Beulich
@ 2015-06-05 16:00       ` Konrad Rzeszutek Wilk
  2015-06-05 16:14         ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-05 16:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

On Fri, Jun 05, 2015 at 04:16:59PM +0100, Jan Beulich wrote:
> >>> On 05.06.15 at 16:49, <konrad.wilk@oracle.com> wrote:
> > On Mon, May 18, 2015 at 01:41:34PM +0100, Jan Beulich wrote:
> >> >>> On 15.05.15 at 21:44, <konrad.wilk@oracle.com> wrote:
> >> > As such having the payload in an ELF file is the sensible way. We would be
> >> > carrying the various set of structures (and data) in the ELF sections under
> >> > different names and with definitions. The prefix for the ELF section name
> >> > would always be: *.xsplice_*
> >> > 
> >> > Note that every structure has padding. This is added so that the hypervisor
> >> > can re-use those fields as it sees fit.
> >> > 
> >> > There are five sections *.xsplice_* sections:
> >> 
> >> A general remark: Having worked on ELF on different occasions,
> >> UNIX'es (and hence Linux'es) use of section names to identify the
> >> sections' purposes is pretty contrary to ELF's original intentions.
> >> Section types (and architecture as well as OS extensions to them)
> >> exist for a reason. Of course from the following sections it's not
> >> really clear in how far you intended this to be done. If you did
> >> and just didn't explicitly say so, pointing out that relations
> >> between sections should then also be expressed suitably via ELF
> >> mechanisms (sh_link, sh_info) would then be unnecessary too. If
> >> you didn't, then I'd strongly suggest switching to a formally more
> >> correct model, which would then include specifying section types
> >> and section attributes (and optional other relevant aspects)
> >> along with their (then no longer mandatory) names.
> > 
> > 
> > Something like:
> > http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-1.html 
> 
> Yes, the SHT_SUNW_* ones are examples of what I mean.
> 
> > With (for example) an pointer to a specific section:
> > http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-28.html#scrolltoc 
> > ?
> 
> Not sure what part of that page you mean to refer to.

The whole page - with the:

typedef struct {
        Elf32_Word      c_tag;
        union {
                Elf32_Word      c_val;
                Elf32_Addr      c_ptr;
        } c_un;
} Elf32_Cap;

.. and such.

> 
> >> >  * `.bss` section for the new code (function)
> >> >  * `.data` and `.data.read_mostly` section for the new and old code (function)
> >> >  * `.rodata` section for the new and old code (function).
> >> > 
> >> > In short the *.xsplice_* sections represent various structures and the
> >> > ELF provides the mechanism to glue it all together when loaded in memory.
> >> 
> >> As per above - I think ELF can do this in a more explicit way. Of
> >> course that would imply sticking to strictly ELF types in the structure
> >> definitions below. That would in turn have the advantage of
> >> expressing e.g. "const char *" references to strings by section
> >> offsets into a specified string section, reducing (with the ultimate
> >> goal of eliminating) the need for processing relocations on the
> >> ELF object (possible references to "external symbols" left aside for
> >> the moment).
> > 
> > Do you think that going that way would be too complex for this design?
> > Folks have expressed that perhaps I went overboard with this ELF and
> > should have just stuck with structures and just mention that it is
> > in an ELF file?
> 
> I think it should be done either completely ELF-agnostic (with ELF
> just happening the container to convey the data) or fully ELF-
> compliant.

Let me do it ELF agnostic for right now and focus on the structures.
If it becomes to difficult to understand then I can look at converting
the whole thing to be ELF compliant.

> 
> >> > #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
> >> 
> >> Assuming the former is DATE - what do you foresee date/time
> >> kinds of changes to be good for?
> > 
> > In case we have source code which uses __DATE__ and the binary patch
> > modifies it - we would need to sort out other users of the __DATE__
> > and make sure that they are not affected (or perhaps they should be).
> 
> And how is this different from arbitrary other string literals?

They are exactly the same. Will coalesce them.
> 
> >> > #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
> >> > #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
> >> > #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
> >> 
> >> For none of these I really understand their purpose.
> > 
> > This is to help the hypervisor to figure out which of the lookup
> > mechanisms to use to find the relevant section. As in if we need
> > to update (as part of the patch) a normal function, but also
> > the exception table (because the new function is at a new virtual
> > address) we want to search in the exception table for the old 
> > one and replace it with the new virtual address.
> 
> I think this gets too specific. Telling apart code and data may make
> sense, but I think beyond that thing should be expressed as
> generically as possible. Imagine new special data kinds showing up
> - do you envision adding a special type (and hence special handling)
> for all of them?

Yes and then updating the design document to include it.

Would there be an more relaxed way to go about this?

> 
> >> > #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
> >> > #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has 
> > .altinstructions. */  
> >> 
> >> I don't see the purpose of adding flags for a few special sections we
> >> may know about now - what if over time hundreds of sections appear?
> > 
> > We just need to know if it has - and then the hypervisor needs
> > to call apply_altinstructions to redo the alternative instructions.
> > 
> > I presume you are advocating to move this flag to an 'higher' level part
> > of the structures?
> 
> As above I'd rather see such things not special cased at all. You
> already need to deal with the to be patched code having two
> (or more) possible variants due to live patching at boot. I think
> for the purpose of runtime patching it ought to be quite fine to
> use correct, but less optimal instructions (i.e. no consideration of
> alternative instructions at all).

OK.
> 
> >> > #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
> >> > #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
> >> > #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. */  
> >> 
> >> This isn't the first reference to "stack", without me having seen a
> >> description of what the word is meant to stand for in such contexts.
> >> Did I overlook it?
> > 
> > I mentioned it earlier in 'Example of trampoline and in-place splicing':
> > 
> > 126 However it has severe drawbacks - the safety checks which have to make sure     
> > 127 the function is not on the stack - must also check every caller. For some 
> > 128 patches this could if there were an sufficient large amount of callers   
> > 129 that we would never be able to apply the update.
> > 
> > But perhaps I should dive much deeper in about the patching
> > mechanism having to check whether the function to be patched
> > is on the stack.
> 
> Yeah - the thing is that the lines you quote are precisely what I
> referred to by "this isn't the first reference..." - it's not being
> clarified anywhere what "stack" you talk about. In some places
> it looked like you meant the stack in memory that the local
> variables and spilled registers go onto, but in other places it
> wasn't clear whether something else was meant. And with us
> supposedly being in stop-machine state, what is on the call
> stacks of each CPU should be pretty deterministic.

True, the 'should' is the operative word. Hence the extra mechanism
to check for stack as an fail-safe mechanism. Or if we really really
are sure we can go away with - just patch it and don't do any
stack checking.

I will update the glossary to describe the 'stack' problem.
> 
> >> > ### Alternative assembler
> >> > 
> >> > Alternative assembler is a mechanism to use different instructions depending
> >> > on what the CPU supports. This is done by providing multiple streams of code
> >> > that can be patched in - or if the CPU does not support it - padded with
> >> > `nop` operations. The alternative assembler macros cause the compiler to
> >> > expand the code to place a most generic code in place - emit a special
> >> > ELF .section header to tag this location. During run-time the hypervisor
> >> > can leave the areas alone or patch them with an better suited opcodes.
> >> > 
> >> > As we might be patching the alternative assembler sections as well - by
> >> > providing a new better suited op-codes or perhaps with nops - we need to
> >> > also re-run the alternative assembler patching after we have done our
> >> > patching.
> >> 
> >> These sections are part of .init.* and as such can't reasonably be
> >> subject to patching.
> > 
> > True, thought I thought we have some for run-time as well.
> 
> No, we're not switching between UP and SMP like Linux does (or did).

That simplifies it! Thanks.
> 
> Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 16:00       ` Konrad Rzeszutek Wilk
@ 2015-06-05 16:14         ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2015-06-05 16:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 05.06.15 at 18:00, <konrad.wilk@oracle.com> wrote:
> On Fri, Jun 05, 2015 at 04:16:59PM +0100, Jan Beulich wrote:
>> >>> On 05.06.15 at 16:49, <konrad.wilk@oracle.com> wrote:
>> > On Mon, May 18, 2015 at 01:41:34PM +0100, Jan Beulich wrote:
>> >> >>> On 15.05.15 at 21:44, <konrad.wilk@oracle.com> wrote:
>> >> A general remark: Having worked on ELF on different occasions,
>> >> UNIX'es (and hence Linux'es) use of section names to identify the
>> >> sections' purposes is pretty contrary to ELF's original intentions.
>> >> Section types (and architecture as well as OS extensions to them)
>> >> exist for a reason. Of course from the following sections it's not
>> >> really clear in how far you intended this to be done. If you did
>> >> and just didn't explicitly say so, pointing out that relations
>> >> between sections should then also be expressed suitably via ELF
>> >> mechanisms (sh_link, sh_info) would then be unnecessary too. If
>> >> you didn't, then I'd strongly suggest switching to a formally more
>> >> correct model, which would then include specifying section types
>> >> and section attributes (and optional other relevant aspects)
>> >> along with their (then no longer mandatory) names.
>> > 
>> > 
>> > Something like:
>> > http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-1.html 
>> 
>> Yes, the SHT_SUNW_* ones are examples of what I mean.
>> 
>> > With (for example) an pointer to a specific section:
>> > http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter7-28.html#scrolltoc 
>> > ?
>> 
>> Not sure what part of that page you mean to refer to.
> 
> The whole page - with the:
> 
> typedef struct {
>         Elf32_Word      c_tag;
>         union {
>                 Elf32_Word      c_val;
>                 Elf32_Addr      c_ptr;
>         } c_un;
> } Elf32_Cap;
> 
> .. and such.

But where's the section reference that I thought you meant to use
as an example?

>> >> > #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
>> >> > #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
>> >> > #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
>> >> 
>> >> For none of these I really understand their purpose.
>> > 
>> > This is to help the hypervisor to figure out which of the lookup
>> > mechanisms to use to find the relevant section. As in if we need
>> > to update (as part of the patch) a normal function, but also
>> > the exception table (because the new function is at a new virtual
>> > address) we want to search in the exception table for the old 
>> > one and replace it with the new virtual address.
>> 
>> I think this gets too specific. Telling apart code and data may make
>> sense, but I think beyond that thing should be expressed as
>> generically as possible. Imagine new special data kinds showing up
>> - do you envision adding a special type (and hence special handling)
>> for all of them?
> 
> Yes and then updating the design document to include it.
> 
> Would there be an more relaxed way to go about this?

This isn't a question of relaxed or strict, but of being generic or
specific. And I think keeping things generic will make the whole
thing better maintainable.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 15:27     ` Jan Beulich
@ 2015-06-08  8:34       ` Martin Pohlack
  2015-06-08  8:51         ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-06-08  8:34 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

On 05.06.2015 17:27, Jan Beulich wrote:
>>>> On 05.06.15 at 17:00, <konrad.wilk@oracle.com> wrote:
>> On Wed, May 20, 2015 at 05:11:20PM +0200, Martin Pohlack wrote:
>>> * Xen as it is now, has a couple of non-unique symbol names which will
>>>   make runtime symbol identification hard.  Sometimes, static symbols
>>>   simply have the same name in C files, sometimes such symbols get
>>>   included via header files, and some C files are also compiled
>>>   multiple times and linked under different names (guest_walk.c).  I
>>>   think the Linux kernel solves this by aiming at unique symbol names
>>>   even for local symbols.
>>>
>>>   nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head
>>
>> Oh my. Looks like a couple of prepartion patches will be in order!
> 
> Just because nm doesn't express this doesn't mean we have to
> special case them: In ELF (and in COFF too, which is relevant as
> long as xen.efi still remains to be a separate binary) static symbols
> can easily be qualified by their source or object file name - the
> symbol table certainly has that information available. As said on
> the hackathon, our current kallsyms mechanism - using nm output -
> suffers from this too, yet I personally view it as rather bad practice
> to add a globally unique prefix to local symbol names. Instead, as
> also said there, we should switch to consuming the full ELF symbol
> table. That's been on my todo list for two years or more - if only I
> would ever get the time to do things like that...

Hmm, in my experience, you have file-type symbols in the symbol table
that refer to source-code files and you have source-file references in
the dwarf debug information.

I have *not* seen references to object-files in either place and as some
C-files are compiled multiple times in Xen (guest_walk.c), the mapping
is not unique.

Further, as soon as non-trivial linker scripts are used, the symbol
ordering in the final xen-syms does not necessarily reflect original
object-file contents anymore.

I might be missing some mechanism here?

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-08  8:34       ` Martin Pohlack
@ 2015-06-08  8:51         ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2015-06-08  8:51 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, Paul Voccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 08.06.15 at 10:34, <mpohlack@amazon.com> wrote:
> On 05.06.2015 17:27, Jan Beulich wrote:
>>>>> On 05.06.15 at 17:00, <konrad.wilk@oracle.com> wrote:
>>> On Wed, May 20, 2015 at 05:11:20PM +0200, Martin Pohlack wrote:
>>>> * Xen as it is now, has a couple of non-unique symbol names which will
>>>>   make runtime symbol identification hard.  Sometimes, static symbols
>>>>   simply have the same name in C files, sometimes such symbols get
>>>>   included via header files, and some C files are also compiled
>>>>   multiple times and linked under different names (guest_walk.c).  I
>>>>   think the Linux kernel solves this by aiming at unique symbol names
>>>>   even for local symbols.
>>>>
>>>>   nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head
>>>
>>> Oh my. Looks like a couple of prepartion patches will be in order!
>> 
>> Just because nm doesn't express this doesn't mean we have to
>> special case them: In ELF (and in COFF too, which is relevant as
>> long as xen.efi still remains to be a separate binary) static symbols
>> can easily be qualified by their source or object file name - the
>> symbol table certainly has that information available. As said on
>> the hackathon, our current kallsyms mechanism - using nm output -
>> suffers from this too, yet I personally view it as rather bad practice
>> to add a globally unique prefix to local symbol names. Instead, as
>> also said there, we should switch to consuming the full ELF symbol
>> table. That's been on my todo list for two years or more - if only I
>> would ever get the time to do things like that...
> 
> Hmm, in my experience, you have file-type symbols in the symbol table
> that refer to source-code files and you have source-file references in
> the dwarf debug information.
> 
> I have *not* seen references to object-files in either place

The linker can insert such. But the case likely isn't of interest in
our scenario anyway.

> and as some
> C-files are compiled multiple times in Xen (guest_walk.c), the mapping
> is not unique.

And I suppose there are always going to be cases where manual
intervention is required. All we can aim for is making as little of such
necessary as possible.

> Further, as soon as non-trivial linker scripts are used, the symbol
> ordering in the final xen-syms does not necessarily reflect original
> object-file contents anymore.

But the linker should ensure that static symbols don't lose their
association with the respective STT_FILE symbol.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-05 15:00   ` Konrad Rzeszutek Wilk
  2015-06-05 15:15     ` Andrew Cooper
  2015-06-05 15:27     ` Jan Beulich
@ 2015-06-08 14:38     ` Martin Pohlack
  2015-06-08 15:19       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-06-08 14:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

Hi,

some inline comments.

On 05.06.2015 17:00, Konrad Rzeszutek Wilk wrote:
> On Wed, May 20, 2015 at 05:11:20PM +0200, Martin Pohlack wrote:
>> Hi,
>>
>> this looks very interesting.
> 
> Thank you!
>>
>> I have talked about an experimental Xen hotpatching design at Linux
>> Plumbers Conference 2014 in Düsseldorf, slides are here:
>>
>> http://www.linuxplumbersconf.net/2014/ocw//system/presentations/2421/original/xen_hotpatching-2014-10-16.pdf
>>
>> and would like to share a couple of observations from the slides:
>>
>> * We found splicing at hypervisor exit with a barrier for all CPUs to
>>   be a good place, because all Xen stacks are effectively empty at
>>   that point.  We use a barrier with a maximum timeout (that can be
>>   adapted).  The approach is similar in concept to stop_machine and
>>   the time rendezvous but is time-bound and does not require all the
>>   asynchronous operations discussed below.
> 
> hypervisor exit = vmexit?

No, when the hypervisor code is left and checks soft IRQs.

>>
>> * Hotpatch generation often requires support for compiling the target
>>   with -ffunction-sections / -fdata-sections.  Last time I looked, you
>>   can compile Xen with that option, but the linker scripts are not
>>   fully adapted to deal with it and the resulting hypervisor cannot be
>>   booted.
> 
> Correct. That (adapting the Makefile to do a second build with this)
> would be needed as well.
> 
>>
>> * Xen as it is now, has a couple of non-unique symbol names which will
>>   make runtime symbol identification hard.  Sometimes, static symbols
>>   simply have the same name in C files, sometimes such symbols get
>>   included via header files, and some C files are also compiled
>>   multiple times and linked under different names (guest_walk.c).  I
>>   think the Linux kernel solves this by aiming at unique symbol names
>>   even for local symbols.
>>
>>   nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head
> 
> Oh my. Looks like a couple of prepartion patches will be in order!
> 
>>
>> * You may need support for adapting or augmenting exception tables if
>>   patching such code is desired (it probably is).  Hotpatches may need
>>   to bring their own small exception tables (similar to how Linux
>>   modules support this).  If you don't plan on supporting hotpatches
>>   that introduce additional exception-locations, one could also change
>>   the exception table in-place and reorder it afterwards.
> 
> Each patch carries 'section' structures which define what kind
> of massaging needs to be done. As in each of these 'section' structues
> says:
>  - At the original address X there was an excetion table
>  - (or) At the origian address Y there was an alt assembler
> 
> And we would:
>  - During 'check' state, confirm that indeed X is in the exception table
>    (or Y in the alt assembler)
> 
>  - During the apply state, fix the exception table X offset to point to
>    the new virtual address.
> 
> Or perhaps I am over-simplying it? My recollection of the exception
> table code is that it would be mostly dealing with the table and
> changing the virtual address, but I hadn't dug it in details.

A simple approach would indeed be to patch the table in-place with the
new addresses for the replacement functions and fix-up sections.  But,
you would also have to reorder the table to keep the binary search working.

And of course this approach would not support adding additional entries
to the table as it is allocated at compile / link time.  So you cannot
support a hotpatch that would introduce an additional entry.

Linux modules, in contrast can bring their own exception tables and
those are walked by the handler whenever the main table does not contain
an entry.  I have not implemented that aspect yet, but would consider
this to be the most desirable design upon first look.

>> * Hotpatch interdependencies are tricky.  IIRC the discussion at the
> 
> Oooh, that bubbled in my nightmare subconcious but hadn't burst out yet.
> Ugh, what did I sign up for! 
> 
>>   Live Patching track at LPC, both kpatch and kgraft aimed, at the
>>   time, at a single large patch that subsumes and replaces all previous
>>   ones.
>>
>> Some more comments inline below ...
>>
>> Cheers,
>> Martin
>>
>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>>> Hey!
>>>
>>> During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
>>> the hypervisor. We sketched out how it could be done, and brainstormed
>>> some of the problems.
>>>
>>> I took that and wrote an design - which is very much RFC. The design is
>>> laid out in two sections - the format of the ELF payload - and then the
>>> hypercalls to act on it.
>>>
>>> Hypercall preemption has caused a couple of XSAs so I've baked the need
>>> for that in the design so we hopefully won't have an XSA for this code.
>>>
>>> There are two big *TODO* in the design which I had hoped to get done
>>> before sending this out - however I am going on vacation for two weeks
>>> so I figured it would be better to send this off for folks to mull now
>>> then to have it languish.
>>>
>>> Please feel free to add more folks on the CC list.
>>>
>>> Enjoy!
>>>
>>>
>>> # xSplice Design v1 (EXTERNAL RFC v2)
>>>
>>> ## Rationale
>>>
>>> A mechanism is required to binarily patch the running hypervisor with new
>>> opcodes that have come about due to primarily security updates.
>>>
>>> This document describes the design of the API that would allow us to
>>> upload to the hypervisor binary patches.
>>>
>>> ## Glossary
>>>
>>>  * splice - patch in the binary code with new opcodes
>>>  * trampoline - a jump to a new instruction.
>>>  * payload - telemetries of the old code along with binary blob of the new
>>>    function (if needed).
>>>  * reloc - telemetries contained in the payload to construct proper trampoline.
>>>
>>> ## Multiple ways to patch
>>>
>>> The mechanism needs to be flexible to patch the hypervisor in multiple ways
>>> and be as simple as possible. The compiled code is contiguous in memory with
>>> no gaps - so we have no luxury of 'moving' existing code and must either
>>> insert a trampoline to the new code to be executed - or only modify in-place
>>> the code if there is sufficient space. The placement of new code has to be done
>>> by hypervisor and the virtual address for the new code is allocated dynamically.
>>> i
>>> This implies that the hypervisor must compute the new offsets when splicing
>>> in the new trampoline code. Where the trampoline is added (inside
>>> the function we are patching or just the callers?) is also important.
>>>
>>> To lessen the amount of code in hypervisor, the consumer of the API
>>> is responsible for identifying which mechanism to employ and how many locations
>>> to patch. Combinations of modifying in-place code, adding trampoline, etc
>>> has to be supported. The API should allow read/write any memory within
>>> the hypervisor virtual address space.
>>>
>>> We must also have a mechanism to query what has been applied and a mechanism
>>> to revert it if needed.
>>>
>>> We must also have a mechanism to: provide an copy of the old code - so that
>>> the hypervisor can verify it against the code in memory; the new code;
>>
>> As Xen has no stable in-hypervisor API / ABI, you need to make sure
>> that a generated module matches a target hypervisor.  In our design,
>> we use build IDs for that (ld --build-id).  We embed build IDs at Xen
>> compile time and can query a running hypervisor for its ID and only
>> load matching patches.
>>
>> This seems to be an alternative to your proposal to include old code
>> into hotpatch modules.
> 
> That is much simpler.

Just to clarify, are you agreeing that the build ID approach is much
simpler?

> There is a nice part of the old code check - you
> can check (and deal with) patching an already patched code.
> As in, if the payload was configured to be applied on top of an already
> patched function it would patch nicely. But if the payload is against
> the virgin code - and the hypervisor is running an older patch, we would
> bail out.

You can do that too with the build IDs if there is some mechanism that
loads hotpatches in the same order as they were built in (if they
overlap).  The simplest approach that comes to mind is a hotpatch stack,
instead of independent patches.

>>> the symbol name of the function to be patched; or offset from the symbol;
>>> or virtual address.
>>>
>>> The complications that this design will encounter are explained later
>>> in this document.
>>>
>>> ## Patching code
>>>
>>> The first mechanism to patch that comes in mind is in-place replacement.
>>> That is replace the affected code with new code. Unfortunately the x86
>>> ISA is variable size which places limits on how much space we have available
>>> to replace the instructions.
>>>
>>> The second mechanism is by replacing the call or jump to the
>>> old function with the address of the new function.
>>
>> We found splicing via a non-conditional JMP in the beginning of the
>> old target function to be a sensible and uncomplicated approach.  This
>> approach works at function level which feels like a very natural unit
>> to work at.  The really nice thing of this approach is that you don't
>> have to track all the, potential indirect, call sites.
> 
> <nods>
>>
>> As you discussed, if you allocate hotpatch memory withing +-2GB of the
>> patch location, no further trampoline indirection is required, a
>> 5-byte JMP does the trick on x86.  We found that to be sufficient in
>> our experiments.
> 
> And worst case if you do need more than +-2GB you can just have
> two jumps. Kind of silly but possible.
> 
> Thank you for your input and lookign forward to your reply!

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-08 14:38     ` Martin Pohlack
@ 2015-06-08 15:19       ` Konrad Rzeszutek Wilk
  2015-06-12 11:51         ` Martin Pohlack
  0 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-08 15:19 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

. snip ..
> >> * You may need support for adapting or augmenting exception tables if
> >>   patching such code is desired (it probably is).  Hotpatches may need
> >>   to bring their own small exception tables (similar to how Linux
> >>   modules support this).  If you don't plan on supporting hotpatches
> >>   that introduce additional exception-locations, one could also change
> >>   the exception table in-place and reorder it afterwards.
> > 
> > Each patch carries 'section' structures which define what kind
> > of massaging needs to be done. As in each of these 'section' structues
> > says:
> >  - At the original address X there was an excetion table
> >  - (or) At the origian address Y there was an alt assembler
> > 
> > And we would:
> >  - During 'check' state, confirm that indeed X is in the exception table
> >    (or Y in the alt assembler)
> > 
> >  - During the apply state, fix the exception table X offset to point to
> >    the new virtual address.
> > 
> > Or perhaps I am over-simplying it? My recollection of the exception
> > table code is that it would be mostly dealing with the table and
> > changing the virtual address, but I hadn't dug it in details.
> 
> A simple approach would indeed be to patch the table in-place with the
> new addresses for the replacement functions and fix-up sections.  But,
> you would also have to reorder the table to keep the binary search working.
> 
> And of course this approach would not support adding additional entries
> to the table as it is allocated at compile / link time.  So you cannot
> support a hotpatch that would introduce an additional entry.
> 
> Linux modules, in contrast can bring their own exception tables and
> those are walked by the handler whenever the main table does not contain
> an entry.  I have not implemented that aspect yet, but would consider
> this to be the most desirable design upon first look.

It is in some way an implementation part. I will include it in the doc
as there are many parts here and having them all nicely outlined
makes my head spin less.
.. snip..
> >>>
> >>> We must also have a mechanism to: provide an copy of the old code - so that
> >>> the hypervisor can verify it against the code in memory; the new code;
> >>
> >> As Xen has no stable in-hypervisor API / ABI, you need to make sure
> >> that a generated module matches a target hypervisor.  In our design,
> >> we use build IDs for that (ld --build-id).  We embed build IDs at Xen
> >> compile time and can query a running hypervisor for its ID and only
> >> load matching patches.
> >>
> >> This seems to be an alternative to your proposal to include old code
> >> into hotpatch modules.
> > 
> > That is much simpler.
> 
> Just to clarify, are you agreeing that the build ID approach is much
> simpler?

Yes.
> 
> > There is a nice part of the old code check - you
> > can check (and deal with) patching an already patched code.
> > As in, if the payload was configured to be applied on top of an already
> > patched function it would patch nicely. But if the payload is against
> > the virgin code - and the hypervisor is running an older patch, we would
> > bail out.
> 
> You can do that too with the build IDs if there is some mechanism that
> loads hotpatches in the same order as they were built in (if they
> overlap).  The simplest approach that comes to mind is a hotpatch stack,
> instead of independent patches.

True. Murphy law though says somebody will do this in reverse order :-)
And that is my worry - some system admin will reverse the order, or pick
an patch out of order, and we end up patching .. and things eventually
break and blow up.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2015-05-20 15:11 ` Martin Pohlack
@ 2015-06-12 11:39 ` Martin Pohlack
  2015-06-12 14:03   ` Konrad Rzeszutek Wilk
  2015-10-27 12:05   ` Ross Lagerwall
  4 siblings, 2 replies; 37+ messages in thread
From: Martin Pohlack @ 2015-06-12 11:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, msw, aliguori, Antony Messerli,
	Rick Harris, Paul Voccio, Steven Wilson, Major Hayden,
	Josh Kearney, jinsong.liu, xiantao.zxt, boris.ostrovsky,
	Daniel Kiper, Elena Ufimtseva, bob.liu, lars.kurth, hanweidong,
	peter.huangpeng, fanhenglong, liuyingdong, john.liuqiming,
	xen-devel, jbeulich, andrew.cooper3, jeremy
  Cc: konrad

On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
[...]
> ## Hypercalls
> 
> We will employ the sub operations of the system management hypercall (sysctl).
> There are to be four sub-operations:
> 
>  * upload the payloads.
>  * listing of payloads summary uploaded and their state.
>  * getting an particular payload summary and its state.
>  * command to apply, delete, or revert the payload.
> 
> The patching is asynchronous therefore the caller is responsible
> to verify that it has been applied properly by retrieving the summary of it
> and verifying that there are no error codes associated with the payload.
> 
> We **MUST** make it asynchronous due to the nature of patching: it requires
> every physical CPU to be lock-step with each other. The patching mechanism
> while an implementation detail, is not an short operation and as such
> the design **MUST** assume it will be an long-running operation.

I am not convinced yet, that you need an asynchronous approach here.

The experience from our prototype suggests that hotpatching itself is
not an expensive operation.  It can usually be completed well below 1ms
with the most expensive part being getting the hypervisor to a quiet state.

If we go for a barrier at hypervisor exit, combined with forcing all
other CPUs through the hypervisor with IPIs, the typical case is very quick.

The only reason why that would take some time is, if another CPU is
executing a lengthy operation in the hypervisor already.  In that case,
you probably don't want to block the whole machine waiting for the
joining of that single CPU anyway and instead re-try later, for example,
using a timeout on the barrier.  That could be signaled to the user-land
process (EAGAIN) so that he could re-attempt hotpatching after some seconds.

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-08 15:19       ` Konrad Rzeszutek Wilk
@ 2015-06-12 11:51         ` Martin Pohlack
  2015-06-12 14:06           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-06-12 11:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On 08.06.2015 17:19, Konrad Rzeszutek Wilk wrote:q
[...]
>>> There is a nice part of the old code check - you
>>> can check (and deal with) patching an already patched code.
>>> As in, if the payload was configured to be applied on top of an already
>>> patched function it would patch nicely. But if the payload is against
>>> the virgin code - and the hypervisor is running an older patch, we would
>>> bail out.
>>
>> You can do that too with the build IDs if there is some mechanism that
>> loads hotpatches in the same order as they were built in (if they
>> overlap).  The simplest approach that comes to mind is a hotpatch stack,
>> instead of independent patches.
> 
> True. Murphy law though says somebody will do this in reverse order :-)
> And that is my worry - some system admin will reverse the order, or pick
> an patch out of order, and we end up patching .. and things eventually
> break and blow up.

Right, I can see how this might be useful as an additional guard.

There are some additional benefits to using build IDs, beyond preventing
loading patches for the wrong hypervisor.  They can also help locate
patches for the currently running hypervisor if laid out correspondingly
on disk, e.g.:

  /some/path/<build_ID>/nnnnn-patch1.mod

A userland tool would query for the specific build ID of the currently
running hypervisor and only attempt to load hotpatches designated for
it.  This is a stronger protection than relying on the RPM version or a
similar mechanism.

* build ID
  * Prevent loading of wrong hotpatches (intended for other builds)
  * Allow to identify suitable hotpatches on disk and help with runtime
    tooling (if laid out using build ID)

* Comparing old code
  * Prevent loading of dynamically incompatible hotpatches

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 11:39 ` Martin Pohlack
@ 2015-06-12 14:03   ` Konrad Rzeszutek Wilk
  2015-06-12 14:31     ` Martin Pohlack
  2015-10-27 12:05   ` Ross Lagerwall
  1 sibling, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 14:03 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On Fri, Jun 12, 2015 at 01:39:05PM +0200, Martin Pohlack wrote:
> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> [...]
> > ## Hypercalls
> > 
> > We will employ the sub operations of the system management hypercall (sysctl).
> > There are to be four sub-operations:
> > 
> >  * upload the payloads.
> >  * listing of payloads summary uploaded and their state.
> >  * getting an particular payload summary and its state.
> >  * command to apply, delete, or revert the payload.
> > 
> > The patching is asynchronous therefore the caller is responsible
> > to verify that it has been applied properly by retrieving the summary of it
> > and verifying that there are no error codes associated with the payload.
> > 
> > We **MUST** make it asynchronous due to the nature of patching: it requires
> > every physical CPU to be lock-step with each other. The patching mechanism
> > while an implementation detail, is not an short operation and as such
> > the design **MUST** assume it will be an long-running operation.
> 
> I am not convinced yet, that you need an asynchronous approach here.
> 
> The experience from our prototype suggests that hotpatching itself is
> not an expensive operation.  It can usually be completed well below 1ms
> with the most expensive part being getting the hypervisor to a quiet state.
> 
> If we go for a barrier at hypervisor exit, combined with forcing all
> other CPUs through the hypervisor with IPIs, the typical case is very quick.
> 
> The only reason why that would take some time is, if another CPU is
> executing a lengthy operation in the hypervisor already.  In that case,
> you probably don't want to block the whole machine waiting for the
> joining of that single CPU anyway and instead re-try later, for example,
> using a timeout on the barrier.  That could be signaled to the user-land
> process (EAGAIN) so that he could re-attempt hotpatching after some seconds.

Which is also an asynchronous operation.

The experience with previous preemption XSAs have left me quite afraid of
long-running operations - which is why I was thinking to have this
baked this at the start.

Both ways - EAGAIN or doing an _GET_STATUS would provide an mechanism for
the VCPU to do other work instead of being tied up.

The EAGAIN mandates that the 'bringing the CPUs together' must be done
under 1ms and that there must be code to enforce an timeout on the barrier.

The _GET_STATUS does not enforce this and can take longer giving us
more breathing room - and also unbounded time - which means if
we were to try to cancel it (say it had run for an hour and still
could not patch it)- we have to add some hairy code to
deal with cancelling asynchronous code.

Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
the xSplice hypercalls. Thoughts?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 11:51         ` Martin Pohlack
@ 2015-06-12 14:06           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 14:06 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On Fri, Jun 12, 2015 at 01:51:08PM +0200, Martin Pohlack wrote:
> On 08.06.2015 17:19, Konrad Rzeszutek Wilk wrote:q

Heh - ":q", well now I know what editor camp you are in :-)

> [...]
> >>> There is a nice part of the old code check - you
> >>> can check (and deal with) patching an already patched code.
> >>> As in, if the payload was configured to be applied on top of an already
> >>> patched function it would patch nicely. But if the payload is against
> >>> the virgin code - and the hypervisor is running an older patch, we would
> >>> bail out.
> >>
> >> You can do that too with the build IDs if there is some mechanism that
> >> loads hotpatches in the same order as they were built in (if they
> >> overlap).  The simplest approach that comes to mind is a hotpatch stack,
> >> instead of independent patches.
> > 
> > True. Murphy law though says somebody will do this in reverse order :-)
> > And that is my worry - some system admin will reverse the order, or pick
> > an patch out of order, and we end up patching .. and things eventually
> > break and blow up.
> 
> Right, I can see how this might be useful as an additional guard.
> 
> There are some additional benefits to using build IDs, beyond preventing
> loading patches for the wrong hypervisor.  They can also help locate
> patches for the currently running hypervisor if laid out correspondingly
> on disk, e.g.:
> 
>   /some/path/<build_ID>/nnnnn-patch1.mod
> 
> A userland tool would query for the specific build ID of the currently
> running hypervisor and only attempt to load hotpatches designated for
> it.  This is a stronger protection than relying on the RPM version or a
> similar mechanism.
> 
> * build ID
>   * Prevent loading of wrong hotpatches (intended for other builds)
>   * Allow to identify suitable hotpatches on disk and help with runtime
>     tooling (if laid out using build ID)
> 
> * Comparing old code
>   * Prevent loading of dynamically incompatible hotpatches

<nods> Having them both sounds sensible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 14:03   ` Konrad Rzeszutek Wilk
@ 2015-06-12 14:31     ` Martin Pohlack
  2015-06-12 14:43       ` Jan Beulich
  2015-06-12 16:09       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 37+ messages in thread
From: Martin Pohlack @ 2015-06-12 14:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On 12.06.2015 16:03, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 12, 2015 at 01:39:05PM +0200, Martin Pohlack wrote:
>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>> [...]
>>> ## Hypercalls
>>>
>>> We will employ the sub operations of the system management hypercall (sysctl).
>>> There are to be four sub-operations:
>>>
>>>  * upload the payloads.
>>>  * listing of payloads summary uploaded and their state.
>>>  * getting an particular payload summary and its state.
>>>  * command to apply, delete, or revert the payload.
>>>
>>> The patching is asynchronous therefore the caller is responsible
>>> to verify that it has been applied properly by retrieving the summary of it
>>> and verifying that there are no error codes associated with the payload.
>>>
>>> We **MUST** make it asynchronous due to the nature of patching: it requires
>>> every physical CPU to be lock-step with each other. The patching mechanism
>>> while an implementation detail, is not an short operation and as such
>>> the design **MUST** assume it will be an long-running operation.
>>
>> I am not convinced yet, that you need an asynchronous approach here.
>>
>> The experience from our prototype suggests that hotpatching itself is
>> not an expensive operation.  It can usually be completed well below 1ms
>> with the most expensive part being getting the hypervisor to a quiet state.
>>
>> If we go for a barrier at hypervisor exit, combined with forcing all
>> other CPUs through the hypervisor with IPIs, the typical case is very quick.
>>
>> The only reason why that would take some time is, if another CPU is
>> executing a lengthy operation in the hypervisor already.  In that case,
>> you probably don't want to block the whole machine waiting for the
>> joining of that single CPU anyway and instead re-try later, for example,
>> using a timeout on the barrier.  That could be signaled to the user-land
>> process (EAGAIN) so that he could re-attempt hotpatching after some seconds.
> 
> Which is also an asynchronous operation.

Right, but in userland.  My main aim is to have as little complicated
code as possible in the hypervisor for obvious reasons.  This approach
would not require any further tracking of state in the hypervisor.

> The experience with previous preemption XSAs have left me quite afraid of
> long-running operations - which is why I was thinking to have this
> baked this at the start.
> 
> Both ways - EAGAIN or doing an _GET_STATUS would provide an mechanism for
> the VCPU to do other work instead of being tied up.

If I understood your proposal correctly, there is a difference.  With
EAGAIN, all activity is dropped and the machine remains fully available
to whatever guests are running at the time.

With _GET_STATUS, you would continue to try to bring the hypervisor to a
quiet state in the background but return to userland to let this one
thread continue.  Behind the scenes though, you would still need to
capture all CPUs at one point and all captured CPUs would have to wait
for the last straggler.  That would lead to noticeable dead-time for
guests running on-top.

I might have misunderstood your proposal though.

> The EAGAIN mandates that the 'bringing the CPUs together' must be done
> under 1ms and that there must be code to enforce an timeout on the barrier.

The 1ms is just a random number.  I would actually suggest to allow a
sysadmin or hotpatch management tooling to specify how long one is
willing to potentially block the whole machine when waiting for a
stop_machine-like barrier as part of a relevant hypercall.  You could
imagine userland to start out with 1ms and slowly work its way up
whenever it retries.

> The _GET_STATUS does not enforce this and can take longer giving us
> more breathing room - and also unbounded time - which means if
> we were to try to cancel it (say it had run for an hour and still
> could not patch it)- we have to add some hairy code to
> deal with cancelling asynchronous code.
> 
> Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
> the xSplice hypercalls. Thoughts?

In my experience, you only need the EAGAIN for hypercalls that use the
quiet state.  Depending on the design, that would be the operations that
do hotpatch activation and deactivation (i.e., the actual splicing).

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 14:31     ` Martin Pohlack
@ 2015-06-12 14:43       ` Jan Beulich
  2015-06-12 17:31         ` Martin Pohlack
  2015-06-12 16:09       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2015-06-12 14:43 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, PaulVoccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

>>> On 12.06.15 at 16:31, <mpohlack@amazon.com> wrote:
> The 1ms is just a random number.  I would actually suggest to allow a
> sysadmin or hotpatch management tooling to specify how long one is
> willing to potentially block the whole machine when waiting for a
> stop_machine-like barrier as part of a relevant hypercall.  You could
> imagine userland to start out with 1ms and slowly work its way up
> whenever it retries.

In which case the question would be why it didn't start with a larger
timeout from the beginning. If anything I could see this to be used
to allow for a larger stop window for more critical patches.

Jan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 14:31     ` Martin Pohlack
  2015-06-12 14:43       ` Jan Beulich
@ 2015-06-12 16:09       ` Konrad Rzeszutek Wilk
  2015-06-12 16:17         ` Andrew Cooper
  2015-07-06 19:36         ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 16:09 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, konrad, fanhenglong,
	andrew.cooper3

On Fri, Jun 12, 2015 at 04:31:12PM +0200, Martin Pohlack wrote:
> On 12.06.2015 16:03, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jun 12, 2015 at 01:39:05PM +0200, Martin Pohlack wrote:
> >> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> >> [...]
> >>> ## Hypercalls
> >>>
> >>> We will employ the sub operations of the system management hypercall (sysctl).
> >>> There are to be four sub-operations:
> >>>
> >>>  * upload the payloads.
> >>>  * listing of payloads summary uploaded and their state.
> >>>  * getting an particular payload summary and its state.
> >>>  * command to apply, delete, or revert the payload.
> >>>
> >>> The patching is asynchronous therefore the caller is responsible
> >>> to verify that it has been applied properly by retrieving the summary of it
> >>> and verifying that there are no error codes associated with the payload.
> >>>
> >>> We **MUST** make it asynchronous due to the nature of patching: it requires
> >>> every physical CPU to be lock-step with each other. The patching mechanism
> >>> while an implementation detail, is not an short operation and as such
> >>> the design **MUST** assume it will be an long-running operation.
> >>
> >> I am not convinced yet, that you need an asynchronous approach here.
> >>
> >> The experience from our prototype suggests that hotpatching itself is
> >> not an expensive operation.  It can usually be completed well below 1ms
> >> with the most expensive part being getting the hypervisor to a quiet state.
> >>
> >> If we go for a barrier at hypervisor exit, combined with forcing all
> >> other CPUs through the hypervisor with IPIs, the typical case is very quick.
> >>
> >> The only reason why that would take some time is, if another CPU is
> >> executing a lengthy operation in the hypervisor already.  In that case,
> >> you probably don't want to block the whole machine waiting for the
> >> joining of that single CPU anyway and instead re-try later, for example,
> >> using a timeout on the barrier.  That could be signaled to the user-land
> >> process (EAGAIN) so that he could re-attempt hotpatching after some seconds.
> > 
> > Which is also an asynchronous operation.
> 
> Right, but in userland.  My main aim is to have as little complicated
> code as possible in the hypervisor for obvious reasons.  This approach
> would not require any further tracking of state in the hypervisor.

True.
> 
> > The experience with previous preemption XSAs have left me quite afraid of
> > long-running operations - which is why I was thinking to have this
> > baked this at the start.
> > 
> > Both ways - EAGAIN or doing an _GET_STATUS would provide an mechanism for
> > the VCPU to do other work instead of being tied up.
> 
> If I understood your proposal correctly, there is a difference.  With
> EAGAIN, all activity is dropped and the machine remains fully available
> to whatever guests are running at the time.

Correct.
> 
> With _GET_STATUS, you would continue to try to bring the hypervisor to a
> quiet state in the background but return to userland to let this one
> thread continue.  Behind the scenes though, you would still need to

<nods>
> capture all CPUs at one point and all captured CPUs would have to wait
> for the last straggler.  That would lead to noticeable dead-time for
> guests running on-top.

Potentially. Using the time calibration routine to do the patching guarantees
that we will have an sync-up every second on machine - so there will be always
that possiblity.
> 
> I might have misunderstood your proposal though.

You got it right.
> 
> > The EAGAIN mandates that the 'bringing the CPUs together' must be done
> > under 1ms and that there must be code to enforce an timeout on the barrier.
> 
> The 1ms is just a random number.  I would actually suggest to allow a
> sysadmin or hotpatch management tooling to specify how long one is
> willing to potentially block the whole machine when waiting for a
> stop_machine-like barrier as part of a relevant hypercall.  You could
> imagine userland to start out with 1ms and slowly work its way up
> whenever it retries.
> 
> > The _GET_STATUS does not enforce this and can take longer giving us
> > more breathing room - and also unbounded time - which means if
> > we were to try to cancel it (say it had run for an hour and still
> > could not patch it)- we have to add some hairy code to
> > deal with cancelling asynchronous code.
> > 
> > Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
> > the xSplice hypercalls. Thoughts?
> 
> In my experience, you only need the EAGAIN for hypercalls that use the
> quiet state.  Depending on the design, that would be the operations that
> do hotpatch activation and deactivation (i.e., the actual splicing).

The uploading of the patch could be slow - as in the checking to be done
and on an big patch (2MB or more?) it would be good to try again.

> 
> Martin
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 16:09       ` Konrad Rzeszutek Wilk
@ 2015-06-12 16:17         ` Andrew Cooper
  2015-06-12 16:39           ` Konrad Rzeszutek Wilk
  2015-07-06 19:36         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Cooper @ 2015-06-12 16:17 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, xen-devel, Daniel Kiper, Major Hayden, liuyingdong,
	aliguori, konrad, lars.kurth, Steven Wilson, peter.huangpeng,
	msw, xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong

On 12/06/15 17:09, Konrad Rzeszutek Wilk wrote:
>
>>> The _GET_STATUS does not enforce this and can take longer giving us
>>> more breathing room - and also unbounded time - which means if
>>> we were to try to cancel it (say it had run for an hour and still
>>> could not patch it)- we have to add some hairy code to
>>> deal with cancelling asynchronous code.
>>>
>>> Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
>>> the xSplice hypercalls. Thoughts?
>> In my experience, you only need the EAGAIN for hypercalls that use the
>> quiet state.  Depending on the design, that would be the operations that
>> do hotpatch activation and deactivation (i.e., the actual splicing).
> The uploading of the patch could be slow - as in the checking to be done
> and on an big patch (2MB or more?) it would be good to try again.

If a patch is greater than a few kb, it is probably not something
sensible to be patching.

However, an upload_patch/apply_patch split in the hypercall ABI might be
a sensible idea.

~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 16:17         ` Andrew Cooper
@ 2015-06-12 16:39           ` Konrad Rzeszutek Wilk
  2015-06-12 18:36             ` Martin Pohlack
  0 siblings, 1 reply; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 16:39 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, konrad,
	aliguori, xiantao.zxt, lars.kurth, Steven Wilson,
	peter.huangpeng, msw, xen-devel, Rick Harris, boris.ostrovsky,
	Josh Kearney, jinsong.liu, Antony Messerli, Martin Pohlack,
	fanhenglong

On Fri, Jun 12, 2015 at 05:17:13PM +0100, Andrew Cooper wrote:
> On 12/06/15 17:09, Konrad Rzeszutek Wilk wrote:
> >
> >>> The _GET_STATUS does not enforce this and can take longer giving us
> >>> more breathing room - and also unbounded time - which means if
> >>> we were to try to cancel it (say it had run for an hour and still
> >>> could not patch it)- we have to add some hairy code to
> >>> deal with cancelling asynchronous code.
> >>>
> >>> Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
> >>> the xSplice hypercalls. Thoughts?
> >> In my experience, you only need the EAGAIN for hypercalls that use the
> >> quiet state.  Depending on the design, that would be the operations that
> >> do hotpatch activation and deactivation (i.e., the actual splicing).
> > The uploading of the patch could be slow - as in the checking to be done
> > and on an big patch (2MB or more?) it would be good to try again.
> 
> If a patch is greater than a few kb, it is probably not something
> sensible to be patching.

Potentially. It could be an cumlative update containing mulitple XSAs.

> 
> However, an upload_patch/apply_patch split in the hypercall ABI might be
> a sensible idea.

The design has that (it has four hypercalls actually).

The question is whether that upload_patch hypercall should also
have the EAGAIN mechansim baked in the design.

The other one (GET_LIST) has it too - and can return EAGAIN with an
count of how many there are left so the user-space can pick up.

> 
> ~Andrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 14:43       ` Jan Beulich
@ 2015-06-12 17:31         ` Martin Pohlack
  2015-06-12 18:46           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-06-12 17:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Elena Ufimtseva, jeremy, hanweidong, john.liuqiming, PaulVoccio,
	xen-devel, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	konrad, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong, andrew.cooper3

On 12.06.2015 16:43, Jan Beulich wrote:
>>>> On 12.06.15 at 16:31, <mpohlack@amazon.com> wrote:
>> The 1ms is just a random number.  I would actually suggest to allow a
>> sysadmin or hotpatch management tooling to specify how long one is
>> willing to potentially block the whole machine when waiting for a
>> stop_machine-like barrier as part of a relevant hypercall.  You could
>> imagine userland to start out with 1ms and slowly work its way up
>> whenever it retries.
> 
> In which case the question would be why it didn't start with a larger
> timeout from the beginning. If anything I could see this to be used
> to allow for a larger stop window for more critical patches.

The main idea is that situations where you cannot patch immediately are
transient (e.g., instance start / stop, ...).  So by trying a couple of
times with a very short timeout every minute or so, chances are very
high to succeed without causing any large interruptions for guests.

Also, you usually have some time to deploy a hotpatch, given the typical
XSA embargo period.  So by slowly increasing the maximum blocking time
that one is willing to pay, one would patch the vast majority very
quickly and one still would have the option to patch stragglers by
paying a bit more blocking time later in the patch period.

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 16:39           ` Konrad Rzeszutek Wilk
@ 2015-06-12 18:36             ` Martin Pohlack
  2015-06-12 18:51               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-06-12 18:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Andrew Cooper
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, xen-devel, Daniel Kiper, Major Hayden, liuyingdong,
	aliguori, konrad, lars.kurth, Steven Wilson, peter.huangpeng,
	msw, xiantao.zxt, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, fanhenglong

On 12.06.2015 18:39, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 12, 2015 at 05:17:13PM +0100, Andrew Cooper wrote:
>> On 12/06/15 17:09, Konrad Rzeszutek Wilk wrote:
>>>
>>>>> The _GET_STATUS does not enforce this and can take longer giving us
>>>>> more breathing room - and also unbounded time - which means if
>>>>> we were to try to cancel it (say it had run for an hour and still
>>>>> could not patch it)- we have to add some hairy code to
>>>>> deal with cancelling asynchronous code.
>>>>>
>>>>> Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
>>>>> the xSplice hypercalls. Thoughts?
>>>> In my experience, you only need the EAGAIN for hypercalls that use the
>>>> quiet state.  Depending on the design, that would be the operations that
>>>> do hotpatch activation and deactivation (i.e., the actual splicing).
>>> The uploading of the patch could be slow - as in the checking to be done
>>> and on an big patch (2MB or more?) it would be good to try again.
>>
>> If a patch is greater than a few kb, it is probably not something
>> sensible to be patching.
> 
> Potentially. It could be an cumlative update containing mulitple XSAs.
> 
>>
>> However, an upload_patch/apply_patch split in the hypercall ABI might be
>> a sensible idea.
> 
> The design has that (it has four hypercalls actually).
> 
> The question is whether that upload_patch hypercall should also
> have the EAGAIN mechansim baked in the design.

Why would you need it?  Do you envision any complex blocking operation
to happen when loading a module?  I can't think of any off the top of my
head.

> The other one (GET_LIST) has it too - and can return EAGAIN with an
> count of how many there are left so the user-space can pick up.

I don't understand how that relates to the async nature of the other
interface parts.  Is that similar to the readdir syscall where a second
invocation would continue from the current seek pointer?

Or do you have something like a pread for a subarray of list entries in
mind?

It might be a bit tricky to reliably deliver atomic snapshots if a
potentially larger list to userland.  Maybe a version field might be
desirable here.

Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 17:31         ` Martin Pohlack
@ 2015-06-12 18:46           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 18:46 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, Jan Beulich, john.liuqiming,
	PaulVoccio, xen-devel, Daniel Kiper, Major Hayden, liuyingdong,
	aliguori, xiantao.zxt, lars.kurth, Steven Wilson,
	peter.huangpeng, msw, konrad, Rick Harris, boris.ostrovsky,
	Josh Kearney, jinsong.liu, Antony Messerli, fanhenglong,
	andrew.cooper3

On Fri, Jun 12, 2015 at 07:31:25PM +0200, Martin Pohlack wrote:
> On 12.06.2015 16:43, Jan Beulich wrote:
> >>>> On 12.06.15 at 16:31, <mpohlack@amazon.com> wrote:
> >> The 1ms is just a random number.  I would actually suggest to allow a
> >> sysadmin or hotpatch management tooling to specify how long one is
> >> willing to potentially block the whole machine when waiting for a
> >> stop_machine-like barrier as part of a relevant hypercall.  You could
> >> imagine userland to start out with 1ms and slowly work its way up
> >> whenever it retries.
> > 
> > In which case the question would be why it didn't start with a larger
> > timeout from the beginning. If anything I could see this to be used
> > to allow for a larger stop window for more critical patches.
> 
> The main idea is that situations where you cannot patch immediately are
> transient (e.g., instance start / stop, ...).  So by trying a couple of
> times with a very short timeout every minute or so, chances are very
> high to succeed without causing any large interruptions for guests.
> 
> Also, you usually have some time to deploy a hotpatch, given the typical
> XSA embargo period.  So by slowly increasing the maximum blocking time
> that one is willing to pay, one would patch the vast majority very
> quickly and one still would have the option to patch stragglers by
> paying a bit more blocking time later in the patch period.

The system admin would want the patch in regardless whether the mechanism
took miliseconds or seconds. Having knobs to define the timeout are not
neccessary - what the admin would most likely want to be told is:
"Hypervisor is busy, attempt #31415" instead of an silence and a hung
command.

And maybe an --pause argument if the hypervisor is really tied up and
can't get any breathing room - which would pause all the guests before
patching, and afterwards unpause them.


However I just realized one problem with causing an patching
through the hypercall (as opposed to having it done asynchronously).

We would not be able to patch all the code that is invoked while
this hypercall is in progress. That is - the do_domctl, the
spinlocks, anything put on the stack, etc.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 18:36             ` Martin Pohlack
@ 2015-06-12 18:51               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 18:51 UTC (permalink / raw)
  To: Martin Pohlack
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, konrad,
	aliguori, xiantao.zxt, lars.kurth, Steven Wilson,
	peter.huangpeng, msw, xen-devel, Rick Harris, boris.ostrovsky,
	Josh Kearney, Andrew Cooper, Antony Messerli, fanhenglong,
	jinsong.liu

On Fri, Jun 12, 2015 at 08:36:29PM +0200, Martin Pohlack wrote:
> On 12.06.2015 18:39, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jun 12, 2015 at 05:17:13PM +0100, Andrew Cooper wrote:
> >> On 12/06/15 17:09, Konrad Rzeszutek Wilk wrote:
> >>>
> >>>>> The _GET_STATUS does not enforce this and can take longer giving us
> >>>>> more breathing room - and also unbounded time - which means if
> >>>>> we were to try to cancel it (say it had run for an hour and still
> >>>>> could not patch it)- we have to add some hairy code to
> >>>>> deal with cancelling asynchronous code.
> >>>>>
> >>>>> Your way is simpler - but I would advocate expanding the -EAGAIN to _all_
> >>>>> the xSplice hypercalls. Thoughts?
> >>>> In my experience, you only need the EAGAIN for hypercalls that use the
> >>>> quiet state.  Depending on the design, that would be the operations that
> >>>> do hotpatch activation and deactivation (i.e., the actual splicing).
> >>> The uploading of the patch could be slow - as in the checking to be done
> >>> and on an big patch (2MB or more?) it would be good to try again.
> >>
> >> If a patch is greater than a few kb, it is probably not something
> >> sensible to be patching.
> > 
> > Potentially. It could be an cumlative update containing mulitple XSAs.
> > 
> >>
> >> However, an upload_patch/apply_patch split in the hypercall ABI might be
> >> a sensible idea.
> > 
> > The design has that (it has four hypercalls actually).
> > 
> > The question is whether that upload_patch hypercall should also
> > have the EAGAIN mechansim baked in the design.
> 
> Why would you need it?  Do you envision any complex blocking operation
> to happen when loading a module?  I can't think of any off the top of my
> head.

We have to ELF load the payload, do a signature verification. If that
is somehow tied in SecureBoot perhaps we need to use the shim to verify
the payload. Hopefully that won't take forever, but it is unbounded
and we have no clue how long it could take.

> 
> > The other one (GET_LIST) has it too - and can return EAGAIN with an
> > count of how many there are left so the user-space can pick up.
> 
> I don't understand how that relates to the async nature of the other
> interface parts.  Is that similar to the readdir syscall where a second
> invocation would continue from the current seek pointer?

That, but without guarantees. That is if during this operation another
patch is loaded - we would not see it. Unless we did another GET_LIST
hypercall.
> 
> Or do you have something like a pread for a subarray of list entries in
> mind?
> 
> It might be a bit tricky to reliably deliver atomic snapshots if a
> potentially larger list to userland.  Maybe a version field might be
> desirable here.

Sure, a version field would work nicely.
> 
> Martin
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 16:09       ` Konrad Rzeszutek Wilk
  2015-06-12 16:17         ` Andrew Cooper
@ 2015-07-06 19:36         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-06 19:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, jeremy, hanweidong, jbeulich, john.liuqiming,
	Paul Voccio, Daniel Kiper, Major Hayden, liuyingdong, aliguori,
	xiantao.zxt, lars.kurth, Steven Wilson, peter.huangpeng, msw,
	xen-devel, Rick Harris, boris.ostrovsky, Josh Kearney,
	jinsong.liu, Antony Messerli, Martin Pohlack, fanhenglong,
	andrew.cooper3

On Fri, Jun 12, 2015 at 12:09:24PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 12, 2015 at 04:31:12PM +0200, Martin Pohlack wrote:
> > On 12.06.2015 16:03, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jun 12, 2015 at 01:39:05PM +0200, Martin Pohlack wrote:
> > >> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> > >> [...]
> > >>> ## Hypercalls
> > >>>
> > >>> We will employ the sub operations of the system management hypercall (sysctl).
> > >>> There are to be four sub-operations:
> > >>>
> > >>>  * upload the payloads.
> > >>>  * listing of payloads summary uploaded and their state.
> > >>>  * getting an particular payload summary and its state.
> > >>>  * command to apply, delete, or revert the payload.
> > >>>
> > >>> The patching is asynchronous therefore the caller is responsible
> > >>> to verify that it has been applied properly by retrieving the summary of it
> > >>> and verifying that there are no error codes associated with the payload.
> > >>>
> > >>> We **MUST** make it asynchronous due to the nature of patching: it requires
> > >>> every physical CPU to be lock-step with each other. The patching mechanism
> > >>> while an implementation detail, is not an short operation and as such
> > >>> the design **MUST** assume it will be an long-running operation.
> > >>
> > >> I am not convinced yet, that you need an asynchronous approach here.
> > >>
> > >> The experience from our prototype suggests that hotpatching itself is
> > >> not an expensive operation.  It can usually be completed well below 1ms
> > >> with the most expensive part being getting the hypervisor to a quiet state.
> > >>
> > >> If we go for a barrier at hypervisor exit, combined with forcing all
> > >> other CPUs through the hypervisor with IPIs, the typical case is very quick.
> > >>
> > >> The only reason why that would take some time is, if another CPU is
> > >> executing a lengthy operation in the hypervisor already.  In that case,
> > >> you probably don't want to block the whole machine waiting for the
> > >> joining of that single CPU anyway and instead re-try later, for example,
> > >> using a timeout on the barrier.  That could be signaled to the user-land
> > >> process (EAGAIN) so that he could re-attempt hotpatching after some seconds.
> > > 
> > > Which is also an asynchronous operation.
> > 
> > Right, but in userland.  My main aim is to have as little complicated
> > code as possible in the hypervisor for obvious reasons.  This approach
> > would not require any further tracking of state in the hypervisor.
> 
> True.
> > 
> > > The experience with previous preemption XSAs have left me quite afraid of
> > > long-running operations - which is why I was thinking to have this
> > > baked this at the start.
> > > 
> > > Both ways - EAGAIN or doing an _GET_STATUS would provide an mechanism for
> > > the VCPU to do other work instead of being tied up.
> > 
> > If I understood your proposal correctly, there is a difference.  With
> > EAGAIN, all activity is dropped and the machine remains fully available
> > to whatever guests are running at the time.
> 
> Correct.
> > 
> > With _GET_STATUS, you would continue to try to bring the hypervisor to a
> > quiet state in the background but return to userland to let this one
> > thread continue.  Behind the scenes though, you would still need to
> 
> <nods>
> > capture all CPUs at one point and all captured CPUs would have to wait
> > for the last straggler.  That would lead to noticeable dead-time for
> > guests running on-top.
> 
> Potentially. Using the time calibration routine to do the patching guarantees
> that we will have an sync-up every second on machine - so there will be always
> that possiblity.
> > 
> > I might have misunderstood your proposal though.
> 
> You got it right.

As I was going over the v3 of this, I reread this and realized that in
fact I was not explaining it properly.

The 'dead-time' you refer to I presume is the time wherein the guests
don't get to run as we are waiting (spinning) on other CPUs.

That is not the case here - the 'capture all CPUs at one point and all
captured CPUs would have to wait for the last straggler' is an
implementation details. One can implement it that way (which is
what I think you have with time-out barriers at hypervisor exits),
or we can piggyback on the time rendevous code which syncs all CPUs.

Either way - we will always have an opportunity to have the CPUs
all synced and can utilize both if need to.

> > 
> > > The EAGAIN mandates that the 'bringing the CPUs together' must be done
> > > under 1ms and that there must be code to enforce an timeout on the barrier.
> > 
> > The 1ms is just a random number.  I would actually suggest to allow a
> > sysadmin or hotpatch management tooling to specify how long one is
> > willing to potentially block the whole machine when waiting for a
> > stop_machine-like barrier as part of a relevant hypercall.  You could
> > imagine userland to start out with 1ms and slowly work its way up
> > whenever it retries.
> > 
> > > The _GET_STATUS does not enforce this and can take longer giving us
> > > more breathing room - and also unbounded time - which means if
> > > we were to try to cancel it (say it had run for an hour and still
> > > could not patch it)- we have to add some hairy code to
> > > deal with cancelling asynchronous code.

.. Which I am not sure if is really needed. I think if we baked
that if the patching cannot be done within a certain time (2seconds)
regardless of the implementation - it would provide an error in
the `status' for the _GET_STATUS.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-06-12 11:39 ` Martin Pohlack
  2015-06-12 14:03   ` Konrad Rzeszutek Wilk
@ 2015-10-27 12:05   ` Ross Lagerwall
  2015-10-29 16:55     ` Ross Lagerwall
  1 sibling, 1 reply; 37+ messages in thread
From: Ross Lagerwall @ 2015-10-27 12:05 UTC (permalink / raw)
  To: Martin Pohlack, Konrad Rzeszutek Wilk, msw, aliguori,
	Antony Messerli, Rick Harris, Paul Voccio, Steven Wilson,
	Major Hayden, Josh Kearney, jinsong.liu, xiantao.zxt,
	boris.ostrovsky, Daniel Kiper, Elena Ufimtseva, bob.liu,
	lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

On 06/12/2015 12:39 PM, Martin Pohlack wrote:
> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> [...]
>> ## Hypercalls
>>
>> We will employ the sub operations of the system management hypercall (sysctl).
>> There are to be four sub-operations:
>>
>>   * upload the payloads.
>>   * listing of payloads summary uploaded and their state.
>>   * getting an particular payload summary and its state.
>>   * command to apply, delete, or revert the payload.
>>
>> The patching is asynchronous therefore the caller is responsible
>> to verify that it has been applied properly by retrieving the summary of it
>> and verifying that there are no error codes associated with the payload.
>>
>> We **MUST** make it asynchronous due to the nature of patching: it requires
>> every physical CPU to be lock-step with each other. The patching mechanism
>> while an implementation detail, is not an short operation and as such
>> the design **MUST** assume it will be an long-running operation.
>
> I am not convinced yet, that you need an asynchronous approach here.
>
> The experience from our prototype suggests that hotpatching itself is
> not an expensive operation.  It can usually be completed well below 1ms
> with the most expensive part being getting the hypervisor to a quiet state.
>

FWIW, my current implementation (which is almost certainly not optimal) 
tested on a 72 CPU machine takes about 3ms, whether idle or fully loaded.

-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-10-27 12:05   ` Ross Lagerwall
@ 2015-10-29 16:55     ` Ross Lagerwall
  2015-10-30 10:39       ` Martin Pohlack
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Lagerwall @ 2015-10-29 16:55 UTC (permalink / raw)
  To: Martin Pohlack, Konrad Rzeszutek Wilk, msw, aliguori,
	Antony Messerli, Rick Harris, Paul Voccio, Steven Wilson,
	Major Hayden, Josh Kearney, jinsong.liu, xiantao.zxt,
	boris.ostrovsky, Daniel Kiper, Elena Ufimtseva, bob.liu,
	lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

On 10/27/2015 12:05 PM, Ross Lagerwall wrote:
> On 06/12/2015 12:39 PM, Martin Pohlack wrote:
>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>> [...]
>>> ## Hypercalls
>>>
>>> We will employ the sub operations of the system management hypercall
>>> (sysctl).
>>> There are to be four sub-operations:
>>>
>>>   * upload the payloads.
>>>   * listing of payloads summary uploaded and their state.
>>>   * getting an particular payload summary and its state.
>>>   * command to apply, delete, or revert the payload.
>>>
>>> The patching is asynchronous therefore the caller is responsible
>>> to verify that it has been applied properly by retrieving the summary
>>> of it
>>> and verifying that there are no error codes associated with the payload.
>>>
>>> We **MUST** make it asynchronous due to the nature of patching: it
>>> requires
>>> every physical CPU to be lock-step with each other. The patching
>>> mechanism
>>> while an implementation detail, is not an short operation and as such
>>> the design **MUST** assume it will be an long-running operation.
>>
>> I am not convinced yet, that you need an asynchronous approach here.
>>
>> The experience from our prototype suggests that hotpatching itself is
>> not an expensive operation.  It can usually be completed well below 1ms
>> with the most expensive part being getting the hypervisor to a quiet
>> state.
>>
>
> FWIW, my current implementation (which is almost certainly not optimal)
> tested on a 72 CPU machine takes about 3ms, whether idle or fully loaded.
>

Let me correct that: it takes 60 μs to 100 μs to synchronize and apply 
the patch (on the same hardware) when synchronous console logging is 
turned off.

-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-10-29 16:55     ` Ross Lagerwall
@ 2015-10-30 10:39       ` Martin Pohlack
  2015-10-30 14:03         ` Ross Lagerwall
  0 siblings, 1 reply; 37+ messages in thread
From: Martin Pohlack @ 2015-10-30 10:39 UTC (permalink / raw)
  To: Ross Lagerwall, Konrad Rzeszutek Wilk, msw, aliguori,
	Antony Messerli, Rick Harris, Paul Voccio, Steven Wilson,
	Major Hayden, Josh Kearney, jinsong.liu, xiantao.zxt,
	boris.ostrovsky, Daniel Kiper, Elena Ufimtseva, bob.liu,
	lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

On 29.10.2015 17:55, Ross Lagerwall wrote:
> On 10/27/2015 12:05 PM, Ross Lagerwall wrote:
>> On 06/12/2015 12:39 PM, Martin Pohlack wrote:
>>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>>> [...]
>>>> ## Hypercalls
>>>>
>>>> We will employ the sub operations of the system management hypercall
>>>> (sysctl).
>>>> There are to be four sub-operations:
>>>>
>>>>   * upload the payloads.
>>>>   * listing of payloads summary uploaded and their state.
>>>>   * getting an particular payload summary and its state.
>>>>   * command to apply, delete, or revert the payload.
>>>>
>>>> The patching is asynchronous therefore the caller is responsible
>>>> to verify that it has been applied properly by retrieving the summary
>>>> of it
>>>> and verifying that there are no error codes associated with the payload.
>>>>
>>>> We **MUST** make it asynchronous due to the nature of patching: it
>>>> requires
>>>> every physical CPU to be lock-step with each other. The patching
>>>> mechanism
>>>> while an implementation detail, is not an short operation and as such
>>>> the design **MUST** assume it will be an long-running operation.
>>>
>>> I am not convinced yet, that you need an asynchronous approach here.
>>>
>>> The experience from our prototype suggests that hotpatching itself is
>>> not an expensive operation.  It can usually be completed well below 1ms
>>> with the most expensive part being getting the hypervisor to a quiet
>>> state.
>>>
>>
>> FWIW, my current implementation (which is almost certainly not optimal)
>> tested on a 72 CPU machine takes about 3ms, whether idle or fully loaded.
>>
> 
> Let me correct that: it takes 60 μs to 100 μs to synchronize and apply 
> the patch (on the same hardware) when synchronous console logging is 
> turned off.

The interesting (and very rare) case is if other CPUs are busy in Xen
already, for example, with memory scrubbing or other long-running
activities.  Those are hard to interrupt and delay patching activity.

Having multiple guests in a reboot-loop / being restarted all the time
might help triggering this case.

Martin
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-10-30 10:39       ` Martin Pohlack
@ 2015-10-30 14:03         ` Ross Lagerwall
  2015-10-30 14:06           ` Martin Pohlack
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Lagerwall @ 2015-10-30 14:03 UTC (permalink / raw)
  To: Martin Pohlack, Konrad Rzeszutek Wilk, msw, aliguori,
	Antony Messerli, Rick Harris, Paul Voccio, Steven Wilson,
	Major Hayden, Josh Kearney, jinsong.liu, xiantao.zxt,
	boris.ostrovsky, Daniel Kiper, Elena Ufimtseva, bob.liu,
	lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

On 10/30/2015 10:39 AM, Martin Pohlack wrote:
> On 29.10.2015 17:55, Ross Lagerwall wrote:
>> On 10/27/2015 12:05 PM, Ross Lagerwall wrote:
>>> On 06/12/2015 12:39 PM, Martin Pohlack wrote:
>>>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>>>> [...]
>>>>> ## Hypercalls
>>>>>
>>>>> We will employ the sub operations of the system management hypercall
>>>>> (sysctl).
>>>>> There are to be four sub-operations:
>>>>>
>>>>>    * upload the payloads.
>>>>>    * listing of payloads summary uploaded and their state.
>>>>>    * getting an particular payload summary and its state.
>>>>>    * command to apply, delete, or revert the payload.
>>>>>
>>>>> The patching is asynchronous therefore the caller is responsible
>>>>> to verify that it has been applied properly by retrieving the summary
>>>>> of it
>>>>> and verifying that there are no error codes associated with the payload.
>>>>>
>>>>> We **MUST** make it asynchronous due to the nature of patching: it
>>>>> requires
>>>>> every physical CPU to be lock-step with each other. The patching
>>>>> mechanism
>>>>> while an implementation detail, is not an short operation and as such
>>>>> the design **MUST** assume it will be an long-running operation.
>>>>
>>>> I am not convinced yet, that you need an asynchronous approach here.
>>>>
>>>> The experience from our prototype suggests that hotpatching itself is
>>>> not an expensive operation.  It can usually be completed well below 1ms
>>>> with the most expensive part being getting the hypervisor to a quiet
>>>> state.
>>>>
>>>
>>> FWIW, my current implementation (which is almost certainly not optimal)
>>> tested on a 72 CPU machine takes about 3ms, whether idle or fully loaded.
>>>
>>
>> Let me correct that: it takes 60 μs to 100 μs to synchronize and apply
>> the patch (on the same hardware) when synchronous console logging is
>> turned off.
>
> The interesting (and very rare) case is if other CPUs are busy in Xen
> already, for example, with memory scrubbing or other long-running
> activities.  Those are hard to interrupt and delay patching activity.
>
> Having multiple guests in a reboot-loop / being restarted all the time
> might help triggering this case.
>

I have been able to trigger this which is why I put in a (currently 
hard-coded) 10ms timeout in the synchronization code otherwise it gives 
up and returns an error. It could then be optionally retried by the user 
at a later point.

-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] xSplice design
  2015-10-30 14:03         ` Ross Lagerwall
@ 2015-10-30 14:06           ` Martin Pohlack
  0 siblings, 0 replies; 37+ messages in thread
From: Martin Pohlack @ 2015-10-30 14:06 UTC (permalink / raw)
  To: Ross Lagerwall, Konrad Rzeszutek Wilk, msw, aliguori,
	Antony Messerli, Rick Harris, Paul Voccio, Steven Wilson,
	Major Hayden, Josh Kearney, jinsong.liu, xiantao.zxt,
	boris.ostrovsky, Daniel Kiper, Elena Ufimtseva, bob.liu,
	lars.kurth, hanweidong, peter.huangpeng, fanhenglong,
	liuyingdong, john.liuqiming, xen-devel, jbeulich, andrew.cooper3,
	jeremy
  Cc: konrad

On 30.10.2015 15:03, Ross Lagerwall wrote:
> On 10/30/2015 10:39 AM, Martin Pohlack wrote:
>> On 29.10.2015 17:55, Ross Lagerwall wrote:
>>> On 10/27/2015 12:05 PM, Ross Lagerwall wrote:
>>>> On 06/12/2015 12:39 PM, Martin Pohlack wrote:
>>>>> On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
>>>>> [...]
>>>>>> ## Hypercalls
>>>>>>
>>>>>> We will employ the sub operations of the system management hypercall
>>>>>> (sysctl).
>>>>>> There are to be four sub-operations:
>>>>>>
>>>>>>    * upload the payloads.
>>>>>>    * listing of payloads summary uploaded and their state.
>>>>>>    * getting an particular payload summary and its state.
>>>>>>    * command to apply, delete, or revert the payload.
>>>>>>
>>>>>> The patching is asynchronous therefore the caller is responsible
>>>>>> to verify that it has been applied properly by retrieving the summary
>>>>>> of it
>>>>>> and verifying that there are no error codes associated with the payload.
>>>>>>
>>>>>> We **MUST** make it asynchronous due to the nature of patching: it
>>>>>> requires
>>>>>> every physical CPU to be lock-step with each other. The patching
>>>>>> mechanism
>>>>>> while an implementation detail, is not an short operation and as such
>>>>>> the design **MUST** assume it will be an long-running operation.
>>>>>
>>>>> I am not convinced yet, that you need an asynchronous approach here.
>>>>>
>>>>> The experience from our prototype suggests that hotpatching itself is
>>>>> not an expensive operation.  It can usually be completed well below 1ms
>>>>> with the most expensive part being getting the hypervisor to a quiet
>>>>> state.
>>>>>
>>>>
>>>> FWIW, my current implementation (which is almost certainly not optimal)
>>>> tested on a 72 CPU machine takes about 3ms, whether idle or fully loaded.
>>>>
>>>
>>> Let me correct that: it takes 60 μs to 100 μs to synchronize and apply
>>> the patch (on the same hardware) when synchronous console logging is
>>> turned off.
>>
>> The interesting (and very rare) case is if other CPUs are busy in Xen
>> already, for example, with memory scrubbing or other long-running
>> activities.  Those are hard to interrupt and delay patching activity.
>>
>> Having multiple guests in a reboot-loop / being restarted all the time
>> might help triggering this case.
>>
> 
> I have been able to trigger this which is why I put in a (currently 
> hard-coded) 10ms timeout in the synchronization code otherwise it gives 
> up and returns an error. It could then be optionally retried by the user 
> at a later point.

If you ever want to run this in QEMU etc. you need to account for the
scheduling timeslice of the host system.  I found it necessary to work
with 20 ms for that specific case.

Martin

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-10-30 14:07 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
2015-05-18 12:41 ` Jan Beulich
2015-06-05 14:49   ` Konrad Rzeszutek Wilk
2015-06-05 15:16     ` Jan Beulich
2015-06-05 16:00       ` Konrad Rzeszutek Wilk
2015-06-05 16:14         ` Jan Beulich
2015-05-18 12:54 ` Liuqiming (John)
2015-05-18 13:11   ` Daniel Kiper
2015-06-05 14:50   ` Konrad Rzeszutek Wilk
2015-05-19 19:13 ` Lars Kurth
2015-05-20 15:11 ` Martin Pohlack
2015-06-05 15:00   ` Konrad Rzeszutek Wilk
2015-06-05 15:15     ` Andrew Cooper
2015-06-05 15:27     ` Jan Beulich
2015-06-08  8:34       ` Martin Pohlack
2015-06-08  8:51         ` Jan Beulich
2015-06-08 14:38     ` Martin Pohlack
2015-06-08 15:19       ` Konrad Rzeszutek Wilk
2015-06-12 11:51         ` Martin Pohlack
2015-06-12 14:06           ` Konrad Rzeszutek Wilk
2015-06-12 11:39 ` Martin Pohlack
2015-06-12 14:03   ` Konrad Rzeszutek Wilk
2015-06-12 14:31     ` Martin Pohlack
2015-06-12 14:43       ` Jan Beulich
2015-06-12 17:31         ` Martin Pohlack
2015-06-12 18:46           ` Konrad Rzeszutek Wilk
2015-06-12 16:09       ` Konrad Rzeszutek Wilk
2015-06-12 16:17         ` Andrew Cooper
2015-06-12 16:39           ` Konrad Rzeszutek Wilk
2015-06-12 18:36             ` Martin Pohlack
2015-06-12 18:51               ` Konrad Rzeszutek Wilk
2015-07-06 19:36         ` Konrad Rzeszutek Wilk
2015-10-27 12:05   ` Ross Lagerwall
2015-10-29 16:55     ` Ross Lagerwall
2015-10-30 10:39       ` Martin Pohlack
2015-10-30 14:03         ` Ross Lagerwall
2015-10-30 14:06           ` Martin Pohlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.