diff options
Diffstat (limited to 'docs/system/i386')
-rw-r--r-- | docs/system/i386/amd-memory-encryption.rst | 206 | ||||
-rw-r--r-- | docs/system/i386/hyperv.rst | 288 | ||||
-rw-r--r-- | docs/system/i386/kvm-pv.rst | 100 | ||||
-rw-r--r-- | docs/system/i386/sgx.rst | 188 | ||||
-rw-r--r-- | docs/system/i386/xen.rst | 144 |
5 files changed, 926 insertions, 0 deletions
diff --git a/docs/system/i386/amd-memory-encryption.rst b/docs/system/i386/amd-memory-encryption.rst new file mode 100644 index 0000000000..e9bc142bc1 --- /dev/null +++ b/docs/system/i386/amd-memory-encryption.rst @@ -0,0 +1,206 @@ +AMD Secure Encrypted Virtualization (SEV) +========================================= + +Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. + +SEV is an extension to the AMD-V architecture which supports running encrypted +virtual machines (VMs) under the control of KVM. Encrypted VMs have their pages +(code and data) secured such that only the guest itself has access to the +unencrypted version. Each encrypted VM is associated with a unique encryption +key; if its data is accessed by a different entity using a different key the +encrypted guests data will be incorrectly decrypted, leading to unintelligible +data. + +Key management for this feature is handled by a separate processor known as the +AMD secure processor (AMD-SP), which is present in AMD SOCs. Firmware running +inside the AMD-SP provides commands to support a common VM lifecycle. This +includes commands for launching, snapshotting, migrating and debugging the +encrypted guest. These SEV commands can be issued via KVM_MEMORY_ENCRYPT_OP +ioctls. + +Secure Encrypted Virtualization - Encrypted State (SEV-ES) builds on the SEV +support to additionally protect the guest register state. In order to allow a +hypervisor to perform functions on behalf of a guest, there is architectural +support for notifying a guest's operating system when certain types of VMEXITs +are about to occur. This allows the guest to selectively share information with +the hypervisor to satisfy the requested function. + +Launching +--------- + +Boot images (such as bios) must be encrypted before a guest can be booted. The +``MEMORY_ENCRYPT_OP`` ioctl provides commands to encrypt the images: ``LAUNCH_START``, +``LAUNCH_UPDATE_DATA``, ``LAUNCH_MEASURE`` and ``LAUNCH_FINISH``. These four commands +together generate a fresh memory encryption key for the VM, encrypt the boot +images and provide a measurement than can be used as an attestation of a +successful launch. + +For a SEV-ES guest, the ``LAUNCH_UPDATE_VMSA`` command is also used to encrypt the +guest register state, or VM save area (VMSA), for all of the guest vCPUs. + +``LAUNCH_START`` is called first to create a cryptographic launch context within +the firmware. To create this context, guest owner must provide a guest policy, +its public Diffie-Hellman key (PDH) and session parameters. These inputs +should be treated as a binary blob and must be passed as-is to the SEV firmware. + +The guest policy is passed as plaintext. A hypervisor may choose to read it, +but should not modify it (any modification of the policy bits will result +in bad measurement). The guest policy is a 4-byte data structure containing +several flags that restricts what can be done on a running SEV guest. +See SEV API Spec ([SEVAPI]_) section 3 and 6.2 for more details. + +The guest policy can be provided via the ``policy`` property:: + + # ${QEMU} \ + sev-guest,id=sev0,policy=0x1...\ + +Setting the "SEV-ES required" policy bit (bit 2) will launch the guest as a +SEV-ES guest:: + + # ${QEMU} \ + sev-guest,id=sev0,policy=0x5...\ + +The guest owner provided DH certificate and session parameters will be used to +establish a cryptographic session with the guest owner to negotiate keys used +for the attestation. + +The DH certificate and session blob can be provided via the ``dh-cert-file`` and +``session-file`` properties:: + + # ${QEMU} \ + sev-guest,id=sev0,dh-cert-file=<file1>,session-file=<file2> + +``LAUNCH_UPDATE_DATA`` encrypts the memory region using the cryptographic context +created via the ``LAUNCH_START`` command. If required, this command can be called +multiple times to encrypt different memory regions. The command also calculates +the measurement of the memory contents as it encrypts. + +``LAUNCH_UPDATE_VMSA`` encrypts all the vCPU VMSAs for a SEV-ES guest using the +cryptographic context created via the ``LAUNCH_START`` command. The command also +calculates the measurement of the VMSAs as it encrypts them. + +``LAUNCH_MEASURE`` can be used to retrieve the measurement of encrypted memory and, +for a SEV-ES guest, encrypted VMSAs. This measurement is a signature of the +memory contents and, for a SEV-ES guest, the VMSA contents, that can be sent +to the guest owner as an attestation that the memory and VMSAs were encrypted +correctly by the firmware. The guest owner may wait to provide the guest +confidential information until it can verify the attestation measurement. +Since the guest owner knows the initial contents of the guest at boot, the +attestation measurement can be verified by comparing it to what the guest owner +expects. + +``LAUNCH_FINISH`` finalizes the guest launch and destroys the cryptographic +context. + +See SEV API Spec ([SEVAPI]_) 'Launching a guest' usage flow (Appendix A) for the +complete flow chart. + +To launch a SEV guest:: + + # ${QEMU} \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 + +To launch a SEV-ES guest:: + + # ${QEMU} \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x5 + +An SEV-ES guest has some restrictions as compared to a SEV guest. Because the +guest register state is encrypted and cannot be updated by the VMM/hypervisor, +a SEV-ES guest: + + - Does not support SMM - SMM support requires updating the guest register + state. + - Does not support reboot - a system reset requires updating the guest register + state. + - Requires in-kernel irqchip - the burden is placed on the hypervisor to + manage booting APs. + +Calculating expected guest launch measurement +--------------------------------------------- + +In order to verify the guest launch measurement, The Guest Owner must compute +it in the exact same way as it is calculated by the AMD-SP. SEV API Spec +([SEVAPI]_) section 6.5.1 describes the AMD-SP operations: + + GCTX.LD is finalized, producing the hash digest of all plaintext data + imported into the guest. + + The launch measurement is calculated as: + + HMAC(0x04 || API_MAJOR || API_MINOR || BUILD || GCTX.POLICY || GCTX.LD || MNONCE; GCTX.TIK) + + where "||" represents concatenation. + +The values of API_MAJOR, API_MINOR, BUILD, and GCTX.POLICY can be obtained +from the ``query-sev`` qmp command. + +The value of MNONCE is part of the response of ``query-sev-launch-measure``: it +is the last 16 bytes of the base64-decoded data field (see SEV API Spec +([SEVAPI]_) section 6.5.2 Table 52: LAUNCH_MEASURE Measurement Buffer). + +The value of GCTX.LD is +``SHA256(firmware_blob || kernel_hashes_blob || vmsas_blob)``, where: + +* ``firmware_blob`` is the content of the entire firmware flash file (for + example, ``OVMF.fd``). Note that you must build a stateless firmware file + which doesn't use an NVRAM store, because the NVRAM area is not measured, and + therefore it is not secure to use a firmware which uses state from an NVRAM + store. +* if kernel is used, and ``kernel-hashes=on``, then ``kernel_hashes_blob`` is + the content of PaddedSevHashTable (including the zero padding), which itself + includes the hashes of kernel, initrd, and cmdline that are passed to the + guest. The PaddedSevHashTable struct is defined in ``target/i386/sev.c``. +* if SEV-ES is enabled (``policy & 0x4 != 0``), ``vmsas_blob`` is the + concatenation of all VMSAs of the guest vcpus. Each VMSA is 4096 bytes long; + its content is defined inside Linux kernel code as ``struct vmcb_save_area``, + or in AMD APM Volume 2 ([APMVOL2]_) Table B-2: VMCB Layout, State Save Area. + +If kernel hashes are not used, or SEV-ES is disabled, use empty blobs for +``kernel_hashes_blob`` and ``vmsas_blob`` as needed. + +Debugging +--------- + +Since the memory contents of a SEV guest are encrypted, hypervisor access to +the guest memory will return cipher text. If the guest policy allows debugging, +then a hypervisor can use the DEBUG_DECRYPT and DEBUG_ENCRYPT commands to access +the guest memory region for debug purposes. This is not supported in QEMU yet. + +Snapshot/Restore +---------------- + +TODO + +Live Migration +--------------- + +TODO + +References +---------- + +`AMD Memory Encryption whitepaper +<https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/memory-encryption-white-paper.pdf>`_ + +.. [SEVAPI] `Secure Encrypted Virtualization API + <https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf>`_ + +.. [APMVOL2] `AMD64 Architecture Programmer's Manual Volume 2: System Programming + <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +KVM Forum slides: + +* `AMD’s Virtualization Memory Encryption (2016) + <http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf>`_ +* `Extending Secure Encrypted Virtualization With SEV-ES (2018) + <https://www.linux-kvm.org/images/9/94/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf>`_ + +`AMD64 Architecture Programmer's Manual: +<https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +* SME is section 7.10 +* SEV is section 15.34 +* SEV-ES is section 15.35 diff --git a/docs/system/i386/hyperv.rst b/docs/system/i386/hyperv.rst new file mode 100644 index 0000000000..2505dc4c86 --- /dev/null +++ b/docs/system/i386/hyperv.rst @@ -0,0 +1,288 @@ +Hyper-V Enlightenments +====================== + + +Description +----------- + +In some cases when implementing a hardware interface in software is slow, KVM +implements its own paravirtualized interfaces. This works well for Linux as +guest support for such features is added simultaneously with the feature itself. +It may, however, be hard-to-impossible to add support for these interfaces to +proprietary OSes, namely, Microsoft Windows. + +KVM on x86 implements Hyper-V Enlightenments for Windows guests. These features +make Windows and Hyper-V guests think they're running on top of a Hyper-V +compatible hypervisor and use Hyper-V specific features. + + +Setup +----- + +No Hyper-V enlightenments are enabled by default by either KVM or QEMU. In +QEMU, individual enlightenments can be enabled through CPU flags, e.g: + +.. parsed-literal:: + + |qemu_system| --enable-kvm --cpu host,hv_relaxed,hv_vpindex,hv_time, ... + +Sometimes there are dependencies between enlightenments, QEMU is supposed to +check that the supplied configuration is sane. + +When any set of the Hyper-V enlightenments is enabled, QEMU changes hypervisor +identification (CPUID 0x40000000..0x4000000A) to Hyper-V. KVM identification +and features are kept in leaves 0x40000100..0x40000101. + + +Existing enlightenments +----------------------- + +``hv-relaxed`` + This feature tells guest OS to disable watchdog timeouts as it is running on a + hypervisor. It is known that some Windows versions will do this even when they + see 'hypervisor' CPU flag. + +``hv-vapic`` + Provides so-called VP Assist page MSR to guest allowing it to work with APIC + more efficiently. In particular, this enlightenment allows paravirtualized + (exit-less) EOI processing. + +``hv-spinlocks`` = xxx + Enables paravirtualized spinlocks. The parameter indicates how many times + spinlock acquisition should be attempted before indicating the situation to the + hypervisor. A special value 0xffffffff indicates "never notify". + +``hv-vpindex`` + Provides HV_X64_MSR_VP_INDEX (0x40000002) MSR to the guest which has Virtual + processor index information. This enlightenment makes sense in conjunction with + hv-synic, hv-stimer and other enlightenments which require the guest to know its + Virtual Processor indices (e.g. when VP index needs to be passed in a + hypercall). + +``hv-runtime`` + Provides HV_X64_MSR_VP_RUNTIME (0x40000010) MSR to the guest. The MSR keeps the + virtual processor run time in 100ns units. This gives guest operating system an + idea of how much time was 'stolen' from it (when the virtual CPU was preempted + to perform some other work). + +``hv-crash`` + Provides HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 (0x40000100..0x40000105) and + HV_X64_MSR_CRASH_CTL (0x40000105) MSRs to the guest. These MSRs are written to + by the guest when it crashes, HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 MSRs + contain additional crash information. This information is outputted in QEMU log + and through QAPI. + Note: unlike under genuine Hyper-V, write to HV_X64_MSR_CRASH_CTL causes guest + to shutdown. This effectively blocks crash dump generation by Windows. + +``hv-time`` + Enables two Hyper-V-specific clocksources available to the guest: MSR-based + Hyper-V clocksource (HV_X64_MSR_TIME_REF_COUNT, 0x40000020) and Reference TSC + page (enabled via MSR HV_X64_MSR_REFERENCE_TSC, 0x40000021). Both clocksources + are per-guest, Reference TSC page clocksource allows for exit-less time stamp + readings. Using this enlightenment leads to significant speedup of all timestamp + related operations. + +``hv-synic`` + Enables Hyper-V Synthetic interrupt controller - an extension of a local APIC. + When enabled, this enlightenment provides additional communication facilities + to the guest: SynIC messages and Events. This is a pre-requisite for + implementing VMBus devices (not yet in QEMU). Additionally, this enlightenment + is needed to enable Hyper-V synthetic timers. SynIC is controlled through MSRs + HV_X64_MSR_SCONTROL..HV_X64_MSR_EOM (0x40000080..0x40000084) and + HV_X64_MSR_SINT0..HV_X64_MSR_SINT15 (0x40000090..0x4000009F) + + Requires: ``hv-vpindex`` + +``hv-stimer`` + Enables Hyper-V synthetic timers. There are four synthetic timers per virtual + CPU controlled through HV_X64_MSR_STIMER0_CONFIG..HV_X64_MSR_STIMER3_COUNT + (0x400000B0..0x400000B7) MSRs. These timers can work either in single-shot or + periodic mode. It is known that certain Windows versions revert to using HPET + (or even RTC when HPET is unavailable) extensively when this enlightenment is + not provided; this can lead to significant CPU consumption, even when virtual + CPU is idle. + + Requires: ``hv-vpindex``, ``hv-synic``, ``hv-time`` + +``hv-tlbflush`` + Enables paravirtualized TLB shoot-down mechanism. On x86 architecture, remote + TLB flush procedure requires sending IPIs and waiting for other CPUs to perform + local TLB flush. In virtualized environment some virtual CPUs may not even be + scheduled at the time of the call and may not require flushing (or, flushing + may be postponed until the virtual CPU is scheduled). hv-tlbflush enlightenment + implements TLB shoot-down through hypervisor enabling the optimization. + + Requires: ``hv-vpindex`` + +``hv-ipi`` + Enables paravirtualized IPI send mechanism. HvCallSendSyntheticClusterIpi + hypercall may target more than 64 virtual CPUs simultaneously, doing the same + through APIC requires more than one access (and thus exit to the hypervisor). + + Requires: ``hv-vpindex`` + +``hv-vendor-id`` = xxx + This changes Hyper-V identification in CPUID 0x40000000.EBX-EDX from the default + "Microsoft Hv". The parameter should be no longer than 12 characters. According + to the specification, guests shouldn't use this information and it is unknown + if there is a Windows version which acts differently. + Note: hv-vendor-id is not an enlightenment and thus doesn't enable Hyper-V + identification when specified without some other enlightenment. + +``hv-reset`` + Provides HV_X64_MSR_RESET (0x40000003) MSR to the guest allowing it to reset + itself by writing to it. Even when this MSR is enabled, it is not a recommended + way for Windows to perform system reboot and thus it may not be used. + +``hv-frequencies`` + Provides HV_X64_MSR_TSC_FREQUENCY (0x40000022) and HV_X64_MSR_APIC_FREQUENCY + (0x40000023) allowing the guest to get its TSC/APIC frequencies without doing + measurements. + +``hv-reenlightenment`` + The enlightenment is nested specific, it targets Hyper-V on KVM guests. When + enabled, it provides HV_X64_MSR_REENLIGHTENMENT_CONTROL (0x40000106), + HV_X64_MSR_TSC_EMULATION_CONTROL (0x40000107)and HV_X64_MSR_TSC_EMULATION_STATUS + (0x40000108) MSRs allowing the guest to get notified when TSC frequency changes + (only happens on migration) and keep using old frequency (through emulation in + the hypervisor) until it is ready to switch to the new one. This, in conjunction + with ``hv-frequencies``, allows Hyper-V on KVM to pass stable clocksource + (Reference TSC page) to its own guests. + + Note, KVM doesn't fully support re-enlightenment notifications and doesn't + emulate TSC accesses after migration so 'tsc-frequency=' CPU option also has to + be specified to make migration succeed. The destination host has to either have + the same TSC frequency or support TSC scaling CPU feature. + + Recommended: ``hv-frequencies`` + +``hv-evmcs`` + The enlightenment is nested specific, it targets Hyper-V on KVM guests. When + enabled, it provides Enlightened VMCS version 1 feature to the guest. The feature + implements paravirtualized protocol between L0 (KVM) and L1 (Hyper-V) + hypervisors making L2 exits to the hypervisor faster. The feature is Intel-only. + + Note: some virtualization features (e.g. Posted Interrupts) are disabled when + hv-evmcs is enabled. It may make sense to measure your nested workload with and + without the feature to find out if enabling it is beneficial. + + Requires: ``hv-vapic`` + +``hv-stimer-direct`` + Hyper-V specification allows synthetic timer operation in two modes: "classic", + when expiration event is delivered as SynIC message and "direct", when the event + is delivered via normal interrupt. It is known that nested Hyper-V can only + use synthetic timers in direct mode and thus ``hv-stimer-direct`` needs to be + enabled. + + Requires: ``hv-vpindex``, ``hv-synic``, ``hv-time``, ``hv-stimer`` + +``hv-avic`` (``hv-apicv``) + The enlightenment allows to use Hyper-V SynIC with hardware APICv/AVIC enabled. + Normally, Hyper-V SynIC disables these hardware feature and suggests the guest + to use paravirtualized AutoEOI feature. + Note: enabling this feature on old hardware (without APICv/AVIC support) may + have negative effect on guest's performance. + +``hv-no-nonarch-coresharing`` = on/off/auto + This enlightenment tells guest OS that virtual processors will never share a + physical core unless they are reported as sibling SMT threads. This information + is required by Windows and Hyper-V guests to properly mitigate SMT related CPU + vulnerabilities. + + When the option is set to 'auto' QEMU will enable the feature only when KVM + reports that non-architectural coresharing is impossible, this means that + hyper-threading is not supported or completely disabled on the host. This + setting also prevents migration as SMT settings on the destination may differ. + When the option is set to 'on' QEMU will always enable the feature, regardless + of host setup. To keep guests secure, this can only be used in conjunction with + exposing correct vCPU topology and vCPU pinning. + +``hv-version-id-build``, ``hv-version-id-major``, ``hv-version-id-minor``, ``hv-version-id-spack``, ``hv-version-id-sbranch``, ``hv-version-id-snumber`` + This changes Hyper-V version identification in CPUID 0x40000002.EAX-EDX from the + default (WS2016). + + - ``hv-version-id-build`` sets 'Build Number' (32 bits) + - ``hv-version-id-major`` sets 'Major Version' (16 bits) + - ``hv-version-id-minor`` sets 'Minor Version' (16 bits) + - ``hv-version-id-spack`` sets 'Service Pack' (32 bits) + - ``hv-version-id-sbranch`` sets 'Service Branch' (8 bits) + - ``hv-version-id-snumber`` sets 'Service Number' (24 bits) + + Note: hv-version-id-* are not enlightenments and thus don't enable Hyper-V + identification when specified without any other enlightenments. + +``hv-syndbg`` + Enables Hyper-V synthetic debugger interface, this is a special interface used + by Windows Kernel debugger to send the packets through, rather than sending + them via serial/network . + When enabled, this enlightenment provides additional communication facilities + to the guest: SynDbg messages. + This new communication is used by Windows Kernel debugger rather than sending + packets via serial/network, adding significant performance boost over the other + comm channels. + This enlightenment requires a VMBus device (-device vmbus-bridge,irq=15). + + Requires: ``hv-relaxed``, ``hv_time``, ``hv-vapic``, ``hv-vpindex``, ``hv-synic``, ``hv-runtime``, ``hv-stimer`` + +``hv-emsr-bitmap`` + The enlightenment is nested specific, it targets Hyper-V on KVM guests. When + enabled, it allows L0 (KVM) and L1 (Hyper-V) hypervisors to collaborate to + avoid unnecessary updates to L2 MSR-Bitmap upon vmexits. While the protocol is + supported for both VMX (Intel) and SVM (AMD), the VMX implementation requires + Enlightened VMCS (``hv-evmcs``) feature to also be enabled. + + Recommended: ``hv-evmcs`` (Intel) + +``hv-xmm-input`` + Hyper-V specification allows to pass parameters for certain hypercalls using XMM + registers ("XMM Fast Hypercall Input"). When the feature is in use, it allows + for faster hypercalls processing as KVM can avoid reading guest's memory. + +``hv-tlbflush-ext`` + Allow for extended GVA ranges to be passed to Hyper-V TLB flush hypercalls + (HvFlushVirtualAddressList/HvFlushVirtualAddressListEx). + + Requires: ``hv-tlbflush`` + +``hv-tlbflush-direct`` + The enlightenment is nested specific, it targets Hyper-V on KVM guests. When + enabled, it allows L0 (KVM) to directly handle TLB flush hypercalls from L2 + guest without the need to exit to L1 (Hyper-V) hypervisor. While the feature is + supported for both VMX (Intel) and SVM (AMD), the VMX implementation requires + Enlightened VMCS (``hv-evmcs``) feature to also be enabled. + + Requires: ``hv-vapic`` + + Recommended: ``hv-evmcs`` (Intel) + +Supplementary features +---------------------- + +``hv-passthrough`` + In some cases (e.g. during development) it may make sense to use QEMU in + 'pass-through' mode and give Windows guests all enlightenments currently + supported by KVM. This pass-through mode is enabled by "hv-passthrough" CPU + flag. + + Note: ``hv-passthrough`` flag only enables enlightenments which are known to QEMU + (have corresponding 'hv-' flag) and copies ``hv-spinlocks`` and ``hv-vendor-id`` + values from KVM to QEMU. ``hv-passthrough`` overrides all other 'hv-' settings on + the command line. Also, enabling this flag effectively prevents migration as the + list of enabled enlightenments may differ between target and destination hosts. + +``hv-enforce-cpuid`` + By default, KVM allows the guest to use all currently supported Hyper-V + enlightenments when Hyper-V CPUID interface was exposed, regardless of if + some features were not announced in guest visible CPUIDs. ``hv-enforce-cpuid`` + feature alters this behavior and only allows the guest to use exposed Hyper-V + enlightenments. + + +Useful links +------------ +Hyper-V Top Level Functional specification and other information: + +- https://github.com/MicrosoftDocs/Virtualization-Documentation +- https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs + diff --git a/docs/system/i386/kvm-pv.rst b/docs/system/i386/kvm-pv.rst new file mode 100644 index 0000000000..1e5a9923ef --- /dev/null +++ b/docs/system/i386/kvm-pv.rst @@ -0,0 +1,100 @@ +Paravirtualized KVM features +============================ + +Description +----------- + +In some cases when implementing hardware interfaces in software is slow, ``KVM`` +implements its own paravirtualized interfaces. + +Setup +----- + +Paravirtualized ``KVM`` features are represented as CPU flags. The following +features are enabled by default for any CPU model when ``KVM`` acceleration is +enabled: + +- ``kvmclock`` +- ``kvm-nopiodelay`` +- ``kvm-asyncpf`` +- ``kvm-steal-time`` +- ``kvm-pv-eoi`` +- ``kvmclock-stable-bit`` + +``kvm-msi-ext-dest-id`` feature is enabled by default in x2apic mode with split +irqchip (e.g. "-machine ...,kernel-irqchip=split -cpu ...,x2apic"). + +Note: when CPU model ``host`` is used, QEMU passes through all supported +paravirtualized ``KVM`` features to the guest. + +Existing features +----------------- + +``kvmclock`` + Expose a ``KVM`` specific paravirtualized clocksource to the guest. Supported + since Linux v2.6.26. + +``kvm-nopiodelay`` + The guest doesn't need to perform delays on PIO operations. Supported since + Linux v2.6.26. + +``kvm-mmu`` + This feature is deprecated. + +``kvm-asyncpf`` + Enable asynchronous page fault mechanism. Supported since Linux v2.6.38. + Note: since Linux v5.10 the feature is deprecated and not enabled by ``KVM``. + Use ``kvm-asyncpf-int`` instead. + +``kvm-steal-time`` + Enable stolen (when guest vCPU is not running) time accounting. Supported + since Linux v3.1. + +``kvm-pv-eoi`` + Enable paravirtualized end-of-interrupt signaling. Supported since Linux + v3.10. + +``kvm-pv-unhalt`` + Enable paravirtualized spinlocks support. Supported since Linux v3.12. + +``kvm-pv-tlb-flush`` + Enable paravirtualized TLB flush mechanism. Supported since Linux v4.16. + +``kvm-pv-ipi`` + Enable paravirtualized IPI mechanism. Supported since Linux v4.19. + +``kvm-poll-control`` + Enable host-side polling on HLT control from the guest. Supported since Linux + v5.10. + +``kvm-pv-sched-yield`` + Enable paravirtualized sched yield feature. Supported since Linux v5.10. + +``kvm-asyncpf-int`` + Enable interrupt based asynchronous page fault mechanism. Supported since Linux + v5.10. + +``kvm-msi-ext-dest-id`` + Support 'Extended Destination ID' for external interrupts. The feature allows + to use up to 32768 CPUs without IRQ remapping (but other limits may apply making + the number of supported vCPUs for a given configuration lower). Supported since + Linux v5.10. + +``kvmclock-stable-bit`` + Tell the guest that guest visible TSC value can be fully trusted for kvmclock + computations and no warps are expected. Supported since Linux v2.6.35. + +Supplementary features +---------------------- + +``kvm-pv-enforce-cpuid`` + Limit the supported paravirtualized feature set to the exposed features only. + Note, by default, ``KVM`` allows the guest to use all currently supported + paravirtualized features even when they were not announced in guest visible + CPUIDs. Supported since Linux v5.10. + + +Useful links +------------ + +Please refer to Documentation/virt/kvm in Linux for additional details. diff --git a/docs/system/i386/sgx.rst b/docs/system/i386/sgx.rst new file mode 100644 index 0000000000..ab58b29392 --- /dev/null +++ b/docs/system/i386/sgx.rst @@ -0,0 +1,188 @@ +Software Guard eXtensions (SGX) +=============================== + +Overview +-------- + +Intel Software Guard eXtensions (SGX) is a set of instructions and mechanisms +for memory accesses in order to provide security accesses for sensitive +applications and data. SGX allows an application to use its particular +address space as an *enclave*, which is a protected area provides confidentiality +and integrity even in the presence of privileged malware. Accesses to the +enclave memory area from any software not resident in the enclave are prevented, +including those from privileged software. + +Virtual SGX +----------- + +SGX feature is exposed to guest via SGX CPUID. Looking at SGX CPUID, we can +report the same CPUID info to guest as on host for most of SGX CPUID. With +reporting the same CPUID guest is able to use full capacity of SGX, and KVM +doesn't need to emulate those info. + +The guest's EPC base and size are determined by QEMU, and KVM needs QEMU to +notify such info to it before it can initialize SGX for guest. + +Virtual EPC +~~~~~~~~~~~ + +By default, QEMU does not assign EPC to a VM, i.e. fully enabling SGX in a VM +requires explicit allocation of EPC to the VM. Similar to other specialized +memory types, e.g. hugetlbfs, EPC is exposed as a memory backend. + +SGX EPC is enumerated through CPUID, i.e. EPC "devices" need to be realized +prior to realizing the vCPUs themselves, which occurs long before generic +devices are parsed and realized. This limitation means that EPC does not +require -maxmem as EPC is not treated as {cold,hot}plugged memory. + +QEMU does not artificially restrict the number of EPC sections exposed to a +guest, e.g. QEMU will happily allow you to create 64 1M EPC sections. Be aware +that some kernels may not recognize all EPC sections, e.g. the Linux SGX driver +is hardwired to support only 8 EPC sections. + +The following QEMU snippet creates two EPC sections, with 64M pre-allocated +to the VM and an additional 28M mapped but not allocated:: + + -object memory-backend-epc,id=mem1,size=64M,prealloc=on \ + -object memory-backend-epc,id=mem2,size=28M \ + -M sgx-epc.0.memdev=mem1,sgx-epc.1.memdev=mem2 + +Note: + +The size and location of the virtual EPC are far less restricted compared +to physical EPC. Because physical EPC is protected via range registers, +the size of the physical EPC must be a power of two (though software sees +a subset of the full EPC, e.g. 92M or 128M) and the EPC must be naturally +aligned. KVM SGX's virtual EPC is purely a software construct and only +requires the size and location to be page aligned. QEMU enforces the EPC +size is a multiple of 4k and will ensure the base of the EPC is 4k aligned. +To simplify the implementation, EPC is always located above 4g in the guest +physical address space. + +Migration +~~~~~~~~~ + +QEMU/KVM doesn't prevent live migrating SGX VMs, although from hardware's +perspective, SGX doesn't support live migration, since both EPC and the SGX +key hierarchy are bound to the physical platform. However live migration +can be supported in the sense if guest software stack can support recreating +enclaves when it suffers sudden lose of EPC; and if guest enclaves can detect +SGX keys being changed, and handle gracefully. For instance, when ERESUME fails +with #PF.SGX, guest software can gracefully detect it and recreate enclaves; +and when enclave fails to unseal sensitive information from outside, it can +detect such error and sensitive information can be provisioned to it again. + +CPUID +~~~~~ + +Due to its myriad dependencies, SGX is currently not listed as supported +in any of QEMU's built-in CPU configuration. To expose SGX (and SGX Launch +Control) to a guest, you must either use ``-cpu host`` to pass-through the +host CPU model, or explicitly enable SGX when using a built-in CPU model, +e.g. via ``-cpu <model>,+sgx`` or ``-cpu <model>,+sgx,+sgxlc``. + +All SGX sub-features enumerated through CPUID, e.g. SGX2, MISCSELECT, +ATTRIBUTES, etc... can be restricted via CPUID flags. Be aware that enforcing +restriction of MISCSELECT, ATTRIBUTES and XFRM requires intercepting ECREATE, +i.e. may marginally reduce SGX performance in the guest. All SGX sub-features +controlled via -cpu are prefixed with "sgx", e.g.:: + + $ qemu-system-x86_64 -cpu help | xargs printf "%s\n" | grep sgx + sgx + sgx-debug + sgx-encls-c + sgx-enclv + sgx-exinfo + sgx-kss + sgx-mode64 + sgx-provisionkey + sgx-tokenkey + sgx1 + sgx2 + sgxlc + +The following QEMU snippet passes through the host CPU but restricts access to +the provision and EINIT token keys:: + + -cpu host,-sgx-provisionkey,-sgx-tokenkey + +SGX sub-features cannot be emulated, i.e. sub-features that are not present +in hardware cannot be forced on via '-cpu'. + +Virtualize SGX Launch Control +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU SGX support for Launch Control (LC) is passive, in the sense that it +does not actively change the LC configuration. QEMU SGX provides the user +the ability to set/clear the CPUID flag (and by extension the associated +IA32_FEATURE_CONTROL MSR bit in fw_cfg) and saves/restores the LE Hash MSRs +when getting/putting guest state, but QEMU does not add new controls to +directly modify the LC configuration. Similar to hardware behavior, locking +the LC configuration to a non-Intel value is left to guest firmware. Unlike +host bios setting for SGX launch control(LC), there is no special bios setting +for SGX guest by our design. If host is in locked mode, we can still allow +creating VM with SGX. + +Feature Control +~~~~~~~~~~~~~~~ + +QEMU SGX updates the ``etc/msr_feature_control`` fw_cfg entry to set the SGX +(bit 18) and SGX LC (bit 17) flags based on their respective CPUID support, +i.e. existing guest firmware will automatically set SGX and SGX LC accordingly, +assuming said firmware supports fw_cfg.msr_feature_control. + +Launching a guest +----------------- + +To launch a SGX guest: + +.. parsed-literal:: + + |qemu_system_x86| \\ + -cpu host,+sgx-provisionkey \\ + -object memory-backend-epc,id=mem1,size=64M,prealloc=on \\ + -M sgx-epc.0.memdev=mem1,sgx-epc.0.node=0 + +Utilizing SGX in the guest requires a kernel/OS with SGX support. +The support can be determined in guest by:: + + $ grep sgx /proc/cpuinfo + +and SGX epc info by:: + + $ dmesg | grep sgx + [ 0.182807] sgx: EPC section 0x140000000-0x143ffffff + [ 0.183695] sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0. + +To launch a SGX numa guest: + +.. parsed-literal:: + + |qemu_system_x86| \\ + -cpu host,+sgx-provisionkey \\ + -object memory-backend-ram,size=2G,host-nodes=0,policy=bind,id=node0 \\ + -object memory-backend-epc,id=mem0,size=64M,prealloc=on,host-nodes=0,policy=bind \\ + -numa node,nodeid=0,cpus=0-1,memdev=node0 \\ + -object memory-backend-ram,size=2G,host-nodes=1,policy=bind,id=node1 \\ + -object memory-backend-epc,id=mem1,size=28M,prealloc=on,host-nodes=1,policy=bind \\ + -numa node,nodeid=1,cpus=2-3,memdev=node1 \\ + -M sgx-epc.0.memdev=mem0,sgx-epc.0.node=0,sgx-epc.1.memdev=mem1,sgx-epc.1.node=1 + +and SGX epc numa info by:: + + $ dmesg | grep sgx + [ 0.369937] sgx: EPC section 0x180000000-0x183ffffff + [ 0.370259] sgx: EPC section 0x184000000-0x185bfffff + + $ dmesg | grep SRAT + [ 0.009981] ACPI: SRAT: Node 0 PXM 0 [mem 0x180000000-0x183ffffff] + [ 0.009982] ACPI: SRAT: Node 1 PXM 1 [mem 0x184000000-0x185bfffff] + +References +---------- + +- `SGX Homepage <https://software.intel.com/sgx>`__ + +- `SGX SDK <https://github.com/intel/linux-sgx.git>`__ + +- SGX specification: Intel SDM Volume 3 diff --git a/docs/system/i386/xen.rst b/docs/system/i386/xen.rst new file mode 100644 index 0000000000..46db5f34c1 --- /dev/null +++ b/docs/system/i386/xen.rst @@ -0,0 +1,144 @@ +Xen HVM guest support +===================== + + +Description +----------- + +KVM has support for hosting Xen guests, intercepting Xen hypercalls and event +channel (Xen PV interrupt) delivery. This allows guests which expect to be +run under Xen to be hosted in QEMU under Linux/KVM instead. + +Using the split irqchip is mandatory for Xen support. + +Setup +----- + +Xen mode is enabled by setting the ``xen-version`` property of the KVM +accelerator, for example for Xen 4.17: + +.. parsed-literal:: + + |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split + +Additionally, virtual APIC support can be advertised to the guest through the +``xen-vapic`` CPU flag: + +.. parsed-literal:: + + |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split --cpu host,+xen-vapic + +When Xen support is enabled, QEMU changes hypervisor identification (CPUID +0x40000000..0x4000000A) to Xen. The KVM identification and features are not +advertised to a Xen guest. If Hyper-V is also enabled, the Xen identification +moves to leaves 0x40000100..0x4000010A. + +Properties +---------- + +The following properties exist on the KVM accelerator object: + +``xen-version`` + This property contains the Xen version in ``XENVER_version`` form, with the + major version in the top 16 bits and the minor version in the low 16 bits. + Setting this property enables the Xen guest support. If Xen version 4.5 or + greater is specified, the HVM leaf in Xen CPUID is populated. Xen version + 4.6 enables the vCPU ID in CPUID, and version 4.17 advertises vCPU upcall + vector support to the guest. + +``xen-evtchn-max-pirq`` + Xen PIRQs represent an emulated physical interrupt, either GSI or MSI, which + can be routed to an event channel instead of to the emulated I/O or local + APIC. By default, QEMU permits only 256 PIRQs because this allows maximum + compatibility with 32-bit MSI where the higher bits of the PIRQ# would need + to be in the upper 64 bits of the MSI message. For guests with large numbers + of PCI devices (and none which are limited to 32-bit addressing) it may be + desirable to increase this value. + +``xen-gnttab-max-frames`` + Xen grant tables are the means by which a Xen guest grants access to its + memory for PV back ends (disk, network, etc.). Since QEMU only supports v1 + grant tables which are 8 bytes in size, each page (each frame) of the grant + table can reference 512 pages of guest memory. The default number of frames + is 64, allowing for 32768 pages of guest memory to be accessed by PV backends + through simultaneous grants. For guests with large numbers of PV devices and + high throughput, it may be desirable to increase this value. + +Xen paravirtual devices +----------------------- + +The Xen PCI platform device is enabled automatically for a Xen guest. This +allows a guest to unplug all emulated devices, in order to use paravirtual +block and network drivers instead. + +Those paravirtual Xen block, network (and console) devices can be created +through the command line, and/or hot-plugged. + +To provide a Xen console device, define a character device and then a device +of type ``xen-console`` to connect to it. For the Xen console equivalent of +the handy ``-serial mon:stdio`` option, for example: + +.. parsed-literal:: + -chardev stdio,mux=on,id=char0,signal=off -mon char0 \\ + -device xen-console,chardev=char0 + +The Xen network device is ``xen-net-device``, which becomes the default NIC +model for emulated Xen guests, meaning that just the default NIC provided +by QEMU should automatically work and present a Xen network device to the +guest. + +Disks can be configured with '``-drive file=${GUEST_IMAGE},if=xen``' and will +appear to the guest as ``xvda`` onwards. + +Under Xen, the boot disk is typically available both via IDE emulation, and +as a PV block device. Guest bootloaders typically use IDE to load the guest +kernel, which then unplugs the IDE and continues with the Xen PV block device. + +This configuration can be achieved as follows: + +.. parsed-literal:: + + |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split \\ + -drive file=${GUEST_IMAGE},if=xen \\ + -drive file=${GUEST_IMAGE},file.locking=off,if=ide + +VirtIO devices can also be used; Linux guests may need to be dissuaded from +umplugging them by adding '``xen_emul_unplug=never``' on their command line. + +Booting Xen PV guests +--------------------- + +Booting PV guest kernels is possible by using the Xen PV shim (a version of Xen +itself, designed to run inside a Xen HVM guest and provide memory management +services for one guest alone). + +The Xen binary is provided as the ``-kernel`` and the guest kernel itself (or +PV Grub image) as the ``-initrd`` image, which actually just means the first +multiboot "module". For example: + +.. parsed-literal:: + + |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split \\ + -chardev stdio,id=char0 -device xen-console,chardev=char0 \\ + -display none -m 1G -kernel xen -initrd bzImage \\ + -append "pv-shim console=xen,pv -- console=hvc0 root=/dev/xvda1" \\ + -drive file=${GUEST_IMAGE},if=xen + +The Xen image must be built with the ``CONFIG_XEN_GUEST`` and ``CONFIG_PV_SHIM`` +options, and as of Xen 4.17, Xen's PV shim mode does not support using a serial +port; it must have a Xen console or it will panic. + +The example above provides the guest kernel command line after a separator +(" ``--`` ") on the Xen command line, and does not provide the guest kernel +with an actual initramfs, which would need to listed as a second multiboot +module. For more complicated alternatives, see the command line +:ref:`documentation <system/invocation-qemu-options-initrd>` for the +``-initrd`` option. + +Host OS requirements +-------------------- + +The minimal Xen support in the KVM accelerator requires the host to be running +Linux v5.12 or newer. Later versions add optimisations: Linux v5.17 added +acceleration of interrupt delivery via the Xen PIRQ mechanism, and Linux v5.19 +accelerated Xen PV timers and inter-processor interrupts (IPIs). |